Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-14 Thread Tom Talpey

On 4/12/2021 6:48 PM, Jason Gunthorpe wrote:

On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote:


So the issue is only in testing all the providers and platforms,
to be sure this new behavior isn't tickling anything that went
unnoticed all along, because no RDMA provider ever issued RO.


The mlx5 ethernet driver has run in RO mode for a long time, and it
operates in basically the same way as RDMA. The issues with Haswell
have been worked out there already.

The only open question is if the ULPs have errors in their
implementation, which I don't think we can find out until we apply
this series and people start running their tests aggressively.


I agree that the core RO support should go in. But turning it on
by default for a ULP should be the decision of each ULP maintainer.
It's a huge risk to shift all the storage drivers overnight. How
do you propose to ensure the aggressive testing happens?

One thing that worries me is the patch02 on-by-default for the dma_lkey.
There's no way for a ULP to prevent IB_ACCESS_RELAXED_ORDERING
from being set in __ib_alloc_pd().

Tom.




Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-12 Thread Tom Talpey

On 4/12/2021 2:32 PM, Haakon Bugge wrote:




On 10 Apr 2021, at 15:30, David Laight  wrote:

From: Tom Talpey

Sent: 09 April 2021 18:49
On 4/9/2021 12:27 PM, Haakon Bugge wrote:




On 9 Apr 2021, at 17:32, Tom Talpey  wrote:

On 4/9/2021 10:45 AM, Chuck Lever III wrote:

On Apr 9, 2021, at 10:26 AM, Tom Talpey  wrote:

On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:

On Mon, Apr 05, 2021 at 11:42:31PM +, Chuck Lever III wrote:


We need to get a better idea what correctness testing has been done,
and whether positive correctness testing results can be replicated
on a variety of platforms.

RO has been rolling out slowly on mlx5 over a few years and storage
ULPs are the last to change. eg the mlx5 ethernet driver has had RO
turned on for a long time, userspace HPC applications have been using
it for a while now too.


I'd love to see RO be used more, it was always something the RDMA
specs supported and carefully architected for. My only concern is
that it's difficult to get right, especially when the platforms
have been running strictly-ordered for so long. The ULPs need
testing, and a lot of it.


We know there are platforms with broken RO implementations (like
Haswell) but the kernel is supposed to globally turn off RO on all
those cases. I'd be a bit surprised if we discover any more from this
series.
On the other hand there are platforms that get huge speed ups from
turning this on, AMD is one example, there are a bunch in the ARM
world too.


My belief is that the biggest risk is from situations where completions
are batched, and therefore polling is used to detect them without
interrupts (which explicitly). The RO pipeline will completely reorder
DMA writes, and consumers which infer ordering from memory contents may
break. This can even apply within the provider code, which may attempt
to poll WR and CQ structures, and be tripped up.

You are referring specifically to RPC/RDMA depending on Receive
completions to guarantee that previous RDMA Writes have been
retired? Or is there a particular implementation practice in
the Linux RPC/RDMA code that worries you?


Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
is hopefully unfounded, is that the RO pipeline might not have flushed
when a completion is posted *after* posting an interrupt.

Something like this...

RDMA Write arrives
PCIe RO Write for data
PCIe RO Write for data
...
RDMA Write arrives
PCIe RO Write for data
...
RDMA Send arrives
PCIe RO Write for receive data
PCIe RO Write for receive descriptor


Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct 
implementation. Then

it will shure prior written RO date has global visibility when the CQE can be 
observed.

I wasn't aware that a strongly-ordered PCIe Write will ensure that
prior relaxed-ordered writes went first. If that's the case, I'm
fine with it - as long as the providers are correctly coded!!


The PCIe spec (Table Ordering Rules Summary) is quite clear here (A Posted 
request is Memory Write Request in this context):

A Posted Request must not pass another Posted Request unless A2b 
applies.

A2b: A Posted Request with RO Set is permitted to pass another Posted 
Request.


Thxs, Håkon


Ok, good - a non-RO write (for example, to a CQE), or an interrupt
(which would be similarly non-RO), will "get behind" all prior writes.

So the issue is only in testing all the providers and platforms,
to be sure this new behavior isn't tickling anything that went
unnoticed all along, because no RDMA provider ever issued RO.

Honestly, the Haswell sounds like a great first candidate, because
if it has a known-broken RO behavior, verifying that it works with
this change is highly important. I'd have greater confidence in newer
platforms, in other words. They *all* have to work, proveably.

Tom.


I remember trying to read the relevant section of the PCIe spec.
(Possibly in a book that was trying to make it easier to understand!)
It is about as clear as mud.

I presume this is all about allowing PCIe targets (eg ethernet cards)
to use relaxed ordering on write requests to host memory.
And that such writes can be completed out of order?

It isn't entirely clear that you aren't talking of letting the
cpu do 'relaxed order' writes to PCIe targets!

For a typical ethernet driver the receive interrupt just means
'go and look at the receive descriptor ring'.
So there is an absolute requirement that the writes for data
buffer complete before the write to the receive descriptor.
There is no requirement for the interrupt (requested after the
descriptor write) to have been seen by the cpu.

Quite often the driver will find the 'receive complete'
descriptor when processing frames from an earlier interrupt
(and nothing to do in response to the interrupt itself).

So the write to the receive descriptor would have to have RO clear
to ensure that all the buffer writes comp

Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-09 Thread Tom Talpey

On 4/9/2021 12:27 PM, Haakon Bugge wrote:




On 9 Apr 2021, at 17:32, Tom Talpey  wrote:

On 4/9/2021 10:45 AM, Chuck Lever III wrote:

On Apr 9, 2021, at 10:26 AM, Tom Talpey  wrote:

On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:

On Mon, Apr 05, 2021 at 11:42:31PM +, Chuck Lever III wrote:
  

We need to get a better idea what correctness testing has been done,
and whether positive correctness testing results can be replicated
on a variety of platforms.

RO has been rolling out slowly on mlx5 over a few years and storage
ULPs are the last to change. eg the mlx5 ethernet driver has had RO
turned on for a long time, userspace HPC applications have been using
it for a while now too.


I'd love to see RO be used more, it was always something the RDMA
specs supported and carefully architected for. My only concern is
that it's difficult to get right, especially when the platforms
have been running strictly-ordered for so long. The ULPs need
testing, and a lot of it.


We know there are platforms with broken RO implementations (like
Haswell) but the kernel is supposed to globally turn off RO on all
those cases. I'd be a bit surprised if we discover any more from this
series.
On the other hand there are platforms that get huge speed ups from
turning this on, AMD is one example, there are a bunch in the ARM
world too.


My belief is that the biggest risk is from situations where completions
are batched, and therefore polling is used to detect them without
interrupts (which explicitly). The RO pipeline will completely reorder
DMA writes, and consumers which infer ordering from memory contents may
break. This can even apply within the provider code, which may attempt
to poll WR and CQ structures, and be tripped up.

You are referring specifically to RPC/RDMA depending on Receive
completions to guarantee that previous RDMA Writes have been
retired? Or is there a particular implementation practice in
the Linux RPC/RDMA code that worries you?


Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
is hopefully unfounded, is that the RO pipeline might not have flushed
when a completion is posted *after* posting an interrupt.

Something like this...

RDMA Write arrives
PCIe RO Write for data
PCIe RO Write for data
...
RDMA Write arrives
PCIe RO Write for data
...
RDMA Send arrives
PCIe RO Write for receive data
PCIe RO Write for receive descriptor


Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct 
implementation. Then it will shure prior written RO date has global visibility 
when the CQE can be observed.


I wasn't aware that a strongly-ordered PCIe Write will ensure that
prior relaxed-ordered writes went first. If that's the case, I'm
fine with it - as long as the providers are correctly coded!!


PCIe interrupt (flushes RO pipeline for all three ops above)


Before the interrupt, the HCA will write the EQE (Event Queue Entry). This has to be a 
Strongly Ordered write to "push" prior written CQEs so that when the EQE is 
observed, the prior writes of CQEs have global visibility.

And the MSI-X write likewise, to avoid spurious interrupts.


Ok, and yes agreed the same principle would apply.

Is there any implication if a PCIe switch were present on the
motherboard? The switch is allowed to do some creative routing
if the operation is relaxed, correct?

Tom.


Thxs, Håkon



RPC/RDMA polls CQ
Reaps receive completion

RDMA Send arrives
PCIe RO Write for receive data
PCIe RO write for receive descriptor
Does *not* interrupt, since CQ not armed

RPC/RDMA continues to poll CQ
Reaps receive completion
PCIe RO writes not yet flushed
Processes incomplete in-memory data
Bzzzt

Hopefully, the adapter performs a PCIe flushing read, or something
to avoid this when an interrupt is not generated. Alternatively, I'm
overly paranoid.

Tom.


The Mellanox adapter, itself, historically has strict in-order DMA
semantics, and while it's great to relax that, changing it by default
for all consumers is something to consider very cautiously.


Still, obviously people should test on the platforms they have.


Yes, and "test" be taken seriously with focus on ULP data integrity.
Speedups will mean nothing if the data is ever damaged.

I agree that data integrity comes first.
Since I currently don't have facilities to test RO in my lab, the
community will have to agree on a set of tests and expected results
that specifically exercise the corner cases you are concerned about.
--
Chuck Lever




Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-09 Thread Tom Talpey

On 4/9/2021 12:40 PM, Jason Gunthorpe wrote:

On Fri, Apr 09, 2021 at 10:26:21AM -0400, Tom Talpey wrote:


My belief is that the biggest risk is from situations where completions
are batched, and therefore polling is used to detect them without
interrupts (which explicitly).


We don't do this in the kernel.

All kernel ULPs only read data after they observe the CQE. We do not
have "last data polling" and our interrupt model does not support some
hacky "interrupt means go and use the data" approach.

ULPs have to be designed this way to use the DMA API properly.


Yep. Totally agree.

My concern was about the data being written as relaxed, and the CQE
racing it. I'll reply in the other fork.



Fencing a DMA before it is completed by the HW will cause IOMMU
errors.

Userspace is a different story, but that will remain as-is with
optional relaxed ordering.

Jason



Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-09 Thread Tom Talpey

On 4/9/2021 10:45 AM, Chuck Lever III wrote:




On Apr 9, 2021, at 10:26 AM, Tom Talpey  wrote:

On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:

On Mon, Apr 05, 2021 at 11:42:31PM +, Chuck Lever III wrote:
  

We need to get a better idea what correctness testing has been done,
and whether positive correctness testing results can be replicated
on a variety of platforms.

RO has been rolling out slowly on mlx5 over a few years and storage
ULPs are the last to change. eg the mlx5 ethernet driver has had RO
turned on for a long time, userspace HPC applications have been using
it for a while now too.


I'd love to see RO be used more, it was always something the RDMA
specs supported and carefully architected for. My only concern is
that it's difficult to get right, especially when the platforms
have been running strictly-ordered for so long. The ULPs need
testing, and a lot of it.


We know there are platforms with broken RO implementations (like
Haswell) but the kernel is supposed to globally turn off RO on all
those cases. I'd be a bit surprised if we discover any more from this
series.
On the other hand there are platforms that get huge speed ups from
turning this on, AMD is one example, there are a bunch in the ARM
world too.


My belief is that the biggest risk is from situations where completions
are batched, and therefore polling is used to detect them without
interrupts (which explicitly). The RO pipeline will completely reorder
DMA writes, and consumers which infer ordering from memory contents may
break. This can even apply within the provider code, which may attempt
to poll WR and CQ structures, and be tripped up.


You are referring specifically to RPC/RDMA depending on Receive
completions to guarantee that previous RDMA Writes have been
retired? Or is there a particular implementation practice in
the Linux RPC/RDMA code that worries you?


Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
is hopefully unfounded, is that the RO pipeline might not have flushed
when a completion is posted *after* posting an interrupt.

Something like this...

RDMA Write arrives
PCIe RO Write for data
PCIe RO Write for data
...
RDMA Write arrives
PCIe RO Write for data
...
RDMA Send arrives
PCIe RO Write for receive data
PCIe RO Write for receive descriptor
PCIe interrupt (flushes RO pipeline for all three ops above)

RPC/RDMA polls CQ
Reaps receive completion

RDMA Send arrives
PCIe RO Write for receive data
PCIe RO write for receive descriptor
Does *not* interrupt, since CQ not armed

RPC/RDMA continues to poll CQ
Reaps receive completion
PCIe RO writes not yet flushed
Processes incomplete in-memory data
Bzzzt

Hopefully, the adapter performs a PCIe flushing read, or something
to avoid this when an interrupt is not generated. Alternatively, I'm
overly paranoid.

Tom.


The Mellanox adapter, itself, historically has strict in-order DMA
semantics, and while it's great to relax that, changing it by default
for all consumers is something to consider very cautiously.


Still, obviously people should test on the platforms they have.


Yes, and "test" be taken seriously with focus on ULP data integrity.
Speedups will mean nothing if the data is ever damaged.


I agree that data integrity comes first.

Since I currently don't have facilities to test RO in my lab, the
community will have to agree on a set of tests and expected results
that specifically exercise the corner cases you are concerned about.


--
Chuck Lever






Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-09 Thread Tom Talpey

On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:

On Mon, Apr 05, 2021 at 11:42:31PM +, Chuck Lever III wrote:
  

We need to get a better idea what correctness testing has been done,
and whether positive correctness testing results can be replicated
on a variety of platforms.


RO has been rolling out slowly on mlx5 over a few years and storage
ULPs are the last to change. eg the mlx5 ethernet driver has had RO
turned on for a long time, userspace HPC applications have been using
it for a while now too.


I'd love to see RO be used more, it was always something the RDMA
specs supported and carefully architected for. My only concern is
that it's difficult to get right, especially when the platforms
have been running strictly-ordered for so long. The ULPs need
testing, and a lot of it.


We know there are platforms with broken RO implementations (like
Haswell) but the kernel is supposed to globally turn off RO on all
those cases. I'd be a bit surprised if we discover any more from this
series.

On the other hand there are platforms that get huge speed ups from
turning this on, AMD is one example, there are a bunch in the ARM
world too.


My belief is that the biggest risk is from situations where completions
are batched, and therefore polling is used to detect them without
interrupts (which explicitly). The RO pipeline will completely reorder
DMA writes, and consumers which infer ordering from memory contents may
break. This can even apply within the provider code, which may attempt
to poll WR and CQ structures, and be tripped up.

The Mellanox adapter, itself, historically has strict in-order DMA
semantics, and while it's great to relax that, changing it by default
for all consumers is something to consider very cautiously.


Still, obviously people should test on the platforms they have.


Yes, and "test" be taken seriously with focus on ULP data integrity.
Speedups will mean nothing if the data is ever damaged.

Tom.


Re: [PATCH rdma-next 02/10] RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()

2021-04-05 Thread Tom Talpey

On 4/5/2021 1:23 AM, Leon Romanovsky wrote:

From: Avihai Horon 

Enable Relaxed Ordering in __ib_alloc_pd() allocation of the
local_dma_lkey.

This will take effect only for devices that don't pre-allocate the lkey
but allocate it per PD allocation.

Signed-off-by: Avihai Horon 
Reviewed-by: Michael Guralnik 
Signed-off-by: Leon Romanovsky 
---
  drivers/infiniband/core/verbs.c  | 3 ++-
  drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 1 +
  2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index a1782f8a6ca0..9b719f7d6fd5 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -287,7 +287,8 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, 
unsigned int flags,
if (device->attrs.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
pd->local_dma_lkey = device->local_dma_lkey;
else
-   mr_access_flags |= IB_ACCESS_LOCAL_WRITE;
+   mr_access_flags |=
+   IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;


So, do local_dma_lkey's get relaxed ordering unconditionally?


if (flags & IB_PD_UNSAFE_GLOBAL_RKEY) {
pr_warn("%s: enabling unsafe global rkey\n", caller);
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c 
b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
index b3fa783698a0..d74827694f92 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
@@ -66,6 +66,7 @@ struct ib_mr *pvrdma_get_dma_mr(struct ib_pd *pd, int acc)
int ret;
  
  	/* Support only LOCAL_WRITE flag for DMA MRs */

+   acc &= ~IB_ACCESS_RELAXED_ORDERING;
if (acc & ~IB_ACCESS_LOCAL_WRITE) {
dev_warn(>pdev->dev,
 "unsupported dma mr access flags %#x\n", acc);


Why does the pvrdma driver require relaxed ordering to be off?

Tom.


Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-05 Thread Tom Talpey

On 4/5/2021 10:08 AM, Leon Romanovsky wrote:

On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:

On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:

From: Leon Romanovsky 

>From Avihai,

Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
imposed on PCI transactions, and thus, can improve performance.

Until now, relaxed ordering could be set only by user space applications
for user MRs. The following patch series enables relaxed ordering for the
kernel ULPs as well. Relaxed ordering is an optional capability, and as
such, it is ignored by vendors that don't support it.

The following test results show the performance improvement achieved
with relaxed ordering. The test was performed on a NVIDIA A100 in order
to check performance of storage infrastructure over xprtrdma:


Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
What does that have to do with storage protocols?


This system is in use by our storage oriented customer who performed the
test. He runs drivers/infiniband/* stack from the upstream, simply backported
to specific kernel version.

The performance boost is seen in other systems too.


We need to see more information about this test, and platform.

What correctness testing was done, and how was it verified? What
PCI bus type(s) were tested, and with what adapters? What storage
workload was generated, and were all possible RDMA exchanges by
each ULP exercised?


Also if you enable this for basically all kernel ULPs, why not have
an opt-out into strict ordering for the cases that need it (if there are
any).


The RO property is optional, it can only improve. In addition, all in-kernel 
ULPs
don't need strict ordering. I can be mistaken here and Jason will correct me, it
is because of two things: ULP doesn't touch data before CQE and DMA API 
prohibits it.


+1 on Christoph's comment.

I woud hope most well-architected ULPs will support relaxed ordering,
but storage workloads, in my experience, can find ways to cause failure
in adapters. I would not suggest making this the default behavior
without extensive testing.

Tom.


Re: [Linux-cifsd-devel] [PATCH] cifsd: use kfree to free memory allocated by kzalloc

2021-04-01 Thread Tom Talpey

On 4/1/2021 9:36 AM, Namjae Jeon wrote:

2021-04-01 22:14 GMT+09:00, Ralph Boehme :

Am 4/1/21 um 2:59 PM schrieb Namjae Jeon:

2021-04-01 21:50 GMT+09:00, Ralph Boehme :

fwiw, while at it what about renaming everything that still references
"cifs" to "smb" ? This is not the 90's... :)

It is also used with the name "ksmbd". So function and variable prefix
are used with ksmbd.


well, I was thinking of this:

  > +++ b/fs/cifsd/...

We should really stop using the name cifs for modern implementation of
SMB{23} and the code should not be added as fs/cifsd/ to the kernel.

As I know, currently "cifs" is being used for the subdirectory name
for historical reasons and to avoid confusions, even though the CIFS
(SMB1) dialect is no longer recommended.


I'm with Ralph. CIFS is history that we need to relegate to the past.

I also agree that wrappers around core memory allocators are to
be avoided.

Tom.


Re: [PATCH v3] cifs: Silently ignore unknown oplock break handle

2021-03-19 Thread Tom Talpey

LGTM feel free to add

Reviewed-By: Tom Talpey 

On 3/19/2021 9:57 AM, Vincent Whitchurch wrote:

Make SMB2 not print out an error when an oplock break is received for an
unknown handle, similar to SMB1.  The debug message which is printed for
these unknown handles may also be misleading, so fix that too.

The SMB2 lease break path is not affected by this patch.

Without this, a program which writes to a file from one thread, and
opens, reads, and writes the same file from another thread triggers the
below errors several times a minute when run against a Samba server
configured with "smb2 leases = no".

  CIFS: VFS: \\192.168.0.1 No task to wake, unknown frame received! NumMids 2
  : 424d53fe 0040  0012  .SMB@...
  0010: 0001     
  0020:      
  0030:      

Signed-off-by: Vincent Whitchurch 
---

Notes:
 v3:
 - Change debug print to Tom Talpey's suggestion
 
 v2:

 - Drop change to lease break
 - Rewrite commit message

  fs/cifs/smb2misc.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/cifs/smb2misc.c b/fs/cifs/smb2misc.c
index 60d4bd1eae2b..76cd05b8d53b 100644
--- a/fs/cifs/smb2misc.c
+++ b/fs/cifs/smb2misc.c
@@ -754,8 +754,8 @@ smb2_is_valid_oplock_break(char *buffer, struct 
TCP_Server_Info *server)
}
}
spin_unlock(_tcp_ses_lock);
-   cifs_dbg(FYI, "Can not process oplock break for non-existent 
connection\n");
-   return false;
+   cifs_dbg(FYI, "No file id matched, oplock break ignored\n");
+   return true;
  }
  
  void




Re: [PATCH v2] cifs: Silently ignore unknown oplock break handle

2021-03-16 Thread Tom Talpey

On 3/16/2021 1:36 PM, Rohith Surabattula wrote:

This issue will not be seen once changes related to deferred close for
files is committed.


That may be, but it's irrelevant to this.


Currently, changes are in review. I will address review comments by this week.


What do you mean by "in review"? Both threads are active on the
mailing list. If you or others have something to discuss, please
post it and don't leave us out of the discussion.

Tom.



Regards,
Rohith

On Tue, Mar 16, 2021 at 9:33 PM Tom Talpey  wrote:


On 3/16/2021 8:48 AM, Vincent Whitchurch via samba-technical wrote:

Make SMB2 not print out an error when an oplock break is received for an
unknown handle, similar to SMB1.  The SMB2 lease break path is not
affected by this patch.

Without this, a program which writes to a file from one thread, and
opens, reads, and writes the same file from another thread triggers the
below errors several times a minute when run against a Samba server
configured with "smb2 leases = no".

   CIFS: VFS: \\192.168.0.1 No task to wake, unknown frame received! NumMids 2
   : 424d53fe 0040  0012  .SMB@...
   0010: 0001     
   0020:      
   0030:      

Signed-off-by: Vincent Whitchurch 
---

Notes:
  v2:
  - Drop change to lease break
  - Rewrite commit message

   fs/cifs/smb2misc.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/smb2misc.c b/fs/cifs/smb2misc.c
index 60d4bd1eae2b..4d8576e202e3 100644
--- a/fs/cifs/smb2misc.c
+++ b/fs/cifs/smb2misc.c
@@ -755,7 +755,7 @@ smb2_is_valid_oplock_break(char *buffer, struct 
TCP_Server_Info *server)
   }
   spin_unlock(_tcp_ses_lock);
   cifs_dbg(FYI, "Can not process oplock break for non-existent 
connection\n");
- return false;
+ return true;
   }

   void



As an oplock-only approach, it looks good. But the old cifs_dbg message
"non-existent connection" is possibly misleading, since the connection
may be perfectly fine.

When breaking the loop successfully, the code emits
 cifs_dbg(FYI, "file id match, oplock break\n");
so perhaps
 cifs_dbg(FYI, "No file id matched, oplock break ignored\n");
?

Tom.




Re: [PATCH v2] cifs: Silently ignore unknown oplock break handle

2021-03-16 Thread Tom Talpey

On 3/16/2021 8:48 AM, Vincent Whitchurch via samba-technical wrote:

Make SMB2 not print out an error when an oplock break is received for an
unknown handle, similar to SMB1.  The SMB2 lease break path is not
affected by this patch.

Without this, a program which writes to a file from one thread, and
opens, reads, and writes the same file from another thread triggers the
below errors several times a minute when run against a Samba server
configured with "smb2 leases = no".

  CIFS: VFS: \\192.168.0.1 No task to wake, unknown frame received! NumMids 2
  : 424d53fe 0040  0012  .SMB@...
  0010: 0001     
  0020:      
  0030:      

Signed-off-by: Vincent Whitchurch 
---

Notes:
 v2:
 - Drop change to lease break
 - Rewrite commit message

  fs/cifs/smb2misc.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/smb2misc.c b/fs/cifs/smb2misc.c
index 60d4bd1eae2b..4d8576e202e3 100644
--- a/fs/cifs/smb2misc.c
+++ b/fs/cifs/smb2misc.c
@@ -755,7 +755,7 @@ smb2_is_valid_oplock_break(char *buffer, struct 
TCP_Server_Info *server)
}
spin_unlock(_tcp_ses_lock);
cifs_dbg(FYI, "Can not process oplock break for non-existent 
connection\n");
-   return false;
+   return true;
  }
  
  void




As an oplock-only approach, it looks good. But the old cifs_dbg message
"non-existent connection" is possibly misleading, since the connection
may be perfectly fine.

When breaking the loop successfully, the code emits
cifs_dbg(FYI, "file id match, oplock break\n");
so perhaps
cifs_dbg(FYI, "No file id matched, oplock break ignored\n");
?

Tom.


Re: [PATCH] CIFS: Prevent error log on spurious oplock break

2021-03-12 Thread Tom Talpey

On 3/12/2021 6:49 AM, Vincent Whitchurch wrote:

On Tue, Mar 09, 2021 at 04:29:14PM +0100, Steve French wrote:

On Tue, Mar 9, 2021, 07:42 Vincent Whitchurch via samba-technical 
mailto:samba-techni...@lists.samba.org>> wrote:

Thank you for the suggestions.  In my case, I've only received some
reports of this error being emitted very rarely (couple of times a month
in our stability tests).  Right now it looks like the problem may only
be with a particular NAS, and we're looking into triggering oplock
breaks more often and catching the problem with some more logging.


I lean toward reducing or skipping the logging of the 'normsl' (or at
least possible) race between close and oplock break.

I see this eg spamming the log running xfstest 524

Can you repro it as well running that?


I haven't run xfstests, but we figured out how to easily trigger the
error in a normal use case in our application.  I can now easily get the
errors to spam the logs with a small program which writes to a file from
one thread in a loop and opens, reads, and closes the same file in
another thread in a loop.  This is against a Samba server configured
with "smb2 leases = no".

Logs show that the oplock break FileId is not found because of the race
between close and oplock break which you mentioned, and in some cases
because of another race between open and oplock break (the open was not
completed since it was waiting on the response to GetInfo).

If this is unavoidable, I think it really would be nice to at least
reduce the severity since it's scary-looking and so easy to trigger.

How about something like the below?  It prints an info message for the
first unhandled oplock breaks once.


No, it's incorrect to state this:

pr_info_once("Received oplock break for unknown file\n");

Oplocks are properties of handles, not files. If the handle is gone,
there's no processing, therefore silence is totally appropriate.

But beyond that, pr_info_once() would seem to be a bad way to signal it,
because the condition could happen many times, from many servers, on
many handles. What's so special about the first one?

Other issues (bad packet, etc) in break processing are worth logging
however. Remember though they will generally point to server issues,
not client, so they should be logged appropriately.


(I'm not sure if the lease key path should be handled differently. If
  the concerns about removing the message were primarily for that path,
  perhaps my original patch but with the change to
  smb2_is_valid_lease_break() dropped could be acceptable?)


Leases are very different from oplocks so the answer is definitely yes.
Leases really are about files, and have additional ownership semantics
which are not merely per-client.

Tom.


diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 3de3c5908a72..849c3721f8a2 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -256,7 +256,7 @@ struct smb_version_operations {
void (*dump_share_caps)(struct seq_file *, struct cifs_tcon *);
/* verify the message */
int (*check_message)(char *, unsigned int, struct TCP_Server_Info *);
-   bool (*is_oplock_break)(char *, struct TCP_Server_Info *);
+   int (*is_oplock_break)(char *, struct TCP_Server_Info *);
int (*handle_cancelled_mid)(char *, struct TCP_Server_Info *);
void (*downgrade_oplock)(struct TCP_Server_Info *server,
 struct cifsInodeInfo *cinode, __u32 oplock,
diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index 75ce6f742b8d..2714b6cdf70a 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -135,7 +135,7 @@ extern int SendReceiveBlockingLock(const unsigned int xid,
int *bytes_returned);
  extern int cifs_reconnect(struct TCP_Server_Info *server);
  extern int checkSMB(char *buf, unsigned int len, struct TCP_Server_Info 
*srvr);
-extern bool is_valid_oplock_break(char *, struct TCP_Server_Info *);
+extern int is_valid_oplock_break(char *, struct TCP_Server_Info *);
  extern bool backup_cred(struct cifs_sb_info *);
  extern bool is_size_safe_to_change(struct cifsInodeInfo *, __u64 eof);
  extern void cifs_update_eof(struct cifsInodeInfo *cifsi, loff_t offset,
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 112692300fb6..5dc58f0c99b0 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -1009,6 +1009,8 @@ cifs_demultiplex_thread(void *p)
server->lstrp = jiffies;
  
  		for (i = 0; i < num_mids; i++) {

+   int oplockret = -EINVAL;
+
if (mids[i] != NULL) {
mids[i]->resp_buf_size = server->pdu_size;
  
@@ -1020,17 +1022,24 @@ cifs_demultiplex_thread(void *p)

mids[i]->callback(mids[i]);
  
  cifs_mid_q_entry_release(mids[i]);

-   } else if (server->ops->is_oplock_break &&
-  server->ops->is_oplock_break(bufs[i],
-  

Re: [PATCH] RDMA/hw/hfi1/tid_rdma: remove unnecessary conversion to bool

2021-02-25 Thread Tom Talpey

On 2/25/2021 4:26 AM, Jiapeng Chong wrote:

Fix the following coccicheck warnings:

./drivers/infiniband/hw/hfi1/tid_rdma.c:1118:36-41: WARNING: conversion
to bool not needed here.

Reported-by: Abaci Robot 
Signed-off-by: Jiapeng Chong 
---
  drivers/infiniband/hw/hfi1/tid_rdma.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/hfi1/tid_rdma.c 
b/drivers/infiniband/hw/hfi1/tid_rdma.c
index 0b1f9e4..8958ea3 100644
--- a/drivers/infiniband/hw/hfi1/tid_rdma.c
+++ b/drivers/infiniband/hw/hfi1/tid_rdma.c
@@ -1115,7 +1115,7 @@ static u32 kern_find_pages(struct tid_rdma_flow *flow,
}
  
  	flow->length = flow->req->seg_len - length;

-   *last = req->isge == ss->num_sge ? false : true;
+   *last = req->isge == !ss->num_sge;


Are you sure this is what you want? The new code seems to compare
an index to a bool (refactoring)

*last = req->isge == (ss->num_sge != 0);

Don't you actually want

*last = req->isge != ss->num_sge;

??

Even then, it seems really hard to read.


return i;
  }
  



Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions

2019-03-19 Thread Tom Talpey

On 3/19/2019 3:45 PM, Jerome Glisse wrote:

On Tue, Mar 19, 2019 at 03:43:44PM -0500, Tom Talpey wrote:

On 3/19/2019 4:03 AM, Ira Weiny wrote:

On Tue, Mar 19, 2019 at 04:36:44PM +0100, Jan Kara wrote:

On Tue 19-03-19 17:29:18, Kirill A. Shutemov wrote:

On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:

On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:

On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:

On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubb...@gmail.com wrote:

From: John Hubbard 


[...]


diff --git a/mm/gup.c b/mm/gup.c
index f84e22685aaa..37085b8163b1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -28,6 +28,88 @@ struct follow_page_context {
unsigned int page_mask;
   };
+typedef int (*set_dirty_func_t)(struct page *page);
+
+static void __put_user_pages_dirty(struct page **pages,
+  unsigned long npages,
+  set_dirty_func_t sdf)
+{
+   unsigned long index;
+
+   for (index = 0; index < npages; index++) {
+   struct page *page = compound_head(pages[index]);
+
+   if (!PageDirty(page))
+   sdf(page);


How is this safe? What prevents the page to be cleared under you?

If it's safe to race clear_page_dirty*() it has to be stated explicitly
with a reason why. It's not very clear to me as it is.


The PageDirty() optimization above is fine to race with clear the
page flag as it means it is racing after a page_mkclean() and the
GUP user is done with the page so page is about to be write back
ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
call while a split second after TestClearPageDirty() happens then
it means the racing clear is about to write back the page so all
is fine (the page was dirty and it is being clear for write back).

If it does call the sdf() while racing with write back then we
just redirtied the page just like clear_page_dirty_for_io() would
do if page_mkclean() failed so nothing harmful will come of that
neither. Page stays dirty despite write back it just means that
the page might be write back twice in a row.


Forgot to mention one thing, we had a discussion with Andrea and Jan
about set_page_dirty() and Andrea had the good idea of maybe doing
the set_page_dirty() at GUP time (when GUP with write) not when the
GUP user calls put_page(). We can do that by setting the dirty bit
in the pte for instance. They are few bonus of doing things that way:
  - amortize the cost of calling set_page_dirty() (ie one call for
GUP and page_mkclean()
  - it is always safe to do so at GUP time (ie the pte has write
permission and thus the page is in correct state)
  - safe from truncate race
  - no need to ever lock the page

Extra bonus from my point of view, it simplify thing for my generic
page protection patchset (KSM for file back page).

So maybe we should explore that ? It would also be a lot less code.


Yes, please. It sounds more sensible to me to dirty the page on get, not
on put.


I fully agree this is a desirable final state of affairs.


I'm glad to see this presented because it has crossed my mind more than once
that effectively a GUP pinned page should be considered "dirty" at all times
until the pin is removed.  This is especially true in the RDMA case.


But, what if the RDMA registration is readonly? That's not uncommon, and
marking dirty unconditonally would add needless overhead to such pages.


Yes and this is only when FOLL_WRITE is set ie when you are doing GUP and
asking for write. Doing GUP and asking for read is always safe.


Aha, ok great.

I guess it does introduce something for callers to be aware of, if
they GUP very large regions. I suppose if they're sufficiently aware
of the situation, e.g. pnfs LAYOUT_COMMIT notifications, they could
walk lists and reset page_dirty for untouched pages before releasing.
That's their issue though, and agreed it's safest for the GUP layer
to mark.

Tom.


Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions

2019-03-19 Thread Tom Talpey

On 3/19/2019 4:03 AM, Ira Weiny wrote:

On Tue, Mar 19, 2019 at 04:36:44PM +0100, Jan Kara wrote:

On Tue 19-03-19 17:29:18, Kirill A. Shutemov wrote:

On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:

On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:

On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:

On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubb...@gmail.com wrote:

From: John Hubbard 


[...]


diff --git a/mm/gup.c b/mm/gup.c
index f84e22685aaa..37085b8163b1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -28,6 +28,88 @@ struct follow_page_context {
unsigned int page_mask;
  };
  
+typedef int (*set_dirty_func_t)(struct page *page);

+
+static void __put_user_pages_dirty(struct page **pages,
+  unsigned long npages,
+  set_dirty_func_t sdf)
+{
+   unsigned long index;
+
+   for (index = 0; index < npages; index++) {
+   struct page *page = compound_head(pages[index]);
+
+   if (!PageDirty(page))
+   sdf(page);


How is this safe? What prevents the page to be cleared under you?

If it's safe to race clear_page_dirty*() it has to be stated explicitly
with a reason why. It's not very clear to me as it is.


The PageDirty() optimization above is fine to race with clear the
page flag as it means it is racing after a page_mkclean() and the
GUP user is done with the page so page is about to be write back
ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
call while a split second after TestClearPageDirty() happens then
it means the racing clear is about to write back the page so all
is fine (the page was dirty and it is being clear for write back).

If it does call the sdf() while racing with write back then we
just redirtied the page just like clear_page_dirty_for_io() would
do if page_mkclean() failed so nothing harmful will come of that
neither. Page stays dirty despite write back it just means that
the page might be write back twice in a row.


Forgot to mention one thing, we had a discussion with Andrea and Jan
about set_page_dirty() and Andrea had the good idea of maybe doing
the set_page_dirty() at GUP time (when GUP with write) not when the
GUP user calls put_page(). We can do that by setting the dirty bit
in the pte for instance. They are few bonus of doing things that way:
 - amortize the cost of calling set_page_dirty() (ie one call for
   GUP and page_mkclean()
 - it is always safe to do so at GUP time (ie the pte has write
   permission and thus the page is in correct state)
 - safe from truncate race
 - no need to ever lock the page

Extra bonus from my point of view, it simplify thing for my generic
page protection patchset (KSM for file back page).

So maybe we should explore that ? It would also be a lot less code.


Yes, please. It sounds more sensible to me to dirty the page on get, not
on put.


I fully agree this is a desirable final state of affairs.


I'm glad to see this presented because it has crossed my mind more than once
that effectively a GUP pinned page should be considered "dirty" at all times
until the pin is removed.  This is especially true in the RDMA case.


But, what if the RDMA registration is readonly? That's not uncommon, and
marking dirty unconditonally would add needless overhead to such pages.

Tom.


Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

2019-02-07 Thread Tom Talpey

On 2/7/2019 11:57 AM, Ira Weiny wrote:

On Thu, Feb 07, 2019 at 10:28:05AM -0500, Tom Talpey wrote:

On 2/7/2019 10:04 AM, Chuck Lever wrote:




On Feb 7, 2019, at 12:23 AM, Jason Gunthorpe  wrote:

On Thu, Feb 07, 2019 at 02:52:58PM +1100, Dave Chinner wrote:


Requiring ODP capable hardware and applications that control RDMA
access to use file leases and be able to cancel/recall client side
delegations (like NFS is already able to do!) seems like a pretty


So, what happens on NFS if the revoke takes too long?


NFS distinguishes between "recall" and "revoke". Dave used "recall"
here, it means that the server recalls the client's delegation. If
the client doesn't respond, the server revokes the delegation
unilaterally and other users are allowed to proceed.


The SMB3 protocol has a similar "lease break" mechanism, btw.

SMB3 "push mode" has long-expected to allow DAX mapping of files
only when an exclusive lease is held by the requesting client.
The server may recall the lease if the DAX mapping needs to change.

Once local (MMU) and remote (RDMA) mappings are dropped, the
client may re-request that the server reestablish them. No
connection or process is terminated, and no data is silently lost.


How long does one wait for these remote mappings to be dropped?


The recall process depends on several things, but it certainly takes a
network round trip.

If recall fails, the file protocols allow the server to revoke. However,
since this results in loss of data, it's a last resort.

Tom.


Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

2019-02-07 Thread Tom Talpey

On 2/7/2019 10:37 AM, Doug Ledford wrote:

On Thu, 2019-02-07 at 10:28 -0500, Tom Talpey wrote:

On 2/7/2019 10:04 AM, Chuck Lever wrote:



On Feb 7, 2019, at 12:23 AM, Jason Gunthorpe  wrote:

On Thu, Feb 07, 2019 at 02:52:58PM +1100, Dave Chinner wrote:


Requiring ODP capable hardware and applications that control RDMA
access to use file leases and be able to cancel/recall client side
delegations (like NFS is already able to do!) seems like a pretty


So, what happens on NFS if the revoke takes too long?


NFS distinguishes between "recall" and "revoke". Dave used "recall"
here, it means that the server recalls the client's delegation. If
the client doesn't respond, the server revokes the delegation
unilaterally and other users are allowed to proceed.


The SMB3 protocol has a similar "lease break" mechanism, btw.

SMB3 "push mode" has long-expected to allow DAX mapping of files
only when an exclusive lease is held by the requesting client.
The server may recall the lease if the DAX mapping needs to change.

Once local (MMU) and remote (RDMA) mappings are dropped, the
client may re-request that the server reestablish them. No
connection or process is terminated, and no data is silently lost.


Yeah, but you're referring to a situation where the communication agent
and the filesystem agent are one and the same and they work
cooperatively to resolve the issue.  With DAX under Linux, the
filesystem agent and the communication agent are separate, and right
now, to my knowledge, the filesystem agent doesn't tell the
communication agent about a broken lease, it want's to be able to do
things 100% transparently without any work on the communication agent's
part.  That works for ODP, but not for anything else.  If the filesystem
notified the communication agent of the need to drop the MMU region and
rebuild it, the communication agent could communicate that to the remote
host, and things would work.  But there's no POSIX message for "your
file is moving on media, redo your mmap".


Indeed, the MMU notifier and the filesystem need to be integrated.

I'm unmoved by the POSIX argument. This stuff didn't happen in 1990.

Tom.


Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

2019-02-07 Thread Tom Talpey

On 2/7/2019 10:04 AM, Chuck Lever wrote:




On Feb 7, 2019, at 12:23 AM, Jason Gunthorpe  wrote:

On Thu, Feb 07, 2019 at 02:52:58PM +1100, Dave Chinner wrote:


Requiring ODP capable hardware and applications that control RDMA
access to use file leases and be able to cancel/recall client side
delegations (like NFS is already able to do!) seems like a pretty


So, what happens on NFS if the revoke takes too long?


NFS distinguishes between "recall" and "revoke". Dave used "recall"
here, it means that the server recalls the client's delegation. If
the client doesn't respond, the server revokes the delegation
unilaterally and other users are allowed to proceed.


The SMB3 protocol has a similar "lease break" mechanism, btw.

SMB3 "push mode" has long-expected to allow DAX mapping of files
only when an exclusive lease is held by the requesting client.
The server may recall the lease if the DAX mapping needs to change.

Once local (MMU) and remote (RDMA) mappings are dropped, the
client may re-request that the server reestablish them. No
connection or process is terminated, and no data is silently lost.

Tom.


Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

2019-02-05 Thread Tom Talpey

On 2/5/2019 3:22 AM, John Hubbard wrote:

On 2/4/19 5:41 PM, Tom Talpey wrote:

On 2/4/2019 12:21 AM, john.hubb...@gmail.com wrote:

From: John Hubbard 


Performance: here is an fio run on an NVMe drive, using this for the fio
configuration file:

 [reader]
 direct=1
 ioengine=libaio
 blocksize=4096
 size=1g
 numjobs=1
 rw=read
 iodepth=64

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=64

fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=7011: Sun Feb  3 20:36:51 2019
    read: IOPS=190k, BW=741MiB/s (778MB/s)(1024MiB/1381msec)
 slat (nsec): min=2716, max=57255, avg=4048.14, stdev=1084.10
 clat (usec): min=20, max=12485, avg=332.63, stdev=191.77
  lat (usec): min=22, max=12498, avg=336.72, stdev=192.07
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  322], 10.00th=[  322], 20.00th=[ 
326],
  | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[ 
326],
  | 70.00th=[  326], 80.00th=[  330], 90.00th=[  330], 95.00th=[ 
330],
  | 99.00th=[  478], 99.50th=[  717], 99.90th=[ 1074], 99.95th=[ 
1090],

  | 99.99th=[12256]


These latencies are concerning. The best results we saw at the end of
November (previous approach) were MUCH flatter. These really start
spiking at three 9's, and are sky-high at four 9's. The "stdev" values
for clat and lat are about 10 times the previous. There's some kind
of serious queuing contention here, that wasn't there in November.


Hi Tom,

I think this latency problem is also there in the baseline kernel, but...



    bw (  KiB/s): min=730152, max=776512, per=99.22%, avg=753332.00, 
stdev=32781.47, samples=2
    iops    : min=182538, max=194128, avg=188333.00, 
stdev=8195.37, samples=2

   lat (usec)   : 50=0.01%, 100=0.01%, 250=0.07%, 500=99.26%, 750=0.38%
   lat (usec)   : 1000=0.02%
   lat (msec)   : 2=0.24%, 20=0.02%
   cpu  : usr=15.07%, sys=84.13%, ctx=10, majf=0, minf=74


System CPU 84% is roughly double the November results of 45%. Ouch.


That's my fault. First of all, I had a few extra, supposedly minor debug
settings in the .config, which I'm removing now--I'm doing a proper run
with the original .config file from November, below. Second, I'm not
sure I controlled the run carefully enough.



Did you re-run the baseline on the new unpatched base kernel and can
we see the before/after?


Doing that now, I see:

-- No significant perf difference between before and after, but
-- Still high clat in the 99.99th

===
Before: using commit 8834f5600cf3 ("Linux 5.0-rc5")
===
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=64

fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1829: Tue Feb  5 00:08:08 2019
    read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1359msec)
     slat (nsec): min=1269, max=40309, avg=1493.66, stdev=534.83
     clat (usec): min=127, max=12249, avg=329.83, stdev=184.92
  lat (usec): min=129, max=12256, avg=331.35, stdev=185.06
     clat percentiles (usec):
  |  1.00th=[  326],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
  | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
  | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
  | 99.00th=[  347], 99.50th=[  519], 99.90th=[  529], 99.95th=[  537],
  | 99.99th=[12125]
    bw (  KiB/s): min=755032, max=781472, per=99.57%, avg=768252.00, 
stdev=18695.90, samples=2
    iops    : min=188758, max=195368, avg=192063.00, stdev=4673.98, 
samples=2

   lat (usec)   : 250=0.08%, 500=99.18%, 750=0.72%
   lat (msec)   : 20=0.02%
   cpu  : usr=12.30%, sys=46.83%, ctx=253554, majf=0, minf=74
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
 >=64=100.0%
  submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, 
 >=64=0.0%

  issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
  latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
    READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), 
io=1024MiB (1074MB), run=1359-1359msec


Disk stats (read/write):
   nvme0n1: ios=221246/0, merge=0/0, ticks=71556/0, in_queue=704, 
util=91.35%


===
After:
===
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=64

fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1803: Mon Feb  4 23:58:07 2019
    read: IOPS=193k, BW=753MiB/s (790MB/s)(1024

Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

2019-02-04 Thread Tom Talpey

On 2/4/2019 12:21 AM, john.hubb...@gmail.com wrote:

From: John Hubbard 


Performance: here is an fio run on an NVMe drive, using this for the fio
configuration file:

 [reader]
 direct=1
 ioengine=libaio
 blocksize=4096
 size=1g
 numjobs=1
 rw=read
 iodepth=64

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=7011: Sun Feb  3 20:36:51 2019
read: IOPS=190k, BW=741MiB/s (778MB/s)(1024MiB/1381msec)
 slat (nsec): min=2716, max=57255, avg=4048.14, stdev=1084.10
 clat (usec): min=20, max=12485, avg=332.63, stdev=191.77
  lat (usec): min=22, max=12498, avg=336.72, stdev=192.07
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  322], 10.00th=[  322], 20.00th=[  326],
  | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
  | 70.00th=[  326], 80.00th=[  330], 90.00th=[  330], 95.00th=[  330],
  | 99.00th=[  478], 99.50th=[  717], 99.90th=[ 1074], 99.95th=[ 1090],
  | 99.99th=[12256]


These latencies are concerning. The best results we saw at the end of
November (previous approach) were MUCH flatter. These really start
spiking at three 9's, and are sky-high at four 9's. The "stdev" values
for clat and lat are about 10 times the previous. There's some kind
of serious queuing contention here, that wasn't there in November.


bw (  KiB/s): min=730152, max=776512, per=99.22%, avg=753332.00, 
stdev=32781.47, samples=2
iops: min=182538, max=194128, avg=188333.00, stdev=8195.37, 
samples=2
   lat (usec)   : 50=0.01%, 100=0.01%, 250=0.07%, 500=99.26%, 750=0.38%
   lat (usec)   : 1000=0.02%
   lat (msec)   : 2=0.24%, 20=0.02%
   cpu  : usr=15.07%, sys=84.13%, ctx=10, majf=0, minf=74


System CPU 84% is roughly double the November results of 45%. Ouch.

Did you re-run the baseline on the new unpatched base kernel and can
we see the before/after?

Tom.


   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
  issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
  latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=741MiB/s (778MB/s), 741MiB/s-741MiB/s (778MB/s-778MB/s), 
io=1024MiB (1074MB), run=1381-1381msec

Disk stats (read/write):
   nvme0n1: ios=216966/0, merge=0/0, ticks=6112/0, in_queue=704, util=91.34%


Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-13 Thread Tom Talpey

On 12/13/2018 10:18 AM, Jerome Glisse wrote:

On Thu, Dec 13, 2018 at 09:51:18AM -0500, Tom Talpey wrote:

On 12/13/2018 9:18 AM, Jerome Glisse wrote:

On Thu, Dec 13, 2018 at 08:40:49AM -0500, Tom Talpey wrote:

On 12/13/2018 7:43 AM, Jerome Glisse wrote:

On Wed, Dec 12, 2018 at 08:20:43PM -0700, Jason Gunthorpe wrote:

On Wed, Dec 12, 2018 at 07:01:09PM -0500, Jerome Glisse wrote:

On Wed, Dec 12, 2018 at 04:37:03PM -0700, Jason Gunthorpe wrote:

On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:

Almost, we need some safety around assuming that DMA is complete the
page, so the notification would need to go all to way to userspace
with something like a file lease notification. It would also need to
be backstopped by an IOMMU in the case where the hardware does not /
can not stop in-flight DMA.


You can always reprogram the hardware right away it will redirect
any dma to the crappy page.


That causes silent data corruption for RDMA users - we can't do that.

The only way out for current hardware is to forcibly terminate the
RDMA activity somehow (and I'm not even sure this is possible, at
least it would be driver specific)

Even the IOMMU idea probably doesn't work, I doubt all current
hardware can handle a PCI-E error TLP properly.


What i saying is reprogram hardware to crappy page ie valid page
dma map but that just has random content as a last resort to allow
filesystem to reuse block. So their should be no PCIE error unless
hardware freak out to see its page table reprogram randomly.


No, that isn't an option. You can't silently provide corrupted data
for RDMA to transfer out onto the network, or silently discard data
coming in!!

Think of the consequences of that - I have a fileserver process and
someone does ftruncate and now my clients receive corrupted data??


This is what happens _today_ ie today someone do GUP on page file
and then someone else do truncate the first GUP is effectively
streaming _random_ data to network as the page does not correspond
to anything anymore and once the RDMA MR goes aways and release
the page the page content will be lost. So i am not changing anything
here, what i proposed was to make it explicit to device driver at
least that they were streaming random data. Right now this is all
silent but this is what is happening wether you like it or not :)

Note that  i am saying do that only for truncate to allow to be
nice to fs. But again i am fine with whatever solution but you can
not please everyone here. Either block truncate and fs folks will
hate you or make it clear to device driver that you are streaming
random things and RDMA people hates you.



The only option is to prevent the RDMA transfer from ever happening,
and we just don't have hardware support (beyond destroy everything) to
do that.


The question is who do you want to punish ? RDMA user that pin stuff
and expect thing to work forever without worrying for other fs
activities ? Or filesystem to pin block forever :)


I don't want to punish everyone, I want both sides to have complete
data integrity as the USER has deliberately decided to combine DAX and
RDMA. So either stop it at the front end (ie get_user_pages_longterm)
or make it work in a way that guarantees integrity for both.


   S2: notify userspace program through device/sub-system
   specific API and delay ftruncate. After a while if there
   is no answer just be mean and force hardware to use
   crappy page as anyway this is what happens today


I don't think this happens today (outside of DAX).. Does it?


It does it is just silent, i don't remember anything in the code
that would stop a truncate to happen because of elevated refcount.
This does not happen with ODP mlx5 as it does abide by _all_ mmu
notifier. This is for anything that does ODP without support for
mmu notifier.


Wait - is it expected that the MMU notifier upcall is handled
synchronously? That is, the page DMA mapping must be torn down
immediately, and before returning?


Yes you must torn down mapping before returning from mmu notifier
call back. Any time after is too late. You obviously need hardware
that can support that. In the infiniband sub-system AFAIK only the
mlx5 hardware can do that. In the GPU sub-system everyone is fine.


I'm skeptical that MLX5 can actually make this guarantee. But we
can take that offline in linux-rdma.


It does unless the code lies about what the hardware do :) See umem_odp.c
in core and odp.c in mlx5 directories.


Ok, I did look and there are numerous error returns from these calls.
Some are related to resource shortages (including the rather ominous-
sounding "emergency_pages" in odp.c), others related to the generic
RDMA behaviors such as posting work requests and reaping their
completion status.

So I'd ask - what is the backup plan from the mmu notifier if the
unmap fails? Which it certainly will, in many real-world situations.

Tom.


I'm also skeptical that NVMe can do thi

Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-13 Thread Tom Talpey

On 12/13/2018 9:18 AM, Jerome Glisse wrote:

On Thu, Dec 13, 2018 at 08:40:49AM -0500, Tom Talpey wrote:

On 12/13/2018 7:43 AM, Jerome Glisse wrote:

On Wed, Dec 12, 2018 at 08:20:43PM -0700, Jason Gunthorpe wrote:

On Wed, Dec 12, 2018 at 07:01:09PM -0500, Jerome Glisse wrote:

On Wed, Dec 12, 2018 at 04:37:03PM -0700, Jason Gunthorpe wrote:

On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:

Almost, we need some safety around assuming that DMA is complete the
page, so the notification would need to go all to way to userspace
with something like a file lease notification. It would also need to
be backstopped by an IOMMU in the case where the hardware does not /
can not stop in-flight DMA.


You can always reprogram the hardware right away it will redirect
any dma to the crappy page.


That causes silent data corruption for RDMA users - we can't do that.

The only way out for current hardware is to forcibly terminate the
RDMA activity somehow (and I'm not even sure this is possible, at
least it would be driver specific)

Even the IOMMU idea probably doesn't work, I doubt all current
hardware can handle a PCI-E error TLP properly.


What i saying is reprogram hardware to crappy page ie valid page
dma map but that just has random content as a last resort to allow
filesystem to reuse block. So their should be no PCIE error unless
hardware freak out to see its page table reprogram randomly.


No, that isn't an option. You can't silently provide corrupted data
for RDMA to transfer out onto the network, or silently discard data
coming in!!

Think of the consequences of that - I have a fileserver process and
someone does ftruncate and now my clients receive corrupted data??


This is what happens _today_ ie today someone do GUP on page file
and then someone else do truncate the first GUP is effectively
streaming _random_ data to network as the page does not correspond
to anything anymore and once the RDMA MR goes aways and release
the page the page content will be lost. So i am not changing anything
here, what i proposed was to make it explicit to device driver at
least that they were streaming random data. Right now this is all
silent but this is what is happening wether you like it or not :)

Note that  i am saying do that only for truncate to allow to be
nice to fs. But again i am fine with whatever solution but you can
not please everyone here. Either block truncate and fs folks will
hate you or make it clear to device driver that you are streaming
random things and RDMA people hates you.



The only option is to prevent the RDMA transfer from ever happening,
and we just don't have hardware support (beyond destroy everything) to
do that.


The question is who do you want to punish ? RDMA user that pin stuff
and expect thing to work forever without worrying for other fs
activities ? Or filesystem to pin block forever :)


I don't want to punish everyone, I want both sides to have complete
data integrity as the USER has deliberately decided to combine DAX and
RDMA. So either stop it at the front end (ie get_user_pages_longterm)
or make it work in a way that guarantees integrity for both.


  S2: notify userspace program through device/sub-system
  specific API and delay ftruncate. After a while if there
  is no answer just be mean and force hardware to use
  crappy page as anyway this is what happens today


I don't think this happens today (outside of DAX).. Does it?


It does it is just silent, i don't remember anything in the code
that would stop a truncate to happen because of elevated refcount.
This does not happen with ODP mlx5 as it does abide by _all_ mmu
notifier. This is for anything that does ODP without support for
mmu notifier.


Wait - is it expected that the MMU notifier upcall is handled
synchronously? That is, the page DMA mapping must be torn down
immediately, and before returning?


Yes you must torn down mapping before returning from mmu notifier
call back. Any time after is too late. You obviously need hardware
that can support that. In the infiniband sub-system AFAIK only the
mlx5 hardware can do that. In the GPU sub-system everyone is fine.


I'm skeptical that MLX5 can actually make this guarantee. But we
can take that offline in linux-rdma.

I'm also skeptical that NVMe can do this.


Dunno about other sub-systems.



That's simply not possible, since the hardware needs to get control
to do this. Even if there were an IOMMU that could intercept the
DMA, reprogramming it will require a flush, which cannot be guaranteed
to occur "inline".


If hardware can not do that then hardware should not use GUP, at
least not on file back page. I advocated in favor of forbiding GUP
for device that can not do that as right now this silently breaks
in few cases (truncate, mremap, splice, reflink, ...). So the device
in those cases can end up with GUPed pages that do not correspond
to anything anymore ie they do not correspond to the memory backing
t

Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-13 Thread Tom Talpey

On 12/13/2018 7:43 AM, Jerome Glisse wrote:

On Wed, Dec 12, 2018 at 08:20:43PM -0700, Jason Gunthorpe wrote:

On Wed, Dec 12, 2018 at 07:01:09PM -0500, Jerome Glisse wrote:

On Wed, Dec 12, 2018 at 04:37:03PM -0700, Jason Gunthorpe wrote:

On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:

Almost, we need some safety around assuming that DMA is complete the
page, so the notification would need to go all to way to userspace
with something like a file lease notification. It would also need to
be backstopped by an IOMMU in the case where the hardware does not /
can not stop in-flight DMA.


You can always reprogram the hardware right away it will redirect
any dma to the crappy page.


That causes silent data corruption for RDMA users - we can't do that.

The only way out for current hardware is to forcibly terminate the
RDMA activity somehow (and I'm not even sure this is possible, at
least it would be driver specific)

Even the IOMMU idea probably doesn't work, I doubt all current
hardware can handle a PCI-E error TLP properly.


What i saying is reprogram hardware to crappy page ie valid page
dma map but that just has random content as a last resort to allow
filesystem to reuse block. So their should be no PCIE error unless
hardware freak out to see its page table reprogram randomly.


No, that isn't an option. You can't silently provide corrupted data
for RDMA to transfer out onto the network, or silently discard data
coming in!!

Think of the consequences of that - I have a fileserver process and
someone does ftruncate and now my clients receive corrupted data??


This is what happens _today_ ie today someone do GUP on page file
and then someone else do truncate the first GUP is effectively
streaming _random_ data to network as the page does not correspond
to anything anymore and once the RDMA MR goes aways and release
the page the page content will be lost. So i am not changing anything
here, what i proposed was to make it explicit to device driver at
least that they were streaming random data. Right now this is all
silent but this is what is happening wether you like it or not :)

Note that  i am saying do that only for truncate to allow to be
nice to fs. But again i am fine with whatever solution but you can
not please everyone here. Either block truncate and fs folks will
hate you or make it clear to device driver that you are streaming
random things and RDMA people hates you.



The only option is to prevent the RDMA transfer from ever happening,
and we just don't have hardware support (beyond destroy everything) to
do that.


The question is who do you want to punish ? RDMA user that pin stuff
and expect thing to work forever without worrying for other fs
activities ? Or filesystem to pin block forever :)


I don't want to punish everyone, I want both sides to have complete
data integrity as the USER has deliberately decided to combine DAX and
RDMA. So either stop it at the front end (ie get_user_pages_longterm)
or make it work in a way that guarantees integrity for both.


 S2: notify userspace program through device/sub-system
 specific API and delay ftruncate. After a while if there
 is no answer just be mean and force hardware to use
 crappy page as anyway this is what happens today


I don't think this happens today (outside of DAX).. Does it?


It does it is just silent, i don't remember anything in the code
that would stop a truncate to happen because of elevated refcount.
This does not happen with ODP mlx5 as it does abide by _all_ mmu
notifier. This is for anything that does ODP without support for
mmu notifier.


Wait - is it expected that the MMU notifier upcall is handled
synchronously? That is, the page DMA mapping must be torn down
immediately, and before returning?

That's simply not possible, since the hardware needs to get control
to do this. Even if there were an IOMMU that could intercept the
DMA, reprogramming it will require a flush, which cannot be guaranteed
to occur "inline".


.. and the remedy here is to kill the process, not provide corrupt
data. Kill the process is likely to not go over well with any real
users that want this combination.

Think Samba serving files over RDMA - you can't have random unpriv
users calling ftruncate and causing smbd to be killed or serve corrupt
data.


So what i am saying is there is a choice and it would be better to
decide something than let the existing status quo where we just keep
streaming random data after truncate to a GUPed page.


Let's also remember that any torn-down DMA mapping can't be recycled
until all uses of the old DMA addresses are destroyed. The whole
thing screams for reference counting all the way down, to me.

Tom.



Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-29 Thread Tom Talpey

On 11/29/2018 10:00 PM, John Hubbard wrote:

On 11/29/18 6:30 PM, Tom Talpey wrote:

On 11/29/2018 9:21 PM, John Hubbard wrote:

On 11/29/18 6:18 PM, Tom Talpey wrote:

On 11/29/2018 8:39 PM, John Hubbard wrote:

On 11/28/18 5:59 AM, Tom Talpey wrote:

On 11/27/2018 9:52 PM, John Hubbard wrote:

On 11/27/18 5:21 PM, Tom Talpey wrote:

On 11/21/2018 5:06 PM, John Hubbard wrote:

On 11/21/18 8:49 AM, Tom Talpey wrote:

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

[...]

Excerpting from below:


Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
  cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73


vs


With patches applied:
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
  cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73


Perfect results, not CPU limited, and full IOPS.

Curiously identical, so I trust you've checked that you measured
both targets, but if so, I say it's good.



Argh, copy-paste error in the email. The real "before" is ever so slightly
better, at 194K IOPS and 759 MB/s:


Definitely better - note the system CPU is lower, which is probably the
reason for the increased IOPS.


     cpu  : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73


Good result - a correct implementation, and faster.



Thanks, Tom, I really appreciate your experience and help on what performance
should look like here. (I'm sure you can guess that this is the first time
I've worked with fio, heh.)


No problem, happy to chip in. Feel free to add my

Tested-By: Tom Talpey 

I know, that's not the personal email I'm posting from, but it's me.

I'll be hopefully trying the code with the Linux SMB client (cifs.ko)
next week, Long Li is implementing direct io in that and we'll see how
it helps.

Mainly, I'm looking forward to seeing this enable RDMA-to-DAX.

Tom.



I'll send out a new, non-RFC patchset soon, then.

thanks,



Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-29 Thread Tom Talpey

On 11/29/2018 10:00 PM, John Hubbard wrote:

On 11/29/18 6:30 PM, Tom Talpey wrote:

On 11/29/2018 9:21 PM, John Hubbard wrote:

On 11/29/18 6:18 PM, Tom Talpey wrote:

On 11/29/2018 8:39 PM, John Hubbard wrote:

On 11/28/18 5:59 AM, Tom Talpey wrote:

On 11/27/2018 9:52 PM, John Hubbard wrote:

On 11/27/18 5:21 PM, Tom Talpey wrote:

On 11/21/2018 5:06 PM, John Hubbard wrote:

On 11/21/18 8:49 AM, Tom Talpey wrote:

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

[...]

Excerpting from below:


Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
  cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73


vs


With patches applied:
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
  cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73


Perfect results, not CPU limited, and full IOPS.

Curiously identical, so I trust you've checked that you measured
both targets, but if so, I say it's good.



Argh, copy-paste error in the email. The real "before" is ever so slightly
better, at 194K IOPS and 759 MB/s:


Definitely better - note the system CPU is lower, which is probably the
reason for the increased IOPS.


     cpu  : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73


Good result - a correct implementation, and faster.



Thanks, Tom, I really appreciate your experience and help on what performance
should look like here. (I'm sure you can guess that this is the first time
I've worked with fio, heh.)


No problem, happy to chip in. Feel free to add my

Tested-By: Tom Talpey 

I know, that's not the personal email I'm posting from, but it's me.

I'll be hopefully trying the code with the Linux SMB client (cifs.ko)
next week, Long Li is implementing direct io in that and we'll see how
it helps.

Mainly, I'm looking forward to seeing this enable RDMA-to-DAX.

Tom.



I'll send out a new, non-RFC patchset soon, then.

thanks,



Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-29 Thread Tom Talpey

On 11/29/2018 9:21 PM, John Hubbard wrote:

On 11/29/18 6:18 PM, Tom Talpey wrote:

On 11/29/2018 8:39 PM, John Hubbard wrote:

On 11/28/18 5:59 AM, Tom Talpey wrote:

On 11/27/2018 9:52 PM, John Hubbard wrote:

On 11/27/18 5:21 PM, Tom Talpey wrote:

On 11/21/2018 5:06 PM, John Hubbard wrote:

On 11/21/18 8:49 AM, Tom Talpey wrote:

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

[...]

I'm super-limited here this week hardware-wise and have not been able
to try testing with the patched kernel.

I was able to compare my earlier quick test with a Bionic 4.15 kernel
(400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
test, and without your change.



So just to double check (again): you are running fio with these parameters,
right?

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


Correct, I copy/pasted these directly. I also ran with size=10g because
the 1g provides a really small sample set.

There was one other difference, your results indicated fio 3.3 was used.
My Bionic install has fio 3.1. I don't find that relevant because our
goal is to compare before/after, which I haven't done yet.



OK, the 50 MB/s was due to my particular .config. I had some expensive debug 
options
set in mm, fs and locking subsystems. Turning those off, I'm back up to the 
rated
speed of the Samsung NVMe device, so now we should have a clearer picture of the
performance that real users will see.


Oh, good! I'm especially glad because I was having a heck of a time
reconfiguring the one machine I have available for this.


Continuing on, then: running a before and after test, I don't see any 
significant
difference in the fio results:


Excerpting from below:


Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
  read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
     cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73


vs


With patches applied:
  read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
     cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73


Perfect results, not CPU limited, and full IOPS.

Curiously identical, so I trust you've checked that you measured
both targets, but if so, I say it's good.



Argh, copy-paste error in the email. The real "before" is ever so slightly
better, at 194K IOPS and 759 MB/s:


Definitely better - note the system CPU is lower, which is probably the
reason for the increased IOPS.

>cpu  : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73

Good result - a correct implementation, and faster.

Tom.




  $ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1715: Thu Nov 29 17:07:09 2018
read: IOPS=194k, BW=759MiB/s (795MB/s)(1024MiB/1350msec)
 slat (nsec): min=1245, max=2812.7k, avg=1538.03, stdev=5519.61
 clat (usec): min=148, max=755, avg=326.85, stdev=18.13
  lat (usec): min=150, max=3483, avg=328.41, stdev=19.53
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
  | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
  | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
  | 99.00th=[  355], 99.50th=[  537], 99.90th=[  553], 99.95th=[  553],
  | 99.99th=[  619]
bw (  KiB/s): min=767816, max=783096, per=99.84%, avg=775456.00, 
stdev=10804.59, samples=2
iops: min=191954, max=195774, avg=193864.00, stdev=2701.15, 
samples=2
   lat (usec)   : 250=0.09%, 500=99.30%, 750=0.61%, 1000=0.01%
   cpu  : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
  issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
  latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=759MiB/s (795MB/s), 759MiB/s-759MiB/s (795MB/s-795MB/s), 
io=1024MiB (1074MB), run=1350-1350msec

Disk stats (read/write):
   nvme0n1: ios=222853/0, merge=0/0, ticks=71410/0, in_queue=71935, util=100.00%

thanks,



Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-29 Thread Tom Talpey

On 11/29/2018 9:21 PM, John Hubbard wrote:

On 11/29/18 6:18 PM, Tom Talpey wrote:

On 11/29/2018 8:39 PM, John Hubbard wrote:

On 11/28/18 5:59 AM, Tom Talpey wrote:

On 11/27/2018 9:52 PM, John Hubbard wrote:

On 11/27/18 5:21 PM, Tom Talpey wrote:

On 11/21/2018 5:06 PM, John Hubbard wrote:

On 11/21/18 8:49 AM, Tom Talpey wrote:

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

[...]

I'm super-limited here this week hardware-wise and have not been able
to try testing with the patched kernel.

I was able to compare my earlier quick test with a Bionic 4.15 kernel
(400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
test, and without your change.



So just to double check (again): you are running fio with these parameters,
right?

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


Correct, I copy/pasted these directly. I also ran with size=10g because
the 1g provides a really small sample set.

There was one other difference, your results indicated fio 3.3 was used.
My Bionic install has fio 3.1. I don't find that relevant because our
goal is to compare before/after, which I haven't done yet.



OK, the 50 MB/s was due to my particular .config. I had some expensive debug 
options
set in mm, fs and locking subsystems. Turning those off, I'm back up to the 
rated
speed of the Samsung NVMe device, so now we should have a clearer picture of the
performance that real users will see.


Oh, good! I'm especially glad because I was having a heck of a time
reconfiguring the one machine I have available for this.


Continuing on, then: running a before and after test, I don't see any 
significant
difference in the fio results:


Excerpting from below:


Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
  read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
     cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73


vs


With patches applied:
  read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
     cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73


Perfect results, not CPU limited, and full IOPS.

Curiously identical, so I trust you've checked that you measured
both targets, but if so, I say it's good.



Argh, copy-paste error in the email. The real "before" is ever so slightly
better, at 194K IOPS and 759 MB/s:


Definitely better - note the system CPU is lower, which is probably the
reason for the increased IOPS.

>cpu  : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73

Good result - a correct implementation, and faster.

Tom.




  $ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1715: Thu Nov 29 17:07:09 2018
read: IOPS=194k, BW=759MiB/s (795MB/s)(1024MiB/1350msec)
 slat (nsec): min=1245, max=2812.7k, avg=1538.03, stdev=5519.61
 clat (usec): min=148, max=755, avg=326.85, stdev=18.13
  lat (usec): min=150, max=3483, avg=328.41, stdev=19.53
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
  | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
  | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
  | 99.00th=[  355], 99.50th=[  537], 99.90th=[  553], 99.95th=[  553],
  | 99.99th=[  619]
bw (  KiB/s): min=767816, max=783096, per=99.84%, avg=775456.00, 
stdev=10804.59, samples=2
iops: min=191954, max=195774, avg=193864.00, stdev=2701.15, 
samples=2
   lat (usec)   : 250=0.09%, 500=99.30%, 750=0.61%, 1000=0.01%
   cpu  : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
  issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
  latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=759MiB/s (795MB/s), 759MiB/s-759MiB/s (795MB/s-795MB/s), 
io=1024MiB (1074MB), run=1350-1350msec

Disk stats (read/write):
   nvme0n1: ios=222853/0, merge=0/0, ticks=71410/0, in_queue=71935, util=100.00%

thanks,



Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-29 Thread Tom Talpey

On 11/29/2018 8:39 PM, John Hubbard wrote:

On 11/28/18 5:59 AM, Tom Talpey wrote:

On 11/27/2018 9:52 PM, John Hubbard wrote:

On 11/27/18 5:21 PM, Tom Talpey wrote:

On 11/21/2018 5:06 PM, John Hubbard wrote:

On 11/21/18 8:49 AM, Tom Talpey wrote:

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

[...]

I'm super-limited here this week hardware-wise and have not been able
to try testing with the patched kernel.

I was able to compare my earlier quick test with a Bionic 4.15 kernel
(400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
test, and without your change.



So just to double check (again): you are running fio with these parameters,
right?

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


Correct, I copy/pasted these directly. I also ran with size=10g because
the 1g provides a really small sample set.

There was one other difference, your results indicated fio 3.3 was used.
My Bionic install has fio 3.1. I don't find that relevant because our
goal is to compare before/after, which I haven't done yet.



OK, the 50 MB/s was due to my particular .config. I had some expensive debug 
options
set in mm, fs and locking subsystems. Turning those off, I'm back up to the 
rated
speed of the Samsung NVMe device, so now we should have a clearer picture of the
performance that real users will see.


Oh, good! I'm especially glad because I was having a heck of a time
reconfiguring the one machine I have available for this.


Continuing on, then: running a before and after test, I don't see any 
significant
difference in the fio results:


Excerpting from below:

> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73

vs

> With patches applied:
> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73

Perfect results, not CPU limited, and full IOPS.

Curiously identical, so I trust you've checked that you measured
both targets, but if so, I say it's good.

Tom.



fio.conf:

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64

-
Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:

$ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
 slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
 clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
  lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
  | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
  | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
  | 99.00th=[  379], 99.50th=[  594], 99.90th=[  603], 99.95th=[  611],
  | 99.99th=[12125]
bw (  KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, 
stdev=22112.64, samples=2
iops: min=187910, max=195728, avg=191819.00, stdev=5528.16, 
samples=2
   lat (usec)   : 250=0.08%, 500=99.30%, 750=0.59%
   lat (msec)   : 20=0.02%
   cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
  issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
  latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), 
io=1024MiB (1074MB), run=1360-1360msec

Disk stats (read/write):
   nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00%

-
With patches applied:

 fast_256GB $ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
 slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
 clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
  lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  326], 10.0

Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-29 Thread Tom Talpey

On 11/29/2018 8:39 PM, John Hubbard wrote:

On 11/28/18 5:59 AM, Tom Talpey wrote:

On 11/27/2018 9:52 PM, John Hubbard wrote:

On 11/27/18 5:21 PM, Tom Talpey wrote:

On 11/21/2018 5:06 PM, John Hubbard wrote:

On 11/21/18 8:49 AM, Tom Talpey wrote:

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

[...]

I'm super-limited here this week hardware-wise and have not been able
to try testing with the patched kernel.

I was able to compare my earlier quick test with a Bionic 4.15 kernel
(400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
test, and without your change.



So just to double check (again): you are running fio with these parameters,
right?

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


Correct, I copy/pasted these directly. I also ran with size=10g because
the 1g provides a really small sample set.

There was one other difference, your results indicated fio 3.3 was used.
My Bionic install has fio 3.1. I don't find that relevant because our
goal is to compare before/after, which I haven't done yet.



OK, the 50 MB/s was due to my particular .config. I had some expensive debug 
options
set in mm, fs and locking subsystems. Turning those off, I'm back up to the 
rated
speed of the Samsung NVMe device, so now we should have a clearer picture of the
performance that real users will see.


Oh, good! I'm especially glad because I was having a heck of a time
reconfiguring the one machine I have available for this.


Continuing on, then: running a before and after test, I don't see any 
significant
difference in the fio results:


Excerpting from below:

> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73

vs

> With patches applied:
> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73

Perfect results, not CPU limited, and full IOPS.

Curiously identical, so I trust you've checked that you measured
both targets, but if so, I say it's good.

Tom.



fio.conf:

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64

-
Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:

$ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
 slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
 clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
  lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
  | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
  | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
  | 99.00th=[  379], 99.50th=[  594], 99.90th=[  603], 99.95th=[  611],
  | 99.99th=[12125]
bw (  KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, 
stdev=22112.64, samples=2
iops: min=187910, max=195728, avg=191819.00, stdev=5528.16, 
samples=2
   lat (usec)   : 250=0.08%, 500=99.30%, 750=0.59%
   lat (msec)   : 20=0.02%
   cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
  issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
  latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), 
io=1024MiB (1074MB), run=1360-1360msec

Disk stats (read/write):
   nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00%

-
With patches applied:

 fast_256GB $ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
 slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
 clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
  lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  326], 10.0

RE: [Patch v4 2/3] CIFS: Add support for direct I/O write

2018-11-29 Thread Tom Talpey
> -Original Message-
> From: linux-cifs-ow...@vger.kernel.org  On
> Behalf Of Long Li
> Sent: Thursday, November 29, 2018 4:30 PM
> To: Pavel Shilovsky 
> Cc: Steve French ; linux-cifs ;
> samba-technical ; Kernel Mailing List
> 
> Subject: RE: [Patch v4 2/3] CIFS: Add support for direct I/O write
> 
> > Subject: Re: [Patch v4 2/3] CIFS: Add support for direct I/O write
> >
> > ср, 28 нояб. 2018 г. в 18:20, Long Li :
> > >
> > > > Subject: Re: [Patch v4 2/3] CIFS: Add support for direct I/O write
> > > >
> > > > ср, 31 окт. 2018 г. в 15:26, Long Li :
> > > > >
> > > > > From: Long Li 
> > > > >
> > > > > With direct I/O write, user supplied buffers are pinned to the
> > > > > memory and data are transferred directly from user buffers to the
> > transport layer.
> > > > >
> > > > > Change in v3: add support for kernel AIO
> > > > >
> > > > > Change in v4:
> > > > > Refactor common write code to __cifs_writev for direct and non-direct
> > I/O.
> > > > > Retry on direct I/O failure.
> > > > >
> > > > > Signed-off-by: Long Li 
> > > > > ---
> > > > >  fs/cifs/cifsfs.h |   1 +
> > > > >  fs/cifs/file.c   | 194
> > +++
> > > > 
> > > > >  2 files changed, 154 insertions(+), 41 deletions(-)
> > > > >
> > > > > diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h index
> > > > > 7fba9aa..e9c5103 100644
> > > > > --- a/fs/cifs/cifsfs.h
> > > > > +++ b/fs/cifs/cifsfs.h
> > > > > @@ -105,6 +105,7 @@ extern ssize_t cifs_user_readv(struct kiocb
> > > > > *iocb, struct iov_iter *to);  extern ssize_t
> > > > > cifs_direct_readv(struct kiocb *iocb, struct iov_iter *to);
> > > > > extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct
> > > > > iov_iter *to);  extern ssize_t cifs_user_writev(struct kiocb
> > > > > *iocb, struct iov_iter *from);
> > > > > +extern ssize_t cifs_direct_writev(struct kiocb *iocb, struct
> > > > > +iov_iter *from);
> > > > >  extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct
> > > > > iov_iter *from);  extern int cifs_lock(struct file *, int, struct
> > > > > file_lock *); extern int cifs_fsync(struct file *, loff_t, loff_t,
> > > > > int); diff --git a/fs/cifs/file.c b/fs/cifs/file.c index
> > > > > daab878..1a41c04 100644
> > > > > --- a/fs/cifs/file.c
> > > > > +++ b/fs/cifs/file.c
> > > > > @@ -2524,6 +2524,55 @@ wdata_fill_from_iovec(struct cifs_writedata
> > > > > *wdata, struct iov_iter *from,  }
> > > > >
> > > > >  static int
> > > > > +cifs_resend_wdata(struct cifs_writedata *wdata, struct list_head
> > > > > +*wdata_list, struct cifs_aio_ctx *ctx) {
> > > > > +   int wait_retry = 0;
> > > > > +   unsigned int wsize, credits;
> > > > > +   int rc;
> > > > > +   struct TCP_Server_Info *server =
> > > > > +tlink_tcon(wdata->cfile->tlink)->ses->server;
> > > > > +
> > > > > +   /*
> > > > > +* Try to resend this wdata, waiting for credits up to 3 
> > > > > seconds.
> > > > > +* Note: we are attempting to resend the whole wdata not
> > > > > + in
> > > > segments
> > > > > +*/
> > > > > +   do {
> > > > > +   rc = server->ops->wait_mtu_credits(server,
> > > > > + wdata->bytes, , );
> > > > > +
> > > > > +   if (rc)
> > > > > +   break;
> > > > > +
> > > > > +   if (wsize < wdata->bytes) {
> > > > > +   add_credits_and_wake_if(server, credits, 0);
> > > > > +   msleep(1000);
> > > > > +   wait_retry++;
> > > > > +   }
> > > > > +   } while (wsize < wdata->bytes && wait_retry < 3);
> > > > > +
> > > > > +   if (wsize < wdata->bytes) {
> > > > > +   rc = -EBUSY;
> > > > > +   goto out;
> > > > > +   }
> > > > > +
> > > > > +   rc = -EAGAIN;
> > > > > +   while (rc == -EAGAIN)
> > > > > +   if (!wdata->cfile->invalidHandle ||
> > > > > +   !(rc = cifs_reopen_file(wdata->cfile, false)))
> > > > > +   rc = server->ops->async_writev(wdata,
> > > > > +
> > > > > + cifs_uncached_writedata_release);
> > > > > +
> > > > > +   if (!rc) {
> > > > > +   list_add_tail(>list, wdata_list);
> > > > > +   return 0;
> > > > > +   }
> > > > > +
> > > > > +   add_credits_and_wake_if(server, wdata->credits, 0);
> > > > > +out:
> > > > > +   kref_put(>refcount,
> > > > > +cifs_uncached_writedata_release);
> > > > > +
> > > > > +   return rc;
> > > > > +}
> > > > > +
> > > > > +static int
> > > > >  cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter 
> > > > > *from,
> > > > >  struct cifsFileInfo *open_file,
> > > > >  struct cifs_sb_info *cifs_sb, struct
> > > > > list_head *wdata_list, @@ -2537,6 +2586,8 @@
> > > > > cifs_write_from_iter(loff_t offset,
> > > > size_t len, struct iov_iter *from,
> > > > > loff_t saved_offset = offset;
> > > > > pid_t pid;
> > > > >  

RE: [Patch v4 2/3] CIFS: Add support for direct I/O write

2018-11-29 Thread Tom Talpey
> -Original Message-
> From: linux-cifs-ow...@vger.kernel.org  On
> Behalf Of Long Li
> Sent: Thursday, November 29, 2018 4:30 PM
> To: Pavel Shilovsky 
> Cc: Steve French ; linux-cifs ;
> samba-technical ; Kernel Mailing List
> 
> Subject: RE: [Patch v4 2/3] CIFS: Add support for direct I/O write
> 
> > Subject: Re: [Patch v4 2/3] CIFS: Add support for direct I/O write
> >
> > ср, 28 нояб. 2018 г. в 18:20, Long Li :
> > >
> > > > Subject: Re: [Patch v4 2/3] CIFS: Add support for direct I/O write
> > > >
> > > > ср, 31 окт. 2018 г. в 15:26, Long Li :
> > > > >
> > > > > From: Long Li 
> > > > >
> > > > > With direct I/O write, user supplied buffers are pinned to the
> > > > > memory and data are transferred directly from user buffers to the
> > transport layer.
> > > > >
> > > > > Change in v3: add support for kernel AIO
> > > > >
> > > > > Change in v4:
> > > > > Refactor common write code to __cifs_writev for direct and non-direct
> > I/O.
> > > > > Retry on direct I/O failure.
> > > > >
> > > > > Signed-off-by: Long Li 
> > > > > ---
> > > > >  fs/cifs/cifsfs.h |   1 +
> > > > >  fs/cifs/file.c   | 194
> > +++
> > > > 
> > > > >  2 files changed, 154 insertions(+), 41 deletions(-)
> > > > >
> > > > > diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h index
> > > > > 7fba9aa..e9c5103 100644
> > > > > --- a/fs/cifs/cifsfs.h
> > > > > +++ b/fs/cifs/cifsfs.h
> > > > > @@ -105,6 +105,7 @@ extern ssize_t cifs_user_readv(struct kiocb
> > > > > *iocb, struct iov_iter *to);  extern ssize_t
> > > > > cifs_direct_readv(struct kiocb *iocb, struct iov_iter *to);
> > > > > extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct
> > > > > iov_iter *to);  extern ssize_t cifs_user_writev(struct kiocb
> > > > > *iocb, struct iov_iter *from);
> > > > > +extern ssize_t cifs_direct_writev(struct kiocb *iocb, struct
> > > > > +iov_iter *from);
> > > > >  extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct
> > > > > iov_iter *from);  extern int cifs_lock(struct file *, int, struct
> > > > > file_lock *); extern int cifs_fsync(struct file *, loff_t, loff_t,
> > > > > int); diff --git a/fs/cifs/file.c b/fs/cifs/file.c index
> > > > > daab878..1a41c04 100644
> > > > > --- a/fs/cifs/file.c
> > > > > +++ b/fs/cifs/file.c
> > > > > @@ -2524,6 +2524,55 @@ wdata_fill_from_iovec(struct cifs_writedata
> > > > > *wdata, struct iov_iter *from,  }
> > > > >
> > > > >  static int
> > > > > +cifs_resend_wdata(struct cifs_writedata *wdata, struct list_head
> > > > > +*wdata_list, struct cifs_aio_ctx *ctx) {
> > > > > +   int wait_retry = 0;
> > > > > +   unsigned int wsize, credits;
> > > > > +   int rc;
> > > > > +   struct TCP_Server_Info *server =
> > > > > +tlink_tcon(wdata->cfile->tlink)->ses->server;
> > > > > +
> > > > > +   /*
> > > > > +* Try to resend this wdata, waiting for credits up to 3 
> > > > > seconds.
> > > > > +* Note: we are attempting to resend the whole wdata not
> > > > > + in
> > > > segments
> > > > > +*/
> > > > > +   do {
> > > > > +   rc = server->ops->wait_mtu_credits(server,
> > > > > + wdata->bytes, , );
> > > > > +
> > > > > +   if (rc)
> > > > > +   break;
> > > > > +
> > > > > +   if (wsize < wdata->bytes) {
> > > > > +   add_credits_and_wake_if(server, credits, 0);
> > > > > +   msleep(1000);
> > > > > +   wait_retry++;
> > > > > +   }
> > > > > +   } while (wsize < wdata->bytes && wait_retry < 3);
> > > > > +
> > > > > +   if (wsize < wdata->bytes) {
> > > > > +   rc = -EBUSY;
> > > > > +   goto out;
> > > > > +   }
> > > > > +
> > > > > +   rc = -EAGAIN;
> > > > > +   while (rc == -EAGAIN)
> > > > > +   if (!wdata->cfile->invalidHandle ||
> > > > > +   !(rc = cifs_reopen_file(wdata->cfile, false)))
> > > > > +   rc = server->ops->async_writev(wdata,
> > > > > +
> > > > > + cifs_uncached_writedata_release);
> > > > > +
> > > > > +   if (!rc) {
> > > > > +   list_add_tail(>list, wdata_list);
> > > > > +   return 0;
> > > > > +   }
> > > > > +
> > > > > +   add_credits_and_wake_if(server, wdata->credits, 0);
> > > > > +out:
> > > > > +   kref_put(>refcount,
> > > > > +cifs_uncached_writedata_release);
> > > > > +
> > > > > +   return rc;
> > > > > +}
> > > > > +
> > > > > +static int
> > > > >  cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter 
> > > > > *from,
> > > > >  struct cifsFileInfo *open_file,
> > > > >  struct cifs_sb_info *cifs_sb, struct
> > > > > list_head *wdata_list, @@ -2537,6 +2586,8 @@
> > > > > cifs_write_from_iter(loff_t offset,
> > > > size_t len, struct iov_iter *from,
> > > > > loff_t saved_offset = offset;
> > > > > pid_t pid;
> > > > >  

Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-28 Thread Tom Talpey

On 11/27/2018 9:52 PM, John Hubbard wrote:

On 11/27/18 5:21 PM, Tom Talpey wrote:

On 11/21/2018 5:06 PM, John Hubbard wrote:

On 11/21/18 8:49 AM, Tom Talpey wrote:

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

[...]


What I'd really like to see is to go back to the original fio parameters
(1 thread, 64 iodepth) and try to get a result that gets at least close
to the speced 200K IOPS of the NVMe device. There seems to be something
wrong with yours, currently.


I'll dig into what has gone wrong with the test. I see fio putting data files
in the right place, so the obvious "using the wrong drive" is (probably)
not it. Even though it really feels like that sort of thing. We'll see.



Then of course, the result with the patched get_user_pages, and
compare whichever of IOPS or CPU% changes, and how much.

If these are within a few percent, I agree it's good to go. If it's
roughly 25% like the result just above, that's a rocky road.

I can try this after the holiday on some basic hardware and might
be able to scrounge up better. Can you post that github link?



Here:

     g...@github.com:johnhubbard/linux (branch: gup_dma_testing)


I'm super-limited here this week hardware-wise and have not been able
to try testing with the patched kernel.

I was able to compare my earlier quick test with a Bionic 4.15 kernel
(400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
test, and without your change.



So just to double check (again): you are running fio with these parameters,
right?

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


Correct, I copy/pasted these directly. I also ran with size=10g because
the 1g provides a really small sample set.

There was one other difference, your results indicated fio 3.3 was used.
My Bionic install has fio 3.1. I don't find that relevant because our
goal is to compare before/after, which I haven't done yet.

Tom.






Say, that branch reports it has not had a commit since June 30. Is that
the right one? What about gup_dma_for_lpc_2018?



That's the right branch, but the AuthorDate for the head commit (only) somehow
got stuck in the past. I just now amended that patch with a new date and pushed
it, so the head commit now shows Nov 27:

https://github.com/johnhubbard/linux/commits/gup_dma_testing


The actual code is the same, though. (It is still based on Nov 19th's 
f2ce1065e767
commit.)


thanks,



Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-28 Thread Tom Talpey

On 11/27/2018 9:52 PM, John Hubbard wrote:

On 11/27/18 5:21 PM, Tom Talpey wrote:

On 11/21/2018 5:06 PM, John Hubbard wrote:

On 11/21/18 8:49 AM, Tom Talpey wrote:

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

[...]


What I'd really like to see is to go back to the original fio parameters
(1 thread, 64 iodepth) and try to get a result that gets at least close
to the speced 200K IOPS of the NVMe device. There seems to be something
wrong with yours, currently.


I'll dig into what has gone wrong with the test. I see fio putting data files
in the right place, so the obvious "using the wrong drive" is (probably)
not it. Even though it really feels like that sort of thing. We'll see.



Then of course, the result with the patched get_user_pages, and
compare whichever of IOPS or CPU% changes, and how much.

If these are within a few percent, I agree it's good to go. If it's
roughly 25% like the result just above, that's a rocky road.

I can try this after the holiday on some basic hardware and might
be able to scrounge up better. Can you post that github link?



Here:

     g...@github.com:johnhubbard/linux (branch: gup_dma_testing)


I'm super-limited here this week hardware-wise and have not been able
to try testing with the patched kernel.

I was able to compare my earlier quick test with a Bionic 4.15 kernel
(400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
test, and without your change.



So just to double check (again): you are running fio with these parameters,
right?

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


Correct, I copy/pasted these directly. I also ran with size=10g because
the 1g provides a really small sample set.

There was one other difference, your results indicated fio 3.3 was used.
My Bionic install has fio 3.1. I don't find that relevant because our
goal is to compare before/after, which I haven't done yet.

Tom.






Say, that branch reports it has not had a commit since June 30. Is that
the right one? What about gup_dma_for_lpc_2018?



That's the right branch, but the AuthorDate for the head commit (only) somehow
got stuck in the past. I just now amended that patch with a new date and pushed
it, so the head commit now shows Nov 27:

https://github.com/johnhubbard/linux/commits/gup_dma_testing


The actual code is the same, though. (It is still based on Nov 19th's 
f2ce1065e767
commit.)


thanks,



Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-27 Thread Tom Talpey

On 11/21/2018 5:06 PM, John Hubbard wrote:

On 11/21/18 8:49 AM, Tom Talpey wrote:

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

~14000 4KB read IOPS is really, really low for an NVMe disk.


Yes, but Jan Kara's original config file for fio is *intended* to highlight
the get_user_pages/put_user_pages changes. It was *not* intended to get max
performance,  as you can see by the numjobs and direct IO parameters:

cat fio.conf
[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


To be clear - I used those identical parameters, on my lower-spec
machine, and got 400,000 4KB read IOPS. Those results are nearly 30x
higher than yours!


OK, then something really is wrong here...




So I'm thinking that this is not a "tainted" test, but rather, we're 
constraining
things a lot with these choices. It's hard to find a good test config to run 
that
allows decisions, but so far, I'm not really seeing anything that says "this
is so bad that we can't afford to fix the brokenness." I think.


I'm not suggesting we tune the benchmark, I'm suggesting the results
on your system are not meaningful since they are orders of magnitude
low. And without meaningful data it's impossible to see the performance
impact of the change...


Can you confirm what type of hardware you're running this test on?
CPU, memory speed and capacity, and NVMe device especially?

Tom.


Yes, it's a nice new system, I don't expect any strange perf problems:

CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
  (Intel X299 chipset)
Block device: nvme-Samsung_SSD_970_EVO_250GB
DRAM: 32 GB


The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS
with a 4KB QD32 workload:


https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs

And the I7-7800X is a 6-core processor (12 hyperthreads).


So, here's a comparison using 20 threads, direct IO, for the baseline vs.
patched kernel (below). Highlights:

 -- IOPS are similar, around 60k.
 -- BW gets worse, dropping from 290 to 220 MB/s.
 -- CPU is well under 100%.
 -- latency is incredibly long, but...20 threads.

Baseline:

$ ./run.sh
fio configuration:
[reader]
ioengine=libaio
blocksize=4096
size=1g
rw=read
group_reporting
iodepth=256
direct=1
numjobs=20


Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets.
That's going to cause tremendous queuing, and context switching, far
outside of the get_user_pages() change.

But even so, it only brings IOPS to 74.2K, which is still far short of
the device's 200K spec.

Comparing anyway:



Patched:

 Running fio:
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=256
...
fio-3.3
Starting 20 processes
Jobs: 13 (f=8): 
[_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0
 IOPS][eta 00m:02s]
reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
     read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
...
Thoughts?


Concern - the 74.2K IOPS unpatched drops to 56.8K patched!


ACK. :)



What I'd really like to see is to go back to the original fio parameters
(1 thread, 64 iodepth) and try to get a result that gets at least close
to the speced 200K IOPS of the NVMe device. There seems to be something
wrong with yours, currently.


I'll dig into what has gone wrong with the test. I see fio putting data files
in the right place, so the obvious "using the wrong drive" is (probably)
not it. Even though it really feels like that sort of thing. We'll see.



Then of course, the result with the patched get_user_pages, and
compare whichever of IOPS or CPU% changes, and how much.

If these are within a few percent, I agree it's good to go. If it's
roughly 25% like the result just above, that's a rocky road.

I can try this after the holiday on some basic hardware and might
be able to scrounge up better. Can you post that github link?



Here:

g...@github.com:johnhubbard/linux (branch: gup_dma_testing)


I'm super-limited here this week hardware-wise and have not been able
to try testing with the patched kernel.

I was able to compare my earlier quick test with a Bionic 4.15 kernel
(400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
test, and without your change.

Say, that branch reports it has not had a commit since June 30. Is that
the right one? What about gup_dma_for_lpc_2018?

Tom.


Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-27 Thread Tom Talpey

On 11/21/2018 5:06 PM, John Hubbard wrote:

On 11/21/18 8:49 AM, Tom Talpey wrote:

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

~14000 4KB read IOPS is really, really low for an NVMe disk.


Yes, but Jan Kara's original config file for fio is *intended* to highlight
the get_user_pages/put_user_pages changes. It was *not* intended to get max
performance,  as you can see by the numjobs and direct IO parameters:

cat fio.conf
[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


To be clear - I used those identical parameters, on my lower-spec
machine, and got 400,000 4KB read IOPS. Those results are nearly 30x
higher than yours!


OK, then something really is wrong here...




So I'm thinking that this is not a "tainted" test, but rather, we're 
constraining
things a lot with these choices. It's hard to find a good test config to run 
that
allows decisions, but so far, I'm not really seeing anything that says "this
is so bad that we can't afford to fix the brokenness." I think.


I'm not suggesting we tune the benchmark, I'm suggesting the results
on your system are not meaningful since they are orders of magnitude
low. And without meaningful data it's impossible to see the performance
impact of the change...


Can you confirm what type of hardware you're running this test on?
CPU, memory speed and capacity, and NVMe device especially?

Tom.


Yes, it's a nice new system, I don't expect any strange perf problems:

CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
  (Intel X299 chipset)
Block device: nvme-Samsung_SSD_970_EVO_250GB
DRAM: 32 GB


The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS
with a 4KB QD32 workload:


https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs

And the I7-7800X is a 6-core processor (12 hyperthreads).


So, here's a comparison using 20 threads, direct IO, for the baseline vs.
patched kernel (below). Highlights:

 -- IOPS are similar, around 60k.
 -- BW gets worse, dropping from 290 to 220 MB/s.
 -- CPU is well under 100%.
 -- latency is incredibly long, but...20 threads.

Baseline:

$ ./run.sh
fio configuration:
[reader]
ioengine=libaio
blocksize=4096
size=1g
rw=read
group_reporting
iodepth=256
direct=1
numjobs=20


Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets.
That's going to cause tremendous queuing, and context switching, far
outside of the get_user_pages() change.

But even so, it only brings IOPS to 74.2K, which is still far short of
the device's 200K spec.

Comparing anyway:



Patched:

 Running fio:
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=256
...
fio-3.3
Starting 20 processes
Jobs: 13 (f=8): 
[_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0
 IOPS][eta 00m:02s]
reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
     read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
...
Thoughts?


Concern - the 74.2K IOPS unpatched drops to 56.8K patched!


ACK. :)



What I'd really like to see is to go back to the original fio parameters
(1 thread, 64 iodepth) and try to get a result that gets at least close
to the speced 200K IOPS of the NVMe device. There seems to be something
wrong with yours, currently.


I'll dig into what has gone wrong with the test. I see fio putting data files
in the right place, so the obvious "using the wrong drive" is (probably)
not it. Even though it really feels like that sort of thing. We'll see.



Then of course, the result with the patched get_user_pages, and
compare whichever of IOPS or CPU% changes, and how much.

If these are within a few percent, I agree it's good to go. If it's
roughly 25% like the result just above, that's a rocky road.

I can try this after the holiday on some basic hardware and might
be able to scrounge up better. Can you post that github link?



Here:

g...@github.com:johnhubbard/linux (branch: gup_dma_testing)


I'm super-limited here this week hardware-wise and have not been able
to try testing with the patched kernel.

I was able to compare my earlier quick test with a Bionic 4.15 kernel
(400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
test, and without your change.

Say, that branch reports it has not had a commit since June 30. Is that
the right one? What about gup_dma_for_lpc_2018?

Tom.


Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-21 Thread Tom Talpey

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

~14000 4KB read IOPS is really, really low for an NVMe disk.


Yes, but Jan Kara's original config file for fio is *intended* to highlight
the get_user_pages/put_user_pages changes. It was *not* intended to get max
performance,  as you can see by the numjobs and direct IO parameters:

cat fio.conf
[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


To be clear - I used those identical parameters, on my lower-spec
machine, and got 400,000 4KB read IOPS. Those results are nearly 30x
higher than yours!


So I'm thinking that this is not a "tainted" test, but rather, we're 
constraining
things a lot with these choices. It's hard to find a good test config to run 
that
allows decisions, but so far, I'm not really seeing anything that says "this
is so bad that we can't afford to fix the brokenness." I think.


I'm not suggesting we tune the benchmark, I'm suggesting the results
on your system are not meaningful since they are orders of magnitude
low. And without meaningful data it's impossible to see the performance
impact of the change...


Can you confirm what type of hardware you're running this test on?
CPU, memory speed and capacity, and NVMe device especially?

Tom.


Yes, it's a nice new system, I don't expect any strange perf problems:

CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
 (Intel X299 chipset)
Block device: nvme-Samsung_SSD_970_EVO_250GB
DRAM: 32 GB


The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS
with a 4KB QD32 workload:


https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs

And the I7-7800X is a 6-core processor (12 hyperthreads).


So, here's a comparison using 20 threads, direct IO, for the baseline vs.
patched kernel (below). Highlights:

-- IOPS are similar, around 60k.
-- BW gets worse, dropping from 290 to 220 MB/s.
-- CPU is well under 100%.
-- latency is incredibly long, but...20 threads.

Baseline:

$ ./run.sh
fio configuration:
[reader]
ioengine=libaio
blocksize=4096
size=1g
rw=read
group_reporting
iodepth=256
direct=1
numjobs=20


Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets.
That's going to cause tremendous queuing, and context switching, far
outside of the get_user_pages() change.

But even so, it only brings IOPS to 74.2K, which is still far short of
the device's 200K spec.

Comparing anyway:



Patched:

 Running fio:
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=256
...
fio-3.3
Starting 20 processes
Jobs: 13 (f=8): 
[_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0
 IOPS][eta 00m:02s]
reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
...
Thoughts?


Concern - the 74.2K IOPS unpatched drops to 56.8K patched!

What I'd really like to see is to go back to the original fio parameters
(1 thread, 64 iodepth) and try to get a result that gets at least close
to the speced 200K IOPS of the NVMe device. There seems to be something
wrong with yours, currently.

Then of course, the result with the patched get_user_pages, and
compare whichever of IOPS or CPU% changes, and how much.

If these are within a few percent, I agree it's good to go. If it's
roughly 25% like the result just above, that's a rocky road.

I can try this after the holiday on some basic hardware and might
be able to scrounge up better. Can you post that github link?

Tom.


Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-21 Thread Tom Talpey

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

~14000 4KB read IOPS is really, really low for an NVMe disk.


Yes, but Jan Kara's original config file for fio is *intended* to highlight
the get_user_pages/put_user_pages changes. It was *not* intended to get max
performance,  as you can see by the numjobs and direct IO parameters:

cat fio.conf
[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


To be clear - I used those identical parameters, on my lower-spec
machine, and got 400,000 4KB read IOPS. Those results are nearly 30x
higher than yours!


So I'm thinking that this is not a "tainted" test, but rather, we're 
constraining
things a lot with these choices. It's hard to find a good test config to run 
that
allows decisions, but so far, I'm not really seeing anything that says "this
is so bad that we can't afford to fix the brokenness." I think.


I'm not suggesting we tune the benchmark, I'm suggesting the results
on your system are not meaningful since they are orders of magnitude
low. And without meaningful data it's impossible to see the performance
impact of the change...


Can you confirm what type of hardware you're running this test on?
CPU, memory speed and capacity, and NVMe device especially?

Tom.


Yes, it's a nice new system, I don't expect any strange perf problems:

CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
 (Intel X299 chipset)
Block device: nvme-Samsung_SSD_970_EVO_250GB
DRAM: 32 GB


The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS
with a 4KB QD32 workload:


https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs

And the I7-7800X is a 6-core processor (12 hyperthreads).


So, here's a comparison using 20 threads, direct IO, for the baseline vs.
patched kernel (below). Highlights:

-- IOPS are similar, around 60k.
-- BW gets worse, dropping from 290 to 220 MB/s.
-- CPU is well under 100%.
-- latency is incredibly long, but...20 threads.

Baseline:

$ ./run.sh
fio configuration:
[reader]
ioengine=libaio
blocksize=4096
size=1g
rw=read
group_reporting
iodepth=256
direct=1
numjobs=20


Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets.
That's going to cause tremendous queuing, and context switching, far
outside of the get_user_pages() change.

But even so, it only brings IOPS to 74.2K, which is still far short of
the device's 200K spec.

Comparing anyway:



Patched:

 Running fio:
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=256
...
fio-3.3
Starting 20 processes
Jobs: 13 (f=8): 
[_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0
 IOPS][eta 00m:02s]
reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
...
Thoughts?


Concern - the 74.2K IOPS unpatched drops to 56.8K patched!

What I'd really like to see is to go back to the original fio parameters
(1 thread, 64 iodepth) and try to get a result that gets at least close
to the speced 200K IOPS of the NVMe device. There seems to be something
wrong with yours, currently.

Then of course, the result with the patched get_user_pages, and
compare whichever of IOPS or CPU% changes, and how much.

If these are within a few percent, I agree it's good to go. If it's
roughly 25% like the result just above, that's a rocky road.

I can try this after the holiday on some basic hardware and might
be able to scrounge up better. Can you post that github link?

Tom.


Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-19 Thread Tom Talpey

John, thanks for the discussion at LPC. One of the concerns we
raised however was the performance test. The numbers below are
rather obviously tainted. I think we need to get a better baseline
before concluding anything...

Here's my main concern:

On 11/10/2018 3:50 AM, john.hubb...@gmail.com wrote:

From: John Hubbard 
...
--
WITHOUT the patch:
--
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 
00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov  6 20:18:06 2018
read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec)


~14000 4KB read IOPS is really, really low for an NVMe disk.


   cpu  : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72


CPU is obviously the limiting factor. At these IOPS, it should be far
less.

--
OR, here's a better run WITH the patch applied, and you can see that this is 
nearly as good
as the "without" case:
--

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 
00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov  6 20:01:33 2018
read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec)


Similar low IOPS.


   cpu  : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73


Similar CPU saturation.





I get nearly 400,000 4KB IOPS on my tiny desktop, which has a 25W
i7-7500 and a Samsung PM961 128GB NVMe (stock Bionic 4.15 kernel
and fio version 3.1). Even then, the CPU saturates, so it's not
necessarily a perfect test. I'd like to see your runs both get to
"max" IOPS, i.e. CPU < 100%, and compare the CPU numbers. This would
give the best comparison for making a decision.

Can you confirm what type of hardware you're running this test on?
CPU, memory speed and capacity, and NVMe device especially?

Tom.


Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-19 Thread Tom Talpey

John, thanks for the discussion at LPC. One of the concerns we
raised however was the performance test. The numbers below are
rather obviously tainted. I think we need to get a better baseline
before concluding anything...

Here's my main concern:

On 11/10/2018 3:50 AM, john.hubb...@gmail.com wrote:

From: John Hubbard 
...
--
WITHOUT the patch:
--
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 
00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov  6 20:18:06 2018
read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec)


~14000 4KB read IOPS is really, really low for an NVMe disk.


   cpu  : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72


CPU is obviously the limiting factor. At these IOPS, it should be far
less.

--
OR, here's a better run WITH the patch applied, and you can see that this is 
nearly as good
as the "without" case:
--

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 
00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov  6 20:01:33 2018
read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec)


Similar low IOPS.


   cpu  : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73


Similar CPU saturation.





I get nearly 400,000 4KB IOPS on my tiny desktop, which has a 25W
i7-7500 and a Samsung PM961 128GB NVMe (stock Bionic 4.15 kernel
and fio version 3.1). Even then, the CPU saturates, so it's not
necessarily a perfect test. I'd like to see your runs both get to
"max" IOPS, i.e. CPU < 100%, and compare the CPU numbers. This would
give the best comparison for making a decision.

Can you confirm what type of hardware you're running this test on?
CPU, memory speed and capacity, and NVMe device especially?

Tom.


Re: [Patch v7 21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

2018-09-23 Thread Tom Talpey

On 9/23/2018 2:24 PM, Stefan Metzmacher wrote:

Hi Tom,


I just tested that setting:

mr->iova &= (PAGE_SIZE - 1);
mr->iova |= 0x;

after the ib_map_mr_sg() and before doing the IB_WR_REG_MR, seems to
work.


Good! As you know, we were concerned about it after seeing that
the ib_dma_map_sg() code was unconditionally setting it to the
dma_mapped address. By salting those 's with varying data,
this should give your FRWR regions stronger integrity in addition
to not leaking kernel "addresses" to the wire.


Just wondering... Isn't the thing we use called FRMR?


They're basically the same concept, it's a subtle difference.

FRMR = Fast Register Memory Region
FRWR = Fast Register Work Request

The memory region is the mr itself, this is created early on.

The work request is built when actually binding the physical
pages to the region, and setting the offset, length, etc, which
is what's happening in the routine that I made the comment on.

So, for this discussion I chose to say FRWR. Sorry for any
confusion!

Tom.


Re: [Patch v7 21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

2018-09-23 Thread Tom Talpey

On 9/23/2018 2:24 PM, Stefan Metzmacher wrote:

Hi Tom,


I just tested that setting:

mr->iova &= (PAGE_SIZE - 1);
mr->iova |= 0x;

after the ib_map_mr_sg() and before doing the IB_WR_REG_MR, seems to
work.


Good! As you know, we were concerned about it after seeing that
the ib_dma_map_sg() code was unconditionally setting it to the
dma_mapped address. By salting those 's with varying data,
this should give your FRWR regions stronger integrity in addition
to not leaking kernel "addresses" to the wire.


Just wondering... Isn't the thing we use called FRMR?


They're basically the same concept, it's a subtle difference.

FRMR = Fast Register Memory Region
FRWR = Fast Register Work Request

The memory region is the mr itself, this is created early on.

The work request is built when actually binding the physical
pages to the region, and setting the offset, length, etc, which
is what's happening in the routine that I made the comment on.

So, for this discussion I chose to say FRWR. Sorry for any
confusion!

Tom.


Re: [Patch v7 21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

2018-09-22 Thread Tom Talpey

On 9/21/2018 8:56 PM, Stefan Metzmacher wrote:

Hi,


+    req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
+    if (need_invalidate)
+    req->Channel = SMB2_CHANNEL_RDMA_V1;
+    req->ReadChannelInfoOffset =
+    offsetof(struct smb2_read_plain_req, Buffer);
+    req->ReadChannelInfoLength =
+    sizeof(struct smbd_buffer_descriptor_v1);
+    v1 = (struct smbd_buffer_descriptor_v1 *) >Buffer[0];
+    v1->offset = rdata->mr->mr->iova;


It's unnecessary, and possibly leaking kernel information, to use
the IOVA as the offset of a memory region which is registered using
an FRWR. Because such regions are based on the exact bytes targeted
by the memory handle, the offset can be set to any value, typically
zero, but nearly arbitrary. As long as the (offset + length) does
not wrap or otherwise overflow, offset can be set to anything
convenient.

Since SMB reads and writes range up to 8MB, I'd suggest zeroing the
least significant 23 bits, which should guarantee it. The other 41
bits, party on. You could randomize them, pass some clever identifier
such as MID sequence, whatever.


I just tested that setting:

mr->iova &= (PAGE_SIZE - 1);
mr->iova |= 0x;

after the ib_map_mr_sg() and before doing the IB_WR_REG_MR, seems to work.


Good! As you know, we were concerned about it after seeing that
the ib_dma_map_sg() code was unconditionally setting it to the
dma_mapped address. By salting those 's with varying data,
this should give your FRWR regions stronger integrity in addition
to not leaking kernel "addresses" to the wire.

Tom.


Re: [Patch v7 21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

2018-09-22 Thread Tom Talpey

On 9/21/2018 8:56 PM, Stefan Metzmacher wrote:

Hi,


+    req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
+    if (need_invalidate)
+    req->Channel = SMB2_CHANNEL_RDMA_V1;
+    req->ReadChannelInfoOffset =
+    offsetof(struct smb2_read_plain_req, Buffer);
+    req->ReadChannelInfoLength =
+    sizeof(struct smbd_buffer_descriptor_v1);
+    v1 = (struct smbd_buffer_descriptor_v1 *) >Buffer[0];
+    v1->offset = rdata->mr->mr->iova;


It's unnecessary, and possibly leaking kernel information, to use
the IOVA as the offset of a memory region which is registered using
an FRWR. Because such regions are based on the exact bytes targeted
by the memory handle, the offset can be set to any value, typically
zero, but nearly arbitrary. As long as the (offset + length) does
not wrap or otherwise overflow, offset can be set to anything
convenient.

Since SMB reads and writes range up to 8MB, I'd suggest zeroing the
least significant 23 bits, which should guarantee it. The other 41
bits, party on. You could randomize them, pass some clever identifier
such as MID sequence, whatever.


I just tested that setting:

mr->iova &= (PAGE_SIZE - 1);
mr->iova |= 0x;

after the ib_map_mr_sg() and before doing the IB_WR_REG_MR, seems to work.


Good! As you know, we were concerned about it after seeing that
the ib_dma_map_sg() code was unconditionally setting it to the
dma_mapped address. By salting those 's with varying data,
this should give your FRWR regions stronger integrity in addition
to not leaking kernel "addresses" to the wire.

Tom.


Re: [Patch v7 21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

2018-09-19 Thread Tom Talpey

Replying to a very old message, but it's something we
discussed today at the IOLab event so to capture it:

On 11/7/2017 12:55 AM, Long Li wrote:

From: Long Li 

---
  fs/cifs/file.c| 17 +++--
  fs/cifs/smb2pdu.c | 45 -
  2 files changed, 59 insertions(+), 3 deletions(-)
...
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index c8afb83..8a5ff90 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2379,7 +2379,40 @@ smb2_new_read_req(void **buf, unsigned int *total_len,
req->MinimumCount = 0;
req->Length = cpu_to_le32(io_parms->length);
req->Offset = cpu_to_le64(io_parms->offset);
+#ifdef CONFIG_CIFS_SMB_DIRECT
+   /*
+* If we want to do a RDMA write, fill in and append
+* smbd_buffer_descriptor_v1 to the end of read request
+*/
+   if (server->rdma && rdata &&
+   rdata->bytes >= server->smbd_conn->rdma_readwrite_threshold) {
+
+   struct smbd_buffer_descriptor_v1 *v1;
+   bool need_invalidate =
+   io_parms->tcon->ses->server->dialect == SMB30_PROT_ID;
+
+   rdata->mr = smbd_register_mr(
+   server->smbd_conn, rdata->pages,
+   rdata->nr_pages, rdata->tailsz,
+   true, need_invalidate);
+   if (!rdata->mr)
+   return -ENOBUFS;
+
+   req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
+   if (need_invalidate)
+   req->Channel = SMB2_CHANNEL_RDMA_V1;
+   req->ReadChannelInfoOffset =
+   offsetof(struct smb2_read_plain_req, Buffer);
+   req->ReadChannelInfoLength =
+   sizeof(struct smbd_buffer_descriptor_v1);
+   v1 = (struct smbd_buffer_descriptor_v1 *) >Buffer[0];
+   v1->offset = rdata->mr->mr->iova;


It's unnecessary, and possibly leaking kernel information, to use
the IOVA as the offset of a memory region which is registered using
an FRWR. Because such regions are based on the exact bytes targeted
by the memory handle, the offset can be set to any value, typically
zero, but nearly arbitrary. As long as the (offset + length) does
not wrap or otherwise overflow, offset can be set to anything
convenient.

Since SMB reads and writes range up to 8MB, I'd suggest zeroing the
least significant 23 bits, which should guarantee it. The other 41
bits, party on. You could randomize them, pass some clever identifier
such as MID sequence, whatever.

Tom.


+   v1->token = rdata->mr->mr->rkey;
+   v1->length = rdata->mr->mr->length;


Re: [Patch v7 21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

2018-09-19 Thread Tom Talpey

Replying to a very old message, but it's something we
discussed today at the IOLab event so to capture it:

On 11/7/2017 12:55 AM, Long Li wrote:

From: Long Li 

---
  fs/cifs/file.c| 17 +++--
  fs/cifs/smb2pdu.c | 45 -
  2 files changed, 59 insertions(+), 3 deletions(-)
...
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index c8afb83..8a5ff90 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2379,7 +2379,40 @@ smb2_new_read_req(void **buf, unsigned int *total_len,
req->MinimumCount = 0;
req->Length = cpu_to_le32(io_parms->length);
req->Offset = cpu_to_le64(io_parms->offset);
+#ifdef CONFIG_CIFS_SMB_DIRECT
+   /*
+* If we want to do a RDMA write, fill in and append
+* smbd_buffer_descriptor_v1 to the end of read request
+*/
+   if (server->rdma && rdata &&
+   rdata->bytes >= server->smbd_conn->rdma_readwrite_threshold) {
+
+   struct smbd_buffer_descriptor_v1 *v1;
+   bool need_invalidate =
+   io_parms->tcon->ses->server->dialect == SMB30_PROT_ID;
+
+   rdata->mr = smbd_register_mr(
+   server->smbd_conn, rdata->pages,
+   rdata->nr_pages, rdata->tailsz,
+   true, need_invalidate);
+   if (!rdata->mr)
+   return -ENOBUFS;
+
+   req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
+   if (need_invalidate)
+   req->Channel = SMB2_CHANNEL_RDMA_V1;
+   req->ReadChannelInfoOffset =
+   offsetof(struct smb2_read_plain_req, Buffer);
+   req->ReadChannelInfoLength =
+   sizeof(struct smbd_buffer_descriptor_v1);
+   v1 = (struct smbd_buffer_descriptor_v1 *) >Buffer[0];
+   v1->offset = rdata->mr->mr->iova;


It's unnecessary, and possibly leaking kernel information, to use
the IOVA as the offset of a memory region which is registered using
an FRWR. Because such regions are based on the exact bytes targeted
by the memory handle, the offset can be set to any value, typically
zero, but nearly arbitrary. As long as the (offset + length) does
not wrap or otherwise overflow, offset can be set to anything
convenient.

Since SMB reads and writes range up to 8MB, I'd suggest zeroing the
least significant 23 bits, which should guarantee it. The other 41
bits, party on. You could randomize them, pass some clever identifier
such as MID sequence, whatever.

Tom.


+   v1->token = rdata->mr->mr->rkey;
+   v1->length = rdata->mr->mr->length;


Re: [Patch v2 02/15] CIFS: Add support for direct pages in rdata

2018-06-26 Thread Tom Talpey

On 6/25/2018 5:01 PM, Jason Gunthorpe wrote:

On Sat, Jun 23, 2018 at 09:50:20PM -0400, Tom Talpey wrote:

On 5/30/2018 3:47 PM, Long Li wrote:

From: Long Li 

Add a function to allocate rdata without allocating pages for data
transfer. This gives the caller an option to pass a number of pages
that point to the data buffer.

rdata is still reponsible for free those pages after it's done.


"Caller" is still responsible? Or is the rdata somehow freeing itself
via another mechanism?



Signed-off-by: Long Li 
  fs/cifs/cifsglob.h |  2 +-
  fs/cifs/file.c | 23 ---
  2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 8d16c3e..56864a87 100644
+++ b/fs/cifs/cifsglob.h
@@ -1179,7 +1179,7 @@ struct cifs_readdata {
unsigned inttailsz;
unsigned intcredits;
unsigned intnr_pages;
-   struct page *pages[];
+   struct page **pages;


Technically speaking, these are syntactically equivalent. It may not
be worth changing this historic definition.


[] is a C99 'flex array', it has a different allocation behavior than
** and is not interchangeable..


In that case, it's an even better reason to not change the declaration.

Tom.



Re: [Patch v2 02/15] CIFS: Add support for direct pages in rdata

2018-06-26 Thread Tom Talpey

On 6/25/2018 5:01 PM, Jason Gunthorpe wrote:

On Sat, Jun 23, 2018 at 09:50:20PM -0400, Tom Talpey wrote:

On 5/30/2018 3:47 PM, Long Li wrote:

From: Long Li 

Add a function to allocate rdata without allocating pages for data
transfer. This gives the caller an option to pass a number of pages
that point to the data buffer.

rdata is still reponsible for free those pages after it's done.


"Caller" is still responsible? Or is the rdata somehow freeing itself
via another mechanism?



Signed-off-by: Long Li 
  fs/cifs/cifsglob.h |  2 +-
  fs/cifs/file.c | 23 ---
  2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 8d16c3e..56864a87 100644
+++ b/fs/cifs/cifsglob.h
@@ -1179,7 +1179,7 @@ struct cifs_readdata {
unsigned inttailsz;
unsigned intcredits;
unsigned intnr_pages;
-   struct page *pages[];
+   struct page **pages;


Technically speaking, these are syntactically equivalent. It may not
be worth changing this historic definition.


[] is a C99 'flex array', it has a different allocation behavior than
** and is not interchangeable..


In that case, it's an even better reason to not change the declaration.

Tom.



Re: [Patch v2 14/15] CIFS: Add support for direct I/O write

2018-06-26 Thread Tom Talpey

On 6/26/2018 12:39 AM, Long Li wrote:

Subject: Re: [Patch v2 14/15] CIFS: Add support for direct I/O write

On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

Implement the function for direct I/O write. It doesn't support AIO,
which will be implemented in a follow up patch.

Signed-off-by: Long Li 
---
   fs/cifs/cifsfs.h |   1 +
   fs/cifs/file.c   | 165

+++

   2 files changed, 166 insertions(+)

diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h index
7fba9aa..e9c5103 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -105,6 +105,7 @@ extern ssize_t cifs_user_readv(struct kiocb *iocb,

struct iov_iter *to);

   extern ssize_t cifs_direct_readv(struct kiocb *iocb, struct iov_iter *to);
   extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to);
   extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter
*from);
+extern ssize_t cifs_direct_writev(struct kiocb *iocb, struct iov_iter
+*from);
   extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from);
   extern int cifs_lock(struct file *, int, struct file_lock *);
   extern int cifs_fsync(struct file *, loff_t, loff_t, int); diff
--git a/fs/cifs/file.c b/fs/cifs/file.c index e6e6f24..8c385b1 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2461,6 +2461,35 @@ cifs_uncached_writedata_release(struct kref
*refcount)

   static void collect_uncached_write_data(struct cifs_aio_ctx *ctx);

+static void cifs_direct_writedata_release(struct kref *refcount) {
+   int i;
+   struct cifs_writedata *wdata = container_of(refcount,
+   struct cifs_writedata, refcount);
+
+   for (i = 0; i < wdata->nr_pages; i++)
+   put_page(wdata->pages[i]);
+
+   cifs_writedata_release(refcount);
+}
+
+static void cifs_direct_writev_complete(struct work_struct *work) {
+   struct cifs_writedata *wdata = container_of(work,
+   struct cifs_writedata, work);
+   struct inode *inode = d_inode(wdata->cfile->dentry);
+   struct cifsInodeInfo *cifsi = CIFS_I(inode);
+
+   spin_lock(>i_lock);
+   cifs_update_eof(cifsi, wdata->offset, wdata->bytes);
+   if (cifsi->server_eof > inode->i_size)
+   i_size_write(inode, cifsi->server_eof);
+   spin_unlock(>i_lock);
+
+   complete(>done);
+   kref_put(>refcount, cifs_direct_writedata_release); }
+
   static void
   cifs_uncached_writev_complete(struct work_struct *work)
   {
@@ -2703,6 +2732,142 @@ static void collect_uncached_write_data(struct

cifs_aio_ctx *ctx)

complete(>done);
   }

+ssize_t cifs_direct_writev(struct kiocb *iocb, struct iov_iter *from)
+{
+   struct file *file = iocb->ki_filp;
+   ssize_t total_written = 0;
+   struct cifsFileInfo *cfile;
+   struct cifs_tcon *tcon;
+   struct cifs_sb_info *cifs_sb;
+   struct TCP_Server_Info *server;
+   pid_t pid;
+   unsigned long nr_pages;
+   loff_t offset = iocb->ki_pos;
+   size_t len = iov_iter_count(from);
+   int rc;
+   struct cifs_writedata *wdata;
+
+   /*
+* iov_iter_get_pages_alloc doesn't work with ITER_KVEC.
+* In this case, fall back to non-direct write function.
+*/
+   if (from->type & ITER_KVEC) {
+   cifs_dbg(FYI, "use non-direct cifs_user_writev for kvec

I/O\n");

+   return cifs_user_writev(iocb, from);
+   }
+
+   rc = generic_write_checks(iocb, from);
+   if (rc <= 0)
+   return rc;
+
+   cifs_sb = CIFS_FILE_SB(file);
+   cfile = file->private_data;
+   tcon = tlink_tcon(cfile->tlink);
+   server = tcon->ses->server;
+
+   if (!server->ops->async_writev)
+   return -ENOSYS;
+
+   if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
+   pid = cfile->pid;
+   else
+   pid = current->tgid;
+
+   do {
+   unsigned int wsize, credits;
+   struct page **pagevec;
+   size_t start;
+   ssize_t cur_len;
+
+   rc = server->ops->wait_mtu_credits(server, cifs_sb->wsize,
+  , );
+   if (rc)
+   break;
+
+   cur_len = iov_iter_get_pages_alloc(
+   from, , wsize, );
+   if (cur_len < 0) {
+   cifs_dbg(VFS,
+   "direct_writev couldn't get user pages "
+   "(rc=%zd) iter type %d iov_offset %lu count"
+   " %lu\n",
+   cur_len, from->type,
+   from->iov_offset, from->count);
+   dump_stack();
+   break;
+   }
+   if (cur_len < 0)
+   break;


This cur_len < 0 test is redundant with the 

Re: [Patch v2 14/15] CIFS: Add support for direct I/O write

2018-06-26 Thread Tom Talpey

On 6/26/2018 12:39 AM, Long Li wrote:

Subject: Re: [Patch v2 14/15] CIFS: Add support for direct I/O write

On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

Implement the function for direct I/O write. It doesn't support AIO,
which will be implemented in a follow up patch.

Signed-off-by: Long Li 
---
   fs/cifs/cifsfs.h |   1 +
   fs/cifs/file.c   | 165

+++

   2 files changed, 166 insertions(+)

diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h index
7fba9aa..e9c5103 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -105,6 +105,7 @@ extern ssize_t cifs_user_readv(struct kiocb *iocb,

struct iov_iter *to);

   extern ssize_t cifs_direct_readv(struct kiocb *iocb, struct iov_iter *to);
   extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to);
   extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter
*from);
+extern ssize_t cifs_direct_writev(struct kiocb *iocb, struct iov_iter
+*from);
   extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from);
   extern int cifs_lock(struct file *, int, struct file_lock *);
   extern int cifs_fsync(struct file *, loff_t, loff_t, int); diff
--git a/fs/cifs/file.c b/fs/cifs/file.c index e6e6f24..8c385b1 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2461,6 +2461,35 @@ cifs_uncached_writedata_release(struct kref
*refcount)

   static void collect_uncached_write_data(struct cifs_aio_ctx *ctx);

+static void cifs_direct_writedata_release(struct kref *refcount) {
+   int i;
+   struct cifs_writedata *wdata = container_of(refcount,
+   struct cifs_writedata, refcount);
+
+   for (i = 0; i < wdata->nr_pages; i++)
+   put_page(wdata->pages[i]);
+
+   cifs_writedata_release(refcount);
+}
+
+static void cifs_direct_writev_complete(struct work_struct *work) {
+   struct cifs_writedata *wdata = container_of(work,
+   struct cifs_writedata, work);
+   struct inode *inode = d_inode(wdata->cfile->dentry);
+   struct cifsInodeInfo *cifsi = CIFS_I(inode);
+
+   spin_lock(>i_lock);
+   cifs_update_eof(cifsi, wdata->offset, wdata->bytes);
+   if (cifsi->server_eof > inode->i_size)
+   i_size_write(inode, cifsi->server_eof);
+   spin_unlock(>i_lock);
+
+   complete(>done);
+   kref_put(>refcount, cifs_direct_writedata_release); }
+
   static void
   cifs_uncached_writev_complete(struct work_struct *work)
   {
@@ -2703,6 +2732,142 @@ static void collect_uncached_write_data(struct

cifs_aio_ctx *ctx)

complete(>done);
   }

+ssize_t cifs_direct_writev(struct kiocb *iocb, struct iov_iter *from)
+{
+   struct file *file = iocb->ki_filp;
+   ssize_t total_written = 0;
+   struct cifsFileInfo *cfile;
+   struct cifs_tcon *tcon;
+   struct cifs_sb_info *cifs_sb;
+   struct TCP_Server_Info *server;
+   pid_t pid;
+   unsigned long nr_pages;
+   loff_t offset = iocb->ki_pos;
+   size_t len = iov_iter_count(from);
+   int rc;
+   struct cifs_writedata *wdata;
+
+   /*
+* iov_iter_get_pages_alloc doesn't work with ITER_KVEC.
+* In this case, fall back to non-direct write function.
+*/
+   if (from->type & ITER_KVEC) {
+   cifs_dbg(FYI, "use non-direct cifs_user_writev for kvec

I/O\n");

+   return cifs_user_writev(iocb, from);
+   }
+
+   rc = generic_write_checks(iocb, from);
+   if (rc <= 0)
+   return rc;
+
+   cifs_sb = CIFS_FILE_SB(file);
+   cfile = file->private_data;
+   tcon = tlink_tcon(cfile->tlink);
+   server = tcon->ses->server;
+
+   if (!server->ops->async_writev)
+   return -ENOSYS;
+
+   if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
+   pid = cfile->pid;
+   else
+   pid = current->tgid;
+
+   do {
+   unsigned int wsize, credits;
+   struct page **pagevec;
+   size_t start;
+   ssize_t cur_len;
+
+   rc = server->ops->wait_mtu_credits(server, cifs_sb->wsize,
+  , );
+   if (rc)
+   break;
+
+   cur_len = iov_iter_get_pages_alloc(
+   from, , wsize, );
+   if (cur_len < 0) {
+   cifs_dbg(VFS,
+   "direct_writev couldn't get user pages "
+   "(rc=%zd) iter type %d iov_offset %lu count"
+   " %lu\n",
+   cur_len, from->type,
+   from->iov_offset, from->count);
+   dump_stack();
+   break;
+   }
+   if (cur_len < 0)
+   break;


This cur_len < 0 test is redundant with the 

Re: [Patch v2 06/15] CIFS: Introduce helper function to get page offset and length in smb_rqst

2018-06-26 Thread Tom Talpey

On 6/25/2018 5:14 PM, Long Li wrote:

Subject: Re: [Patch v2 06/15] CIFS: Introduce helper function to get page
offset and length in smb_rqst

On 5/30/2018 3:47 PM, Long Li wrote:

From: Long Li 

Introduce a function rqst_page_get_length to return the page offset
and length for a given page in smb_rqst. This function is to be used
by following patches.

Signed-off-by: Long Li 
---
   fs/cifs/cifsproto.h |  3 +++
   fs/cifs/misc.c  | 17 +
   2 files changed, 20 insertions(+)

diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h index
7933c5f..89dda14 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -557,4 +557,7 @@ int cifs_alloc_hash(const char *name, struct

crypto_shash **shash,

struct sdesc **sdesc);
   void cifs_free_hash(struct crypto_shash **shash, struct sdesc
**sdesc);

+extern void rqst_page_get_length(struct smb_rqst *rqst, unsigned int

page,

+   unsigned int *len, unsigned int *offset);
+
   #endif   /* _CIFSPROTO_H */
diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c index 96849b5..e951417
100644
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@ -905,3 +905,20 @@ cifs_free_hash(struct crypto_shash **shash,

struct sdesc **sdesc)

crypto_free_shash(*shash);
*shash = NULL;
   }
+
+/**
+ * rqst_page_get_length - obtain the length and offset for a page in
+smb_rqst
+ * Input: rqst - a smb_rqst, page - a page index for rqst
+ * Output: *len - the length for this page, *offset - the offset for
+this page  */ void rqst_page_get_length(struct smb_rqst *rqst,
+unsigned int page,
+   unsigned int *len, unsigned int *offset) {
+   *len = rqst->rq_pagesz;
+   *offset = (page == 0) ? rqst->rq_offset : 0;


Really? Page 0 always has a zero offset??


I think you are misreading this line. The offset for page 0 is rq_offset, the 
offset for all subsequent pages are 0.


Ah, yes, sorry I did read it incorrectly.

+
+   if (rqst->rq_npages == 1 || page == rqst->rq_npages-1)
+   *len = rqst->rq_tailsz;
+   else if (page == 0)
+   *len = rqst->rq_pagesz - rqst->rq_offset; }



This subroutine does what patch 5 does inline. Why not push this patch up in
the sequence and use the helper?


This is actually a little different. This function returns the length and 
offset for a given page in the request. There might be multiple pages in a 
request.


Ok, but I still think there is quite a bit of inline computation of
this stuff that would more clearly and more robustly be done in a set
of common functions. If someone ever touches the code to support a
new upper layer, or even integrate with more complex compounding,
things will get ugly fast.


The other function calculates the total length of all the pages in a request.


Again, a great candidate for a common set of subroutines, IMO.

Tom.


Re: [Patch v2 06/15] CIFS: Introduce helper function to get page offset and length in smb_rqst

2018-06-26 Thread Tom Talpey

On 6/25/2018 5:14 PM, Long Li wrote:

Subject: Re: [Patch v2 06/15] CIFS: Introduce helper function to get page
offset and length in smb_rqst

On 5/30/2018 3:47 PM, Long Li wrote:

From: Long Li 

Introduce a function rqst_page_get_length to return the page offset
and length for a given page in smb_rqst. This function is to be used
by following patches.

Signed-off-by: Long Li 
---
   fs/cifs/cifsproto.h |  3 +++
   fs/cifs/misc.c  | 17 +
   2 files changed, 20 insertions(+)

diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h index
7933c5f..89dda14 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -557,4 +557,7 @@ int cifs_alloc_hash(const char *name, struct

crypto_shash **shash,

struct sdesc **sdesc);
   void cifs_free_hash(struct crypto_shash **shash, struct sdesc
**sdesc);

+extern void rqst_page_get_length(struct smb_rqst *rqst, unsigned int

page,

+   unsigned int *len, unsigned int *offset);
+
   #endif   /* _CIFSPROTO_H */
diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c index 96849b5..e951417
100644
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@ -905,3 +905,20 @@ cifs_free_hash(struct crypto_shash **shash,

struct sdesc **sdesc)

crypto_free_shash(*shash);
*shash = NULL;
   }
+
+/**
+ * rqst_page_get_length - obtain the length and offset for a page in
+smb_rqst
+ * Input: rqst - a smb_rqst, page - a page index for rqst
+ * Output: *len - the length for this page, *offset - the offset for
+this page  */ void rqst_page_get_length(struct smb_rqst *rqst,
+unsigned int page,
+   unsigned int *len, unsigned int *offset) {
+   *len = rqst->rq_pagesz;
+   *offset = (page == 0) ? rqst->rq_offset : 0;


Really? Page 0 always has a zero offset??


I think you are misreading this line. The offset for page 0 is rq_offset, the 
offset for all subsequent pages are 0.


Ah, yes, sorry I did read it incorrectly.

+
+   if (rqst->rq_npages == 1 || page == rqst->rq_npages-1)
+   *len = rqst->rq_tailsz;
+   else if (page == 0)
+   *len = rqst->rq_pagesz - rqst->rq_offset; }



This subroutine does what patch 5 does inline. Why not push this patch up in
the sequence and use the helper?


This is actually a little different. This function returns the length and 
offset for a given page in the request. There might be multiple pages in a 
request.


Ok, but I still think there is quite a bit of inline computation of
this stuff that would more clearly and more robustly be done in a set
of common functions. If someone ever touches the code to support a
new upper layer, or even integrate with more complex compounding,
things will get ugly fast.


The other function calculates the total length of all the pages in a request.


Again, a great candidate for a common set of subroutines, IMO.

Tom.


Re: [Patch v2 14/15] CIFS: Add support for direct I/O write

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

Implement the function for direct I/O write. It doesn't support AIO, which
will be implemented in a follow up patch.

Signed-off-by: Long Li 
---
  fs/cifs/cifsfs.h |   1 +
  fs/cifs/file.c   | 165 +++
  2 files changed, 166 insertions(+)

diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index 7fba9aa..e9c5103 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -105,6 +105,7 @@ extern ssize_t cifs_user_readv(struct kiocb *iocb, struct 
iov_iter *to);
  extern ssize_t cifs_direct_readv(struct kiocb *iocb, struct iov_iter *to);
  extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to);
  extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from);
+extern ssize_t cifs_direct_writev(struct kiocb *iocb, struct iov_iter *from);
  extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from);
  extern int cifs_lock(struct file *, int, struct file_lock *);
  extern int cifs_fsync(struct file *, loff_t, loff_t, int);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index e6e6f24..8c385b1 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2461,6 +2461,35 @@ cifs_uncached_writedata_release(struct kref *refcount)
  
  static void collect_uncached_write_data(struct cifs_aio_ctx *ctx);
  
+static void cifs_direct_writedata_release(struct kref *refcount)

+{
+   int i;
+   struct cifs_writedata *wdata = container_of(refcount,
+   struct cifs_writedata, refcount);
+
+   for (i = 0; i < wdata->nr_pages; i++)
+   put_page(wdata->pages[i]);
+
+   cifs_writedata_release(refcount);
+}
+
+static void cifs_direct_writev_complete(struct work_struct *work)
+{
+   struct cifs_writedata *wdata = container_of(work,
+   struct cifs_writedata, work);
+   struct inode *inode = d_inode(wdata->cfile->dentry);
+   struct cifsInodeInfo *cifsi = CIFS_I(inode);
+
+   spin_lock(>i_lock);
+   cifs_update_eof(cifsi, wdata->offset, wdata->bytes);
+   if (cifsi->server_eof > inode->i_size)
+   i_size_write(inode, cifsi->server_eof);
+   spin_unlock(>i_lock);
+
+   complete(>done);
+   kref_put(>refcount, cifs_direct_writedata_release);
+}
+
  static void
  cifs_uncached_writev_complete(struct work_struct *work)
  {
@@ -2703,6 +2732,142 @@ static void collect_uncached_write_data(struct 
cifs_aio_ctx *ctx)
complete(>done);
  }
  
+ssize_t cifs_direct_writev(struct kiocb *iocb, struct iov_iter *from)

+{
+   struct file *file = iocb->ki_filp;
+   ssize_t total_written = 0;
+   struct cifsFileInfo *cfile;
+   struct cifs_tcon *tcon;
+   struct cifs_sb_info *cifs_sb;
+   struct TCP_Server_Info *server;
+   pid_t pid;
+   unsigned long nr_pages;
+   loff_t offset = iocb->ki_pos;
+   size_t len = iov_iter_count(from);
+   int rc;
+   struct cifs_writedata *wdata;
+
+   /*
+* iov_iter_get_pages_alloc doesn't work with ITER_KVEC.
+* In this case, fall back to non-direct write function.
+*/
+   if (from->type & ITER_KVEC) {
+   cifs_dbg(FYI, "use non-direct cifs_user_writev for kvec I/O\n");
+   return cifs_user_writev(iocb, from);
+   }
+
+   rc = generic_write_checks(iocb, from);
+   if (rc <= 0)
+   return rc;
+
+   cifs_sb = CIFS_FILE_SB(file);
+   cfile = file->private_data;
+   tcon = tlink_tcon(cfile->tlink);
+   server = tcon->ses->server;
+
+   if (!server->ops->async_writev)
+   return -ENOSYS;
+
+   if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
+   pid = cfile->pid;
+   else
+   pid = current->tgid;
+
+   do {
+   unsigned int wsize, credits;
+   struct page **pagevec;
+   size_t start;
+   ssize_t cur_len;
+
+   rc = server->ops->wait_mtu_credits(server, cifs_sb->wsize,
+  , );
+   if (rc)
+   break;
+
+   cur_len = iov_iter_get_pages_alloc(
+   from, , wsize, );
+   if (cur_len < 0) {
+   cifs_dbg(VFS,
+   "direct_writev couldn't get user pages "
+   "(rc=%zd) iter type %d iov_offset %lu count"
+   " %lu\n",
+   cur_len, from->type,
+   from->iov_offset, from->count);
+   dump_stack();
+   break;
+   }
+   if (cur_len < 0)
+   break;


This cur_len < 0 test is redundant with the prior if(), delete.

+
+   nr_pages = (cur_len + start + PAGE_SIZE - 1) / PAGE_SIZE;


Am I misreading, or 

Re: [Patch v2 14/15] CIFS: Add support for direct I/O write

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

Implement the function for direct I/O write. It doesn't support AIO, which
will be implemented in a follow up patch.

Signed-off-by: Long Li 
---
  fs/cifs/cifsfs.h |   1 +
  fs/cifs/file.c   | 165 +++
  2 files changed, 166 insertions(+)

diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index 7fba9aa..e9c5103 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -105,6 +105,7 @@ extern ssize_t cifs_user_readv(struct kiocb *iocb, struct 
iov_iter *to);
  extern ssize_t cifs_direct_readv(struct kiocb *iocb, struct iov_iter *to);
  extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to);
  extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from);
+extern ssize_t cifs_direct_writev(struct kiocb *iocb, struct iov_iter *from);
  extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from);
  extern int cifs_lock(struct file *, int, struct file_lock *);
  extern int cifs_fsync(struct file *, loff_t, loff_t, int);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index e6e6f24..8c385b1 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2461,6 +2461,35 @@ cifs_uncached_writedata_release(struct kref *refcount)
  
  static void collect_uncached_write_data(struct cifs_aio_ctx *ctx);
  
+static void cifs_direct_writedata_release(struct kref *refcount)

+{
+   int i;
+   struct cifs_writedata *wdata = container_of(refcount,
+   struct cifs_writedata, refcount);
+
+   for (i = 0; i < wdata->nr_pages; i++)
+   put_page(wdata->pages[i]);
+
+   cifs_writedata_release(refcount);
+}
+
+static void cifs_direct_writev_complete(struct work_struct *work)
+{
+   struct cifs_writedata *wdata = container_of(work,
+   struct cifs_writedata, work);
+   struct inode *inode = d_inode(wdata->cfile->dentry);
+   struct cifsInodeInfo *cifsi = CIFS_I(inode);
+
+   spin_lock(>i_lock);
+   cifs_update_eof(cifsi, wdata->offset, wdata->bytes);
+   if (cifsi->server_eof > inode->i_size)
+   i_size_write(inode, cifsi->server_eof);
+   spin_unlock(>i_lock);
+
+   complete(>done);
+   kref_put(>refcount, cifs_direct_writedata_release);
+}
+
  static void
  cifs_uncached_writev_complete(struct work_struct *work)
  {
@@ -2703,6 +2732,142 @@ static void collect_uncached_write_data(struct 
cifs_aio_ctx *ctx)
complete(>done);
  }
  
+ssize_t cifs_direct_writev(struct kiocb *iocb, struct iov_iter *from)

+{
+   struct file *file = iocb->ki_filp;
+   ssize_t total_written = 0;
+   struct cifsFileInfo *cfile;
+   struct cifs_tcon *tcon;
+   struct cifs_sb_info *cifs_sb;
+   struct TCP_Server_Info *server;
+   pid_t pid;
+   unsigned long nr_pages;
+   loff_t offset = iocb->ki_pos;
+   size_t len = iov_iter_count(from);
+   int rc;
+   struct cifs_writedata *wdata;
+
+   /*
+* iov_iter_get_pages_alloc doesn't work with ITER_KVEC.
+* In this case, fall back to non-direct write function.
+*/
+   if (from->type & ITER_KVEC) {
+   cifs_dbg(FYI, "use non-direct cifs_user_writev for kvec I/O\n");
+   return cifs_user_writev(iocb, from);
+   }
+
+   rc = generic_write_checks(iocb, from);
+   if (rc <= 0)
+   return rc;
+
+   cifs_sb = CIFS_FILE_SB(file);
+   cfile = file->private_data;
+   tcon = tlink_tcon(cfile->tlink);
+   server = tcon->ses->server;
+
+   if (!server->ops->async_writev)
+   return -ENOSYS;
+
+   if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
+   pid = cfile->pid;
+   else
+   pid = current->tgid;
+
+   do {
+   unsigned int wsize, credits;
+   struct page **pagevec;
+   size_t start;
+   ssize_t cur_len;
+
+   rc = server->ops->wait_mtu_credits(server, cifs_sb->wsize,
+  , );
+   if (rc)
+   break;
+
+   cur_len = iov_iter_get_pages_alloc(
+   from, , wsize, );
+   if (cur_len < 0) {
+   cifs_dbg(VFS,
+   "direct_writev couldn't get user pages "
+   "(rc=%zd) iter type %d iov_offset %lu count"
+   " %lu\n",
+   cur_len, from->type,
+   from->iov_offset, from->count);
+   dump_stack();
+   break;
+   }
+   if (cur_len < 0)
+   break;


This cur_len < 0 test is redundant with the prior if(), delete.

+
+   nr_pages = (cur_len + start + PAGE_SIZE - 1) / PAGE_SIZE;


Am I misreading, or 

Re: [Patch v2 13/15] CIFS: Add support for direct I/O read

2018-06-23 Thread Tom Talpey




On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

Implement the function for direct I/O read. It doesn't support AIO, which
will be implemented in a follow up patch.

Signed-off-by: Long Li 
---
  fs/cifs/cifsfs.h |   1 +
  fs/cifs/file.c   | 149 +++
  2 files changed, 150 insertions(+)

diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index 5f02318..7fba9aa 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -102,6 +102,7 @@ extern int cifs_open(struct inode *inode, struct file 
*file);
  extern int cifs_close(struct inode *inode, struct file *file);
  extern int cifs_closedir(struct inode *inode, struct file *file);
  extern ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to);
+extern ssize_t cifs_direct_readv(struct kiocb *iocb, struct iov_iter *to);
  extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to);
  extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from);
  extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 87eece6..e6e6f24 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2955,6 +2955,18 @@ cifs_read_allocate_pages(struct cifs_readdata *rdata, 
unsigned int nr_pages)
return rc;
  }
  
+static void cifs_direct_readdata_release(struct kref *refcount)

+{
+   struct cifs_readdata *rdata = container_of(refcount,
+   struct cifs_readdata, refcount);
+   unsigned int i;
+
+   for (i = 0; i < rdata->nr_pages; i++)
+   put_page(rdata->pages[i]);
+
+   cifs_readdata_release(refcount);
+}
+
  static void
  cifs_uncached_readdata_release(struct kref *refcount)
  {
@@ -3267,6 +3279,143 @@ collect_uncached_read_data(struct cifs_aio_ctx *ctx)
complete(>done);
  }
  
+static void cifs_direct_readv_complete(struct work_struct *work)

+{
+   struct cifs_readdata *rdata =
+   container_of(work, struct cifs_readdata, work);
+
+   complete(>done);
+   kref_put(>refcount, cifs_direct_readdata_release);
+}
+
+ssize_t cifs_direct_readv(struct kiocb *iocb, struct iov_iter *to)
+{
+   size_t len, cur_len, start;
+   unsigned int npages, rsize, credits;
+   struct file *file;
+   struct cifs_sb_info *cifs_sb;
+   struct cifsFileInfo *cfile;
+   struct cifs_tcon *tcon;
+   struct page **pagevec;
+   ssize_t rc, total_read = 0;
+   struct TCP_Server_Info *server;
+   loff_t offset = iocb->ki_pos;
+   pid_t pid;
+   struct cifs_readdata *rdata;
+
+   /*
+* iov_iter_get_pages_alloc() doesn't work with ITER_KVEC,
+* fall back to data copy read path
+*/
+   if (to->type & ITER_KVEC) {
+   cifs_dbg(FYI, "use non-direct cifs_user_readv for kvec I/O\n");
+   return cifs_user_readv(iocb, to);
+   }
+
+   len = iov_iter_count(to);
+   if (!len)
+   return 0;
+
+   file = iocb->ki_filp;
+   cifs_sb = CIFS_FILE_SB(file);
+   cfile = file->private_data;
+   tcon = tlink_tcon(cfile->tlink);
+   server = tcon->ses->server;
+
+   if (!server->ops->async_readv)
+   return -ENOSYS;
+
+   if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
+   pid = cfile->pid;
+   else
+   pid = current->tgid;
+
+   if ((file->f_flags & O_ACCMODE) == O_WRONLY)
+   cifs_dbg(FYI, "attempting read on write only file instance\n");


Confusing. Maybe "attempting read on write-only filehandle"?


+
+   do {
+   rc = server->ops->wait_mtu_credits(server, cifs_sb->rsize,
+   , );
+   if (rc)
+   break;
+
+   cur_len = min_t(const size_t, len, rsize);
+
+   rc = iov_iter_get_pages_alloc(to, , cur_len, );
+   if (rc < 0) {
+   cifs_dbg(VFS,
+   "couldn't get user pages (rc=%zd) iter type %d"
+   " iov_offset %lu count %lu\n",
+   rc, to->type, to->iov_offset, to->count);
+   dump_stack();
+   break;
+   }
+
+   rdata = cifs_readdata_direct_alloc(
+   pagevec, cifs_direct_readv_complete);
+   if (!rdata) {
+   add_credits_and_wake_if(server, credits, 0);
+   rc = -ENOMEM;
+   break;
+   }
+
+   npages = (rc + start + PAGE_SIZE-1) / PAGE_SIZE;
+   rdata->nr_pages = npages;
+   rdata->page_offset = start;
+   rdata->pagesz = PAGE_SIZE;
+   rdata->tailsz = npages > 1 ?
+   rc-(PAGE_SIZE-start)-(npages-2)*PAGE_SIZE :
+   rc;


This expression makes 

Re: [Patch v2 13/15] CIFS: Add support for direct I/O read

2018-06-23 Thread Tom Talpey




On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

Implement the function for direct I/O read. It doesn't support AIO, which
will be implemented in a follow up patch.

Signed-off-by: Long Li 
---
  fs/cifs/cifsfs.h |   1 +
  fs/cifs/file.c   | 149 +++
  2 files changed, 150 insertions(+)

diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index 5f02318..7fba9aa 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -102,6 +102,7 @@ extern int cifs_open(struct inode *inode, struct file 
*file);
  extern int cifs_close(struct inode *inode, struct file *file);
  extern int cifs_closedir(struct inode *inode, struct file *file);
  extern ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to);
+extern ssize_t cifs_direct_readv(struct kiocb *iocb, struct iov_iter *to);
  extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to);
  extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from);
  extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 87eece6..e6e6f24 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2955,6 +2955,18 @@ cifs_read_allocate_pages(struct cifs_readdata *rdata, 
unsigned int nr_pages)
return rc;
  }
  
+static void cifs_direct_readdata_release(struct kref *refcount)

+{
+   struct cifs_readdata *rdata = container_of(refcount,
+   struct cifs_readdata, refcount);
+   unsigned int i;
+
+   for (i = 0; i < rdata->nr_pages; i++)
+   put_page(rdata->pages[i]);
+
+   cifs_readdata_release(refcount);
+}
+
  static void
  cifs_uncached_readdata_release(struct kref *refcount)
  {
@@ -3267,6 +3279,143 @@ collect_uncached_read_data(struct cifs_aio_ctx *ctx)
complete(>done);
  }
  
+static void cifs_direct_readv_complete(struct work_struct *work)

+{
+   struct cifs_readdata *rdata =
+   container_of(work, struct cifs_readdata, work);
+
+   complete(>done);
+   kref_put(>refcount, cifs_direct_readdata_release);
+}
+
+ssize_t cifs_direct_readv(struct kiocb *iocb, struct iov_iter *to)
+{
+   size_t len, cur_len, start;
+   unsigned int npages, rsize, credits;
+   struct file *file;
+   struct cifs_sb_info *cifs_sb;
+   struct cifsFileInfo *cfile;
+   struct cifs_tcon *tcon;
+   struct page **pagevec;
+   ssize_t rc, total_read = 0;
+   struct TCP_Server_Info *server;
+   loff_t offset = iocb->ki_pos;
+   pid_t pid;
+   struct cifs_readdata *rdata;
+
+   /*
+* iov_iter_get_pages_alloc() doesn't work with ITER_KVEC,
+* fall back to data copy read path
+*/
+   if (to->type & ITER_KVEC) {
+   cifs_dbg(FYI, "use non-direct cifs_user_readv for kvec I/O\n");
+   return cifs_user_readv(iocb, to);
+   }
+
+   len = iov_iter_count(to);
+   if (!len)
+   return 0;
+
+   file = iocb->ki_filp;
+   cifs_sb = CIFS_FILE_SB(file);
+   cfile = file->private_data;
+   tcon = tlink_tcon(cfile->tlink);
+   server = tcon->ses->server;
+
+   if (!server->ops->async_readv)
+   return -ENOSYS;
+
+   if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
+   pid = cfile->pid;
+   else
+   pid = current->tgid;
+
+   if ((file->f_flags & O_ACCMODE) == O_WRONLY)
+   cifs_dbg(FYI, "attempting read on write only file instance\n");


Confusing. Maybe "attempting read on write-only filehandle"?


+
+   do {
+   rc = server->ops->wait_mtu_credits(server, cifs_sb->rsize,
+   , );
+   if (rc)
+   break;
+
+   cur_len = min_t(const size_t, len, rsize);
+
+   rc = iov_iter_get_pages_alloc(to, , cur_len, );
+   if (rc < 0) {
+   cifs_dbg(VFS,
+   "couldn't get user pages (rc=%zd) iter type %d"
+   " iov_offset %lu count %lu\n",
+   rc, to->type, to->iov_offset, to->count);
+   dump_stack();
+   break;
+   }
+
+   rdata = cifs_readdata_direct_alloc(
+   pagevec, cifs_direct_readv_complete);
+   if (!rdata) {
+   add_credits_and_wake_if(server, credits, 0);
+   rc = -ENOMEM;
+   break;
+   }
+
+   npages = (rc + start + PAGE_SIZE-1) / PAGE_SIZE;
+   rdata->nr_pages = npages;
+   rdata->page_offset = start;
+   rdata->pagesz = PAGE_SIZE;
+   rdata->tailsz = npages > 1 ?
+   rc-(PAGE_SIZE-start)-(npages-2)*PAGE_SIZE :
+   rc;


This expression makes 

Re: [Patch v2 12/15] CIFS: Pass page offset for encrypting

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

Encryption function needs to read data starting page offset from input
buffer.

This doesn't affect decryption path since it allocates its own page
buffers.

Signed-off-by: Long Li 
---
  fs/cifs/smb2ops.c | 20 +---
  1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 1fa1c29..38d19b6 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -2189,9 +2189,10 @@ init_sg(struct smb_rqst *rqst, u8 *sign)
smb2_sg_set_buf([i], rqst->rq_iov[i].iov_base,
rqst->rq_iov[i].iov_len);
for (j = 0; i < sg_len - 1; i++, j++) {
-   unsigned int len = (j < rqst->rq_npages - 1) ? rqst->rq_pagesz
-   : rqst->rq_tailsz;
-   sg_set_page([i], rqst->rq_pages[j], len, 0);
+   unsigned int len, offset;
+
+   rqst_page_get_length(rqst, j, , );
+   sg_set_page([i], rqst->rq_pages[j], len, offset);
}
smb2_sg_set_buf([sg_len - 1], sign, SMB2_SIGNATURE_SIZE);
return sg;
@@ -2332,6 +2333,7 @@ smb3_init_transform_rq(struct TCP_Server_Info *server, 
struct smb_rqst *new_rq,
return rc;
  
  	new_rq->rq_pages = pages;

+   new_rq->rq_offset = old_rq->rq_offset;
new_rq->rq_npages = old_rq->rq_npages;
new_rq->rq_pagesz = old_rq->rq_pagesz;
new_rq->rq_tailsz = old_rq->rq_tailsz;
@@ -2363,10 +2365,14 @@ smb3_init_transform_rq(struct TCP_Server_Info *server, 
struct smb_rqst *new_rq,
  
  	/* copy pages form the old */

for (i = 0; i < npages; i++) {
-   char *dst = kmap(new_rq->rq_pages[i]);
-   char *src = kmap(old_rq->rq_pages[i]);
-   unsigned int len = (i < npages - 1) ? new_rq->rq_pagesz :
-   new_rq->rq_tailsz;
+   char *dst, *src;
+   unsigned int offset, len;
+
+   rqst_page_get_length(new_rq, i, , );
+
+   dst = (char *) kmap(new_rq->rq_pages[i]) + offset;
+   src = (char *) kmap(old_rq->rq_pages[i]) + offset;


Ouch! TWO kmap/kunmaps per page of data? Is there not already a kva,
at least for the destination (message)?

Tom.



+
memcpy(dst, src, len);
kunmap(new_rq->rq_pages[i]);
kunmap(old_rq->rq_pages[i]);



Re: [Patch v2 12/15] CIFS: Pass page offset for encrypting

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

Encryption function needs to read data starting page offset from input
buffer.

This doesn't affect decryption path since it allocates its own page
buffers.

Signed-off-by: Long Li 
---
  fs/cifs/smb2ops.c | 20 +---
  1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 1fa1c29..38d19b6 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -2189,9 +2189,10 @@ init_sg(struct smb_rqst *rqst, u8 *sign)
smb2_sg_set_buf([i], rqst->rq_iov[i].iov_base,
rqst->rq_iov[i].iov_len);
for (j = 0; i < sg_len - 1; i++, j++) {
-   unsigned int len = (j < rqst->rq_npages - 1) ? rqst->rq_pagesz
-   : rqst->rq_tailsz;
-   sg_set_page([i], rqst->rq_pages[j], len, 0);
+   unsigned int len, offset;
+
+   rqst_page_get_length(rqst, j, , );
+   sg_set_page([i], rqst->rq_pages[j], len, offset);
}
smb2_sg_set_buf([sg_len - 1], sign, SMB2_SIGNATURE_SIZE);
return sg;
@@ -2332,6 +2333,7 @@ smb3_init_transform_rq(struct TCP_Server_Info *server, 
struct smb_rqst *new_rq,
return rc;
  
  	new_rq->rq_pages = pages;

+   new_rq->rq_offset = old_rq->rq_offset;
new_rq->rq_npages = old_rq->rq_npages;
new_rq->rq_pagesz = old_rq->rq_pagesz;
new_rq->rq_tailsz = old_rq->rq_tailsz;
@@ -2363,10 +2365,14 @@ smb3_init_transform_rq(struct TCP_Server_Info *server, 
struct smb_rqst *new_rq,
  
  	/* copy pages form the old */

for (i = 0; i < npages; i++) {
-   char *dst = kmap(new_rq->rq_pages[i]);
-   char *src = kmap(old_rq->rq_pages[i]);
-   unsigned int len = (i < npages - 1) ? new_rq->rq_pagesz :
-   new_rq->rq_tailsz;
+   char *dst, *src;
+   unsigned int offset, len;
+
+   rqst_page_get_length(new_rq, i, , );
+
+   dst = (char *) kmap(new_rq->rq_pages[i]) + offset;
+   src = (char *) kmap(old_rq->rq_pages[i]) + offset;


Ouch! TWO kmap/kunmaps per page of data? Is there not already a kva,
at least for the destination (message)?

Tom.



+
memcpy(dst, src, len);
kunmap(new_rq->rq_pages[i]);
kunmap(old_rq->rq_pages[i]);



Re: [Patch v2 11/15] CIFS: Pass page offset for calculating signature

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

When calculating signature for the packet, it needs to read into the
correct page offset for the data.

Signed-off-by: Long Li 
---
  fs/cifs/cifsencrypt.c | 9 +
  1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/cifs/cifsencrypt.c b/fs/cifs/cifsencrypt.c
index a6ef088..e88303c 100644
--- a/fs/cifs/cifsencrypt.c
+++ b/fs/cifs/cifsencrypt.c
@@ -68,11 +68,12 @@ int __cifs_calc_signature(struct smb_rqst *rqst,
  
  	/* now hash over the rq_pages array */

for (i = 0; i < rqst->rq_npages; i++) {
-   void *kaddr = kmap(rqst->rq_pages[i]);
-   size_t len = rqst->rq_pagesz;
+   void *kaddr;
+   unsigned int len, offset;
  
-		if (i == rqst->rq_npages - 1)

-   len = rqst->rq_tailsz;
+   rqst_page_get_length(rqst, i, , );
+
+   kaddr = (char *) kmap(rqst->rq_pages[i]) + offset;


I suppose it's more robust to map a page at a time, but it's pretty
expensive. Is this the only way to iterate over a potentially very
large block of data? For example, a 1MB segment means 256 kmap/kunmaps.

Tom.

  
  		crypto_shash_update(shash, kaddr, len);
  



Re: [Patch v2 11/15] CIFS: Pass page offset for calculating signature

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

When calculating signature for the packet, it needs to read into the
correct page offset for the data.

Signed-off-by: Long Li 
---
  fs/cifs/cifsencrypt.c | 9 +
  1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/cifs/cifsencrypt.c b/fs/cifs/cifsencrypt.c
index a6ef088..e88303c 100644
--- a/fs/cifs/cifsencrypt.c
+++ b/fs/cifs/cifsencrypt.c
@@ -68,11 +68,12 @@ int __cifs_calc_signature(struct smb_rqst *rqst,
  
  	/* now hash over the rq_pages array */

for (i = 0; i < rqst->rq_npages; i++) {
-   void *kaddr = kmap(rqst->rq_pages[i]);
-   size_t len = rqst->rq_pagesz;
+   void *kaddr;
+   unsigned int len, offset;
  
-		if (i == rqst->rq_npages - 1)

-   len = rqst->rq_tailsz;
+   rqst_page_get_length(rqst, i, , );
+
+   kaddr = (char *) kmap(rqst->rq_pages[i]) + offset;


I suppose it's more robust to map a page at a time, but it's pretty
expensive. Is this the only way to iterate over a potentially very
large block of data? For example, a 1MB segment means 256 kmap/kunmaps.

Tom.

  
  		crypto_shash_update(shash, kaddr, len);
  



Re: [Patch v2 10/15] CIFS: SMBD: Support page offset in memory registration

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

Change code to pass the correct page offset during memory registration for
RDMA read/write.

Signed-off-by: Long Li 
---
  fs/cifs/smb2pdu.c   | 18 -
  fs/cifs/smbdirect.c | 76 +++--
  fs/cifs/smbdirect.h |  2 +-
  3 files changed, 58 insertions(+), 38 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index f603fbe..fc30774 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2623,8 +2623,8 @@ smb2_new_read_req(void **buf, unsigned int *total_len,
  
  		rdata->mr = smbd_register_mr(

server->smbd_conn, rdata->pages,
-   rdata->nr_pages, rdata->tailsz,
-   true, need_invalidate);
+   rdata->nr_pages, rdata->page_offset,
+   rdata->tailsz, true, need_invalidate);
if (!rdata->mr)
return -ENOBUFS;
  
@@ -3013,16 +3013,22 @@ smb2_async_writev(struct cifs_writedata *wdata,
  
  		wdata->mr = smbd_register_mr(

server->smbd_conn, wdata->pages,
-   wdata->nr_pages, wdata->tailsz,
-   false, need_invalidate);
+   wdata->nr_pages, wdata->page_offset,
+   wdata->tailsz, false, need_invalidate);
if (!wdata->mr) {
rc = -ENOBUFS;
goto async_writev_out;
}
req->Length = 0;
req->DataOffset = 0;
-   req->RemainingBytes =
-   cpu_to_le32((wdata->nr_pages-1)*PAGE_SIZE + 
wdata->tailsz);
+   if (wdata->nr_pages > 1)
+   req->RemainingBytes =
+   cpu_to_le32(
+   (wdata->nr_pages - 1) * wdata->pagesz -
+   wdata->page_offset + wdata->tailsz
+   );
+   else
+   req->RemainingBytes = cpu_to_le32(wdata->tailsz);


Again, I think a helper that computed and returned this size would be
much clearer and compact. And I still am incredulous that a single page
io always has an offset of zero. :-)


req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
if (need_invalidate)
req->Channel = SMB2_CHANNEL_RDMA_V1;
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index ba53c52..e459c97 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2299,37 +2299,37 @@ static void smbd_mr_recovery_work(struct work_struct 
*work)
if (smbdirect_mr->state == MR_INVALIDATED ||
smbdirect_mr->state == MR_ERROR) {
  
-			if (smbdirect_mr->state == MR_INVALIDATED) {

+   /* recover this MR entry */
+   rc = ib_dereg_mr(smbdirect_mr->mr);
+   if (rc) {
+   log_rdma_mr(ERR,
+   "ib_dereg_mr failed rc=%x\n",
+   rc);
+   smbd_disconnect_rdma_connection(info);
+   continue;
+   }


Ok, we discussed this ib_dereg_mr() call at the plugfest last week.
It's unnecessary - the MR is reusable and does not need to be destroyed
after each use.


+
+   smbdirect_mr->mr = ib_alloc_mr(
+   info->pd, info->mr_type,
+   info->max_frmr_depth);
+   if (IS_ERR(smbdirect_mr->mr)) {
+   log_rdma_mr(ERR,
+   "ib_alloc_mr failed mr_type=%x "
+   "max_frmr_depth=%x\n",
+   info->mr_type,
+   info->max_frmr_depth);
+   smbd_disconnect_rdma_connection(info);
+   continue;
+   }
+


Not needed, for the same reason above.


+   if (smbdirect_mr->state == MR_INVALIDATED)
ib_dma_unmap_sg(
info->id->device, smbdirect_mr->sgl,
smbdirect_mr->sgl_count,
smbdirect_mr->dir);
-   smbdirect_mr->state = MR_READY;


As we observed, the smbdirect_mr is not protected by a lock, therefore
this MR_READY state transition needs a memory barrier in front of it!


-   } else if (smbdirect_mr->state == MR_ERROR) {
-
-   /* recover this MR entry */
-   rc = 

Re: [Patch v2 10/15] CIFS: SMBD: Support page offset in memory registration

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

Change code to pass the correct page offset during memory registration for
RDMA read/write.

Signed-off-by: Long Li 
---
  fs/cifs/smb2pdu.c   | 18 -
  fs/cifs/smbdirect.c | 76 +++--
  fs/cifs/smbdirect.h |  2 +-
  3 files changed, 58 insertions(+), 38 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index f603fbe..fc30774 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2623,8 +2623,8 @@ smb2_new_read_req(void **buf, unsigned int *total_len,
  
  		rdata->mr = smbd_register_mr(

server->smbd_conn, rdata->pages,
-   rdata->nr_pages, rdata->tailsz,
-   true, need_invalidate);
+   rdata->nr_pages, rdata->page_offset,
+   rdata->tailsz, true, need_invalidate);
if (!rdata->mr)
return -ENOBUFS;
  
@@ -3013,16 +3013,22 @@ smb2_async_writev(struct cifs_writedata *wdata,
  
  		wdata->mr = smbd_register_mr(

server->smbd_conn, wdata->pages,
-   wdata->nr_pages, wdata->tailsz,
-   false, need_invalidate);
+   wdata->nr_pages, wdata->page_offset,
+   wdata->tailsz, false, need_invalidate);
if (!wdata->mr) {
rc = -ENOBUFS;
goto async_writev_out;
}
req->Length = 0;
req->DataOffset = 0;
-   req->RemainingBytes =
-   cpu_to_le32((wdata->nr_pages-1)*PAGE_SIZE + 
wdata->tailsz);
+   if (wdata->nr_pages > 1)
+   req->RemainingBytes =
+   cpu_to_le32(
+   (wdata->nr_pages - 1) * wdata->pagesz -
+   wdata->page_offset + wdata->tailsz
+   );
+   else
+   req->RemainingBytes = cpu_to_le32(wdata->tailsz);


Again, I think a helper that computed and returned this size would be
much clearer and compact. And I still am incredulous that a single page
io always has an offset of zero. :-)


req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
if (need_invalidate)
req->Channel = SMB2_CHANNEL_RDMA_V1;
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index ba53c52..e459c97 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2299,37 +2299,37 @@ static void smbd_mr_recovery_work(struct work_struct 
*work)
if (smbdirect_mr->state == MR_INVALIDATED ||
smbdirect_mr->state == MR_ERROR) {
  
-			if (smbdirect_mr->state == MR_INVALIDATED) {

+   /* recover this MR entry */
+   rc = ib_dereg_mr(smbdirect_mr->mr);
+   if (rc) {
+   log_rdma_mr(ERR,
+   "ib_dereg_mr failed rc=%x\n",
+   rc);
+   smbd_disconnect_rdma_connection(info);
+   continue;
+   }


Ok, we discussed this ib_dereg_mr() call at the plugfest last week.
It's unnecessary - the MR is reusable and does not need to be destroyed
after each use.


+
+   smbdirect_mr->mr = ib_alloc_mr(
+   info->pd, info->mr_type,
+   info->max_frmr_depth);
+   if (IS_ERR(smbdirect_mr->mr)) {
+   log_rdma_mr(ERR,
+   "ib_alloc_mr failed mr_type=%x "
+   "max_frmr_depth=%x\n",
+   info->mr_type,
+   info->max_frmr_depth);
+   smbd_disconnect_rdma_connection(info);
+   continue;
+   }
+


Not needed, for the same reason above.


+   if (smbdirect_mr->state == MR_INVALIDATED)
ib_dma_unmap_sg(
info->id->device, smbdirect_mr->sgl,
smbdirect_mr->sgl_count,
smbdirect_mr->dir);
-   smbdirect_mr->state = MR_READY;


As we observed, the smbdirect_mr is not protected by a lock, therefore
this MR_READY state transition needs a memory barrier in front of it!


-   } else if (smbdirect_mr->state == MR_ERROR) {
-
-   /* recover this MR entry */
-   rc = 

Re: [Patch v2 09/15] CIFS: SMBD: Support page offset in RDMA recv

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

RDMA recv function needs to place data to the correct place starting at
page offset.

Signed-off-by: Long Li 
---
  fs/cifs/smbdirect.c | 18 +++---
  1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 6141e3c..ba53c52 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2004,10 +2004,12 @@ static int smbd_recv_buf(struct smbd_connection *info, 
char *buf,
   * return value: actual data read
   */
  static int smbd_recv_page(struct smbd_connection *info,
-   struct page *page, unsigned int to_read)
+   struct page *page, unsigned int page_offset,
+   unsigned int to_read)
  {
int ret;
char *to_address;
+   void *page_address;
  
  	/* make sure we have the page ready for read */

ret = wait_event_interruptible(
@@ -2015,16 +2017,17 @@ static int smbd_recv_page(struct smbd_connection *info,
info->reassembly_data_length >= to_read ||
info->transport_status != SMBD_CONNECTED);
if (ret)
-   return 0;
+   return ret;
  
  	/* now we can read from reassembly queue and not sleep */

-   to_address = kmap_atomic(page);
+   page_address = kmap_atomic(page);
+   to_address = (char *) page_address + page_offset;
  
  	log_read(INFO, "reading from page=%p address=%p to_read=%d\n",

page, to_address, to_read);
  
  	ret = smbd_recv_buf(info, to_address, to_read);

-   kunmap_atomic(to_address);
+   kunmap_atomic(page_address);


Is "page" truly not mapped? This kmap/kunmap for each received 4KB is
very expensive. Is there not a way to keep a kva for the reassembly
queue segments?

  
  	return ret;

  }
@@ -2038,7 +2041,7 @@ int smbd_recv(struct smbd_connection *info, struct msghdr 
*msg)
  {
char *buf;
struct page *page;
-   unsigned int to_read;
+   unsigned int to_read, page_offset;
int rc;
  
  	info->smbd_recv_pending++;

@@ -2052,15 +2055,16 @@ int smbd_recv(struct smbd_connection *info, struct 
msghdr *msg)
  
  	case READ | ITER_BVEC:

page = msg->msg_iter.bvec->bv_page;
+   page_offset = msg->msg_iter.bvec->bv_offset;
to_read = msg->msg_iter.bvec->bv_len;
-   rc = smbd_recv_page(info, page, to_read);
+   rc = smbd_recv_page(info, page, page_offset, to_read);
break;
  
  	default:

/* It's a bug in upper layer to get there */
cifs_dbg(VFS, "CIFS: invalid msg type %d\n",
msg->msg_iter.type);
-   rc = -EIO;
+   rc = -EINVAL;
}
  
  	info->smbd_recv_pending--;




Re: [Patch v2 09/15] CIFS: SMBD: Support page offset in RDMA recv

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

RDMA recv function needs to place data to the correct place starting at
page offset.

Signed-off-by: Long Li 
---
  fs/cifs/smbdirect.c | 18 +++---
  1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 6141e3c..ba53c52 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2004,10 +2004,12 @@ static int smbd_recv_buf(struct smbd_connection *info, 
char *buf,
   * return value: actual data read
   */
  static int smbd_recv_page(struct smbd_connection *info,
-   struct page *page, unsigned int to_read)
+   struct page *page, unsigned int page_offset,
+   unsigned int to_read)
  {
int ret;
char *to_address;
+   void *page_address;
  
  	/* make sure we have the page ready for read */

ret = wait_event_interruptible(
@@ -2015,16 +2017,17 @@ static int smbd_recv_page(struct smbd_connection *info,
info->reassembly_data_length >= to_read ||
info->transport_status != SMBD_CONNECTED);
if (ret)
-   return 0;
+   return ret;
  
  	/* now we can read from reassembly queue and not sleep */

-   to_address = kmap_atomic(page);
+   page_address = kmap_atomic(page);
+   to_address = (char *) page_address + page_offset;
  
  	log_read(INFO, "reading from page=%p address=%p to_read=%d\n",

page, to_address, to_read);
  
  	ret = smbd_recv_buf(info, to_address, to_read);

-   kunmap_atomic(to_address);
+   kunmap_atomic(page_address);


Is "page" truly not mapped? This kmap/kunmap for each received 4KB is
very expensive. Is there not a way to keep a kva for the reassembly
queue segments?

  
  	return ret;

  }
@@ -2038,7 +2041,7 @@ int smbd_recv(struct smbd_connection *info, struct msghdr 
*msg)
  {
char *buf;
struct page *page;
-   unsigned int to_read;
+   unsigned int to_read, page_offset;
int rc;
  
  	info->smbd_recv_pending++;

@@ -2052,15 +2055,16 @@ int smbd_recv(struct smbd_connection *info, struct 
msghdr *msg)
  
  	case READ | ITER_BVEC:

page = msg->msg_iter.bvec->bv_page;
+   page_offset = msg->msg_iter.bvec->bv_offset;
to_read = msg->msg_iter.bvec->bv_len;
-   rc = smbd_recv_page(info, page, to_read);
+   rc = smbd_recv_page(info, page, page_offset, to_read);
break;
  
  	default:

/* It's a bug in upper layer to get there */
cifs_dbg(VFS, "CIFS: invalid msg type %d\n",
msg->msg_iter.type);
-   rc = -EIO;
+   rc = -EINVAL;
}
  
  	info->smbd_recv_pending--;




Re: [Patch v2 08/15] CIFS: SMBD: Support page offset in RDMA send

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

The RDMA send function needs to look at offset in the request pages, and
send data starting from there.

Signed-off-by: Long Li 
---
  fs/cifs/smbdirect.c | 27 +++
  1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index c62f7c9..6141e3c 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -17,6 +17,7 @@
  #include 
  #include "smbdirect.h"
  #include "cifs_debug.h"
+#include "cifsproto.h"
  
  static struct smbd_response *get_empty_queue_buffer(

struct smbd_connection *info);
@@ -2082,7 +2083,7 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
struct kvec vec;
int nvecs;
int size;
-   int buflen = 0, remaining_data_length;
+   unsigned int buflen = 0, remaining_data_length;
int start, i, j;
int max_iov_size =
info->max_send_size - sizeof(struct smbd_data_transfer);
@@ -2113,10 +2114,17 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
buflen += iov[i].iov_len;
}
  
-	/* add in the page array if there is one */

+   /*
+* Add in the page array if there is one. The caller needs to set
+* rq_tailsz to PAGE_SIZE when the buffer has multiple pages and
+* ends at page boundary
+*/
if (rqst->rq_npages) {
-   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1);
-   buflen += rqst->rq_tailsz;
+   if (rqst->rq_npages == 1)
+   buflen += rqst->rq_tailsz;
+   else
+   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1) -
+   rqst->rq_offset + rqst->rq_tailsz;
}


This code is really confusing and redundant. It tests npages > 0,
then tests npages == 1, then does an else. Why not call the helper
like the following smbd_send()?

Tom.

  
  	if (buflen + sizeof(struct smbd_data_transfer) >

@@ -2213,8 +2221,9 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
  
  	/* now sending pages if there are any */

for (i = 0; i < rqst->rq_npages; i++) {
-   buflen = (i == rqst->rq_npages-1) ?
-   rqst->rq_tailsz : rqst->rq_pagesz;
+   unsigned int offset;
+
+   rqst_page_get_length(rqst, i, , );
nvecs = (buflen + max_iov_size - 1) / max_iov_size;
log_write(INFO, "sending pages buflen=%d nvecs=%d\n",
buflen, nvecs);
@@ -2225,9 +2234,11 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
remaining_data_length -= size;
log_write(INFO, "sending pages i=%d offset=%d size=%d"
" remaining_data_length=%d\n",
-   i, j*max_iov_size, size, remaining_data_length);
+   i, j*max_iov_size+offset, size,
+   remaining_data_length);
rc = smbd_post_send_page(
-   info, rqst->rq_pages[i], j*max_iov_size,
+   info, rqst->rq_pages[i],
+   j*max_iov_size + offset,
size, remaining_data_length);
if (rc)
goto done;



Re: [Patch v2 08/15] CIFS: SMBD: Support page offset in RDMA send

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:48 PM, Long Li wrote:

From: Long Li 

The RDMA send function needs to look at offset in the request pages, and
send data starting from there.

Signed-off-by: Long Li 
---
  fs/cifs/smbdirect.c | 27 +++
  1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index c62f7c9..6141e3c 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -17,6 +17,7 @@
  #include 
  #include "smbdirect.h"
  #include "cifs_debug.h"
+#include "cifsproto.h"
  
  static struct smbd_response *get_empty_queue_buffer(

struct smbd_connection *info);
@@ -2082,7 +2083,7 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
struct kvec vec;
int nvecs;
int size;
-   int buflen = 0, remaining_data_length;
+   unsigned int buflen = 0, remaining_data_length;
int start, i, j;
int max_iov_size =
info->max_send_size - sizeof(struct smbd_data_transfer);
@@ -2113,10 +2114,17 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
buflen += iov[i].iov_len;
}
  
-	/* add in the page array if there is one */

+   /*
+* Add in the page array if there is one. The caller needs to set
+* rq_tailsz to PAGE_SIZE when the buffer has multiple pages and
+* ends at page boundary
+*/
if (rqst->rq_npages) {
-   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1);
-   buflen += rqst->rq_tailsz;
+   if (rqst->rq_npages == 1)
+   buflen += rqst->rq_tailsz;
+   else
+   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1) -
+   rqst->rq_offset + rqst->rq_tailsz;
}


This code is really confusing and redundant. It tests npages > 0,
then tests npages == 1, then does an else. Why not call the helper
like the following smbd_send()?

Tom.

  
  	if (buflen + sizeof(struct smbd_data_transfer) >

@@ -2213,8 +2221,9 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
  
  	/* now sending pages if there are any */

for (i = 0; i < rqst->rq_npages; i++) {
-   buflen = (i == rqst->rq_npages-1) ?
-   rqst->rq_tailsz : rqst->rq_pagesz;
+   unsigned int offset;
+
+   rqst_page_get_length(rqst, i, , );
nvecs = (buflen + max_iov_size - 1) / max_iov_size;
log_write(INFO, "sending pages buflen=%d nvecs=%d\n",
buflen, nvecs);
@@ -2225,9 +2234,11 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
remaining_data_length -= size;
log_write(INFO, "sending pages i=%d offset=%d size=%d"
" remaining_data_length=%d\n",
-   i, j*max_iov_size, size, remaining_data_length);
+   i, j*max_iov_size+offset, size,
+   remaining_data_length);
rc = smbd_post_send_page(
-   info, rqst->rq_pages[i], j*max_iov_size,
+   info, rqst->rq_pages[i],
+   j*max_iov_size + offset,
size, remaining_data_length);
if (rc)
goto done;



Re: [Patch v2 06/15] CIFS: Introduce helper function to get page offset and length in smb_rqst

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:47 PM, Long Li wrote:

From: Long Li 

Introduce a function rqst_page_get_length to return the page offset and
length for a given page in smb_rqst. This function is to be used by
following patches.

Signed-off-by: Long Li 
---
  fs/cifs/cifsproto.h |  3 +++
  fs/cifs/misc.c  | 17 +
  2 files changed, 20 insertions(+)

diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index 7933c5f..89dda14 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -557,4 +557,7 @@ int cifs_alloc_hash(const char *name, struct crypto_shash 
**shash,
struct sdesc **sdesc);
  void cifs_free_hash(struct crypto_shash **shash, struct sdesc **sdesc);
  
+extern void rqst_page_get_length(struct smb_rqst *rqst, unsigned int page,

+   unsigned int *len, unsigned int *offset);
+
  #endif/* _CIFSPROTO_H */
diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
index 96849b5..e951417 100644
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@ -905,3 +905,20 @@ cifs_free_hash(struct crypto_shash **shash, struct sdesc 
**sdesc)
crypto_free_shash(*shash);
*shash = NULL;
  }
+
+/**
+ * rqst_page_get_length - obtain the length and offset for a page in smb_rqst
+ * Input: rqst - a smb_rqst, page - a page index for rqst
+ * Output: *len - the length for this page, *offset - the offset for this page
+ */
+void rqst_page_get_length(struct smb_rqst *rqst, unsigned int page,
+   unsigned int *len, unsigned int *offset)
+{
+   *len = rqst->rq_pagesz;
+   *offset = (page == 0) ? rqst->rq_offset : 0;


Really? Page 0 always has a zero offset??


+
+   if (rqst->rq_npages == 1 || page == rqst->rq_npages-1)
+   *len = rqst->rq_tailsz;
+   else if (page == 0)
+   *len = rqst->rq_pagesz - rqst->rq_offset;
+}



This subroutine does what patch 5 does inline. Why not push this patch
up in the sequence and use the helper?

Tom.


Re: [Patch v2 06/15] CIFS: Introduce helper function to get page offset and length in smb_rqst

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:47 PM, Long Li wrote:

From: Long Li 

Introduce a function rqst_page_get_length to return the page offset and
length for a given page in smb_rqst. This function is to be used by
following patches.

Signed-off-by: Long Li 
---
  fs/cifs/cifsproto.h |  3 +++
  fs/cifs/misc.c  | 17 +
  2 files changed, 20 insertions(+)

diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index 7933c5f..89dda14 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -557,4 +557,7 @@ int cifs_alloc_hash(const char *name, struct crypto_shash 
**shash,
struct sdesc **sdesc);
  void cifs_free_hash(struct crypto_shash **shash, struct sdesc **sdesc);
  
+extern void rqst_page_get_length(struct smb_rqst *rqst, unsigned int page,

+   unsigned int *len, unsigned int *offset);
+
  #endif/* _CIFSPROTO_H */
diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
index 96849b5..e951417 100644
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@ -905,3 +905,20 @@ cifs_free_hash(struct crypto_shash **shash, struct sdesc 
**sdesc)
crypto_free_shash(*shash);
*shash = NULL;
  }
+
+/**
+ * rqst_page_get_length - obtain the length and offset for a page in smb_rqst
+ * Input: rqst - a smb_rqst, page - a page index for rqst
+ * Output: *len - the length for this page, *offset - the offset for this page
+ */
+void rqst_page_get_length(struct smb_rqst *rqst, unsigned int page,
+   unsigned int *len, unsigned int *offset)
+{
+   *len = rqst->rq_pagesz;
+   *offset = (page == 0) ? rqst->rq_offset : 0;


Really? Page 0 always has a zero offset??


+
+   if (rqst->rq_npages == 1 || page == rqst->rq_npages-1)
+   *len = rqst->rq_tailsz;
+   else if (page == 0)
+   *len = rqst->rq_pagesz - rqst->rq_offset;
+}



This subroutine does what patch 5 does inline. Why not push this patch
up in the sequence and use the helper?

Tom.


Re: [Patch v2 05/15] CIFS: Calculate the correct request length based on page offset and tail size

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:47 PM, Long Li wrote:

From: Long Li 

It's possible that the page offset is non-zero in the pages in a request,
change the function to calculate the correct data buffer length.

Signed-off-by: Long Li 
---
  fs/cifs/transport.c | 20 +---
  1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index 927226a..d6b5523 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -212,10 +212,24 @@ rqst_len(struct smb_rqst *rqst)
for (i = 0; i < rqst->rq_nvec; i++)
buflen += iov[i].iov_len;
  
-	/* add in the page array if there is one */

+   /*
+* Add in the page array if there is one. The caller needs to make
+* sure rq_offset and rq_tailsz are set correctly. If a buffer of
+* multiple pages ends at page boundary, rq_tailsz needs to be set to
+* PAGE_SIZE.
+*/
if (rqst->rq_npages) {
-   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1);
-   buflen += rqst->rq_tailsz;
+   if (rqst->rq_npages == 1)
+   buflen += rqst->rq_tailsz;
+   else {
+   /*
+* If there is more than one page, calculate the
+* buffer length based on rq_offset and rq_tailsz
+*/
+   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1) -
+   rqst->rq_offset;
+   buflen += rqst->rq_tailsz;
+   }


Wouldn't it be simpler to keep the original code, but then just subtract
the rq_offset?

buflen += rqst->rq_pagesz * (rqst->rq_npages - 1);
buflen += rqst->rq_tailsz;
buflen -= rqst->rq_offset;

It's kind of confusing as written. Also, what if it's just one page, but
has a non-zero offset? Is that somehow not possible? My suggested code
would take that into account, yours doesn't.

Tom.


}
  
  	return buflen;




Re: [Patch v2 05/15] CIFS: Calculate the correct request length based on page offset and tail size

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:47 PM, Long Li wrote:

From: Long Li 

It's possible that the page offset is non-zero in the pages in a request,
change the function to calculate the correct data buffer length.

Signed-off-by: Long Li 
---
  fs/cifs/transport.c | 20 +---
  1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index 927226a..d6b5523 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -212,10 +212,24 @@ rqst_len(struct smb_rqst *rqst)
for (i = 0; i < rqst->rq_nvec; i++)
buflen += iov[i].iov_len;
  
-	/* add in the page array if there is one */

+   /*
+* Add in the page array if there is one. The caller needs to make
+* sure rq_offset and rq_tailsz are set correctly. If a buffer of
+* multiple pages ends at page boundary, rq_tailsz needs to be set to
+* PAGE_SIZE.
+*/
if (rqst->rq_npages) {
-   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1);
-   buflen += rqst->rq_tailsz;
+   if (rqst->rq_npages == 1)
+   buflen += rqst->rq_tailsz;
+   else {
+   /*
+* If there is more than one page, calculate the
+* buffer length based on rq_offset and rq_tailsz
+*/
+   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1) -
+   rqst->rq_offset;
+   buflen += rqst->rq_tailsz;
+   }


Wouldn't it be simpler to keep the original code, but then just subtract
the rq_offset?

buflen += rqst->rq_pagesz * (rqst->rq_npages - 1);
buflen += rqst->rq_tailsz;
buflen -= rqst->rq_offset;

It's kind of confusing as written. Also, what if it's just one page, but
has a non-zero offset? Is that somehow not possible? My suggested code
would take that into account, yours doesn't.

Tom.


}
  
  	return buflen;




Re: [Patch v2 04/15] CIFS: Add support for direct pages in wdata

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:47 PM, Long Li wrote:

From: Long Li 

Add a function to allocate wdata without allocating pages for data
transfer. This gives the caller an option to pass a number of pages that
point to the data buffer to write to.

wdata is reponsible for free those pages after it's done.


Same comment as for the earlier patch. "Caller" is responsible for
"freeing" those pages? Confusing, as worded.



Signed-off-by: Long Li 
---
  fs/cifs/cifsglob.h  |  2 +-
  fs/cifs/cifsproto.h |  2 ++
  fs/cifs/cifssmb.c   | 17 ++---
  3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 56864a87..7f62c98 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1205,7 +1205,7 @@ struct cifs_writedata {
unsigned inttailsz;
unsigned intcredits;
unsigned intnr_pages;
-   struct page *pages[];
+   struct page **pages;


Also same comment as for earlier patch, these are syntactically
equivalent and maybe not needed to change.


  };
  
  /*

diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index 1f27d8e..7933c5f 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -533,6 +533,8 @@ int cifs_async_writev(struct cifs_writedata *wdata,
  void cifs_writev_complete(struct work_struct *work);
  struct cifs_writedata *cifs_writedata_alloc(unsigned int nr_pages,
work_func_t complete);
+struct cifs_writedata *cifs_writedata_direct_alloc(struct page **pages,
+   work_func_t complete);
  void cifs_writedata_release(struct kref *refcount);
  int cifs_query_mf_symlink(unsigned int xid, struct cifs_tcon *tcon,
  struct cifs_sb_info *cifs_sb,
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index c8e4278..5aca336 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1952,6 +1952,7 @@ cifs_writedata_release(struct kref *refcount)
if (wdata->cfile)
cifsFileInfo_put(wdata->cfile);
  
+	kvfree(wdata->pages);

kfree(wdata);
  }
  
@@ -2075,12 +2076,22 @@ cifs_writev_complete(struct work_struct *work)

  struct cifs_writedata *
  cifs_writedata_alloc(unsigned int nr_pages, work_func_t complete)
  {
+   struct page **pages =
+   kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS);


Why do you do a GFP_NOFS here but GFP_KERNEL in the earlier patch?


+   if (pages)
+   return cifs_writedata_direct_alloc(pages, complete);
+
+   return NULL;
+}
+
+struct cifs_writedata *
+cifs_writedata_direct_alloc(struct page **pages, work_func_t complete)
+{
struct cifs_writedata *wdata;
  
-	/* writedata + number of page pointers */

-   wdata = kzalloc(sizeof(*wdata) +
-   sizeof(struct page *) * nr_pages, GFP_NOFS);
+   wdata = kzalloc(sizeof(*wdata), GFP_NOFS);
if (wdata != NULL) {
+   wdata->pages = pages;
kref_init(>refcount);
INIT_LIST_HEAD(>list);
init_completion(>done);



Re: [Patch v2 04/15] CIFS: Add support for direct pages in wdata

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:47 PM, Long Li wrote:

From: Long Li 

Add a function to allocate wdata without allocating pages for data
transfer. This gives the caller an option to pass a number of pages that
point to the data buffer to write to.

wdata is reponsible for free those pages after it's done.


Same comment as for the earlier patch. "Caller" is responsible for
"freeing" those pages? Confusing, as worded.



Signed-off-by: Long Li 
---
  fs/cifs/cifsglob.h  |  2 +-
  fs/cifs/cifsproto.h |  2 ++
  fs/cifs/cifssmb.c   | 17 ++---
  3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 56864a87..7f62c98 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1205,7 +1205,7 @@ struct cifs_writedata {
unsigned inttailsz;
unsigned intcredits;
unsigned intnr_pages;
-   struct page *pages[];
+   struct page **pages;


Also same comment as for earlier patch, these are syntactically
equivalent and maybe not needed to change.


  };
  
  /*

diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index 1f27d8e..7933c5f 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -533,6 +533,8 @@ int cifs_async_writev(struct cifs_writedata *wdata,
  void cifs_writev_complete(struct work_struct *work);
  struct cifs_writedata *cifs_writedata_alloc(unsigned int nr_pages,
work_func_t complete);
+struct cifs_writedata *cifs_writedata_direct_alloc(struct page **pages,
+   work_func_t complete);
  void cifs_writedata_release(struct kref *refcount);
  int cifs_query_mf_symlink(unsigned int xid, struct cifs_tcon *tcon,
  struct cifs_sb_info *cifs_sb,
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index c8e4278..5aca336 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1952,6 +1952,7 @@ cifs_writedata_release(struct kref *refcount)
if (wdata->cfile)
cifsFileInfo_put(wdata->cfile);
  
+	kvfree(wdata->pages);

kfree(wdata);
  }
  
@@ -2075,12 +2076,22 @@ cifs_writev_complete(struct work_struct *work)

  struct cifs_writedata *
  cifs_writedata_alloc(unsigned int nr_pages, work_func_t complete)
  {
+   struct page **pages =
+   kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS);


Why do you do a GFP_NOFS here but GFP_KERNEL in the earlier patch?


+   if (pages)
+   return cifs_writedata_direct_alloc(pages, complete);
+
+   return NULL;
+}
+
+struct cifs_writedata *
+cifs_writedata_direct_alloc(struct page **pages, work_func_t complete)
+{
struct cifs_writedata *wdata;
  
-	/* writedata + number of page pointers */

-   wdata = kzalloc(sizeof(*wdata) +
-   sizeof(struct page *) * nr_pages, GFP_NOFS);
+   wdata = kzalloc(sizeof(*wdata), GFP_NOFS);
if (wdata != NULL) {
+   wdata->pages = pages;
kref_init(>refcount);
INIT_LIST_HEAD(>list);
init_completion(>done);



Re: [Patch v2 03/15] CIFS: Use offset when reading pages

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:47 PM, Long Li wrote:

From: Long Li 

With offset defined in rdata, transport functions need to look at this
offset when reading data into the correct places in pages.

Signed-off-by: Long Li 
---
  fs/cifs/cifsproto.h |  4 +++-
  fs/cifs/connect.c   |  5 +++--
  fs/cifs/file.c  | 52 +---
  fs/cifs/smb2ops.c   |  2 +-
  fs/cifs/smb2pdu.c   |  1 +
  5 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index dc80f84..1f27d8e 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -203,7 +203,9 @@ extern void dequeue_mid(struct mid_q_entry *mid, bool 
malformed);
  extern int cifs_read_from_socket(struct TCP_Server_Info *server, char *buf,
 unsigned int to_read);
  extern int cifs_read_page_from_socket(struct TCP_Server_Info *server,
- struct page *page, unsigned int to_read);
+   struct page *page,
+   unsigned int page_offset,
+   unsigned int to_read);
  extern int cifs_setup_cifs_sb(struct smb_vol *pvolume_info,
   struct cifs_sb_info *cifs_sb);
  extern int cifs_match_super(struct super_block *, void *);
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 83b0234..8501da0 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -594,10 +594,11 @@ cifs_read_from_socket(struct TCP_Server_Info *server, 
char *buf,
  
  int

  cifs_read_page_from_socket(struct TCP_Server_Info *server, struct page *page,
- unsigned int to_read)
+   unsigned int page_offset, unsigned int to_read)
  {
struct msghdr smb_msg;
-   struct bio_vec bv = {.bv_page = page, .bv_len = to_read};
+   struct bio_vec bv = {
+   .bv_page = page, .bv_len = to_read, .bv_offset = page_offset};
iov_iter_bvec(_msg.msg_iter, READ | ITER_BVEC, , 1, to_read);
return cifs_readv_from_socket(server, _msg);
  }
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 1c98293..87eece6 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3026,12 +3026,20 @@ uncached_fill_pages(struct TCP_Server_Info *server,
int result = 0;
unsigned int i;
unsigned int nr_pages = rdata->nr_pages;
+   unsigned int page_offset = rdata->page_offset;
  
  	rdata->got_bytes = 0;

rdata->tailsz = PAGE_SIZE;
for (i = 0; i < nr_pages; i++) {
struct page *page = rdata->pages[i];
size_t n;
+   unsigned int segment_size = rdata->pagesz;
+
+   if (i == 0)
+   segment_size -= page_offset;
+   else
+   page_offset = 0;
+
  
  		if (len <= 0) {

/* no need to hold page hostage */
@@ -3040,24 +3048,25 @@ uncached_fill_pages(struct TCP_Server_Info *server,
put_page(page);
continue;
}
+
n = len;
-   if (len >= PAGE_SIZE) {
+   if (len >= segment_size)
/* enough data to fill the page */
-   n = PAGE_SIZE;
-   len -= n;
-   } else {
-   zero_user(page, len, PAGE_SIZE - len);
+   n = segment_size;
+   else
rdata->tailsz = len;
-   len = 0;
-   }
+   len -= n;
+
if (iter)
-   result = copy_page_from_iter(page, 0, n, iter);
+   result = copy_page_from_iter(
+   page, page_offset, n, iter);
  #ifdef CONFIG_CIFS_SMB_DIRECT
else if (rdata->mr)
result = n;
  #endif


I hadn't noticed this type of xonditional code before. It's kind of ugly
to modify the data handling code like this. Do you plan to break this
out into an smbdirect-specific hander, like the cases above and below?


else
-   result = cifs_read_page_from_socket(server, page, n);
+   result = cifs_read_page_from_socket(
+   server, page, page_offset, n);
if (result < 0)
break;
  
@@ -3130,6 +3139,7 @@ cifs_send_async_read(loff_t offset, size_t len, struct cifsFileInfo *open_file,

rdata->bytes = cur_len;
rdata->pid = pid;
rdata->pagesz = PAGE_SIZE;
+   rdata->tailsz = PAGE_SIZE;
rdata->read_into_pages = cifs_uncached_read_into_pages;
rdata->copy_into_pages = cifs_uncached_copy_into_pages;
rdata->credits = credits;
@@ -3574,6 +3584,7 @@ readpages_fill_pages(struct TCP_Server_Info *server,
u64 eof;

Re: [Patch v2 03/15] CIFS: Use offset when reading pages

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:47 PM, Long Li wrote:

From: Long Li 

With offset defined in rdata, transport functions need to look at this
offset when reading data into the correct places in pages.

Signed-off-by: Long Li 
---
  fs/cifs/cifsproto.h |  4 +++-
  fs/cifs/connect.c   |  5 +++--
  fs/cifs/file.c  | 52 +---
  fs/cifs/smb2ops.c   |  2 +-
  fs/cifs/smb2pdu.c   |  1 +
  5 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index dc80f84..1f27d8e 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -203,7 +203,9 @@ extern void dequeue_mid(struct mid_q_entry *mid, bool 
malformed);
  extern int cifs_read_from_socket(struct TCP_Server_Info *server, char *buf,
 unsigned int to_read);
  extern int cifs_read_page_from_socket(struct TCP_Server_Info *server,
- struct page *page, unsigned int to_read);
+   struct page *page,
+   unsigned int page_offset,
+   unsigned int to_read);
  extern int cifs_setup_cifs_sb(struct smb_vol *pvolume_info,
   struct cifs_sb_info *cifs_sb);
  extern int cifs_match_super(struct super_block *, void *);
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 83b0234..8501da0 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -594,10 +594,11 @@ cifs_read_from_socket(struct TCP_Server_Info *server, 
char *buf,
  
  int

  cifs_read_page_from_socket(struct TCP_Server_Info *server, struct page *page,
- unsigned int to_read)
+   unsigned int page_offset, unsigned int to_read)
  {
struct msghdr smb_msg;
-   struct bio_vec bv = {.bv_page = page, .bv_len = to_read};
+   struct bio_vec bv = {
+   .bv_page = page, .bv_len = to_read, .bv_offset = page_offset};
iov_iter_bvec(_msg.msg_iter, READ | ITER_BVEC, , 1, to_read);
return cifs_readv_from_socket(server, _msg);
  }
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 1c98293..87eece6 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3026,12 +3026,20 @@ uncached_fill_pages(struct TCP_Server_Info *server,
int result = 0;
unsigned int i;
unsigned int nr_pages = rdata->nr_pages;
+   unsigned int page_offset = rdata->page_offset;
  
  	rdata->got_bytes = 0;

rdata->tailsz = PAGE_SIZE;
for (i = 0; i < nr_pages; i++) {
struct page *page = rdata->pages[i];
size_t n;
+   unsigned int segment_size = rdata->pagesz;
+
+   if (i == 0)
+   segment_size -= page_offset;
+   else
+   page_offset = 0;
+
  
  		if (len <= 0) {

/* no need to hold page hostage */
@@ -3040,24 +3048,25 @@ uncached_fill_pages(struct TCP_Server_Info *server,
put_page(page);
continue;
}
+
n = len;
-   if (len >= PAGE_SIZE) {
+   if (len >= segment_size)
/* enough data to fill the page */
-   n = PAGE_SIZE;
-   len -= n;
-   } else {
-   zero_user(page, len, PAGE_SIZE - len);
+   n = segment_size;
+   else
rdata->tailsz = len;
-   len = 0;
-   }
+   len -= n;
+
if (iter)
-   result = copy_page_from_iter(page, 0, n, iter);
+   result = copy_page_from_iter(
+   page, page_offset, n, iter);
  #ifdef CONFIG_CIFS_SMB_DIRECT
else if (rdata->mr)
result = n;
  #endif


I hadn't noticed this type of xonditional code before. It's kind of ugly
to modify the data handling code like this. Do you plan to break this
out into an smbdirect-specific hander, like the cases above and below?


else
-   result = cifs_read_page_from_socket(server, page, n);
+   result = cifs_read_page_from_socket(
+   server, page, page_offset, n);
if (result < 0)
break;
  
@@ -3130,6 +3139,7 @@ cifs_send_async_read(loff_t offset, size_t len, struct cifsFileInfo *open_file,

rdata->bytes = cur_len;
rdata->pid = pid;
rdata->pagesz = PAGE_SIZE;
+   rdata->tailsz = PAGE_SIZE;
rdata->read_into_pages = cifs_uncached_read_into_pages;
rdata->copy_into_pages = cifs_uncached_copy_into_pages;
rdata->credits = credits;
@@ -3574,6 +3584,7 @@ readpages_fill_pages(struct TCP_Server_Info *server,
u64 eof;

Re: [Patch v2 02/15] CIFS: Add support for direct pages in rdata

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:47 PM, Long Li wrote:

From: Long Li 

Add a function to allocate rdata without allocating pages for data
transfer. This gives the caller an option to pass a number of pages
that point to the data buffer.

rdata is still reponsible for free those pages after it's done.


"Caller" is still responsible? Or is the rdata somehow freeing itself
via another mechanism?



Signed-off-by: Long Li 
---
  fs/cifs/cifsglob.h |  2 +-
  fs/cifs/file.c | 23 ---
  2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 8d16c3e..56864a87 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1179,7 +1179,7 @@ struct cifs_readdata {
unsigned inttailsz;
unsigned intcredits;
unsigned intnr_pages;
-   struct page *pages[];
+   struct page **pages;


Technically speaking, these are syntactically equivalent. It may not
be worth changing this historic definition.

Tom.



  };
  
  struct cifs_writedata;

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 23fd430..1c98293 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2880,13 +2880,13 @@ cifs_strict_writev(struct kiocb *iocb, struct iov_iter 
*from)
  }
  
  static struct cifs_readdata *

-cifs_readdata_alloc(unsigned int nr_pages, work_func_t complete)
+cifs_readdata_direct_alloc(struct page **pages, work_func_t complete)
  {
struct cifs_readdata *rdata;
  
-	rdata = kzalloc(sizeof(*rdata) + (sizeof(struct page *) * nr_pages),

-   GFP_KERNEL);
+   rdata = kzalloc(sizeof(*rdata), GFP_KERNEL);
if (rdata != NULL) {
+   rdata->pages = pages;
kref_init(>refcount);
INIT_LIST_HEAD(>list);
init_completion(>done);
@@ -2896,6 +2896,22 @@ cifs_readdata_alloc(unsigned int nr_pages, work_func_t 
complete)
return rdata;
  }
  
+static struct cifs_readdata *

+cifs_readdata_alloc(unsigned int nr_pages, work_func_t complete)
+{
+   struct page **pages =
+   kzalloc(sizeof(struct page *) * nr_pages, GFP_KERNEL);
+   struct cifs_readdata *ret = NULL;
+
+   if (pages) {
+   ret = cifs_readdata_direct_alloc(pages, complete);
+   if (!ret)
+   kfree(pages);
+   }
+
+   return ret;
+}
+
  void
  cifs_readdata_release(struct kref *refcount)
  {
@@ -2910,6 +2926,7 @@ cifs_readdata_release(struct kref *refcount)
if (rdata->cfile)
cifsFileInfo_put(rdata->cfile);
  
+	kvfree(rdata->pages);

kfree(rdata);
  }
  



Re: [Patch v2 02/15] CIFS: Add support for direct pages in rdata

2018-06-23 Thread Tom Talpey

On 5/30/2018 3:47 PM, Long Li wrote:

From: Long Li 

Add a function to allocate rdata without allocating pages for data
transfer. This gives the caller an option to pass a number of pages
that point to the data buffer.

rdata is still reponsible for free those pages after it's done.


"Caller" is still responsible? Or is the rdata somehow freeing itself
via another mechanism?



Signed-off-by: Long Li 
---
  fs/cifs/cifsglob.h |  2 +-
  fs/cifs/file.c | 23 ---
  2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 8d16c3e..56864a87 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1179,7 +1179,7 @@ struct cifs_readdata {
unsigned inttailsz;
unsigned intcredits;
unsigned intnr_pages;
-   struct page *pages[];
+   struct page **pages;


Technically speaking, these are syntactically equivalent. It may not
be worth changing this historic definition.

Tom.



  };
  
  struct cifs_writedata;

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 23fd430..1c98293 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2880,13 +2880,13 @@ cifs_strict_writev(struct kiocb *iocb, struct iov_iter 
*from)
  }
  
  static struct cifs_readdata *

-cifs_readdata_alloc(unsigned int nr_pages, work_func_t complete)
+cifs_readdata_direct_alloc(struct page **pages, work_func_t complete)
  {
struct cifs_readdata *rdata;
  
-	rdata = kzalloc(sizeof(*rdata) + (sizeof(struct page *) * nr_pages),

-   GFP_KERNEL);
+   rdata = kzalloc(sizeof(*rdata), GFP_KERNEL);
if (rdata != NULL) {
+   rdata->pages = pages;
kref_init(>refcount);
INIT_LIST_HEAD(>list);
init_completion(>done);
@@ -2896,6 +2896,22 @@ cifs_readdata_alloc(unsigned int nr_pages, work_func_t 
complete)
return rdata;
  }
  
+static struct cifs_readdata *

+cifs_readdata_alloc(unsigned int nr_pages, work_func_t complete)
+{
+   struct page **pages =
+   kzalloc(sizeof(struct page *) * nr_pages, GFP_KERNEL);
+   struct cifs_readdata *ret = NULL;
+
+   if (pages) {
+   ret = cifs_readdata_direct_alloc(pages, complete);
+   if (!ret)
+   kfree(pages);
+   }
+
+   return ret;
+}
+
  void
  cifs_readdata_release(struct kref *refcount)
  {
@@ -2910,6 +2926,7 @@ cifs_readdata_release(struct kref *refcount)
if (rdata->cfile)
cifsFileInfo_put(rdata->cfile);
  
+	kvfree(rdata->pages);

kfree(rdata);
  }
  



Re: [RFC PATCH 09/09] Introduce cache=rdma moutning option

2018-05-18 Thread Tom Talpey

On 5/18/2018 1:58 PM, Long Li wrote:

Subject: Re: [RFC PATCH 09/09] Introduce cache=rdma moutning option

On Fri, May 18, 2018 at 12:00 PM, Long Li via samba-technical  wrote:

Subject: Re: [RFC PATCH 09/09] Introduce cache=rdma moutning option

On Thu, May 17, 2018 at 05:22:14PM -0700, Long Li wrote:

From: Long Li 

When cache=rdma is enabled on mount options, CIFS do not allocate
internal data buffer pages for I/O, data is read/writen directly to
user

memory via RDMA.

I don't think this should be an option.  For direct I/O without
signing or encryption CIFS should always use get_user_pages, with or

without RDMA.


Yes this should be done for all transport. If there are no objections, I'll send

patches to change this.

Would this help/change performance much?


On RDMA, it helps with I/O latency and reduces CPU usage on certain I/O 
patterns.

But I haven't tested on TCP. Maybe it will help a little bit.


Well, when the application requests direct i/o on a TCP connection,
you definitely don't want to cache it! So even if the performance
is different, correctness would dictate doing this.

You probably don't need to pin the buffer in the TCP case, which
might be worth avoiding.

Tom.


Re: [RFC PATCH 09/09] Introduce cache=rdma moutning option

2018-05-18 Thread Tom Talpey

On 5/18/2018 1:58 PM, Long Li wrote:

Subject: Re: [RFC PATCH 09/09] Introduce cache=rdma moutning option

On Fri, May 18, 2018 at 12:00 PM, Long Li via samba-technical  wrote:

Subject: Re: [RFC PATCH 09/09] Introduce cache=rdma moutning option

On Thu, May 17, 2018 at 05:22:14PM -0700, Long Li wrote:

From: Long Li 

When cache=rdma is enabled on mount options, CIFS do not allocate
internal data buffer pages for I/O, data is read/writen directly to
user

memory via RDMA.

I don't think this should be an option.  For direct I/O without
signing or encryption CIFS should always use get_user_pages, with or

without RDMA.


Yes this should be done for all transport. If there are no objections, I'll send

patches to change this.

Would this help/change performance much?


On RDMA, it helps with I/O latency and reduces CPU usage on certain I/O 
patterns.

But I haven't tested on TCP. Maybe it will help a little bit.


Well, when the application requests direct i/o on a TCP connection,
you definitely don't want to cache it! So even if the performance
is different, correctness would dictate doing this.

You probably don't need to pin the buffer in the TCP case, which
might be worth avoiding.

Tom.


Re: [RFC PATCH 05/09] Change RDMA send to regonize page offset in the 1st page

2018-05-18 Thread Tom Talpey

On 5/17/2018 5:22 PM, Long Li wrote:

From: Long Li 


There's a typo "recognize" in the patch title



When doing RDMA send, the offset needs to be checked as data may start in an 
offset
in the 1st page.


Doesn't this patch alter the generic smb2pdu.c code too? I think this
should note "any" send, not just RDMA?

Tom.



Signed-off-by: Long Li 
---
  fs/cifs/smb2pdu.c   |  3 ++-
  fs/cifs/smbdirect.c | 25 +++--
  2 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 5097f28..fdcf97e 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -3015,7 +3015,8 @@ smb2_async_writev(struct cifs_writedata *wdata,
  
  	rqst.rq_iov = iov;

rqst.rq_nvec = 2;
-   rqst.rq_pages = wdata->pages;
+   rqst.rq_pages = wdata->direct_pages ? wdata->direct_pages : 
wdata->pages;
+   rqst.rq_offset = wdata->page_offset;
rqst.rq_npages = wdata->nr_pages;
rqst.rq_pagesz = wdata->pagesz;
rqst.rq_tailsz = wdata->tailsz;
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index b0a1955..b46586d 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2084,8 +2084,10 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
  
  	/* add in the page array if there is one */

if (rqst->rq_npages) {
-   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1);
-   buflen += rqst->rq_tailsz;
+   if (rqst->rq_npages == 1)
+   buflen += rqst->rq_tailsz;
+   else
+   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1) - 
rqst->rq_offset + rqst->rq_tailsz;
}
  
  	if (buflen + sizeof(struct smbd_data_transfer) >

@@ -2182,8 +2184,19 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
  
  	/* now sending pages if there are any */

for (i = 0; i < rqst->rq_npages; i++) {
-   buflen = (i == rqst->rq_npages-1) ?
-   rqst->rq_tailsz : rqst->rq_pagesz;
+   unsigned int offset = 0;
+   if (i == 0)
+   offset = rqst->rq_offset;
+   if (rqst->rq_npages == 1 || i == rqst->rq_npages-1)
+   buflen = rqst->rq_tailsz;
+   else {
+   /* We have at least two pages, and this is not the last 
page */
+   if (i == 0)
+   buflen = rqst->rq_pagesz - rqst->rq_offset;
+   else
+   buflen = rqst->rq_pagesz;
+   }
+
nvecs = (buflen + max_iov_size - 1) / max_iov_size;
log_write(INFO, "sending pages buflen=%d nvecs=%d\n",
buflen, nvecs);
@@ -2194,9 +2207,9 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
remaining_data_length -= size;
log_write(INFO, "sending pages i=%d offset=%d size=%d"
" remaining_data_length=%d\n",
-   i, j*max_iov_size, size, remaining_data_length);
+   i, j*max_iov_size+offset, size, 
remaining_data_length);
rc = smbd_post_send_page(
-   info, rqst->rq_pages[i], j*max_iov_size,
+   info, rqst->rq_pages[i], j*max_iov_size + 
offset,
size, remaining_data_length);
if (rc)
goto done;



Re: [RFC PATCH 05/09] Change RDMA send to regonize page offset in the 1st page

2018-05-18 Thread Tom Talpey

On 5/17/2018 5:22 PM, Long Li wrote:

From: Long Li 


There's a typo "recognize" in the patch title



When doing RDMA send, the offset needs to be checked as data may start in an 
offset
in the 1st page.


Doesn't this patch alter the generic smb2pdu.c code too? I think this
should note "any" send, not just RDMA?

Tom.



Signed-off-by: Long Li 
---
  fs/cifs/smb2pdu.c   |  3 ++-
  fs/cifs/smbdirect.c | 25 +++--
  2 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 5097f28..fdcf97e 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -3015,7 +3015,8 @@ smb2_async_writev(struct cifs_writedata *wdata,
  
  	rqst.rq_iov = iov;

rqst.rq_nvec = 2;
-   rqst.rq_pages = wdata->pages;
+   rqst.rq_pages = wdata->direct_pages ? wdata->direct_pages : 
wdata->pages;
+   rqst.rq_offset = wdata->page_offset;
rqst.rq_npages = wdata->nr_pages;
rqst.rq_pagesz = wdata->pagesz;
rqst.rq_tailsz = wdata->tailsz;
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index b0a1955..b46586d 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2084,8 +2084,10 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
  
  	/* add in the page array if there is one */

if (rqst->rq_npages) {
-   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1);
-   buflen += rqst->rq_tailsz;
+   if (rqst->rq_npages == 1)
+   buflen += rqst->rq_tailsz;
+   else
+   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1) - 
rqst->rq_offset + rqst->rq_tailsz;
}
  
  	if (buflen + sizeof(struct smbd_data_transfer) >

@@ -2182,8 +2184,19 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
  
  	/* now sending pages if there are any */

for (i = 0; i < rqst->rq_npages; i++) {
-   buflen = (i == rqst->rq_npages-1) ?
-   rqst->rq_tailsz : rqst->rq_pagesz;
+   unsigned int offset = 0;
+   if (i == 0)
+   offset = rqst->rq_offset;
+   if (rqst->rq_npages == 1 || i == rqst->rq_npages-1)
+   buflen = rqst->rq_tailsz;
+   else {
+   /* We have at least two pages, and this is not the last 
page */
+   if (i == 0)
+   buflen = rqst->rq_pagesz - rqst->rq_offset;
+   else
+   buflen = rqst->rq_pagesz;
+   }
+
nvecs = (buflen + max_iov_size - 1) / max_iov_size;
log_write(INFO, "sending pages buflen=%d nvecs=%d\n",
buflen, nvecs);
@@ -2194,9 +2207,9 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
remaining_data_length -= size;
log_write(INFO, "sending pages i=%d offset=%d size=%d"
" remaining_data_length=%d\n",
-   i, j*max_iov_size, size, remaining_data_length);
+   i, j*max_iov_size+offset, size, 
remaining_data_length);
rc = smbd_post_send_page(
-   info, rqst->rq_pages[i], j*max_iov_size,
+   info, rqst->rq_pages[i], j*max_iov_size + 
offset,
size, remaining_data_length);
if (rc)
goto done;



Re: [RFC PATCH 00/09] Implement direct user I/O interfaces for RDMA

2018-05-18 Thread Tom Talpey

On 5/17/2018 11:03 PM, Long Li wrote:

Subject: Re: [RFC PATCH 00/09] Implement direct user I/O interfaces for
RDMA

On 5/17/2018 8:22 PM, Long Li wrote:

From: Long Li 

This patchset implements direct user I/O through RDMA.

In normal code path (even with cache=none), CIFS copies I/O data from
user-space to kernel-space for security reasons.

With this patchset, a new mounting option is introduced to have CIFS
pin the user-space buffer into memory and performs I/O through RDMA.
This avoids memory copy, at the cost of added security risk.


What's the security risk? This type of direct i/o behavior is not uncommon,
and can certainly be made safe, using the appropriate memory registration
and protection domains. Any risk needs to be stated explicitly, and mitigation
provided, or at least described.


I think it's an assumption that user-mode buffer can't be trusted, so CIFS 
always copies them into internal buffers, and calculate signature and 
encryption based on protocol used.

With the direct buffer, the user can potentially modify the buffer when 
signature or encryption is in progress or after they are done.


I don't agree that the legacy copying behavior is because the buffer is
"untrusted". The buffer is the user's data, there's no trust issue here.
If the user application modifies the buffer while it's being sent, it's
a violation of the API contract, and the only victim is the application
itself. Same applies for receiving data. And as pointed out, most all
storage layers, file and block both, use this strategy for direct i/o.

Regarding signing, if the application alters the data then the integrity
hash will simply do its job and catch the application in the act. Again,
nothing suffers but the application.

Regarding encryption, I assume you're proposing to encrypt and decrypt
the data in a kernel buffer, effectively a copy. So in fact, in the
encryption case there's no need to pin and map the user buffer at all.

I'll mention however that Windows takes the path of not performing
RDMA placement when encrypting data. It saves nothing, and even adds
some overhead, because of the need to touch the buffer anyway to
manage the encryption/decryption.

Bottom line - no security implication for using user buffers directly.

Tom.



I also want to point out that, I choose to implement .read_iter and .write_iter 
from file_operations to implement direct I/O (CIFS is already doing this for 
O_DIRECT, so following this code path will avoid a big mess up).  The ideal 
choice is to implement .direct_IO from address_space_operations that I think 
eventually we want to move to.



Tom.



This patchset is RFC. The work is in progress, do not merge.


Long Li (9):
Introduce offset for the 1st page in data transfer structures
Change wdata alloc to support direct pages
Change rdata alloc to support direct pages
Change function to support offset when reading pages
Change RDMA send to regonize page offset in the 1st page
Change RDMA recv to support offset in the 1st page
Support page offset in memory regsitrations
Implement no-copy file I/O interfaces
Introduce cache=rdma moutning option


   fs/cifs/cifs_fs_sb.h  |   2 +
   fs/cifs/cifsfs.c  |  19 +++
   fs/cifs/cifsfs.h  |   3 +
   fs/cifs/cifsglob.h|   6 +
   fs/cifs/cifsproto.h   |   4 +-
   fs/cifs/cifssmb.c |  10 +-
   fs/cifs/connect.c |  13 +-
   fs/cifs/dir.c |   5 +
   fs/cifs/file.c| 351

++

   fs/cifs/inode.c   |   4 +-
   fs/cifs/smb2ops.c |   2 +-
   fs/cifs/smb2pdu.c |  22 ++-
   fs/cifs/smbdirect.c   | 132 ++---
   fs/cifs/smbdirect.h   |   2 +-
   fs/read_write.c   |   7 +
   include/linux/ratelimit.h |   2 +-
   16 files changed, 489 insertions(+), 95 deletions(-)


N�r��y���b�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!tml=



Re: [RFC PATCH 00/09] Implement direct user I/O interfaces for RDMA

2018-05-18 Thread Tom Talpey

On 5/17/2018 11:03 PM, Long Li wrote:

Subject: Re: [RFC PATCH 00/09] Implement direct user I/O interfaces for
RDMA

On 5/17/2018 8:22 PM, Long Li wrote:

From: Long Li 

This patchset implements direct user I/O through RDMA.

In normal code path (even with cache=none), CIFS copies I/O data from
user-space to kernel-space for security reasons.

With this patchset, a new mounting option is introduced to have CIFS
pin the user-space buffer into memory and performs I/O through RDMA.
This avoids memory copy, at the cost of added security risk.


What's the security risk? This type of direct i/o behavior is not uncommon,
and can certainly be made safe, using the appropriate memory registration
and protection domains. Any risk needs to be stated explicitly, and mitigation
provided, or at least described.


I think it's an assumption that user-mode buffer can't be trusted, so CIFS 
always copies them into internal buffers, and calculate signature and 
encryption based on protocol used.

With the direct buffer, the user can potentially modify the buffer when 
signature or encryption is in progress or after they are done.


I don't agree that the legacy copying behavior is because the buffer is
"untrusted". The buffer is the user's data, there's no trust issue here.
If the user application modifies the buffer while it's being sent, it's
a violation of the API contract, and the only victim is the application
itself. Same applies for receiving data. And as pointed out, most all
storage layers, file and block both, use this strategy for direct i/o.

Regarding signing, if the application alters the data then the integrity
hash will simply do its job and catch the application in the act. Again,
nothing suffers but the application.

Regarding encryption, I assume you're proposing to encrypt and decrypt
the data in a kernel buffer, effectively a copy. So in fact, in the
encryption case there's no need to pin and map the user buffer at all.

I'll mention however that Windows takes the path of not performing
RDMA placement when encrypting data. It saves nothing, and even adds
some overhead, because of the need to touch the buffer anyway to
manage the encryption/decryption.

Bottom line - no security implication for using user buffers directly.

Tom.



I also want to point out that, I choose to implement .read_iter and .write_iter 
from file_operations to implement direct I/O (CIFS is already doing this for 
O_DIRECT, so following this code path will avoid a big mess up).  The ideal 
choice is to implement .direct_IO from address_space_operations that I think 
eventually we want to move to.



Tom.



This patchset is RFC. The work is in progress, do not merge.


Long Li (9):
Introduce offset for the 1st page in data transfer structures
Change wdata alloc to support direct pages
Change rdata alloc to support direct pages
Change function to support offset when reading pages
Change RDMA send to regonize page offset in the 1st page
Change RDMA recv to support offset in the 1st page
Support page offset in memory regsitrations
Implement no-copy file I/O interfaces
Introduce cache=rdma moutning option


   fs/cifs/cifs_fs_sb.h  |   2 +
   fs/cifs/cifsfs.c  |  19 +++
   fs/cifs/cifsfs.h  |   3 +
   fs/cifs/cifsglob.h|   6 +
   fs/cifs/cifsproto.h   |   4 +-
   fs/cifs/cifssmb.c |  10 +-
   fs/cifs/connect.c |  13 +-
   fs/cifs/dir.c |   5 +
   fs/cifs/file.c| 351

++

   fs/cifs/inode.c   |   4 +-
   fs/cifs/smb2ops.c |   2 +-
   fs/cifs/smb2pdu.c |  22 ++-
   fs/cifs/smbdirect.c   | 132 ++---
   fs/cifs/smbdirect.h   |   2 +-
   fs/read_write.c   |   7 +
   include/linux/ratelimit.h |   2 +-
   16 files changed, 489 insertions(+), 95 deletions(-)


N�r��y���b�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!tml=



Re: [RFC PATCH 02/09] Change wdata alloc to support direct pages

2018-05-18 Thread Tom Talpey

On 5/17/2018 5:22 PM, Long Li wrote:

From: Long Li 

When using direct pages from user space, there is no need to allocate pages.

Just ping those user pages for RDMA.


Did you mean "pin" those user pages? If so, where does that pinning
occur, it's not in this patch.

Perhaps this should just say "point to" those user pages.

I also don't think this is necessarily only "for RDMA". Perhaps
there are other transport scenarios where this is advantageous.




Signed-off-by: Long Li 
---
  fs/cifs/cifsproto.h |  2 +-
  fs/cifs/cifssmb.c   | 10 +++---
  fs/cifs/file.c  |  4 ++--
  3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index 365a414..94106b9 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -523,7 +523,7 @@ int cifs_readv_receive(struct TCP_Server_Info *server, 
struct mid_q_entry *mid);
  int cifs_async_writev(struct cifs_writedata *wdata,
  void (*release)(struct kref *kref));
  void cifs_writev_complete(struct work_struct *work);
-struct cifs_writedata *cifs_writedata_alloc(unsigned int nr_pages,
+struct cifs_writedata *cifs_writedata_alloc(unsigned int nr_pages, struct page 
**direct_pages,
work_func_t complete);
  void cifs_writedata_release(struct kref *refcount);
  int cifs_query_mf_symlink(unsigned int xid, struct cifs_tcon *tcon,
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index 1529a08..3b1731d 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1983,7 +1983,7 @@ cifs_writev_requeue(struct cifs_writedata *wdata)
tailsz = rest_len - (nr_pages - 1) * PAGE_SIZE;
}
  
-		wdata2 = cifs_writedata_alloc(nr_pages, cifs_writev_complete);

+   wdata2 = cifs_writedata_alloc(nr_pages, NULL, 
cifs_writev_complete);
if (!wdata2) {
rc = -ENOMEM;
break;
@@ -2067,12 +2067,16 @@ cifs_writev_complete(struct work_struct *work)
  }
  
  struct cifs_writedata *

-cifs_writedata_alloc(unsigned int nr_pages, work_func_t complete)
+cifs_writedata_alloc(unsigned int nr_pages, struct page **direct_pages, 
work_func_t complete)
  {
struct cifs_writedata *wdata;
  
  	/* writedata + number of page pointers */

-   wdata = kzalloc(sizeof(*wdata) +
+   if (direct_pages) {
+   wdata = kzalloc(sizeof(*wdata), GFP_NOFS);
+   wdata->direct_pages = direct_pages;
+   } else
+   wdata = kzalloc(sizeof(*wdata) +
sizeof(struct page *) * nr_pages, GFP_NOFS);
if (wdata != NULL) {
kref_init(>refcount);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 23fd430..a6ec896 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1965,7 +1965,7 @@ wdata_alloc_and_fillpages(pgoff_t tofind, struct 
address_space *mapping,
  {
struct cifs_writedata *wdata;
  
-	wdata = cifs_writedata_alloc((unsigned int)tofind,

+   wdata = cifs_writedata_alloc((unsigned int)tofind, NULL,
 cifs_writev_complete);
if (!wdata)
return NULL;
@@ -2554,7 +2554,7 @@ cifs_write_from_iter(loff_t offset, size_t len, struct 
iov_iter *from,
break;
  
  		nr_pages = get_numpages(wsize, len, _len);

-   wdata = cifs_writedata_alloc(nr_pages,
+   wdata = cifs_writedata_alloc(nr_pages, NULL,
 cifs_uncached_writev_complete);
if (!wdata) {
rc = -ENOMEM;



Re: [RFC PATCH 02/09] Change wdata alloc to support direct pages

2018-05-18 Thread Tom Talpey

On 5/17/2018 5:22 PM, Long Li wrote:

From: Long Li 

When using direct pages from user space, there is no need to allocate pages.

Just ping those user pages for RDMA.


Did you mean "pin" those user pages? If so, where does that pinning
occur, it's not in this patch.

Perhaps this should just say "point to" those user pages.

I also don't think this is necessarily only "for RDMA". Perhaps
there are other transport scenarios where this is advantageous.




Signed-off-by: Long Li 
---
  fs/cifs/cifsproto.h |  2 +-
  fs/cifs/cifssmb.c   | 10 +++---
  fs/cifs/file.c  |  4 ++--
  3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index 365a414..94106b9 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -523,7 +523,7 @@ int cifs_readv_receive(struct TCP_Server_Info *server, 
struct mid_q_entry *mid);
  int cifs_async_writev(struct cifs_writedata *wdata,
  void (*release)(struct kref *kref));
  void cifs_writev_complete(struct work_struct *work);
-struct cifs_writedata *cifs_writedata_alloc(unsigned int nr_pages,
+struct cifs_writedata *cifs_writedata_alloc(unsigned int nr_pages, struct page 
**direct_pages,
work_func_t complete);
  void cifs_writedata_release(struct kref *refcount);
  int cifs_query_mf_symlink(unsigned int xid, struct cifs_tcon *tcon,
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index 1529a08..3b1731d 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1983,7 +1983,7 @@ cifs_writev_requeue(struct cifs_writedata *wdata)
tailsz = rest_len - (nr_pages - 1) * PAGE_SIZE;
}
  
-		wdata2 = cifs_writedata_alloc(nr_pages, cifs_writev_complete);

+   wdata2 = cifs_writedata_alloc(nr_pages, NULL, 
cifs_writev_complete);
if (!wdata2) {
rc = -ENOMEM;
break;
@@ -2067,12 +2067,16 @@ cifs_writev_complete(struct work_struct *work)
  }
  
  struct cifs_writedata *

-cifs_writedata_alloc(unsigned int nr_pages, work_func_t complete)
+cifs_writedata_alloc(unsigned int nr_pages, struct page **direct_pages, 
work_func_t complete)
  {
struct cifs_writedata *wdata;
  
  	/* writedata + number of page pointers */

-   wdata = kzalloc(sizeof(*wdata) +
+   if (direct_pages) {
+   wdata = kzalloc(sizeof(*wdata), GFP_NOFS);
+   wdata->direct_pages = direct_pages;
+   } else
+   wdata = kzalloc(sizeof(*wdata) +
sizeof(struct page *) * nr_pages, GFP_NOFS);
if (wdata != NULL) {
kref_init(>refcount);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 23fd430..a6ec896 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1965,7 +1965,7 @@ wdata_alloc_and_fillpages(pgoff_t tofind, struct 
address_space *mapping,
  {
struct cifs_writedata *wdata;
  
-	wdata = cifs_writedata_alloc((unsigned int)tofind,

+   wdata = cifs_writedata_alloc((unsigned int)tofind, NULL,
 cifs_writev_complete);
if (!wdata)
return NULL;
@@ -2554,7 +2554,7 @@ cifs_write_from_iter(loff_t offset, size_t len, struct 
iov_iter *from,
break;
  
  		nr_pages = get_numpages(wsize, len, _len);

-   wdata = cifs_writedata_alloc(nr_pages,
+   wdata = cifs_writedata_alloc(nr_pages, NULL,
 cifs_uncached_writev_complete);
if (!wdata) {
rc = -ENOMEM;



Re: [RFC PATCH 00/09] Implement direct user I/O interfaces for RDMA

2018-05-17 Thread Tom Talpey

On 5/17/2018 8:22 PM, Long Li wrote:

From: Long Li 

This patchset implements direct user I/O through RDMA.

In normal code path (even with cache=none), CIFS copies I/O data from
user-space to kernel-space for security reasons.

With this patchset, a new mounting option is introduced to have CIFS pin the
user-space buffer into memory and performs I/O through RDMA. This avoids memory
copy, at the cost of added security risk.


What's the security risk? This type of direct i/o behavior is not
uncommon, and can certainly be made safe, using the appropriate
memory registration and protection domains. Any risk needs to be
stated explicitly, and mitigation provided, or at least described.

Tom.



This patchset is RFC. The work is in progress, do not merge.


Long Li (9):
   Introduce offset for the 1st page in data transfer structures
   Change wdata alloc to support direct pages
   Change rdata alloc to support direct pages
   Change function to support offset when reading pages
   Change RDMA send to regonize page offset in the 1st page
   Change RDMA recv to support offset in the 1st page
   Support page offset in memory regsitrations
   Implement no-copy file I/O interfaces
   Introduce cache=rdma moutning option
  


  fs/cifs/cifs_fs_sb.h  |   2 +
  fs/cifs/cifsfs.c  |  19 +++
  fs/cifs/cifsfs.h  |   3 +
  fs/cifs/cifsglob.h|   6 +
  fs/cifs/cifsproto.h   |   4 +-
  fs/cifs/cifssmb.c |  10 +-
  fs/cifs/connect.c |  13 +-
  fs/cifs/dir.c |   5 +
  fs/cifs/file.c| 351 ++
  fs/cifs/inode.c   |   4 +-
  fs/cifs/smb2ops.c |   2 +-
  fs/cifs/smb2pdu.c |  22 ++-
  fs/cifs/smbdirect.c   | 132 ++---
  fs/cifs/smbdirect.h   |   2 +-
  fs/read_write.c   |   7 +
  include/linux/ratelimit.h |   2 +-
  16 files changed, 489 insertions(+), 95 deletions(-)



Re: [RFC PATCH 00/09] Implement direct user I/O interfaces for RDMA

2018-05-17 Thread Tom Talpey

On 5/17/2018 8:22 PM, Long Li wrote:

From: Long Li 

This patchset implements direct user I/O through RDMA.

In normal code path (even with cache=none), CIFS copies I/O data from
user-space to kernel-space for security reasons.

With this patchset, a new mounting option is introduced to have CIFS pin the
user-space buffer into memory and performs I/O through RDMA. This avoids memory
copy, at the cost of added security risk.


What's the security risk? This type of direct i/o behavior is not
uncommon, and can certainly be made safe, using the appropriate
memory registration and protection domains. Any risk needs to be
stated explicitly, and mitigation provided, or at least described.

Tom.



This patchset is RFC. The work is in progress, do not merge.


Long Li (9):
   Introduce offset for the 1st page in data transfer structures
   Change wdata alloc to support direct pages
   Change rdata alloc to support direct pages
   Change function to support offset when reading pages
   Change RDMA send to regonize page offset in the 1st page
   Change RDMA recv to support offset in the 1st page
   Support page offset in memory regsitrations
   Implement no-copy file I/O interfaces
   Introduce cache=rdma moutning option
  


  fs/cifs/cifs_fs_sb.h  |   2 +
  fs/cifs/cifsfs.c  |  19 +++
  fs/cifs/cifsfs.h  |   3 +
  fs/cifs/cifsglob.h|   6 +
  fs/cifs/cifsproto.h   |   4 +-
  fs/cifs/cifssmb.c |  10 +-
  fs/cifs/connect.c |  13 +-
  fs/cifs/dir.c |   5 +
  fs/cifs/file.c| 351 ++
  fs/cifs/inode.c   |   4 +-
  fs/cifs/smb2ops.c |   2 +-
  fs/cifs/smb2pdu.c |  22 ++-
  fs/cifs/smbdirect.c   | 132 ++---
  fs/cifs/smbdirect.h   |   2 +-
  fs/read_write.c   |   7 +
  include/linux/ratelimit.h |   2 +-
  16 files changed, 489 insertions(+), 95 deletions(-)



RE: [PATCH v5] cifs: Allocate validate negotiation request through kmalloc

2018-04-26 Thread Tom Talpey
> -Original Message-
> From: linux-cifs-ow...@vger.kernel.org <linux-cifs-ow...@vger.kernel.org> On
> Behalf Of Long Li
> Sent: Wednesday, April 25, 2018 2:30 PM
> To: Steve French <sfre...@samba.org>; linux-c...@vger.kernel.org; samba-
> techni...@lists.samba.org; linux-kernel@vger.kernel.org; linux-
> r...@vger.kernel.org
> Cc: Long Li <lon...@microsoft.com>
> Subject: [PATCH v5] cifs: Allocate validate negotiation request through 
> kmalloc
> 
> From: Long Li <lon...@microsoft.com>
> 
> The data buffer allocated on the stack can't be DMA'ed, ib_dma_map_page will
> return an invalid DMA address for a buffer on stack. Even worse, this
> incorrect address can't be detected by ib_dma_mapping_error. Sending data
> from this address to hardware will not fail, but the remote peer will get
> junk data.
> 
> Fix this by allocating the request on the heap in smb3_validate_negotiate.
> 

Looks good.

Reviewed-By: Tom Talpey <ttal...@microsoft.com>
 



RE: [PATCH v5] cifs: Allocate validate negotiation request through kmalloc

2018-04-26 Thread Tom Talpey
> -Original Message-
> From: linux-cifs-ow...@vger.kernel.org  On
> Behalf Of Long Li
> Sent: Wednesday, April 25, 2018 2:30 PM
> To: Steve French ; linux-c...@vger.kernel.org; samba-
> techni...@lists.samba.org; linux-kernel@vger.kernel.org; linux-
> r...@vger.kernel.org
> Cc: Long Li 
> Subject: [PATCH v5] cifs: Allocate validate negotiation request through 
> kmalloc
> 
> From: Long Li 
> 
> The data buffer allocated on the stack can't be DMA'ed, ib_dma_map_page will
> return an invalid DMA address for a buffer on stack. Even worse, this
> incorrect address can't be detected by ib_dma_mapping_error. Sending data
> from this address to hardware will not fail, but the remote peer will get
> junk data.
> 
> Fix this by allocating the request on the heap in smb3_validate_negotiate.
> 

Looks good.

Reviewed-By: Tom Talpey 
 



Re: [Patch v4] cifs: Allocate validate negotiation request through kmalloc

2018-04-20 Thread Tom Talpey

On 4/20/2018 2:41 PM, Long Li wrote:

Subject: Re: [Patch v4] cifs: Allocate validate negotiation request through
kmalloc

Looks good, but I have two possibly style-related comments.

On 4/19/2018 5:38 PM, Long Li wrote:

From: Long Li 

The data buffer allocated on the stack can't be DMA'ed,
ib_dma_map_page will return an invalid DMA address for a buffer on
stack. Even worse, this incorrect address can't be detected by
ib_dma_mapping_error. Sending data from this address to hardware will
not fail, but the remote peer will get junk data.

Fix this by allocating the request on the heap in smb3_validate_negotiate.

Changes in v2:
Removed duplicated code on freeing buffers on function exit.
(Thanks to Parav Pandit ) Fixed typo in the patch
title.

Changes in v3:
Added "Fixes" to the patch.
Changed several sizeof() to use *pointer in place of struct.

Changes in v4:
Added detailed comments on the failure through RDMA.
Allocate request buffer using GPF_NOFS.
Fixed possible memory leak.

Fixes: ff1c038addc4 ("Check SMB3 dialects against downgrade attacks")
Signed-off-by: Long Li 
---
   fs/cifs/smb2pdu.c | 61 ++--

---

   1 file changed, 33 insertions(+), 28 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index
0f044c4..caa2a1e 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -729,8 +729,8 @@ SMB2_negotiate(const unsigned int xid, struct
cifs_ses *ses)

   int smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon)
   {
-   int rc = 0;
-   struct validate_negotiate_info_req vneg_inbuf;
+   int ret, rc = -EIO;


Seems awkward to have "rc" and "ret", and based on the code below I don't
see why two variables are needed. Simplify? (see later comment)


This is for addressing a prior comment to reduce duplicate code. All the 
failure paths
(after issuing I/O) returning -EIO, there are 5 of them. Set rc to -EIO at the 
beginning
so we don’t need to set it multiple times.


Well, ok but now there are semi-duplicate and rather confusing "rc"
and "ret" local variables, only one of which is actually returned.

How about a "goto err" with unconditonal -EIO, and just leave the
"return 0" path alone, like it was? That would be much clearer IMO.

As-is, I needed to read it several times to convince myself the right
rc was returned.

Tom,







+   struct validate_negotiate_info_req *pneg_inbuf;
struct validate_negotiate_info_rsp *pneg_rsp = NULL;
u32 rsplen;
u32 inbuflen; /* max of 4 dialects */ @@ -741,7 +741,6 @@ int
smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon)
if (tcon->ses->server->rdma)
return 0;
   #endif
-
/* In SMB3.11 preauth integrity supersedes validate negotiate */
if (tcon->ses->server->dialect == SMB311_PROT_ID)
return 0;
@@ -764,63 +763,67 @@ int smb3_validate_negotiate(const unsigned int

xid, struct cifs_tcon *tcon)

if (tcon->ses->session_flags & SMB2_SESSION_FLAG_IS_NULL)
cifs_dbg(VFS, "Unexpected null user (anonymous) auth flag

sent by

server\n");

-   vneg_inbuf.Capabilities =
+   pneg_inbuf = kmalloc(sizeof(*pneg_inbuf), GFP_NOFS);
+   if (!pneg_inbuf)
+   return -ENOMEM;
+
+   pneg_inbuf->Capabilities =
cpu_to_le32(tcon->ses->server->vals-
req_capabilities);
-   memcpy(vneg_inbuf.Guid, tcon->ses->server->client_guid,
+   memcpy(pneg_inbuf->Guid, tcon->ses->server->client_guid,
SMB2_CLIENT_GUID_SIZE);

if (tcon->ses->sign)
-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =


cpu_to_le16(SMB2_NEGOTIATE_SIGNING_REQUIRED);

else if (global_secflags & CIFSSEC_MAY_SIGN)
-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =


cpu_to_le16(SMB2_NEGOTIATE_SIGNING_ENABLED);

else
-   vneg_inbuf.SecurityMode = 0;
+   pneg_inbuf->SecurityMode = 0;


if (strcmp(tcon->ses->server->vals->version_string,
SMB3ANY_VERSION_STRING) == 0) {
-   vneg_inbuf.Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
-   vneg_inbuf.Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
-   vneg_inbuf.DialectCount = cpu_to_le16(2);
+   pneg_inbuf->Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
+   pneg_inbuf->Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
+   pneg_inbuf->DialectCount = cpu_to_le16(2);
/* structure is big enough for 3 dialects, sending only 2 */
inbuflen = sizeof(struct validate_negotiate_info_req) - 2;


The "- 2" is a little confusing here. This was existing code, but I suggest you
change this to a sizeof (u16) construct for consistency.
It's reducing by the length of a single Dialects[n] entry.


 

Re: [Patch v4] cifs: Allocate validate negotiation request through kmalloc

2018-04-20 Thread Tom Talpey

On 4/20/2018 2:41 PM, Long Li wrote:

Subject: Re: [Patch v4] cifs: Allocate validate negotiation request through
kmalloc

Looks good, but I have two possibly style-related comments.

On 4/19/2018 5:38 PM, Long Li wrote:

From: Long Li 

The data buffer allocated on the stack can't be DMA'ed,
ib_dma_map_page will return an invalid DMA address for a buffer on
stack. Even worse, this incorrect address can't be detected by
ib_dma_mapping_error. Sending data from this address to hardware will
not fail, but the remote peer will get junk data.

Fix this by allocating the request on the heap in smb3_validate_negotiate.

Changes in v2:
Removed duplicated code on freeing buffers on function exit.
(Thanks to Parav Pandit ) Fixed typo in the patch
title.

Changes in v3:
Added "Fixes" to the patch.
Changed several sizeof() to use *pointer in place of struct.

Changes in v4:
Added detailed comments on the failure through RDMA.
Allocate request buffer using GPF_NOFS.
Fixed possible memory leak.

Fixes: ff1c038addc4 ("Check SMB3 dialects against downgrade attacks")
Signed-off-by: Long Li 
---
   fs/cifs/smb2pdu.c | 61 ++--

---

   1 file changed, 33 insertions(+), 28 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index
0f044c4..caa2a1e 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -729,8 +729,8 @@ SMB2_negotiate(const unsigned int xid, struct
cifs_ses *ses)

   int smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon)
   {
-   int rc = 0;
-   struct validate_negotiate_info_req vneg_inbuf;
+   int ret, rc = -EIO;


Seems awkward to have "rc" and "ret", and based on the code below I don't
see why two variables are needed. Simplify? (see later comment)


This is for addressing a prior comment to reduce duplicate code. All the 
failure paths
(after issuing I/O) returning -EIO, there are 5 of them. Set rc to -EIO at the 
beginning
so we don’t need to set it multiple times.


Well, ok but now there are semi-duplicate and rather confusing "rc"
and "ret" local variables, only one of which is actually returned.

How about a "goto err" with unconditonal -EIO, and just leave the
"return 0" path alone, like it was? That would be much clearer IMO.

As-is, I needed to read it several times to convince myself the right
rc was returned.

Tom,







+   struct validate_negotiate_info_req *pneg_inbuf;
struct validate_negotiate_info_rsp *pneg_rsp = NULL;
u32 rsplen;
u32 inbuflen; /* max of 4 dialects */ @@ -741,7 +741,6 @@ int
smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon)
if (tcon->ses->server->rdma)
return 0;
   #endif
-
/* In SMB3.11 preauth integrity supersedes validate negotiate */
if (tcon->ses->server->dialect == SMB311_PROT_ID)
return 0;
@@ -764,63 +763,67 @@ int smb3_validate_negotiate(const unsigned int

xid, struct cifs_tcon *tcon)

if (tcon->ses->session_flags & SMB2_SESSION_FLAG_IS_NULL)
cifs_dbg(VFS, "Unexpected null user (anonymous) auth flag

sent by

server\n");

-   vneg_inbuf.Capabilities =
+   pneg_inbuf = kmalloc(sizeof(*pneg_inbuf), GFP_NOFS);
+   if (!pneg_inbuf)
+   return -ENOMEM;
+
+   pneg_inbuf->Capabilities =
cpu_to_le32(tcon->ses->server->vals-
req_capabilities);
-   memcpy(vneg_inbuf.Guid, tcon->ses->server->client_guid,
+   memcpy(pneg_inbuf->Guid, tcon->ses->server->client_guid,
SMB2_CLIENT_GUID_SIZE);

if (tcon->ses->sign)
-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =


cpu_to_le16(SMB2_NEGOTIATE_SIGNING_REQUIRED);

else if (global_secflags & CIFSSEC_MAY_SIGN)
-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =


cpu_to_le16(SMB2_NEGOTIATE_SIGNING_ENABLED);

else
-   vneg_inbuf.SecurityMode = 0;
+   pneg_inbuf->SecurityMode = 0;


if (strcmp(tcon->ses->server->vals->version_string,
SMB3ANY_VERSION_STRING) == 0) {
-   vneg_inbuf.Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
-   vneg_inbuf.Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
-   vneg_inbuf.DialectCount = cpu_to_le16(2);
+   pneg_inbuf->Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
+   pneg_inbuf->Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
+   pneg_inbuf->DialectCount = cpu_to_le16(2);
/* structure is big enough for 3 dialects, sending only 2 */
inbuflen = sizeof(struct validate_negotiate_info_req) - 2;


The "- 2" is a little confusing here. This was existing code, but I suggest you
change this to a sizeof (u16) construct for consistency.
It's reducing by the length of a single Dialects[n] entry.


} else if 

Re: [Patch v4] cifs: Allocate validate negotiation request through kmalloc

2018-04-20 Thread Tom Talpey

Looks good, but I have two possibly style-related comments.

On 4/19/2018 5:38 PM, Long Li wrote:

From: Long Li 

The data buffer allocated on the stack can't be DMA'ed, ib_dma_map_page will
return an invalid DMA address for a buffer on stack. Even worse, this
incorrect address can't be detected by ib_dma_mapping_error. Sending data
from this address to hardware will not fail, but the remote peer will get
junk data.

Fix this by allocating the request on the heap in smb3_validate_negotiate.

Changes in v2:
Removed duplicated code on freeing buffers on function exit.
(Thanks to Parav Pandit )
Fixed typo in the patch title.

Changes in v3:
Added "Fixes" to the patch.
Changed several sizeof() to use *pointer in place of struct.

Changes in v4:
Added detailed comments on the failure through RDMA.
Allocate request buffer using GPF_NOFS.
Fixed possible memory leak.

Fixes: ff1c038addc4 ("Check SMB3 dialects against downgrade attacks")
Signed-off-by: Long Li 
---
  fs/cifs/smb2pdu.c | 61 ++-
  1 file changed, 33 insertions(+), 28 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 0f044c4..caa2a1e 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -729,8 +729,8 @@ SMB2_negotiate(const unsigned int xid, struct cifs_ses *ses)
  
  int smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon)

  {
-   int rc = 0;
-   struct validate_negotiate_info_req vneg_inbuf;
+   int ret, rc = -EIO;


Seems awkward to have "rc" and "ret", and based on the code below I
don't see why two variables are needed. Simplify? (see later comment)


+   struct validate_negotiate_info_req *pneg_inbuf;
struct validate_negotiate_info_rsp *pneg_rsp = NULL;
u32 rsplen;
u32 inbuflen; /* max of 4 dialects */
@@ -741,7 +741,6 @@ int smb3_validate_negotiate(const unsigned int xid, struct 
cifs_tcon *tcon)
if (tcon->ses->server->rdma)
return 0;
  #endif
-
/* In SMB3.11 preauth integrity supersedes validate negotiate */
if (tcon->ses->server->dialect == SMB311_PROT_ID)
return 0;
@@ -764,63 +763,67 @@ int smb3_validate_negotiate(const unsigned int xid, 
struct cifs_tcon *tcon)
if (tcon->ses->session_flags & SMB2_SESSION_FLAG_IS_NULL)
cifs_dbg(VFS, "Unexpected null user (anonymous) auth flag sent by 
server\n");
  
-	vneg_inbuf.Capabilities =

+   pneg_inbuf = kmalloc(sizeof(*pneg_inbuf), GFP_NOFS);
+   if (!pneg_inbuf)
+   return -ENOMEM;
+
+   pneg_inbuf->Capabilities =
cpu_to_le32(tcon->ses->server->vals->req_capabilities);
-   memcpy(vneg_inbuf.Guid, tcon->ses->server->client_guid,
+   memcpy(pneg_inbuf->Guid, tcon->ses->server->client_guid,
SMB2_CLIENT_GUID_SIZE);
  
  	if (tcon->ses->sign)

-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =
cpu_to_le16(SMB2_NEGOTIATE_SIGNING_REQUIRED);
else if (global_secflags & CIFSSEC_MAY_SIGN)
-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =
cpu_to_le16(SMB2_NEGOTIATE_SIGNING_ENABLED);
else
-   vneg_inbuf.SecurityMode = 0;
+   pneg_inbuf->SecurityMode = 0;
  
  
  	if (strcmp(tcon->ses->server->vals->version_string,

SMB3ANY_VERSION_STRING) == 0) {
-   vneg_inbuf.Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
-   vneg_inbuf.Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
-   vneg_inbuf.DialectCount = cpu_to_le16(2);
+   pneg_inbuf->Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
+   pneg_inbuf->Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
+   pneg_inbuf->DialectCount = cpu_to_le16(2);
/* structure is big enough for 3 dialects, sending only 2 */
inbuflen = sizeof(struct validate_negotiate_info_req) - 2;


The "- 2" is a little confusing here. This was existing code, but I
suggest you change this to a sizeof (u16) construct for consistency.
It's reducing by the length of a single Dialects[n] entry.


} else if (strcmp(tcon->ses->server->vals->version_string,
SMBDEFAULT_VERSION_STRING) == 0) {
-   vneg_inbuf.Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
-   vneg_inbuf.Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
-   vneg_inbuf.Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
-   vneg_inbuf.DialectCount = cpu_to_le16(3);
+   pneg_inbuf->Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
+   pneg_inbuf->Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
+   pneg_inbuf->Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
+   pneg_inbuf->DialectCount = cpu_to_le16(3);
/* structure is big enough for 3 

Re: [Patch v4] cifs: Allocate validate negotiation request through kmalloc

2018-04-20 Thread Tom Talpey

Looks good, but I have two possibly style-related comments.

On 4/19/2018 5:38 PM, Long Li wrote:

From: Long Li 

The data buffer allocated on the stack can't be DMA'ed, ib_dma_map_page will
return an invalid DMA address for a buffer on stack. Even worse, this
incorrect address can't be detected by ib_dma_mapping_error. Sending data
from this address to hardware will not fail, but the remote peer will get
junk data.

Fix this by allocating the request on the heap in smb3_validate_negotiate.

Changes in v2:
Removed duplicated code on freeing buffers on function exit.
(Thanks to Parav Pandit )
Fixed typo in the patch title.

Changes in v3:
Added "Fixes" to the patch.
Changed several sizeof() to use *pointer in place of struct.

Changes in v4:
Added detailed comments on the failure through RDMA.
Allocate request buffer using GPF_NOFS.
Fixed possible memory leak.

Fixes: ff1c038addc4 ("Check SMB3 dialects against downgrade attacks")
Signed-off-by: Long Li 
---
  fs/cifs/smb2pdu.c | 61 ++-
  1 file changed, 33 insertions(+), 28 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 0f044c4..caa2a1e 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -729,8 +729,8 @@ SMB2_negotiate(const unsigned int xid, struct cifs_ses *ses)
  
  int smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon)

  {
-   int rc = 0;
-   struct validate_negotiate_info_req vneg_inbuf;
+   int ret, rc = -EIO;


Seems awkward to have "rc" and "ret", and based on the code below I
don't see why two variables are needed. Simplify? (see later comment)


+   struct validate_negotiate_info_req *pneg_inbuf;
struct validate_negotiate_info_rsp *pneg_rsp = NULL;
u32 rsplen;
u32 inbuflen; /* max of 4 dialects */
@@ -741,7 +741,6 @@ int smb3_validate_negotiate(const unsigned int xid, struct 
cifs_tcon *tcon)
if (tcon->ses->server->rdma)
return 0;
  #endif
-
/* In SMB3.11 preauth integrity supersedes validate negotiate */
if (tcon->ses->server->dialect == SMB311_PROT_ID)
return 0;
@@ -764,63 +763,67 @@ int smb3_validate_negotiate(const unsigned int xid, 
struct cifs_tcon *tcon)
if (tcon->ses->session_flags & SMB2_SESSION_FLAG_IS_NULL)
cifs_dbg(VFS, "Unexpected null user (anonymous) auth flag sent by 
server\n");
  
-	vneg_inbuf.Capabilities =

+   pneg_inbuf = kmalloc(sizeof(*pneg_inbuf), GFP_NOFS);
+   if (!pneg_inbuf)
+   return -ENOMEM;
+
+   pneg_inbuf->Capabilities =
cpu_to_le32(tcon->ses->server->vals->req_capabilities);
-   memcpy(vneg_inbuf.Guid, tcon->ses->server->client_guid,
+   memcpy(pneg_inbuf->Guid, tcon->ses->server->client_guid,
SMB2_CLIENT_GUID_SIZE);
  
  	if (tcon->ses->sign)

-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =
cpu_to_le16(SMB2_NEGOTIATE_SIGNING_REQUIRED);
else if (global_secflags & CIFSSEC_MAY_SIGN)
-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =
cpu_to_le16(SMB2_NEGOTIATE_SIGNING_ENABLED);
else
-   vneg_inbuf.SecurityMode = 0;
+   pneg_inbuf->SecurityMode = 0;
  
  
  	if (strcmp(tcon->ses->server->vals->version_string,

SMB3ANY_VERSION_STRING) == 0) {
-   vneg_inbuf.Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
-   vneg_inbuf.Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
-   vneg_inbuf.DialectCount = cpu_to_le16(2);
+   pneg_inbuf->Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
+   pneg_inbuf->Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
+   pneg_inbuf->DialectCount = cpu_to_le16(2);
/* structure is big enough for 3 dialects, sending only 2 */
inbuflen = sizeof(struct validate_negotiate_info_req) - 2;


The "- 2" is a little confusing here. This was existing code, but I
suggest you change this to a sizeof (u16) construct for consistency.
It's reducing by the length of a single Dialects[n] entry.


} else if (strcmp(tcon->ses->server->vals->version_string,
SMBDEFAULT_VERSION_STRING) == 0) {
-   vneg_inbuf.Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
-   vneg_inbuf.Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
-   vneg_inbuf.Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
-   vneg_inbuf.DialectCount = cpu_to_le16(3);
+   pneg_inbuf->Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
+   pneg_inbuf->Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
+   pneg_inbuf->Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
+   pneg_inbuf->DialectCount = cpu_to_le16(3);
/* structure is big enough for 3 dialects */
inbuflen = sizeof(struct 

Re: [Patch v3 2/6] cifs: Allocate validate negotiation request through kmalloc

2018-04-18 Thread Tom Talpey

On 4/18/2018 1:16 PM, Long Li wrote:

Subject: Re: [Patch v3 2/6] cifs: Allocate validate negotiation request through
kmalloc

Two comments.

On 4/17/2018 8:33 PM, Long Li wrote:

From: Long Li 

The data buffer allocated on the stack can't be DMA'ed, and hence
can't send through RDMA via SMB Direct.


This comment is confusing. Any registered memory can be DMA'd, need to
state the reason for the choice here more clearly.



Fix this by allocating the request on the heap in smb3_validate_negotiate.

Changes in v2:
Removed duplicated code on freeing buffers on function exit.
(Thanks to Parav Pandit ) Fixed typo in the patch
title.

Changes in v3:
Added "Fixes" to the patch.
Changed sizeof() to use *pointer in place of struct.

Fixes: ff1c038addc4 ("Check SMB3 dialects against downgrade attacks")
Signed-off-by: Long Li 
Cc: sta...@vger.kernel.org
---
   fs/cifs/smb2pdu.c | 59 ++--

---

   1 file changed, 32 insertions(+), 27 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index
0f044c4..5582a02 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -729,8 +729,8 @@ SMB2_negotiate(const unsigned int xid, struct
cifs_ses *ses)

   int smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon)
   {
-   int rc = 0;
-   struct validate_negotiate_info_req vneg_inbuf;
+   int ret, rc = -EIO;
+   struct validate_negotiate_info_req *pneg_inbuf;
struct validate_negotiate_info_rsp *pneg_rsp = NULL;
u32 rsplen;
u32 inbuflen; /* max of 4 dialects */ @@ -741,6 +741,9 @@ int
smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon)
if (tcon->ses->server->rdma)
return 0;
   #endif
+   pneg_inbuf = kmalloc(sizeof(*pneg_inbuf), GFP_KERNEL);
+   if (!pneg_inbuf)
+   return -ENOMEM;


Why is this a nonblocking allocation? It would seem more robust to use
GFP_NOFS here.


I agree it makes sense to use GFP_NOFS.

The choice here is made consistent with all the rest CIFS code allocating 
protocol request buffers. Maybe we should do another patch to cleanup all those 
code.


It'll be required sooner or later. I'm agnostic as to how you apply it,
but I still suggest you change this one now rather than continue the
fragile behavior. It may not be a global search-and-replace since some
allocations may require nonblocking.






Tom.



/* In SMB3.11 preauth integrity supersedes validate negotiate */
if (tcon->ses->server->dialect == SMB311_PROT_ID) @@ -764,63
+767,63 @@ int smb3_validate_negotiate(const unsigned int xid, struct

cifs_tcon *tcon)

if (tcon->ses->session_flags & SMB2_SESSION_FLAG_IS_NULL)
cifs_dbg(VFS, "Unexpected null user (anonymous) auth flag

sent by

server\n");

-   vneg_inbuf.Capabilities =
+   pneg_inbuf->Capabilities =
cpu_to_le32(tcon->ses->server->vals-
req_capabilities);
-   memcpy(vneg_inbuf.Guid, tcon->ses->server->client_guid,
+   memcpy(pneg_inbuf->Guid, tcon->ses->server->client_guid,
SMB2_CLIENT_GUID_SIZE);

if (tcon->ses->sign)
-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =


cpu_to_le16(SMB2_NEGOTIATE_SIGNING_REQUIRED);

else if (global_secflags & CIFSSEC_MAY_SIGN)
-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =


cpu_to_le16(SMB2_NEGOTIATE_SIGNING_ENABLED);

else
-   vneg_inbuf.SecurityMode = 0;
+   pneg_inbuf->SecurityMode = 0;


if (strcmp(tcon->ses->server->vals->version_string,
SMB3ANY_VERSION_STRING) == 0) {
-   vneg_inbuf.Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
-   vneg_inbuf.Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
-   vneg_inbuf.DialectCount = cpu_to_le16(2);
+   pneg_inbuf->Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
+   pneg_inbuf->Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
+   pneg_inbuf->DialectCount = cpu_to_le16(2);
/* structure is big enough for 3 dialects, sending only 2 */
inbuflen = sizeof(struct validate_negotiate_info_req) - 2;
} else if (strcmp(tcon->ses->server->vals->version_string,
SMBDEFAULT_VERSION_STRING) == 0) {
-   vneg_inbuf.Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
-   vneg_inbuf.Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
-   vneg_inbuf.Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
-   vneg_inbuf.DialectCount = cpu_to_le16(3);
+   pneg_inbuf->Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
+   pneg_inbuf->Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
+   pneg_inbuf->Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
+   pneg_inbuf->DialectCount = 

Re: [Patch v3 2/6] cifs: Allocate validate negotiation request through kmalloc

2018-04-18 Thread Tom Talpey

On 4/18/2018 1:16 PM, Long Li wrote:

Subject: Re: [Patch v3 2/6] cifs: Allocate validate negotiation request through
kmalloc

Two comments.

On 4/17/2018 8:33 PM, Long Li wrote:

From: Long Li 

The data buffer allocated on the stack can't be DMA'ed, and hence
can't send through RDMA via SMB Direct.


This comment is confusing. Any registered memory can be DMA'd, need to
state the reason for the choice here more clearly.



Fix this by allocating the request on the heap in smb3_validate_negotiate.

Changes in v2:
Removed duplicated code on freeing buffers on function exit.
(Thanks to Parav Pandit ) Fixed typo in the patch
title.

Changes in v3:
Added "Fixes" to the patch.
Changed sizeof() to use *pointer in place of struct.

Fixes: ff1c038addc4 ("Check SMB3 dialects against downgrade attacks")
Signed-off-by: Long Li 
Cc: sta...@vger.kernel.org
---
   fs/cifs/smb2pdu.c | 59 ++--

---

   1 file changed, 32 insertions(+), 27 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index
0f044c4..5582a02 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -729,8 +729,8 @@ SMB2_negotiate(const unsigned int xid, struct
cifs_ses *ses)

   int smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon)
   {
-   int rc = 0;
-   struct validate_negotiate_info_req vneg_inbuf;
+   int ret, rc = -EIO;
+   struct validate_negotiate_info_req *pneg_inbuf;
struct validate_negotiate_info_rsp *pneg_rsp = NULL;
u32 rsplen;
u32 inbuflen; /* max of 4 dialects */ @@ -741,6 +741,9 @@ int
smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon)
if (tcon->ses->server->rdma)
return 0;
   #endif
+   pneg_inbuf = kmalloc(sizeof(*pneg_inbuf), GFP_KERNEL);
+   if (!pneg_inbuf)
+   return -ENOMEM;


Why is this a nonblocking allocation? It would seem more robust to use
GFP_NOFS here.


I agree it makes sense to use GFP_NOFS.

The choice here is made consistent with all the rest CIFS code allocating 
protocol request buffers. Maybe we should do another patch to cleanup all those 
code.


It'll be required sooner or later. I'm agnostic as to how you apply it,
but I still suggest you change this one now rather than continue the
fragile behavior. It may not be a global search-and-replace since some
allocations may require nonblocking.






Tom.



/* In SMB3.11 preauth integrity supersedes validate negotiate */
if (tcon->ses->server->dialect == SMB311_PROT_ID) @@ -764,63
+767,63 @@ int smb3_validate_negotiate(const unsigned int xid, struct

cifs_tcon *tcon)

if (tcon->ses->session_flags & SMB2_SESSION_FLAG_IS_NULL)
cifs_dbg(VFS, "Unexpected null user (anonymous) auth flag

sent by

server\n");

-   vneg_inbuf.Capabilities =
+   pneg_inbuf->Capabilities =
cpu_to_le32(tcon->ses->server->vals-
req_capabilities);
-   memcpy(vneg_inbuf.Guid, tcon->ses->server->client_guid,
+   memcpy(pneg_inbuf->Guid, tcon->ses->server->client_guid,
SMB2_CLIENT_GUID_SIZE);

if (tcon->ses->sign)
-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =


cpu_to_le16(SMB2_NEGOTIATE_SIGNING_REQUIRED);

else if (global_secflags & CIFSSEC_MAY_SIGN)
-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =


cpu_to_le16(SMB2_NEGOTIATE_SIGNING_ENABLED);

else
-   vneg_inbuf.SecurityMode = 0;
+   pneg_inbuf->SecurityMode = 0;


if (strcmp(tcon->ses->server->vals->version_string,
SMB3ANY_VERSION_STRING) == 0) {
-   vneg_inbuf.Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
-   vneg_inbuf.Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
-   vneg_inbuf.DialectCount = cpu_to_le16(2);
+   pneg_inbuf->Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
+   pneg_inbuf->Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
+   pneg_inbuf->DialectCount = cpu_to_le16(2);
/* structure is big enough for 3 dialects, sending only 2 */
inbuflen = sizeof(struct validate_negotiate_info_req) - 2;
} else if (strcmp(tcon->ses->server->vals->version_string,
SMBDEFAULT_VERSION_STRING) == 0) {
-   vneg_inbuf.Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
-   vneg_inbuf.Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
-   vneg_inbuf.Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
-   vneg_inbuf.DialectCount = cpu_to_le16(3);
+   pneg_inbuf->Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
+   pneg_inbuf->Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
+   pneg_inbuf->Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
+   pneg_inbuf->DialectCount = cpu_to_le16(3);
/* structure is big enough for 3 

Re: [Patch v3 2/6] cifs: Allocate validate negotiation request through kmalloc

2018-04-18 Thread Tom Talpey

On 4/18/2018 1:11 PM, Long Li wrote:

Subject: Re: [Patch v3 2/6] cifs: Allocate validate negotiation request through
kmalloc

On 4/18/2018 9:08 AM, David Laight wrote:

From: Tom Talpey

Sent: 18 April 2018 12:32

...

On 4/17/2018 8:33 PM, Long Li wrote:

From: Long Li <lon...@microsoft.com>

The data buffer allocated on the stack can't be DMA'ed, and hence
can't send through RDMA via SMB Direct.


This comment is confusing. Any registered memory can be DMA'd, need
to state the reason for the choice here more clearly.


The stack could be allocated with vmalloc().
In which case the pages might not be physically contiguous and there
is no
(sensible) call to get the physical address required by the dma
controller (or other bus master).


Memory registration does not requires pages to be physically contiguous.
RDMA Regions can and do support very large physical page scatter/gather,
and the adapter DMA's them readily. Is this the only reason?


ib_dma_map_page will return an invalid DMA address for a buffer on stack. Even 
worse, this incorrect address can't be detected by ib_dma_mapping_error. 
Sending data from this address to hardware will not fail, but the remote peer 
will get junk data.

I think it makes sense as stack is dynamic and can shrink as I/O proceeds, so 
the buffer is gone. Other kernel code use only data on the heap for DMA, e.g. 
BLK/SCSI layer never use buffer on the stack to send data.


I totally agree that registering the stack is a bad idea. I mainly
suggest that you capture these fundamental ib_dma* reasons in the
commit. There's no other practical reason why the original approach
would not work.

Tom.


Re: [Patch v3 2/6] cifs: Allocate validate negotiation request through kmalloc

2018-04-18 Thread Tom Talpey

On 4/18/2018 1:11 PM, Long Li wrote:

Subject: Re: [Patch v3 2/6] cifs: Allocate validate negotiation request through
kmalloc

On 4/18/2018 9:08 AM, David Laight wrote:

From: Tom Talpey

Sent: 18 April 2018 12:32

...

On 4/17/2018 8:33 PM, Long Li wrote:

From: Long Li 

The data buffer allocated on the stack can't be DMA'ed, and hence
can't send through RDMA via SMB Direct.


This comment is confusing. Any registered memory can be DMA'd, need
to state the reason for the choice here more clearly.


The stack could be allocated with vmalloc().
In which case the pages might not be physically contiguous and there
is no
(sensible) call to get the physical address required by the dma
controller (or other bus master).


Memory registration does not requires pages to be physically contiguous.
RDMA Regions can and do support very large physical page scatter/gather,
and the adapter DMA's them readily. Is this the only reason?


ib_dma_map_page will return an invalid DMA address for a buffer on stack. Even 
worse, this incorrect address can't be detected by ib_dma_mapping_error. 
Sending data from this address to hardware will not fail, but the remote peer 
will get junk data.

I think it makes sense as stack is dynamic and can shrink as I/O proceeds, so 
the buffer is gone. Other kernel code use only data on the heap for DMA, e.g. 
BLK/SCSI layer never use buffer on the stack to send data.


I totally agree that registering the stack is a bad idea. I mainly
suggest that you capture these fundamental ib_dma* reasons in the
commit. There's no other practical reason why the original approach
would not work.

Tom.


Re: [Patch v3 2/6] cifs: Allocate validate negotiation request through kmalloc

2018-04-18 Thread Tom Talpey

On 4/18/2018 9:08 AM, David Laight wrote:

From: Tom Talpey

Sent: 18 April 2018 12:32

...

On 4/17/2018 8:33 PM, Long Li wrote:

From: Long Li <lon...@microsoft.com>

The data buffer allocated on the stack can't be DMA'ed, and hence can't send
through RDMA via SMB Direct.


This comment is confusing. Any registered memory can be DMA'd, need to
state the reason for the choice here more clearly.


The stack could be allocated with vmalloc().
In which case the pages might not be physically contiguous and there is no
(sensible) call to get the physical address required by the dma controller
(or other bus master).


Memory registration does not requires pages to be physically contiguous.
RDMA Regions can and do support very large physical page scatter/gather,
and the adapter DMA's them readily. Is this the only reason?

Tom.


Re: [Patch v3 2/6] cifs: Allocate validate negotiation request through kmalloc

2018-04-18 Thread Tom Talpey

On 4/18/2018 9:08 AM, David Laight wrote:

From: Tom Talpey

Sent: 18 April 2018 12:32

...

On 4/17/2018 8:33 PM, Long Li wrote:

From: Long Li 

The data buffer allocated on the stack can't be DMA'ed, and hence can't send
through RDMA via SMB Direct.


This comment is confusing. Any registered memory can be DMA'd, need to
state the reason for the choice here more clearly.


The stack could be allocated with vmalloc().
In which case the pages might not be physically contiguous and there is no
(sensible) call to get the physical address required by the dma controller
(or other bus master).


Memory registration does not requires pages to be physically contiguous.
RDMA Regions can and do support very large physical page scatter/gather,
and the adapter DMA's them readily. Is this the only reason?

Tom.


Re: [Patch v3 2/6] cifs: Allocate validate negotiation request through kmalloc

2018-04-18 Thread Tom Talpey

Two comments.

On 4/17/2018 8:33 PM, Long Li wrote:

From: Long Li 

The data buffer allocated on the stack can't be DMA'ed, and hence can't send
through RDMA via SMB Direct.


This comment is confusing. Any registered memory can be DMA'd, need to
state the reason for the choice here more clearly.



Fix this by allocating the request on the heap in smb3_validate_negotiate.

Changes in v2:
Removed duplicated code on freeing buffers on function exit.
(Thanks to Parav Pandit )
Fixed typo in the patch title.

Changes in v3:
Added "Fixes" to the patch.
Changed sizeof() to use *pointer in place of struct.

Fixes: ff1c038addc4 ("Check SMB3 dialects against downgrade attacks")
Signed-off-by: Long Li 
Cc: sta...@vger.kernel.org
---
  fs/cifs/smb2pdu.c | 59 ++-
  1 file changed, 32 insertions(+), 27 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 0f044c4..5582a02 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -729,8 +729,8 @@ SMB2_negotiate(const unsigned int xid, struct cifs_ses *ses)
  
  int smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon)

  {
-   int rc = 0;
-   struct validate_negotiate_info_req vneg_inbuf;
+   int ret, rc = -EIO;
+   struct validate_negotiate_info_req *pneg_inbuf;
struct validate_negotiate_info_rsp *pneg_rsp = NULL;
u32 rsplen;
u32 inbuflen; /* max of 4 dialects */
@@ -741,6 +741,9 @@ int smb3_validate_negotiate(const unsigned int xid, struct 
cifs_tcon *tcon)
if (tcon->ses->server->rdma)
return 0;
  #endif
+   pneg_inbuf = kmalloc(sizeof(*pneg_inbuf), GFP_KERNEL);
+   if (!pneg_inbuf)
+   return -ENOMEM;


Why is this a nonblocking allocation? It would seem more robust to
use GFP_NOFS here.

Tom.

  
  	/* In SMB3.11 preauth integrity supersedes validate negotiate */

if (tcon->ses->server->dialect == SMB311_PROT_ID)
@@ -764,63 +767,63 @@ int smb3_validate_negotiate(const unsigned int xid, 
struct cifs_tcon *tcon)
if (tcon->ses->session_flags & SMB2_SESSION_FLAG_IS_NULL)
cifs_dbg(VFS, "Unexpected null user (anonymous) auth flag sent by 
server\n");
  
-	vneg_inbuf.Capabilities =

+   pneg_inbuf->Capabilities =
cpu_to_le32(tcon->ses->server->vals->req_capabilities);
-   memcpy(vneg_inbuf.Guid, tcon->ses->server->client_guid,
+   memcpy(pneg_inbuf->Guid, tcon->ses->server->client_guid,
SMB2_CLIENT_GUID_SIZE);
  
  	if (tcon->ses->sign)

-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =
cpu_to_le16(SMB2_NEGOTIATE_SIGNING_REQUIRED);
else if (global_secflags & CIFSSEC_MAY_SIGN)
-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =
cpu_to_le16(SMB2_NEGOTIATE_SIGNING_ENABLED);
else
-   vneg_inbuf.SecurityMode = 0;
+   pneg_inbuf->SecurityMode = 0;
  
  
  	if (strcmp(tcon->ses->server->vals->version_string,

SMB3ANY_VERSION_STRING) == 0) {
-   vneg_inbuf.Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
-   vneg_inbuf.Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
-   vneg_inbuf.DialectCount = cpu_to_le16(2);
+   pneg_inbuf->Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
+   pneg_inbuf->Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
+   pneg_inbuf->DialectCount = cpu_to_le16(2);
/* structure is big enough for 3 dialects, sending only 2 */
inbuflen = sizeof(struct validate_negotiate_info_req) - 2;
} else if (strcmp(tcon->ses->server->vals->version_string,
SMBDEFAULT_VERSION_STRING) == 0) {
-   vneg_inbuf.Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
-   vneg_inbuf.Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
-   vneg_inbuf.Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
-   vneg_inbuf.DialectCount = cpu_to_le16(3);
+   pneg_inbuf->Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
+   pneg_inbuf->Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
+   pneg_inbuf->Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
+   pneg_inbuf->DialectCount = cpu_to_le16(3);
/* structure is big enough for 3 dialects */
inbuflen = sizeof(struct validate_negotiate_info_req);
} else {
/* otherwise specific dialect was requested */
-   vneg_inbuf.Dialects[0] =
+   pneg_inbuf->Dialects[0] =
cpu_to_le16(tcon->ses->server->vals->protocol_id);
-   vneg_inbuf.DialectCount = cpu_to_le16(1);
+   pneg_inbuf->DialectCount = cpu_to_le16(1);
/* structure is big enough for 3 dialects, sending only 1 

Re: [Patch v3 2/6] cifs: Allocate validate negotiation request through kmalloc

2018-04-18 Thread Tom Talpey

Two comments.

On 4/17/2018 8:33 PM, Long Li wrote:

From: Long Li 

The data buffer allocated on the stack can't be DMA'ed, and hence can't send
through RDMA via SMB Direct.


This comment is confusing. Any registered memory can be DMA'd, need to
state the reason for the choice here more clearly.



Fix this by allocating the request on the heap in smb3_validate_negotiate.

Changes in v2:
Removed duplicated code on freeing buffers on function exit.
(Thanks to Parav Pandit )
Fixed typo in the patch title.

Changes in v3:
Added "Fixes" to the patch.
Changed sizeof() to use *pointer in place of struct.

Fixes: ff1c038addc4 ("Check SMB3 dialects against downgrade attacks")
Signed-off-by: Long Li 
Cc: sta...@vger.kernel.org
---
  fs/cifs/smb2pdu.c | 59 ++-
  1 file changed, 32 insertions(+), 27 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 0f044c4..5582a02 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -729,8 +729,8 @@ SMB2_negotiate(const unsigned int xid, struct cifs_ses *ses)
  
  int smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon)

  {
-   int rc = 0;
-   struct validate_negotiate_info_req vneg_inbuf;
+   int ret, rc = -EIO;
+   struct validate_negotiate_info_req *pneg_inbuf;
struct validate_negotiate_info_rsp *pneg_rsp = NULL;
u32 rsplen;
u32 inbuflen; /* max of 4 dialects */
@@ -741,6 +741,9 @@ int smb3_validate_negotiate(const unsigned int xid, struct 
cifs_tcon *tcon)
if (tcon->ses->server->rdma)
return 0;
  #endif
+   pneg_inbuf = kmalloc(sizeof(*pneg_inbuf), GFP_KERNEL);
+   if (!pneg_inbuf)
+   return -ENOMEM;


Why is this a nonblocking allocation? It would seem more robust to
use GFP_NOFS here.

Tom.

  
  	/* In SMB3.11 preauth integrity supersedes validate negotiate */

if (tcon->ses->server->dialect == SMB311_PROT_ID)
@@ -764,63 +767,63 @@ int smb3_validate_negotiate(const unsigned int xid, 
struct cifs_tcon *tcon)
if (tcon->ses->session_flags & SMB2_SESSION_FLAG_IS_NULL)
cifs_dbg(VFS, "Unexpected null user (anonymous) auth flag sent by 
server\n");
  
-	vneg_inbuf.Capabilities =

+   pneg_inbuf->Capabilities =
cpu_to_le32(tcon->ses->server->vals->req_capabilities);
-   memcpy(vneg_inbuf.Guid, tcon->ses->server->client_guid,
+   memcpy(pneg_inbuf->Guid, tcon->ses->server->client_guid,
SMB2_CLIENT_GUID_SIZE);
  
  	if (tcon->ses->sign)

-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =
cpu_to_le16(SMB2_NEGOTIATE_SIGNING_REQUIRED);
else if (global_secflags & CIFSSEC_MAY_SIGN)
-   vneg_inbuf.SecurityMode =
+   pneg_inbuf->SecurityMode =
cpu_to_le16(SMB2_NEGOTIATE_SIGNING_ENABLED);
else
-   vneg_inbuf.SecurityMode = 0;
+   pneg_inbuf->SecurityMode = 0;
  
  
  	if (strcmp(tcon->ses->server->vals->version_string,

SMB3ANY_VERSION_STRING) == 0) {
-   vneg_inbuf.Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
-   vneg_inbuf.Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
-   vneg_inbuf.DialectCount = cpu_to_le16(2);
+   pneg_inbuf->Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
+   pneg_inbuf->Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
+   pneg_inbuf->DialectCount = cpu_to_le16(2);
/* structure is big enough for 3 dialects, sending only 2 */
inbuflen = sizeof(struct validate_negotiate_info_req) - 2;
} else if (strcmp(tcon->ses->server->vals->version_string,
SMBDEFAULT_VERSION_STRING) == 0) {
-   vneg_inbuf.Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
-   vneg_inbuf.Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
-   vneg_inbuf.Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
-   vneg_inbuf.DialectCount = cpu_to_le16(3);
+   pneg_inbuf->Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
+   pneg_inbuf->Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
+   pneg_inbuf->Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
+   pneg_inbuf->DialectCount = cpu_to_le16(3);
/* structure is big enough for 3 dialects */
inbuflen = sizeof(struct validate_negotiate_info_req);
} else {
/* otherwise specific dialect was requested */
-   vneg_inbuf.Dialects[0] =
+   pneg_inbuf->Dialects[0] =
cpu_to_le16(tcon->ses->server->vals->protocol_id);
-   vneg_inbuf.DialectCount = cpu_to_le16(1);
+   pneg_inbuf->DialectCount = cpu_to_le16(1);
/* structure is big enough for 3 dialects, sending only 1 */
inbuflen = sizeof(struct 

  1   2   >