Re: [PATCH] net: hns3: Fix an error handling path in 'hclge_rss_init_hw()'

2017-10-01 Thread David Miller
From: Christophe JAILLET 
Date: Sat, 30 Sep 2017 07:34:34 +0200

> If this sanity check fails, we must free 'rss_indir'. Otherwise there is a
> memory leak.
> 'goto err' as done in the other error handling paths to fix it.
> 
> Fixes: 46a3df9f9718 ("net: hns3: Fix for setting rss_size incorrectly")
> Signed-off-by: Christophe JAILLET 

Applied.


Re: [PATCH] net: hns3: Fix an error handling path in 'hclge_rss_init_hw()'

2017-10-01 Thread David Miller
From: Christophe JAILLET 
Date: Sat, 30 Sep 2017 07:34:34 +0200

> If this sanity check fails, we must free 'rss_indir'. Otherwise there is a
> memory leak.
> 'goto err' as done in the other error handling paths to fix it.
> 
> Fixes: 46a3df9f9718 ("net: hns3: Fix for setting rss_size incorrectly")
> Signed-off-by: Christophe JAILLET 

Applied.


Re: [PATCH net] RDS: IB: Limit the scope of has_fr/has_fmr variables

2017-10-01 Thread David Miller
From: David Miller 
Date: Sun, 01 Oct 2017 22:54:19 -0700 (PDT)

> From: Avinash Repaka 
> Date: Fri, 29 Sep 2017 18:13:50 -0700
> 
>> This patch fixes the scope of has_fr and has_fmr variables as they are
>> needed only in rds_ib_add_one().
>> 
>> Signed-off-by: Avinash Repaka 
> 
> Applied.

Actually, reverted, this breaks the build.

net/rds/rdma_transport.c:38:10: fatal error: ib.h: No such file or directory
 #include "ib.h"

Although I can't see how in the world this patch is causing such
an error.


Re: [PATCH net] RDS: IB: Limit the scope of has_fr/has_fmr variables

2017-10-01 Thread David Miller
From: David Miller 
Date: Sun, 01 Oct 2017 22:54:19 -0700 (PDT)

> From: Avinash Repaka 
> Date: Fri, 29 Sep 2017 18:13:50 -0700
> 
>> This patch fixes the scope of has_fr and has_fmr variables as they are
>> needed only in rds_ib_add_one().
>> 
>> Signed-off-by: Avinash Repaka 
> 
> Applied.

Actually, reverted, this breaks the build.

net/rds/rdma_transport.c:38:10: fatal error: ib.h: No such file or directory
 #include "ib.h"

Although I can't see how in the world this patch is causing such
an error.


Re: [PATCH net] RDS: IB: Limit the scope of has_fr/has_fmr variables

2017-10-01 Thread David Miller
From: Avinash Repaka 
Date: Fri, 29 Sep 2017 18:13:50 -0700

> This patch fixes the scope of has_fr and has_fmr variables as they are
> needed only in rds_ib_add_one().
> 
> Signed-off-by: Avinash Repaka 

Applied.


Re: [PATCH net] RDS: IB: Limit the scope of has_fr/has_fmr variables

2017-10-01 Thread David Miller
From: Avinash Repaka 
Date: Fri, 29 Sep 2017 18:13:50 -0700

> This patch fixes the scope of has_fr and has_fmr variables as they are
> needed only in rds_ib_add_one().
> 
> Signed-off-by: Avinash Repaka 

Applied.


Re: [PATCH v2 net] net: mvpp2: Fix clock resource by adding an optional bus clock

2017-10-01 Thread David Miller
From: Gregory CLEMENT 
Date: Fri, 29 Sep 2017 14:27:39 +0200

> On Armada 7K/8K we need to explicitly enable the bus clock. The bus clock
> is optional because not all the SoCs need them but at least for Armada
> 7K/8K it is actually mandatory.
> 
> The binding documentation is updating accordingly.
> 
> Signed-off-by: Gregory CLEMENT 

Applied.


Re: [PATCH v2 net] net: mvpp2: Fix clock resource by adding an optional bus clock

2017-10-01 Thread David Miller
From: Gregory CLEMENT 
Date: Fri, 29 Sep 2017 14:27:39 +0200

> On Armada 7K/8K we need to explicitly enable the bus clock. The bus clock
> is optional because not all the SoCs need them but at least for Armada
> 7K/8K it is actually mandatory.
> 
> The binding documentation is updating accordingly.
> 
> Signed-off-by: Gregory CLEMENT 

Applied.


Re: [PATCH V4] r8152: add Linksys USB3GIGV1 id

2017-10-01 Thread David Miller
From: Grant Grundler 
Date: Thu, 28 Sep 2017 11:35:00 -0700

> This linksys dongle by default comes up in cdc_ether mode.
> This patch allows r8152 to claim the device:
>Bus 002 Device 002: ID 13b1:0041 Linksys
> 
> Signed-off-by: Grant Grundler 

Applied, thanks.


Re: [PATCH V4] r8152: add Linksys USB3GIGV1 id

2017-10-01 Thread David Miller
From: Grant Grundler 
Date: Thu, 28 Sep 2017 11:35:00 -0700

> This linksys dongle by default comes up in cdc_ether mode.
> This patch allows r8152 to claim the device:
>Bus 002 Device 002: ID 13b1:0041 Linksys
> 
> Signed-off-by: Grant Grundler 

Applied, thanks.


Re: [PATCH 00/18] use ARRAY_SIZE macro

2017-10-01 Thread Greg KH
On Sun, Oct 01, 2017 at 08:52:20PM -0400, Jérémy Lefaure wrote:
> On Mon, 2 Oct 2017 09:01:31 +1100
> "Tobin C. Harding"  wrote:
> 
> > > In order to reduce the size of the To: and Cc: lines, each patch of the
> > > series is sent only to the maintainers and lists concerned by the patch.
> > > This cover letter is sent to every list concerned by this series.  
> > 
> > Why don't you just send individual patches for each subsystem? I'm not a 
> > maintainer but I don't see
> > how any one person is going to be able to apply this whole series, it is 
> > making it hard for
> > maintainers if they have to pick patches out from among the series (if 
> > indeed any will bother
> > doing that).
> Yeah, maybe it would have been better to send individual patches.
> 
> From my point of view it's a series because the patches are related (I
> did a git format-patch from my local branch). But for the maintainers
> point of view, they are individual patches.

And the maintainers view is what matters here, if you wish to get your
patches reviewed and accepted...

thanks,

greg k-h


Re: [PATCH 00/18] use ARRAY_SIZE macro

2017-10-01 Thread Greg KH
On Sun, Oct 01, 2017 at 08:52:20PM -0400, Jérémy Lefaure wrote:
> On Mon, 2 Oct 2017 09:01:31 +1100
> "Tobin C. Harding"  wrote:
> 
> > > In order to reduce the size of the To: and Cc: lines, each patch of the
> > > series is sent only to the maintainers and lists concerned by the patch.
> > > This cover letter is sent to every list concerned by this series.  
> > 
> > Why don't you just send individual patches for each subsystem? I'm not a 
> > maintainer but I don't see
> > how any one person is going to be able to apply this whole series, it is 
> > making it hard for
> > maintainers if they have to pick patches out from among the series (if 
> > indeed any will bother
> > doing that).
> Yeah, maybe it would have been better to send individual patches.
> 
> From my point of view it's a series because the patches are related (I
> did a git format-patch from my local branch). But for the maintainers
> point of view, they are individual patches.

And the maintainers view is what matters here, if you wish to get your
patches reviewed and accepted...

thanks,

greg k-h


Re: [PATCH net 3/3] net: skb_queue_purge(): lock/unlock the queue only once

2017-10-01 Thread Michael Witten
On Sun, 1 Oct 2017 17:59:09 -0700, Stephen Hemminger wrote:

> On Sun, 01 Oct 2017 22:19:20 - Michael Witten wrote:
>
>> +spin_lock_irqsave(>lock, flags);
>> +skb = q->next;
>> +__skb_queue_head_init(q);
>> +spin_unlock_irqrestore(>lock, flags);
>
> Other code manipulating lists uses splice operation and
> a sk_buff_head temporary on the stack. That would be easier
> to understand.
>
>   struct sk_buf_head head;
>
>   __skb_queue_head_init();
>   spin_lock_irqsave(>lock, flags);
>   skb_queue_splice_init(q, );
>   spin_unlock_irqrestore(>lock, flags);
>
>
>> +while (skb != head) {
>> +next = skb->next;
>>  kfree_skb(skb);
>> +skb = next;
>> +}
>
> It would be cleaner if you could use
> skb_queue_walk_safe rather than open coding the loop.
>
>   skb_queue_walk_safe(, skb,  tmp)
>   kfree_skb(skb);

I appreciate abstraction as much as anybody, but I do not believe
that such abstractions would actually be an improvement here.

* Splice-initing seems more like an idiom than an abstraction;
  at first blush, it wouldn't be clear to me what the intention
  is.

* Such abstractions are fairly unnecessary.

* The function as written is already so short as to be
  easily digested.

* More to the point, this function is not some generic,
  higher-level algorithm that just happens to employ the
  socket buffer interface; rather, it is a function that
  implements part of that very interface, and may thus
  twiddle the intimate bits of these data structures
  without being accused of abusing a leaky abstraction.

* Such abstractions add overhead, if only conceptually. In this
  case, a temporary socket buffer queue allocates *3* unnecessary
  struct members, including a whole `spinlock_t' member:
  
prev
qlen
lock

  It's possible that the compiler will be smart enough to leave
  those out, but I have my suspicions that it won't, not only
  given that the interface contract requires that the temporary
  socket buffer queue be properly initialized before use, but
  also because splicing into the temporary will manipulate its
  `qlen'. Yet, why worry whether optimization happens? The whole
  issue can simply be avoided by exploiting the intimate details
  that are already philosophically available to us.

  Similarly, the function `skb_queue_walk_safe' is nice, but it
  loses value both because a temporary queue loses value (as just
  described), and because it ignores the fact that legitimate
  access to the internals of these data structures allows for
  setting up the requested loop in advance; that is to say, the
  two parts of the function that we are now debating can be woven
  together more tightly than `skb_queue_walk_safe' allows.

For these reasons, I stand by the way that the patch currently
implements this function; it does exactly what is desired, no more
or less.

Sincerely,
Michael Witten


Re: [PATCH net 3/3] net: skb_queue_purge(): lock/unlock the queue only once

2017-10-01 Thread Michael Witten
On Sun, 1 Oct 2017 17:59:09 -0700, Stephen Hemminger wrote:

> On Sun, 01 Oct 2017 22:19:20 - Michael Witten wrote:
>
>> +spin_lock_irqsave(>lock, flags);
>> +skb = q->next;
>> +__skb_queue_head_init(q);
>> +spin_unlock_irqrestore(>lock, flags);
>
> Other code manipulating lists uses splice operation and
> a sk_buff_head temporary on the stack. That would be easier
> to understand.
>
>   struct sk_buf_head head;
>
>   __skb_queue_head_init();
>   spin_lock_irqsave(>lock, flags);
>   skb_queue_splice_init(q, );
>   spin_unlock_irqrestore(>lock, flags);
>
>
>> +while (skb != head) {
>> +next = skb->next;
>>  kfree_skb(skb);
>> +skb = next;
>> +}
>
> It would be cleaner if you could use
> skb_queue_walk_safe rather than open coding the loop.
>
>   skb_queue_walk_safe(, skb,  tmp)
>   kfree_skb(skb);

I appreciate abstraction as much as anybody, but I do not believe
that such abstractions would actually be an improvement here.

* Splice-initing seems more like an idiom than an abstraction;
  at first blush, it wouldn't be clear to me what the intention
  is.

* Such abstractions are fairly unnecessary.

* The function as written is already so short as to be
  easily digested.

* More to the point, this function is not some generic,
  higher-level algorithm that just happens to employ the
  socket buffer interface; rather, it is a function that
  implements part of that very interface, and may thus
  twiddle the intimate bits of these data structures
  without being accused of abusing a leaky abstraction.

* Such abstractions add overhead, if only conceptually. In this
  case, a temporary socket buffer queue allocates *3* unnecessary
  struct members, including a whole `spinlock_t' member:
  
prev
qlen
lock

  It's possible that the compiler will be smart enough to leave
  those out, but I have my suspicions that it won't, not only
  given that the interface contract requires that the temporary
  socket buffer queue be properly initialized before use, but
  also because splicing into the temporary will manipulate its
  `qlen'. Yet, why worry whether optimization happens? The whole
  issue can simply be avoided by exploiting the intimate details
  that are already philosophically available to us.

  Similarly, the function `skb_queue_walk_safe' is nice, but it
  loses value both because a temporary queue loses value (as just
  described), and because it ignores the fact that legitimate
  access to the internals of these data structures allows for
  setting up the requested loop in advance; that is to say, the
  two parts of the function that we are now debating can be woven
  together more tightly than `skb_queue_walk_safe' allows.

For these reasons, I stand by the way that the patch currently
implements this function; it does exactly what is desired, no more
or less.

Sincerely,
Michael Witten


Re: [PATCH 04/18] IB/mlx5: Use ARRAY_SIZE

2017-10-01 Thread Leon Romanovsky
On Sun, Oct 01, 2017 at 03:30:42PM -0400, Jérémy Lefaure wrote:
> Using the ARRAY_SIZE macro improves the readability of the code.
>
> Found with Coccinelle with the following semantic patch:
> @r depends on (org || report)@
> type T;
> T[] E;
> position p;
> @@
> (
>  (sizeof(E)@p /sizeof(*E))
> |
>  (sizeof(E)@p /sizeof(E[...]))
> |
>  (sizeof(E)@p /sizeof(T))
> )
>
> Signed-off-by: Jérémy Lefaure 
> ---
>  drivers/infiniband/hw/mlx5/odp.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>

Thanks Jérémy,
I took this into my tree and will forward it to Doug L. during this
cycle.


signature.asc
Description: PGP signature


Re: [PATCH 04/18] IB/mlx5: Use ARRAY_SIZE

2017-10-01 Thread Leon Romanovsky
On Sun, Oct 01, 2017 at 03:30:42PM -0400, Jérémy Lefaure wrote:
> Using the ARRAY_SIZE macro improves the readability of the code.
>
> Found with Coccinelle with the following semantic patch:
> @r depends on (org || report)@
> type T;
> T[] E;
> position p;
> @@
> (
>  (sizeof(E)@p /sizeof(*E))
> |
>  (sizeof(E)@p /sizeof(E[...]))
> |
>  (sizeof(E)@p /sizeof(T))
> )
>
> Signed-off-by: Jérémy Lefaure 
> ---
>  drivers/infiniband/hw/mlx5/odp.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>

Thanks Jérémy,
I took this into my tree and will forward it to Doug L. during this
cycle.


signature.asc
Description: PGP signature


Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

2017-10-01 Thread Dave Chinner
On Sun, Oct 01, 2017 at 07:42:42PM -0400, Mimi Zohar wrote:
> On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote:
> > On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
> > > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar  
> > > wrote:
> > > >
> > > > Right, re-introducing the iint->mutex and a new i_generation field in
> > > > the iint struct with a separate set of locks should work.  It will be
> > > > reset if the file metadata changes (eg. setxattr, chown, chmod).
> > > 
> > > Note that the "inner lock" could possibly be omitted if the
> > > invalidation can be just a single atomic instruction.
> > > 
> > > So particularly if invalidation could be just an atomic_inc() on the
> > > generation count, there might not need to be any inner lock at all.
> > > 
> > > You'd have to serialize the actual measurement with the "read
> > > generation count", but that should be as simple as just doing a
> > > smp_rmb() between the "read generation count" and "do measurement on
> > > file contents".
> > 
> > We already have a change counter on the inode, which is modified on
> > any data or metadata write (i_version) under filesystem locks.  The
> > i_version counter has well defined semantics - it's required by
> > NFSv4 to increment on any metadata or data change - so we should be
> > able to rely on it's behaviour to implement IMA as well. Filesystems
> > that support i_version are marked with [SB|MS]_I_VERSION in the
> > superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA
> > can be supported on a specific filesystem (btrfs, ext4, fuse and xfs
> > ATM).
> 
> Recently I received a patch to replace i_version with mtime/atime.

mtime is not guaranteed to change on data writes - the resolution of
the filesystem timestamps may mean mtime only changes once a second
regardless of the number of writes performed to that file. That's
why NFS can't use it as a change attribute, and hence we have
i_version

>  Now, even more recently, I received a patch that claims that
> i_version is just a performance improvement.

Did you ask them to explain/quantify the performance improvement? 

e.g. Using i_version on XFS slows down performance on small
writes by 2-3% because i_version because all data writes log a
version change rather than only logging a change when mtime updates.
We take that penalty because NFS requires specific change attribute
behaviour, otherwise we wouldn't have implemented it at all in
XFS...

>  For file systems that
> don't support i_version, assume that the file has changed.
> 
> For file systems that don't support i_version, instead of assuming
> that the file has changed, we can at least use i_generation.

I'm not sure what you mean here - the struct inode already has a
i_generation variable. It's a lifecycle indicator used to
discriminate between alloc/free cycles on the same inode number.
i.e. It only changes at inode allocation time, not whenever the data
in the inode changes...

> With Linus' suggested changes, I think this will work nicely.
> 
> > The IMA code should be able to sample that at measurement time and
> > either fail or be retried if i_version changes during measurement.
> > We can then simply make the IMA xattr write conditional on the
> > i_version value being unchanged from the sample the IMA code passes
> > into the filesystem once the filesystem holds all the locks it needs
> > to write the xattr...
> 
> > I note that IMA already grabs the i_version in
> > ima_collect_measurement(), so this shouldn't be too hard to do.
> > Perhaps we don't need any new locks or counterst all, maybe just
> > the ability to feed a version cookie to the set_xattr method?
> 
> The security.ima xattr is normally written out in
> ima_check_last_writer(), not in ima_collect_measurement().

Which, if IIUC, does this to measure and update the xattr:

ima_check_last_writer
  -> ima_update_xattr
-> ima_collect_measurement
-> ima_fix_xattr

>  ima_collect_measurement() calculates the file hash for storing in the
> measurement list (IMA-measurement), verifying the hash/signature (IMA-
> appraisal) already stored in the xattr, and auditing (IMA-audit).

Yup, and it samples the i_version before it calculates the hash and
stores it in the iint, which then gets passed to ima_fix_xattr().
Looks like all that is needed is to pass the i_version back to the
filesystem through the xattr call

IOWs, sample the i_version early while we hold the inode lock and
check the writer count, then if it is the last writer drop the inode
lock and call ima_update_xattr(). The sampled i_version then tells
us if the file has changed before we write the updated xattr...

> The only time that ima_collect_measurement() writes the file xattr is
> in "fix" mode.  Writing the xattr will need to be deferred until after
> the iint->mutex is released.

ima_collect_measurement() doesn't write an xattr at all - it just
reads the file data and calculates the hash.

> There 

Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

2017-10-01 Thread Dave Chinner
On Sun, Oct 01, 2017 at 07:42:42PM -0400, Mimi Zohar wrote:
> On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote:
> > On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
> > > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar  
> > > wrote:
> > > >
> > > > Right, re-introducing the iint->mutex and a new i_generation field in
> > > > the iint struct with a separate set of locks should work.  It will be
> > > > reset if the file metadata changes (eg. setxattr, chown, chmod).
> > > 
> > > Note that the "inner lock" could possibly be omitted if the
> > > invalidation can be just a single atomic instruction.
> > > 
> > > So particularly if invalidation could be just an atomic_inc() on the
> > > generation count, there might not need to be any inner lock at all.
> > > 
> > > You'd have to serialize the actual measurement with the "read
> > > generation count", but that should be as simple as just doing a
> > > smp_rmb() between the "read generation count" and "do measurement on
> > > file contents".
> > 
> > We already have a change counter on the inode, which is modified on
> > any data or metadata write (i_version) under filesystem locks.  The
> > i_version counter has well defined semantics - it's required by
> > NFSv4 to increment on any metadata or data change - so we should be
> > able to rely on it's behaviour to implement IMA as well. Filesystems
> > that support i_version are marked with [SB|MS]_I_VERSION in the
> > superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA
> > can be supported on a specific filesystem (btrfs, ext4, fuse and xfs
> > ATM).
> 
> Recently I received a patch to replace i_version with mtime/atime.

mtime is not guaranteed to change on data writes - the resolution of
the filesystem timestamps may mean mtime only changes once a second
regardless of the number of writes performed to that file. That's
why NFS can't use it as a change attribute, and hence we have
i_version

>  Now, even more recently, I received a patch that claims that
> i_version is just a performance improvement.

Did you ask them to explain/quantify the performance improvement? 

e.g. Using i_version on XFS slows down performance on small
writes by 2-3% because i_version because all data writes log a
version change rather than only logging a change when mtime updates.
We take that penalty because NFS requires specific change attribute
behaviour, otherwise we wouldn't have implemented it at all in
XFS...

>  For file systems that
> don't support i_version, assume that the file has changed.
> 
> For file systems that don't support i_version, instead of assuming
> that the file has changed, we can at least use i_generation.

I'm not sure what you mean here - the struct inode already has a
i_generation variable. It's a lifecycle indicator used to
discriminate between alloc/free cycles on the same inode number.
i.e. It only changes at inode allocation time, not whenever the data
in the inode changes...

> With Linus' suggested changes, I think this will work nicely.
> 
> > The IMA code should be able to sample that at measurement time and
> > either fail or be retried if i_version changes during measurement.
> > We can then simply make the IMA xattr write conditional on the
> > i_version value being unchanged from the sample the IMA code passes
> > into the filesystem once the filesystem holds all the locks it needs
> > to write the xattr...
> 
> > I note that IMA already grabs the i_version in
> > ima_collect_measurement(), so this shouldn't be too hard to do.
> > Perhaps we don't need any new locks or counterst all, maybe just
> > the ability to feed a version cookie to the set_xattr method?
> 
> The security.ima xattr is normally written out in
> ima_check_last_writer(), not in ima_collect_measurement().

Which, if IIUC, does this to measure and update the xattr:

ima_check_last_writer
  -> ima_update_xattr
-> ima_collect_measurement
-> ima_fix_xattr

>  ima_collect_measurement() calculates the file hash for storing in the
> measurement list (IMA-measurement), verifying the hash/signature (IMA-
> appraisal) already stored in the xattr, and auditing (IMA-audit).

Yup, and it samples the i_version before it calculates the hash and
stores it in the iint, which then gets passed to ima_fix_xattr().
Looks like all that is needed is to pass the i_version back to the
filesystem through the xattr call

IOWs, sample the i_version early while we hold the inode lock and
check the writer count, then if it is the last writer drop the inode
lock and call ima_update_xattr(). The sampled i_version then tells
us if the file has changed before we write the updated xattr...

> The only time that ima_collect_measurement() writes the file xattr is
> in "fix" mode.  Writing the xattr will need to be deferred until after
> the iint->mutex is released.

ima_collect_measurement() doesn't write an xattr at all - it just
reads the file data and calculates the hash.

> There should be no open writers in 

Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-10-01 Thread Michael S. Tsirkin
Looks good to me. minor comments below.

On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> @@ -141,13 +146,128 @@ static void set_page_pfns(struct virtio_balloon *vb,
> page_to_balloon_pfn(page) + i);
>  }
>  
> +
> +static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
> +{
> + unsigned int len;
> +
> + virtqueue_kick(vq);
> + wait_event(wq_head, virtqueue_get_buf(vq, ));
> +}
> +
> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> + struct scatterlist sg;
> + unsigned int len;
> +
> + sg_init_one(, addr, size);
> +
> + /* Detach all the used buffers from the vq */
> + while (virtqueue_get_buf(vq, ))
> + ;
> +
> + return virtqueue_add_inbuf(vq, , 1, vq, GFP_KERNEL);
> +}
> +
> +static int send_balloon_page_sg(struct virtio_balloon *vb,
> +  struct virtqueue *vq,
> +  void *addr,
> +  uint32_t size,
> +  bool batch)
> +{
> + int err;
> +
> + err = add_one_sg(vq, addr, size);
> +
> + /* If batchng is requested, we batch till the vq is full */

typo

> + if (!batch || !vq->num_free)
> + kick_and_wait(vq, vb->acked);
> +
> + return err;
> +}

If add_one_sg fails, kick_and_wait will hang forever.

The reason this might work in because
1. with 1 sg there are no memory allocations
2. if adding fails on vq full, then something
   is in queue and will wake up kick_and_wait.

So in short this is expected to never fail.
How about a BUG_ON here then?
And make it void, and add a comment with above explanation.

> +
> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +   struct virtqueue *vq,
> +   unsigned long page_xb_start,
> +   unsigned long page_xb_end)
> +{
> + unsigned long sg_pfn_start, sg_pfn_end;
> + void *sg_addr;
> + uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
> + int err = 0;
> +
> + sg_pfn_start = page_xb_start;
> + while (sg_pfn_start < page_xb_end) {
> + sg_pfn_start = xb_find_next_set_bit(>page_xb, sg_pfn_start,
> + page_xb_end);
> + if (sg_pfn_start == page_xb_end + 1)
> + break;
> + sg_pfn_end = xb_find_next_zero_bit(>page_xb,
> +sg_pfn_start + 1,
> +page_xb_end);
> + sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
> + sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
> + while (sg_len > sg_max_len) {
> + err = send_balloon_page_sg(vb, vq, sg_addr, sg_max_len,
> +true);
> + if (unlikely(err < 0))
> + goto err_out;
> + sg_addr += sg_max_len;
> + sg_len -= sg_max_len;
> + }
> + err = send_balloon_page_sg(vb, vq, sg_addr, sg_len, true);
> + if (unlikely(err < 0))
> + goto err_out;
> + sg_pfn_start = sg_pfn_end + 1;
> + }
> +
> + /*
> +  * The last few sgs may not reach the batch size, but need a kick to
> +  * notify the device to handle them.
> +  */
> + if (vq->num_free != virtqueue_get_vring_size(vq))
> + kick_and_wait(vq, vb->acked);
> +
> + xb_clear_bit_range(>page_xb, page_xb_start, page_xb_end);
> + return;
> +
> +err_out:
> + dev_warn(>vdev->dev, "%s failure: %d\n", __func__, err);

so fundamentally just make send_balloon_page_sg void then.

> +}
> +
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +struct page *page,
> +unsigned long *pfn_min,
> +unsigned long *pfn_max)
> +{
> + unsigned long pfn = page_to_pfn(page);
> +
> + *pfn_min = min(pfn, *pfn_min);
> + *pfn_max = max(pfn, *pfn_max);
> + xb_preload(GFP_KERNEL);
> + xb_set_bit(>page_xb, pfn);
> + xb_preload_end();
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>   struct balloon_dev_info *vb_dev_info = >vb_dev_info;
>   unsigned num_allocated_pages;
> + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> + unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> 

Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-10-01 Thread Michael S. Tsirkin
Looks good to me. minor comments below.

On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> @@ -141,13 +146,128 @@ static void set_page_pfns(struct virtio_balloon *vb,
> page_to_balloon_pfn(page) + i);
>  }
>  
> +
> +static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
> +{
> + unsigned int len;
> +
> + virtqueue_kick(vq);
> + wait_event(wq_head, virtqueue_get_buf(vq, ));
> +}
> +
> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> + struct scatterlist sg;
> + unsigned int len;
> +
> + sg_init_one(, addr, size);
> +
> + /* Detach all the used buffers from the vq */
> + while (virtqueue_get_buf(vq, ))
> + ;
> +
> + return virtqueue_add_inbuf(vq, , 1, vq, GFP_KERNEL);
> +}
> +
> +static int send_balloon_page_sg(struct virtio_balloon *vb,
> +  struct virtqueue *vq,
> +  void *addr,
> +  uint32_t size,
> +  bool batch)
> +{
> + int err;
> +
> + err = add_one_sg(vq, addr, size);
> +
> + /* If batchng is requested, we batch till the vq is full */

typo

> + if (!batch || !vq->num_free)
> + kick_and_wait(vq, vb->acked);
> +
> + return err;
> +}

If add_one_sg fails, kick_and_wait will hang forever.

The reason this might work in because
1. with 1 sg there are no memory allocations
2. if adding fails on vq full, then something
   is in queue and will wake up kick_and_wait.

So in short this is expected to never fail.
How about a BUG_ON here then?
And make it void, and add a comment with above explanation.

> +
> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +   struct virtqueue *vq,
> +   unsigned long page_xb_start,
> +   unsigned long page_xb_end)
> +{
> + unsigned long sg_pfn_start, sg_pfn_end;
> + void *sg_addr;
> + uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
> + int err = 0;
> +
> + sg_pfn_start = page_xb_start;
> + while (sg_pfn_start < page_xb_end) {
> + sg_pfn_start = xb_find_next_set_bit(>page_xb, sg_pfn_start,
> + page_xb_end);
> + if (sg_pfn_start == page_xb_end + 1)
> + break;
> + sg_pfn_end = xb_find_next_zero_bit(>page_xb,
> +sg_pfn_start + 1,
> +page_xb_end);
> + sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
> + sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
> + while (sg_len > sg_max_len) {
> + err = send_balloon_page_sg(vb, vq, sg_addr, sg_max_len,
> +true);
> + if (unlikely(err < 0))
> + goto err_out;
> + sg_addr += sg_max_len;
> + sg_len -= sg_max_len;
> + }
> + err = send_balloon_page_sg(vb, vq, sg_addr, sg_len, true);
> + if (unlikely(err < 0))
> + goto err_out;
> + sg_pfn_start = sg_pfn_end + 1;
> + }
> +
> + /*
> +  * The last few sgs may not reach the batch size, but need a kick to
> +  * notify the device to handle them.
> +  */
> + if (vq->num_free != virtqueue_get_vring_size(vq))
> + kick_and_wait(vq, vb->acked);
> +
> + xb_clear_bit_range(>page_xb, page_xb_start, page_xb_end);
> + return;
> +
> +err_out:
> + dev_warn(>vdev->dev, "%s failure: %d\n", __func__, err);

so fundamentally just make send_balloon_page_sg void then.

> +}
> +
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +struct page *page,
> +unsigned long *pfn_min,
> +unsigned long *pfn_max)
> +{
> + unsigned long pfn = page_to_pfn(page);
> +
> + *pfn_min = min(pfn, *pfn_min);
> + *pfn_max = max(pfn, *pfn_max);
> + xb_preload(GFP_KERNEL);
> + xb_set_bit(>page_xb, pfn);
> + xb_preload_end();
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>   struct balloon_dev_info *vb_dev_info = >vb_dev_info;
>   unsigned num_allocated_pages;
> + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> + unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> 

Re: [PATCH 2/6] ath9k: add a quirk to set use_msi automatically

2017-10-01 Thread Daniel Drake
Hi AceLan,

On Thu, Sep 28, 2017 at 4:28 PM, AceLan Kao  wrote:
> Hi Daniel,
>
> I've tried your patch, but it doesn't work for me.
> Wifi can scan AP, but can't get connected.

Can you please clarify which patch(es) you have tried?

This is the base patch which adds the infrastructure to request
specific MSI IRQ vectors:
https://marc.info/?l=linux-wireless=150631274108016=2

This is the ath9k MSI patch which makes use of that:
https://github.com/endlessm/linux/commit/739c7a924db8f4434a9617657

If you were already able to use ath9k MSI interrupts without specific
consideration for which MSI vector numbers were used, these are the
possible explanations that spring to mind:

1. You got lucky and it picked a vector number that is 4-aligned. You
can check this in the "lspci -vvv" output. You'll see something like:
Capabilities: [50] MSI: Enable+ Count=1/4 Maskable+ 64bit+
Address: fee0300c  Data: 4142
The lower number is the vector number. In my example here 0x42 (66) is
not 4-aligned so the failure condition will be hit.

2. You are using interrupt remapping, which I suspect may provide a
high likelihood of MSI interrupt vectors being 4-aligned. See if
/proc/interrupts shows the IRQ type as IR-PCI-MSI
Unfortunately interrupt remapping is not available here,
https://lists.linuxfoundation.org/pipermail/iommu/2017-August/023717.html

3. My assumption that all ath9k hardware corrupts the MSI vector
number could wrong. However we've seen this on different wifi modules
in laptops produced by different OEMs and ODMs, so it seems to be a
somewhat widespread problem at least.

4. My assumption that ath9k hardware is corrupting the MSI vector
number could be wrong; maybe another component is to blame, could it
be a BIOS issue? Admittedly I don't really know how I can debug the
layers inbetween seeing the MSI Message Data value disagree with the
vector number being handled inside do_IRQ().

Daniel


Re: [PATCH 2/6] ath9k: add a quirk to set use_msi automatically

2017-10-01 Thread Daniel Drake
Hi AceLan,

On Thu, Sep 28, 2017 at 4:28 PM, AceLan Kao  wrote:
> Hi Daniel,
>
> I've tried your patch, but it doesn't work for me.
> Wifi can scan AP, but can't get connected.

Can you please clarify which patch(es) you have tried?

This is the base patch which adds the infrastructure to request
specific MSI IRQ vectors:
https://marc.info/?l=linux-wireless=150631274108016=2

This is the ath9k MSI patch which makes use of that:
https://github.com/endlessm/linux/commit/739c7a924db8f4434a9617657

If you were already able to use ath9k MSI interrupts without specific
consideration for which MSI vector numbers were used, these are the
possible explanations that spring to mind:

1. You got lucky and it picked a vector number that is 4-aligned. You
can check this in the "lspci -vvv" output. You'll see something like:
Capabilities: [50] MSI: Enable+ Count=1/4 Maskable+ 64bit+
Address: fee0300c  Data: 4142
The lower number is the vector number. In my example here 0x42 (66) is
not 4-aligned so the failure condition will be hit.

2. You are using interrupt remapping, which I suspect may provide a
high likelihood of MSI interrupt vectors being 4-aligned. See if
/proc/interrupts shows the IRQ type as IR-PCI-MSI
Unfortunately interrupt remapping is not available here,
https://lists.linuxfoundation.org/pipermail/iommu/2017-August/023717.html

3. My assumption that all ath9k hardware corrupts the MSI vector
number could wrong. However we've seen this on different wifi modules
in laptops produced by different OEMs and ODMs, so it seems to be a
somewhat widespread problem at least.

4. My assumption that ath9k hardware is corrupting the MSI vector
number could be wrong; maybe another component is to blame, could it
be a BIOS issue? Admittedly I don't really know how I can debug the
layers inbetween seeing the MSI Message Data value disagree with the
vector number being handled inside do_IRQ().

Daniel


[PATCH v2] rpmsg: Allow RPMSG_VIRTIO to be enabled via menuconfig or defconfig

2017-10-01 Thread Anup Patel
Currently, RPMSG_VIRTIO can only be enabled if some other kconfig
option selects it. This does not allow it to be enabled for
virtualized systems where Virtio RPMSG is available over Virtio
MMIO or PCI transport.

This patch updates RPMSG_VIRTIO kconfig option so that we can
enable the VirtIO RPMSG driver via menuconfig or defconfig.

Signed-off-by: Anup Patel 
---

Changes since v1:
- Add depends on HAS_DMA to avoid build failures on
  archs (such as um) with NO_DMA=y. For most archs,
  HAS_DMA=y so having depends on HAS_DMA is fine. 

 drivers/rpmsg/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/rpmsg/Kconfig b/drivers/rpmsg/Kconfig
index 0fe6eac..65a9f6b 100644
--- a/drivers/rpmsg/Kconfig
+++ b/drivers/rpmsg/Kconfig
@@ -47,7 +47,8 @@ config RPMSG_QCOM_SMD
  platforms.
 
 config RPMSG_VIRTIO
-   tristate
+   tristate "Virtio RPMSG bus driver"
+   depends on HAS_DMA
select RPMSG
select VIRTIO
 
-- 
2.7.4



[PATCH v2] rpmsg: Allow RPMSG_VIRTIO to be enabled via menuconfig or defconfig

2017-10-01 Thread Anup Patel
Currently, RPMSG_VIRTIO can only be enabled if some other kconfig
option selects it. This does not allow it to be enabled for
virtualized systems where Virtio RPMSG is available over Virtio
MMIO or PCI transport.

This patch updates RPMSG_VIRTIO kconfig option so that we can
enable the VirtIO RPMSG driver via menuconfig or defconfig.

Signed-off-by: Anup Patel 
---

Changes since v1:
- Add depends on HAS_DMA to avoid build failures on
  archs (such as um) with NO_DMA=y. For most archs,
  HAS_DMA=y so having depends on HAS_DMA is fine. 

 drivers/rpmsg/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/rpmsg/Kconfig b/drivers/rpmsg/Kconfig
index 0fe6eac..65a9f6b 100644
--- a/drivers/rpmsg/Kconfig
+++ b/drivers/rpmsg/Kconfig
@@ -47,7 +47,8 @@ config RPMSG_QCOM_SMD
  platforms.
 
 config RPMSG_VIRTIO
-   tristate
+   tristate "Virtio RPMSG bus driver"
+   depends on HAS_DMA
select RPMSG
select VIRTIO
 
-- 
2.7.4



Re: [PATCH 01/18] sound: use ARRAY_SIZE

2017-10-01 Thread Joe Perches
On Sun, 2017-10-01 at 15:30 -0400, Jérémy Lefaure wrote:
> Using the ARRAY_SIZE macro improves the readability of the code.
> 
> Found with Coccinelle with the following semantic patch:
> @r depends on (org || report)@
> type T;
> T[] E;
> position p;
> @@
> (
>  (sizeof(E)@p /sizeof(*E))
> > 
> 
>  (sizeof(E)@p /sizeof(E[...]))
> > 
> 
>  (sizeof(E)@p /sizeof(T))
> )
[]
> diff --git a/sound/oss/ad1848.c b/sound/oss/ad1848.c
[]
> @@ -797,7 +798,7 @@ static int ad1848_set_speed(int dev, int arg)
>  
>   int i, n, selected = -1;
>  
> - n = sizeof(speed_table) / sizeof(speed_struct);
> + n = ARRAY_SIZE(speed_table);

These sorts of changes are OK, but for many
uses, it's more readable to use ARRAY_SIZE(foo)
in each location rather than using a temporary.



Re: [PATCH 01/18] sound: use ARRAY_SIZE

2017-10-01 Thread Joe Perches
On Sun, 2017-10-01 at 15:30 -0400, Jérémy Lefaure wrote:
> Using the ARRAY_SIZE macro improves the readability of the code.
> 
> Found with Coccinelle with the following semantic patch:
> @r depends on (org || report)@
> type T;
> T[] E;
> position p;
> @@
> (
>  (sizeof(E)@p /sizeof(*E))
> > 
> 
>  (sizeof(E)@p /sizeof(E[...]))
> > 
> 
>  (sizeof(E)@p /sizeof(T))
> )
[]
> diff --git a/sound/oss/ad1848.c b/sound/oss/ad1848.c
[]
> @@ -797,7 +798,7 @@ static int ad1848_set_speed(int dev, int arg)
>  
>   int i, n, selected = -1;
>  
> - n = sizeof(speed_table) / sizeof(speed_struct);
> + n = ARRAY_SIZE(speed_table);

These sorts of changes are OK, but for many
uses, it's more readable to use ARRAY_SIZE(foo)
in each location rather than using a temporary.



[PATCH] platform/x86: peaq-wmi: Blacklist Lenovo ideapad 700-15ISK

2017-10-01 Thread Kai-Heng Feng
peaq-wmi on Lenovo ideapad 700-15ISK keeps sending KEY_SOUND,
which makes user's repeated keys gets interrupted.

The system does not have Dolby button, let's blacklist it.

BugLink: https://bugs.launchpad.net/bugs/1720219
Signed-off-by: Kai-Heng Feng 
---
 drivers/platform/x86/peaq-wmi.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/platform/x86/peaq-wmi.c b/drivers/platform/x86/peaq-wmi.c
index bc98ef95514a..5673d5daebc3 100644
--- a/drivers/platform/x86/peaq-wmi.c
+++ b/drivers/platform/x86/peaq-wmi.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define PEAQ_DOLBY_BUTTON_GUID "ABBC0F6F-8EA1-11D1-00A0-C9062910"
 #define PEAQ_DOLBY_BUTTON_METHOD_ID5
@@ -64,9 +65,22 @@ static void peaq_wmi_poll(struct input_polled_dev *dev)
}
 }
 
+static const struct dmi_system_id peaq_blacklist[] __initconst = {
+   {
+   /* Lenovo ideapad 700-15ISK does not have Dolby button */
+   .ident = "Lenovo ideapad 700-15ISK",
+   .matches = {
+   DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
+   DMI_MATCH(DMI_PRODUCT_NAME, "80RU"),
+   },
+   },
+   {}
+};
+
 static int __init peaq_wmi_init(void)
 {
-   if (!wmi_has_guid(PEAQ_DOLBY_BUTTON_GUID))
+   if (!wmi_has_guid(PEAQ_DOLBY_BUTTON_GUID) ||
+   dmi_check_system(peaq_blacklist))
return -ENODEV;
 
peaq_poll_dev = input_allocate_polled_device();
@@ -86,7 +100,8 @@ static int __init peaq_wmi_init(void)
 
 static void __exit peaq_wmi_exit(void)
 {
-   if (!wmi_has_guid(PEAQ_DOLBY_BUTTON_GUID))
+   if (!wmi_has_guid(PEAQ_DOLBY_BUTTON_GUID) ||
+   dmi_check_system(peaq_blacklist))
return;
 
input_unregister_polled_device(peaq_poll_dev);
-- 
2.14.1



[PATCH] platform/x86: peaq-wmi: Blacklist Lenovo ideapad 700-15ISK

2017-10-01 Thread Kai-Heng Feng
peaq-wmi on Lenovo ideapad 700-15ISK keeps sending KEY_SOUND,
which makes user's repeated keys gets interrupted.

The system does not have Dolby button, let's blacklist it.

BugLink: https://bugs.launchpad.net/bugs/1720219
Signed-off-by: Kai-Heng Feng 
---
 drivers/platform/x86/peaq-wmi.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/platform/x86/peaq-wmi.c b/drivers/platform/x86/peaq-wmi.c
index bc98ef95514a..5673d5daebc3 100644
--- a/drivers/platform/x86/peaq-wmi.c
+++ b/drivers/platform/x86/peaq-wmi.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define PEAQ_DOLBY_BUTTON_GUID "ABBC0F6F-8EA1-11D1-00A0-C9062910"
 #define PEAQ_DOLBY_BUTTON_METHOD_ID5
@@ -64,9 +65,22 @@ static void peaq_wmi_poll(struct input_polled_dev *dev)
}
 }
 
+static const struct dmi_system_id peaq_blacklist[] __initconst = {
+   {
+   /* Lenovo ideapad 700-15ISK does not have Dolby button */
+   .ident = "Lenovo ideapad 700-15ISK",
+   .matches = {
+   DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
+   DMI_MATCH(DMI_PRODUCT_NAME, "80RU"),
+   },
+   },
+   {}
+};
+
 static int __init peaq_wmi_init(void)
 {
-   if (!wmi_has_guid(PEAQ_DOLBY_BUTTON_GUID))
+   if (!wmi_has_guid(PEAQ_DOLBY_BUTTON_GUID) ||
+   dmi_check_system(peaq_blacklist))
return -ENODEV;
 
peaq_poll_dev = input_allocate_polled_device();
@@ -86,7 +100,8 @@ static int __init peaq_wmi_init(void)
 
 static void __exit peaq_wmi_exit(void)
 {
-   if (!wmi_has_guid(PEAQ_DOLBY_BUTTON_GUID))
+   if (!wmi_has_guid(PEAQ_DOLBY_BUTTON_GUID) ||
+   dmi_check_system(peaq_blacklist))
return;
 
input_unregister_polled_device(peaq_poll_dev);
-- 
2.14.1



Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

2017-10-01 Thread Dave Chinner
On Sun, Oct 01, 2017 at 04:15:07PM -0700, Linus Torvalds wrote:
> On Sun, Oct 1, 2017 at 3:34 PM, Dave Chinner  wrote:
> >
> > We already have a change counter on the inode, which is modified on
> > any data or metadata write (i_version) under filesystem locks.  The
> > i_version counter has well defined semantics - it's required by
> > NFSv4 to increment on any metadata or data change - so we should be
> > able to rely on it's behaviour to implement IMA as well.
> 
> I actually think i_version has exactly the wrong semantics.
> 
> Afaik, it doesn't actually version the file _data_ at all, it only
> versions "inode itself changed".

No, the NFSv4 change attribute must change if either data or
metadata on the inode is changed, and be consistent and persistent
across server crashes. For data updates, they piggy back on mtime
updates 

> But I might have missed something obvious. The updates are hidden in
> some odd places sometimes.

... which are in file_update_time().

Hence every data write or write page fault will call
file_update_time() and trigger an i_version increment, even if the
mtime doesn't change.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

2017-10-01 Thread Dave Chinner
On Sun, Oct 01, 2017 at 04:15:07PM -0700, Linus Torvalds wrote:
> On Sun, Oct 1, 2017 at 3:34 PM, Dave Chinner  wrote:
> >
> > We already have a change counter on the inode, which is modified on
> > any data or metadata write (i_version) under filesystem locks.  The
> > i_version counter has well defined semantics - it's required by
> > NFSv4 to increment on any metadata or data change - so we should be
> > able to rely on it's behaviour to implement IMA as well.
> 
> I actually think i_version has exactly the wrong semantics.
> 
> Afaik, it doesn't actually version the file _data_ at all, it only
> versions "inode itself changed".

No, the NFSv4 change attribute must change if either data or
metadata on the inode is changed, and be consistent and persistent
across server crashes. For data updates, they piggy back on mtime
updates 

> But I might have missed something obvious. The updates are hidden in
> some odd places sometimes.

... which are in file_update_time().

Hence every data write or write page fault will call
file_update_time() and trigger an i_version increment, even if the
mtime doesn't change.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

2017-10-01 Thread Eric W. Biederman
Mimi Zohar  writes:

> On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote:
>> On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
>> > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar  
>> > wrote:
>> > >
>> > > Right, re-introducing the iint->mutex and a new i_generation field in
>> > > the iint struct with a separate set of locks should work.  It will be
>> > > reset if the file metadata changes (eg. setxattr, chown, chmod).
>> > 
>> > Note that the "inner lock" could possibly be omitted if the
>> > invalidation can be just a single atomic instruction.
>> > 
>> > So particularly if invalidation could be just an atomic_inc() on the
>> > generation count, there might not need to be any inner lock at all.
>> > 
>> > You'd have to serialize the actual measurement with the "read
>> > generation count", but that should be as simple as just doing a
>> > smp_rmb() between the "read generation count" and "do measurement on
>> > file contents".
>> 
>> We already have a change counter on the inode, which is modified on
>> any data or metadata write (i_version) under filesystem locks.  The
>> i_version counter has well defined semantics - it's required by
>> NFSv4 to increment on any metadata or data change - so we should be
>> able to rely on it's behaviour to implement IMA as well. Filesystems
>> that support i_version are marked with [SB|MS]_I_VERSION in the
>> superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA
>> can be supported on a specific filesystem (btrfs, ext4, fuse and xfs
>> ATM).
>
> Recently I received a patch to replace i_version with mtime/atime.
>  Now, even more recently, I received a patch that claims that
> i_version is just a performance improvement.  For file systems that
> don't support i_version, assume that the file has changed.
>
> For file systems that don't support i_version, instead of assuming
> that the file has changed, we can at least use i_generation.
>
> With Linus' suggested changes, I think this will work nicely.
>
>> The IMA code should be able to sample that at measurement time and
>> either fail or be retried if i_version changes during measurement.
>> We can then simply make the IMA xattr write conditional on the
>> i_version value being unchanged from the sample the IMA code passes
>> into the filesystem once the filesystem holds all the locks it needs
>> to write the xattr...
>
>> I note that IMA already grabs the i_version in
>> ima_collect_measurement(), so this shouldn't be too hard to do.
>> Perhaps we don't need any new locks or counterst all, maybe just
>> the ability to feed a version cookie to the set_xattr method?
>
> The security.ima xattr is normally written out in
> ima_check_last_writer(), not in ima_collect_measurement().
>  ima_collect_measurement() calculates the file hash for storing in the
> measurement list (IMA-measurement), verifying the hash/signature (IMA-
> appraisal) already stored in the xattr, and auditing (IMA-audit).
>
> The only time that ima_collect_measurement() writes the file xattr is
> in "fix" mode.  Writing the xattr will need to be deferred until after
> the iint->mutex is released.
>
> There should be no open writers in ima_check_last_writer(), so the
> file shouldn't be changing.

This is slightly tangential but I think important to consider.
What do you do about distributed filesystems fuse, nfs, etc that
can change the data behind the kernels back.

Do you not support such systems or do you have a sufficient way to
detect changes?

Eric


Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

2017-10-01 Thread Eric W. Biederman
Mimi Zohar  writes:

> On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote:
>> On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
>> > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar  
>> > wrote:
>> > >
>> > > Right, re-introducing the iint->mutex and a new i_generation field in
>> > > the iint struct with a separate set of locks should work.  It will be
>> > > reset if the file metadata changes (eg. setxattr, chown, chmod).
>> > 
>> > Note that the "inner lock" could possibly be omitted if the
>> > invalidation can be just a single atomic instruction.
>> > 
>> > So particularly if invalidation could be just an atomic_inc() on the
>> > generation count, there might not need to be any inner lock at all.
>> > 
>> > You'd have to serialize the actual measurement with the "read
>> > generation count", but that should be as simple as just doing a
>> > smp_rmb() between the "read generation count" and "do measurement on
>> > file contents".
>> 
>> We already have a change counter on the inode, which is modified on
>> any data or metadata write (i_version) under filesystem locks.  The
>> i_version counter has well defined semantics - it's required by
>> NFSv4 to increment on any metadata or data change - so we should be
>> able to rely on it's behaviour to implement IMA as well. Filesystems
>> that support i_version are marked with [SB|MS]_I_VERSION in the
>> superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA
>> can be supported on a specific filesystem (btrfs, ext4, fuse and xfs
>> ATM).
>
> Recently I received a patch to replace i_version with mtime/atime.
>  Now, even more recently, I received a patch that claims that
> i_version is just a performance improvement.  For file systems that
> don't support i_version, assume that the file has changed.
>
> For file systems that don't support i_version, instead of assuming
> that the file has changed, we can at least use i_generation.
>
> With Linus' suggested changes, I think this will work nicely.
>
>> The IMA code should be able to sample that at measurement time and
>> either fail or be retried if i_version changes during measurement.
>> We can then simply make the IMA xattr write conditional on the
>> i_version value being unchanged from the sample the IMA code passes
>> into the filesystem once the filesystem holds all the locks it needs
>> to write the xattr...
>
>> I note that IMA already grabs the i_version in
>> ima_collect_measurement(), so this shouldn't be too hard to do.
>> Perhaps we don't need any new locks or counterst all, maybe just
>> the ability to feed a version cookie to the set_xattr method?
>
> The security.ima xattr is normally written out in
> ima_check_last_writer(), not in ima_collect_measurement().
>  ima_collect_measurement() calculates the file hash for storing in the
> measurement list (IMA-measurement), verifying the hash/signature (IMA-
> appraisal) already stored in the xattr, and auditing (IMA-audit).
>
> The only time that ima_collect_measurement() writes the file xattr is
> in "fix" mode.  Writing the xattr will need to be deferred until after
> the iint->mutex is released.
>
> There should be no open writers in ima_check_last_writer(), so the
> file shouldn't be changing.

This is slightly tangential but I think important to consider.
What do you do about distributed filesystems fuse, nfs, etc that
can change the data behind the kernels back.

Do you not support such systems or do you have a sufficient way to
detect changes?

Eric


Re: [PATCH] powernv: Add OCC driver to mmap sensor area

2017-10-01 Thread Stewart Smith
Shilpasri G Bhat  writes:
> This driver provides interface to mmap the OCC sensor area
> to userspace to parse and read OCC inband sensors.

Why?

Is this for debug? If so, the existing exports interface should be used.

If there's actual sensors, we already have two ways of exposing sensors
to Linux: the OPAL_SENSOR API and the IMC API.

Why this method and not use the existing ones?

-- 
Stewart Smith
OPAL Architect, IBM.



Re: [PATCH] powernv: Add OCC driver to mmap sensor area

2017-10-01 Thread Stewart Smith
Shilpasri G Bhat  writes:
> This driver provides interface to mmap the OCC sensor area
> to userspace to parse and read OCC inband sensors.

Why?

Is this for debug? If so, the existing exports interface should be used.

If there's actual sensors, we already have two ways of exposing sensors
to Linux: the OPAL_SENSOR API and the IMC API.

Why this method and not use the existing ones?

-- 
Stewart Smith
OPAL Architect, IBM.



Re: [RFC][PATCH] KEYS: Replace uid/gid/perm permissions checking with ACL

2017-10-01 Thread Eric Biggers
Hi David,

On Wed, Sep 27, 2017 at 12:41:41PM +0100, David Howells wrote:
> 
> Replace the uid/gid/perm permissions checking on a key with an ACL to 
> allow
> the SETATTR permission to be split.  The problem is that SETATTR covers a
> slew of things, not all of which should be grouped together.  This
> includes:
> 
>  (1) Changing the key ownership.
> 
>  (2) Changing the security information.
> 
>  (3) Keyring restriction.
> 
>  (4) Expiry time.
> 
>  (5) Revocation.
> 
> and it has also been proposed to add:
> 
>  (6) Invalidation.
> 
> The above can be divided into three groups: Controlling access (1), (2) 
> and
> (3), managing the content at construction time (4) and managing the key 
> (5)
> and (6).

This is interesting work, though it adds complexity and makes a lot of subtle
(and potentially breaking) changes to which permissions are required for various
things.  First I think you need to start out with a better statement of the
problems you are trying to solve.  The patch does much more than simply split up
the SETATTR permission --- for example, it also adds the ability to assign
permissions to specific uids, gids, and capabilities.  Who is planning to use
those features and why?

> The KEYCTL_SETATTR function is then deprecated.  If called, it will

KEYCTL_SETPERM

> construct an ACL to reflect the mask it is given, using possessor, owner,
> group and other ACE's as appropriate if any of those elements are granted
> any permissions.  SETATTR permission turns on all of INVAL, REVOKE and
> SET_SECURITY.  WRITE permission turns on WRITE, REVOKE and, if a keyring,
> CLEAR.  JOIN is turned on if a keyring is being altered.

The proposed changes to keyctl_setperm_key() actually never enable INVAL at all,
which doesn't match the description here.  Also, all breaking changes need to be
justified.  If keyctl_setperm(key, KEY_*_SEARCH) is no longer going to allow the
key to be invalidated (as I had proposed earlier), that is really its own change
which needs its own justification; it shouldn't be hidden in a larger patch.

> will return an error if SETACL has been called on a key.

That is simplest, but it doesn't match the behavior of POSIX ACLs, for example.
With POSIX ACLs you can still chmod() a file that has an ACL.

> The KEYCTL_DESCRIBE function then creates a permissions mask to return
> depending on possessor, owner, group and other ACEs, indicating SETATTR if
> any of INVAL, REVOKE and SET_SECURITY are set and indicating WRITE if any
> of WRITE, REVOKE or CLEAR are set.

Ignoring ACEs for specific users, groups, and capabilities may be problematic
because the returned mask will under-estimate rather than over-estimate the
permissions that have been granted.  With POSIX ACLs, for example, the union of
all permissions that have been granted to any subjects other than the regular
ones is reflected in the group entry.  I believe that's generally considered
better from a security perspective, because then no permissions are "hidden"
from a listing of the regular (non-ACL) permissions only.

> Note that the value subsequently returned by KEYCTL_DESCRIBE may not match
> the value set with KEYCTL_SETATTR - but this is already true because keys
> that lack ->read() can't have READ set and keys that lack ->write() can't
> have WRITE set.

Not true; you *can* set READ on a key that lacks ->read() and WRITE on a key
that lacks ->update().  They are only omitted from the default permissions.

> The KEYCTL_SET_TIMEOUT function then is permitted if WRITE or SETSEC is
> set, or if the caller has a valid instantiation auth token.

This doesn't match the code, which asks for WRITE permission only.  It's also a
breaking change which needs to be justified on its own.  Also I'm not sure that
WRITE permission actually makes sense, given that KEYCTL_SET_TIMEOUT doesn't
modify the payload of the key.

> +static struct key_acl blacklist_key_acl = {
> + .usage  = REFCOUNT_INIT(1),
> + .nr_ace = 2,
> + .aces[0] = {
> + .mask = KEY_ACE_SPECIAL | (KEY_ACE_SEARCH | KEY_ACE_READ),
> + .special_id = KEY_ACE_POSSESSOR,
> + },
> + .aces[1] = {
> + .mask = KEY_ACE_SPECIAL | (KEY_ACE_VIEW | KEY_ACE_SEARCH),
> + .special_id = KEY_ACE_OWNER,
> + },
> +};

Designators into flexible arrays are a gcc extension which doesn't work with
clang.  Use this instead:

.aces = {
{
.mask = KEY_ACE_SPECIAL | (KEY_ACE_SEARCH | 
KEY_ACE_READ),
.special_id = KEY_ACE_POSSESSOR,
},
{
.mask = KEY_ACE_SPECIAL | (KEY_ACE_VIEW | 
KEY_ACE_SEARCH),
.special_id = KEY_ACE_OWNER,
},
},

It's also difficult to read these lists of ACEs.  An ACE should read as 

Re: [RFC][PATCH] KEYS: Replace uid/gid/perm permissions checking with ACL

2017-10-01 Thread Eric Biggers
Hi David,

On Wed, Sep 27, 2017 at 12:41:41PM +0100, David Howells wrote:
> 
> Replace the uid/gid/perm permissions checking on a key with an ACL to 
> allow
> the SETATTR permission to be split.  The problem is that SETATTR covers a
> slew of things, not all of which should be grouped together.  This
> includes:
> 
>  (1) Changing the key ownership.
> 
>  (2) Changing the security information.
> 
>  (3) Keyring restriction.
> 
>  (4) Expiry time.
> 
>  (5) Revocation.
> 
> and it has also been proposed to add:
> 
>  (6) Invalidation.
> 
> The above can be divided into three groups: Controlling access (1), (2) 
> and
> (3), managing the content at construction time (4) and managing the key 
> (5)
> and (6).

This is interesting work, though it adds complexity and makes a lot of subtle
(and potentially breaking) changes to which permissions are required for various
things.  First I think you need to start out with a better statement of the
problems you are trying to solve.  The patch does much more than simply split up
the SETATTR permission --- for example, it also adds the ability to assign
permissions to specific uids, gids, and capabilities.  Who is planning to use
those features and why?

> The KEYCTL_SETATTR function is then deprecated.  If called, it will

KEYCTL_SETPERM

> construct an ACL to reflect the mask it is given, using possessor, owner,
> group and other ACE's as appropriate if any of those elements are granted
> any permissions.  SETATTR permission turns on all of INVAL, REVOKE and
> SET_SECURITY.  WRITE permission turns on WRITE, REVOKE and, if a keyring,
> CLEAR.  JOIN is turned on if a keyring is being altered.

The proposed changes to keyctl_setperm_key() actually never enable INVAL at all,
which doesn't match the description here.  Also, all breaking changes need to be
justified.  If keyctl_setperm(key, KEY_*_SEARCH) is no longer going to allow the
key to be invalidated (as I had proposed earlier), that is really its own change
which needs its own justification; it shouldn't be hidden in a larger patch.

> will return an error if SETACL has been called on a key.

That is simplest, but it doesn't match the behavior of POSIX ACLs, for example.
With POSIX ACLs you can still chmod() a file that has an ACL.

> The KEYCTL_DESCRIBE function then creates a permissions mask to return
> depending on possessor, owner, group and other ACEs, indicating SETATTR if
> any of INVAL, REVOKE and SET_SECURITY are set and indicating WRITE if any
> of WRITE, REVOKE or CLEAR are set.

Ignoring ACEs for specific users, groups, and capabilities may be problematic
because the returned mask will under-estimate rather than over-estimate the
permissions that have been granted.  With POSIX ACLs, for example, the union of
all permissions that have been granted to any subjects other than the regular
ones is reflected in the group entry.  I believe that's generally considered
better from a security perspective, because then no permissions are "hidden"
from a listing of the regular (non-ACL) permissions only.

> Note that the value subsequently returned by KEYCTL_DESCRIBE may not match
> the value set with KEYCTL_SETATTR - but this is already true because keys
> that lack ->read() can't have READ set and keys that lack ->write() can't
> have WRITE set.

Not true; you *can* set READ on a key that lacks ->read() and WRITE on a key
that lacks ->update().  They are only omitted from the default permissions.

> The KEYCTL_SET_TIMEOUT function then is permitted if WRITE or SETSEC is
> set, or if the caller has a valid instantiation auth token.

This doesn't match the code, which asks for WRITE permission only.  It's also a
breaking change which needs to be justified on its own.  Also I'm not sure that
WRITE permission actually makes sense, given that KEYCTL_SET_TIMEOUT doesn't
modify the payload of the key.

> +static struct key_acl blacklist_key_acl = {
> + .usage  = REFCOUNT_INIT(1),
> + .nr_ace = 2,
> + .aces[0] = {
> + .mask = KEY_ACE_SPECIAL | (KEY_ACE_SEARCH | KEY_ACE_READ),
> + .special_id = KEY_ACE_POSSESSOR,
> + },
> + .aces[1] = {
> + .mask = KEY_ACE_SPECIAL | (KEY_ACE_VIEW | KEY_ACE_SEARCH),
> + .special_id = KEY_ACE_OWNER,
> + },
> +};

Designators into flexible arrays are a gcc extension which doesn't work with
clang.  Use this instead:

.aces = {
{
.mask = KEY_ACE_SPECIAL | (KEY_ACE_SEARCH | 
KEY_ACE_READ),
.special_id = KEY_ACE_POSSESSOR,
},
{
.mask = KEY_ACE_SPECIAL | (KEY_ACE_VIEW | 
KEY_ACE_SEARCH),
.special_id = KEY_ACE_OWNER,
},
},

It's also difficult to read these lists of ACEs.  An ACE should read as 

[Patch v4 14/22] CIFS: SMBD: Implement function to send data via RDMA send

2017-10-01 Thread Long Li
From: Long Li 

The transport doesn't maintain send buffers or send queue for transferring
payload via RDMA send. There is no data copy in the transport on send.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 248 
 fs/cifs/smbdirect.h |   4 +
 2 files changed, 252 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index b9be9d6..90e2c94 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -42,6 +42,12 @@ static int smbd_post_recv(
struct smbd_response *response);
 
 static int smbd_post_send_empty(struct smbd_connection *info);
+static int smbd_post_send_data(
+   struct smbd_connection *info,
+   struct kvec *iov, int n_vec, int remaining_data_length);
+static int smbd_post_send_page(struct smbd_connection *info,
+   struct page *page, unsigned long offset,
+   size_t size, int remaining_data_length);
 
 /* SMBD version number */
 #define SMBD_V10x0100
@@ -198,6 +204,10 @@ static void smbd_destroy_rdma_work(struct work_struct 
*work)
log_rdma_event(INFO, "cancelling send immediate work\n");
cancel_delayed_work_sync(>send_immediate_work);
 
+   log_rdma_event(INFO, "wait for all send to finish\n");
+   wait_event(info->wait_smbd_send_pending,
+   info->smbd_send_pending == 0);
+
log_rdma_event(INFO, "wait for all recv to finish\n");
wake_up_interruptible(>wait_reassembly_queue);
wait_event(info->wait_smbd_recv_pending,
@@ -1103,6 +1113,24 @@ static int smbd_post_send_sgl(struct smbd_connection 
*info,
 }
 
 /*
+ * Send a page
+ * page: the page to send
+ * offset: offset in the page to send
+ * size: length in the page to send
+ * remaining_data_length: remaining data to send in this payload
+ */
+static int smbd_post_send_page(struct smbd_connection *info, struct page *page,
+   unsigned long offset, size_t size, int remaining_data_length)
+{
+   struct scatterlist sgl;
+
+   sg_init_table(, 1);
+   sg_set_page(, page, size, offset);
+
+   return smbd_post_send_sgl(info, , size, remaining_data_length);
+}
+
+/*
  * Send an empty message
  * Empty message is used to extend credits to peer to for keep live
  * while there is no upper layer payload to send at the time
@@ -1114,6 +1142,35 @@ static int smbd_post_send_empty(struct smbd_connection 
*info)
 }
 
 /*
+ * Send a data buffer
+ * iov: the iov array describing the data buffers
+ * n_vec: number of iov array
+ * remaining_data_length: remaining data to send following this packet
+ * in segmented SMBD packet
+ */
+static int smbd_post_send_data(
+   struct smbd_connection *info, struct kvec *iov, int n_vec,
+   int remaining_data_length)
+{
+   int i;
+   u32 data_length = 0;
+   struct scatterlist sgl[SMBDIRECT_MAX_SGE];
+
+   if (n_vec > SMBDIRECT_MAX_SGE) {
+   cifs_dbg(VFS, "Can't fit data to SGL, n_vec=%d\n", n_vec);
+   return -ENOMEM;
+   }
+
+   sg_init_table(sgl, n_vec);
+   for (i = 0; i < n_vec; i++) {
+   data_length += iov[i].iov_len;
+   sg_set_buf([i], iov[i].iov_base, iov[i].iov_len);
+   }
+
+   return smbd_post_send_sgl(info, sgl, data_length, 
remaining_data_length);
+}
+
+/*
  * Post a receive request to the transport
  * The remote peer can only send data when a receive request is posted
  * The interaction is controlled by send/receive credit system
@@ -1680,6 +1737,9 @@ struct smbd_connection *_smbd_get_connection(
queue_delayed_work(info->workqueue, >idle_timer_work,
info->keep_alive_interval*HZ);
 
+   init_waitqueue_head(>wait_smbd_send_pending);
+   info->smbd_send_pending = 0;
+
init_waitqueue_head(>wait_smbd_recv_pending);
info->smbd_recv_pending = 0;
 
@@ -1973,3 +2033,191 @@ int smbd_recv(struct smbd_connection *info, struct 
msghdr *msg)
msg->msg_iter.count = 0;
return rc;
 }
+
+/*
+ * Send data to transport
+ * Each rqst is transported as a SMBDirect payload
+ * rqst: the data to write
+ * return value: 0 if successfully write, otherwise error code
+ */
+int smbd_send(struct smbd_connection *info, struct smb_rqst *rqst)
+{
+   struct kvec vec;
+   int nvecs;
+   int size;
+   int buflen = 0, remaining_data_length;
+   int start, i, j;
+   int max_iov_size =
+   info->max_send_size - sizeof(struct smbd_data_transfer);
+   struct kvec iov[SMBDIRECT_MAX_SGE];
+   int rc;
+   unsigned long long t1 = rdtsc();
+
+   info->smbd_send_pending++;
+   if (info->transport_status != SMBD_CONNECTED) {
+   rc = -ENODEV;
+   goto done;
+   }
+
+   /*
+* This usually means a configuration error
+* We use RDMA read/write for packet size > rdma_readwrite_threshold
+* as long as it's 

[Patch v4 02/22] CIFS: SMBD: Establish SMBDirect connection

2017-10-01 Thread Long Li
From: Long Li 

Add code to implement the core functions to establish a SMBDirect connection.

1. Establish an RDMA connection to SMB server.
2. Negotiate and setup SMBDirect protocol.
3. Implement idle connection timer and credit management.

Add to Makefile.

Signed-off-by: Long Li 
---
 fs/cifs/Makefile|2 +-
 fs/cifs/smbdirect.c | 1600 +++
 fs/cifs/smbdirect.h |  229 
 3 files changed, 1830 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/Makefile b/fs/cifs/Makefile
index 5e853a3..bb54662b 100644
--- a/fs/cifs/Makefile
+++ b/fs/cifs/Makefile
@@ -8,7 +8,7 @@ cifs-y := cifsfs.o cifssmb.o cifs_debug.o connect.o dir.o 
file.o inode.o \
  cifs_unicode.o nterr.o cifsencrypt.o \
  readdir.o ioctl.o sess.o export.o smb1ops.o winucase.o \
  smb2ops.o smb2maperror.o smb2transport.o \
- smb2misc.o smb2pdu.o smb2inode.o smb2file.o
+ smb2misc.o smb2pdu.o smb2inode.o smb2file.o smbdirect.o
 
 cifs-$(CONFIG_CIFS_XATTR) += xattr.o
 cifs-$(CONFIG_CIFS_ACL) += cifsacl.o
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index d3c16f8..e8f976f 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -13,7 +13,35 @@
  *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
  *   the GNU General Public License for more details.
  */
+#include 
 #include "smbdirect.h"
+#include "cifs_debug.h"
+#include 
+
+static struct smbd_response *get_empty_queue_buffer(
+   struct smbd_connection *info);
+static struct smbd_response *get_receive_buffer(
+   struct smbd_connection *info);
+static void put_receive_buffer(
+   struct smbd_connection *info,
+   struct smbd_response *response,
+   bool lock);
+static int allocate_receive_buffers(struct smbd_connection *info, int num_buf);
+static void destroy_receive_buffers(struct smbd_connection *info);
+
+static void put_empty_packet(
+   struct smbd_connection *info, struct smbd_response *response);
+static void enqueue_reassembly(
+   struct smbd_connection *info,
+   struct smbd_response *response, int data_length);
+static struct smbd_response *_get_first_reassembly(
+   struct smbd_connection *info);
+
+static int smbd_post_recv(
+   struct smbd_connection *info,
+   struct smbd_response *response);
+
+static int smbd_post_send_empty(struct smbd_connection *info);
 
 /* SMBD version number */
 #define SMBD_V10x0100
@@ -75,3 +103,1575 @@ int smbd_max_frmr_depth = 2048;
 
 /* If payload is less than this byte, use RDMA send/recv not read/write */
 int rdma_readwrite_threshold = 4096;
+
+/* Transport logging functions
+ * Logging are defined as classes. They can be OR'ed to define the actual
+ * logging level via module parameter smbd_logging_class
+ * e.g. cifs.smbd_logging_class=0x500 will log all log_rdma_recv() and
+ * log_rdma_event()
+ */
+#define LOG_OUTGOING   0x1
+#define LOG_INCOMING   0x2
+#define LOG_READ   0x4
+#define LOG_WRITE  0x8
+#define LOG_RDMA_SEND  0x10
+#define LOG_RDMA_RECV  0x20
+#define LOG_KEEP_ALIVE 0x40
+#define LOG_RDMA_EVENT 0x80
+#define LOG_RDMA_MR0x100
+static unsigned int smbd_logging_class = 0;
+module_param(smbd_logging_class, uint, 0644);
+MODULE_PARM_DESC(smbd_logging_class,
+   "Logging class for SMBD transport 0x0 to 0x100");
+
+#define ERR0x0
+#define INFO   0x1
+static unsigned int smbd_logging_level = ERR;
+module_param(smbd_logging_level, uint, 0644);
+MODULE_PARM_DESC(smbd_logging_level,
+   "Logging level for SMBD transport, 0 (default): error, 1: info");
+
+#define log_rdma(level, class, fmt, args...)   \
+do {   \
+   if (level <= smbd_logging_level || class & smbd_logging_class)  \
+   cifs_dbg(VFS, "%s:%d " fmt, __func__, __LINE__, ##args);\
+} while (0)
+
+#define log_outgoing(level, fmt, args...) \
+   log_rdma(level, LOG_OUTGOING, fmt, ##args)
+#define log_incoming(level, fmt, args...) \
+   log_rdma(level, LOG_INCOMING, fmt, ##args)
+#define log_read(level, fmt, args...)  log_rdma(level, LOG_READ, fmt, ##args)
+#define log_write(level, fmt, args...) log_rdma(level, LOG_WRITE, fmt, ##args)
+#define log_rdma_send(level, fmt, args...) \
+   log_rdma(level, LOG_RDMA_SEND, fmt, ##args)
+#define log_rdma_recv(level, fmt, args...) \
+   log_rdma(level, LOG_RDMA_RECV, fmt, ##args)
+#define log_keep_alive(level, fmt, args...) \
+   log_rdma(level, LOG_KEEP_ALIVE, fmt, ##args)
+#define log_rdma_event(level, fmt, args...) \
+   log_rdma(level, LOG_RDMA_EVENT, fmt, ##args)
+#define 

[Patch v4 14/22] CIFS: SMBD: Implement function to send data via RDMA send

2017-10-01 Thread Long Li
From: Long Li 

The transport doesn't maintain send buffers or send queue for transferring
payload via RDMA send. There is no data copy in the transport on send.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 248 
 fs/cifs/smbdirect.h |   4 +
 2 files changed, 252 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index b9be9d6..90e2c94 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -42,6 +42,12 @@ static int smbd_post_recv(
struct smbd_response *response);
 
 static int smbd_post_send_empty(struct smbd_connection *info);
+static int smbd_post_send_data(
+   struct smbd_connection *info,
+   struct kvec *iov, int n_vec, int remaining_data_length);
+static int smbd_post_send_page(struct smbd_connection *info,
+   struct page *page, unsigned long offset,
+   size_t size, int remaining_data_length);
 
 /* SMBD version number */
 #define SMBD_V10x0100
@@ -198,6 +204,10 @@ static void smbd_destroy_rdma_work(struct work_struct 
*work)
log_rdma_event(INFO, "cancelling send immediate work\n");
cancel_delayed_work_sync(>send_immediate_work);
 
+   log_rdma_event(INFO, "wait for all send to finish\n");
+   wait_event(info->wait_smbd_send_pending,
+   info->smbd_send_pending == 0);
+
log_rdma_event(INFO, "wait for all recv to finish\n");
wake_up_interruptible(>wait_reassembly_queue);
wait_event(info->wait_smbd_recv_pending,
@@ -1103,6 +1113,24 @@ static int smbd_post_send_sgl(struct smbd_connection 
*info,
 }
 
 /*
+ * Send a page
+ * page: the page to send
+ * offset: offset in the page to send
+ * size: length in the page to send
+ * remaining_data_length: remaining data to send in this payload
+ */
+static int smbd_post_send_page(struct smbd_connection *info, struct page *page,
+   unsigned long offset, size_t size, int remaining_data_length)
+{
+   struct scatterlist sgl;
+
+   sg_init_table(, 1);
+   sg_set_page(, page, size, offset);
+
+   return smbd_post_send_sgl(info, , size, remaining_data_length);
+}
+
+/*
  * Send an empty message
  * Empty message is used to extend credits to peer to for keep live
  * while there is no upper layer payload to send at the time
@@ -1114,6 +1142,35 @@ static int smbd_post_send_empty(struct smbd_connection 
*info)
 }
 
 /*
+ * Send a data buffer
+ * iov: the iov array describing the data buffers
+ * n_vec: number of iov array
+ * remaining_data_length: remaining data to send following this packet
+ * in segmented SMBD packet
+ */
+static int smbd_post_send_data(
+   struct smbd_connection *info, struct kvec *iov, int n_vec,
+   int remaining_data_length)
+{
+   int i;
+   u32 data_length = 0;
+   struct scatterlist sgl[SMBDIRECT_MAX_SGE];
+
+   if (n_vec > SMBDIRECT_MAX_SGE) {
+   cifs_dbg(VFS, "Can't fit data to SGL, n_vec=%d\n", n_vec);
+   return -ENOMEM;
+   }
+
+   sg_init_table(sgl, n_vec);
+   for (i = 0; i < n_vec; i++) {
+   data_length += iov[i].iov_len;
+   sg_set_buf([i], iov[i].iov_base, iov[i].iov_len);
+   }
+
+   return smbd_post_send_sgl(info, sgl, data_length, 
remaining_data_length);
+}
+
+/*
  * Post a receive request to the transport
  * The remote peer can only send data when a receive request is posted
  * The interaction is controlled by send/receive credit system
@@ -1680,6 +1737,9 @@ struct smbd_connection *_smbd_get_connection(
queue_delayed_work(info->workqueue, >idle_timer_work,
info->keep_alive_interval*HZ);
 
+   init_waitqueue_head(>wait_smbd_send_pending);
+   info->smbd_send_pending = 0;
+
init_waitqueue_head(>wait_smbd_recv_pending);
info->smbd_recv_pending = 0;
 
@@ -1973,3 +2033,191 @@ int smbd_recv(struct smbd_connection *info, struct 
msghdr *msg)
msg->msg_iter.count = 0;
return rc;
 }
+
+/*
+ * Send data to transport
+ * Each rqst is transported as a SMBDirect payload
+ * rqst: the data to write
+ * return value: 0 if successfully write, otherwise error code
+ */
+int smbd_send(struct smbd_connection *info, struct smb_rqst *rqst)
+{
+   struct kvec vec;
+   int nvecs;
+   int size;
+   int buflen = 0, remaining_data_length;
+   int start, i, j;
+   int max_iov_size =
+   info->max_send_size - sizeof(struct smbd_data_transfer);
+   struct kvec iov[SMBDIRECT_MAX_SGE];
+   int rc;
+   unsigned long long t1 = rdtsc();
+
+   info->smbd_send_pending++;
+   if (info->transport_status != SMBD_CONNECTED) {
+   rc = -ENODEV;
+   goto done;
+   }
+
+   /*
+* This usually means a configuration error
+* We use RDMA read/write for packet size > rdma_readwrite_threshold
+* as long as it's properly configured we should never get into 

[Patch v4 02/22] CIFS: SMBD: Establish SMBDirect connection

2017-10-01 Thread Long Li
From: Long Li 

Add code to implement the core functions to establish a SMBDirect connection.

1. Establish an RDMA connection to SMB server.
2. Negotiate and setup SMBDirect protocol.
3. Implement idle connection timer and credit management.

Add to Makefile.

Signed-off-by: Long Li 
---
 fs/cifs/Makefile|2 +-
 fs/cifs/smbdirect.c | 1600 +++
 fs/cifs/smbdirect.h |  229 
 3 files changed, 1830 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/Makefile b/fs/cifs/Makefile
index 5e853a3..bb54662b 100644
--- a/fs/cifs/Makefile
+++ b/fs/cifs/Makefile
@@ -8,7 +8,7 @@ cifs-y := cifsfs.o cifssmb.o cifs_debug.o connect.o dir.o 
file.o inode.o \
  cifs_unicode.o nterr.o cifsencrypt.o \
  readdir.o ioctl.o sess.o export.o smb1ops.o winucase.o \
  smb2ops.o smb2maperror.o smb2transport.o \
- smb2misc.o smb2pdu.o smb2inode.o smb2file.o
+ smb2misc.o smb2pdu.o smb2inode.o smb2file.o smbdirect.o
 
 cifs-$(CONFIG_CIFS_XATTR) += xattr.o
 cifs-$(CONFIG_CIFS_ACL) += cifsacl.o
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index d3c16f8..e8f976f 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -13,7 +13,35 @@
  *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
  *   the GNU General Public License for more details.
  */
+#include 
 #include "smbdirect.h"
+#include "cifs_debug.h"
+#include 
+
+static struct smbd_response *get_empty_queue_buffer(
+   struct smbd_connection *info);
+static struct smbd_response *get_receive_buffer(
+   struct smbd_connection *info);
+static void put_receive_buffer(
+   struct smbd_connection *info,
+   struct smbd_response *response,
+   bool lock);
+static int allocate_receive_buffers(struct smbd_connection *info, int num_buf);
+static void destroy_receive_buffers(struct smbd_connection *info);
+
+static void put_empty_packet(
+   struct smbd_connection *info, struct smbd_response *response);
+static void enqueue_reassembly(
+   struct smbd_connection *info,
+   struct smbd_response *response, int data_length);
+static struct smbd_response *_get_first_reassembly(
+   struct smbd_connection *info);
+
+static int smbd_post_recv(
+   struct smbd_connection *info,
+   struct smbd_response *response);
+
+static int smbd_post_send_empty(struct smbd_connection *info);
 
 /* SMBD version number */
 #define SMBD_V10x0100
@@ -75,3 +103,1575 @@ int smbd_max_frmr_depth = 2048;
 
 /* If payload is less than this byte, use RDMA send/recv not read/write */
 int rdma_readwrite_threshold = 4096;
+
+/* Transport logging functions
+ * Logging are defined as classes. They can be OR'ed to define the actual
+ * logging level via module parameter smbd_logging_class
+ * e.g. cifs.smbd_logging_class=0x500 will log all log_rdma_recv() and
+ * log_rdma_event()
+ */
+#define LOG_OUTGOING   0x1
+#define LOG_INCOMING   0x2
+#define LOG_READ   0x4
+#define LOG_WRITE  0x8
+#define LOG_RDMA_SEND  0x10
+#define LOG_RDMA_RECV  0x20
+#define LOG_KEEP_ALIVE 0x40
+#define LOG_RDMA_EVENT 0x80
+#define LOG_RDMA_MR0x100
+static unsigned int smbd_logging_class = 0;
+module_param(smbd_logging_class, uint, 0644);
+MODULE_PARM_DESC(smbd_logging_class,
+   "Logging class for SMBD transport 0x0 to 0x100");
+
+#define ERR0x0
+#define INFO   0x1
+static unsigned int smbd_logging_level = ERR;
+module_param(smbd_logging_level, uint, 0644);
+MODULE_PARM_DESC(smbd_logging_level,
+   "Logging level for SMBD transport, 0 (default): error, 1: info");
+
+#define log_rdma(level, class, fmt, args...)   \
+do {   \
+   if (level <= smbd_logging_level || class & smbd_logging_class)  \
+   cifs_dbg(VFS, "%s:%d " fmt, __func__, __LINE__, ##args);\
+} while (0)
+
+#define log_outgoing(level, fmt, args...) \
+   log_rdma(level, LOG_OUTGOING, fmt, ##args)
+#define log_incoming(level, fmt, args...) \
+   log_rdma(level, LOG_INCOMING, fmt, ##args)
+#define log_read(level, fmt, args...)  log_rdma(level, LOG_READ, fmt, ##args)
+#define log_write(level, fmt, args...) log_rdma(level, LOG_WRITE, fmt, ##args)
+#define log_rdma_send(level, fmt, args...) \
+   log_rdma(level, LOG_RDMA_SEND, fmt, ##args)
+#define log_rdma_recv(level, fmt, args...) \
+   log_rdma(level, LOG_RDMA_RECV, fmt, ##args)
+#define log_keep_alive(level, fmt, args...) \
+   log_rdma(level, LOG_KEEP_ALIVE, fmt, ##args)
+#define log_rdma_event(level, fmt, args...) \
+   log_rdma(level, LOG_RDMA_EVENT, fmt, ##args)
+#define log_rdma_mr(level, fmt, args...) \
+   

[Patch v4 01/22] CIFS: SMBD: Add SMBDirect protocol initial values and constants

2017-10-01 Thread Long Li
From: Long Li 

To prepare for protocol implementation, add constants and user-configurable
values in the SMBDirect protocol.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 77 +
 fs/cifs/smbdirect.h | 21 +++
 2 files changed, 98 insertions(+)
 create mode 100644 fs/cifs/smbdirect.c
 create mode 100644 fs/cifs/smbdirect.h

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
new file mode 100644
index 000..d3c16f8
--- /dev/null
+++ b/fs/cifs/smbdirect.c
@@ -0,0 +1,77 @@
+/*
+ *   Copyright (C) 2017, Microsoft Corporation.
+ *
+ *   Author(s): Long Li 
+ *
+ *   This program is free software;  you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation; either version 2 of the License, or
+ *   (at your option) any later version.
+ *
+ *   This program is distributed in the hope that it will be useful,
+ *   but WITHOUT ANY WARRANTY;  without even the implied warranty of
+ *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
+ *   the GNU General Public License for more details.
+ */
+#include "smbdirect.h"
+
+/* SMBD version number */
+#define SMBD_V10x0100
+
+/* Port numbers for SMBD transport */
+#define SMB_PORT   445
+#define SMBD_PORT  5445
+
+/* Address lookup and resolve timeout in ms */
+#define RDMA_RESOLVE_TIMEOUT   5000
+
+/* SMBD negotiation timeout in seconds */
+#define SMBD_NEGOTIATE_TIMEOUT 120
+
+/* SMBD minimum receive size and fragmented sized defined in [MS-SMBD] */
+#define SMBD_MIN_RECEIVE_SIZE  128
+#define SMBD_MIN_FRAGMENTED_SIZE   131072
+
+/*
+ * Default maximum number of RDMA read/write outstanding on this connection
+ * This value is possibly decreased during QP creation on hardware limit
+ */
+#define SMBD_CM_RESPONDER_RESOURCES32
+
+/* Maximum number of retries on data transfer operations */
+#define SMBD_CM_RETRY  6
+/* No need to retry on Receiver Not Ready since SMBD manages credits */
+#define SMBD_CM_RNR_RETRY  0
+
+/*
+ * User configurable initial values per SMBD transport connection
+ * as defined in [MS-SMBD] 3.1.1.1
+ * Those may change after a SMBD negotiation
+ */
+/* The local peer's maximum number of credits to grant to the peer */
+int smbd_receive_credit_max = 255;
+
+/* The remote peer's credit request of local peer */
+int smbd_send_credit_target = 255;
+
+/* The maximum single message size can be sent to remote peer */
+int smbd_max_send_size = 1364;
+
+/*  The maximum fragmented upper-layer payload receive size supported */
+int smbd_max_fragmented_recv_size = 1024 * 1024;
+
+/*  The maximum single-message size which can be received */
+int smbd_max_receive_size = 8192;
+
+/* The timeout to initiate send of a keepalive message on idle */
+int smbd_keep_alive_interval = 120;
+
+/*
+ * User configurable initial values for RDMA transport
+ * The actual values used may be lower and are limited to hardware capabilities
+ */
+/* Default maximum number of SGEs in a RDMA write/read */
+int smbd_max_frmr_depth = 2048;
+
+/* If payload is less than this byte, use RDMA send/recv not read/write */
+int rdma_readwrite_threshold = 4096;
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
new file mode 100644
index 000..c55f28b
--- /dev/null
+++ b/fs/cifs/smbdirect.h
@@ -0,0 +1,21 @@
+/*
+ *   Copyright (C) 2017, Microsoft Corporation.
+ *
+ *   Author(s): Long Li 
+ *
+ *   This program is free software;  you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation; either version 2 of the License, or
+ *   (at your option) any later version.
+ *
+ *   This program is distributed in the hope that it will be useful,
+ *   but WITHOUT ANY WARRANTY;  without even the implied warranty of
+ *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
+ *   the GNU General Public License for more details.
+ */
+#ifndef _SMBDIRECT_H
+#define _SMBDIRECT_H
+
+/* Default maximum number of SGEs in a RDMA send/recv */
+#define SMBDIRECT_MAX_SGE  16
+#endif
-- 
2.7.4



[Patch v4 01/22] CIFS: SMBD: Add SMBDirect protocol initial values and constants

2017-10-01 Thread Long Li
From: Long Li 

To prepare for protocol implementation, add constants and user-configurable
values in the SMBDirect protocol.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 77 +
 fs/cifs/smbdirect.h | 21 +++
 2 files changed, 98 insertions(+)
 create mode 100644 fs/cifs/smbdirect.c
 create mode 100644 fs/cifs/smbdirect.h

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
new file mode 100644
index 000..d3c16f8
--- /dev/null
+++ b/fs/cifs/smbdirect.c
@@ -0,0 +1,77 @@
+/*
+ *   Copyright (C) 2017, Microsoft Corporation.
+ *
+ *   Author(s): Long Li 
+ *
+ *   This program is free software;  you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation; either version 2 of the License, or
+ *   (at your option) any later version.
+ *
+ *   This program is distributed in the hope that it will be useful,
+ *   but WITHOUT ANY WARRANTY;  without even the implied warranty of
+ *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
+ *   the GNU General Public License for more details.
+ */
+#include "smbdirect.h"
+
+/* SMBD version number */
+#define SMBD_V10x0100
+
+/* Port numbers for SMBD transport */
+#define SMB_PORT   445
+#define SMBD_PORT  5445
+
+/* Address lookup and resolve timeout in ms */
+#define RDMA_RESOLVE_TIMEOUT   5000
+
+/* SMBD negotiation timeout in seconds */
+#define SMBD_NEGOTIATE_TIMEOUT 120
+
+/* SMBD minimum receive size and fragmented sized defined in [MS-SMBD] */
+#define SMBD_MIN_RECEIVE_SIZE  128
+#define SMBD_MIN_FRAGMENTED_SIZE   131072
+
+/*
+ * Default maximum number of RDMA read/write outstanding on this connection
+ * This value is possibly decreased during QP creation on hardware limit
+ */
+#define SMBD_CM_RESPONDER_RESOURCES32
+
+/* Maximum number of retries on data transfer operations */
+#define SMBD_CM_RETRY  6
+/* No need to retry on Receiver Not Ready since SMBD manages credits */
+#define SMBD_CM_RNR_RETRY  0
+
+/*
+ * User configurable initial values per SMBD transport connection
+ * as defined in [MS-SMBD] 3.1.1.1
+ * Those may change after a SMBD negotiation
+ */
+/* The local peer's maximum number of credits to grant to the peer */
+int smbd_receive_credit_max = 255;
+
+/* The remote peer's credit request of local peer */
+int smbd_send_credit_target = 255;
+
+/* The maximum single message size can be sent to remote peer */
+int smbd_max_send_size = 1364;
+
+/*  The maximum fragmented upper-layer payload receive size supported */
+int smbd_max_fragmented_recv_size = 1024 * 1024;
+
+/*  The maximum single-message size which can be received */
+int smbd_max_receive_size = 8192;
+
+/* The timeout to initiate send of a keepalive message on idle */
+int smbd_keep_alive_interval = 120;
+
+/*
+ * User configurable initial values for RDMA transport
+ * The actual values used may be lower and are limited to hardware capabilities
+ */
+/* Default maximum number of SGEs in a RDMA write/read */
+int smbd_max_frmr_depth = 2048;
+
+/* If payload is less than this byte, use RDMA send/recv not read/write */
+int rdma_readwrite_threshold = 4096;
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
new file mode 100644
index 000..c55f28b
--- /dev/null
+++ b/fs/cifs/smbdirect.h
@@ -0,0 +1,21 @@
+/*
+ *   Copyright (C) 2017, Microsoft Corporation.
+ *
+ *   Author(s): Long Li 
+ *
+ *   This program is free software;  you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation; either version 2 of the License, or
+ *   (at your option) any later version.
+ *
+ *   This program is distributed in the hope that it will be useful,
+ *   but WITHOUT ANY WARRANTY;  without even the implied warranty of
+ *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
+ *   the GNU General Public License for more details.
+ */
+#ifndef _SMBDIRECT_H
+#define _SMBDIRECT_H
+
+/* Default maximum number of SGEs in a RDMA send/recv */
+#define SMBDIRECT_MAX_SGE  16
+#endif
-- 
2.7.4



[Patch v4 06/22] CIFS: SMBD: Upper layer connects to SMBDirect session

2017-10-01 Thread Long Li
From: Long Li 

When "rdma" is specified in the mount option, CIFS attempts to connect to
SMBDirect instead of TCP socket.

Signed-off-by: Long Li 
---
 fs/cifs/connect.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index b5a575f..94b6357 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -45,6 +45,7 @@
 #include 
 #include 
 
+#include "smbdirect.h"
 #include "cifspdu.h"
 #include "cifsglob.h"
 #include "cifsproto.h"
@@ -2280,12 +2281,26 @@ cifs_get_tcp_session(struct smb_vol *volume_info)
else
tcp_ses->echo_interval = SMB_ECHO_INTERVAL_DEFAULT * HZ;
 
+   if (tcp_ses->rdma) {
+   tcp_ses->smbd_conn = smbd_get_connection(
+   tcp_ses, (struct sockaddr *)_info->dstaddr);
+   if (tcp_ses->smbd_conn) {
+   cifs_dbg(VFS, "RDMA transport established\n");
+   rc = 0;
+   goto connected;
+   } else {
+   rc = -ENOENT;
+   goto out_err_crypto_release;
+   }
+   }
+
rc = ip_connect(tcp_ses);
if (rc < 0) {
cifs_dbg(VFS, "Error connecting to socket. Aborting 
operation.\n");
goto out_err_crypto_release;
}
 
+connected:
/*
 * since we're in a cifs function already, we know that
 * this will succeed. No need for try_module_get().
-- 
2.7.4



[Patch v4 06/22] CIFS: SMBD: Upper layer connects to SMBDirect session

2017-10-01 Thread Long Li
From: Long Li 

When "rdma" is specified in the mount option, CIFS attempts to connect to
SMBDirect instead of TCP socket.

Signed-off-by: Long Li 
---
 fs/cifs/connect.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index b5a575f..94b6357 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -45,6 +45,7 @@
 #include 
 #include 
 
+#include "smbdirect.h"
 #include "cifspdu.h"
 #include "cifsglob.h"
 #include "cifsproto.h"
@@ -2280,12 +2281,26 @@ cifs_get_tcp_session(struct smb_vol *volume_info)
else
tcp_ses->echo_interval = SMB_ECHO_INTERVAL_DEFAULT * HZ;
 
+   if (tcp_ses->rdma) {
+   tcp_ses->smbd_conn = smbd_get_connection(
+   tcp_ses, (struct sockaddr *)_info->dstaddr);
+   if (tcp_ses->smbd_conn) {
+   cifs_dbg(VFS, "RDMA transport established\n");
+   rc = 0;
+   goto connected;
+   } else {
+   rc = -ENOENT;
+   goto out_err_crypto_release;
+   }
+   }
+
rc = ip_connect(tcp_ses);
if (rc < 0) {
cifs_dbg(VFS, "Error connecting to socket. Aborting 
operation.\n");
goto out_err_crypto_release;
}
 
+connected:
/*
 * since we're in a cifs function already, we know that
 * this will succeed. No need for try_module_get().
-- 
2.7.4



[Patch v4 07/22] CIFS: SMBD: Implement function to reconnect to a SMBDirect transport

2017-10-01 Thread Long Li
From: Long Li 

Add function to implement a reconnect to SMBDirect. This involves tearing down
the current connection and establishing/negotiating a new connection.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 36 
 fs/cifs/smbdirect.h |  3 +++
 2 files changed, 39 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 34f73e2..1f0f33c 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -1416,6 +1416,42 @@ static void idle_connection_timer(struct work_struct 
*work)
info->keep_alive_interval*HZ);
 }
 
+/*
+ * Reconnect this SMBD connection, called from upper layer
+ * return value: 0 on success, or actual error code
+ */
+int smbd_reconnect(struct TCP_Server_Info *server)
+{
+   log_rdma_event(INFO, "reconnecting rdma session\n");
+
+   if (!server->smbd_conn) {
+   log_rdma_event(ERR, "rdma session already destroyed\n");
+   return -EINVAL;
+   }
+
+   /*
+* This is possible if transport is disconnected and we haven't received
+* notification from RDMA, but upper layer has detected timeout
+*/
+   if (server->smbd_conn->transport_status == SMBD_CONNECTED) {
+   log_rdma_event(INFO, "disconnecting transport\n");
+   smbd_disconnect_rdma_connection(server->smbd_conn);
+   }
+
+   /* wait until the transport is destroyed */
+   wait_event(server->smbd_conn->wait_destroy,
+   server->smbd_conn->transport_status == SMBD_DESTROYED);
+
+   destroy_workqueue(server->smbd_conn->workqueue);
+   kfree(server->smbd_conn);
+
+   log_rdma_event(INFO, "creating rdma session\n");
+   server->smbd_conn = smbd_get_connection(
+   server, (struct sockaddr *) >dstaddr);
+
+   return server->smbd_conn ? 0 : -ENOENT;
+}
+
 static void destroy_caches_and_workqueue(struct smbd_connection *info)
 {
destroy_receive_buffers(info);
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
index 42a9338..9818852 100644
--- a/fs/cifs/smbdirect.h
+++ b/fs/cifs/smbdirect.h
@@ -249,6 +249,9 @@ struct smbd_response {
 struct smbd_connection *smbd_get_connection(
struct TCP_Server_Info *server, struct sockaddr *dstaddr);
 
+/* Reconnect SMBDirect session */
+int smbd_reconnect(struct TCP_Server_Info *server);
+
 void profiling_display_histogram(
struct seq_file *m, unsigned long long array[]);
 #endif
-- 
2.7.4



[Patch v4 11/22] CIFS: SMBD: Set SMBDirect maximum read or write size for I/O

2017-10-01 Thread Long Li
From: Long Li 

When connecting over SMBDirect, the transport negotiates its maximum I/O sizes
with the server and determines how to choose to do RDMA send/recv vs
read/write. Expose these maximum I/O sizes to upper layer so we will get the
correct sized payloads.

Signed-off-by: Long Li 
---
 fs/cifs/smb2ops.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index fb2934b..7ad35d6 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -32,6 +32,7 @@
 #include "smb2status.h"
 #include "smb2glob.h"
 #include "cifs_ioctl.h"
+#include "smbdirect.h"
 
 static int
 change_conf(struct TCP_Server_Info *server)
@@ -249,7 +250,11 @@ smb2_negotiate_wsize(struct cifs_tcon *tcon, struct 
smb_vol *volume_info)
 
/* start with specified wsize, or default */
wsize = volume_info->wsize ? volume_info->wsize : CIFS_DEFAULT_IOSIZE;
-   wsize = min_t(unsigned int, wsize, server->max_write);
+   if (server->rdma)
+   wsize = min_t(unsigned int,
+   wsize, server->smbd_conn->max_readwrite_size);
+   else
+   wsize = min_t(unsigned int, wsize, server->max_write);
 
if (!(server->capabilities & SMB2_GLOBAL_CAP_LARGE_MTU))
wsize = min_t(unsigned int, wsize, SMB2_MAX_BUFFER_SIZE);
@@ -265,7 +270,11 @@ smb2_negotiate_rsize(struct cifs_tcon *tcon, struct 
smb_vol *volume_info)
 
/* start with specified rsize, or default */
rsize = volume_info->rsize ? volume_info->rsize : CIFS_DEFAULT_IOSIZE;
-   rsize = min_t(unsigned int, rsize, server->max_read);
+   if (server->rdma)
+   rsize = min_t(unsigned int,
+   rsize, server->smbd_conn->max_readwrite_size);
+   else
+   rsize = min_t(unsigned int, rsize, server->max_read);
 
if (!(server->capabilities & SMB2_GLOBAL_CAP_LARGE_MTU))
rsize = min_t(unsigned int, rsize, SMB2_MAX_BUFFER_SIZE);
-- 
2.7.4



[Patch v4 07/22] CIFS: SMBD: Implement function to reconnect to a SMBDirect transport

2017-10-01 Thread Long Li
From: Long Li 

Add function to implement a reconnect to SMBDirect. This involves tearing down
the current connection and establishing/negotiating a new connection.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 36 
 fs/cifs/smbdirect.h |  3 +++
 2 files changed, 39 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 34f73e2..1f0f33c 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -1416,6 +1416,42 @@ static void idle_connection_timer(struct work_struct 
*work)
info->keep_alive_interval*HZ);
 }
 
+/*
+ * Reconnect this SMBD connection, called from upper layer
+ * return value: 0 on success, or actual error code
+ */
+int smbd_reconnect(struct TCP_Server_Info *server)
+{
+   log_rdma_event(INFO, "reconnecting rdma session\n");
+
+   if (!server->smbd_conn) {
+   log_rdma_event(ERR, "rdma session already destroyed\n");
+   return -EINVAL;
+   }
+
+   /*
+* This is possible if transport is disconnected and we haven't received
+* notification from RDMA, but upper layer has detected timeout
+*/
+   if (server->smbd_conn->transport_status == SMBD_CONNECTED) {
+   log_rdma_event(INFO, "disconnecting transport\n");
+   smbd_disconnect_rdma_connection(server->smbd_conn);
+   }
+
+   /* wait until the transport is destroyed */
+   wait_event(server->smbd_conn->wait_destroy,
+   server->smbd_conn->transport_status == SMBD_DESTROYED);
+
+   destroy_workqueue(server->smbd_conn->workqueue);
+   kfree(server->smbd_conn);
+
+   log_rdma_event(INFO, "creating rdma session\n");
+   server->smbd_conn = smbd_get_connection(
+   server, (struct sockaddr *) >dstaddr);
+
+   return server->smbd_conn ? 0 : -ENOENT;
+}
+
 static void destroy_caches_and_workqueue(struct smbd_connection *info)
 {
destroy_receive_buffers(info);
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
index 42a9338..9818852 100644
--- a/fs/cifs/smbdirect.h
+++ b/fs/cifs/smbdirect.h
@@ -249,6 +249,9 @@ struct smbd_response {
 struct smbd_connection *smbd_get_connection(
struct TCP_Server_Info *server, struct sockaddr *dstaddr);
 
+/* Reconnect SMBDirect session */
+int smbd_reconnect(struct TCP_Server_Info *server);
+
 void profiling_display_histogram(
struct seq_file *m, unsigned long long array[]);
 #endif
-- 
2.7.4



[Patch v4 11/22] CIFS: SMBD: Set SMBDirect maximum read or write size for I/O

2017-10-01 Thread Long Li
From: Long Li 

When connecting over SMBDirect, the transport negotiates its maximum I/O sizes
with the server and determines how to choose to do RDMA send/recv vs
read/write. Expose these maximum I/O sizes to upper layer so we will get the
correct sized payloads.

Signed-off-by: Long Li 
---
 fs/cifs/smb2ops.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index fb2934b..7ad35d6 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -32,6 +32,7 @@
 #include "smb2status.h"
 #include "smb2glob.h"
 #include "cifs_ioctl.h"
+#include "smbdirect.h"
 
 static int
 change_conf(struct TCP_Server_Info *server)
@@ -249,7 +250,11 @@ smb2_negotiate_wsize(struct cifs_tcon *tcon, struct 
smb_vol *volume_info)
 
/* start with specified wsize, or default */
wsize = volume_info->wsize ? volume_info->wsize : CIFS_DEFAULT_IOSIZE;
-   wsize = min_t(unsigned int, wsize, server->max_write);
+   if (server->rdma)
+   wsize = min_t(unsigned int,
+   wsize, server->smbd_conn->max_readwrite_size);
+   else
+   wsize = min_t(unsigned int, wsize, server->max_write);
 
if (!(server->capabilities & SMB2_GLOBAL_CAP_LARGE_MTU))
wsize = min_t(unsigned int, wsize, SMB2_MAX_BUFFER_SIZE);
@@ -265,7 +270,11 @@ smb2_negotiate_rsize(struct cifs_tcon *tcon, struct 
smb_vol *volume_info)
 
/* start with specified rsize, or default */
rsize = volume_info->rsize ? volume_info->rsize : CIFS_DEFAULT_IOSIZE;
-   rsize = min_t(unsigned int, rsize, server->max_read);
+   if (server->rdma)
+   rsize = min_t(unsigned int,
+   rsize, server->smbd_conn->max_readwrite_size);
+   else
+   rsize = min_t(unsigned int, rsize, server->max_read);
 
if (!(server->capabilities & SMB2_GLOBAL_CAP_LARGE_MTU))
rsize = min_t(unsigned int, rsize, SMB2_MAX_BUFFER_SIZE);
-- 
2.7.4



[Patch v4 08/22] CIFS: SMBD: Upper layer reconnects to SMBDirect session

2017-10-01 Thread Long Li
From: Long Li 

Do a reconnect on SMBDirect when it is used as the connection. Reconnect can
happen for many reasons and it's mostly the decision of upper layer SMB2.

Signed-off-by: Long Li 
---
 fs/cifs/connect.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 94b6357..26ad706 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -405,7 +405,11 @@ cifs_reconnect(struct TCP_Server_Info *server)
 
/* we should try only the port we connected to before */
mutex_lock(>srv_mutex);
-   rc = generic_ip_connect(server);
+   if (server->rdma)
+   rc = smbd_reconnect(server);
+   else
+   rc = generic_ip_connect(server);
+
if (rc) {
cifs_dbg(FYI, "reconnect error %d\n", rc);
mutex_unlock(>srv_mutex);
-- 
2.7.4



[Patch v4 08/22] CIFS: SMBD: Upper layer reconnects to SMBDirect session

2017-10-01 Thread Long Li
From: Long Li 

Do a reconnect on SMBDirect when it is used as the connection. Reconnect can
happen for many reasons and it's mostly the decision of upper layer SMB2.

Signed-off-by: Long Li 
---
 fs/cifs/connect.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 94b6357..26ad706 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -405,7 +405,11 @@ cifs_reconnect(struct TCP_Server_Info *server)
 
/* we should try only the port we connected to before */
mutex_lock(>srv_mutex);
-   rc = generic_ip_connect(server);
+   if (server->rdma)
+   rc = smbd_reconnect(server);
+   else
+   rc = generic_ip_connect(server);
+
if (rc) {
cifs_dbg(FYI, "reconnect error %d\n", rc);
mutex_unlock(>srv_mutex);
-- 
2.7.4



[Patch v4 21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

2017-10-01 Thread Long Li
From: Long Li 

If I/O size is larger than rdma_readwrite_threshold, use RDMA write for
SMB read by specifying channel SMB2_CHANNEL_RDMA_V1 or
SMB2_CHANNEL_RDMA_V1_INVALIDATE in the SMB packet, depending on SMB dialect
used. Append a smbd_buffer_descriptor_v1 to the end of the SMB packet and fill
in other values to indicate this SMB read uses RDMA write.

There is no need to read from the transport for incoming payload. At the time
SMB read response comes back, the data is already transfered and placed in the
pages by RDMA hardware.

When SMB read is finished, deregister the memory regions if RDMA write is used
for this SMB read. smbd_deregister_mr may need to do local invalidation and
sleep, if server remote invalidation is not used.

There are situations where the MID may not be created on I/O failure, under
which memory region is deregistered when read data context is released.

Signed-off-by: Long Li 
---
 fs/cifs/cifsglob.h |  1 +
 fs/cifs/file.c | 10 ++
 fs/cifs/smb2pdu.c  | 43 +++
 3 files changed, 54 insertions(+)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index f851b50..30b99a5 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1152,6 +1152,7 @@ struct cifs_readdata {
struct cifs_readdata *rdata,
struct iov_iter *iter);
struct kvec iov[2];
+   struct smbd_mr  *mr;
unsigned intpagesz;
unsigned inttailsz;
unsigned intcredits;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0786f19..8396f1e 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -42,6 +42,7 @@
 #include "cifs_debug.h"
 #include "cifs_fs_sb.h"
 #include "fscache.h"
+#include "smbdirect.h"
 
 
 static inline int cifs_convert_flags(unsigned int flags)
@@ -2909,6 +2910,11 @@ cifs_readdata_release(struct kref *refcount)
struct cifs_readdata *rdata = container_of(refcount,
struct cifs_readdata, refcount);
 
+   if (rdata->mr) {
+   smbd_deregister_mr(rdata->mr);
+   rdata->mr = NULL;
+   }
+
if (rdata->cfile)
cifsFileInfo_put(rdata->cfile);
 
@@ -3037,6 +3043,8 @@ uncached_fill_pages(struct TCP_Server_Info *server,
}
if (iter)
result = copy_page_from_iter(page, 0, n, iter);
+   else if (rdata->mr)
+   result = n;
else
result = cifs_read_page_from_socket(server, page, n);
if (result < 0)
@@ -3606,6 +3614,8 @@ readpages_fill_pages(struct TCP_Server_Info *server,
 
if (iter)
result = copy_page_from_iter(page, 0, n, iter);
+   else if (rdata->mr)
+   result = n;
else
result = cifs_read_page_from_socket(server, page, n);
if (result < 0)
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 7053db9..31dcee0 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2380,6 +2380,39 @@ smb2_new_read_req(void **buf, unsigned int *total_len,
req->Length = cpu_to_le32(io_parms->length);
req->Offset = cpu_to_le64(io_parms->offset);
 
+   /*
+* If we want to do a RDMA write, fill in and append
+* smbd_buffer_descriptor_v1 to the end of read request
+*/
+   if (server->rdma && rdata &&
+   rdata->bytes >= server->smbd_conn->rdma_readwrite_threshold) {
+
+   struct smbd_buffer_descriptor_v1 *v1;
+   bool need_invalidate =
+   io_parms->tcon->ses->server->dialect == SMB30_PROT_ID;
+
+   rdata->mr = smbd_register_mr(
+   server->smbd_conn, rdata->pages,
+   rdata->nr_pages, rdata->tailsz,
+   true, need_invalidate);
+   if (!rdata->mr)
+   return -ENOBUFS;
+
+   req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
+   if (need_invalidate)
+   req->Channel = SMB2_CHANNEL_RDMA_V1;
+   req->ReadChannelInfoOffset =
+   offsetof(struct smb2_read_plain_req, Buffer);
+   req->ReadChannelInfoLength =
+   sizeof(struct smbd_buffer_descriptor_v1);
+   v1 = (struct smbd_buffer_descriptor_v1 *) >Buffer[0];
+   v1->offset = rdata->mr->mr->iova;
+   v1->token = rdata->mr->mr->rkey;
+   v1->length = rdata->mr->mr->length;
+
+   *total_len += sizeof(*v1) - 1;
+   }
+
if (request_type & CHAINED_REQUEST) {
if (!(request_type & END_OF_CHAIN)) {
/* next 

[Patch v4 12/22] CIFS: SMBD: Implement function to receive data via RDMA receive

2017-10-01 Thread Long Li
From: Long Li 

On the receive path, the transport maintains receive buffers and a reassembly
queue for transferring payload via RDMA recv. There is data copy in the
transport on recv when it copies the payload to upper layer.

The transport recognizes the RFC1002 header length use in the SMB
upper layer payloads in CIFS. Because this length is mainly used for TCP and
not applicable to RDMA, it is handled as a out-of-band information and is
never sent over the wire, and the trasnport behaves like TCP to upper layer
by processing and exposing the length correctly on data payloads.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 229 
 fs/cifs/smbdirect.h |   6 ++
 2 files changed, 235 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index cb129c2..b9be9d6 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -200,6 +200,8 @@ static void smbd_destroy_rdma_work(struct work_struct *work)
 
log_rdma_event(INFO, "wait for all recv to finish\n");
wake_up_interruptible(>wait_reassembly_queue);
+   wait_event(info->wait_smbd_recv_pending,
+   info->smbd_recv_pending == 0);
 
log_rdma_event(INFO, "wait for all send posted to IB to finish\n");
wait_event(info->wait_send_pending,
@@ -1678,6 +1680,9 @@ struct smbd_connection *_smbd_get_connection(
queue_delayed_work(info->workqueue, >idle_timer_work,
info->keep_alive_interval*HZ);
 
+   init_waitqueue_head(>wait_smbd_recv_pending);
+   info->smbd_recv_pending = 0;
+
init_waitqueue_head(>wait_send_pending);
atomic_set(>send_pending, 0);
 
@@ -1744,3 +1749,227 @@ struct smbd_connection *smbd_get_connection(
}
return ret;
 }
+
+/*
+ * Receive data from receive reassembly queue
+ * All the incoming data packets are placed in reassembly queue
+ * buf: the buffer to read data into
+ * size: the length of data to read
+ * return value: actual data read
+ * Note: this implementation copies the data from reassebmly queue to receive
+ * buffers used by upper layer. This is not the optimal code path. A better way
+ * to do it is to not have upper layer allocate its receive buffers but rather
+ * borrow the buffer from reassembly queue, and return it after data is
+ * consumed. But this will require more changes to upper layer code, and also
+ * need to consider packet boundaries while they still being reassembled.
+ */
+int smbd_recv_buf(struct smbd_connection *info, char *buf, unsigned int size)
+{
+   struct smbd_response *response;
+   struct smbd_data_transfer *data_transfer;
+   int to_copy, to_read, data_read, offset;
+   u32 data_length, remaining_data_length, data_offset;
+   int rc;
+   unsigned long flags;
+
+again:
+   if (info->transport_status != SMBD_CONNECTED) {
+   log_read(ERR, "disconnected\n");
+   return -ENODEV;
+   }
+
+   /*
+* No need to hold the reassembly queue lock all the time as we are
+* the only one reading from the front of the queue. The transport
+* may add more entries to the back of the queeu at the same time
+*/
+   log_read(INFO, "size=%d info->reassembly_data_length=%d\n", size,
+   info->reassembly_data_length);
+   if (info->reassembly_data_length >= size) {
+   unsigned long long t1 = rdtsc();
+   int queue_length;
+   int queue_removed = 0;
+
+   /*
+* Need to make sure reassembly_data_length is read before
+* reading reassembly_queue_length and calling
+* _get_first_reassembly. This call is lock free
+* as we never read at the end of the queue which are being
+* updated in SOFTIRQ as more data is received
+*/
+   virt_rmb();
+   queue_length = info->reassembly_queue_length;
+   data_read = 0;
+   to_read = size;
+   offset = info->first_entry_offset;
+   while (data_read < size) {
+   response = _get_first_reassembly(info);
+   data_transfer = smbd_response_payload(response);
+   data_length = le32_to_cpu(data_transfer->data_length);
+   remaining_data_length =
+   le32_to_cpu(
+   data_transfer->remaining_data_length);
+   data_offset = le32_to_cpu(data_transfer->data_offset);
+
+   /*
+* The upper layer expects RFC1002 length at the
+* beginning of the payload. Return it to indicate
+* the total length of the packet. This minimize the
+* change to upper layer packet processing logic. This
+ 

[Patch v4 10/22] CIFS: SMBD: Upper layer destroys SMBDirect session on shutdown or umount

2017-10-01 Thread Long Li
From: Long Li 

When CIFS wants to umount, call shutdown on transport when SMBDirect is used.

Signed-off-by: Long Li 
---
 fs/cifs/connect.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 26ad706..1a9f22f 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -704,6 +704,11 @@ static void clean_demultiplex_info(struct TCP_Server_Info 
*server)
/* give those requests time to exit */
msleep(125);
 
+   if (server->smbd_conn) {
+   smbd_destroy(server->smbd_conn);
+   server->smbd_conn = NULL;
+   }
+
if (server->ssocket) {
sock_release(server->ssocket);
server->ssocket = NULL;
-- 
2.7.4



[Patch v4 04/22] CIFS: SMBD: Add rdma mount option

2017-10-01 Thread Long Li
From: Long Li 

Add "rdma" to CIFS mount options to connect to SMB Direct.
Add checks to validate this is used on SMB 3.X dialects.

To connect to SMBDirect, use "mount.cifs -o rdma,vers=3.x".
At the time of this patch, 3.x can be 3.0, 3.02 or 3.1.1.

Signed-off-by: Long Li 
---
 fs/cifs/cifs_debug.c |  2 ++
 fs/cifs/cifsfs.c |  2 ++
 fs/cifs/cifsglob.h   |  5 +
 fs/cifs/connect.c| 15 ++-
 4 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index bdc2f38..9738026 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -171,6 +171,8 @@ static int cifs_debug_data_proc_show(struct seq_file *m, 
void *v)
ses->ses_count, ses->serverOS, ses->serverNOS,
ses->capabilities, ses->status);
}
+   if (server->rdma)
+   seq_printf(m, "RDMA\n\t");
seq_printf(m, "TCP status: %d\n\tLocal Users To "
   "Server: %d SecMode: 0x%x Req On Wire: %d",
   server->tcpStatus, server->srv_count,
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 180b335..e15fbf1 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -327,6 +327,8 @@ cifs_show_address(struct seq_file *s, struct 
TCP_Server_Info *server)
default:
seq_puts(s, "(unknown)");
}
+   if (server->rdma)
+   seq_puts(s, ",rdma");
 }
 
 static void
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 808486c..5585516 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -530,6 +530,7 @@ struct smb_vol {
bool nopersistent:1;
bool resilient:1; /* noresilient not required since not fored for CA */
bool domainauto:1;
+   bool rdma:1;
unsigned int rsize;
unsigned int wsize;
bool sockopt_tcp_nodelay:1;
@@ -646,6 +647,10 @@ struct TCP_Server_Info {
boolsec_kerberos;   /* supports plain Kerberos */
boolsec_mskerberos; /* supports legacy MS Kerberos */
boollarge_buf;  /* is current buffer large? */
+   /* use SMBD connection instead of socket */
+   boolrdma;
+   /* point to the SMBD connection if RDMA is used instead of socket */
+   struct smbd_connection *smbd_conn;
struct delayed_work echo; /* echo ping workqueue job */
char*smallbuf;  /* pointer to current "small" buffer */
char*bigbuf;/* pointer to current "big" buffer */
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 59647eb..b5a575f 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -92,7 +92,7 @@ enum {
Opt_multiuser, Opt_sloppy, Opt_nosharesock,
Opt_persistent, Opt_nopersistent,
Opt_resilient, Opt_noresilient,
-   Opt_domainauto,
+   Opt_domainauto, Opt_rdma,
 
/* Mount options which take numeric value */
Opt_backupuid, Opt_backupgid, Opt_uid,
@@ -183,6 +183,7 @@ static const match_table_t cifs_mount_option_tokens = {
{ Opt_resilient, "resilienthandles"},
{ Opt_noresilient, "noresilienthandles"},
{ Opt_domainauto, "domainauto"},
+   { Opt_rdma, "rdma"},
 
{ Opt_backupuid, "backupuid=%s" },
{ Opt_backupgid, "backupgid=%s" },
@@ -1538,6 +1539,9 @@ cifs_parse_mount_options(const char *mountdata, const 
char *devname,
case Opt_domainauto:
vol->domainauto = true;
break;
+   case Opt_rdma:
+   vol->rdma = true;
+   break;
 
/* Numeric Values */
case Opt_backupuid:
@@ -1928,6 +1932,11 @@ cifs_parse_mount_options(const char *mountdata, const 
char *devname,
goto cifs_parse_mount_err;
}
 
+   if (vol->rdma && vol->vals->protocol_id < SMB30_PROT_ID) {
+   cifs_dbg(VFS, "SMB Direct requires Version >=3.0\n");
+   goto cifs_parse_mount_err;
+   }
+
 #ifndef CONFIG_KEYS
/* Muliuser mounts require CONFIG_KEYS support */
if (vol->multiuser) {
@@ -2131,6 +2140,9 @@ static int match_server(struct TCP_Server_Info *server, 
struct smb_vol *vol)
if (server->echo_interval != vol->echo_interval * HZ)
return 0;
 
+   if (server->rdma != vol->rdma)
+   return 0;
+
return 1;
 }
 
@@ -2229,6 +2241,7 @@ cifs_get_tcp_session(struct smb_vol *volume_info)
tcp_ses->noblocksnd = volume_info->noblocksnd;
tcp_ses->noautotune = volume_info->noautotune;
tcp_ses->tcp_nodelay = volume_info->sockopt_tcp_nodelay;
+   tcp_ses->rdma = volume_info->rdma;
tcp_ses->in_flight = 0;
tcp_ses->credits = 1;
init_waitqueue_head(_ses->response_q);
-- 
2.7.4



[Patch v4 21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

2017-10-01 Thread Long Li
From: Long Li 

If I/O size is larger than rdma_readwrite_threshold, use RDMA write for
SMB read by specifying channel SMB2_CHANNEL_RDMA_V1 or
SMB2_CHANNEL_RDMA_V1_INVALIDATE in the SMB packet, depending on SMB dialect
used. Append a smbd_buffer_descriptor_v1 to the end of the SMB packet and fill
in other values to indicate this SMB read uses RDMA write.

There is no need to read from the transport for incoming payload. At the time
SMB read response comes back, the data is already transfered and placed in the
pages by RDMA hardware.

When SMB read is finished, deregister the memory regions if RDMA write is used
for this SMB read. smbd_deregister_mr may need to do local invalidation and
sleep, if server remote invalidation is not used.

There are situations where the MID may not be created on I/O failure, under
which memory region is deregistered when read data context is released.

Signed-off-by: Long Li 
---
 fs/cifs/cifsglob.h |  1 +
 fs/cifs/file.c | 10 ++
 fs/cifs/smb2pdu.c  | 43 +++
 3 files changed, 54 insertions(+)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index f851b50..30b99a5 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1152,6 +1152,7 @@ struct cifs_readdata {
struct cifs_readdata *rdata,
struct iov_iter *iter);
struct kvec iov[2];
+   struct smbd_mr  *mr;
unsigned intpagesz;
unsigned inttailsz;
unsigned intcredits;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0786f19..8396f1e 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -42,6 +42,7 @@
 #include "cifs_debug.h"
 #include "cifs_fs_sb.h"
 #include "fscache.h"
+#include "smbdirect.h"
 
 
 static inline int cifs_convert_flags(unsigned int flags)
@@ -2909,6 +2910,11 @@ cifs_readdata_release(struct kref *refcount)
struct cifs_readdata *rdata = container_of(refcount,
struct cifs_readdata, refcount);
 
+   if (rdata->mr) {
+   smbd_deregister_mr(rdata->mr);
+   rdata->mr = NULL;
+   }
+
if (rdata->cfile)
cifsFileInfo_put(rdata->cfile);
 
@@ -3037,6 +3043,8 @@ uncached_fill_pages(struct TCP_Server_Info *server,
}
if (iter)
result = copy_page_from_iter(page, 0, n, iter);
+   else if (rdata->mr)
+   result = n;
else
result = cifs_read_page_from_socket(server, page, n);
if (result < 0)
@@ -3606,6 +3614,8 @@ readpages_fill_pages(struct TCP_Server_Info *server,
 
if (iter)
result = copy_page_from_iter(page, 0, n, iter);
+   else if (rdata->mr)
+   result = n;
else
result = cifs_read_page_from_socket(server, page, n);
if (result < 0)
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 7053db9..31dcee0 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2380,6 +2380,39 @@ smb2_new_read_req(void **buf, unsigned int *total_len,
req->Length = cpu_to_le32(io_parms->length);
req->Offset = cpu_to_le64(io_parms->offset);
 
+   /*
+* If we want to do a RDMA write, fill in and append
+* smbd_buffer_descriptor_v1 to the end of read request
+*/
+   if (server->rdma && rdata &&
+   rdata->bytes >= server->smbd_conn->rdma_readwrite_threshold) {
+
+   struct smbd_buffer_descriptor_v1 *v1;
+   bool need_invalidate =
+   io_parms->tcon->ses->server->dialect == SMB30_PROT_ID;
+
+   rdata->mr = smbd_register_mr(
+   server->smbd_conn, rdata->pages,
+   rdata->nr_pages, rdata->tailsz,
+   true, need_invalidate);
+   if (!rdata->mr)
+   return -ENOBUFS;
+
+   req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
+   if (need_invalidate)
+   req->Channel = SMB2_CHANNEL_RDMA_V1;
+   req->ReadChannelInfoOffset =
+   offsetof(struct smb2_read_plain_req, Buffer);
+   req->ReadChannelInfoLength =
+   sizeof(struct smbd_buffer_descriptor_v1);
+   v1 = (struct smbd_buffer_descriptor_v1 *) >Buffer[0];
+   v1->offset = rdata->mr->mr->iova;
+   v1->token = rdata->mr->mr->rkey;
+   v1->length = rdata->mr->mr->length;
+
+   *total_len += sizeof(*v1) - 1;
+   }
+
if (request_type & CHAINED_REQUEST) {
if (!(request_type & END_OF_CHAIN)) {
/* next 8-byte aligned request */
@@ -2459,6 +2492,16 @@ 

[Patch v4 12/22] CIFS: SMBD: Implement function to receive data via RDMA receive

2017-10-01 Thread Long Li
From: Long Li 

On the receive path, the transport maintains receive buffers and a reassembly
queue for transferring payload via RDMA recv. There is data copy in the
transport on recv when it copies the payload to upper layer.

The transport recognizes the RFC1002 header length use in the SMB
upper layer payloads in CIFS. Because this length is mainly used for TCP and
not applicable to RDMA, it is handled as a out-of-band information and is
never sent over the wire, and the trasnport behaves like TCP to upper layer
by processing and exposing the length correctly on data payloads.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 229 
 fs/cifs/smbdirect.h |   6 ++
 2 files changed, 235 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index cb129c2..b9be9d6 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -200,6 +200,8 @@ static void smbd_destroy_rdma_work(struct work_struct *work)
 
log_rdma_event(INFO, "wait for all recv to finish\n");
wake_up_interruptible(>wait_reassembly_queue);
+   wait_event(info->wait_smbd_recv_pending,
+   info->smbd_recv_pending == 0);
 
log_rdma_event(INFO, "wait for all send posted to IB to finish\n");
wait_event(info->wait_send_pending,
@@ -1678,6 +1680,9 @@ struct smbd_connection *_smbd_get_connection(
queue_delayed_work(info->workqueue, >idle_timer_work,
info->keep_alive_interval*HZ);
 
+   init_waitqueue_head(>wait_smbd_recv_pending);
+   info->smbd_recv_pending = 0;
+
init_waitqueue_head(>wait_send_pending);
atomic_set(>send_pending, 0);
 
@@ -1744,3 +1749,227 @@ struct smbd_connection *smbd_get_connection(
}
return ret;
 }
+
+/*
+ * Receive data from receive reassembly queue
+ * All the incoming data packets are placed in reassembly queue
+ * buf: the buffer to read data into
+ * size: the length of data to read
+ * return value: actual data read
+ * Note: this implementation copies the data from reassebmly queue to receive
+ * buffers used by upper layer. This is not the optimal code path. A better way
+ * to do it is to not have upper layer allocate its receive buffers but rather
+ * borrow the buffer from reassembly queue, and return it after data is
+ * consumed. But this will require more changes to upper layer code, and also
+ * need to consider packet boundaries while they still being reassembled.
+ */
+int smbd_recv_buf(struct smbd_connection *info, char *buf, unsigned int size)
+{
+   struct smbd_response *response;
+   struct smbd_data_transfer *data_transfer;
+   int to_copy, to_read, data_read, offset;
+   u32 data_length, remaining_data_length, data_offset;
+   int rc;
+   unsigned long flags;
+
+again:
+   if (info->transport_status != SMBD_CONNECTED) {
+   log_read(ERR, "disconnected\n");
+   return -ENODEV;
+   }
+
+   /*
+* No need to hold the reassembly queue lock all the time as we are
+* the only one reading from the front of the queue. The transport
+* may add more entries to the back of the queeu at the same time
+*/
+   log_read(INFO, "size=%d info->reassembly_data_length=%d\n", size,
+   info->reassembly_data_length);
+   if (info->reassembly_data_length >= size) {
+   unsigned long long t1 = rdtsc();
+   int queue_length;
+   int queue_removed = 0;
+
+   /*
+* Need to make sure reassembly_data_length is read before
+* reading reassembly_queue_length and calling
+* _get_first_reassembly. This call is lock free
+* as we never read at the end of the queue which are being
+* updated in SOFTIRQ as more data is received
+*/
+   virt_rmb();
+   queue_length = info->reassembly_queue_length;
+   data_read = 0;
+   to_read = size;
+   offset = info->first_entry_offset;
+   while (data_read < size) {
+   response = _get_first_reassembly(info);
+   data_transfer = smbd_response_payload(response);
+   data_length = le32_to_cpu(data_transfer->data_length);
+   remaining_data_length =
+   le32_to_cpu(
+   data_transfer->remaining_data_length);
+   data_offset = le32_to_cpu(data_transfer->data_offset);
+
+   /*
+* The upper layer expects RFC1002 length at the
+* beginning of the payload. Return it to indicate
+* the total length of the packet. This minimize the
+* change to upper layer packet processing logic. This
+* will be eventually remove when 

[Patch v4 10/22] CIFS: SMBD: Upper layer destroys SMBDirect session on shutdown or umount

2017-10-01 Thread Long Li
From: Long Li 

When CIFS wants to umount, call shutdown on transport when SMBDirect is used.

Signed-off-by: Long Li 
---
 fs/cifs/connect.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 26ad706..1a9f22f 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -704,6 +704,11 @@ static void clean_demultiplex_info(struct TCP_Server_Info 
*server)
/* give those requests time to exit */
msleep(125);
 
+   if (server->smbd_conn) {
+   smbd_destroy(server->smbd_conn);
+   server->smbd_conn = NULL;
+   }
+
if (server->ssocket) {
sock_release(server->ssocket);
server->ssocket = NULL;
-- 
2.7.4



[Patch v4 04/22] CIFS: SMBD: Add rdma mount option

2017-10-01 Thread Long Li
From: Long Li 

Add "rdma" to CIFS mount options to connect to SMB Direct.
Add checks to validate this is used on SMB 3.X dialects.

To connect to SMBDirect, use "mount.cifs -o rdma,vers=3.x".
At the time of this patch, 3.x can be 3.0, 3.02 or 3.1.1.

Signed-off-by: Long Li 
---
 fs/cifs/cifs_debug.c |  2 ++
 fs/cifs/cifsfs.c |  2 ++
 fs/cifs/cifsglob.h   |  5 +
 fs/cifs/connect.c| 15 ++-
 4 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index bdc2f38..9738026 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -171,6 +171,8 @@ static int cifs_debug_data_proc_show(struct seq_file *m, 
void *v)
ses->ses_count, ses->serverOS, ses->serverNOS,
ses->capabilities, ses->status);
}
+   if (server->rdma)
+   seq_printf(m, "RDMA\n\t");
seq_printf(m, "TCP status: %d\n\tLocal Users To "
   "Server: %d SecMode: 0x%x Req On Wire: %d",
   server->tcpStatus, server->srv_count,
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 180b335..e15fbf1 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -327,6 +327,8 @@ cifs_show_address(struct seq_file *s, struct 
TCP_Server_Info *server)
default:
seq_puts(s, "(unknown)");
}
+   if (server->rdma)
+   seq_puts(s, ",rdma");
 }
 
 static void
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 808486c..5585516 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -530,6 +530,7 @@ struct smb_vol {
bool nopersistent:1;
bool resilient:1; /* noresilient not required since not fored for CA */
bool domainauto:1;
+   bool rdma:1;
unsigned int rsize;
unsigned int wsize;
bool sockopt_tcp_nodelay:1;
@@ -646,6 +647,10 @@ struct TCP_Server_Info {
boolsec_kerberos;   /* supports plain Kerberos */
boolsec_mskerberos; /* supports legacy MS Kerberos */
boollarge_buf;  /* is current buffer large? */
+   /* use SMBD connection instead of socket */
+   boolrdma;
+   /* point to the SMBD connection if RDMA is used instead of socket */
+   struct smbd_connection *smbd_conn;
struct delayed_work echo; /* echo ping workqueue job */
char*smallbuf;  /* pointer to current "small" buffer */
char*bigbuf;/* pointer to current "big" buffer */
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 59647eb..b5a575f 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -92,7 +92,7 @@ enum {
Opt_multiuser, Opt_sloppy, Opt_nosharesock,
Opt_persistent, Opt_nopersistent,
Opt_resilient, Opt_noresilient,
-   Opt_domainauto,
+   Opt_domainauto, Opt_rdma,
 
/* Mount options which take numeric value */
Opt_backupuid, Opt_backupgid, Opt_uid,
@@ -183,6 +183,7 @@ static const match_table_t cifs_mount_option_tokens = {
{ Opt_resilient, "resilienthandles"},
{ Opt_noresilient, "noresilienthandles"},
{ Opt_domainauto, "domainauto"},
+   { Opt_rdma, "rdma"},
 
{ Opt_backupuid, "backupuid=%s" },
{ Opt_backupgid, "backupgid=%s" },
@@ -1538,6 +1539,9 @@ cifs_parse_mount_options(const char *mountdata, const 
char *devname,
case Opt_domainauto:
vol->domainauto = true;
break;
+   case Opt_rdma:
+   vol->rdma = true;
+   break;
 
/* Numeric Values */
case Opt_backupuid:
@@ -1928,6 +1932,11 @@ cifs_parse_mount_options(const char *mountdata, const 
char *devname,
goto cifs_parse_mount_err;
}
 
+   if (vol->rdma && vol->vals->protocol_id < SMB30_PROT_ID) {
+   cifs_dbg(VFS, "SMB Direct requires Version >=3.0\n");
+   goto cifs_parse_mount_err;
+   }
+
 #ifndef CONFIG_KEYS
/* Muliuser mounts require CONFIG_KEYS support */
if (vol->multiuser) {
@@ -2131,6 +2140,9 @@ static int match_server(struct TCP_Server_Info *server, 
struct smb_vol *vol)
if (server->echo_interval != vol->echo_interval * HZ)
return 0;
 
+   if (server->rdma != vol->rdma)
+   return 0;
+
return 1;
 }
 
@@ -2229,6 +2241,7 @@ cifs_get_tcp_session(struct smb_vol *volume_info)
tcp_ses->noblocksnd = volume_info->noblocksnd;
tcp_ses->noautotune = volume_info->noautotune;
tcp_ses->tcp_nodelay = volume_info->sockopt_tcp_nodelay;
+   tcp_ses->rdma = volume_info->rdma;
tcp_ses->in_flight = 0;
tcp_ses->credits = 1;
init_waitqueue_head(_ses->response_q);
-- 
2.7.4



[Patch v4 05/22] CIFS: SMBD: Implement function to create a SMBDirect connection

2017-10-01 Thread Long Li
From: Long Li 

The upper layer calls this function to connect to peer through SMBDirect.
Each SMBDirect connection is based on a RC Queue Pair.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 17 +
 fs/cifs/smbdirect.h |  4 
 2 files changed, 21 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index e8f976f..34f73e2 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -1675,3 +1675,20 @@ struct smbd_connection *_smbd_get_connection(
kfree(info);
return NULL;
 }
+
+struct smbd_connection *smbd_get_connection(
+   struct TCP_Server_Info *server, struct sockaddr *dstaddr)
+{
+   struct smbd_connection *ret;
+   int port = SMBD_PORT;
+
+try_again:
+   ret = _smbd_get_connection(server, dstaddr, port);
+
+   /* Try SMB_PORT if SMBD_PORT doesn't work */
+   if (!ret && port == SMBD_PORT) {
+   port = SMB_PORT;
+   goto try_again;
+   }
+   return ret;
+}
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
index ca60700..42a9338 100644
--- a/fs/cifs/smbdirect.h
+++ b/fs/cifs/smbdirect.h
@@ -245,6 +245,10 @@ struct smbd_response {
u8 packet[];
 };
 
+/* Create a SMBDirect session */
+struct smbd_connection *smbd_get_connection(
+   struct TCP_Server_Info *server, struct sockaddr *dstaddr);
+
 void profiling_display_histogram(
struct seq_file *m, unsigned long long array[]);
 #endif
-- 
2.7.4



[Patch v4 05/22] CIFS: SMBD: Implement function to create a SMBDirect connection

2017-10-01 Thread Long Li
From: Long Li 

The upper layer calls this function to connect to peer through SMBDirect.
Each SMBDirect connection is based on a RC Queue Pair.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 17 +
 fs/cifs/smbdirect.h |  4 
 2 files changed, 21 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index e8f976f..34f73e2 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -1675,3 +1675,20 @@ struct smbd_connection *_smbd_get_connection(
kfree(info);
return NULL;
 }
+
+struct smbd_connection *smbd_get_connection(
+   struct TCP_Server_Info *server, struct sockaddr *dstaddr)
+{
+   struct smbd_connection *ret;
+   int port = SMBD_PORT;
+
+try_again:
+   ret = _smbd_get_connection(server, dstaddr, port);
+
+   /* Try SMB_PORT if SMBD_PORT doesn't work */
+   if (!ret && port == SMBD_PORT) {
+   port = SMB_PORT;
+   goto try_again;
+   }
+   return ret;
+}
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
index ca60700..42a9338 100644
--- a/fs/cifs/smbdirect.h
+++ b/fs/cifs/smbdirect.h
@@ -245,6 +245,10 @@ struct smbd_response {
u8 packet[];
 };
 
+/* Create a SMBDirect session */
+struct smbd_connection *smbd_get_connection(
+   struct TCP_Server_Info *server, struct sockaddr *dstaddr);
+
 void profiling_display_histogram(
struct seq_file *m, unsigned long long array[]);
 #endif
-- 
2.7.4



[Patch v4 16/22] CIFS: SMBD: Fix the definition for SMB2_CHANNEL_RDMA_V1_INVALIDATE

2017-10-01 Thread Long Li
From: Long Li 

The channel value for requesting server remote invalidating local memory
registration should be 0x0002

Signed-off-by: Long Li 
---
 fs/cifs/smb2pdu.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/smb2pdu.h b/fs/cifs/smb2pdu.h
index 393ed5f..f783a08 100644
--- a/fs/cifs/smb2pdu.h
+++ b/fs/cifs/smb2pdu.h
@@ -832,7 +832,7 @@ struct smb2_flush_rsp {
 /* Channel field for read and write: exactly one of following flags can be 
set*/
 #define SMB2_CHANNEL_NONE  0x
 #define SMB2_CHANNEL_RDMA_V1   0x0001 /* SMB3 or later */
-#define SMB2_CHANNEL_RDMA_V1_INVALIDATE 0x0001 /* SMB3.02 or later */
+#define SMB2_CHANNEL_RDMA_V1_INVALIDATE 0x0002 /* SMB3.02 or later */
 
 /* SMB2 read request without RFC1001 length at the beginning */
 struct smb2_read_plain_req {
-- 
2.7.4



[Patch v4 03/22] CIFS: SMBD: export protocol initial values

2017-10-01 Thread Long Li
From: Long Li 

Those values can be configured by user. Export them to /proc/fs/cifs.

Signed-off-by: Long Li 
---
 fs/cifs/cifs_debug.c | 70 
 1 file changed, 70 insertions(+)

diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index 9727e1d..bdc2f38 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -369,6 +369,52 @@ static const struct file_operations cifs_stats_proc_fops = 
{
 };
 #endif /* STATS */
 
+#define PROC_FILE_DEFINE(name) \
+static ssize_t name##_write(struct file *file, const char __user *buffer, \
+   size_t count, loff_t *ppos) \
+{ \
+   int rc; \
+   rc = kstrtoint_from_user(buffer, count, 10, & name ); \
+   if (rc) \
+   return rc; \
+   return count; \
+} \
+static int name##_proc_show(struct seq_file *m, void *v) \
+{ \
+   seq_printf(m, "%d\n", name ); \
+   return 0; \
+} \
+static int name##_open(struct inode *inode, struct file *file) \
+{ \
+   return single_open(file, name##_proc_show, NULL); \
+} \
+\
+static const struct file_operations cifs_##name##_proc_fops = { \
+   .open   = name##_open, \
+   .read   = seq_read, \
+   .llseek = seq_lseek, \
+   .release= single_release, \
+   .write  = name##_write, \
+}
+
+extern int rdma_readwrite_threshold;
+extern int smbd_max_frmr_depth;
+extern int smbd_keep_alive_interval;
+extern int smbd_max_receive_size;
+extern int smbd_max_fragmented_recv_size;
+extern int smbd_max_send_size;
+extern int smbd_send_credit_target;
+extern int smbd_receive_credit_max;
+
+PROC_FILE_DEFINE(rdma_readwrite_threshold);
+PROC_FILE_DEFINE(smbd_max_frmr_depth);
+PROC_FILE_DEFINE(smbd_keep_alive_interval);
+PROC_FILE_DEFINE(smbd_max_receive_size);
+PROC_FILE_DEFINE(smbd_max_fragmented_recv_size);
+PROC_FILE_DEFINE(smbd_max_send_size);
+PROC_FILE_DEFINE(smbd_send_credit_target);
+PROC_FILE_DEFINE(smbd_receive_credit_max);
+
 static struct proc_dir_entry *proc_fs_cifs;
 static const struct file_operations cifsFYI_proc_fops;
 static const struct file_operations cifs_lookup_cache_proc_fops;
@@ -396,6 +442,22 @@ cifs_proc_init(void)
_security_flags_proc_fops);
proc_create("LookupCacheEnabled", 0, proc_fs_cifs,
_lookup_cache_proc_fops);
+   proc_create("rdma_readwrite_threshold", 0, proc_fs_cifs,
+   _rdma_readwrite_threshold_proc_fops);
+   proc_create("smbd_max_frmr_depth", 0, proc_fs_cifs,
+   _smbd_max_frmr_depth_proc_fops);
+   proc_create("smbd_keep_alive_interval", 0, proc_fs_cifs,
+   _smbd_keep_alive_interval_proc_fops);
+   proc_create("smbd_max_receive_size", 0, proc_fs_cifs,
+   _smbd_max_receive_size_proc_fops);
+   proc_create("smbd_max_fragmented_recv_size", 0, proc_fs_cifs,
+   _smbd_max_fragmented_recv_size_proc_fops);
+   proc_create("smbd_max_send_size", 0, proc_fs_cifs,
+   _smbd_max_send_size_proc_fops);
+   proc_create("smbd_send_credit_target", 0, proc_fs_cifs,
+   _smbd_send_credit_target_proc_fops);
+   proc_create("smbd_receive_credit_max", 0, proc_fs_cifs,
+   _smbd_receive_credit_max_proc_fops);
 }
 
 void
@@ -413,6 +475,14 @@ cifs_proc_clean(void)
remove_proc_entry("SecurityFlags", proc_fs_cifs);
remove_proc_entry("LinuxExtensionsEnabled", proc_fs_cifs);
remove_proc_entry("LookupCacheEnabled", proc_fs_cifs);
+   remove_proc_entry("rdma_readwrite_threshold", proc_fs_cifs);
+   remove_proc_entry("smbd_max_frmr_depth", proc_fs_cifs);
+   remove_proc_entry("smbd_keep_alive_interval", proc_fs_cifs);
+   remove_proc_entry("smbd_max_receive_size", proc_fs_cifs);
+   remove_proc_entry("smbd_max_fragmented_recv_size", proc_fs_cifs);
+   remove_proc_entry("smbd_max_send_size", proc_fs_cifs);
+   remove_proc_entry("smbd_send_credit_target", proc_fs_cifs);
+   remove_proc_entry("smbd_receive_credit_max", proc_fs_cifs);
remove_proc_entry("fs/cifs", NULL);
 }
 
-- 
2.7.4



[Patch v4 16/22] CIFS: SMBD: Fix the definition for SMB2_CHANNEL_RDMA_V1_INVALIDATE

2017-10-01 Thread Long Li
From: Long Li 

The channel value for requesting server remote invalidating local memory
registration should be 0x0002

Signed-off-by: Long Li 
---
 fs/cifs/smb2pdu.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/smb2pdu.h b/fs/cifs/smb2pdu.h
index 393ed5f..f783a08 100644
--- a/fs/cifs/smb2pdu.h
+++ b/fs/cifs/smb2pdu.h
@@ -832,7 +832,7 @@ struct smb2_flush_rsp {
 /* Channel field for read and write: exactly one of following flags can be 
set*/
 #define SMB2_CHANNEL_NONE  0x
 #define SMB2_CHANNEL_RDMA_V1   0x0001 /* SMB3 or later */
-#define SMB2_CHANNEL_RDMA_V1_INVALIDATE 0x0001 /* SMB3.02 or later */
+#define SMB2_CHANNEL_RDMA_V1_INVALIDATE 0x0002 /* SMB3.02 or later */
 
 /* SMB2 read request without RFC1001 length at the beginning */
 struct smb2_read_plain_req {
-- 
2.7.4



[Patch v4 03/22] CIFS: SMBD: export protocol initial values

2017-10-01 Thread Long Li
From: Long Li 

Those values can be configured by user. Export them to /proc/fs/cifs.

Signed-off-by: Long Li 
---
 fs/cifs/cifs_debug.c | 70 
 1 file changed, 70 insertions(+)

diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index 9727e1d..bdc2f38 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -369,6 +369,52 @@ static const struct file_operations cifs_stats_proc_fops = 
{
 };
 #endif /* STATS */
 
+#define PROC_FILE_DEFINE(name) \
+static ssize_t name##_write(struct file *file, const char __user *buffer, \
+   size_t count, loff_t *ppos) \
+{ \
+   int rc; \
+   rc = kstrtoint_from_user(buffer, count, 10, & name ); \
+   if (rc) \
+   return rc; \
+   return count; \
+} \
+static int name##_proc_show(struct seq_file *m, void *v) \
+{ \
+   seq_printf(m, "%d\n", name ); \
+   return 0; \
+} \
+static int name##_open(struct inode *inode, struct file *file) \
+{ \
+   return single_open(file, name##_proc_show, NULL); \
+} \
+\
+static const struct file_operations cifs_##name##_proc_fops = { \
+   .open   = name##_open, \
+   .read   = seq_read, \
+   .llseek = seq_lseek, \
+   .release= single_release, \
+   .write  = name##_write, \
+}
+
+extern int rdma_readwrite_threshold;
+extern int smbd_max_frmr_depth;
+extern int smbd_keep_alive_interval;
+extern int smbd_max_receive_size;
+extern int smbd_max_fragmented_recv_size;
+extern int smbd_max_send_size;
+extern int smbd_send_credit_target;
+extern int smbd_receive_credit_max;
+
+PROC_FILE_DEFINE(rdma_readwrite_threshold);
+PROC_FILE_DEFINE(smbd_max_frmr_depth);
+PROC_FILE_DEFINE(smbd_keep_alive_interval);
+PROC_FILE_DEFINE(smbd_max_receive_size);
+PROC_FILE_DEFINE(smbd_max_fragmented_recv_size);
+PROC_FILE_DEFINE(smbd_max_send_size);
+PROC_FILE_DEFINE(smbd_send_credit_target);
+PROC_FILE_DEFINE(smbd_receive_credit_max);
+
 static struct proc_dir_entry *proc_fs_cifs;
 static const struct file_operations cifsFYI_proc_fops;
 static const struct file_operations cifs_lookup_cache_proc_fops;
@@ -396,6 +442,22 @@ cifs_proc_init(void)
_security_flags_proc_fops);
proc_create("LookupCacheEnabled", 0, proc_fs_cifs,
_lookup_cache_proc_fops);
+   proc_create("rdma_readwrite_threshold", 0, proc_fs_cifs,
+   _rdma_readwrite_threshold_proc_fops);
+   proc_create("smbd_max_frmr_depth", 0, proc_fs_cifs,
+   _smbd_max_frmr_depth_proc_fops);
+   proc_create("smbd_keep_alive_interval", 0, proc_fs_cifs,
+   _smbd_keep_alive_interval_proc_fops);
+   proc_create("smbd_max_receive_size", 0, proc_fs_cifs,
+   _smbd_max_receive_size_proc_fops);
+   proc_create("smbd_max_fragmented_recv_size", 0, proc_fs_cifs,
+   _smbd_max_fragmented_recv_size_proc_fops);
+   proc_create("smbd_max_send_size", 0, proc_fs_cifs,
+   _smbd_max_send_size_proc_fops);
+   proc_create("smbd_send_credit_target", 0, proc_fs_cifs,
+   _smbd_send_credit_target_proc_fops);
+   proc_create("smbd_receive_credit_max", 0, proc_fs_cifs,
+   _smbd_receive_credit_max_proc_fops);
 }
 
 void
@@ -413,6 +475,14 @@ cifs_proc_clean(void)
remove_proc_entry("SecurityFlags", proc_fs_cifs);
remove_proc_entry("LinuxExtensionsEnabled", proc_fs_cifs);
remove_proc_entry("LookupCacheEnabled", proc_fs_cifs);
+   remove_proc_entry("rdma_readwrite_threshold", proc_fs_cifs);
+   remove_proc_entry("smbd_max_frmr_depth", proc_fs_cifs);
+   remove_proc_entry("smbd_keep_alive_interval", proc_fs_cifs);
+   remove_proc_entry("smbd_max_receive_size", proc_fs_cifs);
+   remove_proc_entry("smbd_max_fragmented_recv_size", proc_fs_cifs);
+   remove_proc_entry("smbd_max_send_size", proc_fs_cifs);
+   remove_proc_entry("smbd_send_credit_target", proc_fs_cifs);
+   remove_proc_entry("smbd_receive_credit_max", proc_fs_cifs);
remove_proc_entry("fs/cifs", NULL);
 }
 
-- 
2.7.4



[Patch v4 22/22] CIFS: SMBD: Add SMBDirect debug counters

2017-10-01 Thread Long Li
From: Long Li 

Export SMBDirect debug counters to /proc/fs/cifs/DebugData.

Those are used for debugging, troubleshooting and profiling.

Signed-off-by: Long Li 
---
 fs/cifs/cifs_debug.c | 87 
 1 file changed, 87 insertions(+)

diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index 9738026..1ea78d5 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -30,6 +30,7 @@
 #include "cifsproto.h"
 #include "cifs_debug.h"
 #include "cifsfs.h"
+#include "smbdirect.h"
 
 void
 cifs_dump_mem(char *label, void *data, int length)
@@ -152,6 +153,92 @@ static int cifs_debug_data_proc_show(struct seq_file *m, 
void *v)
list_for_each(tmp1, _tcp_ses_list) {
server = list_entry(tmp1, struct TCP_Server_Info,
tcp_ses_list);
+
+   if (!server->rdma)
+   goto skip_rdma;
+
+   seq_printf(m, "\nSMBDirect (in hex) protocol version: %x "
+   "transport status: %x",
+   server->smbd_conn->protocol,
+   server->smbd_conn->transport_status);
+   seq_printf(m, "\nConn receive_credit_max: %x "
+   "send_credit_target: %x max_send_size: %x",
+   server->smbd_conn->receive_credit_max,
+   server->smbd_conn->send_credit_target,
+   server->smbd_conn->max_send_size);
+   seq_printf(m, "\nConn max_fragmented_recv_size: %x "
+   "max_fragmented_send_size: %x max_receive_size:%x",
+   server->smbd_conn->max_fragmented_recv_size,
+   server->smbd_conn->max_fragmented_send_size,
+   server->smbd_conn->max_receive_size);
+   seq_printf(m, "\nConn keep_alive_interval: %x "
+   "max_readwrite_size: %x rdma_readwrite_threshold: %x",
+   server->smbd_conn->keep_alive_interval,
+   server->smbd_conn->max_readwrite_size,
+   server->smbd_conn->rdma_readwrite_threshold);
+   seq_printf(m, "\nDebug count_get_receive_buffer: %x "
+   "count_put_receive_buffer: %x count_send_empty: %x",
+   server->smbd_conn->count_get_receive_buffer,
+   server->smbd_conn->count_put_receive_buffer,
+   server->smbd_conn->count_send_empty);
+   seq_printf(m, "\nRead Queue count_reassembly_queue: %x "
+   "count_enqueue_reassembly_queue: %x "
+   "count_dequeue_reassembly_queue: %x "
+   "fragment_reassembly_remaining: %x "
+   "reassembly_data_length: %x "
+   "reassembly_queue_length: %x",
+   server->smbd_conn->count_reassembly_queue,
+   server->smbd_conn->count_enqueue_reassembly_queue,
+   server->smbd_conn->count_dequeue_reassembly_queue,
+   server->smbd_conn->fragment_reassembly_remaining,
+   server->smbd_conn->reassembly_data_length,
+   server->smbd_conn->reassembly_queue_length);
+   seq_printf(m, "\nCurrent Credits send_credits: %x "
+   "receive_credits: %x receive_credit_target: %x",
+   atomic_read(>smbd_conn->send_credits),
+   atomic_read(>smbd_conn->receive_credits),
+   server->smbd_conn->receive_credit_target);
+   seq_printf(m, "\nPending send_pending: %x send_payload_pending:"
+   " %x smbd_send_pending: %x smbd_recv_pending: %x",
+   atomic_read(>smbd_conn->send_pending),
+   atomic_read(>smbd_conn->send_payload_pending),
+   server->smbd_conn->smbd_send_pending,
+   server->smbd_conn->smbd_recv_pending);
+   seq_printf(m, "\nReceive buffers count_receive_queue: %x "
+   "count_empty_packet_queue: %x",
+   server->smbd_conn->count_receive_queue,
+   server->smbd_conn->count_empty_packet_queue);
+   seq_printf(m, "\nMR responder_resources: %x "
+   "max_frmr_depth: %x mr_type: %x",
+   server->smbd_conn->responder_resources,
+   server->smbd_conn->max_frmr_depth,
+   server->smbd_conn->mr_type);
+   seq_printf(m, "\nMR mr_ready_count: %x mr_used_count: %x",
+   atomic_read(>smbd_conn->mr_ready_count),
+   atomic_read(>smbd_conn->mr_used_count));
+
+   seq_printf(m, "\nTSC cycle histogram in I/O path: "
+   "(the number of most significant 

[Patch v4 09/22] CIFS: SMBD: Implement function to destroy a SMBDirect connection

2017-10-01 Thread Long Li
From: Long Li 

Add function to tear down a SMBDirect connection. This is used by upper layer
to free all SMBDirect connection and transport resources.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 16 
 fs/cifs/smbdirect.h |  3 +++
 2 files changed, 19 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 1f0f33c..cb129c2 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -1416,6 +1416,22 @@ static void idle_connection_timer(struct work_struct 
*work)
info->keep_alive_interval*HZ);
 }
 
+/* Destroy this SMBD connection, called from upper layer */
+void smbd_destroy(struct smbd_connection *info)
+{
+   log_rdma_event(INFO, "destroying rdma session\n");
+
+   /* Kick off the disconnection process */
+   smbd_disconnect_rdma_connection(info);
+
+   log_rdma_event(INFO, "wait for transport being destroyed\n");
+   wait_event(info->wait_destroy,
+   info->transport_status == SMBD_DESTROYED);
+
+   destroy_workqueue(info->workqueue);
+   kfree(info);
+}
+
 /*
  * Reconnect this SMBD connection, called from upper layer
  * return value: 0 on success, or actual error code
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
index 9818852..d14a484 100644
--- a/fs/cifs/smbdirect.h
+++ b/fs/cifs/smbdirect.h
@@ -252,6 +252,9 @@ struct smbd_connection *smbd_get_connection(
 /* Reconnect SMBDirect session */
 int smbd_reconnect(struct TCP_Server_Info *server);
 
+/* Destroy SMBDirect session */
+void smbd_destroy(struct smbd_connection *info);
+
 void profiling_display_histogram(
struct seq_file *m, unsigned long long array[]);
 #endif
-- 
2.7.4



[Patch v4 19/22] CIFS: SMBD: Add parameter rdata to smb2_new_read_req

2017-10-01 Thread Long Li
From: Long Li 

This patch is for preparing upper layer for doing SMB read via RDMA write.

When we assemble the SMB read packet header, we need to know the I/O layout
if this request is to use a RDMA write. rdata has all the information we need
for memory registration. Add rdata to smb2_new_read_req.

Signed-off-by: Long Li 
---
 fs/cifs/smb2pdu.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 6089957..7053db9 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2351,18 +2351,21 @@ SMB2_flush(const unsigned int xid, struct cifs_tcon 
*tcon, u64 persistent_fid,
  */
 static int
 smb2_new_read_req(void **buf, unsigned int *total_len,
- struct cifs_io_parms *io_parms, unsigned int remaining_bytes,
- int request_type)
+   struct cifs_io_parms *io_parms, struct cifs_readdata *rdata,
+   unsigned int remaining_bytes, int request_type)
 {
int rc = -EACCES;
struct smb2_read_plain_req *req = NULL;
struct smb2_sync_hdr *shdr;
+   struct TCP_Server_Info *server;
 
rc = smb2_plain_req_init(SMB2_READ, io_parms->tcon, (void **) ,
 total_len);
if (rc)
return rc;
-   if (io_parms->tcon->ses->server == NULL)
+
+   server = io_parms->tcon->ses->server;
+   if (server == NULL)
return -ECONNABORTED;
 
shdr = >sync_hdr;
@@ -2490,7 +2493,8 @@ smb2_async_readv(struct cifs_readdata *rdata)
 
server = io_parms.tcon->ses->server;
 
-   rc = smb2_new_read_req((void **) , _len, _parms, 0, 0);
+   rc = smb2_new_read_req(
+   (void **) , _len, _parms, rdata, 0, 0);
if (rc) {
if (rc == -EAGAIN && rdata->credits) {
/* credits was reset by reconnect */
@@ -2558,7 +2562,7 @@ SMB2_read(const unsigned int xid, struct cifs_io_parms 
*io_parms,
struct cifs_ses *ses = io_parms->tcon->ses;
 
*nbytes = 0;
-   rc = smb2_new_read_req((void **), _len, io_parms, 0, 0);
+   rc = smb2_new_read_req((void **), _len, io_parms, NULL, 0, 0);
if (rc)
return rc;
 
-- 
2.7.4



[Patch v4 22/22] CIFS: SMBD: Add SMBDirect debug counters

2017-10-01 Thread Long Li
From: Long Li 

Export SMBDirect debug counters to /proc/fs/cifs/DebugData.

Those are used for debugging, troubleshooting and profiling.

Signed-off-by: Long Li 
---
 fs/cifs/cifs_debug.c | 87 
 1 file changed, 87 insertions(+)

diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index 9738026..1ea78d5 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -30,6 +30,7 @@
 #include "cifsproto.h"
 #include "cifs_debug.h"
 #include "cifsfs.h"
+#include "smbdirect.h"
 
 void
 cifs_dump_mem(char *label, void *data, int length)
@@ -152,6 +153,92 @@ static int cifs_debug_data_proc_show(struct seq_file *m, 
void *v)
list_for_each(tmp1, _tcp_ses_list) {
server = list_entry(tmp1, struct TCP_Server_Info,
tcp_ses_list);
+
+   if (!server->rdma)
+   goto skip_rdma;
+
+   seq_printf(m, "\nSMBDirect (in hex) protocol version: %x "
+   "transport status: %x",
+   server->smbd_conn->protocol,
+   server->smbd_conn->transport_status);
+   seq_printf(m, "\nConn receive_credit_max: %x "
+   "send_credit_target: %x max_send_size: %x",
+   server->smbd_conn->receive_credit_max,
+   server->smbd_conn->send_credit_target,
+   server->smbd_conn->max_send_size);
+   seq_printf(m, "\nConn max_fragmented_recv_size: %x "
+   "max_fragmented_send_size: %x max_receive_size:%x",
+   server->smbd_conn->max_fragmented_recv_size,
+   server->smbd_conn->max_fragmented_send_size,
+   server->smbd_conn->max_receive_size);
+   seq_printf(m, "\nConn keep_alive_interval: %x "
+   "max_readwrite_size: %x rdma_readwrite_threshold: %x",
+   server->smbd_conn->keep_alive_interval,
+   server->smbd_conn->max_readwrite_size,
+   server->smbd_conn->rdma_readwrite_threshold);
+   seq_printf(m, "\nDebug count_get_receive_buffer: %x "
+   "count_put_receive_buffer: %x count_send_empty: %x",
+   server->smbd_conn->count_get_receive_buffer,
+   server->smbd_conn->count_put_receive_buffer,
+   server->smbd_conn->count_send_empty);
+   seq_printf(m, "\nRead Queue count_reassembly_queue: %x "
+   "count_enqueue_reassembly_queue: %x "
+   "count_dequeue_reassembly_queue: %x "
+   "fragment_reassembly_remaining: %x "
+   "reassembly_data_length: %x "
+   "reassembly_queue_length: %x",
+   server->smbd_conn->count_reassembly_queue,
+   server->smbd_conn->count_enqueue_reassembly_queue,
+   server->smbd_conn->count_dequeue_reassembly_queue,
+   server->smbd_conn->fragment_reassembly_remaining,
+   server->smbd_conn->reassembly_data_length,
+   server->smbd_conn->reassembly_queue_length);
+   seq_printf(m, "\nCurrent Credits send_credits: %x "
+   "receive_credits: %x receive_credit_target: %x",
+   atomic_read(>smbd_conn->send_credits),
+   atomic_read(>smbd_conn->receive_credits),
+   server->smbd_conn->receive_credit_target);
+   seq_printf(m, "\nPending send_pending: %x send_payload_pending:"
+   " %x smbd_send_pending: %x smbd_recv_pending: %x",
+   atomic_read(>smbd_conn->send_pending),
+   atomic_read(>smbd_conn->send_payload_pending),
+   server->smbd_conn->smbd_send_pending,
+   server->smbd_conn->smbd_recv_pending);
+   seq_printf(m, "\nReceive buffers count_receive_queue: %x "
+   "count_empty_packet_queue: %x",
+   server->smbd_conn->count_receive_queue,
+   server->smbd_conn->count_empty_packet_queue);
+   seq_printf(m, "\nMR responder_resources: %x "
+   "max_frmr_depth: %x mr_type: %x",
+   server->smbd_conn->responder_resources,
+   server->smbd_conn->max_frmr_depth,
+   server->smbd_conn->mr_type);
+   seq_printf(m, "\nMR mr_ready_count: %x mr_used_count: %x",
+   atomic_read(>smbd_conn->mr_ready_count),
+   atomic_read(>smbd_conn->mr_used_count));
+
+   seq_printf(m, "\nTSC cycle histogram in I/O path: "
+   "(the number of most significant bits)");
+   seq_printf(m, 

[Patch v4 09/22] CIFS: SMBD: Implement function to destroy a SMBDirect connection

2017-10-01 Thread Long Li
From: Long Li 

Add function to tear down a SMBDirect connection. This is used by upper layer
to free all SMBDirect connection and transport resources.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 16 
 fs/cifs/smbdirect.h |  3 +++
 2 files changed, 19 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 1f0f33c..cb129c2 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -1416,6 +1416,22 @@ static void idle_connection_timer(struct work_struct 
*work)
info->keep_alive_interval*HZ);
 }
 
+/* Destroy this SMBD connection, called from upper layer */
+void smbd_destroy(struct smbd_connection *info)
+{
+   log_rdma_event(INFO, "destroying rdma session\n");
+
+   /* Kick off the disconnection process */
+   smbd_disconnect_rdma_connection(info);
+
+   log_rdma_event(INFO, "wait for transport being destroyed\n");
+   wait_event(info->wait_destroy,
+   info->transport_status == SMBD_DESTROYED);
+
+   destroy_workqueue(info->workqueue);
+   kfree(info);
+}
+
 /*
  * Reconnect this SMBD connection, called from upper layer
  * return value: 0 on success, or actual error code
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
index 9818852..d14a484 100644
--- a/fs/cifs/smbdirect.h
+++ b/fs/cifs/smbdirect.h
@@ -252,6 +252,9 @@ struct smbd_connection *smbd_get_connection(
 /* Reconnect SMBDirect session */
 int smbd_reconnect(struct TCP_Server_Info *server);
 
+/* Destroy SMBDirect session */
+void smbd_destroy(struct smbd_connection *info);
+
 void profiling_display_histogram(
struct seq_file *m, unsigned long long array[]);
 #endif
-- 
2.7.4



[Patch v4 19/22] CIFS: SMBD: Add parameter rdata to smb2_new_read_req

2017-10-01 Thread Long Li
From: Long Li 

This patch is for preparing upper layer for doing SMB read via RDMA write.

When we assemble the SMB read packet header, we need to know the I/O layout
if this request is to use a RDMA write. rdata has all the information we need
for memory registration. Add rdata to smb2_new_read_req.

Signed-off-by: Long Li 
---
 fs/cifs/smb2pdu.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 6089957..7053db9 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2351,18 +2351,21 @@ SMB2_flush(const unsigned int xid, struct cifs_tcon 
*tcon, u64 persistent_fid,
  */
 static int
 smb2_new_read_req(void **buf, unsigned int *total_len,
- struct cifs_io_parms *io_parms, unsigned int remaining_bytes,
- int request_type)
+   struct cifs_io_parms *io_parms, struct cifs_readdata *rdata,
+   unsigned int remaining_bytes, int request_type)
 {
int rc = -EACCES;
struct smb2_read_plain_req *req = NULL;
struct smb2_sync_hdr *shdr;
+   struct TCP_Server_Info *server;
 
rc = smb2_plain_req_init(SMB2_READ, io_parms->tcon, (void **) ,
 total_len);
if (rc)
return rc;
-   if (io_parms->tcon->ses->server == NULL)
+
+   server = io_parms->tcon->ses->server;
+   if (server == NULL)
return -ECONNABORTED;
 
shdr = >sync_hdr;
@@ -2490,7 +2493,8 @@ smb2_async_readv(struct cifs_readdata *rdata)
 
server = io_parms.tcon->ses->server;
 
-   rc = smb2_new_read_req((void **) , _len, _parms, 0, 0);
+   rc = smb2_new_read_req(
+   (void **) , _len, _parms, rdata, 0, 0);
if (rc) {
if (rc == -EAGAIN && rdata->credits) {
/* credits was reset by reconnect */
@@ -2558,7 +2562,7 @@ SMB2_read(const unsigned int xid, struct cifs_io_parms 
*io_parms,
struct cifs_ses *ses = io_parms->tcon->ses;
 
*nbytes = 0;
-   rc = smb2_new_read_req((void **), _len, io_parms, 0, 0);
+   rc = smb2_new_read_req((void **), _len, io_parms, NULL, 0, 0);
if (rc)
return rc;
 
-- 
2.7.4



[Patch v4 00/22] CIFS: Implement SMBDirect

2017-10-01 Thread Long Li
From: Long Li 

Starting with SMB2 dialect 3.0, Microsoft introduced SMBDirect transport
protocol for transferring upper layer (SMB2) payload over RDMA via Infiniband,
RoCE or iWARP. The prococol is published in [MS-SMBD]
(https://msdn.microsoft.com/en-us/library/hh536346.aspx).

Patch v2 added RDMA read/write via memory registration, and addressed
feedbacks on v1.

Patch v3 improved performance by introducing an additional queue for handling
empty packets and reducing lock contention on IRQ path. Also added light
weight profiling by reading TSC and addressed feedbacks on v2.

Patch v4 fixed connectivity issues with iWAPR devices and addressed comments.

Long Li (22):
  CIFS: SMBD: Add SMBDirect protocol initial values and constants
  CIFS: SMBD: Establish SMBDirect connection
  CIFS: SMBD: export protocol initial values
  CIFS: SMBD: Add rdma mount option
  CIFS: SMBD: Implement function to create a SMBDirect connection
  CIFS: SMBD: Upper layer connects to SMBDirect session
  CIFS: SMBD: Implement function to reconnect to a SMBDirect transport
  CIFS: SMBD: Upper layer reconnects to SMBDirect session
  CIFS: SMBD: Implement function to destroy a SMBDirect connection
  CIFS: SMBD: Upper layer destroys SMBDirect session on shutdown or
umount
  CIFS: SMBD: Set SMBDirect maximum read or write size for I/O
  CIFS: SMBD: Implement function to receive data via RDMA receive
  CIFS: SMBD: Upper layer receives data via RDMA receive
  CIFS: SMBD: Implement function to send data via RDMA send
  CIFS: SMBD: Upper layer sends data via RDMA send
  CIFS: SMBD: Fix the definition for SMB2_CHANNEL_RDMA_V1_INVALIDATE
  CIFS: SMBD: Implement RDMA memory registration
  CIFS: SMBD: Upper layer performs SMB write via RDMA read through
memory registration
  CIFS: SMBD: Add parameter rdata to smb2_new_read_req
  CIFS: SMBD: Read correct returned data length for RDMA write (SMB
read) I/O
  CIFS: SMBD: Upper layer performs SMB read via RDMA write through
memory registration
  CIFS: SMBD: Add SMBDirect debug counters

 fs/cifs/Makefile |2 +-
 fs/cifs/cifs_debug.c |  159 +++
 fs/cifs/cifsfs.c |2 +
 fs/cifs/cifsglob.h   |   17 +-
 fs/cifs/cifssmb.c|   10 +-
 fs/cifs/connect.c|   46 +-
 fs/cifs/file.c   |   10 +
 fs/cifs/smb1ops.c|2 +-
 fs/cifs/smb2ops.c|   21 +-
 fs/cifs/smb2pdu.c|  114 ++-
 fs/cifs/smb2pdu.h|2 +-
 fs/cifs/smbdirect.c  | 2651 ++
 fs/cifs/smbdirect.h  |  325 +++
 fs/cifs/transport.c  |7 +
 14 files changed, 3348 insertions(+), 20 deletions(-)
 create mode 100644 fs/cifs/smbdirect.c
 create mode 100644 fs/cifs/smbdirect.h

-- 
2.7.4



[Patch v4 00/22] CIFS: Implement SMBDirect

2017-10-01 Thread Long Li
From: Long Li 

Starting with SMB2 dialect 3.0, Microsoft introduced SMBDirect transport
protocol for transferring upper layer (SMB2) payload over RDMA via Infiniband,
RoCE or iWARP. The prococol is published in [MS-SMBD]
(https://msdn.microsoft.com/en-us/library/hh536346.aspx).

Patch v2 added RDMA read/write via memory registration, and addressed
feedbacks on v1.

Patch v3 improved performance by introducing an additional queue for handling
empty packets and reducing lock contention on IRQ path. Also added light
weight profiling by reading TSC and addressed feedbacks on v2.

Patch v4 fixed connectivity issues with iWAPR devices and addressed comments.

Long Li (22):
  CIFS: SMBD: Add SMBDirect protocol initial values and constants
  CIFS: SMBD: Establish SMBDirect connection
  CIFS: SMBD: export protocol initial values
  CIFS: SMBD: Add rdma mount option
  CIFS: SMBD: Implement function to create a SMBDirect connection
  CIFS: SMBD: Upper layer connects to SMBDirect session
  CIFS: SMBD: Implement function to reconnect to a SMBDirect transport
  CIFS: SMBD: Upper layer reconnects to SMBDirect session
  CIFS: SMBD: Implement function to destroy a SMBDirect connection
  CIFS: SMBD: Upper layer destroys SMBDirect session on shutdown or
umount
  CIFS: SMBD: Set SMBDirect maximum read or write size for I/O
  CIFS: SMBD: Implement function to receive data via RDMA receive
  CIFS: SMBD: Upper layer receives data via RDMA receive
  CIFS: SMBD: Implement function to send data via RDMA send
  CIFS: SMBD: Upper layer sends data via RDMA send
  CIFS: SMBD: Fix the definition for SMB2_CHANNEL_RDMA_V1_INVALIDATE
  CIFS: SMBD: Implement RDMA memory registration
  CIFS: SMBD: Upper layer performs SMB write via RDMA read through
memory registration
  CIFS: SMBD: Add parameter rdata to smb2_new_read_req
  CIFS: SMBD: Read correct returned data length for RDMA write (SMB
read) I/O
  CIFS: SMBD: Upper layer performs SMB read via RDMA write through
memory registration
  CIFS: SMBD: Add SMBDirect debug counters

 fs/cifs/Makefile |2 +-
 fs/cifs/cifs_debug.c |  159 +++
 fs/cifs/cifsfs.c |2 +
 fs/cifs/cifsglob.h   |   17 +-
 fs/cifs/cifssmb.c|   10 +-
 fs/cifs/connect.c|   46 +-
 fs/cifs/file.c   |   10 +
 fs/cifs/smb1ops.c|2 +-
 fs/cifs/smb2ops.c|   21 +-
 fs/cifs/smb2pdu.c|  114 ++-
 fs/cifs/smb2pdu.h|2 +-
 fs/cifs/smbdirect.c  | 2651 ++
 fs/cifs/smbdirect.h  |  325 +++
 fs/cifs/transport.c  |7 +
 14 files changed, 3348 insertions(+), 20 deletions(-)
 create mode 100644 fs/cifs/smbdirect.c
 create mode 100644 fs/cifs/smbdirect.h

-- 
2.7.4



[Patch v4 13/22] CIFS: SMBD: Upper layer receives data via RDMA receive

2017-10-01 Thread Long Li
From: Long Li 

With SMBDirect connected, use it for receiving data via RDMA receive.

Signed-off-by: Long Li 
---
 fs/cifs/connect.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 1a9f22f..8026682 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -542,7 +542,10 @@ cifs_readv_from_socket(struct TCP_Server_Info *server, 
struct msghdr *smb_msg)
if (server_unresponsive(server))
return -ECONNABORTED;
 
-   length = sock_recvmsg(server->ssocket, smb_msg, 0);
+   if (server->smbd_conn)
+   length = smbd_recv(server->smbd_conn, smb_msg);
+   else
+   length = sock_recvmsg(server->ssocket, smb_msg, 0);
 
if (server->tcpStatus == CifsExiting)
return -ESHUTDOWN;
-- 
2.7.4



[Patch v4 13/22] CIFS: SMBD: Upper layer receives data via RDMA receive

2017-10-01 Thread Long Li
From: Long Li 

With SMBDirect connected, use it for receiving data via RDMA receive.

Signed-off-by: Long Li 
---
 fs/cifs/connect.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 1a9f22f..8026682 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -542,7 +542,10 @@ cifs_readv_from_socket(struct TCP_Server_Info *server, 
struct msghdr *smb_msg)
if (server_unresponsive(server))
return -ECONNABORTED;
 
-   length = sock_recvmsg(server->ssocket, smb_msg, 0);
+   if (server->smbd_conn)
+   length = smbd_recv(server->smbd_conn, smb_msg);
+   else
+   length = sock_recvmsg(server->ssocket, smb_msg, 0);
 
if (server->tcpStatus == CifsExiting)
return -ESHUTDOWN;
-- 
2.7.4



[Patch v4 17/22] CIFS: SMBD: Implement RDMA memory registration

2017-10-01 Thread Long Li
From: Long Li 

Memory registration is used for transferring payload via RDMA read or write.
After I/O is done, memory registrations are recovered and reused. This
process can be time consuming and is done in a work queue.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 428 
 fs/cifs/smbdirect.h |  55 +++
 2 files changed, 483 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 90e2c94..3f2de48 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -49,6 +49,9 @@ static int smbd_post_send_page(struct smbd_connection *info,
struct page *page, unsigned long offset,
size_t size, int remaining_data_length);
 
+static void destroy_mr_list(struct smbd_connection *info);
+static int allocate_mr_list(struct smbd_connection *info);
+
 /* SMBD version number */
 #define SMBD_V10x0100
 
@@ -219,6 +222,12 @@ static void smbd_destroy_rdma_work(struct work_struct 
*work)
wait_event(info->wait_send_payload_pending,
atomic_read(>send_payload_pending) == 0);
 
+   log_rdma_event(INFO, "freeing mr list\n");
+   wake_up_interruptible_all(>wait_mr);
+   wait_event(info->wait_for_mr_cleanup,
+   atomic_read(>mr_used_count) == 0);
+   destroy_mr_list(info);
+
/* It's not posssible for upper layer to get to reassembly */
log_rdma_event(INFO, "drain the reassembly queue\n");
do {
@@ -475,6 +484,16 @@ static bool process_negotiation_response(
}
info->max_fragmented_send_size =
le32_to_cpu(packet->max_fragmented_size);
+   info->rdma_readwrite_threshold =
+   rdma_readwrite_threshold > info->max_fragmented_send_size ?
+   info->max_fragmented_send_size :
+   rdma_readwrite_threshold;
+
+
+   info->max_readwrite_size = min_t(u32,
+   le32_to_cpu(packet->max_readwrite_size),
+   info->max_frmr_depth * PAGE_SIZE);
+   info->max_frmr_depth = info->max_readwrite_size / PAGE_SIZE;
 
return true;
 }
@@ -773,6 +792,12 @@ static int smbd_ia_open(
rc = -EPROTONOSUPPORT;
goto out2;
}
+   info->max_frmr_depth = min_t(int,
+   smbd_max_frmr_depth,
+   info->id->device->attrs.max_fast_reg_page_list_len);
+   info->mr_type = IB_MR_TYPE_MEM_REG;
+   if (info->id->device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
+   info->mr_type = IB_MR_TYPE_SG_GAPS;
 
info->pd = ib_alloc_pd(info->id->device, 0);
if (IS_ERR(info->pd)) {
@@ -1610,6 +1635,8 @@ struct smbd_connection *_smbd_get_connection(
struct rdma_conn_param conn_param;
struct ib_qp_init_attr qp_attr;
struct sockaddr_in *addr_in = (struct sockaddr_in *) dstaddr;
+   struct ib_port_immutable port_immutable;
+   u32 ird_ord_hdr[2];
 
info = kzalloc(sizeof(struct smbd_connection), GFP_KERNEL);
if (!info)
@@ -1698,6 +1725,28 @@ struct smbd_connection *_smbd_get_connection(
memset(_param, 0, sizeof(conn_param));
conn_param.initiator_depth = 0;
 
+   conn_param.responder_resources =
+   info->id->device->attrs.max_qp_rd_atom
+   < SMBD_CM_RESPONDER_RESOURCES ?
+   info->id->device->attrs.max_qp_rd_atom :
+   SMBD_CM_RESPONDER_RESOURCES;
+   info->responder_resources = conn_param.responder_resources;
+   log_rdma_mr(INFO, "responder_resources=%d\n",
+   info->responder_resources);
+
+   /* Need to send IRD/ORD in private data for iWARP */
+   info->id->device->get_port_immutable(
+   info->id->device, info->id->port_num, _immutable);
+   if (port_immutable.core_cap_flags & RDMA_CORE_PORT_IWARP) {
+   ird_ord_hdr[0] = info->responder_resources;
+   ird_ord_hdr[1] = 1;
+   conn_param.private_data = ird_ord_hdr;
+   conn_param.private_data_len = sizeof(ird_ord_hdr);
+   } else {
+   conn_param.private_data = NULL;
+   conn_param.private_data_len = 0;
+   }
+
conn_param.retry_count = SMBD_CM_RETRY;
conn_param.rnr_retry_count = SMBD_CM_RNR_RETRY;
conn_param.flow_control = 0;
@@ -1762,8 +1811,19 @@ struct smbd_connection *_smbd_get_connection(
goto negotiation_failed;
}
 
+   rc = allocate_mr_list(info);
+   if (rc) {
+   log_rdma_mr(ERR, "memory registration allocation failed\n");
+   goto allocate_mr_failed;
+   }
+
return info;
 
+allocate_mr_failed:
+   /* At this point, need to a full transport shutdown */
+   smbd_destroy(info);
+   return NULL;
+
 negotiation_failed:
cancel_delayed_work_sync(>idle_timer_work);
destroy_caches_and_workqueue(info);
@@ -2221,3 

[Patch v4 17/22] CIFS: SMBD: Implement RDMA memory registration

2017-10-01 Thread Long Li
From: Long Li 

Memory registration is used for transferring payload via RDMA read or write.
After I/O is done, memory registrations are recovered and reused. This
process can be time consuming and is done in a work queue.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 428 
 fs/cifs/smbdirect.h |  55 +++
 2 files changed, 483 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 90e2c94..3f2de48 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -49,6 +49,9 @@ static int smbd_post_send_page(struct smbd_connection *info,
struct page *page, unsigned long offset,
size_t size, int remaining_data_length);
 
+static void destroy_mr_list(struct smbd_connection *info);
+static int allocate_mr_list(struct smbd_connection *info);
+
 /* SMBD version number */
 #define SMBD_V10x0100
 
@@ -219,6 +222,12 @@ static void smbd_destroy_rdma_work(struct work_struct 
*work)
wait_event(info->wait_send_payload_pending,
atomic_read(>send_payload_pending) == 0);
 
+   log_rdma_event(INFO, "freeing mr list\n");
+   wake_up_interruptible_all(>wait_mr);
+   wait_event(info->wait_for_mr_cleanup,
+   atomic_read(>mr_used_count) == 0);
+   destroy_mr_list(info);
+
/* It's not posssible for upper layer to get to reassembly */
log_rdma_event(INFO, "drain the reassembly queue\n");
do {
@@ -475,6 +484,16 @@ static bool process_negotiation_response(
}
info->max_fragmented_send_size =
le32_to_cpu(packet->max_fragmented_size);
+   info->rdma_readwrite_threshold =
+   rdma_readwrite_threshold > info->max_fragmented_send_size ?
+   info->max_fragmented_send_size :
+   rdma_readwrite_threshold;
+
+
+   info->max_readwrite_size = min_t(u32,
+   le32_to_cpu(packet->max_readwrite_size),
+   info->max_frmr_depth * PAGE_SIZE);
+   info->max_frmr_depth = info->max_readwrite_size / PAGE_SIZE;
 
return true;
 }
@@ -773,6 +792,12 @@ static int smbd_ia_open(
rc = -EPROTONOSUPPORT;
goto out2;
}
+   info->max_frmr_depth = min_t(int,
+   smbd_max_frmr_depth,
+   info->id->device->attrs.max_fast_reg_page_list_len);
+   info->mr_type = IB_MR_TYPE_MEM_REG;
+   if (info->id->device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
+   info->mr_type = IB_MR_TYPE_SG_GAPS;
 
info->pd = ib_alloc_pd(info->id->device, 0);
if (IS_ERR(info->pd)) {
@@ -1610,6 +1635,8 @@ struct smbd_connection *_smbd_get_connection(
struct rdma_conn_param conn_param;
struct ib_qp_init_attr qp_attr;
struct sockaddr_in *addr_in = (struct sockaddr_in *) dstaddr;
+   struct ib_port_immutable port_immutable;
+   u32 ird_ord_hdr[2];
 
info = kzalloc(sizeof(struct smbd_connection), GFP_KERNEL);
if (!info)
@@ -1698,6 +1725,28 @@ struct smbd_connection *_smbd_get_connection(
memset(_param, 0, sizeof(conn_param));
conn_param.initiator_depth = 0;
 
+   conn_param.responder_resources =
+   info->id->device->attrs.max_qp_rd_atom
+   < SMBD_CM_RESPONDER_RESOURCES ?
+   info->id->device->attrs.max_qp_rd_atom :
+   SMBD_CM_RESPONDER_RESOURCES;
+   info->responder_resources = conn_param.responder_resources;
+   log_rdma_mr(INFO, "responder_resources=%d\n",
+   info->responder_resources);
+
+   /* Need to send IRD/ORD in private data for iWARP */
+   info->id->device->get_port_immutable(
+   info->id->device, info->id->port_num, _immutable);
+   if (port_immutable.core_cap_flags & RDMA_CORE_PORT_IWARP) {
+   ird_ord_hdr[0] = info->responder_resources;
+   ird_ord_hdr[1] = 1;
+   conn_param.private_data = ird_ord_hdr;
+   conn_param.private_data_len = sizeof(ird_ord_hdr);
+   } else {
+   conn_param.private_data = NULL;
+   conn_param.private_data_len = 0;
+   }
+
conn_param.retry_count = SMBD_CM_RETRY;
conn_param.rnr_retry_count = SMBD_CM_RNR_RETRY;
conn_param.flow_control = 0;
@@ -1762,8 +1811,19 @@ struct smbd_connection *_smbd_get_connection(
goto negotiation_failed;
}
 
+   rc = allocate_mr_list(info);
+   if (rc) {
+   log_rdma_mr(ERR, "memory registration allocation failed\n");
+   goto allocate_mr_failed;
+   }
+
return info;
 
+allocate_mr_failed:
+   /* At this point, need to a full transport shutdown */
+   smbd_destroy(info);
+   return NULL;
+
 negotiation_failed:
cancel_delayed_work_sync(>idle_timer_work);
destroy_caches_and_workqueue(info);
@@ -2221,3 +2281,371 @@ int smbd_send(struct 

[Patch v4 15/22] CIFS: SMBD: Upper layer sends data via RDMA send

2017-10-01 Thread Long Li
From: Long Li 

With SMBDirect connected, use it for sending data via RDMA send.

Signed-off-by: Long Li 
---
 fs/cifs/transport.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index 7efbab0..3a9b5a0 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -37,6 +37,7 @@
 #include "cifsglob.h"
 #include "cifsproto.h"
 #include "cifs_debug.h"
+#include "smbdirect.h"
 
 void
 cifs_wake_up_task(struct mid_q_entry *mid)
@@ -230,6 +231,11 @@ __smb_send_rqst(struct TCP_Server_Info *server, struct 
smb_rqst *rqst)
struct msghdr smb_msg;
int val = 1;
 
+   if (server->smbd_conn) {
+   rc = smbd_send(server->smbd_conn, rqst);
+   goto done;
+   }
+
if (ssocket == NULL)
return -ENOTSOCK;
 
@@ -299,6 +305,7 @@ __smb_send_rqst(struct TCP_Server_Info *server, struct 
smb_rqst *rqst)
server->tcpStatus = CifsNeedReconnect;
}
 
+done:
if (rc < 0 && rc != -EINTR)
cifs_dbg(VFS, "Error %d sending data on socket to server\n",
 rc);
-- 
2.7.4



[Patch v4 20/22] CIFS: SMBD: Read correct returned data length for RDMA write (SMB read) I/O

2017-10-01 Thread Long Li
From: Long Li 

This patch is for preparing upper layer for doing SMB read via RDMA write.

When RDMA write is used for SMB read, the returned data length is in
DataRemaining in the response packet. Reading it properly by adding a
parameter to specifiy where the returned data length is.

Signed-off-by: Long Li 
---
 fs/cifs/cifsglob.h | 10 --
 fs/cifs/cifssmb.c  |  4 ++--
 fs/cifs/smb1ops.c  |  2 +-
 fs/cifs/smb2ops.c  |  8 ++--
 4 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index bcb6df1..f851b50 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -228,8 +228,14 @@ struct smb_version_operations {
__u64 (*get_next_mid)(struct TCP_Server_Info *);
/* data offset from read response message */
unsigned int (*read_data_offset)(char *);
-   /* data length from read response message */
-   unsigned int (*read_data_length)(char *);
+   /*
+* Data length from read response message
+* When in_remaining is true, the returned data length is in
+* message field DataRemaining for out-of-band data read (e.g through
+* Memory Registration RDMA write in SMBD).
+* Otherwise, the returned data length is in message field DataLength.
+*/
+   unsigned int (*read_data_length)(char *, bool in_remaining);
/* map smb to linux error */
int (*map_error)(char *, bool);
/* find mid corresponding to the response message */
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index 0e29ecf..b9410e1 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1531,8 +1531,8 @@ cifs_readv_receive(struct TCP_Server_Info *server, struct 
mid_q_entry *mid)
 rdata->iov[0].iov_base, server->total_read);
 
/* how much data is in the response? */
-   data_len = server->ops->read_data_length(buf);
-   if (data_offset + data_len > buflen) {
+   data_len = server->ops->read_data_length(buf, rdata->mr);
+   if (!rdata->mr && (data_offset + data_len > buflen)) {
/* data_len is corrupt -- discard frame */
rdata->result = -EIO;
return cifs_readv_discard(server, mid);
diff --git a/fs/cifs/smb1ops.c b/fs/cifs/smb1ops.c
index a723df3..27a8280 100644
--- a/fs/cifs/smb1ops.c
+++ b/fs/cifs/smb1ops.c
@@ -87,7 +87,7 @@ cifs_read_data_offset(char *buf)
 }
 
 static unsigned int
-cifs_read_data_length(char *buf)
+cifs_read_data_length(char *buf, bool in_remaining)
 {
READ_RSP *rsp = (READ_RSP *)buf;
return (le16_to_cpu(rsp->DataLengthHigh) << 16) +
diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 7ad35d6..a765877 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -935,9 +935,13 @@ smb2_read_data_offset(char *buf)
 }
 
 static unsigned int
-smb2_read_data_length(char *buf)
+smb2_read_data_length(char *buf, bool in_remaining)
 {
struct smb2_read_rsp *rsp = (struct smb2_read_rsp *)buf;
+
+   if (in_remaining)
+   return le32_to_cpu(rsp->DataRemaining);
+
return le32_to_cpu(rsp->DataLength);
 }
 
@@ -2446,7 +2450,7 @@ handle_read_data(struct TCP_Server_Info *server, struct 
mid_q_entry *mid,
}
 
data_offset = server->ops->read_data_offset(buf) + 4;
-   data_len = server->ops->read_data_length(buf);
+   data_len = server->ops->read_data_length(buf, rdata->mr);
 
if (data_offset < server->vals->read_rsp_size) {
/*
-- 
2.7.4



[Patch v4 15/22] CIFS: SMBD: Upper layer sends data via RDMA send

2017-10-01 Thread Long Li
From: Long Li 

With SMBDirect connected, use it for sending data via RDMA send.

Signed-off-by: Long Li 
---
 fs/cifs/transport.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index 7efbab0..3a9b5a0 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -37,6 +37,7 @@
 #include "cifsglob.h"
 #include "cifsproto.h"
 #include "cifs_debug.h"
+#include "smbdirect.h"
 
 void
 cifs_wake_up_task(struct mid_q_entry *mid)
@@ -230,6 +231,11 @@ __smb_send_rqst(struct TCP_Server_Info *server, struct 
smb_rqst *rqst)
struct msghdr smb_msg;
int val = 1;
 
+   if (server->smbd_conn) {
+   rc = smbd_send(server->smbd_conn, rqst);
+   goto done;
+   }
+
if (ssocket == NULL)
return -ENOTSOCK;
 
@@ -299,6 +305,7 @@ __smb_send_rqst(struct TCP_Server_Info *server, struct 
smb_rqst *rqst)
server->tcpStatus = CifsNeedReconnect;
}
 
+done:
if (rc < 0 && rc != -EINTR)
cifs_dbg(VFS, "Error %d sending data on socket to server\n",
 rc);
-- 
2.7.4



[Patch v4 20/22] CIFS: SMBD: Read correct returned data length for RDMA write (SMB read) I/O

2017-10-01 Thread Long Li
From: Long Li 

This patch is for preparing upper layer for doing SMB read via RDMA write.

When RDMA write is used for SMB read, the returned data length is in
DataRemaining in the response packet. Reading it properly by adding a
parameter to specifiy where the returned data length is.

Signed-off-by: Long Li 
---
 fs/cifs/cifsglob.h | 10 --
 fs/cifs/cifssmb.c  |  4 ++--
 fs/cifs/smb1ops.c  |  2 +-
 fs/cifs/smb2ops.c  |  8 ++--
 4 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index bcb6df1..f851b50 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -228,8 +228,14 @@ struct smb_version_operations {
__u64 (*get_next_mid)(struct TCP_Server_Info *);
/* data offset from read response message */
unsigned int (*read_data_offset)(char *);
-   /* data length from read response message */
-   unsigned int (*read_data_length)(char *);
+   /*
+* Data length from read response message
+* When in_remaining is true, the returned data length is in
+* message field DataRemaining for out-of-band data read (e.g through
+* Memory Registration RDMA write in SMBD).
+* Otherwise, the returned data length is in message field DataLength.
+*/
+   unsigned int (*read_data_length)(char *, bool in_remaining);
/* map smb to linux error */
int (*map_error)(char *, bool);
/* find mid corresponding to the response message */
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index 0e29ecf..b9410e1 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1531,8 +1531,8 @@ cifs_readv_receive(struct TCP_Server_Info *server, struct 
mid_q_entry *mid)
 rdata->iov[0].iov_base, server->total_read);
 
/* how much data is in the response? */
-   data_len = server->ops->read_data_length(buf);
-   if (data_offset + data_len > buflen) {
+   data_len = server->ops->read_data_length(buf, rdata->mr);
+   if (!rdata->mr && (data_offset + data_len > buflen)) {
/* data_len is corrupt -- discard frame */
rdata->result = -EIO;
return cifs_readv_discard(server, mid);
diff --git a/fs/cifs/smb1ops.c b/fs/cifs/smb1ops.c
index a723df3..27a8280 100644
--- a/fs/cifs/smb1ops.c
+++ b/fs/cifs/smb1ops.c
@@ -87,7 +87,7 @@ cifs_read_data_offset(char *buf)
 }
 
 static unsigned int
-cifs_read_data_length(char *buf)
+cifs_read_data_length(char *buf, bool in_remaining)
 {
READ_RSP *rsp = (READ_RSP *)buf;
return (le16_to_cpu(rsp->DataLengthHigh) << 16) +
diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 7ad35d6..a765877 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -935,9 +935,13 @@ smb2_read_data_offset(char *buf)
 }
 
 static unsigned int
-smb2_read_data_length(char *buf)
+smb2_read_data_length(char *buf, bool in_remaining)
 {
struct smb2_read_rsp *rsp = (struct smb2_read_rsp *)buf;
+
+   if (in_remaining)
+   return le32_to_cpu(rsp->DataRemaining);
+
return le32_to_cpu(rsp->DataLength);
 }
 
@@ -2446,7 +2450,7 @@ handle_read_data(struct TCP_Server_Info *server, struct 
mid_q_entry *mid,
}
 
data_offset = server->ops->read_data_offset(buf) + 4;
-   data_len = server->ops->read_data_length(buf);
+   data_len = server->ops->read_data_length(buf, rdata->mr);
 
if (data_offset < server->vals->read_rsp_size) {
/*
-- 
2.7.4



[Patch v4 18/22] CIFS: SMBD: Upper layer performs SMB write via RDMA read through memory registration

2017-10-01 Thread Long Li
From: Long Li 

When sending I/O, if size is larger than rdma_readwrite_threshold we prepare
to send SMB write packet for a RDMA read via memory registration. The actual
I/O is done by remote peer through local RDMA hardware. Modify the relevant
fields in the packet accordingly, and append a smbd_buffer_descriptor_v1 to
the end of the SMB write packet.

On write I/O finish, deregister the memory region if this was for a RDMA read.
If remote invalidation is not used, the call to smbd_deregister_mr will do
local invalidation and possibly wait. Memory region is normally deregistered
in MID callback as soon as it's used. There are situations where the MID may
not be created on I/O failure, under which memory region is deregistered when
write data context is released.

Signed-off-by: Long Li 
---
 fs/cifs/cifsglob.h |  1 +
 fs/cifs/cifssmb.c  |  6 ++
 fs/cifs/smb2pdu.c  | 57 +-
 3 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 5585516..bcb6df1 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1168,6 +1168,7 @@ struct cifs_writedata {
pid_t   pid;
unsigned intbytes;
int result;
+   struct smbd_mr  *mr;
unsigned intpagesz;
unsigned inttailsz;
unsigned intcredits;
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index 5857009..0e29ecf 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -43,6 +43,7 @@
 #include "cifs_unicode.h"
 #include "cifs_debug.h"
 #include "fscache.h"
+#include "smbdirect.h"
 
 #ifdef CONFIG_CIFS_POSIX
 static struct {
@@ -1912,6 +1913,11 @@ cifs_writedata_release(struct kref *refcount)
struct cifs_writedata *wdata = container_of(refcount,
struct cifs_writedata, refcount);
 
+   if (wdata->mr) {
+   smbd_deregister_mr(wdata->mr);
+   wdata->mr = NULL;
+   }
+
if (wdata->cfile)
cifsFileInfo_put(wdata->cfile);
 
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index bab3da6..6089957 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -48,6 +48,7 @@
 #include "smb2glob.h"
 #include "cifspdu.h"
 #include "cifs_spnego.h"
+#include "smbdirect.h"
 
 /*
  *  The following table defines the expected "StructureSize" of SMB2 requests
@@ -2653,6 +2654,18 @@ smb2_writev_callback(struct mid_q_entry *mid)
break;
}
 
+   /*
+* If this wdata has a memory registered, the MR can be freed
+* The number of MRs available is limited, it's important to recover
+* used MR as soon as I/O is finished. Hold MR longer in the later
+* I/O process can possibly result in I/O deadlock due to lack of MR
+* to send request on I/O retry
+*/
+   if (wdata->mr) {
+   smbd_deregister_mr(wdata->mr);
+   wdata->mr = NULL;
+   }
+
if (wdata->result)
cifs_stats_fail_inc(tcon, SMB2_WRITE_HE);
 
@@ -2704,6 +2717,41 @@ smb2_async_writev(struct cifs_writedata *wdata,
offsetof(struct smb2_write_req, Buffer) - 4);
req->RemainingBytes = 0;
 
+   /*
+* If we want to do a server RDMA read, fill in and append
+* smbd_buffer_descriptor_v1 to the end of write request
+*/
+   if (server->rdma && wdata->bytes >=
+   server->smbd_conn->rdma_readwrite_threshold) {
+
+   struct smbd_buffer_descriptor_v1 *v1;
+   bool need_invalidate = server->dialect == SMB30_PROT_ID;
+
+   wdata->mr = smbd_register_mr(
+   server->smbd_conn, wdata->pages,
+   wdata->nr_pages, wdata->tailsz,
+   false, need_invalidate);
+   if (!wdata->mr) {
+   rc = -ENOBUFS;
+   goto async_writev_out;
+   }
+   req->Length = 0;
+   req->DataOffset = 0;
+   req->RemainingBytes =
+   (wdata->nr_pages-1)*PAGE_SIZE + wdata->tailsz;
+   req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
+   if (need_invalidate)
+   req->Channel = SMB2_CHANNEL_RDMA_V1;
+   req->WriteChannelInfoOffset =
+   offsetof(struct smb2_write_req, Buffer) - 4;
+   req->WriteChannelInfoLength =
+   sizeof(struct smbd_buffer_descriptor_v1);
+   v1 = (struct smbd_buffer_descriptor_v1 *) >Buffer[0];
+   v1->offset = wdata->mr->mr->iova;
+   v1->token = wdata->mr->mr->rkey;
+   v1->length = wdata->mr->mr->length;
+   }
+
/* 4 

[Patch v4 18/22] CIFS: SMBD: Upper layer performs SMB write via RDMA read through memory registration

2017-10-01 Thread Long Li
From: Long Li 

When sending I/O, if size is larger than rdma_readwrite_threshold we prepare
to send SMB write packet for a RDMA read via memory registration. The actual
I/O is done by remote peer through local RDMA hardware. Modify the relevant
fields in the packet accordingly, and append a smbd_buffer_descriptor_v1 to
the end of the SMB write packet.

On write I/O finish, deregister the memory region if this was for a RDMA read.
If remote invalidation is not used, the call to smbd_deregister_mr will do
local invalidation and possibly wait. Memory region is normally deregistered
in MID callback as soon as it's used. There are situations where the MID may
not be created on I/O failure, under which memory region is deregistered when
write data context is released.

Signed-off-by: Long Li 
---
 fs/cifs/cifsglob.h |  1 +
 fs/cifs/cifssmb.c  |  6 ++
 fs/cifs/smb2pdu.c  | 57 +-
 3 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 5585516..bcb6df1 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1168,6 +1168,7 @@ struct cifs_writedata {
pid_t   pid;
unsigned intbytes;
int result;
+   struct smbd_mr  *mr;
unsigned intpagesz;
unsigned inttailsz;
unsigned intcredits;
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index 5857009..0e29ecf 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -43,6 +43,7 @@
 #include "cifs_unicode.h"
 #include "cifs_debug.h"
 #include "fscache.h"
+#include "smbdirect.h"
 
 #ifdef CONFIG_CIFS_POSIX
 static struct {
@@ -1912,6 +1913,11 @@ cifs_writedata_release(struct kref *refcount)
struct cifs_writedata *wdata = container_of(refcount,
struct cifs_writedata, refcount);
 
+   if (wdata->mr) {
+   smbd_deregister_mr(wdata->mr);
+   wdata->mr = NULL;
+   }
+
if (wdata->cfile)
cifsFileInfo_put(wdata->cfile);
 
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index bab3da6..6089957 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -48,6 +48,7 @@
 #include "smb2glob.h"
 #include "cifspdu.h"
 #include "cifs_spnego.h"
+#include "smbdirect.h"
 
 /*
  *  The following table defines the expected "StructureSize" of SMB2 requests
@@ -2653,6 +2654,18 @@ smb2_writev_callback(struct mid_q_entry *mid)
break;
}
 
+   /*
+* If this wdata has a memory registered, the MR can be freed
+* The number of MRs available is limited, it's important to recover
+* used MR as soon as I/O is finished. Hold MR longer in the later
+* I/O process can possibly result in I/O deadlock due to lack of MR
+* to send request on I/O retry
+*/
+   if (wdata->mr) {
+   smbd_deregister_mr(wdata->mr);
+   wdata->mr = NULL;
+   }
+
if (wdata->result)
cifs_stats_fail_inc(tcon, SMB2_WRITE_HE);
 
@@ -2704,6 +2717,41 @@ smb2_async_writev(struct cifs_writedata *wdata,
offsetof(struct smb2_write_req, Buffer) - 4);
req->RemainingBytes = 0;
 
+   /*
+* If we want to do a server RDMA read, fill in and append
+* smbd_buffer_descriptor_v1 to the end of write request
+*/
+   if (server->rdma && wdata->bytes >=
+   server->smbd_conn->rdma_readwrite_threshold) {
+
+   struct smbd_buffer_descriptor_v1 *v1;
+   bool need_invalidate = server->dialect == SMB30_PROT_ID;
+
+   wdata->mr = smbd_register_mr(
+   server->smbd_conn, wdata->pages,
+   wdata->nr_pages, wdata->tailsz,
+   false, need_invalidate);
+   if (!wdata->mr) {
+   rc = -ENOBUFS;
+   goto async_writev_out;
+   }
+   req->Length = 0;
+   req->DataOffset = 0;
+   req->RemainingBytes =
+   (wdata->nr_pages-1)*PAGE_SIZE + wdata->tailsz;
+   req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
+   if (need_invalidate)
+   req->Channel = SMB2_CHANNEL_RDMA_V1;
+   req->WriteChannelInfoOffset =
+   offsetof(struct smb2_write_req, Buffer) - 4;
+   req->WriteChannelInfoLength =
+   sizeof(struct smbd_buffer_descriptor_v1);
+   v1 = (struct smbd_buffer_descriptor_v1 *) >Buffer[0];
+   v1->offset = wdata->mr->mr->iova;
+   v1->token = wdata->mr->mr->rkey;
+   v1->length = wdata->mr->mr->length;
+   }
+
/* 4 for rfc1002 length field and 1 for Buffer */
  

[PATCH v3 07/10] arm: dts: mt7623: add iommu and jpecdec nodes

2017-10-01 Thread Ryder Lee
This patch adds iommu and jpecdec nodes for MT7623.

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt7623.dtsi | 74 +++
 1 file changed, 74 insertions(+)

diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index a877f9a..b257715 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "skeleton64.dtsi"
@@ -273,6 +274,17 @@
clock-names = "system-clk", "rtc-clk";
};
 
+   smi_common: smi@1000c000 {
+   compatible = "mediatek,mt7623-smi-common",
+"mediatek,mt2701-smi-common";
+   reg = <0 0x1000c000 0 0x1000>;
+   clocks = < CLK_INFRA_SMI>,
+< CLK_MM_SMI_COMMON>,
+< CLK_INFRA_SMI>;
+   clock-names = "apb", "smi", "async";
+   power-domains = < MT2701_POWER_DOMAIN_DISP>;
+   };
+
pwrap: pwrap@1000d000 {
compatible = "mediatek,mt7623-pwrap",
 "mediatek,mt2701-pwrap";
@@ -304,6 +316,17 @@
reg = <0 0x10200100 0 0x1c>;
};
 
+   iommu: mmsys_iommu@10205000 {
+   compatible = "mediatek,mt7623-m4u",
+"mediatek,mt2701-m4u";
+   reg = <0 0x10205000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_INFRA_M4U>;
+   clock-names = "bclk";
+   mediatek,larbs = <  >;
+   #iommu-cells = <1>;
+   };
+
efuse: efuse@10206000 {
compatible = "mediatek,mt7623-efuse",
 "mediatek,mt8173-efuse";
@@ -669,6 +692,18 @@
#clock-cells = <1>;
};
 
+   larb0: larb@1401 {
+   compatible = "mediatek,mt7623-smi-larb",
+"mediatek,mt2701-smi-larb";
+   reg = <0 0x1401 0 0x1000>;
+   mediatek,smi = <_common>;
+   mediatek,larb-id = <0>;
+   clocks = < CLK_MM_SMI_LARB0>,
+< CLK_MM_SMI_LARB0>;
+   clock-names = "apb", "smi";
+   power-domains = < MT2701_POWER_DOMAIN_DISP>;
+   };
+
imgsys: syscon@1500 {
compatible = "mediatek,mt7623-imgsys",
 "mediatek,mt2701-imgsys",
@@ -677,6 +712,33 @@
#clock-cells = <1>;
};
 
+   larb2: larb@15001000 {
+   compatible = "mediatek,mt7623-smi-larb",
+"mediatek,mt2701-smi-larb";
+   reg = <0 0x15001000 0 0x1000>;
+   mediatek,smi = <_common>;
+   mediatek,larb-id = <2>;
+   clocks = < CLK_IMG_SMI_COMM>,
+< CLK_IMG_SMI_COMM>;
+   clock-names = "apb", "smi";
+   power-domains = < MT2701_POWER_DOMAIN_ISP>;
+   };
+
+   jpegdec: jpegdec@15004000 {
+   compatible = "mediatek,mt7623-jpgdec",
+"mediatek,mt2701-jpgdec";
+   reg = <0 0x15004000 0 0x1000>;
+   interrupts = ;
+   clocks =  < CLK_IMG_JPGDEC_SMI>,
+ < CLK_IMG_JPGDEC>;
+   clock-names = "jpgdec-smi",
+ "jpgdec";
+   power-domains = < MT2701_POWER_DOMAIN_ISP>;
+   mediatek,larb = <>;
+   iommus = < MT2701_M4U_PORT_JPGDEC_WDMA>,
+< MT2701_M4U_PORT_JPGDEC_BSDMA>;
+   };
+
vdecsys: syscon@1600 {
compatible = "mediatek,mt7623-vdecsys",
 "mediatek,mt2701-vdecsys",
@@ -685,6 +747,18 @@
#clock-cells = <1>;
};
 
+   larb1: larb@1601 {
+   compatible = "mediatek,mt7623-smi-larb",
+"mediatek,mt2701-smi-larb";
+   reg = <0 0x1601 0 0x1000>;
+   mediatek,smi = <_common>;
+   mediatek,larb-id = <1>;
+   clocks = < CLK_VDEC_CKGEN>,
+< CLK_VDEC_LARB>;
+   clock-names = "apb", "smi";
+   power-domains = < MT2701_POWER_DOMAIN_VDEC>;
+   };
+
hifsys: syscon@1a00 {
compatible = "mediatek,mt7623-hifsys",
 "mediatek,mt2701-hifsys",
-- 
1.9.1



[PATCH v3 07/10] arm: dts: mt7623: add iommu and jpecdec nodes

2017-10-01 Thread Ryder Lee
This patch adds iommu and jpecdec nodes for MT7623.

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt7623.dtsi | 74 +++
 1 file changed, 74 insertions(+)

diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index a877f9a..b257715 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "skeleton64.dtsi"
@@ -273,6 +274,17 @@
clock-names = "system-clk", "rtc-clk";
};
 
+   smi_common: smi@1000c000 {
+   compatible = "mediatek,mt7623-smi-common",
+"mediatek,mt2701-smi-common";
+   reg = <0 0x1000c000 0 0x1000>;
+   clocks = < CLK_INFRA_SMI>,
+< CLK_MM_SMI_COMMON>,
+< CLK_INFRA_SMI>;
+   clock-names = "apb", "smi", "async";
+   power-domains = < MT2701_POWER_DOMAIN_DISP>;
+   };
+
pwrap: pwrap@1000d000 {
compatible = "mediatek,mt7623-pwrap",
 "mediatek,mt2701-pwrap";
@@ -304,6 +316,17 @@
reg = <0 0x10200100 0 0x1c>;
};
 
+   iommu: mmsys_iommu@10205000 {
+   compatible = "mediatek,mt7623-m4u",
+"mediatek,mt2701-m4u";
+   reg = <0 0x10205000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_INFRA_M4U>;
+   clock-names = "bclk";
+   mediatek,larbs = <  >;
+   #iommu-cells = <1>;
+   };
+
efuse: efuse@10206000 {
compatible = "mediatek,mt7623-efuse",
 "mediatek,mt8173-efuse";
@@ -669,6 +692,18 @@
#clock-cells = <1>;
};
 
+   larb0: larb@1401 {
+   compatible = "mediatek,mt7623-smi-larb",
+"mediatek,mt2701-smi-larb";
+   reg = <0 0x1401 0 0x1000>;
+   mediatek,smi = <_common>;
+   mediatek,larb-id = <0>;
+   clocks = < CLK_MM_SMI_LARB0>,
+< CLK_MM_SMI_LARB0>;
+   clock-names = "apb", "smi";
+   power-domains = < MT2701_POWER_DOMAIN_DISP>;
+   };
+
imgsys: syscon@1500 {
compatible = "mediatek,mt7623-imgsys",
 "mediatek,mt2701-imgsys",
@@ -677,6 +712,33 @@
#clock-cells = <1>;
};
 
+   larb2: larb@15001000 {
+   compatible = "mediatek,mt7623-smi-larb",
+"mediatek,mt2701-smi-larb";
+   reg = <0 0x15001000 0 0x1000>;
+   mediatek,smi = <_common>;
+   mediatek,larb-id = <2>;
+   clocks = < CLK_IMG_SMI_COMM>,
+< CLK_IMG_SMI_COMM>;
+   clock-names = "apb", "smi";
+   power-domains = < MT2701_POWER_DOMAIN_ISP>;
+   };
+
+   jpegdec: jpegdec@15004000 {
+   compatible = "mediatek,mt7623-jpgdec",
+"mediatek,mt2701-jpgdec";
+   reg = <0 0x15004000 0 0x1000>;
+   interrupts = ;
+   clocks =  < CLK_IMG_JPGDEC_SMI>,
+ < CLK_IMG_JPGDEC>;
+   clock-names = "jpgdec-smi",
+ "jpgdec";
+   power-domains = < MT2701_POWER_DOMAIN_ISP>;
+   mediatek,larb = <>;
+   iommus = < MT2701_M4U_PORT_JPGDEC_WDMA>,
+< MT2701_M4U_PORT_JPGDEC_BSDMA>;
+   };
+
vdecsys: syscon@1600 {
compatible = "mediatek,mt7623-vdecsys",
 "mediatek,mt2701-vdecsys",
@@ -685,6 +747,18 @@
#clock-cells = <1>;
};
 
+   larb1: larb@1601 {
+   compatible = "mediatek,mt7623-smi-larb",
+"mediatek,mt2701-smi-larb";
+   reg = <0 0x1601 0 0x1000>;
+   mediatek,smi = <_common>;
+   mediatek,larb-id = <1>;
+   clocks = < CLK_VDEC_CKGEN>,
+< CLK_VDEC_LARB>;
+   clock-names = "apb", "smi";
+   power-domains = < MT2701_POWER_DOMAIN_VDEC>;
+   };
+
hifsys: syscon@1a00 {
compatible = "mediatek,mt7623-hifsys",
 "mediatek,mt2701-hifsys",
-- 
1.9.1



[PATCH v3 06/10] arm: dts: mt7623: add subsystem clock controller nodes

2017-10-01 Thread Ryder Lee
This patch adds missing susbsystem clock controllers nodes for MT7623.
(e.g., mmsys, imgsys, vdecsys and bdpsys)

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt7623.dtsi | 32 
 1 file changed, 32 insertions(+)

diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index 0640fb7..a877f9a 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -661,6 +661,30 @@
status = "disabled";
};
 
+   mmsys: syscon@1400 {
+   compatible = "mediatek,mt7623-mmsys",
+"mediatek,mt2701-mmsys",
+"syscon";
+   reg = <0 0x1400 0 0x1000>;
+   #clock-cells = <1>;
+   };
+
+   imgsys: syscon@1500 {
+   compatible = "mediatek,mt7623-imgsys",
+"mediatek,mt2701-imgsys",
+"syscon";
+   reg = <0 0x1500 0 0x1000>;
+   #clock-cells = <1>;
+   };
+
+   vdecsys: syscon@1600 {
+   compatible = "mediatek,mt7623-vdecsys",
+"mediatek,mt2701-vdecsys",
+"syscon";
+   reg = <0 0x1600 0 0x1000>;
+   #clock-cells = <1>;
+   };
+
hifsys: syscon@1a00 {
compatible = "mediatek,mt7623-hifsys",
 "mediatek,mt2701-hifsys",
@@ -799,4 +823,12 @@
power-domains = < MT2701_POWER_DOMAIN_ETH>;
status = "disabled";
};
+
+   bdpsys: syscon@1c00 {
+   compatible = "mediatek,mt7623-bdpsys",
+"mediatek,mt2701-bdpsys",
+"syscon";
+   reg = <0 0x1c00 0 0x1000>;
+   #clock-cells = <1>;
+   };
 };
-- 
1.9.1



[PATCH v3 06/10] arm: dts: mt7623: add subsystem clock controller nodes

2017-10-01 Thread Ryder Lee
This patch adds missing susbsystem clock controllers nodes for MT7623.
(e.g., mmsys, imgsys, vdecsys and bdpsys)

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt7623.dtsi | 32 
 1 file changed, 32 insertions(+)

diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index 0640fb7..a877f9a 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -661,6 +661,30 @@
status = "disabled";
};
 
+   mmsys: syscon@1400 {
+   compatible = "mediatek,mt7623-mmsys",
+"mediatek,mt2701-mmsys",
+"syscon";
+   reg = <0 0x1400 0 0x1000>;
+   #clock-cells = <1>;
+   };
+
+   imgsys: syscon@1500 {
+   compatible = "mediatek,mt7623-imgsys",
+"mediatek,mt2701-imgsys",
+"syscon";
+   reg = <0 0x1500 0 0x1000>;
+   #clock-cells = <1>;
+   };
+
+   vdecsys: syscon@1600 {
+   compatible = "mediatek,mt7623-vdecsys",
+"mediatek,mt2701-vdecsys",
+"syscon";
+   reg = <0 0x1600 0 0x1000>;
+   #clock-cells = <1>;
+   };
+
hifsys: syscon@1a00 {
compatible = "mediatek,mt7623-hifsys",
 "mediatek,mt2701-hifsys",
@@ -799,4 +823,12 @@
power-domains = < MT2701_POWER_DOMAIN_ETH>;
status = "disabled";
};
+
+   bdpsys: syscon@1c00 {
+   compatible = "mediatek,mt7623-bdpsys",
+"mediatek,mt2701-bdpsys",
+"syscon";
+   reg = <0 0x1c00 0 0x1000>;
+   #clock-cells = <1>;
+   };
 };
-- 
1.9.1



[PATCH v3 05/10] arm: dts: mt7623: update pio, usb and crypto nodes

2017-10-01 Thread Ryder Lee
This patch updates pio, usb and crypto nodes to make them be consistent
with the binding documents.

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt7623.dtsi | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index 381843e..0640fb7 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -227,8 +227,7 @@
};
 
pio: pinctrl@10005000 {
-   compatible = "mediatek,mt7623-pinctrl",
-"mediatek,mt2701-pinctrl";
+   compatible = "mediatek,mt7623-pinctrl";
reg = <0 0x1000b000 0 0x1000>;
mediatek,pctl-regmap = <_pctl_a>;
pins-are-numbered;
@@ -680,7 +679,7 @@
interrupts = ;
clocks = < CLK_HIFSYS_USB0PHY>,
 < CLK_TOP_ETHIF_SEL>;
-   clock-names = "sys_ck", "free_ck";
+   clock-names = "sys_ck", "ref_ck";
power-domains = < MT2701_POWER_DOMAIN_HIF>;
phys = < PHY_TYPE_USB2>, < PHY_TYPE_USB3>;
status = "disabled";
@@ -690,8 +689,6 @@
compatible = "mediatek,mt7623-u3phy",
 "mediatek,mt2701-u3phy";
reg = <0 0x1a1c4000 0 0x0700>;
-   clocks = <>;
-   clock-names = "u3phya_ref";
#address-cells = <2>;
#size-cells = <2>;
ranges;
@@ -699,12 +696,16 @@
 
u2port0: usb-phy@1a1c4800 {
reg = <0 0x1a1c4800 0 0x0100>;
+   clocks = < CLK_TOP_USB_PHY48M>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
 
u3port0: usb-phy@1a1c4900 {
reg = <0 0x1a1c4900 0 0x0700>;
+   clocks = <>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
@@ -719,7 +720,7 @@
interrupts = ;
clocks = < CLK_HIFSYS_USB1PHY>,
 < CLK_TOP_ETHIF_SEL>;
-   clock-names = "sys_ck", "free_ck";
+   clock-names = "sys_ck", "ref_ck";
power-domains = < MT2701_POWER_DOMAIN_HIF>;
phys = < PHY_TYPE_USB2>, < PHY_TYPE_USB3>;
status = "disabled";
@@ -729,8 +730,6 @@
compatible = "mediatek,mt7623-u3phy",
 "mediatek,mt2701-u3phy";
reg = <0 0x1a244000 0 0x0700>;
-   clocks = <>;
-   clock-names = "u3phya_ref";
#address-cells = <2>;
#size-cells = <2>;
ranges;
@@ -738,12 +737,16 @@
 
u2port1: usb-phy@1a244800 {
reg = <0 0x1a244800 0 0x0100>;
+   clocks = < CLK_TOP_USB_PHY48M>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
 
u3port1: usb-phy@1a244900 {
reg = <0 0x1a244900 0 0x0700>;
+   clocks = <>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
@@ -784,16 +787,15 @@
};
 
crypto: crypto@1b24 {
-   compatible = "mediatek,mt7623-crypto";
+   compatible = "mediatek,eip97-crypto";
reg = <0 0x1b24 0 0x2>;
interrupts = ,
 ,
 ,
 ,
 ;
-   clocks = < CLK_TOP_ETHIF_SEL>,
-< CLK_ETHSYS_CRYPTO>;
-   clock-names = "ethif","cryp";
+   clocks = < CLK_ETHSYS_CRYPTO>;
+   clock-names = "cryp";
power-domains = < MT2701_POWER_DOMAIN_ETH>;
status = "disabled";
};
-- 
1.9.1



[PATCH v3 02/10] arm: dts: mt2701: enable display pwm backlight

2017-10-01 Thread Ryder Lee
From: Weiqing Kong 

This patch adds board related config for MT2701 pwm backlight.

Signed-off-by: Weiqing Kong 
Signed-off-by: Erin Lo 
Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt2701-evb.dts | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/arch/arm/boot/dts/mt2701-evb.dts b/arch/arm/boot/dts/mt2701-evb.dts
index f484973..63af4b1 100644
--- a/arch/arm/boot/dts/mt2701-evb.dts
+++ b/arch/arm/boot/dts/mt2701-evb.dts
@@ -56,12 +56,29 @@
bt_sco_codec:bt_sco_codec {
compatible = "linux,bt-sco";
};
+
+   backlight_lcd: backlight_lcd {
+   compatible = "pwm-backlight";
+   pwms = < 0 10>;
+   brightness-levels = <
+ 0  16  32  48  64  80  96 112
+   128 144 160 176 192 208 224 240
+   255
+   >;
+   default-brightness-level = <9>;
+   };
 };
 
  {
status = "okay";
 };
 
+ {
+   status = "okay";
+   pinctrl-names = "default";
+   pinctrl-0 = <_bls_gpio>;
+};
+
  {
pinctrl-names = "default";
pinctrl-0 = <_pins_a>;
@@ -111,6 +128,12 @@
};
};
 
+   pwm_bls_gpio: pwm_bls_gpio {
+   pins_cmd_dat {
+   pinmux = ;
+   };
+   };
+
spi_pins_a: spi0@0 {
pins_spi {
pinmux = ,
-- 
1.9.1



[PATCH v3 05/10] arm: dts: mt7623: update pio, usb and crypto nodes

2017-10-01 Thread Ryder Lee
This patch updates pio, usb and crypto nodes to make them be consistent
with the binding documents.

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt7623.dtsi | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index 381843e..0640fb7 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -227,8 +227,7 @@
};
 
pio: pinctrl@10005000 {
-   compatible = "mediatek,mt7623-pinctrl",
-"mediatek,mt2701-pinctrl";
+   compatible = "mediatek,mt7623-pinctrl";
reg = <0 0x1000b000 0 0x1000>;
mediatek,pctl-regmap = <_pctl_a>;
pins-are-numbered;
@@ -680,7 +679,7 @@
interrupts = ;
clocks = < CLK_HIFSYS_USB0PHY>,
 < CLK_TOP_ETHIF_SEL>;
-   clock-names = "sys_ck", "free_ck";
+   clock-names = "sys_ck", "ref_ck";
power-domains = < MT2701_POWER_DOMAIN_HIF>;
phys = < PHY_TYPE_USB2>, < PHY_TYPE_USB3>;
status = "disabled";
@@ -690,8 +689,6 @@
compatible = "mediatek,mt7623-u3phy",
 "mediatek,mt2701-u3phy";
reg = <0 0x1a1c4000 0 0x0700>;
-   clocks = <>;
-   clock-names = "u3phya_ref";
#address-cells = <2>;
#size-cells = <2>;
ranges;
@@ -699,12 +696,16 @@
 
u2port0: usb-phy@1a1c4800 {
reg = <0 0x1a1c4800 0 0x0100>;
+   clocks = < CLK_TOP_USB_PHY48M>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
 
u3port0: usb-phy@1a1c4900 {
reg = <0 0x1a1c4900 0 0x0700>;
+   clocks = <>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
@@ -719,7 +720,7 @@
interrupts = ;
clocks = < CLK_HIFSYS_USB1PHY>,
 < CLK_TOP_ETHIF_SEL>;
-   clock-names = "sys_ck", "free_ck";
+   clock-names = "sys_ck", "ref_ck";
power-domains = < MT2701_POWER_DOMAIN_HIF>;
phys = < PHY_TYPE_USB2>, < PHY_TYPE_USB3>;
status = "disabled";
@@ -729,8 +730,6 @@
compatible = "mediatek,mt7623-u3phy",
 "mediatek,mt2701-u3phy";
reg = <0 0x1a244000 0 0x0700>;
-   clocks = <>;
-   clock-names = "u3phya_ref";
#address-cells = <2>;
#size-cells = <2>;
ranges;
@@ -738,12 +737,16 @@
 
u2port1: usb-phy@1a244800 {
reg = <0 0x1a244800 0 0x0100>;
+   clocks = < CLK_TOP_USB_PHY48M>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
 
u3port1: usb-phy@1a244900 {
reg = <0 0x1a244900 0 0x0700>;
+   clocks = <>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
@@ -784,16 +787,15 @@
};
 
crypto: crypto@1b24 {
-   compatible = "mediatek,mt7623-crypto";
+   compatible = "mediatek,eip97-crypto";
reg = <0 0x1b24 0 0x2>;
interrupts = ,
 ,
 ,
 ,
 ;
-   clocks = < CLK_TOP_ETHIF_SEL>,
-< CLK_ETHSYS_CRYPTO>;
-   clock-names = "ethif","cryp";
+   clocks = < CLK_ETHSYS_CRYPTO>;
+   clock-names = "cryp";
power-domains = < MT2701_POWER_DOMAIN_ETH>;
status = "disabled";
};
-- 
1.9.1



[PATCH v3 02/10] arm: dts: mt2701: enable display pwm backlight

2017-10-01 Thread Ryder Lee
From: Weiqing Kong 

This patch adds board related config for MT2701 pwm backlight.

Signed-off-by: Weiqing Kong 
Signed-off-by: Erin Lo 
Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt2701-evb.dts | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/arch/arm/boot/dts/mt2701-evb.dts b/arch/arm/boot/dts/mt2701-evb.dts
index f484973..63af4b1 100644
--- a/arch/arm/boot/dts/mt2701-evb.dts
+++ b/arch/arm/boot/dts/mt2701-evb.dts
@@ -56,12 +56,29 @@
bt_sco_codec:bt_sco_codec {
compatible = "linux,bt-sco";
};
+
+   backlight_lcd: backlight_lcd {
+   compatible = "pwm-backlight";
+   pwms = < 0 10>;
+   brightness-levels = <
+ 0  16  32  48  64  80  96 112
+   128 144 160 176 192 208 224 240
+   255
+   >;
+   default-brightness-level = <9>;
+   };
 };
 
  {
status = "okay";
 };
 
+ {
+   status = "okay";
+   pinctrl-names = "default";
+   pinctrl-0 = <_bls_gpio>;
+};
+
  {
pinctrl-names = "default";
pinctrl-0 = <_pins_a>;
@@ -111,6 +128,12 @@
};
};
 
+   pwm_bls_gpio: pwm_bls_gpio {
+   pins_cmd_dat {
+   pinmux = ;
+   };
+   };
+
spi_pins_a: spi0@0 {
pins_spi {
pinmux = ,
-- 
1.9.1



[PATCH v3 04/10] arm: dts: mediatek: update audio node for mt2701 and mt7623

2017-10-01 Thread Ryder Lee
This patch adds interrupt-names property in audio node so that
binding can be agnostic of the IRQ order.

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt2701.dtsi | 4 +++-
 arch/arm/boot/dts/mt7623.dtsi | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/mt2701.dtsi b/arch/arm/boot/dts/mt2701.dtsi
index 8c9fbe5..ecd388a 100644
--- a/arch/arm/boot/dts/mt2701.dtsi
+++ b/arch/arm/boot/dts/mt2701.dtsi
@@ -445,7 +445,9 @@
compatible = "mediatek,mt2701-audio";
reg = <0 0x1122 0 0x2000>,
  <0 0x112a 0 0x2>;
-   interrupts = ;
+   interrupts =  ,
+ ;
+   interrupt-names = "afe", "asys";
power-domains = < MT2701_POWER_DOMAIN_IFR_MSC>;
 
clocks = < CLK_INFRA_AUDIO>,
diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index ec8a074..381843e 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -544,7 +544,9 @@
 "mediatek,mt2701-audio";
reg = <0 0x1122 0 0x2000>,
  <0 0x112a 0 0x2>;
-   interrupts = ;
+   interrupts =  ,
+ ;
+   interrupt-names = "afe", "asys";
power-domains = < MT2701_POWER_DOMAIN_IFR_MSC>;
 
clocks = < CLK_INFRA_AUDIO>,
-- 
1.9.1



[PATCH v3 04/10] arm: dts: mediatek: update audio node for mt2701 and mt7623

2017-10-01 Thread Ryder Lee
This patch adds interrupt-names property in audio node so that
binding can be agnostic of the IRQ order.

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt2701.dtsi | 4 +++-
 arch/arm/boot/dts/mt7623.dtsi | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/mt2701.dtsi b/arch/arm/boot/dts/mt2701.dtsi
index 8c9fbe5..ecd388a 100644
--- a/arch/arm/boot/dts/mt2701.dtsi
+++ b/arch/arm/boot/dts/mt2701.dtsi
@@ -445,7 +445,9 @@
compatible = "mediatek,mt2701-audio";
reg = <0 0x1122 0 0x2000>,
  <0 0x112a 0 0x2>;
-   interrupts = ;
+   interrupts =  ,
+ ;
+   interrupt-names = "afe", "asys";
power-domains = < MT2701_POWER_DOMAIN_IFR_MSC>;
 
clocks = < CLK_INFRA_AUDIO>,
diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index ec8a074..381843e 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -544,7 +544,9 @@
 "mediatek,mt2701-audio";
reg = <0 0x1122 0 0x2000>,
  <0 0x112a 0 0x2>;
-   interrupts = ;
+   interrupts =  ,
+ ;
+   interrupt-names = "afe", "asys";
power-domains = < MT2701_POWER_DOMAIN_IFR_MSC>;
 
clocks = < CLK_INFRA_AUDIO>,
-- 
1.9.1



[PATCH v3 03/10] arm: dts: mt2701: add display subsystem related nodes

2017-10-01 Thread Ryder Lee
From: YT Shen 

This patch adds the device nodes for MT2701 DISP function blocks.

Signed-off-by: YT Shen 
Signed-off-by: Erin Lo 
Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt2701.dtsi | 75 +++
 1 file changed, 75 insertions(+)

diff --git a/arch/arm/boot/dts/mt2701.dtsi b/arch/arm/boot/dts/mt2701.dtsi
index 3c85879..8c9fbe5 100644
--- a/arch/arm/boot/dts/mt2701.dtsi
+++ b/arch/arm/boot/dts/mt2701.dtsi
@@ -26,6 +26,11 @@
compatible = "mediatek,mt2701";
interrupt-parent = <>;
 
+   aliases {
+   rdma0 = 
+   rdma1 = 
+   };
+
cpus {
#address-cells = <1>;
#size-cells = <0>;
@@ -203,6 +208,16 @@
power-domains = < MT2701_POWER_DOMAIN_DISP>;
};
 
+   mipi_tx0: mipi-dphy@1001 {
+   compatible = "mediatek,mt2701-mipi-tx";
+   reg = <0 0x1001 0 0x90>;
+   clocks = <>;
+   clock-output-names = "mipi_tx0_pll";
+   #clock-cells = <0>;
+   #phy-cells = <0>;
+   status = "disabled";
+   };
+
sysirq: interrupt-controller@10200100 {
compatible = "mediatek,mt2701-sysirq",
 "mediatek,mt6577-sysirq";
@@ -530,6 +545,30 @@
#clock-cells = <1>;
};
 
+   display_components: dispsys@1400 {
+   compatible = "mediatek,mt2701-mmsys";
+   reg = <0 0x1400 0 0x1000>;
+   power-domains = < MT2701_POWER_DOMAIN_DISP>;
+   };
+
+   ovl: ovl@14007000 {
+   compatible = "mediatek,mt2701-disp-ovl";
+   reg = <0 0x14007000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_OVL>;
+   iommus = < MT2701_M4U_PORT_DISP_OVL_0>;
+   mediatek,larb = <>;
+   };
+
+   rdma0: rdma@14008000 {
+   compatible = "mediatek,mt2701-disp-rdma";
+   reg = <0 0x14008000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_RDMA>;
+   iommus = < MT2701_M4U_PORT_DISP_RDMA>;
+   mediatek,larb = <>;
+   };
+
bls: pwm@1400a000 {
compatible = "mediatek,mt2701-disp-pwm";
reg = <0 0x1400a000 0 0x1000>;
@@ -539,6 +578,33 @@
status = "disabled";
};
 
+   color: color@1400b000 {
+   compatible = "mediatek,mt2701-disp-color";
+   reg = <0 0x1400b000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_COLOR>;
+   };
+
+   dsi: dsi@1400c000 {
+   compatible = "mediatek,mt2701-dsi";
+   reg = <0 0x1400c000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DSI_ENGINE>,
+< CLK_MM_DSI_DIG>,
+<_tx0>;
+   clock-names = "engine", "digital", "hs";
+   phys = <_tx0>;
+   phy-names = "dphy";
+   status = "disabled";
+   };
+
+   mutex: mutex@1400e000 {
+   compatible = "mediatek,mt2701-disp-mutex";
+   reg = <0 0x1400e000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_MUTEX_32K>;
+   };
+
larb0: larb@1401 {
compatible = "mediatek,mt2701-smi-larb";
reg = <0 0x1401 0 0x1000>;
@@ -550,6 +616,15 @@
power-domains = < MT2701_POWER_DOMAIN_DISP>;
};
 
+   rdma1: rdma@14012000 {
+   compatible = "mediatek,mt2701-disp-rdma";
+   reg = <0 0x14012000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_RDMA1>;
+   iommus = < MT2701_M4U_PORT_DISP_RDMA1>;
+   mediatek,larb = <>;
+   };
+
imgsys: syscon@1500 {
compatible = "mediatek,mt2701-imgsys", "syscon";
reg = <0 0x1500 0 0x1000>;
-- 
1.9.1



[PATCH v3 00/10] update MT7623 and MT2701 dts

2017-10-01 Thread Ryder Lee
Hi Matthias,

This patch series adds/corrects some device nodes for both MT7623 and MT2701.

changes since v3:
- revert PIO register space.

changes since v2:
- move non-common part and non-display related nodes to different patches.
- remove unused wdma node.
- add display related nodes for MT2701.

changes since v1:
- rebase to v4.14.
- sort nodes in alphabetical order

Ryder Lee (7):
  arm: dts: mediatek: update audio node for mt2701 and mt7623
  arm: dts: mt7623: update pio, usb and crypto nodes
  arm: dts: mt7623: add subsystem clock controller nodes
  arm: dts: mt7623: add iommu and jpecdec nodes
  arm: dts: mt7623: add display subsystem related nodes
  arm: dts: mt7623: enable bananapi-r2 display function
  arm: dts: mt7623: add PCIe related nodes

Weiqing Kong (2):
  arm: dts: mt2701: add pwm backlight device node
  arm: dts: mt2701: enable display pwm backlight

YT Shen (1):
  arm: dts: mt2701: add display subsystem related nodes

 arch/arm/boot/dts/mt2701-evb.dts  |  23 ++
 arch/arm/boot/dts/mt2701.dtsi |  88 ++-
 arch/arm/boot/dts/mt7623.dtsi | 338 +-
 arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts |  71 +-
 include/dt-bindings/pinctrl/mt7623-pinfunc.h  |  12 +
 5 files changed, 516 insertions(+), 16 deletions(-)

-- 
1.9.1



[PATCH v3 03/10] arm: dts: mt2701: add display subsystem related nodes

2017-10-01 Thread Ryder Lee
From: YT Shen 

This patch adds the device nodes for MT2701 DISP function blocks.

Signed-off-by: YT Shen 
Signed-off-by: Erin Lo 
Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt2701.dtsi | 75 +++
 1 file changed, 75 insertions(+)

diff --git a/arch/arm/boot/dts/mt2701.dtsi b/arch/arm/boot/dts/mt2701.dtsi
index 3c85879..8c9fbe5 100644
--- a/arch/arm/boot/dts/mt2701.dtsi
+++ b/arch/arm/boot/dts/mt2701.dtsi
@@ -26,6 +26,11 @@
compatible = "mediatek,mt2701";
interrupt-parent = <>;
 
+   aliases {
+   rdma0 = 
+   rdma1 = 
+   };
+
cpus {
#address-cells = <1>;
#size-cells = <0>;
@@ -203,6 +208,16 @@
power-domains = < MT2701_POWER_DOMAIN_DISP>;
};
 
+   mipi_tx0: mipi-dphy@1001 {
+   compatible = "mediatek,mt2701-mipi-tx";
+   reg = <0 0x1001 0 0x90>;
+   clocks = <>;
+   clock-output-names = "mipi_tx0_pll";
+   #clock-cells = <0>;
+   #phy-cells = <0>;
+   status = "disabled";
+   };
+
sysirq: interrupt-controller@10200100 {
compatible = "mediatek,mt2701-sysirq",
 "mediatek,mt6577-sysirq";
@@ -530,6 +545,30 @@
#clock-cells = <1>;
};
 
+   display_components: dispsys@1400 {
+   compatible = "mediatek,mt2701-mmsys";
+   reg = <0 0x1400 0 0x1000>;
+   power-domains = < MT2701_POWER_DOMAIN_DISP>;
+   };
+
+   ovl: ovl@14007000 {
+   compatible = "mediatek,mt2701-disp-ovl";
+   reg = <0 0x14007000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_OVL>;
+   iommus = < MT2701_M4U_PORT_DISP_OVL_0>;
+   mediatek,larb = <>;
+   };
+
+   rdma0: rdma@14008000 {
+   compatible = "mediatek,mt2701-disp-rdma";
+   reg = <0 0x14008000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_RDMA>;
+   iommus = < MT2701_M4U_PORT_DISP_RDMA>;
+   mediatek,larb = <>;
+   };
+
bls: pwm@1400a000 {
compatible = "mediatek,mt2701-disp-pwm";
reg = <0 0x1400a000 0 0x1000>;
@@ -539,6 +578,33 @@
status = "disabled";
};
 
+   color: color@1400b000 {
+   compatible = "mediatek,mt2701-disp-color";
+   reg = <0 0x1400b000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_COLOR>;
+   };
+
+   dsi: dsi@1400c000 {
+   compatible = "mediatek,mt2701-dsi";
+   reg = <0 0x1400c000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DSI_ENGINE>,
+< CLK_MM_DSI_DIG>,
+<_tx0>;
+   clock-names = "engine", "digital", "hs";
+   phys = <_tx0>;
+   phy-names = "dphy";
+   status = "disabled";
+   };
+
+   mutex: mutex@1400e000 {
+   compatible = "mediatek,mt2701-disp-mutex";
+   reg = <0 0x1400e000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_MUTEX_32K>;
+   };
+
larb0: larb@1401 {
compatible = "mediatek,mt2701-smi-larb";
reg = <0 0x1401 0 0x1000>;
@@ -550,6 +616,15 @@
power-domains = < MT2701_POWER_DOMAIN_DISP>;
};
 
+   rdma1: rdma@14012000 {
+   compatible = "mediatek,mt2701-disp-rdma";
+   reg = <0 0x14012000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_RDMA1>;
+   iommus = < MT2701_M4U_PORT_DISP_RDMA1>;
+   mediatek,larb = <>;
+   };
+
imgsys: syscon@1500 {
compatible = "mediatek,mt2701-imgsys", "syscon";
reg = <0 0x1500 0 0x1000>;
-- 
1.9.1



[PATCH v3 00/10] update MT7623 and MT2701 dts

2017-10-01 Thread Ryder Lee
Hi Matthias,

This patch series adds/corrects some device nodes for both MT7623 and MT2701.

changes since v3:
- revert PIO register space.

changes since v2:
- move non-common part and non-display related nodes to different patches.
- remove unused wdma node.
- add display related nodes for MT2701.

changes since v1:
- rebase to v4.14.
- sort nodes in alphabetical order

Ryder Lee (7):
  arm: dts: mediatek: update audio node for mt2701 and mt7623
  arm: dts: mt7623: update pio, usb and crypto nodes
  arm: dts: mt7623: add subsystem clock controller nodes
  arm: dts: mt7623: add iommu and jpecdec nodes
  arm: dts: mt7623: add display subsystem related nodes
  arm: dts: mt7623: enable bananapi-r2 display function
  arm: dts: mt7623: add PCIe related nodes

Weiqing Kong (2):
  arm: dts: mt2701: add pwm backlight device node
  arm: dts: mt2701: enable display pwm backlight

YT Shen (1):
  arm: dts: mt2701: add display subsystem related nodes

 arch/arm/boot/dts/mt2701-evb.dts  |  23 ++
 arch/arm/boot/dts/mt2701.dtsi |  88 ++-
 arch/arm/boot/dts/mt7623.dtsi | 338 +-
 arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts |  71 +-
 include/dt-bindings/pinctrl/mt7623-pinfunc.h  |  12 +
 5 files changed, 516 insertions(+), 16 deletions(-)

-- 
1.9.1



[PATCH v3 08/10] arm: dts: mt7623: add display subsystem related nodes

2017-10-01 Thread Ryder Lee
This patch adds the device nodes for the display function blocks.

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt7623.dtsi | 94 +++
 1 file changed, 94 insertions(+)

diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index b257715..b19aa9f 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -29,6 +29,11 @@
compatible = "mediatek,mt7623";
interrupt-parent = <>;
 
+   aliases {
+   rdma0 = 
+   rdma1 = 
+   };
+
cpu_opp_table: opp_table {
compatible = "operating-points-v2";
opp-shared;
@@ -298,6 +303,17 @@
clock-names = "spi", "wrap";
};
 
+   mipi_tx0: mipi-dphy@1001 {
+   compatible = "mediatek,mt7623-mipi-tx",
+"mediatek,mt2701-mipi-tx";
+   reg = <0 0x1001 0 0x90>;
+   clocks = <>;
+   clock-output-names = "mipi_tx0_pll";
+   #clock-cells = <0>;
+   #phy-cells = <0>;
+   status = "disabled";
+   };
+
cir: cir@10013000 {
compatible = "mediatek,mt7623-cir";
reg = <0 0x10013000 0 0x1000>;
@@ -692,6 +708,74 @@
#clock-cells = <1>;
};
 
+   display_components: dispsys@1400 {
+   compatible = "mediatek,mt7623-mmsys",
+"mediatek,mt2701-mmsys";
+   reg = <0 0x1400 0 0x1000>;
+   power-domains = < MT2701_POWER_DOMAIN_DISP>;
+   };
+
+   ovl: ovl@14007000 {
+   compatible = "mediatek,mt7623-disp-ovl",
+"mediatek,mt2701-disp-ovl";
+   reg = <0 0x14007000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_OVL>;
+   iommus = < MT2701_M4U_PORT_DISP_OVL_0>;
+   mediatek,larb = <>;
+   };
+
+   rdma0: rdma@14008000 {
+   compatible = "mediatek,mt7623-disp-rdma",
+"mediatek,mt2701-disp-rdma";
+   reg = <0 0x14008000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_RDMA>;
+   iommus = < MT2701_M4U_PORT_DISP_RDMA>;
+   mediatek,larb = <>;
+   };
+
+   bls: pwm@1400a000 {
+   compatible = "mediatek,mt7623-disp-pwm",
+"mediatek,mt2701-disp-pwm";
+   reg = <0 0x1400a000 0 0x1000>;
+   #pwm-cells = <2>;
+   clocks = < CLK_MM_MDP_BLS_26M>,
+< CLK_MM_DISP_BLS>;
+   clock-names = "main", "mm";
+   status = "disabled";
+   };
+
+   color: color@1400b000 {
+   compatible = "mediatek,mt7623-disp-color",
+"mediatek,mt2701-disp-color";
+   reg = <0 0x1400b000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_COLOR>;
+   };
+
+   dsi: dsi@1400c000 {
+   compatible = "mediatek,mt7623-dsi",
+"mediatek,mt2701-dsi";
+   reg = <0 0x1400c000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DSI_ENGINE>,
+< CLK_MM_DSI_DIG>,
+<_tx0>;
+   clock-names = "engine", "digital", "hs";
+   phys = <_tx0>;
+   phy-names = "dphy";
+   status = "disabled";
+   };
+
+   mutex: mutex@1400e000 {
+   compatible = "mediatek,mt7623-disp-mutex",
+"mediatek,mt2701-disp-mutex";
+   reg = <0 0x1400e000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_MUTEX_32K>;
+   };
+
larb0: larb@1401 {
compatible = "mediatek,mt7623-smi-larb",
 "mediatek,mt2701-smi-larb";
@@ -704,6 +788,16 @@
power-domains = < MT2701_POWER_DOMAIN_DISP>;
};
 
+   rdma1: rdma@14012000 {
+   compatible = "mediatek,mt7623-disp-rdma",
+"mediatek,mt2701-disp-rdma";
+   reg = <0 0x14012000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_RDMA1>;
+   iommus = < MT2701_M4U_PORT_DISP_RDMA1>;
+   mediatek,larb = <>;
+   };
+
imgsys: syscon@1500 {
compatible = "mediatek,mt7623-imgsys",
 "mediatek,mt2701-imgsys",
-- 
1.9.1



[PATCH v3 08/10] arm: dts: mt7623: add display subsystem related nodes

2017-10-01 Thread Ryder Lee
This patch adds the device nodes for the display function blocks.

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt7623.dtsi | 94 +++
 1 file changed, 94 insertions(+)

diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index b257715..b19aa9f 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -29,6 +29,11 @@
compatible = "mediatek,mt7623";
interrupt-parent = <>;
 
+   aliases {
+   rdma0 = 
+   rdma1 = 
+   };
+
cpu_opp_table: opp_table {
compatible = "operating-points-v2";
opp-shared;
@@ -298,6 +303,17 @@
clock-names = "spi", "wrap";
};
 
+   mipi_tx0: mipi-dphy@1001 {
+   compatible = "mediatek,mt7623-mipi-tx",
+"mediatek,mt2701-mipi-tx";
+   reg = <0 0x1001 0 0x90>;
+   clocks = <>;
+   clock-output-names = "mipi_tx0_pll";
+   #clock-cells = <0>;
+   #phy-cells = <0>;
+   status = "disabled";
+   };
+
cir: cir@10013000 {
compatible = "mediatek,mt7623-cir";
reg = <0 0x10013000 0 0x1000>;
@@ -692,6 +708,74 @@
#clock-cells = <1>;
};
 
+   display_components: dispsys@1400 {
+   compatible = "mediatek,mt7623-mmsys",
+"mediatek,mt2701-mmsys";
+   reg = <0 0x1400 0 0x1000>;
+   power-domains = < MT2701_POWER_DOMAIN_DISP>;
+   };
+
+   ovl: ovl@14007000 {
+   compatible = "mediatek,mt7623-disp-ovl",
+"mediatek,mt2701-disp-ovl";
+   reg = <0 0x14007000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_OVL>;
+   iommus = < MT2701_M4U_PORT_DISP_OVL_0>;
+   mediatek,larb = <>;
+   };
+
+   rdma0: rdma@14008000 {
+   compatible = "mediatek,mt7623-disp-rdma",
+"mediatek,mt2701-disp-rdma";
+   reg = <0 0x14008000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_RDMA>;
+   iommus = < MT2701_M4U_PORT_DISP_RDMA>;
+   mediatek,larb = <>;
+   };
+
+   bls: pwm@1400a000 {
+   compatible = "mediatek,mt7623-disp-pwm",
+"mediatek,mt2701-disp-pwm";
+   reg = <0 0x1400a000 0 0x1000>;
+   #pwm-cells = <2>;
+   clocks = < CLK_MM_MDP_BLS_26M>,
+< CLK_MM_DISP_BLS>;
+   clock-names = "main", "mm";
+   status = "disabled";
+   };
+
+   color: color@1400b000 {
+   compatible = "mediatek,mt7623-disp-color",
+"mediatek,mt2701-disp-color";
+   reg = <0 0x1400b000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_COLOR>;
+   };
+
+   dsi: dsi@1400c000 {
+   compatible = "mediatek,mt7623-dsi",
+"mediatek,mt2701-dsi";
+   reg = <0 0x1400c000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DSI_ENGINE>,
+< CLK_MM_DSI_DIG>,
+<_tx0>;
+   clock-names = "engine", "digital", "hs";
+   phys = <_tx0>;
+   phy-names = "dphy";
+   status = "disabled";
+   };
+
+   mutex: mutex@1400e000 {
+   compatible = "mediatek,mt7623-disp-mutex",
+"mediatek,mt2701-disp-mutex";
+   reg = <0 0x1400e000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_MUTEX_32K>;
+   };
+
larb0: larb@1401 {
compatible = "mediatek,mt7623-smi-larb",
 "mediatek,mt2701-smi-larb";
@@ -704,6 +788,16 @@
power-domains = < MT2701_POWER_DOMAIN_DISP>;
};
 
+   rdma1: rdma@14012000 {
+   compatible = "mediatek,mt7623-disp-rdma",
+"mediatek,mt2701-disp-rdma";
+   reg = <0 0x14012000 0 0x1000>;
+   interrupts = ;
+   clocks = < CLK_MM_DISP_RDMA1>;
+   iommus = < MT2701_M4U_PORT_DISP_RDMA1>;
+   mediatek,larb = <>;
+   };
+
imgsys: syscon@1500 {
compatible = "mediatek,mt7623-imgsys",
 "mediatek,mt2701-imgsys",
-- 
1.9.1



[PATCH v3 01/10] arm: dts: mt2701: add pwm backlight device node

2017-10-01 Thread Ryder Lee
From: Weiqing Kong 

This patch adds the device node for MT2701 pwm backlight.

Signed-off-by: Weiqing Kong 
Signed-off-by: Erin Lo 
Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt2701.dtsi | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/arm/boot/dts/mt2701.dtsi b/arch/arm/boot/dts/mt2701.dtsi
index afe12e5..3c85879 100644
--- a/arch/arm/boot/dts/mt2701.dtsi
+++ b/arch/arm/boot/dts/mt2701.dtsi
@@ -530,6 +530,15 @@
#clock-cells = <1>;
};
 
+   bls: pwm@1400a000 {
+   compatible = "mediatek,mt2701-disp-pwm";
+   reg = <0 0x1400a000 0 0x1000>;
+   #pwm-cells = <2>;
+   clocks = < CLK_MM_MDP_BLS_26M>, < CLK_MM_DISP_BLS>;
+   clock-names = "main", "mm";
+   status = "disabled";
+   };
+
larb0: larb@1401 {
compatible = "mediatek,mt2701-smi-larb";
reg = <0 0x1401 0 0x1000>;
-- 
1.9.1



[PATCH v3 10/10] arm: dts: mt7623: add PCIe related nodes

2017-10-01 Thread Ryder Lee
This patch adds devices nodes and updates pinmux setting for the PICe
function block. Just note that PCIe port2 PHY is shared with U3 port.

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt7623.dtsi | 108 ++
 arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts |  30 +++
 2 files changed, 138 insertions(+)

diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index b19aa9f..32d454e 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -862,6 +862,114 @@
#reset-cells = <1>;
};
 
+   pcie: pcie-controller@1a14 {
+   compatible = "mediatek,mt7623-pcie";
+   device_type = "pci";
+   reg = <0 0x1a14 0 0x1000>, /* PCIe shared registers */
+ <0 0x1a142000 0 0x1000>, /* Port0 registers */
+ <0 0x1a143000 0 0x1000>, /* Port1 registers */
+ <0 0x1a144000 0 0x1000>; /* Port2 registers */
+   reg-names = "subsys", "port0", "port1", "port2";
+   #address-cells = <3>;
+   #size-cells = <2>;
+   #interrupt-cells = <1>;
+   interrupt-map-mask = <0xf800 0 0 0>;
+   interrupt-map = <0x 0 0 0  GIC_SPI 193 
IRQ_TYPE_LEVEL_LOW>,
+   <0x0800 0 0 0  GIC_SPI 194 
IRQ_TYPE_LEVEL_LOW>,
+   <0x1000 0 0 0  GIC_SPI 195 
IRQ_TYPE_LEVEL_LOW>;
+   clocks = < CLK_TOP_ETHIF_SEL>,
+< CLK_HIFSYS_PCIE0>,
+< CLK_HIFSYS_PCIE1>,
+< CLK_HIFSYS_PCIE2>;
+   clock-names = "free_ck", "sys_ck0", "sys_ck1", "sys_ck2";
+   resets = < MT2701_HIFSYS_PCIE0_RST>,
+< MT2701_HIFSYS_PCIE1_RST>,
+< MT2701_HIFSYS_PCIE2_RST>;
+   reset-names = "pcie-rst0", "pcie-rst1", "pcie-rst2";
+   phys = <_port PHY_TYPE_PCIE>,
+  <_port PHY_TYPE_PCIE>,
+  < PHY_TYPE_PCIE>;
+   phy-names = "pcie-phy0", "pcie-phy1", "pcie-phy2";
+   power-domains = < MT2701_POWER_DOMAIN_HIF>;
+   bus-range = <0x00 0xff>;
+   status = "disabled";
+   ranges = <0x8100 0 0x1a16 0 0x1a16 0 0x0001
+ 0x8300 0 0x6000 0 0x6000 0 0x1000>;
+
+   pcie@0,0 {
+   device_type = "pci";
+   reg = <0x 0 0 0 0>;
+   #address-cells = <3>;
+   #size-cells = <2>;
+   #interrupt-cells = <1>;
+   interrupt-map-mask = <0 0 0 0>;
+   interrupt-map = <0 0 0 0  GIC_SPI 193 
IRQ_TYPE_LEVEL_LOW>;
+   ranges;
+   num-lanes = <1>;
+   status = "disabled";
+   };
+
+   pcie@1,0 {
+   device_type = "pci";
+   reg = <0x0800 0 0 0 0>;
+   #address-cells = <3>;
+   #size-cells = <2>;
+   #interrupt-cells = <1>;
+   interrupt-map-mask = <0 0 0 0>;
+   interrupt-map = <0 0 0 0  GIC_SPI 194 
IRQ_TYPE_LEVEL_LOW>;
+   ranges;
+   num-lanes = <1>;
+   status = "disabled";
+   };
+
+   pcie@2,0 {
+   device_type = "pci";
+   reg = <0x1000 0 0 0 0>;
+   #address-cells = <3>;
+   #size-cells = <2>;
+   #interrupt-cells = <1>;
+   interrupt-map-mask = <0 0 0 0>;
+   interrupt-map = <0 0 0 0  GIC_SPI 195 
IRQ_TYPE_LEVEL_LOW>;
+   ranges;
+   num-lanes = <1>;
+   status = "disabled";
+   };
+   };
+
+   pcie0_phy: pcie-phy@1a149000 {
+   compatible = "mediatek,generic-tphy-v1";
+   reg = <0 0x1a149000 0 0x0700>;
+   #address-cells = <2>;
+   #size-cells = <2>;
+   ranges;
+   status = "disabled";
+
+   pcie0_port: pcie-phy@1a149900 {
+   reg = <0 0x1a149900 0 0x0700>;
+   clocks = <>;
+   clock-names = "ref";
+   #phy-cells = <1>;
+   status = "okay";
+   };
+   };
+
+   pcie1_phy: pcie-phy@1a14a000 {
+   compatible = "mediatek,generic-tphy-v1";
+   reg = <0 0x1a14a000 0 0x0700>;
+   #address-cells = <2>;
+   #size-cells = <2>;
+   ranges;
+   status = "disabled";
+
+   pcie1_port: 

  1   2   3   4   5   >