Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

2018-08-17 Thread Sagi Grimberg

Hi Jason,


The new patchworks doesn't grab patches inlined in messages, so you
will need to resend it.


Yes, just wanted to to add Steve's tested by as its going to
lists that did not follow this thread.


Also, can someone remind me what the outcome is here? Does it
supersede Leon's patch:

https://patchwork.kernel.org/patch/10526167/


Leon's patch is exposing the breakage so I think it would be
wise to have it go in after this lands mainline.


Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

2018-08-17 Thread Sagi Grimberg




Hey Sagi,

The patch works allowing connections for the various affinity mappings below:

One comp_vector per core across all cores, starting with numa-local cores:


Thanks Steve, is this your "Tested by:" tag?


Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

2018-08-16 Thread Sagi Grimberg




Let me know if you want me to try this or any particular fix.


Steve, can you test this one?
--
[PATCH rfc] block: fix rdma queue mapping

nvme-rdma attempts to map queues based on irq vector affinity.
However, for some devices, completion vector irq affinity is
configurable by the user which can break the existing assumption
that irq vectors are optimally arranged over the host cpu cores.

So we map queues in two stages:
First map queues according to corresponding to the completion
vector IRQ affinity taking the first cpu in the vector affinity map.
if the current irq affinity is arranged such that a vector is not
assigned to any distinct cpu, we map it to a cpu that is on the same
node. If numa affinity can not be sufficed, we map it to any unmapped
cpu we can find. Then, map the remaining cpus in the possible cpumap
naively.

Signed-off-by: Sagi Grimberg 
---
Steve, can you test out this patch?
 block/blk-mq-cpumap.c  | 39 +---
 block/blk-mq-rdma.c| 80 
+++---

 include/linux/blk-mq.h |  1 +
 3 files changed, 93 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f15842..34811db8cba9 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,30 +30,35 @@ static int get_first_sibling(unsigned int cpu)
return cpu;
 }

-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue_cpu(struct blk_mq_tag_set *set, unsigned int cpu)
 {
unsigned int *map = set->mq_map;
unsigned int nr_queues = set->nr_hw_queues;
-   unsigned int cpu, first_sibling;
+   unsigned int first_sibling;

-   for_each_possible_cpu(cpu) {
-   /*
-* First do sequential mapping between CPUs and queues.
-* In case we still have CPUs to map, and we have some 
number of
-* threads per cores then map sibling threads to the 
same queue for

-* performace optimizations.
-*/
-   if (cpu < nr_queues) {
+   /*
+* First do sequential mapping between CPUs and queues.
+* In case we still have CPUs to map, and we have some number of
+* threads per cores then map sibling threads to the same queue for
+* performace optimizations.
+*/
+   if (cpu < nr_queues) {
+   map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+   } else {
+   first_sibling = get_first_sibling(cpu);
+   if (first_sibling == cpu)
map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-   } else {
-   first_sibling = get_first_sibling(cpu);
-   if (first_sibling == cpu)
-   map[cpu] = cpu_to_queue_index(nr_queues, 
cpu);

-   else
-   map[cpu] = map[first_sibling];
-   }
+   else
+   map[cpu] = map[first_sibling];
}
+}
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+   unsigned int cpu;

+   for_each_possible_cpu(cpu)
+   blk_mq_map_queue_cpu(set, cpu);
return 0;
 }
 EXPORT_SYMBOL_GPL(blk_mq_map_queues);
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..d04cbb1925f5 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -14,6 +14,61 @@
 #include 
 #include 

+static int blk_mq_rdma_map_queue(struct blk_mq_tag_set *set,
+   struct ib_device *dev, int first_vec, unsigned int queue)
+{
+   const struct cpumask *mask;
+   unsigned int cpu;
+   bool mapped = false;
+
+   mask = ib_get_vector_affinity(dev, first_vec + queue);
+   if (!mask)
+   return -ENOTSUPP;
+
+   /* map with an unmapped cpu according to affinity mask */
+   for_each_cpu(cpu, mask) {
+   if (set->mq_map[cpu] == UINT_MAX) {
+   set->mq_map[cpu] = queue;
+   mapped = true;
+   break;
+   }
+   }
+
+   if (!mapped) {
+   int n;
+
+   /* map with an unmapped cpu in the same numa node */
+   for_each_node(n) {
+   const struct cpumask *node_cpumask = 
cpumask_of_node(n);

+
+   if (!cpumask_intersects(mask, node_cpumask))
+   continue;
+
+   for_each_cpu(cpu, node_cpumask) {
+   if (set->mq_map[cpu] == UINT_MAX) {
+   set->mq_map[cpu] = queue;
+   mapped = true;
+   break;
+   }
+   }
+   }
+   }
+
+   if (!mapped) {
+   /* map with any unmapped cpu we can find */
+   f

Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

2018-07-31 Thread Sagi Grimberg

Hi Max,


Yes, since nvmf is the only user of this function.
Still waiting for comments on the suggested patch :)



Sorry for the late response (but I'm on vacation so I have
an excuse ;))

I'm thinking that we should avoid trying to find an assignment
when stuff like irqbalance daemon is running and changing
the affinitization.

This extension was made to apply optimal affinity assignment
when the device irq affinity is lined up in a vector per
core.

I'm thinking that when we identify this is not the case, we immediately
fallback to the default mapping.

1. when we get a mask, if its weight != 1, we fallback.
2. if a queue was left unmapped, we fallback.

Maybe something like the following:
--
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..1ada6211c55e 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -35,17 +35,26 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
const struct cpumask *mask;
unsigned int queue, cpu;

+   /* reset all CPUs mapping */
+   for_each_possible_cpu(cpu)
+   set->mq_map[cpu] = UINT_MAX;
+
for (queue = 0; queue < set->nr_hw_queues; queue++) {
mask = ib_get_vector_affinity(dev, first_vec + queue);
if (!mask)
goto fallback;

-   for_each_cpu(cpu, mask)
-   set->mq_map[cpu] = queue;
+   if (cpumask_weight(mask) != 1)
+   goto fallback;
+
+   cpu = cpumask_first(mask);
+   if (set->mq_map[cpu] != UINT_MAX)
+   goto fallback;
+
+   set->mq_map[cpu] = queue;
}

return 0;
-
 fallback:
return blk_mq_map_queues(set);
 }
--


Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

2018-07-18 Thread Sagi Grimberg




IMO we must fulfil the user wish to connect to N queues and not reduce
it because of affinity overlaps. So in order to push Leon's patch we
must also fix the blk_mq_rdma_map_queues to do a best effort mapping
according the affinity and map the rest in naive way (in that way we
will *always* map all the queues).


That is what I would expect also.   For example, in my node, where there are
16 cpus, and 2 numa nodes, I observe much better nvmf IOPS performance by
setting up my 16 driver completion event queues such that each is bound to a
node-local cpu.  So I end up with each nodel-local cpu having 2 queues bound
to it.   W/O adding support in iw_cxgb4 for ib_get_vector_affinity(), this
works fine.   I assumed adding ib_get_vector_affinity() would allow this to
all "just work" by default, but I'm running into this connection failure
issue.

I don't understand exactly what the blk_mq layer is trying to do, but I
assume it has ingress event queues and processing that it trying to align
with the drivers ingress cq event handling, so everybody stays on the same
cpu (or at least node).   But something else is going on.  Is there
documentation on how this works somewhere?


Does this (untested) patch help?
--
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f15842..dbe962cb537d 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,29 +30,34 @@ static int get_first_sibling(unsigned int cpu)
return cpu;
 }

-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue(struct blk_mq_tag_set *set, unsigned int cpu)
 {
unsigned int *map = set->mq_map;
unsigned int nr_queues = set->nr_hw_queues;
-   unsigned int cpu, first_sibling;
+   unsigned int first_sibling;

-   for_each_possible_cpu(cpu) {
-   /*
-* First do sequential mapping between CPUs and queues.
-* In case we still have CPUs to map, and we have some 
number of
-* threads per cores then map sibling threads to the 
same queue for

-* performace optimizations.
-*/
-   if (cpu < nr_queues) {
+   /*
+* First do sequential mapping between CPUs and queues.
+* In case we still have CPUs to map, and we have some number of
+* threads per cores then map sibling threads to the same queue for
+* performace optimizations.
+*/
+   if (cpu < nr_queues) {
+   map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+   } else {
+   first_sibling = get_first_sibling(cpu);
+   if (first_sibling == cpu)
map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-   } else {
-   first_sibling = get_first_sibling(cpu);
-   if (first_sibling == cpu)
-   map[cpu] = cpu_to_queue_index(nr_queues, 
cpu);

-   else
-   map[cpu] = map[first_sibling];
-   }
+   else
+   map[cpu] = map[first_sibling];
}
+}
+EXPORT_SYMBOL_GPL(blk_mq_map_queue);
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+   for_each_possible_cpu(cpu)
+blk_mq_map_queue(set, cpu);

return 0;
 }
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..5e91789bea5b 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -35,6 +35,10 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
const struct cpumask *mask;
unsigned int queue, cpu;

+   /* reset all to  */
+   for_each_possible_cpu(cpu)
+   set->mq_map[cpu] = UINT_MAX;
+
for (queue = 0; queue < set->nr_hw_queues; queue++) {
mask = ib_get_vector_affinity(dev, first_vec + queue);
if (!mask)
@@ -44,6 +48,11 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
set->mq_map[cpu] = queue;
}

+   for_each_possible_cpu(cpu) {
+   if (set->mq_map[cpu] == UINT_MAX)
+   blk_mq_map_queue(set, cpu);
+   }
+
return 0;

 fallback:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e3147eb74222..7a9848a82475 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -283,6 +283,7 @@ int blk_mq_freeze_queue_wait_timeout(struct 
request_queue *q,

 unsigned long timeout);

 int blk_mq_map_queues(struct blk_mq_tag_set *set);
+void blk_mq_map_queue(struct blk_mq_tag_set *set, unsigned int cpu);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int 
nr_hw_queues);


 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
--

It really is still a best effort thing...


Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

2018-07-16 Thread Sagi Grimberg




Hi,
I've tested this patch and seems problematic at this moment.


Problematic how? what are you seeing?

maybe this is because of the bug that Steve mentioned in the NVMe 
mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA 
initiator and I'll run his suggestion as well.


Is your device irq affinity linear?


BTW, when I run the blk_mq_map_queues it works for every irq affinity.


But its probably not aligned to the device vector affinity.


Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

2018-07-16 Thread Sagi Grimberg

Leon, I'd like to see a tested-by tag for this (at least
until I get some time to test it).

The patch itself looks fine to me.


Re: [net 02/11] net/mlx5: Fix get vector affinity helper function

2018-01-14 Thread Sagi Grimberg

Hi Saeed,


mlx5_get_vector_affinity used to call pci_irq_get_affinity and after
reverting the patch that sets the device affinity via PCI_IRQ_AFFINITY
API, calling pci_irq_get_affinity becomes useless and it breaks RDMA
mlx5 users.  To fix this, this patch provides an alternative way to
retrieve IRQ vector affinity using legacy IRQ API, following
smp_affinity read procfs implementation.

Fixes: 231243c82793 ("Revert mlx5: move affinity hints assignments to generic 
code")
Fixes: a435393acafb ("mlx5: move affinity hints assignments to generic code")
Cc: Sagi Grimberg <s...@grimberg.me>
Cc: Qing Huang <qing.hu...@oracle.com>
Signed-off-by: Saeed Mahameed <sae...@mellanox.com>

I didn't have time to test it out thus far, I assume you tested it
though, so looks good to me,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [net 02/14] Revert "mlx5: move affinity hints assignments to generic code"

2017-12-26 Thread Sagi Grimberg



Are you sure it won't get populated at all  ? even if you manually set
IRQ affinity via sysfs ?


Yes, the msi_desc affinity is not initialized without the affinity
descriptor passed (which is what PCI_IRQ_AFFINITY is for).


Anyway we can implement this driver helper function to return the IRQ
affinity hint stored in the driver:
  "cpumask_first(mdev->priv.irq_info[vector].mask);"


minus the cpumask_first, but yea. Please send a new patch so we can
test it out.


Re: [net 02/14] Revert "mlx5: move affinity hints assignments to generic code"

2017-12-25 Thread Sagi Grimberg



Before the offending commit, mlx5 core did the IRQ affinity itself,
and it seems that the new generic code have some drawbacks and one
of them is the lack for user ability to modify irq affinity after
the initial affinity values got assigned.

The issue is still being discussed and a solution in the new generic code
is required, until then we need to revert this patch.

This fixes the following issue:
echo  > /proc/irq//smp_affinity
fails with  -EIO

This reverts commit a435393acafbf0ecff4deb3e3cb554b34f0d0664.
Note: kept mlx5_get_vector_affinity in include/linux/mlx5/driver.h since

> it is used in mlx5_ib driver.

This won't work for sure because the msi_desc affinity cpumask won't
ever be populated. You need to re-implement it in mlx5 if you don't want
to break rdma ULPs that rely on it.


Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

2017-11-13 Thread Sagi Grimberg



Can you explain what do you mean by "subsystem"? I thought that the
subsystem would be the irq subsystem (which means you are the one to provide
the needed input :) ) and the driver would pass in something
like msi_irq_ops to pci_alloc_irq_vectors() if it supports the driver
requirements that you listed and NULL to tell the core to leave it alone
and do what it sees fit (or pass msi_irq_ops with flag that means that).

ops structure is a very common way for drivers to communicate with a
subsystem core.


So if you look at the above pseudo code then the subsys_*_move_callbacks
are probably subsystem specific, i.e. block or networking.

Those subsystem callbacks might either handle it at the subsystem level
directly or call into the particular driver.


Personally I do not think that integrating this to networking/block
stacks in that level is going to work. drivers don't communicate any
information on what they do with msi(x) vectors (and I'm not sure they
should).

I think that driver that uses managed facilities is up to its own
discretion, it interacts with the irq subsystem to allocate the vectors
so it make sense to me that it should pass in the ops directly and
handle the callouts.


That's certainly out of the scope what the generic interrupt code can do :)


Don't get me wrong, I'm all for adding useful helpers in net/ and block/
if drivers need some services from the core subsystem or if drivers end
up sharing lots of logic.

For example, drivers already take care of draining queues when device
hot unplug is triggered, so they must be able to get it right for
IRQ vector migration (at least I assume).


Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

2017-11-13 Thread Sagi Grimberg



Do you know if any exist? Would it make sense to have a survey to
understand if anyone relies on it?

 From what I've seen so far, drivers that were converted simply worked
with the non-managed facility and didn't have any special code for it.
Perhaps Christoph can comment as he convert most of them.

But if there aren't any drivers that absolutely rely on it, maybe its
not a bad idea to allow it by default?


Sure, I was just cautious and I have to admit that I have no insight into
the driver side details.


Christoph, feel free to chime in :)

Should I construct an email list of the driver maintainers of the
converted drivers?


* When and how is the driver informed about the change?

   When:

 #1 Before the core tries to move the interrupt so it can veto the
  move if it cannot allocate new resources or whatever is required
  to operate after the move.


What would the core do if a driver veto a move?


Return the error code from write_affinity() as it does with any other error
which fails to set the affinity.


OK, so this would mean that the driver queue no longer has a vector
correct? so is the semantics that it needs to cleanup its resources or
should it expect another callout for that?


I'm wandering in what conditions a driver will be unable to allocate
resources for move to cpu X but able to allocate for move to cpu Y.


Node affine memory allocation is the only thing which comes to my mind, or
some decision not to have a gazillion of queues on a single CPU.


Yea, makes sense.


This looks like it can work to me, but I'm probably not familiar enough
to see the full picture here.


On the interrupt core side this is workable, I just need the input from the
driver^Wsubsystem side if this can be implemented sanely.


Can you explain what do you mean by "subsystem"? I thought that the
subsystem would be the irq subsystem (which means you are the one to 
provide the needed input :) ) and the driver would pass in something

like msi_irq_ops to pci_alloc_irq_vectors() if it supports the driver
requirements that you listed and NULL to tell the core to leave it alone
and do what it sees fit (or pass msi_irq_ops with flag that means that).

ops structure is a very common way for drivers to communicate with a
subsystem core.


Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

2017-11-13 Thread Sagi Grimberg

Hi Thomas,


What can be done with some time to work on?

The managed mechanism consists of 3 pieces:

  1) Vector spreading

  2) Managed vector allocation, which becomes a guaranteed reservation in
 4.15 due of the big rework of the vector management code.

 Non managed interrupts get a best effort reservation to handle theCPU
 unplug vector pressure problem in a sane way.

  3) CPU hotplug management

 If the last CPU in the affinity set goes offline, then the interrupt is
 shutdown and restarted when the first CPU in the affinity set comes
 online again. The driver code needs to ensure that the queue associated
 to that interrupt is drained before shutdown and nothing is queued
 there after this point.

So we have options:

1) Initial vector spreading

  Let the driver use the initial vector spreading. That does only the
  initial affinity setup, but otherwise the interrupts are handled like any
  other non managed interrupt, i.e. best effort reservation, affinity
  settings enabled and CPU unplug breaks affinity and moves them to some
  random other online CPU.

  The simplest solution of all.

2) Allowing a driver supplied mask

  Certainly simple to do, but as you said it's not really a solution. I'm
  not sure whether we want to go there as this is going to be replaced fast
  enough and then create another breakage/frustration level.


3) Affinity override in managed mode

  Doable, but there are a couple of things to think about:


I think that it will be good to shoot for (3). Given that there are
driver requirements I'd say that driver will expose up front if it can
handle it, and if not we fallback to (1).


   * How is this enabled?

 - Opt-in by driver

 - Extra sysfs/procfs knob

 We definitely should not enable it per default because that would
 surprise users/drivers which work with the current managed devices and
 rely on the affinity files to be non writeable in managed mode.


Do you know if any exist? Would it make sense to have a survey to
understand if anyone relies on it?

From what I've seen so far, drivers that were converted simply worked
with the non-managed facility and didn't have any special code for it.
Perhaps Christoph can comment as he convert most of them.

But if there aren't any drivers that absolutely rely on it, maybe its
not a bad idea to allow it by default?



   * When and how is the driver informed about the change?

  When:

#1 Before the core tries to move the interrupt so it can veto the
  move if it cannot allocate new resources or whatever is required
  to operate after the move.


What would the core do if a driver veto a move? I'm wandering in what
conditions a driver will be unable to allocate resources for move to cpu
X but able to allocate for move to cpu Y.



#2 After the core made the move effective because:

   - The interrupt might be moved from an offline set to an online
 set and needs to be started up, so the related queue must be
 enabled as well.

   - The interrupt might be moved from an online set to an offline
 set, so the queue needs to be drained and disabled.

  - Resources which have been allocated in the first step must be
 made effective and old resources freed.

  How:

The existing affinity notification mechanism does not work for this
and it's a horrible piece of crap which should go away sooner than
later.

So we need some sensible way to provide callback. Emphasis on
callbacks as one multiplexing callback is not a good idea.

   * How can the change made effective?

 When the preliminaries (vector reservation on the new set and
 evtl. resource allocation in the subsystem have been done, then the
 actual move can be made.

 But, there is a caveat. x86 is not good in reassociating interrupts on
 the fly except when it sits behind an interrupt remapping unit, but we
 cannot rely on that.

 So the change flow which works for everything would be:

 if (reserve_vectors() < 0)
return FAIL;

 if (subsys_prep_callback() < 0) {
release_vectors();
return FAIL;
 }

 shutdown(irq);

 if (!online(newset))
return SUCCESS;

 startup(irq);

 subsys_post_callback();
 return SUCCESS;

 subsys_prep_callback() must basically work the same way as the CPU
 offline mechanism and drain the queue and prevent queueing before the
 irq is restarted. If the move results in keeping it shutdown because
 the new set is offline, then the irq will be restarted via the CPU
 hotplug code and the subsystem will be informed about that via the
 hotplug mechanism as well.

 subsys_post_callback() is more or less the same as the hotplug callback
 and restarts the queue. The only difference to the hotplug code as of
 today is 

Re: mlx5 broken affinity

2017-11-09 Thread Sagi Grimberg



The early discussion of the managed facility came to the conclusion that it
will manage this stuff completely to allow fixed association of 'queue /
interrupt / corresponding memory' to a single CPU or a set of CPUs. That
removes a lot of 'affinity' handling magic from the driver and utilizes the
hardware in a sensible way. That was not my decision, really. It surely
made sense to me and I helped Christoph to implement it.

The real question is whether you want to have the fixed 'queue / interrupt/
corresponding memory association' and get rid of all the 'affinity' dance
in your driver or not.

If you answer that question with 'yes' then the consequence is that there
is no knob.

If you answer that question with 'no' then you should not use
the managed facility in the first place and if you need parts of that
functionality then this needs to be added to the core code _before_ a
driver gets converted and not afterwards.


point taken.


It's not my problem if people decide, to use this and then trip over the
limitations after the changes hit the tree. This could have been figured
out before even a single patch was posted.


That's correct, I could have known that, but I didn't, and from your
reply, I understand there is really only a single way forward...


Now you try to blame the people who implemented the managed affinity stuff
for the wreckage, which was created by people who changed drivers to use
it. Nice try.


I'm not trying to blame anyone, really. I was just trying to understand
how to move forward with making users happy and still enjoy subsystem
services instead of doing lots of similar things inside mlx5 driver.


Re: mlx5 broken affinity

2017-11-09 Thread Sagi Grimberg



Again, I think Jes or others can provide more information.


Sagi, I believe Jes is not trying to argue about what initial affinity
values you give to the driver, We have a very critical regression that
is afflicting Live systems today and common tools that already exists
in various distros, such as irqblanace which solely depends on
smp_affinity sysfs entry which is now not write-able due to this
regression. please see https://github.com/Irqbalance/irqbalance/blob/ma
ster/activate.c

Some users would like to have thier network traffic handled in some
cores and free up other cores for other purposes, you just can't take
that away from them.

If revereting mlx5 patches would solve the issue, I am afraid that is
the solution i am going to go with, until the regression is fixed.


Well, I completely agree with you Saeed, If this is causing regression
for mlx5 users and we can't find a way for managed interface to expose
this knob to userspace, I don't see any other choice as well.


Re: mlx5 broken affinity

2017-11-09 Thread Sagi Grimberg

Thomas,


Because the user sometimes knows better based on statically assigned
loads, or the user wants consistency across kernels. It's great that the
system is better at allocating this now, but we also need to allow for a
user to change it. Like anything on Linux, a user wanting to blow off
his/her own foot, should be allowed to do so.


That's fine, but that's not what the managed affinity facility provides. If
you want to leverage the spread mechanism, but avoid the managed part, then
this is a different story and we need to figure out how to provide that
without breaking the managed side of it.

As I said it's possible, but I vehemently disagree, that this is a bug in
the core code, as it was claimed several times in this thread.

The real issue is that the driver was converted to something which was
expected to behave differently. That's hardly a bug in the core code, at
most it's a documentation problem.


I disagree here, this is not a device specific discussion. The question
of exposing IRQ affinity assignment knob to user-space holds for every
device (NICs, HBAs and alike). The same issue could have been raised on
any other device using managed affinitization (and we have quite a few
of those now). I can't see any reason why its OK for device X to use it
while its not OK for device Y to use it.

If the resolution is "YES we must expose a knob to user-space" then the
managed facility is unusable in its current form for *any* device. If
the answer is "NO, user-space can't handle all the stuff the kernel can"
then we should document it. This is really device independent.


Re: mlx5 broken affinity

2017-11-07 Thread Sagi Grimberg



Depending on the machine and the number of queues this might even result in
completely losing the ability to suspend/hibernate because the number of
available vectors on CPU0 is not sufficient to accomodate all queue
interrupts.


Would it be possible to keep the managed facility until a user overrides
an affinity assignment? This way if the user didn't touch it, we keep
all the perks, and in case the user overrides it, we log the implication
so the user is aware?


A lot of things are possible, the question is whether it makes sense. The
whole point is to have resources (queues, interrupts etc.) per CPU and have
them strictly associated.


Not arguing here.


Why would you give the user a knob to destroy what you carefully optimized?


Well, looks like someone relies on this knob, the question is if he is
doing something better for his workload. I don't know, its really up to
the user to say.


Just because we can and just because users love those knobs or is there any
real technical reason?


Again, I think Jes or others can provide more information.


Re: mlx5 broken affinity

2017-11-05 Thread Sagi Grimberg



This wasn't to start a debate about which allocation method is the
perfect solution. I am perfectly happy with the new default, the part
that is broken is to take away the user's option to reassign the
affinity. That is a bug and it needs to be fixed!


Well,

I would really want to wait for Thomas/Christoph to reply, but this
simple change fixed it for me:
--
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 573dc52b0806..eccd06be5e44 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
  {
 struct irq_desc *desc = irq_to_desc(irq);

-   return __irq_can_set_affinity(desc) &&
-   !irqd_affinity_is_managed(>irq_data);
+   return __irq_can_set_affinity(desc);


Which defeats the whole purpose of the managed facility, which is _not_ to
break the affinities on cpu offline and bring the interrupt back on the CPU
when it comes online again.

What I can do is to have a separate flag, which only uses the initial
distribution mechanism, but I really want to have Christophs opinion on
that.


I do agree that the user would lose better cpu online/offline behavior,
but it seems that users want to still have some control over the IRQ
affinity assignments even if they lose this functionality.

Would it be possible to keep the managed facility until a user overrides
an affinity assignment? This way if the user didn't touch it, we keep
all the perks, and in case the user overrides it, we log the implication
so the user is aware?


Re: mlx5 broken affinity

2017-11-02 Thread Sagi Grimberg



This wasn't to start a debate about which allocation method is the
perfect solution. I am perfectly happy with the new default, the part
that is broken is to take away the user's option to reassign the
affinity. That is a bug and it needs to be fixed!


Well,

I would really want to wait for Thomas/Christoph to reply, but this
simple change fixed it for me:
--
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 573dc52b0806..eccd06be5e44 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
 {
struct irq_desc *desc = irq_to_desc(irq);

-   return __irq_can_set_affinity(desc) &&
-   !irqd_affinity_is_managed(>irq_data);
+   return __irq_can_set_affinity(desc);
 }

 /**
--


Re: mlx5 broken affinity

2017-11-02 Thread Sagi Grimberg



I vaguely remember Nacking Sagi's patch as we knew it would break
mlx5e netdev affinity assumptions.

I remember that argument. Still the series found its way in.


Of course it maid its way in, it was acked by three different
maintainers, and I addressed all of Saeed's comments.


That series moves affinity decisions to kernel's responsibility.
AFAI see, what kernel does is assign IRQs to the NUMA's one by one in 
increasing indexing (starting with cores of NUMA #0), no matter what 
NUMA is closer to the NIC.


Well, as we said before, if there is a good argument to do the home node
first we can change the generic code (as it should be given that this is
absolutely not device specific).

This means that if your NIC is on NUMA #1, and you reduce the number of 
channels, you might end up working only with the cores on the far NUMA. 
Not good!

We deliberated on this before, and concluded that application affinity
and device affinity are equally important. If you have a real use case
that shows otherwise, its perfectly doable to start from the device home
node.


And I agree here that user should be able to read
/proc/irq/x/smp_affinity and even modify it if required.

Totally agree. We should fix that ASAP.
User must have write access.


I'll let Thomas reply here, I do not fully understand the reason for why
pci_alloc_irq_vectors() make the affinity assignments immutable..


Re: mlx5 broken affinity

2017-11-02 Thread Sagi Grimberg

Jes,


I am all in favor of making the automatic setup better, but assuming an
automatic setup is always right seems problematic. There could be
workloads where you may want to assign affinity explicitly.


Adding Thomas to the thread.

My understanding that the thought is to prevent user-space from
messing up the nice linear assignment, but maybe this might be too
conservative?

The patch that prevented user-space from touching managed irq affinity
assignments:

--
commit 45ddcecbfa947f1dd8e8019bad9e90d6c9f2665c
Author: Thomas Gleixner 
Date:   Mon Jul 4 17:39:25 2016 +0900

genirq: Use affinity hint in irqdesc allocation

Use the affinity hint in the irqdesc allocator. The hint is used to 
determine

the node for the allocation and to set the affinity of the interrupt.

If multiple interrupts are allocated (multi-MSI) then the allocator 
iterates
over the cpumask and for each set cpu it allocates on their node 
and sets the

initial affinity to that cpu.

If a single interrupt is allocated (MSI-X) then the allocator uses 
the first
cpu in the mask to compute the allocation node and uses the mask 
for the

initial affinity setting.

Interrupts set up this way are marked with the AFFINITY_MANAGED flag to
prevent userspace from messing with their affinity settings.

Signed-off-by: Thomas Gleixner 
Cc: Christoph Hellwig 
Cc: linux-bl...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-n...@lists.infradead.org
Cc: ax...@fb.com
Cc: agord...@redhat.com
Link: 
http://lkml.kernel.org/r/1467621574-8277-5-git-send-email-...@lst.de

Signed-off-by: Thomas Gleixner 
--


Re: mlx5 broken affinity

2017-11-01 Thread Sagi Grimberg

Hi,


Hi Jes,


The below patch seems to have broken PCI IRQ affinity assignments for mlx5.


I wouldn't call it breaking IRQ affinity assignments. It just makes
them automatic.


Prior to this patch I could echo a value to /proc/irq//smp_affinity
and it would get assigned. With this patch applied I get -EIO


Adding Christoph,

Ideally the affinity assignments would be better left alone, but
I wasn't aware that they are now immutable. Christoph?


The actual affinity assignments seem to have changed too, but I assume
this is a result of the generic allocation?


The affinity assignments should be giving you perfect linear assignment
of the rx/tx rings completion vectors to CPU cores:
first  -> 0
second -> 1
third  -> 2
...

Before, mlx5 spread affinity starting from the local numa node as it
relied on that when constructing RSS indirection table only to the local
numa node (which as a side effect hurt applications running on the far
node as the traffic was guaranteed to arrive rx rings located in the
"wrong" node).

Now the RSS indirection table is linear across both numa nodes just like
the irq affinity settings. Another thing that was improved was the
pre/post vectors which blacklisted any non rx/tx completion vectors from
the affinity assignments (like port events completion vectors from
example).


With this applied I get:
[root@testbox /proc/irq]# cat 50/smp_affinity
00,0010
[root@textbox /proc/irq]# cat 100/smp_affinity
10,

Without I get:
[root@testbox /proc/irq]# cat 50/smp_affinity
00,0200
[root@testbox /proc/irq]# cat 100/smp_affinity
000100,

I am not wildly familiar with the affinity assignment code. Is there
something obvious I am missing here?


The affinity assignment is indeed changed, should be better though.
If you do not see a linear (diagonal) affinity assignment then
its a bug that needs to be fixed (probably in msi.c code).


Re: [PATCH v2 08/15] nvmet: make config_item_type const

2017-10-17 Thread Sagi Grimberg

Acked-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH] net/smc: mark as BROKEN due to remote memory exposure

2017-05-15 Thread Sagi Grimberg

Hi Dave,


this patch has not been superceeded by anything, can you explain why
it has been marked as such in patchworks?


I think you're being overbearing by requiring this to be marked BROKEN
and I would like you to explore other ways with the authors to fix
whatever perceived problems you think SMC has.


Well, its not one's opinion, its a real problem. To be fair, this
security breach existed in other RDMA based storage protocols for
years in the past, but we cleaned it up completely.

Our assumption is that *if* the user is willingly choosing to expose its
entire physical address space to remote access to get some performance
boost that's fine, but to have the kernel expose it by default, without
even letting the user to control it is plain irresponsible. IMHO we
should *not* repeat the mistakes of the past and set a higher bar for
RDMA protocol implementations.


You claim that this is somehow "urgent" is false.  You can ask
distributions to disable SMC or whatever in the short term if it
reallly, truly, bothers you.


I doubt that is sufficient given that not all implementations
out there rely on distros. I'm afraid one time is too much in
this case.


Re: net/smc and the RDMA core

2017-05-04 Thread Sagi Grimberg



if you can point out specific issues, we will be happy to work with you
to get them addressed!


Hello Ursula,

My list of issues that I would like to see addressed can be found below. Doug,
Christoph and others may have additional inputs. The issues that have not yet
been mentioned in other e-mails are:
- The SMC driver only supports one RDMA transport type (RoCE v1) but
  none of the other RDMA transport types (RoCE v2, IB and iWARP). New
  RDMA drivers should support all RDMA transport types transparently.
  The traditional approach to support multiple RDMA transport types is
  by using the RDMA/CM to establish connections.
- The implementation of the SMC driver only supports RoCEv1. This is
  a very unfortunate choice because:
  - RoCEv1 is not routable and hence is limited to a single Ethernet
broadcast domain.
  - RoCEv1 packets escape a whole bunch of mechanisms that only work
for IP packets. Firewalls pass all RoCEv1 packets and switches
do not restrict RoCEv1 packets to a single VLAN. This means that
if the network configuration is changed after an SMC connection
has been set up such that IP communication between the endpoints
of an SMC connection is blocked that the SMC RoCEv1 packets will
not be blocked by the network equipment of which the configuration
has just been changed.
- As already mentioned by Christoph, the SMC implementation uses RDMA
  calls that probably will be deprecated soon (ib_create_cq()) and
  should use the RDMA R/W API instead of building sge lists itself.


I would also suggest that you stop exposing the DMA MR for remote
access (at least by default) and use a proper reg_mr operations with a
limited lifetime on a properly sized buffer.

Also, the cq polling code looks completely wrong, you should really
use the RDMA CQ API.



Re: [PATCH block-tree] net: off by one in inet6_pton()

2017-04-20 Thread Sagi Grimberg

Thanks Dan,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper

2017-04-06 Thread Sagi Grimberg

shouldn't you include   and  like in
commit 8ec2ef2b66ea2f that fixes blk-mq-pci.c ?


Not really. We can lose these from blk-mq-pci.c as well.


+#include 
+#include 
+#include 
+#include 
+#include "blk-mq.h"


Is this include needed ?


You're right, I can just keep:

+#include 
+#include 
+#include 


Re: [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma

2017-04-06 Thread Sagi Grimberg

Hi Sagi,


Hey Max,


the patchset looks good and of course we can add support for more
drivers in the future.
have you run some performance testing with the nvmf initiator ?


I'm limited by the target machine in terms of IOPs, but the host shows
~10% cpu usage decrease, and latency improves slightly as well
which is more apparent depending on which cpu I'm running my IO
thread (due to the mismatch in comp_vectors and queue mappings
some queues have irq vectors mapped to a core on a different numa
node.


Re: [PATCH rfc 6/6] nvme-rdma: use intelligent affinity based queue mappings

2017-04-06 Thread Sagi Grimberg



Use the geneic block layer affinity mapping helper. Also,


  generic


nr_io_queues = min(opts->nr_io_queues, num_online_cpus());
+   nr_io_queues = min_t(unsigned int, nr_io_queues,
+   ibdev->num_comp_vectors);
+


Add a comment here?


Will do


Re: [PATCH rfc 2/6] mlx5: move affinity hints assignments to generic code

2017-04-06 Thread Sagi Grimberg

 static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
 {
-   return cpumask_first(priv->mdev->priv.irq_info[ix].mask);
+   return cpumask_first(pci_irq_get_affinity(priv->mdev->pdev,
+   MLX5_EQ_VEC_COMP_BASE + ix));


This looks ok for now, but if we look at the callers we'd probably
want to make direct use of pci_irq_get_node and pci_irq_get_affinity for
the uses directly in mlx5e_open_channel as well as the stored away
->cpu field.  But maybe that should be left for another patch after
this one.


Its small enough to fold it in.


+   struct irq_affinity irqdesc = { .pre_vectors = MLX5_EQ_VEC_COMP_BASE, };


I usually move assignments inside structures onto a separate line to make
it more readable, e.g.

struct irq_affinity irqdesc = {
.pre_vectors = MLX5_EQ_VEC_COMP_BASE,
};


Will do.


[PATCH rfc 3/6] RDMA/core: expose affinity mappings per completion vector

2017-04-02 Thread Sagi Grimberg
This will allow ULPs to intelligently locate threads based
on completion vector cpu affinity mappings. In case the
driver does not expose a get_vector_affinity callout, return
NULL so the caller can maintain a fallback logic.

Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 include/rdma/ib_verbs.h | 24 
 1 file changed, 24 insertions(+)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 0f1813c13687..d44b62791c64 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2150,6 +2150,8 @@ struct ib_device {
 */
int (*get_port_immutable)(struct ib_device *, u8, struct 
ib_port_immutable *);
void (*get_dev_fw_str)(struct ib_device *, char *str, size_t str_len);
+   const struct cpumask *(*get_vector_affinity)(struct ib_device *ibdev,
+int comp_vector);
 };
 
 struct ib_client {
@@ -3377,4 +3379,26 @@ void ib_drain_qp(struct ib_qp *qp);
 
 int ib_resolve_eth_dmac(struct ib_device *device,
struct ib_ah_attr *ah_attr);
+
+/**
+ * ib_get_vector_affinity - Get the affinity mappings of a given completion
+ *   vector
+ * @device: the rdma device
+ * @comp_vector:index of completion vector
+ *
+ * Returns NULL on failure, otherwise a corresponding cpu map of the
+ * completion vector (returns all-cpus map if the device driver doesn't
+ * implement get_vector_affinity).
+ */
+static inline const struct cpumask *
+ib_get_vector_affinity(struct ib_device *device, int comp_vector)
+{
+   if (comp_vector > device->num_comp_vectors ||
+   !device->get_vector_affinity)
+   return NULL;
+
+   return device->get_vector_affinity(device, comp_vector);
+
+}
+
 #endif /* IB_VERBS_H */
-- 
2.7.4



[PATCH rfc 6/6] nvme-rdma: use intelligent affinity based queue mappings

2017-04-02 Thread Sagi Grimberg
Use the geneic block layer affinity mapping helper. Also,
limit nr_hw_queues to the rdma device number of irq vectors
as we don't really need more.

Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 drivers/nvme/host/rdma.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4aae363943e3..81ee5b1207c8 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -645,10 +646,14 @@ static int nvme_rdma_connect_io_queues(struct 
nvme_rdma_ctrl *ctrl)
 static int nvme_rdma_init_io_queues(struct nvme_rdma_ctrl *ctrl)
 {
struct nvmf_ctrl_options *opts = ctrl->ctrl.opts;
+   struct ib_device *ibdev = ctrl->device->dev;
unsigned int nr_io_queues;
int i, ret;
 
nr_io_queues = min(opts->nr_io_queues, num_online_cpus());
+   nr_io_queues = min_t(unsigned int, nr_io_queues,
+   ibdev->num_comp_vectors);
+
ret = nvme_set_queue_count(>ctrl, _io_queues);
if (ret)
return ret;
@@ -1523,6 +1528,13 @@ static void nvme_rdma_complete_rq(struct request *rq)
nvme_complete_rq(rq);
 }
 
+static int nvme_rdma_map_queues(struct blk_mq_tag_set *set)
+{
+   struct nvme_rdma_ctrl *ctrl = set->driver_data;
+
+   return blk_mq_rdma_map_queues(set, ctrl->device->dev, 0);
+}
+
 static const struct blk_mq_ops nvme_rdma_mq_ops = {
.queue_rq   = nvme_rdma_queue_rq,
.complete   = nvme_rdma_complete_rq,
@@ -1532,6 +1544,7 @@ static const struct blk_mq_ops nvme_rdma_mq_ops = {
.init_hctx  = nvme_rdma_init_hctx,
.poll   = nvme_rdma_poll,
.timeout= nvme_rdma_timeout,
+   .map_queues = nvme_rdma_map_queues,
 };
 
 static const struct blk_mq_ops nvme_rdma_admin_mq_ops = {
-- 
2.7.4



[PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper

2017-04-02 Thread Sagi Grimberg
Like pci and virtio, we add a rdma helper for affinity
spreading. This achieves optimal mq affinity assignments
according to the underlying rdma device affinity maps.

Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 block/Kconfig   |  5 
 block/Makefile  |  1 +
 block/blk-mq-rdma.c | 56 +
 include/linux/blk-mq-rdma.h | 10 
 4 files changed, 72 insertions(+)
 create mode 100644 block/blk-mq-rdma.c
 create mode 100644 include/linux/blk-mq-rdma.h

diff --git a/block/Kconfig b/block/Kconfig
index 89cd28f8d051..3ab42bbb06d5 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -206,4 +206,9 @@ config BLK_MQ_VIRTIO
depends on BLOCK && VIRTIO
default y
 
+config BLK_MQ_RDMA
+   bool
+   depends on BLOCK && INFINIBAND
+   default y
+
 source block/Kconfig.iosched
diff --git a/block/Makefile b/block/Makefile
index 081bb680789b..4498603dbc83 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_BLK_CMDLINE_PARSER)  += cmdline-parser.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
 obj-$(CONFIG_BLK_MQ_PCI)   += blk-mq-pci.o
 obj-$(CONFIG_BLK_MQ_VIRTIO)+= blk-mq-virtio.o
+obj-$(CONFIG_BLK_MQ_RDMA)  += blk-mq-rdma.o
 obj-$(CONFIG_BLK_DEV_ZONED)+= blk-zoned.o
 obj-$(CONFIG_BLK_WBT)  += blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
new file mode 100644
index ..d402f7c93528
--- /dev/null
+++ b/block/blk-mq-rdma.c
@@ -0,0 +1,56 @@
+/*
+ * Copyright (c) 2017 Sagi Grimberg.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include "blk-mq.h"
+
+/**
+ * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
+ * @set:   tagset to provide the mapping for
+ * @dev:   rdma device associated with @set.
+ * @first_vec: first interrupt vectors to use for queues (usually 0)
+ *
+ * This function assumes the rdma device @dev has at least as many available
+ * interrupt vetors as @set has queues.  It will then query it's affinity mask
+ * and built queue mapping that maps a queue to the CPUs that have irq affinity
+ * for the corresponding vector.
+ *
+ * In case either the driver passed a @dev with less vectors than
+ * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
+ * vector, we fallback to the naive mapping.
+ */
+int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
+   struct ib_device *dev, int first_vec)
+{
+   const struct cpumask *mask;
+   unsigned int queue, cpu;
+
+   if (set->nr_hw_queues > dev->num_comp_vectors)
+   goto fallback;
+
+   for (queue = 0; queue < set->nr_hw_queues; queue++) {
+   mask = ib_get_vector_affinity(dev, first_vec + queue);
+   if (!mask)
+   goto fallback;
+
+   for_each_cpu(cpu, mask)
+   set->mq_map[cpu] = queue;
+   }
+
+   return 0;
+fallback:
+   return blk_mq_map_queues(set);
+}
+EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);
diff --git a/include/linux/blk-mq-rdma.h b/include/linux/blk-mq-rdma.h
new file mode 100644
index ..b4ade198007d
--- /dev/null
+++ b/include/linux/blk-mq-rdma.h
@@ -0,0 +1,10 @@
+#ifndef _LINUX_BLK_MQ_RDMA_H
+#define _LINUX_BLK_MQ_RDMA_H
+
+struct blk_mq_tag_set;
+struct ib_device;
+
+int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
+   struct ib_device *dev, int first_vec);
+
+#endif /* _LINUX_BLK_MQ_RDMA_H */
-- 
2.7.4



[PATCH rfc 2/6] mlx5: move affinity hints assignments to generic code

2017-04-02 Thread Sagi Grimberg
generic api takes care of spreading affinity similar to
what mlx5 open coded (and even handles better asymmetric
configurations). Ask the generic API to spread affinity
for us, and feed him pre_vectors that do not participate
in affinity settings (which is an improvement to what we
had before).

The affinity assignments should match what mlx5 tried to
do earlier but now we do not set affinity to async, cmd
and pages dedicated vectors.

Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c| 81 ++-
 include/linux/mlx5/driver.h   |  1 -
 3 files changed, 6 insertions(+), 79 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index eec0d172761e..2bab0e1ceb94 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1375,7 +1375,8 @@ static void mlx5e_close_cq(struct mlx5e_cq *cq)
 
 static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
 {
-   return cpumask_first(priv->mdev->priv.irq_info[ix].mask);
+   return cpumask_first(pci_irq_get_affinity(priv->mdev->pdev,
+   MLX5_EQ_VEC_COMP_BASE + ix));
 }
 
 static int mlx5e_open_tx_cqs(struct mlx5e_channel *c,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 7c8672cbb369..8624a7451064 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -312,6 +312,7 @@ static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
 {
struct mlx5_priv *priv = >priv;
struct mlx5_eq_table *table = >eq_table;
+   struct irq_affinity irqdesc = { .pre_vectors = MLX5_EQ_VEC_COMP_BASE, };
int num_eqs = 1 << MLX5_CAP_GEN(dev, log_max_eq);
int nvec;
 
@@ -325,9 +326,10 @@ static int mlx5_alloc_irq_vectors(struct mlx5_core_dev 
*dev)
if (!priv->irq_info)
goto err_free_msix;
 
-   nvec = pci_alloc_irq_vectors(dev->pdev,
+   nvec = pci_alloc_irq_vectors_affinity(dev->pdev,
MLX5_EQ_VEC_COMP_BASE + 1, nvec,
-   PCI_IRQ_MSIX);
+   PCI_IRQ_MSIX | PCI_IRQ_AFFINITY,
+   );
if (nvec < 0)
return nvec;
 
@@ -600,71 +602,6 @@ u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev)
return (u64)timer_l | (u64)timer_h1 << 32;
 }
 
-static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
-{
-   struct mlx5_priv *priv  = >priv;
-   int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
-   int err;
-
-   if (!zalloc_cpumask_var(>irq_info[i].mask, GFP_KERNEL)) {
-   mlx5_core_warn(mdev, "zalloc_cpumask_var failed");
-   return -ENOMEM;
-   }
-
-   cpumask_set_cpu(cpumask_local_spread(i, priv->numa_node),
-   priv->irq_info[i].mask);
-
-   err = irq_set_affinity_hint(irq, priv->irq_info[i].mask);
-   if (err) {
-   mlx5_core_warn(mdev, "irq_set_affinity_hint failed,irq 0x%.4x",
-  irq);
-   goto err_clear_mask;
-   }
-
-   return 0;
-
-err_clear_mask:
-   free_cpumask_var(priv->irq_info[i].mask);
-   return err;
-}
-
-static void mlx5_irq_clear_affinity_hint(struct mlx5_core_dev *mdev, int i)
-{
-   struct mlx5_priv *priv  = >priv;
-   int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
-
-   irq_set_affinity_hint(irq, NULL);
-   free_cpumask_var(priv->irq_info[i].mask);
-}
-
-static int mlx5_irq_set_affinity_hints(struct mlx5_core_dev *mdev)
-{
-   int err;
-   int i;
-
-   for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++) {
-   err = mlx5_irq_set_affinity_hint(mdev, i);
-   if (err)
-   goto err_out;
-   }
-
-   return 0;
-
-err_out:
-   for (i--; i >= 0; i--)
-   mlx5_irq_clear_affinity_hint(mdev, i);
-
-   return err;
-}
-
-static void mlx5_irq_clear_affinity_hints(struct mlx5_core_dev *mdev)
-{
-   int i;
-
-   for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++)
-   mlx5_irq_clear_affinity_hint(mdev, i);
-}
-
 int mlx5_vector2eqn(struct mlx5_core_dev *dev, int vector, int *eqn,
unsigned int *irqn)
 {
@@ -1116,12 +1053,6 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
goto err_stop_eqs;
}
 
-   err = mlx5_irq_set_affinity_hints(dev);
-   if (err) {
-   dev_err(>dev, "Failed to alloc affinity hint cpumask\n");
-   goto err_affinity_hints;

[PATCH rfc 4/6] mlx5: support ->get_vector_affinity

2017-04-02 Thread Sagi Grimberg
Simply refer to the generic affinity mask helper.

Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 drivers/infiniband/hw/mlx5/main.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 4dc0a8785fe0..b12bc2294895 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -3319,6 +3319,15 @@ static int mlx5_ib_get_hw_stats(struct ib_device *ibdev,
return port->q_cnts.num_counters;
 }
 
+const struct cpumask *mlx5_ib_get_vector_affinity(struct ib_device *ibdev,
+   int comp_vector)
+{
+   struct mlx5_ib_dev *dev = to_mdev(ibdev);
+
+   return pci_irq_get_affinity(dev->mdev->pdev,
+   MLX5_EQ_VEC_COMP_BASE + comp_vector);
+}
+
 static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 {
struct mlx5_ib_dev *dev;
@@ -3449,6 +3458,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
dev->ib_dev.check_mr_status = mlx5_ib_check_mr_status;
dev->ib_dev.get_port_immutable  = mlx5_port_immutable;
dev->ib_dev.get_dev_fw_str  = get_dev_fw_str;
+   dev->ib_dev.get_vector_affinity = mlx5_ib_get_vector_affinity;
if (mlx5_core_is_pf(mdev)) {
dev->ib_dev.get_vf_config   = mlx5_ib_get_vf_config;
dev->ib_dev.set_vf_link_state   = mlx5_ib_set_vf_link_state;
-- 
2.7.4



[PATCH rfc 1/6] mlx5: convert to generic pci_alloc_irq_vectors

2017-04-02 Thread Sagi Grimberg
Now that we have a generic code to allocate an array
of irq vectors and even correctly spread their affinity,
correctly handle cpu hotplug events and more, were much
better off using it.

Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c   |  9 ++
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/health.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 33 --
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h|  1 -
 include/linux/mlx5/driver.h|  1 -
 7 files changed, 17 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8ef64c4db2c2..eec0d172761e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -389,7 +389,7 @@ static void mlx5e_enable_async_events(struct mlx5e_priv 
*priv)
 static void mlx5e_disable_async_events(struct mlx5e_priv *priv)
 {
clear_bit(MLX5E_STATE_ASYNC_EVENTS_ENABLED, >state);
-   synchronize_irq(mlx5_get_msix_vec(priv->mdev, MLX5_EQ_VEC_ASYNC));
+   synchronize_irq(pci_irq_vector(priv->mdev->pdev, MLX5_EQ_VEC_ASYNC));
 }
 
 static inline int mlx5e_get_wqe_mtt_sz(void)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index ea5d8d37a75c..e2c33c493b89 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -575,7 +575,7 @@ int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct 
mlx5_eq *eq, u8 vecidx,
 name, pci_name(dev->pdev));
 
eq->eqn = MLX5_GET(create_eq_out, out, eq_number);
-   eq->irqn = priv->msix_arr[vecidx].vector;
+   eq->irqn = pci_irq_vector(dev->pdev, vecidx);
eq->dev = dev;
eq->doorbell = priv->uar->map + MLX5_EQ_DOORBEL_OFFSET;
err = request_irq(eq->irqn, handler, 0,
@@ -610,7 +610,7 @@ int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct 
mlx5_eq *eq, u8 vecidx,
return 0;
 
 err_irq:
-   free_irq(priv->msix_arr[vecidx].vector, eq);
+   free_irq(eq->irqn, eq);
 
 err_eq:
mlx5_cmd_destroy_eq(dev, eq->eqn);
@@ -651,11 +651,6 @@ int mlx5_destroy_unmap_eq(struct mlx5_core_dev *dev, 
struct mlx5_eq *eq)
 }
 EXPORT_SYMBOL_GPL(mlx5_destroy_unmap_eq);
 
-u32 mlx5_get_msix_vec(struct mlx5_core_dev *dev, int vecidx)
-{
-   return dev->priv.msix_arr[MLX5_EQ_VEC_ASYNC].vector;
-}
-
 int mlx5_eq_init(struct mlx5_core_dev *dev)
 {
int err;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index fcd5bc7e31db..6bf5d70b4117 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1596,7 +1596,7 @@ static void esw_disable_vport(struct mlx5_eswitch *esw, 
int vport_num)
/* Mark this vport as disabled to discard new events */
vport->enabled = false;
 
-   synchronize_irq(mlx5_get_msix_vec(esw->dev, MLX5_EQ_VEC_ASYNC));
+   synchronize_irq(pci_irq_vector(esw->dev->pdev, MLX5_EQ_VEC_ASYNC));
/* Wait for current already scheduled events to complete */
flush_workqueue(esw->work_queue);
/* Disable events from this vport */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c 
b/drivers/net/ethernet/mellanox/mlx5/core/health.c
index d0515391d33b..8b38d5cfd4c5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
@@ -80,7 +80,7 @@ static void trigger_cmd_completions(struct mlx5_core_dev *dev)
u64 vector;
 
/* wait for pending handlers to complete */
-   synchronize_irq(dev->priv.msix_arr[MLX5_EQ_VEC_CMD].vector);
+   synchronize_irq(pci_irq_vector(dev->pdev, MLX5_EQ_VEC_CMD));
spin_lock_irqsave(>cmd.alloc_lock, flags);
vector = ~dev->cmd.bitmask & ((1ul << (1 << dev->cmd.log_sz)) - 1);
if (!vector)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index e2bd600d19de..7c8672cbb369 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -308,13 +308,12 @@ static void release_bar(struct pci_dev *pdev)
pci_release_regions(pdev);
 }
 
-static int mlx5_enable_msix(struct mlx5_core_dev *dev)
+static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
 {
struct mlx5_priv *priv = >priv;
struct mlx5_eq_table *table = >eq_table;
int num_eqs = 1 << MLX5_CAP_GEN(dev, log_max_eq);
int nvec;
-   int i;
 
nvec = 

[PATCH rfc 0/6] Automatic affinity settings for nvme over rdma

2017-04-02 Thread Sagi Grimberg
This patch set is aiming to automatically find the optimal
queue <-> irq multi-queue assignments in storage ULPs (demonstrated
on nvme-rdma) based on the underlying rdma device irq affinity
settings.

First two patches modify mlx5 core driver to use generic API
to allocate array of irq vectors with automatic affinity
settings instead of open-coding exactly what it does (and
slightly worse).

Then, in order to obtain an affinity map for a given completion
vector, we expose a new RDMA core API, and implement it in mlx5.

The third part is addition of a rdma-based queue mapping helper
to blk-mq that maps the tagset hctx's according to the device
affinity mappings.

I'd happily convert some more drivers, but I'll need volunteers
to test as I don't have access to any other devices.

I cc'd @netdev (and Saeed + Or) as this is the place that most of
mlx5 core action takes place, so Saeed, would love to hear your
feedback.

Any feedback is welcome.

Sagi Grimberg (6):
  mlx5: convert to generic pci_alloc_irq_vectors
  mlx5: move affinity hints assignments to generic code
  RDMA/core: expose affinity mappings per completion vector
  mlx5: support ->get_vector_affinity
  block: Add rdma affinity based queue mapping helper
  nvme-rdma: use intelligent affinity based queue mappings

 block/Kconfig  |   5 +
 block/Makefile |   1 +
 block/blk-mq-rdma.c|  56 +++
 drivers/infiniband/hw/mlx5/main.c  |  10 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   5 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c   |   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/health.c   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 106 +++--
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h|   1 -
 drivers/nvme/host/rdma.c   |  13 +++
 include/linux/blk-mq-rdma.h|  10 ++
 include/linux/mlx5/driver.h|   2 -
 include/rdma/ib_verbs.h|  24 +
 14 files changed, 138 insertions(+), 108 deletions(-)
 create mode 100644 block/blk-mq-rdma.c
 create mode 100644 include/linux/blk-mq-rdma.h

-- 
2.7.4



[PATCH v3] net/utils: generic inet_pton_with_scope helper

2017-03-29 Thread Sagi Grimberg
Several locations in the stack need to handle ipv4/ipv6
(with scope) and port strings conversion to sockaddr.
Add a helper that takes either AF_INET, AF_INET6 or
AF_UNSPEC (for wildcard) to centralize this handling.

Suggested-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
This patch is posted standalone for review, it pairs with
a couple of call-sites which can be found at:

https://www.mail-archive.com/netdev@vger.kernel.org/msg157094.html

 include/linux/inet.h |   6 +++
 net/core/utils.c | 103 +++
 2 files changed, 109 insertions(+)

diff --git a/include/linux/inet.h b/include/linux/inet.h
index 4cca05c9678e..636ebe87e6f8 100644
--- a/include/linux/inet.h
+++ b/include/linux/inet.h
@@ -43,6 +43,8 @@
 #define _LINUX_INET_H
 
 #include 
+#include 
+#include 
 
 /*
  * These mimic similar macros defined in user-space for inet_ntop(3).
@@ -54,4 +56,8 @@
 extern __be32 in_aton(const char *str);
 extern int in4_pton(const char *src, int srclen, u8 *dst, int delim, const 
char **end);
 extern int in6_pton(const char *src, int srclen, u8 *dst, int delim, const 
char **end);
+
+extern int inet_pton_with_scope(struct net *net, unsigned short af,
+   const char *src, const char *port, struct sockaddr_storage 
*addr);
+
 #endif /* _LINUX_INET_H */
diff --git a/net/core/utils.c b/net/core/utils.c
index 6592d7bbed39..f96cf527bb8f 100644
--- a/net/core/utils.c
+++ b/net/core/utils.c
@@ -26,9 +26,11 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -300,6 +302,107 @@ int in6_pton(const char *src, int srclen,
 }
 EXPORT_SYMBOL(in6_pton);
 
+static int inet4_pton(const char *src, u16 port_num,
+   struct sockaddr_storage *addr)
+{
+   struct sockaddr_in *addr4 = (struct sockaddr_in *)addr;
+   int srclen = strlen(src);
+
+   if (srclen > INET_ADDRSTRLEN)
+   return -EINVAL;
+
+   if (in4_pton(src, srclen, (u8 *)>sin_addr.s_addr,
+'\n', NULL) == 0)
+   return -EINVAL;
+
+   addr4->sin_family = AF_INET;
+   addr4->sin_port = htons(port_num);
+
+   return 0;
+}
+
+static int inet6_pton(struct net *net, const char *src, u16 port_num,
+   struct sockaddr_storage *addr)
+{
+   struct sockaddr_in6 *addr6 = (struct sockaddr_in6 *)addr;
+   const char *scope_delim;
+   int srclen = strlen(src);
+
+   if (srclen > INET6_ADDRSTRLEN)
+   return -EINVAL;
+
+   if (in6_pton(src, srclen, (u8 *)>sin6_addr.s6_addr,
+'%', _delim) == 0)
+   return -EINVAL;
+
+   if (ipv6_addr_type(>sin6_addr) & IPV6_ADDR_LINKLOCAL &&
+   src + srclen != scope_delim && *scope_delim == '%') {
+   struct net_device *dev;
+   char scope_id[16];
+   size_t scope_len = min_t(size_t, sizeof(scope_id),
+src + srclen - scope_delim - 1);
+
+   memcpy(scope_id, scope_delim + 1, scope_len);
+   scope_id[scope_len] = '\0';
+
+   dev = dev_get_by_name(net, scope_id);
+   if (dev) {
+   addr6->sin6_scope_id = dev->ifindex;
+   dev_put(dev);
+   } else if (kstrtouint(scope_id, 0, >sin6_scope_id)) {
+   return -EINVAL;
+   }
+   }
+
+   addr6->sin6_family = AF_INET6;
+   addr6->sin6_port = htons(port_num);
+
+   return 0;
+}
+
+/**
+ * inet_pton_with_scope - convert an IPv4/IPv6 and port to socket address
+ * @net: net namespace (used for scope handling)
+ * @af: address family, AF_INET, AF_INET6 or AF_UNSPEC for either
+ * @src: the start of the address string
+ * @port: the start of the port string (or NULL for none)
+ * @addr: output socket address
+ *
+ * Return zero on success, return errno when any error occurs.
+ */
+int inet_pton_with_scope(struct net *net, __kernel_sa_family_t af,
+   const char *src, const char *port, struct sockaddr_storage 
*addr)
+{
+   u16 port_num;
+   int ret = -EINVAL;
+
+   if (port) {
+   if (kstrtou16(port, 0, _num))
+   return -EINVAL;
+   } else {
+   port_num = 0;
+   }
+
+   switch (af) {
+   case AF_INET:
+   ret = inet4_pton(src, port_num, addr);
+   break;
+   case AF_INET6:
+   ret = inet6_pton(net, src, port_num, addr);
+   break;
+   case AF_UNSPEC:
+   ret = inet4_pton(src, port_num, addr);
+   if (ret)
+   ret = inet6_pton(net, src, port_num, addr);
+   break;
+   default:
+   pr_err("unexpected address family %d\n", af

Re: [PATCH v3 0/4] Introduce a new helper for parsing ipv[4|6]:port to socket address

2017-03-29 Thread Sagi Grimberg

Hey Dave, thanks for replying.


Please repost, the problem was you mixed it in a series that was
not targetting the net tree, so it was confusing what was going on.


I see, I posted the entire series because adding a helper without
seeing it's call sites might be too detached.


Just submit the networking change, by itself, for review.


I will.


Re: [PATCH v3 0/4] Introduce a new helper for parsing ipv[4|6]:port to socket address

2017-03-29 Thread Sagi Grimberg



Hi Dave,

This set got review tags from two kernel maintainers but this is your
realm so I wouldn't be comfortable moving forward with this without
your feedback.


Would it be possible to get a review from one of the netdev folks
on patch 1/4?


Hi everyone,

Still no feedback...

This patchset is targeted to 4.12 as it adds IPv6 support to
nvme over fabrics drivers which is desired by some linux users.

Given that I'm in v3 by now, and accumulated reviews from two kernel
maintainers, Unless someone is strongly against it, I'll be queuing this 
from the nvme tree to 4.12.


Cheers,
Sagi.


Re: [PATCH v3 0/4] Introduce a new helper for parsing ipv[4|6]:port to socket address

2017-03-19 Thread Sagi Grimberg



Changes from v1:
- rebased to 4.11-rc1
- improved changelogs

Changes from v0:
- rebased on 4.10
- splitted inet_pton_with_scope to be a bit saner (from Chrsitoph)
- converted nvme-rdma host_traddr to use a generic helper

We have some places in the stack that support ipv4 and ipv6. In
some cases the user configuration does not reveal which
address family is given and needs to be parsed from the input string.

Given that the user-input varies between subsystems, some processing
is required from the call-site to separate address and port strings.

As a side-effect, this set adds ipv6 support for nvme over fabrics.


Hi Dave,

This set got review tags from two kernel maintainers but this is your
realm so I wouldn't be comfortable moving forward with this without
your feedback.


Would it be possible to get a review from one of the netdev folks
on patch 1/4?


Re: [PATCH v3 0/4] Introduce a new helper for parsing ipv[4|6]:port to socket address

2017-03-13 Thread Sagi Grimberg



Changes from v1:
- rebased to 4.11-rc1
- improved changelogs

Changes from v0:
- rebased on 4.10
- splitted inet_pton_with_scope to be a bit saner (from Chrsitoph)
- converted nvme-rdma host_traddr to use a generic helper

We have some places in the stack that support ipv4 and ipv6. In
some cases the user configuration does not reveal which
address family is given and needs to be parsed from the input string.

Given that the user-input varies between subsystems, some processing
is required from the call-site to separate address and port strings.

As a side-effect, this set adds ipv6 support for nvme over fabrics.


Hi Dave,

This set got review tags from two kernel maintainers but this is your
realm so I wouldn't be comfortable moving forward with this without
your feedback.


[PATCH v3 2/4] nvmet-rdma: use generic inet_pton_with_scope

2017-03-09 Thread Sagi Grimberg
Instead of parsing address strings, use a generic
helper. This also adds ipv6 (with address scopes)
support.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 drivers/nvme/target/rdma.c | 42 +-
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 9aa1da3778b3..973b674ab55b 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -1429,12 +1429,16 @@ static void nvmet_rdma_delete_ctrl(struct nvmet_ctrl 
*ctrl)
 static int nvmet_rdma_add_port(struct nvmet_port *port)
 {
struct rdma_cm_id *cm_id;
-   struct sockaddr_in addr_in;
-   u16 port_in;
+   struct sockaddr_storage addr = { };
+   __kernel_sa_family_t af;
int ret;
 
switch (port->disc_addr.adrfam) {
case NVMF_ADDR_FAMILY_IP4:
+   af = AF_INET;
+   break;
+   case NVMF_ADDR_FAMILY_IP6:
+   af = AF_INET6;
break;
default:
pr_err("address family %d not supported\n",
@@ -1442,13 +1446,13 @@ static int nvmet_rdma_add_port(struct nvmet_port *port)
return -EINVAL;
}
 
-   ret = kstrtou16(port->disc_addr.trsvcid, 0, _in);
-   if (ret)
+   ret = inet_pton_with_scope(_net, af, port->disc_addr.traddr,
+   port->disc_addr.trsvcid, );
+   if (ret) {
+   pr_err("malformed ip/port passed: %s:%s\n",
+   port->disc_addr.traddr, port->disc_addr.trsvcid);
return ret;
-
-   addr_in.sin_family = AF_INET;
-   addr_in.sin_addr.s_addr = in_aton(port->disc_addr.traddr);
-   addr_in.sin_port = htons(port_in);
+   }
 
cm_id = rdma_create_id(_net, nvmet_rdma_cm_handler, port,
RDMA_PS_TCP, IB_QPT_RC);
@@ -1457,20 +1461,32 @@ static int nvmet_rdma_add_port(struct nvmet_port *port)
return PTR_ERR(cm_id);
}
 
-   ret = rdma_bind_addr(cm_id, (struct sockaddr *)_in);
+   /*
+* Allow both IPv4 and IPv6 sockets to bind a single port
+* at the same time.
+*/
+   ret = rdma_set_afonly(cm_id, 1);
+   if (ret) {
+   pr_err("rdma_set_afonly failed (%d)\n", ret);
+   goto out_destroy_id;
+   }
+
+   ret = rdma_bind_addr(cm_id, (struct sockaddr *));
if (ret) {
-   pr_err("binding CM ID to %pISpc failed (%d)\n", _in, ret);
+   pr_err("binding CM ID to %pISpcs failed (%d)\n",
+   (struct sockaddr *), ret);
goto out_destroy_id;
}
 
ret = rdma_listen(cm_id, 128);
if (ret) {
-   pr_err("listening to %pISpc failed (%d)\n", _in, ret);
+   pr_err("listening to %pISpcs failed (%d)\n",
+   (struct sockaddr *), ret);
goto out_destroy_id;
}
 
-   pr_info("enabling port %d (%pISpc)\n",
-   le16_to_cpu(port->disc_addr.portid), _in);
+   pr_info("enabling port %d (%pISpcs)\n",
+   le16_to_cpu(port->disc_addr.portid), (struct sockaddr *));
port->priv = cm_id;
return 0;
 
-- 
2.7.4



[PATCH v3 1/4] net/utils: generic inet_pton_with_scope helper

2017-03-09 Thread Sagi Grimberg
Several locations in the stack need to handle ipv4/ipv6
(with scope) and port strings conversion to sockaddr.
Add a helper that takes either AF_INET, AF_INET6 or
AF_UNSPEC (for wildcard) to centralize this handling.

Suggested-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 include/linux/inet.h |   6 +++
 net/core/utils.c | 103 +++
 2 files changed, 109 insertions(+)

diff --git a/include/linux/inet.h b/include/linux/inet.h
index 4cca05c9678e..636ebe87e6f8 100644
--- a/include/linux/inet.h
+++ b/include/linux/inet.h
@@ -43,6 +43,8 @@
 #define _LINUX_INET_H
 
 #include 
+#include 
+#include 
 
 /*
  * These mimic similar macros defined in user-space for inet_ntop(3).
@@ -54,4 +56,8 @@
 extern __be32 in_aton(const char *str);
 extern int in4_pton(const char *src, int srclen, u8 *dst, int delim, const 
char **end);
 extern int in6_pton(const char *src, int srclen, u8 *dst, int delim, const 
char **end);
+
+extern int inet_pton_with_scope(struct net *net, unsigned short af,
+   const char *src, const char *port, struct sockaddr_storage 
*addr);
+
 #endif /* _LINUX_INET_H */
diff --git a/net/core/utils.c b/net/core/utils.c
index 6592d7bbed39..f96cf527bb8f 100644
--- a/net/core/utils.c
+++ b/net/core/utils.c
@@ -26,9 +26,11 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -300,6 +302,107 @@ int in6_pton(const char *src, int srclen,
 }
 EXPORT_SYMBOL(in6_pton);
 
+static int inet4_pton(const char *src, u16 port_num,
+   struct sockaddr_storage *addr)
+{
+   struct sockaddr_in *addr4 = (struct sockaddr_in *)addr;
+   int srclen = strlen(src);
+
+   if (srclen > INET_ADDRSTRLEN)
+   return -EINVAL;
+
+   if (in4_pton(src, srclen, (u8 *)>sin_addr.s_addr,
+'\n', NULL) == 0)
+   return -EINVAL;
+
+   addr4->sin_family = AF_INET;
+   addr4->sin_port = htons(port_num);
+
+   return 0;
+}
+
+static int inet6_pton(struct net *net, const char *src, u16 port_num,
+   struct sockaddr_storage *addr)
+{
+   struct sockaddr_in6 *addr6 = (struct sockaddr_in6 *)addr;
+   const char *scope_delim;
+   int srclen = strlen(src);
+
+   if (srclen > INET6_ADDRSTRLEN)
+   return -EINVAL;
+
+   if (in6_pton(src, srclen, (u8 *)>sin6_addr.s6_addr,
+'%', _delim) == 0)
+   return -EINVAL;
+
+   if (ipv6_addr_type(>sin6_addr) & IPV6_ADDR_LINKLOCAL &&
+   src + srclen != scope_delim && *scope_delim == '%') {
+   struct net_device *dev;
+   char scope_id[16];
+   size_t scope_len = min_t(size_t, sizeof(scope_id),
+src + srclen - scope_delim - 1);
+
+   memcpy(scope_id, scope_delim + 1, scope_len);
+   scope_id[scope_len] = '\0';
+
+   dev = dev_get_by_name(net, scope_id);
+   if (dev) {
+   addr6->sin6_scope_id = dev->ifindex;
+   dev_put(dev);
+   } else if (kstrtouint(scope_id, 0, >sin6_scope_id)) {
+   return -EINVAL;
+   }
+   }
+
+   addr6->sin6_family = AF_INET6;
+   addr6->sin6_port = htons(port_num);
+
+   return 0;
+}
+
+/**
+ * inet_pton_with_scope - convert an IPv4/IPv6 and port to socket address
+ * @net: net namespace (used for scope handling)
+ * @af: address family, AF_INET, AF_INET6 or AF_UNSPEC for either
+ * @src: the start of the address string
+ * @port: the start of the port string (or NULL for none)
+ * @addr: output socket address
+ *
+ * Return zero on success, return errno when any error occurs.
+ */
+int inet_pton_with_scope(struct net *net, __kernel_sa_family_t af,
+   const char *src, const char *port, struct sockaddr_storage 
*addr)
+{
+   u16 port_num;
+   int ret = -EINVAL;
+
+   if (port) {
+   if (kstrtou16(port, 0, _num))
+   return -EINVAL;
+   } else {
+   port_num = 0;
+   }
+
+   switch (af) {
+   case AF_INET:
+   ret = inet4_pton(src, port_num, addr);
+   break;
+   case AF_INET6:
+   ret = inet6_pton(net, src, port_num, addr);
+   break;
+   case AF_UNSPEC:
+   ret = inet4_pton(src, port_num, addr);
+   if (ret)
+   ret = inet6_pton(net, src, port_num, addr);
+   break;
+   default:
+   pr_err("unexpected address family %d\n", af);
+   };
+
+   return ret;
+}
+EXPORT_SYMBOL(inet_pton_with_scope);
+
 void inet_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb,
  __be32 from, __be32 to, bool pseudohdr)
 {
-- 
2.7.4



[PATCH v3 3/4] nvme-rdma: use inet_pton_with_scope helper

2017-03-09 Thread Sagi Grimberg
Both the destination and the host addresses are now
parsed using inet_pton_with_scope helper. We also
get ipv6 (with address scopes support).

Reviewed-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 drivers/nvme/host/rdma.c | 63 +++-
 1 file changed, 19 insertions(+), 44 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 779f516e7a4e..7bad791a7fe9 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -129,14 +129,8 @@ struct nvme_rdma_ctrl {
u64 cap;
u32 max_fr_pages;
 
-   union {
-   struct sockaddr addr;
-   struct sockaddr_in addr_in;
-   };
-   union {
-   struct sockaddr src_addr;
-   struct sockaddr_in src_addr_in;
-   };
+   struct sockaddr_storage addr;
+   struct sockaddr_storage src_addr;
 
struct nvme_ctrlctrl;
 };
@@ -571,11 +565,12 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl 
*ctrl,
return PTR_ERR(queue->cm_id);
}
 
-   queue->cm_error = -ETIMEDOUT;
if (ctrl->ctrl.opts->mask & NVMF_OPT_HOST_TRADDR)
-   src_addr = >src_addr;
+   src_addr = (struct sockaddr *)>src_addr;
 
-   ret = rdma_resolve_addr(queue->cm_id, src_addr, >addr,
+   queue->cm_error = -ETIMEDOUT;
+   ret = rdma_resolve_addr(queue->cm_id, src_addr,
+   (struct sockaddr *)>addr,
NVME_RDMA_CONNECT_TIMEOUT_MS);
if (ret) {
dev_info(ctrl->ctrl.device,
@@ -1857,27 +1852,13 @@ static int nvme_rdma_create_io_queues(struct 
nvme_rdma_ctrl *ctrl)
return ret;
 }
 
-static int nvme_rdma_parse_ipaddr(struct sockaddr_in *in_addr, char *p)
-{
-   u8 *addr = (u8 *)_addr->sin_addr.s_addr;
-   size_t buflen = strlen(p);
-
-   /* XXX: handle IPv6 addresses */
-
-   if (buflen > INET_ADDRSTRLEN)
-   return -EINVAL;
-   if (in4_pton(p, buflen, addr, '\0', NULL) == 0)
-   return -EINVAL;
-   in_addr->sin_family = AF_INET;
-   return 0;
-}
-
 static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
struct nvmf_ctrl_options *opts)
 {
struct nvme_rdma_ctrl *ctrl;
int ret;
bool changed;
+   char *port;
 
ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
if (!ctrl)
@@ -1885,34 +1866,28 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct 
device *dev,
ctrl->ctrl.opts = opts;
INIT_LIST_HEAD(>list);
 
-   ret = nvme_rdma_parse_ipaddr(>addr_in, opts->traddr);
+   if (opts->mask & NVMF_OPT_TRSVCID)
+   port = opts->trsvcid;
+   else
+   port = __stringify(NVME_RDMA_IP_PORT);
+
+   ret = inet_pton_with_scope(_net, AF_UNSPEC,
+   opts->traddr, port, >addr);
if (ret) {
-   pr_err("malformed IP address passed: %s\n", opts->traddr);
+   pr_err("malformed address passed: %s:%s\n", opts->traddr, port);
goto out_free_ctrl;
}
 
if (opts->mask & NVMF_OPT_HOST_TRADDR) {
-   ret = nvme_rdma_parse_ipaddr(>src_addr_in,
-   opts->host_traddr);
+   ret = inet_pton_with_scope(_net, AF_UNSPEC,
+   opts->host_traddr, NULL, >src_addr);
if (ret) {
-   pr_err("malformed src IP address passed: %s\n",
+   pr_err("malformed src address passed: %s\n",
   opts->host_traddr);
goto out_free_ctrl;
}
}
 
-   if (opts->mask & NVMF_OPT_TRSVCID) {
-   u16 port;
-
-   ret = kstrtou16(opts->trsvcid, 0, );
-   if (ret)
-   goto out_free_ctrl;
-
-   ctrl->addr_in.sin_port = cpu_to_be16(port);
-   } else {
-   ctrl->addr_in.sin_port = cpu_to_be16(NVME_RDMA_IP_PORT);
-   }
-
ret = nvme_init_ctrl(>ctrl, dev, _rdma_ctrl_ops,
0 /* no quirks, we're perfect! */);
if (ret)
@@ -1977,7 +1952,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct 
device *dev,
changed = nvme_change_ctrl_state(>ctrl, NVME_CTRL_LIVE);
WARN_ON_ONCE(!changed);
 
-   dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISp\n",
+   dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISpcs\n",
ctrl->ctrl.opts->subsysnqn, >addr);
 
kref_get(>ctrl.kref);
-- 
2.7.4



[PATCH v3 4/4] iscsi-target: use generic inet_pton_with_scope

2017-03-09 Thread Sagi Grimberg
Instead of parsing address strings, use a generic
helper.

Acked-by: Nicholas Bellinger <n...@linux-iscsi.org>
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 drivers/target/iscsi/iscsi_target_configfs.c | 46 
 1 file changed, 12 insertions(+), 34 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target_configfs.c 
b/drivers/target/iscsi/iscsi_target_configfs.c
index bf40f03755dd..f30c27b83c5e 100644
--- a/drivers/target/iscsi/iscsi_target_configfs.c
+++ b/drivers/target/iscsi/iscsi_target_configfs.c
@@ -167,10 +167,7 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
struct iscsi_portal_group *tpg;
struct iscsi_tpg_np *tpg_np;
char *str, *str2, *ip_str, *port_str;
-   struct sockaddr_storage sockaddr;
-   struct sockaddr_in *sock_in;
-   struct sockaddr_in6 *sock_in6;
-   unsigned long port;
+   struct sockaddr_storage sockaddr = { };
int ret;
char buf[MAX_PORTAL_LEN + 1];
 
@@ -182,21 +179,19 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
memset(buf, 0, MAX_PORTAL_LEN + 1);
snprintf(buf, MAX_PORTAL_LEN + 1, "%s", name);
 
-   memset(, 0, sizeof(struct sockaddr_storage));
-
str = strstr(buf, "[");
if (str) {
-   const char *end;
-
str2 = strstr(str, "]");
if (!str2) {
pr_err("Unable to locate trailing \"]\""
" in IPv6 iSCSI network portal address\n");
return ERR_PTR(-EINVAL);
}
-   str++; /* Skip over leading "[" */
+
+   ip_str = str + 1; /* Skip over leading "[" */
*str2 = '\0'; /* Terminate the unbracketed IPv6 address */
str2++; /* Skip over the \0 */
+
port_str = strstr(str2, ":");
if (!port_str) {
pr_err("Unable to locate \":port\""
@@ -205,23 +200,8 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
}
*port_str = '\0'; /* Terminate string for IP */
port_str++; /* Skip over ":" */
-
-   ret = kstrtoul(port_str, 0, );
-   if (ret < 0) {
-   pr_err("kstrtoul() failed for port_str: %d\n", ret);
-   return ERR_PTR(ret);
-   }
-   sock_in6 = (struct sockaddr_in6 *)
-   sock_in6->sin6_family = AF_INET6;
-   sock_in6->sin6_port = htons((unsigned short)port);
-   ret = in6_pton(str, -1,
-   (void *)_in6->sin6_addr.in6_u, -1, );
-   if (ret <= 0) {
-   pr_err("in6_pton returned: %d\n", ret);
-   return ERR_PTR(-EINVAL);
-   }
} else {
-   str = ip_str = [0];
+   ip_str = [0];
port_str = strstr(ip_str, ":");
if (!port_str) {
pr_err("Unable to locate \":port\""
@@ -230,17 +210,15 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
}
*port_str = '\0'; /* Terminate string for IP */
port_str++; /* Skip over ":" */
+   }
 
-   ret = kstrtoul(port_str, 0, );
-   if (ret < 0) {
-   pr_err("kstrtoul() failed for port_str: %d\n", ret);
-   return ERR_PTR(ret);
-   }
-   sock_in = (struct sockaddr_in *)
-   sock_in->sin_family = AF_INET;
-   sock_in->sin_port = htons((unsigned short)port);
-   sock_in->sin_addr.s_addr = in_aton(ip_str);
+   ret = inet_pton_with_scope(_net, AF_UNSPEC, ip_str,
+   port_str, );
+   if (ret) {
+   pr_err("malformed ip/port passed: %s\n", name);
+   return ERR_PTR(ret);
}
+
tpg = container_of(se_tpg, struct iscsi_portal_group, tpg_se_tpg);
ret = iscsit_get_tpg(tpg);
if (ret < 0)
-- 
2.7.4



[PATCH v3 0/4] Introduce a new helper for parsing ipv[4|6]:port to socket address

2017-03-09 Thread Sagi Grimberg
Changes from v1:
- rebased to 4.11-rc1
- improved changelogs

Changes from v0:
- rebased on 4.10
- splitted inet_pton_with_scope to be a bit saner (from Chrsitoph)
- converted nvme-rdma host_traddr to use a generic helper

We have some places in the stack that support ipv4 and ipv6. In
some cases the user configuration does not reveal which
address family is given and needs to be parsed from the input string.

Given that the user-input varies between subsystems, some processing
is required from the call-site to separate address and port strings.

As a side-effect, this set adds ipv6 support for nvme over fabrics.

Sagi Grimberg (4):
  net/utils: generic inet_pton_with_scope helper
  nvmet-rdma: use generic inet_pton_with_scope
  nvme-rdma: use inet_pton_with_scope helper
  iscsi-target: use generic inet_pton_with_scope

 drivers/nvme/host/rdma.c |  63 +---
 drivers/nvme/target/rdma.c   |  42 +++
 drivers/target/iscsi/iscsi_target_configfs.c |  46 
 include/linux/inet.h |   6 ++
 net/core/utils.c | 103 +++
 5 files changed, 169 insertions(+), 91 deletions(-)

-- 
2.7.4



Re: [PATCH v1 2/4] nvmet-rdma: use generic inet_pton_with_scope

2017-02-28 Thread Sagi Grimberg

Please add a changelog and mention that this adds IPv6 support.


Will do, thanks!


Re: [PATCH v1 3/4] nvme-rdma: use inet_pton_with_scope helper

2017-02-28 Thread Sagi Grimberg

Could use a proper changelog.


Will do, thanks!


[PATCH v1 4/4] iscsi-target: use generic inet_pton_with_scope

2017-02-27 Thread Sagi Grimberg
Acked-by: Nicholas Bellinger <n...@linux-iscsi.org>
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 drivers/target/iscsi/iscsi_target_configfs.c | 46 
 1 file changed, 12 insertions(+), 34 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target_configfs.c 
b/drivers/target/iscsi/iscsi_target_configfs.c
index bf40f03755dd..f30c27b83c5e 100644
--- a/drivers/target/iscsi/iscsi_target_configfs.c
+++ b/drivers/target/iscsi/iscsi_target_configfs.c
@@ -167,10 +167,7 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
struct iscsi_portal_group *tpg;
struct iscsi_tpg_np *tpg_np;
char *str, *str2, *ip_str, *port_str;
-   struct sockaddr_storage sockaddr;
-   struct sockaddr_in *sock_in;
-   struct sockaddr_in6 *sock_in6;
-   unsigned long port;
+   struct sockaddr_storage sockaddr = { };
int ret;
char buf[MAX_PORTAL_LEN + 1];
 
@@ -182,21 +179,19 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
memset(buf, 0, MAX_PORTAL_LEN + 1);
snprintf(buf, MAX_PORTAL_LEN + 1, "%s", name);
 
-   memset(, 0, sizeof(struct sockaddr_storage));
-
str = strstr(buf, "[");
if (str) {
-   const char *end;
-
str2 = strstr(str, "]");
if (!str2) {
pr_err("Unable to locate trailing \"]\""
" in IPv6 iSCSI network portal address\n");
return ERR_PTR(-EINVAL);
}
-   str++; /* Skip over leading "[" */
+
+   ip_str = str + 1; /* Skip over leading "[" */
*str2 = '\0'; /* Terminate the unbracketed IPv6 address */
str2++; /* Skip over the \0 */
+
port_str = strstr(str2, ":");
if (!port_str) {
pr_err("Unable to locate \":port\""
@@ -205,23 +200,8 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
}
*port_str = '\0'; /* Terminate string for IP */
port_str++; /* Skip over ":" */
-
-   ret = kstrtoul(port_str, 0, );
-   if (ret < 0) {
-   pr_err("kstrtoul() failed for port_str: %d\n", ret);
-   return ERR_PTR(ret);
-   }
-   sock_in6 = (struct sockaddr_in6 *)
-   sock_in6->sin6_family = AF_INET6;
-   sock_in6->sin6_port = htons((unsigned short)port);
-   ret = in6_pton(str, -1,
-   (void *)_in6->sin6_addr.in6_u, -1, );
-   if (ret <= 0) {
-   pr_err("in6_pton returned: %d\n", ret);
-   return ERR_PTR(-EINVAL);
-   }
} else {
-   str = ip_str = [0];
+   ip_str = [0];
port_str = strstr(ip_str, ":");
if (!port_str) {
pr_err("Unable to locate \":port\""
@@ -230,17 +210,15 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
}
*port_str = '\0'; /* Terminate string for IP */
port_str++; /* Skip over ":" */
+   }
 
-   ret = kstrtoul(port_str, 0, );
-   if (ret < 0) {
-   pr_err("kstrtoul() failed for port_str: %d\n", ret);
-   return ERR_PTR(ret);
-   }
-   sock_in = (struct sockaddr_in *)
-   sock_in->sin_family = AF_INET;
-   sock_in->sin_port = htons((unsigned short)port);
-   sock_in->sin_addr.s_addr = in_aton(ip_str);
+   ret = inet_pton_with_scope(_net, AF_UNSPEC, ip_str,
+   port_str, );
+   if (ret) {
+   pr_err("malformed ip/port passed: %s\n", name);
+   return ERR_PTR(ret);
}
+
tpg = container_of(se_tpg, struct iscsi_portal_group, tpg_se_tpg);
ret = iscsit_get_tpg(tpg);
if (ret < 0)
-- 
2.7.4



[PATCH v1 1/4] net/utils: generic inet_pton_with_scope helper

2017-02-27 Thread Sagi Grimberg
Several locations in the stack need to handle ipv4/ipv6
(with scope) and port strings conversion to sockaddr.
Add a helper that takes either AF_INET, AF_INET6 or
AF_UNSPEC (for wildcard) to centralize this handling.

Suggested-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 include/linux/inet.h |   6 +++
 net/core/utils.c | 103 +++
 2 files changed, 109 insertions(+)

diff --git a/include/linux/inet.h b/include/linux/inet.h
index 4cca05c9678e..636ebe87e6f8 100644
--- a/include/linux/inet.h
+++ b/include/linux/inet.h
@@ -43,6 +43,8 @@
 #define _LINUX_INET_H
 
 #include 
+#include 
+#include 
 
 /*
  * These mimic similar macros defined in user-space for inet_ntop(3).
@@ -54,4 +56,8 @@
 extern __be32 in_aton(const char *str);
 extern int in4_pton(const char *src, int srclen, u8 *dst, int delim, const 
char **end);
 extern int in6_pton(const char *src, int srclen, u8 *dst, int delim, const 
char **end);
+
+extern int inet_pton_with_scope(struct net *net, unsigned short af,
+   const char *src, const char *port, struct sockaddr_storage 
*addr);
+
 #endif /* _LINUX_INET_H */
diff --git a/net/core/utils.c b/net/core/utils.c
index 6592d7bbed39..f96cf527bb8f 100644
--- a/net/core/utils.c
+++ b/net/core/utils.c
@@ -26,9 +26,11 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -300,6 +302,107 @@ int in6_pton(const char *src, int srclen,
 }
 EXPORT_SYMBOL(in6_pton);
 
+static int inet4_pton(const char *src, u16 port_num,
+   struct sockaddr_storage *addr)
+{
+   struct sockaddr_in *addr4 = (struct sockaddr_in *)addr;
+   int srclen = strlen(src);
+
+   if (srclen > INET_ADDRSTRLEN)
+   return -EINVAL;
+
+   if (in4_pton(src, srclen, (u8 *)>sin_addr.s_addr,
+'\n', NULL) == 0)
+   return -EINVAL;
+
+   addr4->sin_family = AF_INET;
+   addr4->sin_port = htons(port_num);
+
+   return 0;
+}
+
+static int inet6_pton(struct net *net, const char *src, u16 port_num,
+   struct sockaddr_storage *addr)
+{
+   struct sockaddr_in6 *addr6 = (struct sockaddr_in6 *)addr;
+   const char *scope_delim;
+   int srclen = strlen(src);
+
+   if (srclen > INET6_ADDRSTRLEN)
+   return -EINVAL;
+
+   if (in6_pton(src, srclen, (u8 *)>sin6_addr.s6_addr,
+'%', _delim) == 0)
+   return -EINVAL;
+
+   if (ipv6_addr_type(>sin6_addr) & IPV6_ADDR_LINKLOCAL &&
+   src + srclen != scope_delim && *scope_delim == '%') {
+   struct net_device *dev;
+   char scope_id[16];
+   size_t scope_len = min_t(size_t, sizeof(scope_id),
+src + srclen - scope_delim - 1);
+
+   memcpy(scope_id, scope_delim + 1, scope_len);
+   scope_id[scope_len] = '\0';
+
+   dev = dev_get_by_name(net, scope_id);
+   if (dev) {
+   addr6->sin6_scope_id = dev->ifindex;
+   dev_put(dev);
+   } else if (kstrtouint(scope_id, 0, >sin6_scope_id)) {
+   return -EINVAL;
+   }
+   }
+
+   addr6->sin6_family = AF_INET6;
+   addr6->sin6_port = htons(port_num);
+
+   return 0;
+}
+
+/**
+ * inet_pton_with_scope - convert an IPv4/IPv6 and port to socket address
+ * @net: net namespace (used for scope handling)
+ * @af: address family, AF_INET, AF_INET6 or AF_UNSPEC for either
+ * @src: the start of the address string
+ * @port: the start of the port string (or NULL for none)
+ * @addr: output socket address
+ *
+ * Return zero on success, return errno when any error occurs.
+ */
+int inet_pton_with_scope(struct net *net, __kernel_sa_family_t af,
+   const char *src, const char *port, struct sockaddr_storage 
*addr)
+{
+   u16 port_num;
+   int ret = -EINVAL;
+
+   if (port) {
+   if (kstrtou16(port, 0, _num))
+   return -EINVAL;
+   } else {
+   port_num = 0;
+   }
+
+   switch (af) {
+   case AF_INET:
+   ret = inet4_pton(src, port_num, addr);
+   break;
+   case AF_INET6:
+   ret = inet6_pton(net, src, port_num, addr);
+   break;
+   case AF_UNSPEC:
+   ret = inet4_pton(src, port_num, addr);
+   if (ret)
+   ret = inet6_pton(net, src, port_num, addr);
+   break;
+   default:
+   pr_err("unexpected address family %d\n", af);
+   };
+
+   return ret;
+}
+EXPORT_SYMBOL(inet_pton_with_scope);
+
 void inet_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb,
  __be32 from, __be32 to, bool pseudohdr)
 {
-- 
2.7.4



[PATCH v1 2/4] nvmet-rdma: use generic inet_pton_with_scope

2017-02-27 Thread Sagi Grimberg
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 drivers/nvme/target/rdma.c | 42 +-
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 9aa1da3778b3..973b674ab55b 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -1429,12 +1429,16 @@ static void nvmet_rdma_delete_ctrl(struct nvmet_ctrl 
*ctrl)
 static int nvmet_rdma_add_port(struct nvmet_port *port)
 {
struct rdma_cm_id *cm_id;
-   struct sockaddr_in addr_in;
-   u16 port_in;
+   struct sockaddr_storage addr = { };
+   __kernel_sa_family_t af;
int ret;
 
switch (port->disc_addr.adrfam) {
case NVMF_ADDR_FAMILY_IP4:
+   af = AF_INET;
+   break;
+   case NVMF_ADDR_FAMILY_IP6:
+   af = AF_INET6;
break;
default:
pr_err("address family %d not supported\n",
@@ -1442,13 +1446,13 @@ static int nvmet_rdma_add_port(struct nvmet_port *port)
return -EINVAL;
}
 
-   ret = kstrtou16(port->disc_addr.trsvcid, 0, _in);
-   if (ret)
+   ret = inet_pton_with_scope(_net, af, port->disc_addr.traddr,
+   port->disc_addr.trsvcid, );
+   if (ret) {
+   pr_err("malformed ip/port passed: %s:%s\n",
+   port->disc_addr.traddr, port->disc_addr.trsvcid);
return ret;
-
-   addr_in.sin_family = AF_INET;
-   addr_in.sin_addr.s_addr = in_aton(port->disc_addr.traddr);
-   addr_in.sin_port = htons(port_in);
+   }
 
cm_id = rdma_create_id(_net, nvmet_rdma_cm_handler, port,
RDMA_PS_TCP, IB_QPT_RC);
@@ -1457,20 +1461,32 @@ static int nvmet_rdma_add_port(struct nvmet_port *port)
return PTR_ERR(cm_id);
}
 
-   ret = rdma_bind_addr(cm_id, (struct sockaddr *)_in);
+   /*
+* Allow both IPv4 and IPv6 sockets to bind a single port
+* at the same time.
+*/
+   ret = rdma_set_afonly(cm_id, 1);
+   if (ret) {
+   pr_err("rdma_set_afonly failed (%d)\n", ret);
+   goto out_destroy_id;
+   }
+
+   ret = rdma_bind_addr(cm_id, (struct sockaddr *));
if (ret) {
-   pr_err("binding CM ID to %pISpc failed (%d)\n", _in, ret);
+   pr_err("binding CM ID to %pISpcs failed (%d)\n",
+   (struct sockaddr *), ret);
goto out_destroy_id;
}
 
ret = rdma_listen(cm_id, 128);
if (ret) {
-   pr_err("listening to %pISpc failed (%d)\n", _in, ret);
+   pr_err("listening to %pISpcs failed (%d)\n",
+   (struct sockaddr *), ret);
goto out_destroy_id;
}
 
-   pr_info("enabling port %d (%pISpc)\n",
-   le16_to_cpu(port->disc_addr.portid), _in);
+   pr_info("enabling port %d (%pISpcs)\n",
+   le16_to_cpu(port->disc_addr.portid), (struct sockaddr *));
port->priv = cm_id;
return 0;
 
-- 
2.7.4



[PATCH v1 3/4] nvme-rdma: use inet_pton_with_scope helper

2017-02-27 Thread Sagi Grimberg
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 drivers/nvme/host/rdma.c | 63 +++-
 1 file changed, 19 insertions(+), 44 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 49b2121af689..3f4c49969f55 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -129,14 +129,8 @@ struct nvme_rdma_ctrl {
u64 cap;
u32 max_fr_pages;
 
-   union {
-   struct sockaddr addr;
-   struct sockaddr_in addr_in;
-   };
-   union {
-   struct sockaddr src_addr;
-   struct sockaddr_in src_addr_in;
-   };
+   struct sockaddr_storage addr;
+   struct sockaddr_storage src_addr;
 
struct nvme_ctrlctrl;
 };
@@ -571,11 +565,12 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl 
*ctrl,
return PTR_ERR(queue->cm_id);
}
 
-   queue->cm_error = -ETIMEDOUT;
if (ctrl->ctrl.opts->mask & NVMF_OPT_HOST_TRADDR)
-   src_addr = >src_addr;
+   src_addr = (struct sockaddr *)>src_addr;
 
-   ret = rdma_resolve_addr(queue->cm_id, src_addr, >addr,
+   queue->cm_error = -ETIMEDOUT;
+   ret = rdma_resolve_addr(queue->cm_id, src_addr,
+   (struct sockaddr *)>addr,
NVME_RDMA_CONNECT_TIMEOUT_MS);
if (ret) {
dev_info(ctrl->ctrl.device,
@@ -1857,27 +1852,13 @@ static int nvme_rdma_create_io_queues(struct 
nvme_rdma_ctrl *ctrl)
return ret;
 }
 
-static int nvme_rdma_parse_ipaddr(struct sockaddr_in *in_addr, char *p)
-{
-   u8 *addr = (u8 *)_addr->sin_addr.s_addr;
-   size_t buflen = strlen(p);
-
-   /* XXX: handle IPv6 addresses */
-
-   if (buflen > INET_ADDRSTRLEN)
-   return -EINVAL;
-   if (in4_pton(p, buflen, addr, '\0', NULL) == 0)
-   return -EINVAL;
-   in_addr->sin_family = AF_INET;
-   return 0;
-}
-
 static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
struct nvmf_ctrl_options *opts)
 {
struct nvme_rdma_ctrl *ctrl;
int ret;
bool changed;
+   char *port;
 
ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
if (!ctrl)
@@ -1885,34 +1866,28 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct 
device *dev,
ctrl->ctrl.opts = opts;
INIT_LIST_HEAD(>list);
 
-   ret = nvme_rdma_parse_ipaddr(>addr_in, opts->traddr);
+   if (opts->mask & NVMF_OPT_TRSVCID)
+   port = opts->trsvcid;
+   else
+   port = __stringify(NVME_RDMA_IP_PORT);
+
+   ret = inet_pton_with_scope(_net, AF_UNSPEC,
+   opts->traddr, port, >addr);
if (ret) {
-   pr_err("malformed IP address passed: %s\n", opts->traddr);
+   pr_err("malformed address passed: %s:%s\n", opts->traddr, port);
goto out_free_ctrl;
}
 
if (opts->mask & NVMF_OPT_HOST_TRADDR) {
-   ret = nvme_rdma_parse_ipaddr(>src_addr_in,
-   opts->host_traddr);
+   ret = inet_pton_with_scope(_net, AF_UNSPEC,
+   opts->host_traddr, NULL, >src_addr);
if (ret) {
-   pr_err("malformed src IP address passed: %s\n",
+   pr_err("malformed src address passed: %s\n",
   opts->host_traddr);
goto out_free_ctrl;
}
}
 
-   if (opts->mask & NVMF_OPT_TRSVCID) {
-   u16 port;
-
-   ret = kstrtou16(opts->trsvcid, 0, );
-   if (ret)
-   goto out_free_ctrl;
-
-   ctrl->addr_in.sin_port = cpu_to_be16(port);
-   } else {
-   ctrl->addr_in.sin_port = cpu_to_be16(NVME_RDMA_IP_PORT);
-   }
-
ret = nvme_init_ctrl(>ctrl, dev, _rdma_ctrl_ops,
0 /* no quirks, we're perfect! */);
if (ret)
@@ -1977,7 +1952,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct 
device *dev,
changed = nvme_change_ctrl_state(>ctrl, NVME_CTRL_LIVE);
WARN_ON_ONCE(!changed);
 
-   dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISp\n",
+   dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISpcs\n",
ctrl->ctrl.opts->subsysnqn, >addr);
 
kref_get(>ctrl.kref);
-- 
2.7.4



[PATCH v1 0/4] Introduce a new helper for parsing ipv[4|6]:port to socket address

2017-02-27 Thread Sagi Grimberg
Changes from v0:
- rebased on 4.10
- splitted inet_pton_with_scope to be a bit saner (from Chrsitoph)
- converted nvme-rdma host_traddr to use a generic helper

We have some places in the stack that support ipv4 and ipv6. In
some cases the user configuration does not reveal which
address family is given and needs to be parsed from the input string.

Given that the user-input varies between subsystems, some processing
is required from the call-site to separate address and port strings.

As a side-effect, this set adds ipv6 support for nvme over fabrics.

Sagi Grimberg (4):
  net/utils: generic inet_pton_with_scope helper
  nvmet-rdma: use generic inet_pton_with_scope
  nvme-rdma: use inet_pton_with_scope helper
  iscsi-target: use generic inet_pton_with_scope

 drivers/nvme/host/rdma.c |  63 +---
 drivers/nvme/target/rdma.c   |  42 +++
 drivers/target/iscsi/iscsi_target_configfs.c |  46 
 include/linux/inet.h |   6 ++
 net/core/utils.c | 103 +++
 5 files changed, 169 insertions(+), 91 deletions(-)

-- 
2.7.4



Re: [PATCH rfc 1/4] net/utils: generic inet_pton_with_scope helper

2017-02-21 Thread Sagi Grimberg



On 19/02/17 19:15, Christoph Hellwig wrote:

On Thu, Feb 16, 2017 at 07:43:34PM +0200, Sagi Grimberg wrote:

Several locations in the stack need to handle ipv4/ipv6
(with scope) and port strings conversion to sockaddr.
Add a helper that takes either AF_INET, AF_INET6 or
AF_UNSPEC (for wildcard) to centralize this handling.

Suggested-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 include/linux/inet.h |  6 
 net/core/utils.c | 91 
 2 files changed, 97 insertions(+)

diff --git a/include/linux/inet.h b/include/linux/inet.h
index 4cca05c9678e..636ebe87e6f8 100644
--- a/include/linux/inet.h
+++ b/include/linux/inet.h
@@ -43,6 +43,8 @@
 #define _LINUX_INET_H

 #include 
+#include 
+#include 

 /*
  * These mimic similar macros defined in user-space for inet_ntop(3).
@@ -54,4 +56,8 @@
 extern __be32 in_aton(const char *str);
 extern int in4_pton(const char *src, int srclen, u8 *dst, int delim, const 
char **end);
 extern int in6_pton(const char *src, int srclen, u8 *dst, int delim, const 
char **end);
+
+extern int inet_pton_with_scope(struct net *net, unsigned short af,
+   const char *src, const char *port, struct sockaddr_storage 
*addr);
+
 #endif /* _LINUX_INET_H */
diff --git a/net/core/utils.c b/net/core/utils.c
index 6592d7bbed39..8f15d016c64a 100644
--- a/net/core/utils.c
+++ b/net/core/utils.c
@@ -26,9 +26,11 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
+#include 

 #include 
 #include 
@@ -300,6 +302,95 @@ int in6_pton(const char *src, int srclen,
 }
 EXPORT_SYMBOL(in6_pton);

+/**
+ * inet_pton_with_scope - convert an IPv4/IPv6 and port to socket address
+ * @net: net namespace (used for scope handling)
+ * @af: address family, AF_INET, AF_INET6 or AF_UNSPEC for either
+ * @src: the start of the address string
+ * @port: the start of the port string (or NULL for none)
+ * @addr: output socket address
+ *
+ * Return zero on success, return errno when any error occurs.
+ */
+int inet_pton_with_scope(struct net *net, __kernel_sa_family_t af,
+   const char *src, const char *port, struct sockaddr_storage 
*addr)
+{
+   struct sockaddr_in *addr4;
+   struct sockaddr_in6 *addr6;
+   const char *scope_delim;
+   bool unspec = false;
+   int srclen = strlen(src);
+   u16 port_num;
+
+   if (port) {
+   if (kstrtou16(port, 0, _num)) {
+   pr_err("failed port_num %s\n", port);
+   return -EINVAL;
+   }
+   } else {
+   port_num = 0;
+   }
+
+   switch (af) {
+   case AF_UNSPEC:
+   unspec = true;
+   /* FALLTHRU */
+   case AF_INET:
+   if (srclen <= INET_ADDRSTRLEN) {
+   addr4 = (struct sockaddr_in *)addr;
+   if (in4_pton(src, srclen, (u8 *) 
>sin_addr.s_addr,
+'\n', NULL) > 0) {
+   addr4->sin_family = AF_INET;
+   addr4->sin_port = htons(port_num);
+   return 0;
+   }
+   pr_err("failed in4_pton %s\n", src);
+   }


I would probably factor this and the IPv6 equivalent below into helpers
to keep the code a little more self-contained so that we could avoid the
fallthrough magic.


I can factor out ipv4/ipv6 to helpers, but still I need to try either
ipv4, ipv6, or both for unspec.

I can maybe do something like:

if (af == AF_INET || af == AF_UNSPEC) {
ret = inet4_pton();
if (!ret)
return 0;
else if (af != AF_UNSPEC)
return ret;
}

if (af == AF_INET6 || af == AF_UNSPEC) {
ret = inet6_pton();
if (!ret)
return 0;
else if (af != AF_UNSPEC)
return ret;
}

return -EINVAL;

better?


Re: [PATCH rfc 0/4] Introduce a new helper for parsing ipv[4|6]:port to socket address

2017-02-16 Thread Sagi Grimberg



Yea, I wanted to get all the rpc_pton/rpc_ntop (which handle both
ipv4 and ipv6 with scopes) converted to use a common helper.


RPC is dealing with universal addresses, which are not quite the
same as IP presentation addresses. The port number is expressed
as:

  hi8.lo8


I noticed that, that's why there will be some processing in
sunrpc to construct a port string to a generic helper (if it chooses to
use it).


And universal IPv6 addresses don't have a mechanism for conveying
scope information.


I do see scope handling in rpc_pton6 code if provided...


Re: [PATCH rfc 0/4] Introduce a new helper for parsing ipv[4|6]:port to socket address

2017-02-16 Thread Sagi Grimberg

Hey Chuck,


As a side-effect, this set adds ipv6 support for nvme over fabrics.
I also converted iscsi target and plan to convert nfs/cifs next but
wanted to get some feedback before doing that.


At least NFS/RDMA already supports IPv6. You might not need many
changes there, but I'll hold off on further comments.


Yea, I wanted to get all the rpc_pton/rpc_ntop (which handle both
ipv4 and ipv6 with scopes) converted to use a common helper.


[PATCH rfc 4/4] iscsi-target: use generic inet_pton_with_scope

2017-02-16 Thread Sagi Grimberg
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 drivers/target/iscsi/iscsi_target_configfs.c | 46 
 1 file changed, 12 insertions(+), 34 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target_configfs.c 
b/drivers/target/iscsi/iscsi_target_configfs.c
index bf40f03755dd..f30c27b83c5e 100644
--- a/drivers/target/iscsi/iscsi_target_configfs.c
+++ b/drivers/target/iscsi/iscsi_target_configfs.c
@@ -167,10 +167,7 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
struct iscsi_portal_group *tpg;
struct iscsi_tpg_np *tpg_np;
char *str, *str2, *ip_str, *port_str;
-   struct sockaddr_storage sockaddr;
-   struct sockaddr_in *sock_in;
-   struct sockaddr_in6 *sock_in6;
-   unsigned long port;
+   struct sockaddr_storage sockaddr = { };
int ret;
char buf[MAX_PORTAL_LEN + 1];
 
@@ -182,21 +179,19 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
memset(buf, 0, MAX_PORTAL_LEN + 1);
snprintf(buf, MAX_PORTAL_LEN + 1, "%s", name);
 
-   memset(, 0, sizeof(struct sockaddr_storage));
-
str = strstr(buf, "[");
if (str) {
-   const char *end;
-
str2 = strstr(str, "]");
if (!str2) {
pr_err("Unable to locate trailing \"]\""
" in IPv6 iSCSI network portal address\n");
return ERR_PTR(-EINVAL);
}
-   str++; /* Skip over leading "[" */
+
+   ip_str = str + 1; /* Skip over leading "[" */
*str2 = '\0'; /* Terminate the unbracketed IPv6 address */
str2++; /* Skip over the \0 */
+
port_str = strstr(str2, ":");
if (!port_str) {
pr_err("Unable to locate \":port\""
@@ -205,23 +200,8 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
}
*port_str = '\0'; /* Terminate string for IP */
port_str++; /* Skip over ":" */
-
-   ret = kstrtoul(port_str, 0, );
-   if (ret < 0) {
-   pr_err("kstrtoul() failed for port_str: %d\n", ret);
-   return ERR_PTR(ret);
-   }
-   sock_in6 = (struct sockaddr_in6 *)
-   sock_in6->sin6_family = AF_INET6;
-   sock_in6->sin6_port = htons((unsigned short)port);
-   ret = in6_pton(str, -1,
-   (void *)_in6->sin6_addr.in6_u, -1, );
-   if (ret <= 0) {
-   pr_err("in6_pton returned: %d\n", ret);
-   return ERR_PTR(-EINVAL);
-   }
} else {
-   str = ip_str = [0];
+   ip_str = [0];
port_str = strstr(ip_str, ":");
if (!port_str) {
pr_err("Unable to locate \":port\""
@@ -230,17 +210,15 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
}
*port_str = '\0'; /* Terminate string for IP */
port_str++; /* Skip over ":" */
+   }
 
-   ret = kstrtoul(port_str, 0, );
-   if (ret < 0) {
-   pr_err("kstrtoul() failed for port_str: %d\n", ret);
-   return ERR_PTR(ret);
-   }
-   sock_in = (struct sockaddr_in *)
-   sock_in->sin_family = AF_INET;
-   sock_in->sin_port = htons((unsigned short)port);
-   sock_in->sin_addr.s_addr = in_aton(ip_str);
+   ret = inet_pton_with_scope(_net, AF_UNSPEC, ip_str,
+   port_str, );
+   if (ret) {
+   pr_err("malformed ip/port passed: %s\n", name);
+   return ERR_PTR(ret);
}
+
tpg = container_of(se_tpg, struct iscsi_portal_group, tpg_se_tpg);
ret = iscsit_get_tpg(tpg);
if (ret < 0)
-- 
2.7.4



[PATCH rfc 3/4] nvme-rdma: use inet_pton_with_scope helper

2017-02-16 Thread Sagi Grimberg
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 drivers/nvme/host/rdma.c | 48 +---
 1 file changed, 13 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 890e7e4b1c15..bf02acbb4af8 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -151,10 +151,7 @@ struct nvme_rdma_ctrl {
u64 cap;
u32 max_fr_pages;
 
-   union {
-   struct sockaddr addr;
-   struct sockaddr_in addr_in;
-   };
+   struct sockaddr_storage addr;
 
struct nvme_ctrlctrl;
 };
@@ -589,7 +586,8 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl *ctrl,
}
 
queue->cm_error = -ETIMEDOUT;
-   ret = rdma_resolve_addr(queue->cm_id, NULL, >addr,
+   ret = rdma_resolve_addr(queue->cm_id, NULL,
+   (struct sockaddr *)>addr,
NVME_RDMA_CONNECT_TIMEOUT_MS);
if (ret) {
dev_info(ctrl->ctrl.device,
@@ -1874,27 +1872,13 @@ static int nvme_rdma_create_io_queues(struct 
nvme_rdma_ctrl *ctrl)
return ret;
 }
 
-static int nvme_rdma_parse_ipaddr(struct sockaddr_in *in_addr, char *p)
-{
-   u8 *addr = (u8 *)_addr->sin_addr.s_addr;
-   size_t buflen = strlen(p);
-
-   /* XXX: handle IPv6 addresses */
-
-   if (buflen > INET_ADDRSTRLEN)
-   return -EINVAL;
-   if (in4_pton(p, buflen, addr, '\0', NULL) == 0)
-   return -EINVAL;
-   in_addr->sin_family = AF_INET;
-   return 0;
-}
-
 static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
struct nvmf_ctrl_options *opts)
 {
struct nvme_rdma_ctrl *ctrl;
int ret;
bool changed;
+   char *port;
 
ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
if (!ctrl)
@@ -1902,24 +1886,18 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct 
device *dev,
ctrl->ctrl.opts = opts;
INIT_LIST_HEAD(>list);
 
-   ret = nvme_rdma_parse_ipaddr(>addr_in, opts->traddr);
+   if (opts->mask & NVMF_OPT_TRSVCID)
+   port = opts->trsvcid;
+   else
+   port = __stringify(NVME_RDMA_IP_PORT);
+
+   ret = inet_pton_with_scope(_net, AF_UNSPEC, opts->traddr,
+   port, >addr);
if (ret) {
-   pr_err("malformed IP address passed: %s\n", opts->traddr);
+   pr_err("malformed ip/port passed: %s:%s\n", opts->traddr, port);
goto out_free_ctrl;
}
 
-   if (opts->mask & NVMF_OPT_TRSVCID) {
-   u16 port;
-
-   ret = kstrtou16(opts->trsvcid, 0, );
-   if (ret)
-   goto out_free_ctrl;
-
-   ctrl->addr_in.sin_port = cpu_to_be16(port);
-   } else {
-   ctrl->addr_in.sin_port = cpu_to_be16(NVME_RDMA_IP_PORT);
-   }
-
ret = nvme_init_ctrl(>ctrl, dev, _rdma_ctrl_ops,
0 /* no quirks, we're perfect! */);
if (ret)
@@ -1984,7 +1962,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct 
device *dev,
changed = nvme_change_ctrl_state(>ctrl, NVME_CTRL_LIVE);
WARN_ON_ONCE(!changed);
 
-   dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISp\n",
+   dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISpcs\n",
ctrl->ctrl.opts->subsysnqn, >addr);
 
kref_get(>ctrl.kref);
-- 
2.7.4



[PATCH rfc 0/4] Introduce a new helper for parsing ipv[4|6]:port to socket address

2017-02-16 Thread Sagi Grimberg
We have some places in the stack that support ipv4 and ipv6. In
some cases the user configuration does not reveal which
address family is given and needs to be parsed from the input string.

Given that the user-input varies between subsystems, some processing
is required from the call-site to separate address and port strings.

As a side-effect, this set adds ipv6 support for nvme over fabrics.
I also converted iscsi target and plan to convert nfs/cifs next but
wanted to get some feedback before doing that.

patch #1 is based off roland's patch:
http://lists.infradead.org/pipermail/linux-nvme/2016-August/005600.html

Sagi Grimberg (4):
  net/utils: generic inet_pton_with_scope helper
  nvmet-rdma: use generic inet_pton_with_scope
  nvme-rdma: use inet_pton_with_scope helper
  iscsi-target: use generic inet_pton_with_scope

 drivers/nvme/host/rdma.c | 48 ---
 drivers/nvme/target/rdma.c   | 42 +
 drivers/target/iscsi/iscsi_target_configfs.c | 46 --
 include/linux/inet.h |  6 ++
 net/core/utils.c | 91 
 5 files changed, 151 insertions(+), 82 deletions(-)

-- 
2.7.4



[PATCH rfc 1/4] net/utils: generic inet_pton_with_scope helper

2017-02-16 Thread Sagi Grimberg
Several locations in the stack need to handle ipv4/ipv6
(with scope) and port strings conversion to sockaddr.
Add a helper that takes either AF_INET, AF_INET6 or
AF_UNSPEC (for wildcard) to centralize this handling.

Suggested-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 include/linux/inet.h |  6 
 net/core/utils.c | 91 
 2 files changed, 97 insertions(+)

diff --git a/include/linux/inet.h b/include/linux/inet.h
index 4cca05c9678e..636ebe87e6f8 100644
--- a/include/linux/inet.h
+++ b/include/linux/inet.h
@@ -43,6 +43,8 @@
 #define _LINUX_INET_H
 
 #include 
+#include 
+#include 
 
 /*
  * These mimic similar macros defined in user-space for inet_ntop(3).
@@ -54,4 +56,8 @@
 extern __be32 in_aton(const char *str);
 extern int in4_pton(const char *src, int srclen, u8 *dst, int delim, const 
char **end);
 extern int in6_pton(const char *src, int srclen, u8 *dst, int delim, const 
char **end);
+
+extern int inet_pton_with_scope(struct net *net, unsigned short af,
+   const char *src, const char *port, struct sockaddr_storage 
*addr);
+
 #endif /* _LINUX_INET_H */
diff --git a/net/core/utils.c b/net/core/utils.c
index 6592d7bbed39..8f15d016c64a 100644
--- a/net/core/utils.c
+++ b/net/core/utils.c
@@ -26,9 +26,11 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -300,6 +302,95 @@ int in6_pton(const char *src, int srclen,
 }
 EXPORT_SYMBOL(in6_pton);
 
+/**
+ * inet_pton_with_scope - convert an IPv4/IPv6 and port to socket address
+ * @net: net namespace (used for scope handling)
+ * @af: address family, AF_INET, AF_INET6 or AF_UNSPEC for either
+ * @src: the start of the address string
+ * @port: the start of the port string (or NULL for none)
+ * @addr: output socket address
+ *
+ * Return zero on success, return errno when any error occurs.
+ */
+int inet_pton_with_scope(struct net *net, __kernel_sa_family_t af,
+   const char *src, const char *port, struct sockaddr_storage 
*addr)
+{
+   struct sockaddr_in *addr4;
+   struct sockaddr_in6 *addr6;
+   const char *scope_delim;
+   bool unspec = false;
+   int srclen = strlen(src);
+   u16 port_num;
+
+   if (port) {
+   if (kstrtou16(port, 0, _num)) {
+   pr_err("failed port_num %s\n", port);
+   return -EINVAL;
+   }
+   } else {
+   port_num = 0;
+   }
+
+   switch (af) {
+   case AF_UNSPEC:
+   unspec = true;
+   /* FALLTHRU */
+   case AF_INET:
+   if (srclen <= INET_ADDRSTRLEN) {
+   addr4 = (struct sockaddr_in *)addr;
+   if (in4_pton(src, srclen, (u8 *) 
>sin_addr.s_addr,
+'\n', NULL) > 0) {
+   addr4->sin_family = AF_INET;
+   addr4->sin_port = htons(port_num);
+   return 0;
+   }
+   pr_err("failed in4_pton %s\n", src);
+   }
+   if (!unspec)
+   break;
+   case AF_INET6:
+   if (srclen <= INET6_ADDRSTRLEN) {
+   addr6 = (struct sockaddr_in6 *)addr;
+   if (in6_pton(src, srclen, (u8 *) 
>sin6_addr.s6_addr,
+'%', _delim) == 0) {
+   pr_err("failed in6_pton %s\n", src);
+   return -EINVAL;
+   }
+
+   if (ipv6_addr_type(>sin6_addr) & 
IPV6_ADDR_LINKLOCAL &&
+   src + srclen != scope_delim && *scope_delim == '%') 
{
+   struct net_device *dev;
+   char scope_id[16];
+   size_t scope_len = min_t(size_t, sizeof 
scope_id,
+src + srclen - 
scope_delim - 1);
+
+   memcpy(scope_id, scope_delim + 1, scope_len);
+   scope_id[scope_len] = '\0';
+
+   dev = dev_get_by_name(net, scope_id);
+   if (dev) {
+   addr6->sin6_scope_id = dev->ifindex;
+   dev_put(dev);
+   } else if (kstrtouint(scope_id, 0, 
>sin6_scope_id)) {
+   pr_err("failed dev_get_by_name %s\n", 
scope_id);
+   return -EINVAL;
+   }
+   }
+
+   addr6->sin6_family = AF_INET6;
+   addr6->sin6_port 

[PATCH rfc 2/4] nvmet-rdma: use generic inet_pton_with_scope

2017-02-16 Thread Sagi Grimberg
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 drivers/nvme/target/rdma.c | 42 +-
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 8c3760a78ac0..298db68f2db9 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -1411,12 +1411,16 @@ static void nvmet_rdma_delete_ctrl(struct nvmet_ctrl 
*ctrl)
 static int nvmet_rdma_add_port(struct nvmet_port *port)
 {
struct rdma_cm_id *cm_id;
-   struct sockaddr_in addr_in;
-   u16 port_in;
+   struct sockaddr_storage addr = { };
+   __kernel_sa_family_t af;
int ret;
 
switch (port->disc_addr.adrfam) {
case NVMF_ADDR_FAMILY_IP4:
+   af = AF_INET;
+   break;
+   case NVMF_ADDR_FAMILY_IP6:
+   af = AF_INET6;
break;
default:
pr_err("address family %d not supported\n",
@@ -1424,13 +1428,13 @@ static int nvmet_rdma_add_port(struct nvmet_port *port)
return -EINVAL;
}
 
-   ret = kstrtou16(port->disc_addr.trsvcid, 0, _in);
-   if (ret)
+   ret = inet_pton_with_scope(_net, af, port->disc_addr.traddr,
+   port->disc_addr.trsvcid, );
+   if (ret) {
+   pr_err("malformed ip/port passed: %s:%s\n",
+   port->disc_addr.traddr, port->disc_addr.trsvcid);
return ret;
-
-   addr_in.sin_family = AF_INET;
-   addr_in.sin_addr.s_addr = in_aton(port->disc_addr.traddr);
-   addr_in.sin_port = htons(port_in);
+   }
 
cm_id = rdma_create_id(_net, nvmet_rdma_cm_handler, port,
RDMA_PS_TCP, IB_QPT_RC);
@@ -1439,20 +1443,32 @@ static int nvmet_rdma_add_port(struct nvmet_port *port)
return PTR_ERR(cm_id);
}
 
-   ret = rdma_bind_addr(cm_id, (struct sockaddr *)_in);
+   /*
+* Allow both IPv4 and IPv6 sockets to bind a single port
+* at the same time.
+*/
+   ret = rdma_set_afonly(cm_id, 1);
+   if (ret) {
+   pr_err("rdma_set_afonly failed (%d)\n", ret);
+   goto out_destroy_id;
+   }
+
+   ret = rdma_bind_addr(cm_id, (struct sockaddr *));
if (ret) {
-   pr_err("binding CM ID to %pISpc failed (%d)\n", _in, ret);
+   pr_err("binding CM ID to %pISpcs failed (%d)\n",
+   (struct sockaddr *), ret);
goto out_destroy_id;
}
 
ret = rdma_listen(cm_id, 128);
if (ret) {
-   pr_err("listening to %pISpc failed (%d)\n", _in, ret);
+   pr_err("listening to %pISpcs failed (%d)\n",
+   (struct sockaddr *), ret);
goto out_destroy_id;
}
 
-   pr_info("enabling port %d (%pISpc)\n",
-   le16_to_cpu(port->disc_addr.portid), _in);
+   pr_info("enabling port %d (%pISpcs)\n",
+   le16_to_cpu(port->disc_addr.portid), (struct sockaddr *));
port->priv = cm_id;
return 0;
 
-- 
2.7.4



Re: [PATCH 9/9] treewide: Inline ib_dma_map_*() functions

2017-01-12 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [RFC v2 07/12] qedr: Add support for memory registeration verbs

2016-09-21 Thread Sagi Grimberg



+static int qedr_set_page(struct ib_mr *ibmr, u64 addr)
+{
+   struct qedr_mr *mr = get_qedr_mr(ibmr);
+   struct qedr_pbl *pbl_table;
+   struct regpair *pbe;
+   u32 pbes_in_page;
+
+   if (unlikely(mr->npages == mr->info.pbl_info.num_pbes)) {
+   DP_ERR(mr->dev, "qedr_set_page failes when %d\n", mr->npages);
+   return -ENOMEM;
+   }
+
+   DP_VERBOSE(mr->dev, QEDR_MSG_MR, "qedr_set_page pages[%d] = 0x%llx\n",
+  mr->npages, addr);
+
+   pbes_in_page = mr->info.pbl_info.pbl_size / sizeof(u64);
+   pbl_table = mr->info.pbl_table + (mr->npages / pbes_in_page);
+   pbe = (struct regpair *)pbl_table->va;
+   pbe +=  mr->npages % pbes_in_page;
+   pbe->lo = cpu_to_le32((u32)addr);
+   pbe->hi = cpu_to_le32((u32)upper_32_bits(addr));
+
+   mr->npages++;
+
+   return 0;
+}


Looks better.

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [RFC 07/11] Add support for memory registeration verbs

2016-09-14 Thread Sagi Grimberg



+struct qedr_mr *__qedr_alloc_mr(struct ib_pd *ibpd, int
+max_page_list_len) {
+   struct qedr_pd *pd = get_qedr_pd(ibpd);
+   struct qedr_dev *dev = get_qedr_dev(ibpd->device);
+   struct qedr_mr *mr;
+   int rc = -ENOMEM;
+
+   DP_VERBOSE(dev, QEDR_MSG_MR,
+  "qedr_alloc_frmr pd = %d max_page_list_len= %d\n", pd-
pd_id,
+  max_page_list_len);
+
+   mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+   if (!mr)
+   return ERR_PTR(rc);
+
+   mr->dev = dev;
+   mr->type = QEDR_MR_FRMR;
+
+   rc = init_mr_info(dev, >info, max_page_list_len, 1);
+   if (rc)
+   goto err0;
+
+   rc = dev->ops->rdma_alloc_tid(dev->rdma_ctx, >hw_mr.itid);
+   if (rc) {
+   DP_ERR(dev, "roce alloc tid returned an error %d\n", rc);
+   goto err0;
+   }
+
+   /* Index only, 18 bit long, lkey = itid << 8 | key */
+   mr->hw_mr.tid_type = QED_RDMA_TID_FMR;
+   mr->hw_mr.key = 0;
+   mr->hw_mr.pd = pd->pd_id;


Do you have a real MR<->PD association in HW? If so, can you point me to the
code that binds it? If not, any reason not to expose the local_dma_lkey?


Yes, we send the pd id to the FW in function qed_rdma_register_tid.


Right, thanks.


In any case, if we didn't have the association in HW
Wouldn't the local_dma_lkey be relevant only to dma_mr ? ( code snippet above 
refers to FMR)


I was just sticking to a location in the code where you associate
MR<->PD...

The local_dma_lkey is a special key that spans the entire memory space
and unlike the notorious dma_mr, its not associated with a PD.

See the code in ib_alloc_pd(), if the device does not support a single
device local dma lkey, the core allocates a dma mr associated with
the pd. If your device has such a key, you can save a dma mr allocation
for each pd in the system.


Re: [RFC 00/11] QLogic RDMA Driver (qedr) RFC

2016-09-14 Thread Sagi Grimberg

> SRQ is not part of the RFC (but we do have it and NVMF was tested

with it).






Nice, I have plans to make SRQs better usable for our ULPs so it'd be



good to have it.


That’s good to know. Are there plans on implementing XRC?

Right now it looks like none of the ULPS make use of it.


The problem with XRC is that it needs to be reflected on the
wire protocol for it to actually be useful for our ULPs. Not
sure if its worth the effort...


Re: [RFC 07/11] Add support for memory registeration verbs

2016-09-13 Thread Sagi Grimberg



+struct qedr_mr *__qedr_alloc_mr(struct ib_pd *ibpd, int max_page_list_len)
+{
+   struct qedr_pd *pd = get_qedr_pd(ibpd);
+   struct qedr_dev *dev = get_qedr_dev(ibpd->device);
+   struct qedr_mr *mr;
+   int rc = -ENOMEM;
+
+   DP_VERBOSE(dev, QEDR_MSG_MR,
+  "qedr_alloc_frmr pd = %d max_page_list_len= %d\n", pd->pd_id,
+  max_page_list_len);
+
+   mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+   if (!mr)
+   return ERR_PTR(rc);
+
+   mr->dev = dev;
+   mr->type = QEDR_MR_FRMR;
+
+   rc = init_mr_info(dev, >info, max_page_list_len, 1);
+   if (rc)
+   goto err0;
+
+   rc = dev->ops->rdma_alloc_tid(dev->rdma_ctx, >hw_mr.itid);
+   if (rc) {
+   DP_ERR(dev, "roce alloc tid returned an error %d\n", rc);
+   goto err0;
+   }
+
+   /* Index only, 18 bit long, lkey = itid << 8 | key */
+   mr->hw_mr.tid_type = QED_RDMA_TID_FMR;
+   mr->hw_mr.key = 0;
+   mr->hw_mr.pd = pd->pd_id;


Do you have a real MR<->PD association in HW? If so, can you point
me to the code that binds it? If not, any reason not to expose
the local_dma_lkey?


+struct ib_mr *qedr_get_dma_mr(struct ib_pd *ibpd, int acc)
+{
+   struct qedr_dev *dev = get_qedr_dev(ibpd->device);
+   struct qedr_pd *pd = get_qedr_pd(ibpd);
+   struct qedr_mr *mr;
+   int rc;
+
+   if (acc & IB_ACCESS_MW_BIND) {
+   DP_ERR(dev, "Unsupported access flags received for dma mr\n");
+   return ERR_PTR(-EINVAL);
+   }


This check looks like it really belongs in the core, it would help
everyone if you move it...

Although I know Christoph is trying to get rid of this API altogether...


Re: [RFC 08/11] Add support for data path

2016-09-13 Thread Sagi Grimberg




+   pbe = (struct regpair *)pbl_table->va;
+   num_pbes = 0;
+
+   for (i = 0; i < mr->npages &&
+(total_num_pbes != mr->info.pbl_info.num_pbes); i++) {
+   u64 buf_addr = mr->pages[i];
+
+   pbe->lo = cpu_to_le32((u32)buf_addr);
+   pbe->hi = cpu_to_le32((u32)upper_32_bits(buf_addr));


Thats a shame... you could have easily set the buf_addr correctly
in qedr_set_page...

I think you could have also set the pbe directly from set_page if you
have access to pbl_table from your mr context
(and if I understand correctly I think you do, mr->info.pbl_table)...


Re: [RFC 03/11] Add support for RoCE HW init

2016-09-13 Thread Sagi Grimberg



+   dev->max_sge = min_t(u32, RDMA_MAX_SGE_PER_SQ_WQE,
+RDMA_MAX_SGE_PER_RQ_WQE);


Our kernel target mode consumers sort of rely on max_sge_rd, you need
to make sure to set it too.


Re: [RFC 07/11] Add support for memory registeration verbs

2016-09-13 Thread Sagi Grimberg



+static inline struct qedr_ah *get_qedr_ah(struct ib_ah *ibah)
+{
+   return container_of(ibah, struct qedr_ah, ibah);
+}


Little surprising to find that here... how is the ah related
to this patch?


Re: [RFC 00/11] QLogic RDMA Driver (qedr) RFC

2016-09-13 Thread Sagi Grimberg

Hey Ram and Co,


This series introduces RoCE RDMA driver for the 579xx RDMA products by Qlogic.
The RDMA support is added as an additional loadable module (qedr) over the 
Ethernet qede driver.
The qedr module will support both RoCE and iWarp, although this series only 
adds RoCE support.
The qed and qede drivers are enhanced with functionality required for RDMA 
support.

Any review/comment is appreciated.


Was this driver tested with any of our kernel consumers (nfs/iser/nvmf)?

If so, are there known issues?

Thanks,
Sagi.


Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-10 Thread Sagi Grimberg



This is also very interesting for storage targets, which face the same
issue.  SCST has a mode where it caches some fully constructed SGLs,
which is probably very similar to what NICs want to do.


I think a cached allocator for page sets + the scatterlists that
describe these page sets would not only be useful for SCSI target
implementations but also for the Linux SCSI initiator. Today the scsi-mq
code reserves space in each scsi_cmnd for a scatterlist of
SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
sets less memory would be needed per scsi_cmnd.


If we go down this road how about also attaching some driver opaques
to the page sets?

I know of some drivers that can make good use of those ;)


Re: [PATCH] net/9p: convert to new CQ API

2016-02-28 Thread Sagi Grimberg



Trivial conversion to the new RDMA CQ API.


Looks nice and simple :)

But I think that the fact that CQ processing is now
done in soft-IRQ (which is an improvement!) needs to
be documented.

Other than that, looks great

Reviewed-by: Sagi Grimberg <sa...@mellanox.com>

P.S.
I was also confused in the past about 9p and if anyone
actually uses it...


Re: [PATCH 02/25] IB/mthca, net/mlx4: remove counting semaphores

2015-10-28 Thread Sagi Grimberg

Hi Arnd,


Since we want to make counting semaphores go away,


Why do we want to make counting semaphores go away? completely?
or just for binary use cases?

I have a use case in iser target code where a counting semaphore is the
best suited synchronizing mechanism.

I have a single thread handling connect requests (one at a time) while
connect requests are event driven and come asynchronously. This is
why I use a queue and a counting semaphore to handle this situation.

I'd need to rethink of a new strategy to handle this without counting
semaphores and I'm not entirely sure it would be simpler.


this patch replaces the semaphore counting the event-driven
commands with an open-coded wait-queue, which should
be an equivalent transformation of the code, although
it does not make it any nicer.

As far as I can tell, there is a preexisting race condition
regarding the cmd->use_events flag, which is not protected
by any lock. When this flag is toggled while another command
is being started, that command gets stuck until the mode is
toggled back.

A better solution that would solve the race condition and
at the same time improve the code readability would create
a new locking primitive that replaces both semaphores, like

static int mlx4_use_events(struct mlx4_cmd *cmd)
{
int ret = -EAGAIN;
spin_lock(>lock);
if (cmd->use_events && cmd->commands < cmd->max_commands) {
cmd->commands++;
ret = 1;
} else if (!cmd->use_events && cmd->commands == 0) {
cmd->commands = 1;
ret = 0;
}
spin_unlock(>lock);
return ret;
}

static bool mlx4_use_events(struct mlx4_cmd *cmd)
{
int ret;
wait_event(cmd->events_wq, ret = __mlx4_use_events(cmd) >= 0);
return ret;
}

Cc: Roland Dreier 
Cc: Eli Cohen 
Cc: Yevgeny Petrilin 
Cc: netdev@vger.kernel.org
Cc: linux-r...@vger.kernel.org
Signed-off-by: Arnd Bergmann 

Conflicts:

drivers/net/mlx4/cmd.c
drivers/net/mlx4/mlx4.h
---
  drivers/infiniband/hw/mthca/mthca_cmd.c   | 12 
  drivers/infiniband/hw/mthca/mthca_dev.h   |  3 ++-
  drivers/net/ethernet/mellanox/mlx4/cmd.c  | 12 
  drivers/net/ethernet/mellanox/mlx4/mlx4.h |  3 ++-
  4 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c 
b/drivers/infiniband/hw/mthca/mthca_cmd.c
index 9d3e5c1ac60e..aad1852e8e10 100644
--- a/drivers/infiniband/hw/mthca/mthca_cmd.c
+++ b/drivers/infiniband/hw/mthca/mthca_cmd.c
@@ -417,7 +417,8 @@ static int mthca_cmd_wait(struct mthca_dev *dev,
int err = 0;
struct mthca_cmd_context *context;

-   down(>cmd.event_sem);
+   wait_event(dev->cmd.event_wait,
+  atomic_add_unless(>cmd.commands, -1, 0));

spin_lock(>cmd.context_lock);
BUG_ON(dev->cmd.free_head < 0);
@@ -459,7 +460,8 @@ out:
dev->cmd.free_head = context - dev->cmd.context;
spin_unlock(>cmd.context_lock);

-   up(>cmd.event_sem);
+   atomic_inc(>cmd.commands);
+   wake_up(>cmd.event_wait);
return err;
  }

@@ -571,7 +573,8 @@ int mthca_cmd_use_events(struct mthca_dev *dev)
dev->cmd.context[dev->cmd.max_cmds - 1].next = -1;
dev->cmd.free_head = 0;

-   sema_init(>cmd.event_sem, dev->cmd.max_cmds);
+   init_waitqueue_head(>cmd.event_wait);
+   atomic_set(>cmd.commands, dev->cmd.max_cmds);
spin_lock_init(>cmd.context_lock);

for (dev->cmd.token_mask = 1;
@@ -597,7 +600,8 @@ void mthca_cmd_use_polling(struct mthca_dev *dev)
dev->cmd.flags &= ~MTHCA_CMD_USE_EVENTS;

for (i = 0; i < dev->cmd.max_cmds; ++i)
-   down(>cmd.event_sem);
+   wait_event(dev->cmd.event_wait,
+  atomic_add_unless(>cmd.commands, -1, 0));

kfree(dev->cmd.context);

diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h 
b/drivers/infiniband/hw/mthca/mthca_dev.h
index 7e6a6d64ad4e..3055f5c12ac8 100644
--- a/drivers/infiniband/hw/mthca/mthca_dev.h
+++ b/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -121,7 +121,8 @@ struct mthca_cmd {
struct pci_pool  *pool;
struct mutex  hcr_mutex;
struct semaphore  poll_sem;
-   struct semaphore  event_sem;
+   wait_queue_head_t event_wait;
+   atomic_t  commands;
int   max_cmds;
spinlock_tcontext_lock;
int   free_head;
diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 78f5a1a0b8c8..60134a4245ef 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -273,7 +273,8 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, u64 
in_param, u64 *out_param,
struct 

Re: [PATCH 00/15] RDS: connection scalability and performance improvements

2015-09-20 Thread Sagi Grimberg

On 9/20/2015 2:04 AM, Santosh Shilimkar wrote:

This series addresses RDS connection bottlenecks on massive workloads and
improve the RDMA performance almost by 3X. RDS TCP also gets a small gain
of about 12%.

RDS is being used in massive systems with high scalability where several
hundred thousand end points and tens of thousands of local processes
are operating in tens of thousand sockets. Being RC(reliable connection),
socket bind and release happens very often and any inefficiencies in
bind hash look ups hurts the overall system performance. RDS bin hash-table
uses global spin-lock which is the biggest bottleneck. To make matter worst,
it uses rcu inside global lock for hash buckets.
This is being addressed by simply using per bucket rw lock which makes the
locking simple and very efficient. The hash table size is also scaled up
accordingly.

For RDS RDMA improvement, the completion handling is revamped so that we
can do batch completions. Both send and receive completion handlers are
split logically to achieve the same. RDS 8K messages being one of the
key usecase, mr pool is adapted to have the 8K mrs along with default 1M
mrs. And while doing this, few fixes and couple of bottlenecks seen with
rds_sendmsg() are addressed.


Hi Santosh,

I think that can get a more effective code review if you CC the
Linux-rdma mailing list.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html