Re: wake_wide mechanism clarification

2017-08-02 Thread Michael Wang
Hi, Joel

On 07/29/2017 10:13 AM, Joel Fernandes wrote:
> +Michael Wang on his current email address (old one bounced). (my
> reply was to Mike Galbraith but I also meant to CC Michael Wang for
> the discussion). Thanks

Just back from vacation and saw this long long discussion...

I think guys explained well on the idea, wake_wide() just try to filter
out the cases that may congest the waker's domain, as much as possible
without side effect (ideally).

The factor at very beginning is just a static number which picked by
enormous times of practice testing, Peter Zijlstr suggest we add the
domain size and make it a flexible factor, which by practice not that bad.

So the simple answer is we use the llc_size since no better option at
that time :-)

But things changing very fast and new feature can introduce new cases,
I'm pretty sure if we redo the testing the results will be very different,
however, the idea itself still make sense to me, at least on theory.

Recently I'm also thinking about the scheduler issue, cfs try to find out
general solution for all these cases and the best answer is obviously, all
the cases will suffer some damage and scheduler itself bloated to achieve
the goal 'taking care of all'.

So in order to achieve the maximum performance of particular workload, some
user defined scheduler would be an interesting idea :-P

Regards,
Michael Wang



> 
> On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes <joe...@google.com> wrote:
>> Hi Mike,
>>
>> I have take spent some time understanding the email thread and
>> previous discussions. Unfortunately the second condition we are
>> checking for in the wake_wide still didn't make sense to me (mentioned
>> below) :-(
>>
>> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
>> <umgwanakikb...@gmail.com> wrote:
>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>>
>>>>> That makes sense that we multiply slave's flips by a factor because
>>>>> its low, but I still didn't get why the factor is chosen to be
>>>>> llc_size instead of something else for the multiplication with slave
>>>>> (slave * factor).
>>>
>>>> Yeah I don't know why llc_size was chosen...
>>>
>>> static void update_top_cache_domain(int cpu)
>>> {
>>> struct sched_domain_shared *sds = NULL;
>>> struct sched_domain *sd;
>>> int id = cpu;
>>> int size = 1;
>>>
>>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>>> if (sd) {
>>> id = cpumask_first(sched_domain_span(sd));
>>> size = cpumask_weight(sched_domain_span(sd));
>>> sds = sd->shared;
>>> }
>>>
>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>> per_cpu(sd_llc_size, cpu) = size;
>>>
>>> The goal of wake wide was to approximate when pulling would be a futile
>>> consolidation effort and counterproductive to scaling.  'course with
>>> ever increasing socket size, any 1:N waker is ever more likely to run
>>> out of CPU for its one and only self (slamming into scaling wall)
>>> before it needing to turn its minions loose to conquer the world.
>>
>> Actually the original question was why do we have the second condition
>> as "master < slave * factor", instead of "master < factor". that's
>> what didn't make sense to me. Why don't we return 0 from wake_wide if
>> master < factor ?
>>
>> Infact, as the factor is set to the llc_size, I think the condition
>> that makes sense to me is:
>>
>> if ((master + slave) < llc_size)
>>   return 0;
>>
>> In other words, if the master flips and the slave flips are totally
>> higher than the llc_size, then we are most likely waking up too many
>> tasks as affine and should then switch to wide to prevent overloading.
>>
>> Digging further into the original patch from Michael Wang (I also CC'd
>> him), this was the code (before you had changed it to master/slave):
>>
>> wakee->nr_wakee_switch > factor &&
>> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>>
>> To explain the second condition above, Michael Wang said the following in [1]
>>
>> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
>> tasks rely on it, then waker's higher latency will damage all of them, pull
>> wakee seems to be a bad deal."
>>
>> Again I didn't f

Re: wake_wide mechanism clarification

2017-08-02 Thread Michael Wang
Hi, Joel

On 07/29/2017 10:13 AM, Joel Fernandes wrote:
> +Michael Wang on his current email address (old one bounced). (my
> reply was to Mike Galbraith but I also meant to CC Michael Wang for
> the discussion). Thanks

Just back from vacation and saw this long long discussion...

I think guys explained well on the idea, wake_wide() just try to filter
out the cases that may congest the waker's domain, as much as possible
without side effect (ideally).

The factor at very beginning is just a static number which picked by
enormous times of practice testing, Peter Zijlstr suggest we add the
domain size and make it a flexible factor, which by practice not that bad.

So the simple answer is we use the llc_size since no better option at
that time :-)

But things changing very fast and new feature can introduce new cases,
I'm pretty sure if we redo the testing the results will be very different,
however, the idea itself still make sense to me, at least on theory.

Recently I'm also thinking about the scheduler issue, cfs try to find out
general solution for all these cases and the best answer is obviously, all
the cases will suffer some damage and scheduler itself bloated to achieve
the goal 'taking care of all'.

So in order to achieve the maximum performance of particular workload, some
user defined scheduler would be an interesting idea :-P

Regards,
Michael Wang



> 
> On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes  wrote:
>> Hi Mike,
>>
>> I have take spent some time understanding the email thread and
>> previous discussions. Unfortunately the second condition we are
>> checking for in the wake_wide still didn't make sense to me (mentioned
>> below) :-(
>>
>> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
>>  wrote:
>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>>
>>>>> That makes sense that we multiply slave's flips by a factor because
>>>>> its low, but I still didn't get why the factor is chosen to be
>>>>> llc_size instead of something else for the multiplication with slave
>>>>> (slave * factor).
>>>
>>>> Yeah I don't know why llc_size was chosen...
>>>
>>> static void update_top_cache_domain(int cpu)
>>> {
>>> struct sched_domain_shared *sds = NULL;
>>> struct sched_domain *sd;
>>> int id = cpu;
>>> int size = 1;
>>>
>>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>>> if (sd) {
>>> id = cpumask_first(sched_domain_span(sd));
>>> size = cpumask_weight(sched_domain_span(sd));
>>> sds = sd->shared;
>>> }
>>>
>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>> per_cpu(sd_llc_size, cpu) = size;
>>>
>>> The goal of wake wide was to approximate when pulling would be a futile
>>> consolidation effort and counterproductive to scaling.  'course with
>>> ever increasing socket size, any 1:N waker is ever more likely to run
>>> out of CPU for its one and only self (slamming into scaling wall)
>>> before it needing to turn its minions loose to conquer the world.
>>
>> Actually the original question was why do we have the second condition
>> as "master < slave * factor", instead of "master < factor". that's
>> what didn't make sense to me. Why don't we return 0 from wake_wide if
>> master < factor ?
>>
>> Infact, as the factor is set to the llc_size, I think the condition
>> that makes sense to me is:
>>
>> if ((master + slave) < llc_size)
>>   return 0;
>>
>> In other words, if the master flips and the slave flips are totally
>> higher than the llc_size, then we are most likely waking up too many
>> tasks as affine and should then switch to wide to prevent overloading.
>>
>> Digging further into the original patch from Michael Wang (I also CC'd
>> him), this was the code (before you had changed it to master/slave):
>>
>> wakee->nr_wakee_switch > factor &&
>> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>>
>> To explain the second condition above, Michael Wang said the following in [1]
>>
>> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
>> tasks rely on it, then waker's higher latency will damage all of them, pull
>> wakee seems to be a bad deal."
>>
>> Again I didn't follow why the second condition couldn't just be:
>> waker-&

[ Linux 4.4 stable ] missing 'printk: set may_schedule for some of console_trylock() callers'

2017-07-25 Thread Michael Wang
Hi, greg k-h

During our testing with 4.4.73 we got soft lockup like:

 NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [systemd-udevd:856]
 ...
 Call Trace:
  [] vprintk_emit+0x319/0x4a0
  [] printk_emit+0x33/0x3b
  [] ? simple_strtoull+0x2c/0x50 
  [] devkmsg_write+0xaa/0x100
  [] ? vprintk+0x30/0x30
  [] do_readv_writev+0x1c2/0x270 
  [] ? kmem_cache_free+0x7d/0x1a0
  [] vfs_writev+0x39/0x50
  [] SyS_writev+0x4a/0xd0
  [] entry_SYSCALL_64_fastpath+0x12/0x6a

Currently in 4.4 the console_unlock() called by vprintk_emit() is with
preemption disabled, so the cond_resched is not working, and soft lockup
appear if it take too much time on writing data into every console.

We found the upstream patch:
  commit 6b97a20d3a79 printk: set may_schedule for some of console_trylock() 
callers

which should have addressed this issue, but not included in the latest 4.4.78 
stable
yet, is there any plan on backport it in future?

Regards,
Michael Wang


[ Linux 4.4 stable ] missing 'printk: set may_schedule for some of console_trylock() callers'

2017-07-25 Thread Michael Wang
Hi, greg k-h

During our testing with 4.4.73 we got soft lockup like:

 NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [systemd-udevd:856]
 ...
 Call Trace:
  [] vprintk_emit+0x319/0x4a0
  [] printk_emit+0x33/0x3b
  [] ? simple_strtoull+0x2c/0x50 
  [] devkmsg_write+0xaa/0x100
  [] ? vprintk+0x30/0x30
  [] do_readv_writev+0x1c2/0x270 
  [] ? kmem_cache_free+0x7d/0x1a0
  [] vfs_writev+0x39/0x50
  [] SyS_writev+0x4a/0xd0
  [] entry_SYSCALL_64_fastpath+0x12/0x6a

Currently in 4.4 the console_unlock() called by vprintk_emit() is with
preemption disabled, so the cond_resched is not working, and soft lockup
appear if it take too much time on writing data into every console.

We found the upstream patch:
  commit 6b97a20d3a79 printk: set may_schedule for some of console_trylock() 
callers

which should have addressed this issue, but not included in the latest 4.4.78 
stable
yet, is there any plan on backport it in future?

Regards,
Michael Wang


Re: [PATCH] md: return 0 instead of error in rdev_attr_show()

2017-04-11 Thread Michael Wang
We found the upstream fix, sorry for the noise...

Regards,
Michael Wang

On 04/11/2017 12:14 PM, Michael Wang wrote:
> 
> sysfs_kf_read() expect the show() callback return the dumped
> length, while rdev_attr_show() can return the error which lead
> into overflow:
> 
>  BUG: unable to handle kernel paging request at 88040b084000
>  IP: [] __memmove+0x24/0x1a0
>  PGD 1edb067 PUD 1ede067 PMD 406b9a063 PTE 80040b084161
>  Oops: 0003 [#1] SMP 
>  [snip]
>  Call Trace:
>   [] ? sysfs_kf_read+0x80/0xb0
>   [] kernfs_fop_read+0xab/0x160
>   [] __vfs_read+0x28/0xd0
>   [] vfs_read+0x86/0x130
>   [] SyS_read+0x46/0xa0
>   [] entry_SYSCALL_64_fastpath+0x12/0x6a
> 
> Simply return 0 in case of error solved the problem.
> 
> Signed-off-by: Michael Wang <yun.w...@profitbricks.com>
> ---
>  drivers/md/md.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 1db88d7..d46d714 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -3271,10 +3271,9 @@ rdev_attr_show(struct kobject *kobj, struct attribute 
> *attr, char *page)
>   struct rdev_sysfs_entry *entry = container_of(attr, struct 
> rdev_sysfs_entry, attr);
>   struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj);
>  
> - if (!entry->show)
> - return -EIO;
> - if (!rdev->mddev)
> - return -EBUSY;
> + if (!entry->show || !rdev->mddev)
> + return 0;
> +
>   return entry->show(rdev, page);
>  }
>  
> 


Re: [PATCH] md: return 0 instead of error in rdev_attr_show()

2017-04-11 Thread Michael Wang
We found the upstream fix, sorry for the noise...

Regards,
Michael Wang

On 04/11/2017 12:14 PM, Michael Wang wrote:
> 
> sysfs_kf_read() expect the show() callback return the dumped
> length, while rdev_attr_show() can return the error which lead
> into overflow:
> 
>  BUG: unable to handle kernel paging request at 88040b084000
>  IP: [] __memmove+0x24/0x1a0
>  PGD 1edb067 PUD 1ede067 PMD 406b9a063 PTE 80040b084161
>  Oops: 0003 [#1] SMP 
>  [snip]
>  Call Trace:
>   [] ? sysfs_kf_read+0x80/0xb0
>   [] kernfs_fop_read+0xab/0x160
>   [] __vfs_read+0x28/0xd0
>   [] vfs_read+0x86/0x130
>   [] SyS_read+0x46/0xa0
>   [] entry_SYSCALL_64_fastpath+0x12/0x6a
> 
> Simply return 0 in case of error solved the problem.
> 
> Signed-off-by: Michael Wang 
> ---
>  drivers/md/md.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 1db88d7..d46d714 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -3271,10 +3271,9 @@ rdev_attr_show(struct kobject *kobj, struct attribute 
> *attr, char *page)
>   struct rdev_sysfs_entry *entry = container_of(attr, struct 
> rdev_sysfs_entry, attr);
>   struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj);
>  
> - if (!entry->show)
> - return -EIO;
> - if (!rdev->mddev)
> - return -EBUSY;
> + if (!entry->show || !rdev->mddev)
> + return 0;
> +
>   return entry->show(rdev, page);
>  }
>  
> 


[PATCH] md: return 0 instead of error in rdev_attr_show()

2017-04-11 Thread Michael Wang

sysfs_kf_read() expect the show() callback return the dumped
length, while rdev_attr_show() can return the error which lead
into overflow:

 BUG: unable to handle kernel paging request at 88040b084000
 IP: [] __memmove+0x24/0x1a0
 PGD 1edb067 PUD 1ede067 PMD 406b9a063 PTE 80040b084161
 Oops: 0003 [#1] SMP 
 [snip]
 Call Trace:
  [] ? sysfs_kf_read+0x80/0xb0
  [] kernfs_fop_read+0xab/0x160
  [] __vfs_read+0x28/0xd0
  [] vfs_read+0x86/0x130
  [] SyS_read+0x46/0xa0
  [] entry_SYSCALL_64_fastpath+0x12/0x6a

Simply return 0 in case of error solved the problem.

Signed-off-by: Michael Wang <yun.w...@profitbricks.com>
---
 drivers/md/md.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1db88d7..d46d714 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -3271,10 +3271,9 @@ rdev_attr_show(struct kobject *kobj, struct attribute 
*attr, char *page)
struct rdev_sysfs_entry *entry = container_of(attr, struct 
rdev_sysfs_entry, attr);
struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj);
 
-   if (!entry->show)
-   return -EIO;
-   if (!rdev->mddev)
-   return -EBUSY;
+   if (!entry->show || !rdev->mddev)
+   return 0;
+
return entry->show(rdev, page);
 }
 
-- 
2.5.0


[PATCH] md: return 0 instead of error in rdev_attr_show()

2017-04-11 Thread Michael Wang

sysfs_kf_read() expect the show() callback return the dumped
length, while rdev_attr_show() can return the error which lead
into overflow:

 BUG: unable to handle kernel paging request at 88040b084000
 IP: [] __memmove+0x24/0x1a0
 PGD 1edb067 PUD 1ede067 PMD 406b9a063 PTE 80040b084161
 Oops: 0003 [#1] SMP 
 [snip]
 Call Trace:
  [] ? sysfs_kf_read+0x80/0xb0
  [] kernfs_fop_read+0xab/0x160
  [] __vfs_read+0x28/0xd0
  [] vfs_read+0x86/0x130
  [] SyS_read+0x46/0xa0
  [] entry_SYSCALL_64_fastpath+0x12/0x6a

Simply return 0 in case of error solved the problem.

Signed-off-by: Michael Wang 
---
 drivers/md/md.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1db88d7..d46d714 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -3271,10 +3271,9 @@ rdev_attr_show(struct kobject *kobj, struct attribute 
*attr, char *page)
struct rdev_sysfs_entry *entry = container_of(attr, struct 
rdev_sysfs_entry, attr);
struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj);
 
-   if (!entry->show)
-   return -EIO;
-   if (!rdev->mddev)
-   return -EBUSY;
+   if (!entry->show || !rdev->mddev)
+   return 0;
+
return entry->show(rdev, page);
 }
 
-- 
2.5.0


Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio

2017-04-05 Thread Michael Wang


On 04/05/2017 12:17 AM, NeilBrown wrote:
[snip]
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 7d67235..0554110 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>  /* Don't try recovering from here - just fail it
>>   * ... unless it is the last working device of course */
>>  md_error(mddev, rdev);
>> -if (test_bit(Faulty, >flags))
>> +if (test_bit(Faulty, >flags)) {
>>  /* Don't try to read from here, but make sure
>>   * put_buf does it's thing
>>   */
>>  bio->bi_end_io = end_sync_write;
>> +bio->bi_next = NULL;
>> +}
>>  }
>>  
>>  while(sectors) {
> 
> 
> Ah - I see what is happening now.  I was looking at the vanilla 4.4
> code, which doesn't have the failfast changes.

My bad to forgot mention... yes our md stuff is very much close to the
upstream.

> 
> I don't think your patch is correct though.  We really shouldn't be
> re-using that bio, and setting bi_next to NULL just hides the bug.  It
> doesn't fix it.
> As the rdev is now Faulty, it doesn't make sense for
> sync_request_write() to submit a write request to it.

Make sense, while still have concerns regarding the design:
  * in this case since the read_disk already abandoned, is it fine to
keep r1_bio->read_disk recording the faulty device index?
  * we assign the 'end_sync_write' to the original read bio in this
case, but when is this supposed to be called?

> 
> Can you confirm that this works please.

Yes, it works.

Tested-by: Michael Wang <yun.w...@profitbricks.com>

Regards,
Michael Wang

> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index d2d8b8a5bd56..219f1e1f1d1d 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, 
> struct r1bio *r1_bio)
>(i == r1_bio->read_disk ||
> !test_bit(MD_RECOVERY_SYNC, >recovery
>   continue;
> + if (test_bit(Faulty, >mirrors[i].rdev->flags))
> + continue;
>  
>   bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
>   if (test_bit(FailFast, >mirrors[i].rdev->flags))
> 
> 
> Thanks,
> NeilBrown
> 


Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio

2017-04-05 Thread Michael Wang


On 04/05/2017 12:17 AM, NeilBrown wrote:
[snip]
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 7d67235..0554110 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>  /* Don't try recovering from here - just fail it
>>   * ... unless it is the last working device of course */
>>  md_error(mddev, rdev);
>> -if (test_bit(Faulty, >flags))
>> +if (test_bit(Faulty, >flags)) {
>>  /* Don't try to read from here, but make sure
>>   * put_buf does it's thing
>>   */
>>  bio->bi_end_io = end_sync_write;
>> +bio->bi_next = NULL;
>> +}
>>  }
>>  
>>  while(sectors) {
> 
> 
> Ah - I see what is happening now.  I was looking at the vanilla 4.4
> code, which doesn't have the failfast changes.

My bad to forgot mention... yes our md stuff is very much close to the
upstream.

> 
> I don't think your patch is correct though.  We really shouldn't be
> re-using that bio, and setting bi_next to NULL just hides the bug.  It
> doesn't fix it.
> As the rdev is now Faulty, it doesn't make sense for
> sync_request_write() to submit a write request to it.

Make sense, while still have concerns regarding the design:
  * in this case since the read_disk already abandoned, is it fine to
keep r1_bio->read_disk recording the faulty device index?
  * we assign the 'end_sync_write' to the original read bio in this
    case, but when is this supposed to be called?

> 
> Can you confirm that this works please.

Yes, it works.

Tested-by: Michael Wang 

Regards,
Michael Wang

> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index d2d8b8a5bd56..219f1e1f1d1d 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, 
> struct r1bio *r1_bio)
>(i == r1_bio->read_disk ||
> !test_bit(MD_RECOVERY_SYNC, >recovery
>   continue;
> + if (test_bit(Faulty, >mirrors[i].rdev->flags))
> + continue;
>  
>   bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
>   if (test_bit(FailFast, >mirrors[i].rdev->flags))
> 
> 
> Thanks,
> NeilBrown
> 


[RFC PATCH] raid1: reset 'bi_next' before reuse the bio

2017-04-04 Thread Michael Wang

During the testing we found the sync read bio can go through
path:

  md_do_sync()
sync_request()
  generic_make_request()
blk_queue_bio()
  blk_attempt_plug_merge()
bio->bi_next CHAINED HERE

  ...

  raid1d()
sync_request_write()
  fix_sync_read_error()
if FailFast && Faulty
  bio->bi_end_io = end_sync_write
  generic_make_request()
BUG_ON(bio->bi_next)

This need to meet the conditions:
  * bio once merged
  * read disk have FailFast enabled
  * read disk is Faulty

And since the block layer won't reset the 'bi_next' after bio
is done inside request, we hit the BUG like that.

This patch simply reset the bi_next before we reuse it.

Signed-off-by: Michael Wang <yun.w...@profitbricks.com>
---
 drivers/md/raid1.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7d67235..0554110 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
/* Don't try recovering from here - just fail it
 * ... unless it is the last working device of course */
md_error(mddev, rdev);
-   if (test_bit(Faulty, >flags))
+   if (test_bit(Faulty, >flags)) {
/* Don't try to read from here, but make sure
 * put_buf does it's thing
 */
bio->bi_end_io = end_sync_write;
+   bio->bi_next = NULL;
+   }
}
 
while(sectors) {
-- 
2.5.0


[RFC PATCH] raid1: reset 'bi_next' before reuse the bio

2017-04-04 Thread Michael Wang

During the testing we found the sync read bio can go through
path:

  md_do_sync()
sync_request()
  generic_make_request()
blk_queue_bio()
  blk_attempt_plug_merge()
bio->bi_next CHAINED HERE

  ...

  raid1d()
sync_request_write()
  fix_sync_read_error()
if FailFast && Faulty
  bio->bi_end_io = end_sync_write
  generic_make_request()
BUG_ON(bio->bi_next)

This need to meet the conditions:
  * bio once merged
  * read disk have FailFast enabled
  * read disk is Faulty

And since the block layer won't reset the 'bi_next' after bio
is done inside request, we hit the BUG like that.

This patch simply reset the bi_next before we reuse it.

Signed-off-by: Michael Wang 
---
 drivers/md/raid1.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7d67235..0554110 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
/* Don't try recovering from here - just fail it
 * ... unless it is the last working device of course */
md_error(mddev, rdev);
-   if (test_bit(Faulty, >flags))
+   if (test_bit(Faulty, >flags)) {
/* Don't try to read from here, but make sure
 * put_buf does it's thing
 */
bio->bi_end_io = end_sync_write;
+   bio->bi_next = NULL;
+   }
}
 
while(sectors) {
-- 
2.5.0


Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request

2017-04-04 Thread Michael Wang


On 04/04/2017 02:24 PM, Michael Wang wrote:
> On 04/04/2017 12:23 PM, Michael Wang wrote:
> [snip]
>>> add something like
>>>   if (wbio->bi_next)
>>>  printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>>>   i, r1_bio->read_disk, wbio->bi_end_io);
>>>
>>> that might help narrow down what is happening.
>>
>> Just triggered again in 4.4, dmesg like:
>>
>> [  399.240230] md: super_written gets error=-5
>> [  399.240286] md: super_written gets error=-5
>> [  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
>> 204160
>> [  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
>> 204160
>> [  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
>> 204160
>> [  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
>> 204160
>> [  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
>> 204160
>> [  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
>> 204160
>> [  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
>> 204160
>> [  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write 
>> [raid1]
> 
> Is it possible that the fail fast who changed the 'bi_end_io' inside
> fix_sync_read_error() help the used bio pass the check?

Hi, NeilBrown, below patch fixed the issue in our testing, I'll post a md
RFC patch so we can continue the discussion there.

Regards,
Michael Wang

> 
> I'm not sure but if the read bio was supposed to be reused as write
> for fail fast, maybe we should reset it like this?
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7d67235..0554110 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
> /* Don't try recovering from here - just fail it
>  * ... unless it is the last working device of course */
> md_error(mddev, rdev);
> -   if (test_bit(Faulty, >flags))
> +   if (test_bit(Faulty, >flags)) {
> /* Don't try to read from here, but make sure
>      * put_buf does it's thing
>  */
> bio->bi_end_io = end_sync_write;
> +   bio->bi_next = NULL;
> +   }
> }
>  
> while(sectors) {
> 
> Regards,
> Michael Wang
> 
> 
>> [  399.240363] [ cut here ]
>> [  399.240364] kernel BUG at block/blk-core.c:2147!
>> [  399.240365] invalid opcode:  [#1] SMP 
>> [  399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod 
>> ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel 
>> udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt 
>> iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal 
>> tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul 
>> battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass 
>> dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata 
>> mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: 
>> scsi_transport_srp]
>> [  399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted 
>> 4.4.50-1-pserver+ #26
>> [  399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 
>> 1.3.6 05/26/2016
>> [  399.240381] task: 8804031b6200 ti: 8800d72b4000 task.ti: 
>> 8800d72b4000
>> [  399.240385] RIP: 0010:[]  [] 
>> generic_make_request+0x29e/0x2a0
>> [  399.240385] RSP: 0018:8800d72b7d10  EFLAGS: 00010286
>> [  399.240386] RAX: 8804031b6200 RBX: 8800d2577e00 RCX: 
>> 3fff
>> [  399.240387] RDX: c001 RSI: 0001 RDI: 
>> 8800d5e8c1e0
>> [  399.240387] RBP: 8800d72b7d50 R08:  R09: 
>> 003f
>> [  399.240388] R10: 0004 R11: 001db9ac R12: 
>> 
>> [  399.240388] R13: 8800d2748e00 R14: 88040a016400 R15: 
>> 8800d2748e40
>> [  399.240389] FS:  () GS:88041dc4() 
>> knlGS:
>> [  399.240390] CS:  0010 DS:  ES:  CR0: 80050033
>> [  399.240390] CR2: 7fb49246a000 CR3: 00040215c000 CR4: 
>> 003406e0
>> [  399.240391] DR0:  DR1:  DR2: 
>> 
&

Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request

2017-04-04 Thread Michael Wang


On 04/04/2017 02:24 PM, Michael Wang wrote:
> On 04/04/2017 12:23 PM, Michael Wang wrote:
> [snip]
>>> add something like
>>>   if (wbio->bi_next)
>>>  printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>>>   i, r1_bio->read_disk, wbio->bi_end_io);
>>>
>>> that might help narrow down what is happening.
>>
>> Just triggered again in 4.4, dmesg like:
>>
>> [  399.240230] md: super_written gets error=-5
>> [  399.240286] md: super_written gets error=-5
>> [  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
>> 204160
>> [  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
>> 204160
>> [  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
>> 204160
>> [  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
>> 204160
>> [  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
>> 204160
>> [  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
>> 204160
>> [  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
>> 204160
>> [  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write 
>> [raid1]
> 
> Is it possible that the fail fast who changed the 'bi_end_io' inside
> fix_sync_read_error() help the used bio pass the check?

Hi, NeilBrown, below patch fixed the issue in our testing, I'll post a md
RFC patch so we can continue the discussion there.

Regards,
Michael Wang

> 
> I'm not sure but if the read bio was supposed to be reused as write
> for fail fast, maybe we should reset it like this?
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7d67235..0554110 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
> /* Don't try recovering from here - just fail it
>  * ... unless it is the last working device of course */
> md_error(mddev, rdev);
> -   if (test_bit(Faulty, >flags))
> +   if (test_bit(Faulty, >flags)) {
> /* Don't try to read from here, but make sure
>      * put_buf does it's thing
>  */
> bio->bi_end_io = end_sync_write;
> +   bio->bi_next = NULL;
> +   }
> }
>  
> while(sectors) {
> 
> Regards,
> Michael Wang
> 
> 
>> [  399.240363] [ cut here ]
>> [  399.240364] kernel BUG at block/blk-core.c:2147!
>> [  399.240365] invalid opcode:  [#1] SMP 
>> [  399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod 
>> ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel 
>> udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt 
>> iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal 
>> tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul 
>> battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass 
>> dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata 
>> mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: 
>> scsi_transport_srp]
>> [  399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted 
>> 4.4.50-1-pserver+ #26
>> [  399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 
>> 1.3.6 05/26/2016
>> [  399.240381] task: 8804031b6200 ti: 8800d72b4000 task.ti: 
>> 8800d72b4000
>> [  399.240385] RIP: 0010:[]  [] 
>> generic_make_request+0x29e/0x2a0
>> [  399.240385] RSP: 0018:8800d72b7d10  EFLAGS: 00010286
>> [  399.240386] RAX: 8804031b6200 RBX: 8800d2577e00 RCX: 
>> 3fff
>> [  399.240387] RDX: c001 RSI: 0001 RDI: 
>> 8800d5e8c1e0
>> [  399.240387] RBP: 8800d72b7d50 R08:  R09: 
>> 003f
>> [  399.240388] R10: 0004 R11: 001db9ac R12: 
>> 
>> [  399.240388] R13: 8800d2748e00 R14: 88040a016400 R15: 
>> 8800d2748e40
>> [  399.240389] FS:  () GS:88041dc4() 
>> knlGS:
>> [  399.240390] CS:  0010 DS:  ES:  CR0: 80050033
>> [  399.240390] CR2: 7fb49246a000 CR3: 00040215c000 CR4: 
>> 003406e0
>> [  399.240391] DR0:  DR1:  DR2: 
>> 
&

Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request

2017-04-04 Thread Michael Wang
On 04/04/2017 12:23 PM, Michael Wang wrote:
[snip]
>> add something like
>>   if (wbio->bi_next)
>>  printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>>   i, r1_bio->read_disk, wbio->bi_end_io);
>>
>> that might help narrow down what is happening.
> 
> Just triggered again in 4.4, dmesg like:
> 
> [  399.240230] md: super_written gets error=-5
> [  399.240286] md: super_written gets error=-5
> [  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
> 204160
> [  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
> 204160
> [  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
> 204160
> [  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
> 204160
> [  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
> 204160
> [  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
> 204160
> [  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
> 204160
> [  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]

Is it possible that the fail fast who changed the 'bi_end_io' inside
fix_sync_read_error() help the used bio pass the check?

I'm not sure but if the read bio was supposed to be reused as write
for fail fast, maybe we should reset it like this?

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7d67235..0554110 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
/* Don't try recovering from here - just fail it
 * ... unless it is the last working device of course */
md_error(mddev, rdev);
-   if (test_bit(Faulty, >flags))
+   if (test_bit(Faulty, >flags)) {
/* Don't try to read from here, but make sure
 * put_buf does it's thing
 */
bio->bi_end_io = end_sync_write;
+   bio->bi_next = NULL;
+   }
}
 
while(sectors) {

Regards,
Michael Wang


> [  399.240363] [ cut here ]
> [  399.240364] kernel BUG at block/blk-core.c:2147!
> [  399.240365] invalid opcode:  [#1] SMP 
> [  399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod 
> ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel 
> udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt 
> iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal 
> tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul 
> battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass 
> dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata 
> mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: 
> scsi_transport_srp]
> [  399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted 4.4.50-1-pserver+ 
> #26
> [  399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 
> 1.3.6 05/26/2016
> [  399.240381] task: 8804031b6200 ti: 8800d72b4000 task.ti: 
> 8800d72b4000
> [  399.240385] RIP: 0010:[]  [] 
> generic_make_request+0x29e/0x2a0
> [  399.240385] RSP: 0018:8800d72b7d10  EFLAGS: 00010286
> [  399.240386] RAX: 8804031b6200 RBX: 8800d2577e00 RCX: 
> 3fff
> [  399.240387] RDX: c001 RSI: 0001 RDI: 
> 8800d5e8c1e0
> [  399.240387] RBP: 8800d72b7d50 R08:  R09: 
> 003f
> [  399.240388] R10: 0004 R11: 001db9ac R12: 
> 
> [  399.240388] R13: 8800d2748e00 R14: 88040a016400 R15: 
> 8800d2748e40
> [  399.240389] FS:  () GS:88041dc4() 
> knlGS:
> [  399.240390] CS:  0010 DS:  ES:  CR0: 80050033
> [  399.240390] CR2: 7fb49246a000 CR3: 00040215c000 CR4: 
> 003406e0
> [  399.240391] DR0:  DR1:  DR2: 
> 
> [  399.240391] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  399.240392] Stack:
> [  399.240393]  8800d72b7d18 8800d72b7d30  
> 
> [  399.240394]  a079c290 8800d2577e00  
> 8800d2748e00
> [  399.240395]  8800d72b7e58 a079e74c 88040b661c00 
> 8800d2577e00
> [  399.240396] Call Trace:
> [  399.240398]  [] ? sync_request+0xb20/0xb20 [raid1]
> [  399.240400]  [] raid1d+0x65c/0x1060 [raid1]
> [  399.240403]  [] ? 
> trace_raw_output_itimer_expire+0x80/0x80
> [  39

Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request

2017-04-04 Thread Michael Wang
On 04/04/2017 12:23 PM, Michael Wang wrote:
[snip]
>> add something like
>>   if (wbio->bi_next)
>>  printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>>   i, r1_bio->read_disk, wbio->bi_end_io);
>>
>> that might help narrow down what is happening.
> 
> Just triggered again in 4.4, dmesg like:
> 
> [  399.240230] md: super_written gets error=-5
> [  399.240286] md: super_written gets error=-5
> [  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
> 204160
> [  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
> 204160
> [  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
> 204160
> [  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
> 204160
> [  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
> 204160
> [  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
> 204160
> [  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 
> 204160
> [  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]

Is it possible that the fail fast who changed the 'bi_end_io' inside
fix_sync_read_error() help the used bio pass the check?

I'm not sure but if the read bio was supposed to be reused as write
for fail fast, maybe we should reset it like this?

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7d67235..0554110 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
/* Don't try recovering from here - just fail it
 * ... unless it is the last working device of course */
md_error(mddev, rdev);
-   if (test_bit(Faulty, >flags))
+   if (test_bit(Faulty, >flags)) {
/* Don't try to read from here, but make sure
 * put_buf does it's thing
 */
bio->bi_end_io = end_sync_write;
+   bio->bi_next = NULL;
+   }
}
 
while(sectors) {

Regards,
Michael Wang


> [  399.240363] [ cut here ]
> [  399.240364] kernel BUG at block/blk-core.c:2147!
> [  399.240365] invalid opcode:  [#1] SMP 
> [  399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod 
> ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel 
> udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt 
> iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal 
> tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul 
> battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass 
> dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata 
> mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: 
> scsi_transport_srp]
> [  399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted 4.4.50-1-pserver+ 
> #26
> [  399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 
> 1.3.6 05/26/2016
> [  399.240381] task: 8804031b6200 ti: 8800d72b4000 task.ti: 
> 8800d72b4000
> [  399.240385] RIP: 0010:[]  [] 
> generic_make_request+0x29e/0x2a0
> [  399.240385] RSP: 0018:8800d72b7d10  EFLAGS: 00010286
> [  399.240386] RAX: 8804031b6200 RBX: 8800d2577e00 RCX: 
> 3fff
> [  399.240387] RDX: c001 RSI: 0001 RDI: 
> 8800d5e8c1e0
> [  399.240387] RBP: 8800d72b7d50 R08:  R09: 
> 003f
> [  399.240388] R10: 0004 R11: 001db9ac R12: 
> 
> [  399.240388] R13: 8800d2748e00 R14: 88040a016400 R15: 
> 8800d2748e40
> [  399.240389] FS:  () GS:88041dc4() 
> knlGS:
> [  399.240390] CS:  0010 DS:  ES:  CR0: 80050033
> [  399.240390] CR2: 7fb49246a000 CR3: 00040215c000 CR4: 
> 003406e0
> [  399.240391] DR0:  DR1:  DR2: 
> 
> [  399.240391] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  399.240392] Stack:
> [  399.240393]  8800d72b7d18 8800d72b7d30  
> 
> [  399.240394]  a079c290 8800d2577e00  
> 8800d2748e00
> [  399.240395]  8800d72b7e58 a079e74c 88040b661c00 
> 8800d2577e00
> [  399.240396] Call Trace:
> [  399.240398]  [] ? sync_request+0xb20/0xb20 [raid1]
> [  399.240400]  [] raid1d+0x65c/0x1060 [raid1]
> [  399.240403]  [] ? 
> trace_raw_output_itimer_expire+0x80/0x80
> [  39

Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request

2017-04-04 Thread Michael Wang

On 04/04/2017 11:37 AM, NeilBrown wrote:
> On Tue, Apr 04 2017, Michael Wang wrote:
[snip]
>>>
>>> If sync_request_write() is using a bio that has already been used, it
>>> should call bio_reset() and fill in the details again.
>>> However I don't see how that would happen.
>>> Can you give specific details on the situation that triggers the bug?
>>
>> We have storage side mapping lv through scst to server, on server side
>> we assemble them into multipath device, and then assemble these dm into
>> two raid1.
>>
>> The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage
>> side we unmap all the lv (could during mkfs or fio), then on server side
>> we hit the BUG (reproducible).
> 
> So I assume the initial resync is still happening at this point?
> And you unmap *all* the lv's so you expect IO to fail?
> I can see that the code would behave strangely if you have a
> bad-block-list configured (which is the default).
> Do you have a bbl?  If you create the array without the bbl, does it
> still crash?

The resync is at least happen concurrently in this case, we try
to simulate the case that all the connections dropped, the IO do
failed, also bunch of kernel log like:

md: super_written gets error=-5
blk_update_request: I/O error, dev dm-3, sector 64184
md/raid1:md1: dm-2: unrecoverable I/O read error for block 46848

we expect that to happen, but server should not crash on BUG.

And we haven't done any thing special regarding bbl, the bad_blocks
in sysfs are all empty.

> 
>>
>> The path of bio was confirmed by add tracing, it is reused in 
>> sync_request_write()
>> with 'bi_next' once chained inside blk_attempt_plug_merge().
> 
> I still don't see why it is re-used.
> I assume you didn't explicitly ask for a check/repair (i.e. didn't write
> to .../md/sync_action at all?).  In that case MD_RECOVERY_REQUESTED is
> not set.

Just unmap lv on storage side, no operation on server side.

> So sync_request() sends only one bio to generic_make_request():
>r1_bio->bios[r1_bio->read_disk];
> 
> then sync_request_write() *doesn't* send that bio again, but does send
> all the others.
> 
> So where does it reuse a bio?

If that's the design then it would be strange... the log do showing the path
of that bio go through sync_request(), will do more investigation.

> 
>>
>> We also tried to reset the bi_next inside sync_request_write() before
>> generic_make_request() which also works.
>>
>> The testing was done with 4.4, but we found upstream also left bi_next
>> chained after done in request, thus we post this RFC.
>>
>> Regarding raid1, we haven't found the place on path where the bio was
>> reset... where does it supposed to be?
> 
> I'm not sure what you mean.
> We only reset bios when they are being reused.
> One place is in process_checks() where bio_reset() is called before
> filling in all the details.
> 
> 
> Maybe, in sync_request_write(), before
> 
>   wbio->bi_rw = WRITE;
> 
> add something like
>   if (wbio->bi_next)
>  printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>   i, r1_bio->read_disk, wbio->bi_end_io);
> 
> that might help narrow down what is happening.

Just triggered again in 4.4, dmesg like:

[  399.240230] md: super_written gets error=-5
[  399.240286] md: super_written gets error=-5
[  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]
[  399.240363] [ cut here ]
[  399.240364] kernel BUG at block/blk-core.c:2147!
[  399.240365] invalid opcode:  [#1] SMP 
[  399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod 
ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel 
udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt 
iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal 
tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul 
battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass 
dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata 
mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal 

Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request

2017-04-04 Thread Michael Wang

On 04/04/2017 11:37 AM, NeilBrown wrote:
> On Tue, Apr 04 2017, Michael Wang wrote:
[snip]
>>>
>>> If sync_request_write() is using a bio that has already been used, it
>>> should call bio_reset() and fill in the details again.
>>> However I don't see how that would happen.
>>> Can you give specific details on the situation that triggers the bug?
>>
>> We have storage side mapping lv through scst to server, on server side
>> we assemble them into multipath device, and then assemble these dm into
>> two raid1.
>>
>> The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage
>> side we unmap all the lv (could during mkfs or fio), then on server side
>> we hit the BUG (reproducible).
> 
> So I assume the initial resync is still happening at this point?
> And you unmap *all* the lv's so you expect IO to fail?
> I can see that the code would behave strangely if you have a
> bad-block-list configured (which is the default).
> Do you have a bbl?  If you create the array without the bbl, does it
> still crash?

The resync is at least happen concurrently in this case, we try
to simulate the case that all the connections dropped, the IO do
failed, also bunch of kernel log like:

md: super_written gets error=-5
blk_update_request: I/O error, dev dm-3, sector 64184
md/raid1:md1: dm-2: unrecoverable I/O read error for block 46848

we expect that to happen, but server should not crash on BUG.

And we haven't done any thing special regarding bbl, the bad_blocks
in sysfs are all empty.

> 
>>
>> The path of bio was confirmed by add tracing, it is reused in 
>> sync_request_write()
>> with 'bi_next' once chained inside blk_attempt_plug_merge().
> 
> I still don't see why it is re-used.
> I assume you didn't explicitly ask for a check/repair (i.e. didn't write
> to .../md/sync_action at all?).  In that case MD_RECOVERY_REQUESTED is
> not set.

Just unmap lv on storage side, no operation on server side.

> So sync_request() sends only one bio to generic_make_request():
>r1_bio->bios[r1_bio->read_disk];
> 
> then sync_request_write() *doesn't* send that bio again, but does send
> all the others.
> 
> So where does it reuse a bio?

If that's the design then it would be strange... the log do showing the path
of that bio go through sync_request(), will do more investigation.

> 
>>
>> We also tried to reset the bi_next inside sync_request_write() before
>> generic_make_request() which also works.
>>
>> The testing was done with 4.4, but we found upstream also left bi_next
>> chained after done in request, thus we post this RFC.
>>
>> Regarding raid1, we haven't found the place on path where the bio was
>> reset... where does it supposed to be?
> 
> I'm not sure what you mean.
> We only reset bios when they are being reused.
> One place is in process_checks() where bio_reset() is called before
> filling in all the details.
> 
> 
> Maybe, in sync_request_write(), before
> 
>   wbio->bi_rw = WRITE;
> 
> add something like
>   if (wbio->bi_next)
>  printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>   i, r1_bio->read_disk, wbio->bi_end_io);
> 
> that might help narrow down what is happening.

Just triggered again in 4.4, dmesg like:

[  399.240230] md: super_written gets error=-5
[  399.240286] md: super_written gets error=-5
[  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]
[  399.240363] [ cut here ]
[  399.240364] kernel BUG at block/blk-core.c:2147!
[  399.240365] invalid opcode:  [#1] SMP 
[  399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod 
ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel 
udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt 
iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal 
tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul 
battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass 
dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata 
mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal 

Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request

2017-04-04 Thread Michael Wang
Hi, Neil

On 04/03/2017 11:25 PM, NeilBrown wrote:
> On Mon, Apr 03 2017, Michael Wang wrote:
> 
>> blk_attempt_plug_merge() try to merge bio into request and chain them
>> by 'bi_next', while after the bio is done inside request, we forgot to
>> reset the 'bi_next'.
>>
>> This lead into BUG while removing all the underlying devices from md-raid1,
>> the bio once go through:
>>
>>   md_do_sync()
>> sync_request()
>>   generic_make_request()
> 
> This is a read request from the "first" device.
> 
>> blk_queue_bio()
>>   blk_attempt_plug_merge()
>> CHAINED HERE
>>
>> will keep chained and reused by:
>>
>>   raid1d()
>> sync_request_write()
>>   generic_make_request()
> 
> This is a write request to some other device, isn't it?
> 
> If sync_request_write() is using a bio that has already been used, it
> should call bio_reset() and fill in the details again.
> However I don't see how that would happen.
> Can you give specific details on the situation that triggers the bug?

We have storage side mapping lv through scst to server, on server side
we assemble them into multipath device, and then assemble these dm into
two raid1.

The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage
side we unmap all the lv (could during mkfs or fio), then on server side
we hit the BUG (reproducible).

The path of bio was confirmed by add tracing, it is reused in 
sync_request_write()
with 'bi_next' once chained inside blk_attempt_plug_merge().

We also tried to reset the bi_next inside sync_request_write() before
generic_make_request() which also works.

The testing was done with 4.4, but we found upstream also left bi_next
chained after done in request, thus we post this RFC.

Regarding raid1, we haven't found the place on path where the bio was
reset... where does it supposed to be?

BTW the fix_sync_read_error() also invoked and succeed before trigger
the BUG.

Regards,
Michael Wang

> 
> Thanks,
> NeilBrown
> 
> 
>> BUG_ON(bio->bi_next)
>>
>> After reset the 'bi_next' this can no longer happen.
>>
>> Signed-off-by: Michael Wang <yun.w...@profitbricks.com>
>> ---
>>  block/blk-core.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 43b7d06..91223b2 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int 
>> error, unsigned int nr_bytes)
>> struct bio *bio = req->bio;
>> unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
>>
>> -   if (bio_bytes == bio->bi_iter.bi_size)
>> +   if (bio_bytes == bio->bi_iter.bi_size) {
>> req->bio = bio->bi_next;
>> +   bio->bi_next = NULL;
>> +   }
>>
>> req_bio_endio(req, bio, bio_bytes, error);
>>
>> -- 
>> 2.5.0


Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request

2017-04-04 Thread Michael Wang
Hi, Neil

On 04/03/2017 11:25 PM, NeilBrown wrote:
> On Mon, Apr 03 2017, Michael Wang wrote:
> 
>> blk_attempt_plug_merge() try to merge bio into request and chain them
>> by 'bi_next', while after the bio is done inside request, we forgot to
>> reset the 'bi_next'.
>>
>> This lead into BUG while removing all the underlying devices from md-raid1,
>> the bio once go through:
>>
>>   md_do_sync()
>> sync_request()
>>   generic_make_request()
> 
> This is a read request from the "first" device.
> 
>> blk_queue_bio()
>>   blk_attempt_plug_merge()
>> CHAINED HERE
>>
>> will keep chained and reused by:
>>
>>   raid1d()
>> sync_request_write()
>>   generic_make_request()
> 
> This is a write request to some other device, isn't it?
> 
> If sync_request_write() is using a bio that has already been used, it
> should call bio_reset() and fill in the details again.
> However I don't see how that would happen.
> Can you give specific details on the situation that triggers the bug?

We have storage side mapping lv through scst to server, on server side
we assemble them into multipath device, and then assemble these dm into
two raid1.

The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage
side we unmap all the lv (could during mkfs or fio), then on server side
we hit the BUG (reproducible).

The path of bio was confirmed by add tracing, it is reused in 
sync_request_write()
with 'bi_next' once chained inside blk_attempt_plug_merge().

We also tried to reset the bi_next inside sync_request_write() before
generic_make_request() which also works.

The testing was done with 4.4, but we found upstream also left bi_next
chained after done in request, thus we post this RFC.

Regarding raid1, we haven't found the place on path where the bio was
reset... where does it supposed to be?

BTW the fix_sync_read_error() also invoked and succeed before trigger
the BUG.

Regards,
Michael Wang

> 
> Thanks,
> NeilBrown
> 
> 
>> BUG_ON(bio->bi_next)
>>
>> After reset the 'bi_next' this can no longer happen.
>>
>> Signed-off-by: Michael Wang 
>> ---
>>  block/blk-core.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 43b7d06..91223b2 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int 
>> error, unsigned int nr_bytes)
>> struct bio *bio = req->bio;
>> unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
>>
>> -   if (bio_bytes == bio->bi_iter.bi_size)
>> +   if (bio_bytes == bio->bi_iter.bi_size) {
>> req->bio = bio->bi_next;
>> +   bio->bi_next = NULL;
>> +   }
>>
>> req_bio_endio(req, bio, bio_bytes, error);
>>
>> -- 
>> 2.5.0


[RFC PATCH] blk: reset 'bi_next' when bio is done inside request

2017-04-03 Thread Michael Wang

blk_attempt_plug_merge() try to merge bio into request and chain them
by 'bi_next', while after the bio is done inside request, we forgot to
reset the 'bi_next'.

This lead into BUG while removing all the underlying devices from md-raid1,
the bio once go through:

  md_do_sync()
sync_request()
  generic_make_request()
blk_queue_bio()
  blk_attempt_plug_merge()
CHAINED HERE

will keep chained and reused by:

  raid1d()
sync_request_write()
  generic_make_request()
BUG_ON(bio->bi_next)

After reset the 'bi_next' this can no longer happen.

Signed-off-by: Michael Wang <yun.w...@profitbricks.com>
---
 block/blk-core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 43b7d06..91223b2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int error, 
unsigned int nr_bytes)
struct bio *bio = req->bio;
unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);

-   if (bio_bytes == bio->bi_iter.bi_size)
+   if (bio_bytes == bio->bi_iter.bi_size) {
req->bio = bio->bi_next;
+   bio->bi_next = NULL;
+   }

req_bio_endio(req, bio, bio_bytes, error);

-- 
2.5.0


[RFC PATCH] blk: reset 'bi_next' when bio is done inside request

2017-04-03 Thread Michael Wang

blk_attempt_plug_merge() try to merge bio into request and chain them
by 'bi_next', while after the bio is done inside request, we forgot to
reset the 'bi_next'.

This lead into BUG while removing all the underlying devices from md-raid1,
the bio once go through:

  md_do_sync()
sync_request()
  generic_make_request()
blk_queue_bio()
  blk_attempt_plug_merge()
CHAINED HERE

will keep chained and reused by:

  raid1d()
sync_request_write()
  generic_make_request()
BUG_ON(bio->bi_next)

After reset the 'bi_next' this can no longer happen.

Signed-off-by: Michael Wang 
---
 block/blk-core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 43b7d06..91223b2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int error, 
unsigned int nr_bytes)
struct bio *bio = req->bio;
unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);

-   if (bio_bytes == bio->bi_iter.bi_size)
+   if (bio_bytes == bio->bi_iter.bi_size) {
req->bio = bio->bi_next;
+   bio->bi_next = NULL;
+   }

req_bio_endio(req, bio, bio_bytes, error);

-- 
2.5.0


Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2016-12-23 Thread Michael Wang
Dear Maintainers

I'd like to ask for the status of this patch since we hit the
issue too during our testing on md raid1.

Split remainder bio_A was queued ahead, following by bio_B for
lower device, at this moment raid start freezing, the loop take
out bio_A firstly and deliver it, which will hung since raid is
freezing, while the freezing never end since it waiting for
bio_B to finish, and bio_B is still on the queue, waiting for
bio_A to finish...

We're looking for a good solution and we found this patch
already progressed a lot, but we can't find it on linux-next,
so we'd like to ask are we still planning to have this fix
in upstream?

Regards,
Michael Wang


On 07/11/2016 04:10 PM, Lars Ellenberg wrote:
> For a long time, generic_make_request() converts recursion into
> iteration by queuing recursive arguments on current->bio_list.
> 
> This is convenient for stacking drivers,
> the top-most driver would take the originally submitted bio,
> and re-submit a re-mapped version of it, or one or more clones,
> or one or more new allocated bios to its backend(s). Which
> are then simply processed in turn, and each can again queue
> more "backend-bios" until we reach the bottom of the driver stack,
> and actually dispatch to the real backend device.
> 
> Any stacking driver ->make_request_fn() could expect that,
> once it returns, any backend-bios it submitted via recursive calls
> to generic_make_request() would now be processed and dispatched, before
> the current task would call into this driver again.
> 
> This is changed by commit
>   54efd50 block: make generic_make_request handle arbitrarily sized bios
> 
> Drivers may call blk_queue_split() inside their ->make_request_fn(),
> which may split the current bio into a front-part to be dealt with
> immediately, and a remainder-part, which may need to be split even
> further. That remainder-part will simply also be pushed to
> current->bio_list, and would end up being head-of-queue, in front
> of any backend-bios the current make_request_fn() might submit during
> processing of the fron-part.
> 
> Which means the current task would immediately end up back in the same
> make_request_fn() of the same driver again, before any of its backend
> bios have even been processed.
> 
> This can lead to resource starvation deadlock.
> Drivers could avoid this by learning to not need blk_queue_split(),
> or by submitting their backend bios in a different context (dedicated
> kernel thread, work_queue context, ...). Or by playing funny re-ordering
> games with entries on current->bio_list.
> 
> Instead, I suggest to distinguish between recursive calls to
> generic_make_request(), and pushing back the remainder part in
> blk_queue_split(), by pointing current->bio_lists to a
>   struct recursion_to_iteration_bio_lists {
>   struct bio_list recursion;
>   struct bio_list queue;
>   }
> 
> By providing each q->make_request_fn() with an empty "recursion"
> bio_list, then merging any recursively submitted bios to the
> head of the "queue" list, we can make the recursion-to-iteration
> logic in generic_make_request() process deepest level bios first,
> and "sibling" bios of the same level in "natural" order.
> 
> Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com>
> Signed-off-by: Roland Kammerer <roland.kamme...@linbit.com>
> ---
>  block/bio.c   | 20 +++
>  block/blk-core.c  | 49 
> +--
>  block/blk-merge.c |  5 -
>  drivers/md/bcache/btree.c | 12 ++--
>  drivers/md/dm-bufio.c |  2 +-
>  drivers/md/raid1.c|  5 ++---
>  drivers/md/raid10.c   |  5 ++---
>  include/linux/bio.h   | 25 
>  include/linux/sched.h |  4 ++--
>  9 files changed, 80 insertions(+), 47 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 848cd35..c2606fd 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>*/
>  
>   bio_list_init();
> - bio_list_init();
>  
> - while ((bio = bio_list_pop(current->bio_list)))
> + bio_list_init();
> + while ((bio = bio_list_pop(>bio_lists->recursion)))
>   bio_list_add(bio->bi_pool == bs ?  : , bio);
> + current->bio_lists->recursion = nopunt;
>  
> - *current->bio_list = nopunt;
> + bio_list_init();
> + while ((bio = bio_list_pop(>bio_lists->queue)))
> + bio_list_add(bio->bi_pool == bs ?  : , bio);
> + current->bio_lists->queue = nopunt;
>  
>

Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2016-12-23 Thread Michael Wang
Dear Maintainers

I'd like to ask for the status of this patch since we hit the
issue too during our testing on md raid1.

Split remainder bio_A was queued ahead, following by bio_B for
lower device, at this moment raid start freezing, the loop take
out bio_A firstly and deliver it, which will hung since raid is
freezing, while the freezing never end since it waiting for
bio_B to finish, and bio_B is still on the queue, waiting for
bio_A to finish...

We're looking for a good solution and we found this patch
already progressed a lot, but we can't find it on linux-next,
so we'd like to ask are we still planning to have this fix
in upstream?

Regards,
Michael Wang


On 07/11/2016 04:10 PM, Lars Ellenberg wrote:
> For a long time, generic_make_request() converts recursion into
> iteration by queuing recursive arguments on current->bio_list.
> 
> This is convenient for stacking drivers,
> the top-most driver would take the originally submitted bio,
> and re-submit a re-mapped version of it, or one or more clones,
> or one or more new allocated bios to its backend(s). Which
> are then simply processed in turn, and each can again queue
> more "backend-bios" until we reach the bottom of the driver stack,
> and actually dispatch to the real backend device.
> 
> Any stacking driver ->make_request_fn() could expect that,
> once it returns, any backend-bios it submitted via recursive calls
> to generic_make_request() would now be processed and dispatched, before
> the current task would call into this driver again.
> 
> This is changed by commit
>   54efd50 block: make generic_make_request handle arbitrarily sized bios
> 
> Drivers may call blk_queue_split() inside their ->make_request_fn(),
> which may split the current bio into a front-part to be dealt with
> immediately, and a remainder-part, which may need to be split even
> further. That remainder-part will simply also be pushed to
> current->bio_list, and would end up being head-of-queue, in front
> of any backend-bios the current make_request_fn() might submit during
> processing of the fron-part.
> 
> Which means the current task would immediately end up back in the same
> make_request_fn() of the same driver again, before any of its backend
> bios have even been processed.
> 
> This can lead to resource starvation deadlock.
> Drivers could avoid this by learning to not need blk_queue_split(),
> or by submitting their backend bios in a different context (dedicated
> kernel thread, work_queue context, ...). Or by playing funny re-ordering
> games with entries on current->bio_list.
> 
> Instead, I suggest to distinguish between recursive calls to
> generic_make_request(), and pushing back the remainder part in
> blk_queue_split(), by pointing current->bio_lists to a
>   struct recursion_to_iteration_bio_lists {
>   struct bio_list recursion;
>   struct bio_list queue;
>   }
> 
> By providing each q->make_request_fn() with an empty "recursion"
> bio_list, then merging any recursively submitted bios to the
> head of the "queue" list, we can make the recursion-to-iteration
> logic in generic_make_request() process deepest level bios first,
> and "sibling" bios of the same level in "natural" order.
> 
> Signed-off-by: Lars Ellenberg 
> Signed-off-by: Roland Kammerer 
> ---
>  block/bio.c   | 20 +++
>  block/blk-core.c  | 49 
> +--
>  block/blk-merge.c |  5 -
>  drivers/md/bcache/btree.c | 12 ++--
>  drivers/md/dm-bufio.c |  2 +-
>  drivers/md/raid1.c|  5 ++---
>  drivers/md/raid10.c   |  5 ++---
>  include/linux/bio.h   | 25 
>  include/linux/sched.h |  4 ++--
>  9 files changed, 80 insertions(+), 47 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 848cd35..c2606fd 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>*/
>  
>   bio_list_init();
> - bio_list_init();
>  
> - while ((bio = bio_list_pop(current->bio_list)))
> + bio_list_init();
> + while ((bio = bio_list_pop(>bio_lists->recursion)))
>   bio_list_add(bio->bi_pool == bs ?  : , bio);
> + current->bio_lists->recursion = nopunt;
>  
> - *current->bio_list = nopunt;
> + bio_list_init();
> + while ((bio = bio_list_pop(>bio_lists->queue)))
> + bio_list_add(bio->bi_pool == bs ?  : , bio);
> + current->bio_lists->queue = nopunt;
>  
>   spin_lock(>rescue_lock);
>   bio_list_

[BUG] block: bdi_register_owner() failure cause NULL pointer dereference

2016-09-29 Thread Michael Wang
Hi, Folks

We observed the hard lockup while trying raid assemble with sas3ircu,
it was start with the failure inside bdi_register_owner() with duplicated
kobj path, and later comeup the NULL pointer dereference, after that system
hang and we saw hard lockup on screen.

The duplicated issue could be with the scsi controller driver and we are
going to upgrade it anyway, but my question is why we don't do some error
handling like:

diff --git a/block/genhd.c b/block/genhd.c
index a178c8e..318bc63 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -614,7 +614,15 @@ void device_add_disk(struct device *parent, struct gendisk 
*disk)
 
/* Register BDI before referencing it from bdev */
bdi = >queue->backing_dev_info;
-   bdi_register_owner(bdi, disk_to_dev(disk));
+   if (bdi_register_owner(bdi, disk_to_dev(disk))) {
+   disk_release_events(disk);
+   blk_free_devt(devt);
+   disk->ev = NULL;
+   disk->first_minor = 0;
+   disk->major = 0;
+   WARN_ON(1);
+   return;
+   }
 
blk_register_region(disk_devt(disk), disk->minors, NULL,
exact_match, exact_lock, disk);

to prevent the following NULL pointer dereference and hard lockup?

Regards,
Michael Wang
Sep 29 09:53:28 st401b-3 systemd[1]: Starting Update UTMP about System Runlevel 
Changes...
Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen and drop on 1 v6wildcard :: UDP 123
Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 2 lo 127.0.0.1 UDP 123
Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 3 eth0 10.41.12.3 UDP 
123
Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 4 lo ::1 UDP 123
Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 5 eth0 
fe80::ec4:7aff:feab:6b0 UDP 123
Sep 29 09:53:28 st401b-3 ntpd[4970]: peers refreshed
Sep 29 09:53:28 st401b-3 ntpd[4970]: Listening on routing socket on fd #22 for 
interface updates
Sep 29 09:53:28 st401b-3 systemd[1]: Started Update UTMP about System Runlevel 
Changes.
Sep 29 09:53:28 st401b-3 systemd[1]: Startup finished in 18.720s (kernel) + 
39.513s (userspace) = 58.233s.
Sep 29 09:55:01 st401b-3 CRON[5433]: (root) CMD (command -v debian-sa1 > 
/dev/null && debian-sa1 1 1)
Sep 29 09:55:01 st401b-3 CRON[5434]: (root) CMD (test -x 
/opt/profitbricks/bin/check_memory && /opt/profitbricks/bin/check_memory)
Sep 29 09:55:17 st401b-3 kernel: [  167.693658] scsi 0:1:0:0: Direct-Access 
LSI  Logical Volume   3000 PQ: 0 ANSI: 6
Sep 29 09:55:17 st401b-3 kernel: [  167.693795] scsi 0:1:0:0: RAID1: 
handle(0x0143), wwid(0x02f7ec7949091b05), pd_count(2), type(SSP)
Sep 29 09:55:17 st401b-3 kernel: [  167.694042] sd 0:1:0:0: [sdam] 5859373056 
512-byte logical blocks: (3.00 TB/2.73 TiB)
Sep 29 09:55:17 st401b-3 kernel: [  167.694044] sd 0:1:0:0: [sdam] 4096-byte 
physical blocks
Sep 29 09:55:17 st401b-3 kernel: [  167.694057] sd 0:1:0:0: Attached scsi 
generic sg40 type 0
Sep 29 09:55:17 st401b-3 kernel: [  167.694129] sd 0:1:0:0: [sdam] Write 
Protect is off
Sep 29 09:55:17 st401b-3 kernel: [  167.694131] sd 0:1:0:0: [sdam] Mode Sense: 
03 00 00 08
Sep 29 09:55:17 st401b-3 kernel: [  167.694166] sd 0:1:0:0: [sdam] No Caching 
mode page found
Sep 29 09:55:17 st401b-3 kernel: [  167.694282] sd 0:0:4:0: hidding raid 
component
Sep 29 09:55:17 st401b-3 kernel: [  167.694589] sd 0:1:0:0: [sdam] Assuming 
drive cache: write through
Sep 29 09:55:17 st401b-3 kernel: [  167.703346] sd 0:1:0:0: [sdam] Attached 
SCSI disk
Sep 29 09:55:17 st401b-3 kernel: [  167.703653] sd 0:0:5:0: hidding raid 
component
Sep 29 09:55:39 st401b-3 check_backup_lvm_push: critical: local git command 
rev-parse HEAD failed, retval: 0, fatal: Not a git repository: 
'/var/lib/backup-lvm/.git/'
Sep 29 09:56:03 st401b-3 kernel: [  213.684812] scsi 0:1:1:0: Direct-Access 
LSI  Logical Volume   3000 PQ: 0 ANSI: 6
Sep 29 09:56:03 st401b-3 kernel: [  213.684946] scsi 0:1:1:0: RAID1: 
handle(0x0142), wwid(0x0ab3eca651cd1b58), pd_count(2), type(SSP)
Sep 29 09:56:03 st401b-3 kernel: [  213.685189] sd 0:1:1:0: [sde] 5859373056 
512-byte logical blocks: (3.00 TB/2.73 TiB)
Sep 29 09:56:03 st401b-3 kernel: [  213.685192] sd 0:1:1:0: [sde] 4096-byte 
physical blocks
Sep 29 09:56:03 st401b-3 kernel: [  213.685204] sd 0:1:1:0: Attached scsi 
generic sg41 type 0
Sep 29 09:56:03 st401b-3 kernel: [  213.685275] sd 0:1:1:0: [sde] Write Protect 
is off
Sep 29 09:56:03 st401b-3 kernel: [  213.685277] sd 0:1:1:0: [sde] Mode Sense: 
03 00 00 08
Sep 29 09:56:03 st401b-3 kernel: [  213.685307] sd 0:1:1:0: [sde] No Caching 
mode page found
Sep 29 09:56:03 st401b-3 kernel: [  213.685423] sd 0:0:6:0: hidding raid 
component
Sep 29 09:56:03 st401b-3 kernel: [  213.685698] sd 0:1:1:0: [sde] Assuming 
drive cache: write through
Sep 29 09:56:03 st401b-3 kernel: [  213.686226] [ cut here 
]
Sep 29 09:56:03 st401b-3 kernel: [  213.686232] WARNING: C

[BUG] block: bdi_register_owner() failure cause NULL pointer dereference

2016-09-29 Thread Michael Wang
Hi, Folks

We observed the hard lockup while trying raid assemble with sas3ircu,
it was start with the failure inside bdi_register_owner() with duplicated
kobj path, and later comeup the NULL pointer dereference, after that system
hang and we saw hard lockup on screen.

The duplicated issue could be with the scsi controller driver and we are
going to upgrade it anyway, but my question is why we don't do some error
handling like:

diff --git a/block/genhd.c b/block/genhd.c
index a178c8e..318bc63 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -614,7 +614,15 @@ void device_add_disk(struct device *parent, struct gendisk 
*disk)
 
/* Register BDI before referencing it from bdev */
bdi = >queue->backing_dev_info;
-   bdi_register_owner(bdi, disk_to_dev(disk));
+   if (bdi_register_owner(bdi, disk_to_dev(disk))) {
+   disk_release_events(disk);
+   blk_free_devt(devt);
+   disk->ev = NULL;
+   disk->first_minor = 0;
+   disk->major = 0;
+   WARN_ON(1);
+   return;
+   }
 
blk_register_region(disk_devt(disk), disk->minors, NULL,
exact_match, exact_lock, disk);

to prevent the following NULL pointer dereference and hard lockup?

Regards,
Michael Wang
Sep 29 09:53:28 st401b-3 systemd[1]: Starting Update UTMP about System Runlevel 
Changes...
Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen and drop on 1 v6wildcard :: UDP 123
Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 2 lo 127.0.0.1 UDP 123
Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 3 eth0 10.41.12.3 UDP 
123
Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 4 lo ::1 UDP 123
Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 5 eth0 
fe80::ec4:7aff:feab:6b0 UDP 123
Sep 29 09:53:28 st401b-3 ntpd[4970]: peers refreshed
Sep 29 09:53:28 st401b-3 ntpd[4970]: Listening on routing socket on fd #22 for 
interface updates
Sep 29 09:53:28 st401b-3 systemd[1]: Started Update UTMP about System Runlevel 
Changes.
Sep 29 09:53:28 st401b-3 systemd[1]: Startup finished in 18.720s (kernel) + 
39.513s (userspace) = 58.233s.
Sep 29 09:55:01 st401b-3 CRON[5433]: (root) CMD (command -v debian-sa1 > 
/dev/null && debian-sa1 1 1)
Sep 29 09:55:01 st401b-3 CRON[5434]: (root) CMD (test -x 
/opt/profitbricks/bin/check_memory && /opt/profitbricks/bin/check_memory)
Sep 29 09:55:17 st401b-3 kernel: [  167.693658] scsi 0:1:0:0: Direct-Access 
LSI  Logical Volume   3000 PQ: 0 ANSI: 6
Sep 29 09:55:17 st401b-3 kernel: [  167.693795] scsi 0:1:0:0: RAID1: 
handle(0x0143), wwid(0x02f7ec7949091b05), pd_count(2), type(SSP)
Sep 29 09:55:17 st401b-3 kernel: [  167.694042] sd 0:1:0:0: [sdam] 5859373056 
512-byte logical blocks: (3.00 TB/2.73 TiB)
Sep 29 09:55:17 st401b-3 kernel: [  167.694044] sd 0:1:0:0: [sdam] 4096-byte 
physical blocks
Sep 29 09:55:17 st401b-3 kernel: [  167.694057] sd 0:1:0:0: Attached scsi 
generic sg40 type 0
Sep 29 09:55:17 st401b-3 kernel: [  167.694129] sd 0:1:0:0: [sdam] Write 
Protect is off
Sep 29 09:55:17 st401b-3 kernel: [  167.694131] sd 0:1:0:0: [sdam] Mode Sense: 
03 00 00 08
Sep 29 09:55:17 st401b-3 kernel: [  167.694166] sd 0:1:0:0: [sdam] No Caching 
mode page found
Sep 29 09:55:17 st401b-3 kernel: [  167.694282] sd 0:0:4:0: hidding raid 
component
Sep 29 09:55:17 st401b-3 kernel: [  167.694589] sd 0:1:0:0: [sdam] Assuming 
drive cache: write through
Sep 29 09:55:17 st401b-3 kernel: [  167.703346] sd 0:1:0:0: [sdam] Attached 
SCSI disk
Sep 29 09:55:17 st401b-3 kernel: [  167.703653] sd 0:0:5:0: hidding raid 
component
Sep 29 09:55:39 st401b-3 check_backup_lvm_push: critical: local git command 
rev-parse HEAD failed, retval: 0, fatal: Not a git repository: 
'/var/lib/backup-lvm/.git/'
Sep 29 09:56:03 st401b-3 kernel: [  213.684812] scsi 0:1:1:0: Direct-Access 
LSI  Logical Volume   3000 PQ: 0 ANSI: 6
Sep 29 09:56:03 st401b-3 kernel: [  213.684946] scsi 0:1:1:0: RAID1: 
handle(0x0142), wwid(0x0ab3eca651cd1b58), pd_count(2), type(SSP)
Sep 29 09:56:03 st401b-3 kernel: [  213.685189] sd 0:1:1:0: [sde] 5859373056 
512-byte logical blocks: (3.00 TB/2.73 TiB)
Sep 29 09:56:03 st401b-3 kernel: [  213.685192] sd 0:1:1:0: [sde] 4096-byte 
physical blocks
Sep 29 09:56:03 st401b-3 kernel: [  213.685204] sd 0:1:1:0: Attached scsi 
generic sg41 type 0
Sep 29 09:56:03 st401b-3 kernel: [  213.685275] sd 0:1:1:0: [sde] Write Protect 
is off
Sep 29 09:56:03 st401b-3 kernel: [  213.685277] sd 0:1:1:0: [sde] Mode Sense: 
03 00 00 08
Sep 29 09:56:03 st401b-3 kernel: [  213.685307] sd 0:1:1:0: [sde] No Caching 
mode page found
Sep 29 09:56:03 st401b-3 kernel: [  213.685423] sd 0:0:6:0: hidding raid 
component
Sep 29 09:56:03 st401b-3 kernel: [  213.685698] sd 0:1:1:0: [sde] Assuming 
drive cache: write through
Sep 29 09:56:03 st401b-3 kernel: [  213.686226] [ cut here 
]
Sep 29 09:56:03 st401b-3 kernel: [  213.686232] WARNING: C

Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path

2015-12-17 Thread Michael Wang


On 12/16/2015 07:16 PM, Jason Gunthorpe wrote:
> On Wed, Dec 16, 2015 at 11:26:39AM +0100, Michael Wang wrote:
[snip]
>>
>> I've rechecked the ib_init_ah_from_path() again, and found it
>> still set IB_AH_GRH when the GID cache missing, but with:
> 
> How do you mean?
> 
> ah_attr->ah_flags = IB_AH_GRH;
> ah_attr->grh.dgid = rec->dgid;
> 
> ret = ib_find_cached_gid(device, >sgid, ndev, _num,
>  _index);
> if (ret) {
> if (ndev)
> dev_put(ndev);
> return ret;
> }
> 
> If find_cached_gid fails then ib_init_ah_from_path also fails.
> 
> Is there a case where ib_find_cached_gid can succeed but not return
> good data?

Just for the GRH header, ib_find_cached_gid() will failed but the flag
and dgid will be correct, and others are all 0 including hop limit, but
may be just coincidence...

As long as hop_limit > 1 means the pkg do have to pass through a router
to other subnet, the fix make sense :-)

Regards,
Michael Wang

> 
> I agree it would read nicer if the ah_flags and gr.dgid was moved
> after the ib_find_cached_gid
> 
>> BTW, cma_sidr_rep_handler() also call ib_init_ah_from_path() with out
>> a check on return.
> 
> That sounds like a problem.
> 
> Jason
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path

2015-12-17 Thread Michael Wang


On 12/16/2015 07:16 PM, Jason Gunthorpe wrote:
> On Wed, Dec 16, 2015 at 11:26:39AM +0100, Michael Wang wrote:
[snip]
>>
>> I've rechecked the ib_init_ah_from_path() again, and found it
>> still set IB_AH_GRH when the GID cache missing, but with:
> 
> How do you mean?
> 
> ah_attr->ah_flags = IB_AH_GRH;
> ah_attr->grh.dgid = rec->dgid;
> 
> ret = ib_find_cached_gid(device, >sgid, ndev, _num,
>  _index);
> if (ret) {
> if (ndev)
> dev_put(ndev);
> return ret;
> }
> 
> If find_cached_gid fails then ib_init_ah_from_path also fails.
> 
> Is there a case where ib_find_cached_gid can succeed but not return
> good data?

Just for the GRH header, ib_find_cached_gid() will failed but the flag
and dgid will be correct, and others are all 0 including hop limit, but
may be just coincidence...

As long as hop_limit > 1 means the pkg do have to pass through a router
to other subnet, the fix make sense :-)

Regards,
Michael Wang

> 
> I agree it would read nicer if the ah_flags and gr.dgid was moved
> after the ib_find_cached_gid
> 
>> BTW, cma_sidr_rep_handler() also call ib_init_ah_from_path() with out
>> a check on return.
> 
> That sounds like a problem.
> 
> Jason
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path

2015-12-16 Thread Michael Wang

On 12/15/2015 06:30 PM, Jason Gunthorpe wrote:
> On Tue, Dec 15, 2015 at 05:38:34PM +0100, Michael Wang wrote:
>> The hop_limit is only suggest that the package allowed to be
>> routed, not have to, correct?
> 
> If the hop limit is >= 2 (?) then the GRH is mandatory. The
> SM will return this information in the PathRecord if it determines a
> GRH is required. The whole stack follows this protocol.
> 
> The GRH is optional for in-subnet communications.

Thanks for the explain :-)

I've rechecked the ib_init_ah_from_path() again, and found it
still set IB_AH_GRH when the GID cache missing, but with:

  grh.sgid_index = 0
  grh.flow_label = 0
  grh.hop_limit  = 0
  grh.traffic_class = 0

Not sure if it's just coincidence, hop_limit is 0, so router
will discard the pkg and GRH won't be used, the transaction in
subnet still works.

Could this by designed as an optimization for the case like when
SM reassigning the GID?

BTW, cma_sidr_rep_handler() also call ib_init_ah_from_path() with out
a check on return.

Regards,
Michael Wang

> 
> Jason
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path

2015-12-16 Thread Michael Wang

On 12/15/2015 06:30 PM, Jason Gunthorpe wrote:
> On Tue, Dec 15, 2015 at 05:38:34PM +0100, Michael Wang wrote:
>> The hop_limit is only suggest that the package allowed to be
>> routed, not have to, correct?
> 
> If the hop limit is >= 2 (?) then the GRH is mandatory. The
> SM will return this information in the PathRecord if it determines a
> GRH is required. The whole stack follows this protocol.
> 
> The GRH is optional for in-subnet communications.

Thanks for the explain :-)

I've rechecked the ib_init_ah_from_path() again, and found it
still set IB_AH_GRH when the GID cache missing, but with:

  grh.sgid_index = 0
  grh.flow_label = 0
  grh.hop_limit  = 0
  grh.traffic_class = 0

Not sure if it's just coincidence, hop_limit is 0, so router
will discard the pkg and GRH won't be used, the transaction in
subnet still works.

Could this by designed as an optimization for the case like when
SM reassigning the GID?

BTW, cma_sidr_rep_handler() also call ib_init_ah_from_path() with out
a check on return.

Regards,
Michael Wang

> 
> Jason
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path

2015-12-15 Thread Michael Wang


On 12/15/2015 04:52 PM, Nicholas Krause wrote:
> This adds a needed error path in the function, cm_init_av_by_path
> after the call to ib_init_ah_from_path in order to avoid incorrectly
> accessing the path pointer of structure type ib_sa_path_rec if this
> function call fails to complete its intended work successfully by
> returning a error code.
> 
> Signed-off-by: Nicholas Krause 
> ---
>  drivers/infiniband/core/cm.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
> index 0a26dd6..e9b36ea 100644
> --- a/drivers/infiniband/core/cm.c
> +++ b/drivers/infiniband/core/cm.c
> @@ -383,8 +383,11 @@ static int cm_init_av_by_path(struct ib_sa_path_rec 
> *path, struct cm_av *av)
>   return ret;
>  
>   av->port = port;
> - ib_init_ah_from_path(cm_dev->ib_device, port->port_num, path,
> -  >ah_attr);
> + ret = ib_init_ah_from_path(cm_dev->ib_device, port->port_num, path,
> +>ah_attr);

..Just wondering what if the transport don't require GRH?
eg inside the same subnet?

The hop_limit is only suggest that the package allowed to be
routed, not have to, correct?

Regards,
Michael Wang

> + if (ret)
> + return ret;
> +
>   av->timeout = path->packet_life_time + 1;
>  
>   return 0;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path

2015-12-15 Thread Michael Wang


On 12/15/2015 04:52 PM, Nicholas Krause wrote:
> This adds a needed error path in the function, cm_init_av_by_path
> after the call to ib_init_ah_from_path in order to avoid incorrectly
> accessing the path pointer of structure type ib_sa_path_rec if this
> function call fails to complete its intended work successfully by
> returning a error code.
> 
> Signed-off-by: Nicholas Krause <xerofo...@gmail.com>
> ---
>  drivers/infiniband/core/cm.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
> index 0a26dd6..e9b36ea 100644
> --- a/drivers/infiniband/core/cm.c
> +++ b/drivers/infiniband/core/cm.c
> @@ -383,8 +383,11 @@ static int cm_init_av_by_path(struct ib_sa_path_rec 
> *path, struct cm_av *av)
>   return ret;
>  
>   av->port = port;
> - ib_init_ah_from_path(cm_dev->ib_device, port->port_num, path,
> -  >ah_attr);
> + ret = ib_init_ah_from_path(cm_dev->ib_device, port->port_num, path,
> +>ah_attr);

..Just wondering what if the transport don't require GRH?
eg inside the same subnet?

The hop_limit is only suggest that the package allowed to be
routed, not have to, correct?

Regards,
Michael Wang

> + if (ret)
> + return ret;
> +
>   av->timeout = path->packet_life_time + 1;
>  
>   return 0;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] infiniband:core:Add needed error path in cm_init_av_by_path

2015-12-07 Thread Michael Wang


On 12/07/2015 08:57 AM, Haggai Eran wrote:
> On Friday, December 4, 2015 8:02 PM, Nicholas Krause  
> wrote:
>> To: dledf...@redhat.com
>> Cc: sean.he...@intel.com; hal.rosenst...@gmail.com; Haggai Eran; 
>> jguntho...@obsidianresearch.com; Matan Barak; yun.w...@profitbricks.com; 
>> ted.h@oracle.com; Doron Tsur; Erez Shitrit; david.ah...@oracle.com; 
>> linux-r...@vger.kernel.org; linux-kernel@vger.kernel.org
>> Subject: [PATCH] infiniband:core:Add needed error path in cm_init_av_by_path
>>
>> This adds a needed error path in the function, cm_init_av_by_path
>> after the call to ib_init_ah_from_path in order to avoid incorrectly
>> accessing the path pointer of structure type ib_sa_path_rec if this
>> function call fails to complete its intended work successfully by
>> returning a error code.
>>
>> Signed-off-by: Nicholas Krause 
> 
> The subject doesn't seem to match the convention but apart from that,
> 
> Reviewed-by: Haggai Eran 
> 
> I wonder if this should go to stable. If I understand correctly, this will 
> fail only when the SGID isn't found in the GID table, but such connections 
> would fail later on when creating a QP, right?

Me too think this need a reconsider, to me the current logical don't
really care the missing gid in cache when initializing AV, I'm not
sure if it's necessary to fail all the following path for such cache
missing...

Regards,
Michael Wang

> 
> Haggai
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] infiniband:core:Add needed error path in cm_init_av_by_path

2015-12-07 Thread Michael Wang


On 12/07/2015 08:57 AM, Haggai Eran wrote:
> On Friday, December 4, 2015 8:02 PM, Nicholas Krause <xerofo...@gmail.com> 
> wrote:
>> To: dledf...@redhat.com
>> Cc: sean.he...@intel.com; hal.rosenst...@gmail.com; Haggai Eran; 
>> jguntho...@obsidianresearch.com; Matan Barak; yun.w...@profitbricks.com; 
>> ted.h@oracle.com; Doron Tsur; Erez Shitrit; david.ah...@oracle.com; 
>> linux-r...@vger.kernel.org; linux-kernel@vger.kernel.org
>> Subject: [PATCH] infiniband:core:Add needed error path in cm_init_av_by_path
>>
>> This adds a needed error path in the function, cm_init_av_by_path
>> after the call to ib_init_ah_from_path in order to avoid incorrectly
>> accessing the path pointer of structure type ib_sa_path_rec if this
>> function call fails to complete its intended work successfully by
>> returning a error code.
>>
>> Signed-off-by: Nicholas Krause <xerofo...@gmail.com>
> 
> The subject doesn't seem to match the convention but apart from that,
> 
> Reviewed-by: Haggai Eran <hagg...@mellanox.com>
> 
> I wonder if this should go to stable. If I understand correctly, this will 
> fail only when the SGID isn't found in the GID table, but such connections 
> would fail later on when creating a QP, right?

Me too think this need a reconsider, to me the current logical don't
really care the missing gid in cache when initializing AV, I'm not
sure if it's necessary to fail all the following path for such cache
missing...

Regards,
Michael Wang

> 
> Haggai
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-03 Thread Michael Wang


On 12/02/2015 06:36 PM, Catalin Marinas wrote:
> On 2 December 2015 at 13:59, Borislav Petkov  wrote:
[snip]
> 
> 1. The sl?b allocators themselves use page allocations, so kmemleak
> could end up detecting the same pointer twice, hiding a potential leak
> 
> 2. Most page allocations do not contain data/pointers relevant to
> kmemleak (e.g. page cache pages), however the randomness of such data
> greatly diminishes kmemleak's ability to detect real leaks
> 
> Arguably, kmemleak could be made to detect both cases above by a
> combination of page flags, additional annotations or specific page
> alloc API. However, this has its own drawbacks in terms of code
> complexity (potentially outside mm/kmemleak.c) and overhead.

Thanks for the very nice explain :-) I used to thought overhead is
the only concern, missing the point regarding allocator it self.

Regards,
Michael Wang

> 
> Regarding a kmemleak_alloc() annotation like in the patch I suggested,
> that's the second one I've seen needed outside alloc APIs (the first
> one is commit f75782e4e067 - "block: kmemleak: Track the page
> allocations for struct request"). If the number of such explicit
> annotations stays small, it's better to keep it this way.
> 
> There are other explicit annotations like kmemleak_not_leak() or
> kmemleak_ignore() but these are for objects kmemleak knows about and
> incorrectly reports them as leaks. Most of the time is because the
> pointers to such objects are stored in a different form (e.g. physical
> address).
> 
> Anyway, kmemleak is not the only tool requiring annotations (take
> spin_lock_nested() for example). If needed, we could do with an
> additional page alloc/free API which informs kmemleak in the process
> but I don't think it's worth it.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-03 Thread Michael Wang


On 12/02/2015 06:36 PM, Catalin Marinas wrote:
> On 2 December 2015 at 13:59, Borislav Petkov <b...@alien8.de> wrote:
[snip]
> 
> 1. The sl?b allocators themselves use page allocations, so kmemleak
> could end up detecting the same pointer twice, hiding a potential leak
> 
> 2. Most page allocations do not contain data/pointers relevant to
> kmemleak (e.g. page cache pages), however the randomness of such data
> greatly diminishes kmemleak's ability to detect real leaks
> 
> Arguably, kmemleak could be made to detect both cases above by a
> combination of page flags, additional annotations or specific page
> alloc API. However, this has its own drawbacks in terms of code
> complexity (potentially outside mm/kmemleak.c) and overhead.

Thanks for the very nice explain :-) I used to thought overhead is
the only concern, missing the point regarding allocator it self.

Regards,
Michael Wang

> 
> Regarding a kmemleak_alloc() annotation like in the patch I suggested,
> that's the second one I've seen needed outside alloc APIs (the first
> one is commit f75782e4e067 - "block: kmemleak: Track the page
> allocations for struct request"). If the number of such explicit
> annotations stays small, it's better to keep it this way.
> 
> There are other explicit annotations like kmemleak_not_leak() or
> kmemleak_ignore() but these are for objects kmemleak knows about and
> incorrectly reports them as leaks. Most of the time is because the
> pointers to such objects are stored in a different form (e.g. physical
> address).
> 
> Anyway, kmemleak is not the only tool requiring annotations (take
> spin_lock_nested() for example). If needed, we could do with an
> additional page alloc/free API which informs kmemleak in the process
> but I don't think it's worth it.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang


On 12/02/2015 03:13 PM, Borislav Petkov wrote:
> On Wed, Dec 02, 2015 at 03:09:18PM +0100, Michael Wang wrote:
>> This tool will help improve the kernel, AFAIK it's already made it's
>> best, if you got any idea on how to make it even better that would be
>> great, but at this moment, it still need few of care :-P
> 
> I think you're replying to my emails without even reading what I said.
> So I'm going to do the same and stop wasting my time.

First of all, I respect all your reply, and reply regarding your point.

Frankly speaking, I think you know all these already, it's not a big
deal but you refuse to obey the rules setup by others, although it do
help make things better, and benefit yourself too.

If you refuse to make things better I'm totally fine, time is valuable
for all of us :-)

Regards,
Michael Wang

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang


On 12/02/2015 02:59 PM, Borislav Petkov wrote:
> On Wed, Dec 02, 2015 at 02:48:47PM +0100, Michael Wang wrote:
>> I'm not sure why amd-iommu use get_page not kmalloc to initialize the
>> pointer table, but if it's necessary then the conflict will be there,
>> it's not the fault of driver or kmemleak, but the design require them
>> to cooperate with each other.
> 
> So, according to you, we should go and "fix" all callers of
> __get_free_pages() to make kmemleak happy. Then when the next new tool
> comes along, we should "fix" another kernel API just so that the tools
> are happy.

That's the way we have to detect leak, no driver could get rid of
the possibility of memory leaking, so it should respect the rule to
help others locating the problem, if a driver full of false report then
most likely folks will gradually lost interests on help fix leaking
problem for it.

> 
> Bzzt. Wrong!
> 
> The tools should work without sprinkling their code everywhere. Driver
> etc developers don't need to care about what tool they make happy or
> not. Tools' hooks should be hidden in macro magic so that developers
> don't care.

This tool will help improve the kernel, AFAIK it's already made it's
best, if you got any idea on how to make it even better that would be
great, but at this moment, it still need few of care :-P

Regards,
Michael Wang

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang


On 12/02/2015 02:40 PM, Borislav Petkov wrote:
> On Wed, Dec 02, 2015 at 02:18:44PM +0100, Michael Wang wrote:
[snip]
> 
>> Yeah, but it would be better to solve it, otherwise whoever saw this
>> report will need to go into the amd-iommu, make sure it's not a real
>> leak, then change their testing script...
> 
> No, you don't need to go into the iommu - you need to fix kmemleak.
> 
> And frankly, I'm getting sick and tired of all those tools needing
> special handling and us adding code just so that they're happy. If the
> tools can't figure out something, they shouldn't warn just in case but
> shut up instead.

If you mean the design of kmemleak, IMHO it's not that bad.

The problem is regarding performance, think about if kmemleak go into
every page to find out pointers, I guess the whole system will stuck.

I'm not sure why amd-iommu use get_page not kmalloc to initialize the pointer
table, but if it's necessary then the conflict will be there, it's not the fault
of driver or kmemleak, but the design require them to cooperate with each
other.

Regards,
Michael Wang

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang
Hi, Borislav

On 12/02/2015 02:13 PM, Borislav Petkov wrote:
> On Wed, Dec 02, 2015 at 02:01:55PM +0100, Michael Wang wrote:
>> Yeah.. it's a little complicated since we have our own kernel tree and this
>> won't be a problem for us, but we really prefer to help fix it in mainline
>> too, as long as this is really a defect, so others could save time on 
>> research
>> in future.
> 
> Well, to keep it realistic and if it were me, I wouldn't even take such
> a fix as it is apparently kmemleak's problem.

Do you mean this could be a real kmemleak? Could you please provide more 
details?

> 
> So you could fix your testing instead to ignore that error message now
> that you know it is a false-positive. That should be easiest.
> 

Yeah, but it would be better to solve it, otherwise whoever saw this report
will need to go into the amd-iommu, make sure it's not a real leak, then
change their testing script...

Regards,
Michael Wang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang

On 12/02/2015 01:56 PM, Joerg Roedel wrote:
> On Wed, Dec 02, 2015 at 01:31:38PM +0100, Michael Wang wrote:
>> It's not my work or your work... it's a defect in the module and maintainer
>> should take responsibility on fixing it, correct?
> 
> No, its a false positive from an in-kernel checking tool, the iommu
> driver is correct. You just sent a patch to silence the false positive
> report.

Yeah, but caused by the driver :-P and have to be fixed in there too.

> 
>> We're very willing to help, but as I mentioned we are out of resource for
>> testing at this moment, but we can send you a new patch without testing,
>> will that works for you?
> 
> This should be testable on any AMD IOMMU system with working interrupt
> remapping. I will probably have no time to test this, if you really
> can't test yourself, try to get a Tested-by from someone else.

Good point, anyone willing to help test the fix? Or better provide the new
patch to be the author.

Regards,
Michael Wang

> 
> 
> 
>   Joerg
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang


On 12/02/2015 01:53 PM, Borislav Petkov wrote:
> On Wed, Dec 02, 2015 at 01:31:38PM +0100, Michael Wang wrote:
>> It's not my work or your work... it's a defect in the module and maintainer
>> should take responsibility on fixing it, correct?
> 
> Well, you said "actually we just want to get rid of this annoying report
> on obj won't leak..."
> 
> It sounds to me like you want to have something fixed. So you do the
> patch properly, add to the commit message why exactly you're doing it
> and test it. Like everyone else.

Yeah.. it's a little complicated since we have our own kernel tree and this
won't be a problem for us, but we really prefer to help fix it in mainline
too, as long as this is really a defect, so others could save time on research
in future.

But seems like we can only wait for another chance to confirm the another
solution, frankly speaking I think we both will forgot this soon... fortunately
it's not that critical :-P

Regards,
Michael Wang

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang


On 12/02/2015 12:51 PM, Joerg Roedel wrote:
> On Wed, Dec 02, 2015 at 12:38:03PM +0100, Michael Wang wrote:
>> Joerg, this is really a tiny fix, would you mind to merge it into some
>> of your cleanup patch and testing them together? we are not in hurry,
>> just want to make sure the issue will get solved.
> 
> I am not doing your work. You sent a patch, received feedback, and now
> you can send a new patch based on it. Thats the process. If it addresses
> the feedback I will merge it, but I will not scissor your patch
> together.

It's not my work or your work... it's a defect in the module and maintainer
should take responsibility on fixing it, correct?

We're very willing to help, but as I mentioned we are out of resource for
testing at this moment, but we can send you a new patch without testing,
will that works for you?

Regards,
Michael Wang

> 
> 
>   Joerg
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang

On 12/02/2015 12:31 PM, Catalin Marinas wrote:
> On 2 December 2015 at 10:56, Michael Wang  wrote:
[snip]
> 
> I could copy your description but I don't currently have a way (nor
> time) to test the patch. If you plan to test it anyway, please feel
> free to include my diff (which I guess was badly re-formatted by
> gmail), I don't really mind which author it is (I found it easier to
> show a diff than explain in plain English ;)).

Unfortunately that's the same on my side... we already close the ticket
and I don't have resources to testing it again.

Joerg, this is really a tiny fix, would you mind to merge it into some
of your cleanup patch and testing them together? we are not in hurry,
just want to make sure the issue will get solved.

Regards,
Michael Wang

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang

On 12/02/2015 11:52 AM, Catalin Marinas wrote:
[snip]
>>
>> Is there any more concern? actually we just want to get rid of this
>> annoying report on obj won't leak, if you're going to create obj for
>> 'irq_lookup_table' that's also fine for us, or will you pick this patch?
> 
> My preference (from a kmemleak perspective) is to tell kmemleak about
> the irq_lookup_table. Untested:

I'm fine with both solution, will leave the decision to maintainer :-)

BTW, could you please send a formal patch with descriptions?

Regards,
Michael Wang

> 
> diff --git a/drivers/iommu/amd_iommu_init.c b/drivers/iommu/amd_iommu_init.c
> index 013bdfff2d4d..c41609f71cbe 100644
> --- a/drivers/iommu/amd_iommu_init.c
> +++ b/drivers/iommu/amd_iommu_init.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1692,6 +1693,7 @@ static struct syscore_ops amd_iommu_syscore_ops = {
> 
>  static void __init free_on_init_error(void)
>  {
> + kmemleak_free(irq_lookup_table);
>   free_pages((unsigned long)irq_lookup_table,
> get_order(rlookup_table_size));
> 
> @@ -1906,6 +1908,7 @@ static int __init early_amd_iommu_init(void)
>   irq_lookup_table = (void *)__get_free_pages(
>   GFP_KERNEL | __GFP_ZERO,
>   get_order(rlookup_table_size));
> + kmemleak_alloc(irq_lookup_table, rlookup_table_size, 1, GFP_KERNEL);
>   if (!irq_lookup_table)
>   goto out;
>   }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang
Hi, Joerg

On 11/25/2015 04:14 PM, Michael Wang wrote:
> On 11/25/2015 04:08 PM, Joerg Roedel wrote:
[snip]
>>> This is caused by the 'irq_lookup_table' was allocated with
>>> __get_free_pages() which won't create kmemleak object, thus it's
>>> pointers won't be count as referencing 'irq_remap_table' in
>>> kmemleak scan.
>>
>> Isn't it better to allocate the kmemleak object manually instead of
>> ignoring all irq-table pointers? With this patch we might not notice any
>> real leak of irq-tables.
> 
> We've considered that too, but found that the irq-tables is not
> dynamically alloc/free, they won't be freed once initialized, so there
> is no leaking for such object :-)

Is there any more concern? actually we just want to get rid of this
annoying report on obj won't leak, if you're going to create obj for
'irq_lookup_table' that's also fine for us, or will you pick this patch?

Regards,
Michael Wang

> 
> Regards,
> Michael Wang
> 
>>
>>
>>
>>  Joerg
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang


On 12/02/2015 03:13 PM, Borislav Petkov wrote:
> On Wed, Dec 02, 2015 at 03:09:18PM +0100, Michael Wang wrote:
>> This tool will help improve the kernel, AFAIK it's already made it's
>> best, if you got any idea on how to make it even better that would be
>> great, but at this moment, it still need few of care :-P
> 
> I think you're replying to my emails without even reading what I said.
> So I'm going to do the same and stop wasting my time.

First of all, I respect all your reply, and reply regarding your point.

Frankly speaking, I think you know all these already, it's not a big
deal but you refuse to obey the rules setup by others, although it do
help make things better, and benefit yourself too.

If you refuse to make things better I'm totally fine, time is valuable
for all of us :-)

Regards,
Michael Wang

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang
Hi, Joerg

On 11/25/2015 04:14 PM, Michael Wang wrote:
> On 11/25/2015 04:08 PM, Joerg Roedel wrote:
[snip]
>>> This is caused by the 'irq_lookup_table' was allocated with
>>> __get_free_pages() which won't create kmemleak object, thus it's
>>> pointers won't be count as referencing 'irq_remap_table' in
>>> kmemleak scan.
>>
>> Isn't it better to allocate the kmemleak object manually instead of
>> ignoring all irq-table pointers? With this patch we might not notice any
>> real leak of irq-tables.
> 
> We've considered that too, but found that the irq-tables is not
> dynamically alloc/free, they won't be freed once initialized, so there
> is no leaking for such object :-)

Is there any more concern? actually we just want to get rid of this
annoying report on obj won't leak, if you're going to create obj for
'irq_lookup_table' that's also fine for us, or will you pick this patch?

Regards,
Michael Wang

> 
> Regards,
> Michael Wang
> 
>>
>>
>>
>>  Joerg
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang

On 12/02/2015 11:52 AM, Catalin Marinas wrote:
[snip]
>>
>> Is there any more concern? actually we just want to get rid of this
>> annoying report on obj won't leak, if you're going to create obj for
>> 'irq_lookup_table' that's also fine for us, or will you pick this patch?
> 
> My preference (from a kmemleak perspective) is to tell kmemleak about
> the irq_lookup_table. Untested:

I'm fine with both solution, will leave the decision to maintainer :-)

BTW, could you please send a formal patch with descriptions?

Regards,
Michael Wang

> 
> diff --git a/drivers/iommu/amd_iommu_init.c b/drivers/iommu/amd_iommu_init.c
> index 013bdfff2d4d..c41609f71cbe 100644
> --- a/drivers/iommu/amd_iommu_init.c
> +++ b/drivers/iommu/amd_iommu_init.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1692,6 +1693,7 @@ static struct syscore_ops amd_iommu_syscore_ops = {
> 
>  static void __init free_on_init_error(void)
>  {
> + kmemleak_free(irq_lookup_table);
>   free_pages((unsigned long)irq_lookup_table,
> get_order(rlookup_table_size));
> 
> @@ -1906,6 +1908,7 @@ static int __init early_amd_iommu_init(void)
>   irq_lookup_table = (void *)__get_free_pages(
>   GFP_KERNEL | __GFP_ZERO,
>   get_order(rlookup_table_size));
> + kmemleak_alloc(irq_lookup_table, rlookup_table_size, 1, GFP_KERNEL);
>   if (!irq_lookup_table)
>   goto out;
>   }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang

On 12/02/2015 12:31 PM, Catalin Marinas wrote:
> On 2 December 2015 at 10:56, Michael Wang <yun.w...@profitbricks.com> wrote:
[snip]
> 
> I could copy your description but I don't currently have a way (nor
> time) to test the patch. If you plan to test it anyway, please feel
> free to include my diff (which I guess was badly re-formatted by
> gmail), I don't really mind which author it is (I found it easier to
> show a diff than explain in plain English ;)).

Unfortunately that's the same on my side... we already close the ticket
and I don't have resources to testing it again.

Joerg, this is really a tiny fix, would you mind to merge it into some
of your cleanup patch and testing them together? we are not in hurry,
just want to make sure the issue will get solved.

Regards,
Michael Wang

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang


On 12/02/2015 12:51 PM, Joerg Roedel wrote:
> On Wed, Dec 02, 2015 at 12:38:03PM +0100, Michael Wang wrote:
>> Joerg, this is really a tiny fix, would you mind to merge it into some
>> of your cleanup patch and testing them together? we are not in hurry,
>> just want to make sure the issue will get solved.
> 
> I am not doing your work. You sent a patch, received feedback, and now
> you can send a new patch based on it. Thats the process. If it addresses
> the feedback I will merge it, but I will not scissor your patch
> together.

It's not my work or your work... it's a defect in the module and maintainer
should take responsibility on fixing it, correct?

We're very willing to help, but as I mentioned we are out of resource for
testing at this moment, but we can send you a new patch without testing,
will that works for you?

Regards,
Michael Wang

> 
> 
>   Joerg
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang


On 12/02/2015 01:53 PM, Borislav Petkov wrote:
> On Wed, Dec 02, 2015 at 01:31:38PM +0100, Michael Wang wrote:
>> It's not my work or your work... it's a defect in the module and maintainer
>> should take responsibility on fixing it, correct?
> 
> Well, you said "actually we just want to get rid of this annoying report
> on obj won't leak..."
> 
> It sounds to me like you want to have something fixed. So you do the
> patch properly, add to the commit message why exactly you're doing it
> and test it. Like everyone else.

Yeah.. it's a little complicated since we have our own kernel tree and this
won't be a problem for us, but we really prefer to help fix it in mainline
too, as long as this is really a defect, so others could save time on research
in future.

But seems like we can only wait for another chance to confirm the another
solution, frankly speaking I think we both will forgot this soon... fortunately
it's not that critical :-P

Regards,
Michael Wang

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang

On 12/02/2015 01:56 PM, Joerg Roedel wrote:
> On Wed, Dec 02, 2015 at 01:31:38PM +0100, Michael Wang wrote:
>> It's not my work or your work... it's a defect in the module and maintainer
>> should take responsibility on fixing it, correct?
> 
> No, its a false positive from an in-kernel checking tool, the iommu
> driver is correct. You just sent a patch to silence the false positive
> report.

Yeah, but caused by the driver :-P and have to be fixed in there too.

> 
>> We're very willing to help, but as I mentioned we are out of resource for
>> testing at this moment, but we can send you a new patch without testing,
>> will that works for you?
> 
> This should be testable on any AMD IOMMU system with working interrupt
> remapping. I will probably have no time to test this, if you really
> can't test yourself, try to get a Tested-by from someone else.

Good point, anyone willing to help test the fix? Or better provide the new
patch to be the author.

Regards,
Michael Wang

> 
> 
> 
>   Joerg
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang
Hi, Borislav

On 12/02/2015 02:13 PM, Borislav Petkov wrote:
> On Wed, Dec 02, 2015 at 02:01:55PM +0100, Michael Wang wrote:
>> Yeah.. it's a little complicated since we have our own kernel tree and this
>> won't be a problem for us, but we really prefer to help fix it in mainline
>> too, as long as this is really a defect, so others could save time on 
>> research
>> in future.
> 
> Well, to keep it realistic and if it were me, I wouldn't even take such
> a fix as it is apparently kmemleak's problem.

Do you mean this could be a real kmemleak? Could you please provide more 
details?

> 
> So you could fix your testing instead to ignore that error message now
> that you know it is a false-positive. That should be easiest.
> 

Yeah, but it would be better to solve it, otherwise whoever saw this report
will need to go into the amd-iommu, make sure it's not a real leak, then
change their testing script...

Regards,
Michael Wang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang


On 12/02/2015 02:40 PM, Borislav Petkov wrote:
> On Wed, Dec 02, 2015 at 02:18:44PM +0100, Michael Wang wrote:
[snip]
> 
>> Yeah, but it would be better to solve it, otherwise whoever saw this
>> report will need to go into the amd-iommu, make sure it's not a real
>> leak, then change their testing script...
> 
> No, you don't need to go into the iommu - you need to fix kmemleak.
> 
> And frankly, I'm getting sick and tired of all those tools needing
> special handling and us adding code just so that they're happy. If the
> tools can't figure out something, they shouldn't warn just in case but
> shut up instead.

If you mean the design of kmemleak, IMHO it's not that bad.

The problem is regarding performance, think about if kmemleak go into
every page to find out pointers, I guess the whole system will stuck.

I'm not sure why amd-iommu use get_page not kmalloc to initialize the pointer
table, but if it's necessary then the conflict will be there, it's not the fault
of driver or kmemleak, but the design require them to cooperate with each
other.

Regards,
Michael Wang

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-12-02 Thread Michael Wang


On 12/02/2015 02:59 PM, Borislav Petkov wrote:
> On Wed, Dec 02, 2015 at 02:48:47PM +0100, Michael Wang wrote:
>> I'm not sure why amd-iommu use get_page not kmalloc to initialize the
>> pointer table, but if it's necessary then the conflict will be there,
>> it's not the fault of driver or kmemleak, but the design require them
>> to cooperate with each other.
> 
> So, according to you, we should go and "fix" all callers of
> __get_free_pages() to make kmemleak happy. Then when the next new tool
> comes along, we should "fix" another kernel API just so that the tools
> are happy.

That's the way we have to detect leak, no driver could get rid of
the possibility of memory leaking, so it should respect the rule to
help others locating the problem, if a driver full of false report then
most likely folks will gradually lost interests on help fix leaking
problem for it.

> 
> Bzzt. Wrong!
> 
> The tools should work without sprinkling their code everywhere. Driver
> etc developers don't need to care about what tool they make happy or
> not. Tools' hooks should be hidden in macro magic so that developers
> don't care.

This tool will help improve the kernel, AFAIK it's already made it's
best, if you got any idea on how to make it even better that would be
great, but at this moment, it still need few of care :-P

Regards,
Michael Wang

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-11-25 Thread Michael Wang
On 11/25/2015 04:08 PM, Joerg Roedel wrote:
> On Fri, Nov 20, 2015 at 12:33:50PM +0100, Michael Wang wrote:
>> The kmemleak testing on 3.18.24 show:
>>
>> unreferenced object 0x880233ff9010 (size 16):
>>   comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s)
>>   hex dump (first 16 bytes):
>> 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff   D.3
>>   backtrace:
>> [] create_object+0x10d/0x2d0
>> [] kmemleak_alloc+0x5b/0xc0
>> [] kmem_cache_alloc_trace+0xb9/0x160
>> [] get_irq_table+0x151/0x380
>>
>> This is caused by the 'irq_lookup_table' was allocated with
>> __get_free_pages() which won't create kmemleak object, thus it's
>> pointers won't be count as referencing 'irq_remap_table' in
>> kmemleak scan.
> 
> Isn't it better to allocate the kmemleak object manually instead of
> ignoring all irq-table pointers? With this patch we might not notice any
> real leak of irq-tables.

We've considered that too, but found that the irq-tables is not
dynamically alloc/free, they won't be freed once initialized, so there
is no leaking for such object :-)

Regards,
Michael Wang

> 
> 
> 
>   Joerg
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-11-25 Thread Michael Wang
Hi, Joery

On 11/20/2015 12:33 PM, Michael Wang wrote:
> The kmemleak testing on 3.18.24 show:
> 
> unreferenced object 0x880233ff9010 (size 16):
>   comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s)
>   hex dump (first 16 bytes):
> 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff   D.3
>   backtrace:
> [] create_object+0x10d/0x2d0
> [] kmemleak_alloc+0x5b/0xc0
> [] kmem_cache_alloc_trace+0xb9/0x160
> [] get_irq_table+0x151/0x380
> 
> This is caused by the 'irq_lookup_table' was allocated with
> __get_free_pages() which won't create kmemleak object, thus it's
> pointers won't be count as referencing 'irq_remap_table' in
> kmemleak scan.
> 
> The 'irq_remap_table' won't be freed after initialized, doesn't
> make sense to check it's leaking.
> 
> This patch mark the 'irq_remap_table' object as 'gray' to stop
> the 'false positives' report.

Any comments on this one?

Regards,
Michael Wang

> 
> Signed-off-by: Michael Wang 
> ---
> v2:
>   Use kmemleak_not_leak() instead of kmemleak_ignore() since
>   the 'irq_remap_table' itself also contain pointer.
> 
>  drivers/iommu/amd_iommu.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 8b2be1e..87a1a88 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, 
> bool ioapic)
>   }
>  
>   irq_lookup_table[devid] = table;
> + kmemleak_not_leak(table);
>   set_dte_irq_entry(devid, table);
>   iommu_flush_dte(iommu, devid);
>   if (devid != alias) {
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-11-25 Thread Michael Wang
On 11/25/2015 04:08 PM, Joerg Roedel wrote:
> On Fri, Nov 20, 2015 at 12:33:50PM +0100, Michael Wang wrote:
>> The kmemleak testing on 3.18.24 show:
>>
>> unreferenced object 0x880233ff9010 (size 16):
>>   comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s)
>>   hex dump (first 16 bytes):
>> 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff   D.3
>>   backtrace:
>> [] create_object+0x10d/0x2d0
>> [] kmemleak_alloc+0x5b/0xc0
>> [] kmem_cache_alloc_trace+0xb9/0x160
>> [] get_irq_table+0x151/0x380
>>
>> This is caused by the 'irq_lookup_table' was allocated with
>> __get_free_pages() which won't create kmemleak object, thus it's
>> pointers won't be count as referencing 'irq_remap_table' in
>> kmemleak scan.
> 
> Isn't it better to allocate the kmemleak object manually instead of
> ignoring all irq-table pointers? With this patch we might not notice any
> real leak of irq-tables.

We've considered that too, but found that the irq-tables is not
dynamically alloc/free, they won't be freed once initialized, so there
is no leaking for such object :-)

Regards,
Michael Wang

> 
> 
> 
>   Joerg
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-11-25 Thread Michael Wang
Hi, Joery

On 11/20/2015 12:33 PM, Michael Wang wrote:
> The kmemleak testing on 3.18.24 show:
> 
> unreferenced object 0x880233ff9010 (size 16):
>   comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s)
>   hex dump (first 16 bytes):
> 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff   D.3
>   backtrace:
> [] create_object+0x10d/0x2d0
> [] kmemleak_alloc+0x5b/0xc0
> [] kmem_cache_alloc_trace+0xb9/0x160
> [] get_irq_table+0x151/0x380
> 
> This is caused by the 'irq_lookup_table' was allocated with
> __get_free_pages() which won't create kmemleak object, thus it's
> pointers won't be count as referencing 'irq_remap_table' in
> kmemleak scan.
> 
> The 'irq_remap_table' won't be freed after initialized, doesn't
> make sense to check it's leaking.
> 
> This patch mark the 'irq_remap_table' object as 'gray' to stop
> the 'false positives' report.

Any comments on this one?

Regards,
Michael Wang

> 
> Signed-off-by: Michael Wang <yun.w...@profitbricks.com>
> ---
> v2:
>   Use kmemleak_not_leak() instead of kmemleak_ignore() since
>   the 'irq_remap_table' itself also contain pointer.
> 
>  drivers/iommu/amd_iommu.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 8b2be1e..87a1a88 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, 
> bool ioapic)
>   }
>  
>   irq_lookup_table[devid] = table;
> + kmemleak_not_leak(table);
>   set_dte_irq_entry(devid, table);
>   iommu_flush_dte(iommu, devid);
>   if (devid != alias) {
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: nfnetlink warnings

2015-11-23 Thread Michael Wang


On 11/23/2015 11:32 AM, Borislav Petkov wrote:
> On Mon, Nov 23, 2015 at 11:20:18AM +0100, Michael Wang wrote:
>> Who want to do that would take responsibility to make an else branch at
>> that time, but reserve the branch at this moment sounds unnecessary, and
>> not that pretty frankly speaking.
> 
> Actually, I was looking for the better idea which doesn't uglify the
> code. And here it is:
> 
> https://lkml.kernel.org/r/5585663.OcpAQiytKY@wuerfel

Looks even better :-)

Regards,
Michael Wang

> 
> :-)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: nfnetlink warnings

2015-11-23 Thread Michael Wang


On 11/23/2015 10:54 AM, Borislav Petkov wrote:
> Hi Michael,
> 
> On Mon, Nov 23, 2015 at 10:49:34AM +0100, Michael Wang wrote:
>> Why not just initialized it as NULL, or mark it as uninitialized_var()?
> 
> because I'd like us to save us the redundant NULL initialization in the
> if-case.

Well, I would vote initialized with NULL, rather than use another else
branch to do the same thing.

> 
> I'm not saying any of the approaches are good visually, though. Who
> knows, someone might have a better idea like, maybe "Oh, I wanted to
> rewrite that code and this handlong is going to be different anyway ..."
> or so. Or something to that effect.

Who want to do that would take responsibility to make an else branch at
that time, but reserve the branch at this moment sounds unnecessary, and
not that pretty frankly speaking.

> 
> Btw, please do not top-post.

Enjoy ;-)

Regards,
Michael Wang

> 
> Thanks.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: nfnetlink warnings

2015-11-23 Thread Michael Wang
Hi, Borislav

Why not just initialized it as NULL, or mark it as uninitialized_var()?

Regards,
Michael Wang

On 11/23/2015 10:36 AM, Borislav Petkov wrote:
> Hey,
> 
> so I keep getting those since recently:
> 
> net/netfilter/nfnetlink_queue.c:519:19: warning: ‘nfnl_ct’ may be used 
> uninitialized in this function [-Wmaybe-uninitialized]
>   if (ct && nfnl_ct->build(skb, ct, ctinfo, NFQA_CT, NFQA_CT_INFO) < 0)
>^
> net/netfilter/nfnetlink_queue.c:316:23: note: ‘nfnl_ct’ was declared here
>   struct nfnl_ct_hook *nfnl_ct;
>^
> net/netfilter/nfnetlink_queue.c: In function ‘nfqnl_recv_verdict’:
> net/netfilter/nfnetlink_queue.c:1083:11: warning: ‘nfnl_ct’ may be used 
> uninitialized in this function [-Wmaybe-uninitialized]
> nfnl_ct->seq_adjust(entry->skb, ct, ctinfo, diff);
>^
> 
> and was thinking can we shut them up like this? I know, it is ugly :-\
> 
> I mean, it is obvious in both cases that nfnl_ct won't be used if ct is
> not set but apparently gcc can't see that far...
> 
> ---
> diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
> index 7d81d280cb4f..cd61b0b5c413 100644
> --- a/net/netfilter/nfnetlink_queue.c
> +++ b/net/netfilter/nfnetlink_queue.c
> @@ -372,6 +372,8 @@ nfqnl_build_packet_message(struct net *net, struct 
> nfqnl_instance *queue,
>   if (ct != NULL)
>   size += nfnl_ct->build_size(ct);
>   }
> + } else {
> + nfnl_ct = NULL;
>   }
>  
>   if (queue->flags & NFQA_CFG_F_UID_GID) {
> @@ -1069,6 +1071,8 @@ nfqnl_recv_verdict(struct sock *ctnl, struct sk_buff 
> *skb,
>   nfnl_ct = rcu_dereference(nfnl_ct_hook);
>   if (nfnl_ct != NULL)
>   ct = nfqnl_ct_parse(nfnl_ct, nlh, nfqa, entry, );
> + } else {
> + nfnl_ct = NULL;
>   }
>  
>   if (nfqa[NFQA_PAYLOAD]) {
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: nfnetlink warnings

2015-11-23 Thread Michael Wang


On 11/23/2015 11:32 AM, Borislav Petkov wrote:
> On Mon, Nov 23, 2015 at 11:20:18AM +0100, Michael Wang wrote:
>> Who want to do that would take responsibility to make an else branch at
>> that time, but reserve the branch at this moment sounds unnecessary, and
>> not that pretty frankly speaking.
> 
> Actually, I was looking for the better idea which doesn't uglify the
> code. And here it is:
> 
> https://lkml.kernel.org/r/5585663.OcpAQiytKY@wuerfel

Looks even better :-)

Regards,
Michael Wang

> 
> :-)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: nfnetlink warnings

2015-11-23 Thread Michael Wang


On 11/23/2015 10:54 AM, Borislav Petkov wrote:
> Hi Michael,
> 
> On Mon, Nov 23, 2015 at 10:49:34AM +0100, Michael Wang wrote:
>> Why not just initialized it as NULL, or mark it as uninitialized_var()?
> 
> because I'd like us to save us the redundant NULL initialization in the
> if-case.

Well, I would vote initialized with NULL, rather than use another else
branch to do the same thing.

> 
> I'm not saying any of the approaches are good visually, though. Who
> knows, someone might have a better idea like, maybe "Oh, I wanted to
> rewrite that code and this handlong is going to be different anyway ..."
> or so. Or something to that effect.

Who want to do that would take responsibility to make an else branch at
that time, but reserve the branch at this moment sounds unnecessary, and
not that pretty frankly speaking.

> 
> Btw, please do not top-post.

Enjoy ;-)

Regards,
Michael Wang

> 
> Thanks.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: nfnetlink warnings

2015-11-23 Thread Michael Wang
Hi, Borislav

Why not just initialized it as NULL, or mark it as uninitialized_var()?

Regards,
Michael Wang

On 11/23/2015 10:36 AM, Borislav Petkov wrote:
> Hey,
> 
> so I keep getting those since recently:
> 
> net/netfilter/nfnetlink_queue.c:519:19: warning: ‘nfnl_ct’ may be used 
> uninitialized in this function [-Wmaybe-uninitialized]
>   if (ct && nfnl_ct->build(skb, ct, ctinfo, NFQA_CT, NFQA_CT_INFO) < 0)
>^
> net/netfilter/nfnetlink_queue.c:316:23: note: ‘nfnl_ct’ was declared here
>   struct nfnl_ct_hook *nfnl_ct;
>^
> net/netfilter/nfnetlink_queue.c: In function ‘nfqnl_recv_verdict’:
> net/netfilter/nfnetlink_queue.c:1083:11: warning: ‘nfnl_ct’ may be used 
> uninitialized in this function [-Wmaybe-uninitialized]
> nfnl_ct->seq_adjust(entry->skb, ct, ctinfo, diff);
>^
> 
> and was thinking can we shut them up like this? I know, it is ugly :-\
> 
> I mean, it is obvious in both cases that nfnl_ct won't be used if ct is
> not set but apparently gcc can't see that far...
> 
> ---
> diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
> index 7d81d280cb4f..cd61b0b5c413 100644
> --- a/net/netfilter/nfnetlink_queue.c
> +++ b/net/netfilter/nfnetlink_queue.c
> @@ -372,6 +372,8 @@ nfqnl_build_packet_message(struct net *net, struct 
> nfqnl_instance *queue,
>   if (ct != NULL)
>   size += nfnl_ct->build_size(ct);
>   }
> + } else {
> + nfnl_ct = NULL;
>   }
>  
>   if (queue->flags & NFQA_CFG_F_UID_GID) {
> @@ -1069,6 +1071,8 @@ nfqnl_recv_verdict(struct sock *ctnl, struct sk_buff 
> *skb,
>   nfnl_ct = rcu_dereference(nfnl_ct_hook);
>   if (nfnl_ct != NULL)
>   ct = nfqnl_ct_parse(nfnl_ct, nlh, nfqa, entry, );
> + } else {
> + nfnl_ct = NULL;
>   }
>  
>   if (nfqa[NFQA_PAYLOAD]) {
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-11-20 Thread Michael Wang


On 11/20/2015 12:33 PM, Michael Wang wrote:
> The kmemleak testing on 3.18.24 show:
> 
> unreferenced object 0x880233ff9010 (size 16):
>   comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s)
>   hex dump (first 16 bytes):
> 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff   D.3
>   backtrace:
> [] create_object+0x10d/0x2d0
> [] kmemleak_alloc+0x5b/0xc0
> [] kmem_cache_alloc_trace+0xb9/0x160
> [] get_irq_table+0x151/0x380
> 
> This is caused by the 'irq_lookup_table' was allocated with
> __get_free_pages() which won't create kmemleak object, thus it's
> pointers won't be count as referencing 'irq_remap_table' in
> kmemleak scan.
> 
> The 'irq_remap_table' won't be freed after initialized, doesn't
> make sense to check it's leaking.
> 
> This patch mark the 'irq_remap_table' object as 'gray' to stop
> the 'false positives' report.
> 
> Signed-off-by: Michael Wang 

Reported-by: Danil Kipnis 

Regards,
Michael Wang

> ---
> v2:
>   Use kmemleak_not_leak() instead of kmemleak_ignore() since
>   the 'irq_remap_table' itself also contain pointer.
> 
>  drivers/iommu/amd_iommu.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 8b2be1e..87a1a88 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, 
> bool ioapic)
>   }
>  
>   irq_lookup_table[devid] = table;
> + kmemleak_not_leak(table);
>   set_dte_irq_entry(devid, table);
>   iommu_flush_dte(iommu, devid);
>   if (devid != alias) {
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-11-20 Thread Michael Wang
The kmemleak testing on 3.18.24 show:

unreferenced object 0x880233ff9010 (size 16):
  comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s)
  hex dump (first 16 bytes):
0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff   D.3
  backtrace:
[] create_object+0x10d/0x2d0
[] kmemleak_alloc+0x5b/0xc0
[] kmem_cache_alloc_trace+0xb9/0x160
[] get_irq_table+0x151/0x380

This is caused by the 'irq_lookup_table' was allocated with
__get_free_pages() which won't create kmemleak object, thus it's
pointers won't be count as referencing 'irq_remap_table' in
kmemleak scan.

The 'irq_remap_table' won't be freed after initialized, doesn't
make sense to check it's leaking.

This patch mark the 'irq_remap_table' object as 'gray' to stop
the 'false positives' report.

Signed-off-by: Michael Wang 
---
v2:
  Use kmemleak_not_leak() instead of kmemleak_ignore() since
  the 'irq_remap_table' itself also contain pointer.

 drivers/iommu/amd_iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 8b2be1e..87a1a88 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, 
bool ioapic)
}
 
irq_lookup_table[devid] = table;
+   kmemleak_not_leak(table);
set_dte_irq_entry(devid, table);
iommu_flush_dte(iommu, devid);
if (devid != alias) {
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] iommu/amd: make kmemleak ignore the 'irq_remap_table' object

2015-11-20 Thread Michael Wang
The kmemleak testing on 3.18.24 show:

unreferenced object 0x880233ff9010 (size 16):
  comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s)
  hex dump (first 16 bytes):
0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff   D.3
  backtrace:
[] create_object+0x10d/0x2d0
[] kmemleak_alloc+0x5b/0xc0
[] kmem_cache_alloc_trace+0xb9/0x160
[] get_irq_table+0x151/0x380

This is caused by the 'irq_lookup_table' was allocated with
__get_free_pages() which won't create kmemleak object, thus it's
pointers won't be count as referencing in kmemleak scanning.

The 'irq_remap_table' allocated won't be freed after initialized,
doesn't make sense to let kmemleak scan it.

This patch mark the 'irq_remap_table' object as 'ignored' to
stop the 'false positives' report.

Signed-off-by: Michael Wang 
---
 drivers/iommu/amd_iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 8b2be1e..87a1a88 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, 
bool ioapic)
}
 
irq_lookup_table[devid] = table;
+   kmemleak_ignore(table);
set_dte_irq_entry(devid, table);
iommu_flush_dte(iommu, devid);
if (devid != alias) {
-- 
2.1.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-11-20 Thread Michael Wang


On 11/20/2015 12:33 PM, Michael Wang wrote:
> The kmemleak testing on 3.18.24 show:
> 
> unreferenced object 0x880233ff9010 (size 16):
>   comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s)
>   hex dump (first 16 bytes):
> 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff   D.3
>   backtrace:
> [] create_object+0x10d/0x2d0
> [] kmemleak_alloc+0x5b/0xc0
> [] kmem_cache_alloc_trace+0xb9/0x160
> [] get_irq_table+0x151/0x380
> 
> This is caused by the 'irq_lookup_table' was allocated with
> __get_free_pages() which won't create kmemleak object, thus it's
> pointers won't be count as referencing 'irq_remap_table' in
> kmemleak scan.
> 
> The 'irq_remap_table' won't be freed after initialized, doesn't
> make sense to check it's leaking.
> 
> This patch mark the 'irq_remap_table' object as 'gray' to stop
> the 'false positives' report.
> 
> Signed-off-by: Michael Wang <yun.w...@profitbricks.com>

Reported-by: Danil Kipnis <danil.kip...@profitbricks.com>

Regards,
Michael Wang

> ---
> v2:
>   Use kmemleak_not_leak() instead of kmemleak_ignore() since
>   the 'irq_remap_table' itself also contain pointer.
> 
>  drivers/iommu/amd_iommu.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 8b2be1e..87a1a88 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, 
> bool ioapic)
>   }
>  
>   irq_lookup_table[devid] = table;
> + kmemleak_not_leak(table);
>   set_dte_irq_entry(devid, table);
>   iommu_flush_dte(iommu, devid);
>   if (devid != alias) {
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] iommu/amd: make kmemleak ignore the 'irq_remap_table' object

2015-11-20 Thread Michael Wang
The kmemleak testing on 3.18.24 show:

unreferenced object 0x880233ff9010 (size 16):
  comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s)
  hex dump (first 16 bytes):
0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff   D.3
  backtrace:
[] create_object+0x10d/0x2d0
[] kmemleak_alloc+0x5b/0xc0
[] kmem_cache_alloc_trace+0xb9/0x160
[] get_irq_table+0x151/0x380

This is caused by the 'irq_lookup_table' was allocated with
__get_free_pages() which won't create kmemleak object, thus it's
pointers won't be count as referencing in kmemleak scanning.

The 'irq_remap_table' allocated won't be freed after initialized,
doesn't make sense to let kmemleak scan it.

This patch mark the 'irq_remap_table' object as 'ignored' to
stop the 'false positives' report.

Signed-off-by: Michael Wang <yun.w...@profitbricks.com>
---
 drivers/iommu/amd_iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 8b2be1e..87a1a88 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, 
bool ioapic)
}
 
irq_lookup_table[devid] = table;
+   kmemleak_ignore(table);
set_dte_irq_entry(devid, table);
iommu_flush_dte(iommu, devid);
if (devid != alias) {
-- 
2.1.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak

2015-11-20 Thread Michael Wang
The kmemleak testing on 3.18.24 show:

unreferenced object 0x880233ff9010 (size 16):
  comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s)
  hex dump (first 16 bytes):
0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff   D.3
  backtrace:
[] create_object+0x10d/0x2d0
[] kmemleak_alloc+0x5b/0xc0
[] kmem_cache_alloc_trace+0xb9/0x160
[] get_irq_table+0x151/0x380

This is caused by the 'irq_lookup_table' was allocated with
__get_free_pages() which won't create kmemleak object, thus it's
pointers won't be count as referencing 'irq_remap_table' in
kmemleak scan.

The 'irq_remap_table' won't be freed after initialized, doesn't
make sense to check it's leaking.

This patch mark the 'irq_remap_table' object as 'gray' to stop
the 'false positives' report.

Signed-off-by: Michael Wang <yun.w...@profitbricks.com>
---
v2:
  Use kmemleak_not_leak() instead of kmemleak_ignore() since
  the 'irq_remap_table' itself also contain pointer.

 drivers/iommu/amd_iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 8b2be1e..87a1a88 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, 
bool ioapic)
}
 
irq_lookup_table[devid] = table;
+   kmemleak_not_leak(table);
set_dte_irq_entry(devid, table);
iommu_flush_dte(iommu, devid);
if (devid != alias) {
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers

2015-05-19 Thread Michael Wang


On 05/18/2015 06:58 PM, Doug Ledford wrote:
[snip]
>> I see :-) I've not work with the kdoc yet, not sure if there is any
>> guidelines on how to write the header of inline func for kdoc?
> 
> It's an automated tool thing.  Any comment section that starts with /**
> is automatically included as a kdoc.  Then there is an expected format
> after that.  See Documentation/kernel-doc-nano-HOWTO.txt.

Got it :-)

> 
>>>
>>> Just because I want to move this along versus waiting for another
>>> respin, I'm going to copy and paste these into those locations and clean
>>> up the changelog when I integrate this patch.
>>
>> Got it, if there is anything I could help, please let me know ;-)
> 
> I'm sending the patch for review, please let me know if you are OK with
> how I handled the attribution.

The definition is far more detailed and accurate, it's already good enough
according to my understanding, should benefit the developer a lot ;-)

Regards,
Michael Wang

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers

2015-05-19 Thread Michael Wang


On 05/18/2015 06:58 PM, Doug Ledford wrote:
[snip]
 I see :-) I've not work with the kdoc yet, not sure if there is any
 guidelines on how to write the header of inline func for kdoc?
 
 It's an automated tool thing.  Any comment section that starts with /**
 is automatically included as a kdoc.  Then there is an expected format
 after that.  See Documentation/kernel-doc-nano-HOWTO.txt.

Got it :-)

 

 Just because I want to move this along versus waiting for another
 respin, I'm going to copy and paste these into those locations and clean
 up the changelog when I integrate this patch.

 Got it, if there is anything I could help, please let me know ;-)
 
 I'm sending the patch for review, please let me know if you are OK with
 how I handled the attribution.

The definition is far more detailed and accurate, it's already good enough
according to my understanding, should benefit the developer a lot ;-)

Regards,
Michael Wang

 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers

2015-05-18 Thread Michael Wang


On 05/18/2015 05:21 PM, Doug Ledford wrote:
[snip]
>>
>> I'll put the highlights and changelog under '---' in next version, is it
>> looks like this?
> 
> We're still missing Jason's feedback request though.  Specifically, he
> pointed out that kdocs are usually not done in Documentation/*, they are
> done in the .c files where the function is (or the .h file if the
> function is an inline, which these all are).  So, you included some
> limited documentation for each of these items in your original patches
> that added them.  His request was that you put this expanded information
> not in Documentation/infiniband where someone has to go looking for it,
> but as part of the kdoc header for each of the various helpers in
> ib_verbs.h itself.

I see :-) I've not work with the kdoc yet, not sure if there is any
guidelines on how to write the header of inline func for kdoc?

> 
> Just because I want to move this along versus waiting for another
> respin, I'm going to copy and paste these into those locations and clean
> up the changelog when I integrate this patch.

Got it, if there is anything I could help, please let me know ;-)

Regards,
Michael Wang

> 
>>
>> Subject: [PATCH RFC v3] Documentation/infiniband: Add docs for rdma-helpers
>>
>> This is the following patch for:
>>   https://lkml.org/lkml/2015/5/5/417
>> which try to document the settled rdma_cap_XX().
>>
>> Signed-off-by: Michael Wang 
>> ---
>> Highlights:
>>   There could be many missing/mistakes/misunderstanding, please don't
>>   be hesitate to point out the issues, any suggestions to improve or
>>   complete the description are very welcomed ;-)
>>
>> v2:
>>   * Merge the descriptions from Doug:
>> http://www.spinics.net/lists/linux-rdma/msg25172.html
>>
>> v3:
>>   ...
>>
>>  Documentation/infiniband/rdma_helpers.txt | 79 
>> +++
>>  1 file changed, 79 insertions(+)
>>  create mode 100644 Documentation/infiniband/rdma_helpers.txt
>>
>> diff --git a/Documentation/infiniband/rdma_helpers.txt 
>> b/Documentation/infiniband/rdma_helpers.txt
>> new file mode 100644
>> index 000..be9416d
>> --- /dev/null
>> +++ b/Documentation/infiniband/rdma_helpers.txt
>>
>> Regards,
>> Michael Wang
>>
>>>
>>>>
>>>> Signed-off-by: Michael Wang 
>>>> ---
>>>>  Documentation/infiniband/rdma_helpers.txt | 79 
>>>> +++
>>>>  1 file changed, 79 insertions(+)
>>>>  create mode 100644 Documentation/infiniband/rdma_helpers.txt
>>>>
>>>> diff --git a/Documentation/infiniband/rdma_helpers.txt 
>>>> b/Documentation/infiniband/rdma_helpers.txt
>>>> new file mode 100644
>>>> index 000..be9416d
>>>> --- /dev/null
>>>> +++ b/Documentation/infiniband/rdma_helpers.txt
>>>> @@ -0,0 +1,79 @@
>>>> +RDMA HELPERS
>>>> +
>>>> +  The following helpers are used to check the specific capabilities of a
>>>> +  particular port before utilizing those capabilities.
>>>> +
>>>> +rdma_cap_ib_mad- Infiniband Management Datagrams.
>>>> +rdma_cap_ib_smi- Infiniband Subnet Management Interface.
>>>> +rdma_cap_ib_cm - Infiniband Communication Manager.
>>>> +rdma_cap_iw_cm - IWARP Communication Manager.
>>>> +rdma_cap_ib_sa - Infiniband Subnet Administration.
>>>> +rdma_cap_ib_mcast  - Infiniband Multicast Join/Leave Protocol.
>>>> +rdma_cap_read_multi_sge- RDMA Read Work Request Support Multiple 
>>>> SGE.
>>>> +rdma_cap_af_ib - Native Infiniband Address.
>>>> +rdma_cap_eth_ah- InfiniBand Transport With Ethernet 
>>>> Address.
>>>> +
>>>> +USAGE
>>>> +
>>>> +  if (rdma_cap_XX(device, i)) {
>>>> +   /* The port i of device support XX */
>>>> +   ...
>>>> +  } else {
>>>> +   /* The port i of device don't support XX */
>>>> +   ...
>>>> +  }
>>>> +
>>>> +  rdma_cap_ib_mad
>>>> +  ---
>>>> +Management Datagrams (MAD) are a required part of the InfiniBand
>>>> +specification and are supported on all InfiniBand devices.  A slightly
>>>> +extended version are also supported on OPA interfaces.
>>>> +
>>>&

Re: [PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers

2015-05-18 Thread Michael Wang
Hi, Or

On 05/18/2015 11:47 AM, Or Gerlitz wrote:
[snip]
>> Highlights:
>>   There could be many missing/mistakes/misunderstanding, please don't
>>   be hesitate to point out the issues, any suggestions to improve or
>>   complete the description are very welcomed ;-)
> 
> Michael, none of what you wrote above belongs to the change-log which
> is going to stay for-ever in the upstream git repo. You should put it
> all belong the --- line after your S.O.B and add proper change log
> telling what this patch is about

Thanks for point out this for me :-)

I'll put the highlights and changelog under '---' in next version, is it
looks like this?


Subject: [PATCH RFC v3] Documentation/infiniband: Add docs for rdma-helpers

This is the following patch for:
  https://lkml.org/lkml/2015/5/5/417
which try to document the settled rdma_cap_XX().

Signed-off-by: Michael Wang 
---
Highlights:
  There could be many missing/mistakes/misunderstanding, please don't
  be hesitate to point out the issues, any suggestions to improve or
  complete the description are very welcomed ;-)

v2:
  * Merge the descriptions from Doug:
http://www.spinics.net/lists/linux-rdma/msg25172.html

v3:
  ...

 Documentation/infiniband/rdma_helpers.txt | 79 +++
 1 file changed, 79 insertions(+)
 create mode 100644 Documentation/infiniband/rdma_helpers.txt

diff --git a/Documentation/infiniband/rdma_helpers.txt 
b/Documentation/infiniband/rdma_helpers.txt
new file mode 100644
index 000..be9416d
--- /dev/null
+++ b/Documentation/infiniband/rdma_helpers.txt

Regards,
Michael Wang

> 
>>
>> Signed-off-by: Michael Wang 
>> ---
>>  Documentation/infiniband/rdma_helpers.txt | 79 
>> +++
>>  1 file changed, 79 insertions(+)
>>  create mode 100644 Documentation/infiniband/rdma_helpers.txt
>>
>> diff --git a/Documentation/infiniband/rdma_helpers.txt 
>> b/Documentation/infiniband/rdma_helpers.txt
>> new file mode 100644
>> index 000..be9416d
>> --- /dev/null
>> +++ b/Documentation/infiniband/rdma_helpers.txt
>> @@ -0,0 +1,79 @@
>> +RDMA HELPERS
>> +
>> +  The following helpers are used to check the specific capabilities of a
>> +  particular port before utilizing those capabilities.
>> +
>> +rdma_cap_ib_mad- Infiniband Management Datagrams.
>> +rdma_cap_ib_smi- Infiniband Subnet Management Interface.
>> +rdma_cap_ib_cm - Infiniband Communication Manager.
>> +rdma_cap_iw_cm - IWARP Communication Manager.
>> +rdma_cap_ib_sa - Infiniband Subnet Administration.
>> +rdma_cap_ib_mcast  - Infiniband Multicast Join/Leave Protocol.
>> +rdma_cap_read_multi_sge- RDMA Read Work Request Support Multiple 
>> SGE.
>> +rdma_cap_af_ib - Native Infiniband Address.
>> +rdma_cap_eth_ah- InfiniBand Transport With Ethernet Address.
>> +
>> +USAGE
>> +
>> +  if (rdma_cap_XX(device, i)) {
>> +   /* The port i of device support XX */
>> +   ...
>> +  } else {
>> +   /* The port i of device don't support XX */
>> +   ...
>> +  }
>> +
>> +  rdma_cap_ib_mad
>> +  ---
>> +Management Datagrams (MAD) are a required part of the InfiniBand
>> +specification and are supported on all InfiniBand devices.  A slightly
>> +extended version are also supported on OPA interfaces.
>> +
>> +  rdma_cap_ib_smi
>> +  ---
>> +Subnet Management Interface (SMI) will handle SMP packet from SM
>> +in an infiniband fabric.
>> +
>> +  rdma_cap_ib_cm
>> +  ---
>> +Communication Manager (CM) service, used to ease the process of 
>> connecting
>> +to a remote host.  The IB-CM can be used to connect to remote hosts 
>> using
>> +either InfiniBand or RoCE connections, iWARP has its own CM.
>> +
>> +  rdma_cap_iw_cm
>> +  ---
>> +iWARP Communication Manager (CM), Similar to the IB-CM, but only used on
>> +iWARP devices.
>> +
>> +  rdma_cap_ib_sa
>> +  ---
>> +Subnet Administration (SA) is the database built by SM in an
>> +infiniband fabric.
>> +
>> +  rdma_cap_ib_mcast
>> +  ---
>> +InfiniBand (and OPA) use a different multicast mechanism rather than
>> +traditional IP multicast found on Ethernet devices.  If this is true, 
>> then
>> +traditional IPv4/IPv6 multicast is handled by the IPoIB layer and direct
>> +multicast joins and leaves are handled per the InfiniBan

[PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers

2015-05-18 Thread Michael Wang
Since v1:
  * Merge the descriptions from Doug:
http://www.spinics.net/lists/linux-rdma/msg25172.html

This is the following patch for:
  https://lkml.org/lkml/2015/5/5/417
which try to document the settled rdma_cap_XX().

Highlights:
  There could be many missing/mistakes/misunderstanding, please don't
  be hesitate to point out the issues, any suggestions to improve or
  complete the description are very welcomed ;-)

Signed-off-by: Michael Wang 
---
 Documentation/infiniband/rdma_helpers.txt | 79 +++
 1 file changed, 79 insertions(+)
 create mode 100644 Documentation/infiniband/rdma_helpers.txt

diff --git a/Documentation/infiniband/rdma_helpers.txt 
b/Documentation/infiniband/rdma_helpers.txt
new file mode 100644
index 000..be9416d
--- /dev/null
+++ b/Documentation/infiniband/rdma_helpers.txt
@@ -0,0 +1,79 @@
+RDMA HELPERS
+
+  The following helpers are used to check the specific capabilities of a
+  particular port before utilizing those capabilities.
+
+rdma_cap_ib_mad- Infiniband Management Datagrams.
+rdma_cap_ib_smi- Infiniband Subnet Management Interface.
+rdma_cap_ib_cm - Infiniband Communication Manager.
+rdma_cap_iw_cm - IWARP Communication Manager.
+rdma_cap_ib_sa - Infiniband Subnet Administration.
+rdma_cap_ib_mcast  - Infiniband Multicast Join/Leave Protocol.
+rdma_cap_read_multi_sge- RDMA Read Work Request Support Multiple SGE.
+rdma_cap_af_ib - Native Infiniband Address.
+rdma_cap_eth_ah- InfiniBand Transport With Ethernet Address.
+
+USAGE
+
+  if (rdma_cap_XX(device, i)) {
+   /* The port i of device support XX */
+   ...
+  } else {
+   /* The port i of device don't support XX */
+   ...
+  }
+
+  rdma_cap_ib_mad
+  ---
+Management Datagrams (MAD) are a required part of the InfiniBand
+specification and are supported on all InfiniBand devices.  A slightly
+extended version are also supported on OPA interfaces.
+
+  rdma_cap_ib_smi
+  ---
+Subnet Management Interface (SMI) will handle SMP packet from SM
+in an infiniband fabric.
+
+  rdma_cap_ib_cm
+  ---
+Communication Manager (CM) service, used to ease the process of connecting
+to a remote host.  The IB-CM can be used to connect to remote hosts using
+either InfiniBand or RoCE connections, iWARP has its own CM.
+
+  rdma_cap_iw_cm
+  ---
+iWARP Communication Manager (CM), Similar to the IB-CM, but only used on
+iWARP devices.
+
+  rdma_cap_ib_sa
+  ---
+Subnet Administration (SA) is the database built by SM in an
+infiniband fabric.
+
+  rdma_cap_ib_mcast
+  ---
+InfiniBand (and OPA) use a different multicast mechanism rather than
+traditional IP multicast found on Ethernet devices.  If this is true, then
+traditional IPv4/IPv6 multicast is handled by the IPoIB layer and direct
+multicast joins and leaves are handled per the InfiniBand specifications.
+
+  rdma_cap_read_multi_sge
+  ---
+Certain devices (iWARP in particular) have restrictions on the number of
+scatter gather elements that can be present in an RDMA READ work request,
+this is true if the device does not have that restriction.
+
+  rdma_cap_af_ib
+  ---
+Many code paths for traditional InfiniBand and RoCE links are the same,
+but need minor differences to accommodate the different addresses on the
+two types of connections.  This helper is true when the address of the
+specific connection is of the InfiniBand native variety.
+
+  rdma_cap_eth_ah
+  ---
+Queue Pair is InfiniBand transport, but uses Ethernet address instead
+of native InfiniBand address (aka, this is a RoCE QP, and that means
+ethertype 0x8915 + GRH for RoCEv1 and IP/UDP to well known UDP port for
+RoCEv2), this is true when the address family of the specific queue pair
+is of the Ethernet (RoCE) variety.
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers

2015-05-18 Thread Michael Wang


On 05/18/2015 05:21 PM, Doug Ledford wrote:
[snip]

 I'll put the highlights and changelog under '---' in next version, is it
 looks like this?
 
 We're still missing Jason's feedback request though.  Specifically, he
 pointed out that kdocs are usually not done in Documentation/*, they are
 done in the .c files where the function is (or the .h file if the
 function is an inline, which these all are).  So, you included some
 limited documentation for each of these items in your original patches
 that added them.  His request was that you put this expanded information
 not in Documentation/infiniband where someone has to go looking for it,
 but as part of the kdoc header for each of the various helpers in
 ib_verbs.h itself.

I see :-) I've not work with the kdoc yet, not sure if there is any
guidelines on how to write the header of inline func for kdoc?

 
 Just because I want to move this along versus waiting for another
 respin, I'm going to copy and paste these into those locations and clean
 up the changelog when I integrate this patch.

Got it, if there is anything I could help, please let me know ;-)

Regards,
Michael Wang

 

 Subject: [PATCH RFC v3] Documentation/infiniband: Add docs for rdma-helpers

 This is the following patch for:
   https://lkml.org/lkml/2015/5/5/417
 which try to document the settled rdma_cap_XX().

 Signed-off-by: Michael Wang yun.w...@profitbricks.com
 ---
 Highlights:
   There could be many missing/mistakes/misunderstanding, please don't
   be hesitate to point out the issues, any suggestions to improve or
   complete the description are very welcomed ;-)

 v2:
   * Merge the descriptions from Doug:
 http://www.spinics.net/lists/linux-rdma/msg25172.html

 v3:
   ...

  Documentation/infiniband/rdma_helpers.txt | 79 
 +++
  1 file changed, 79 insertions(+)
  create mode 100644 Documentation/infiniband/rdma_helpers.txt

 diff --git a/Documentation/infiniband/rdma_helpers.txt 
 b/Documentation/infiniband/rdma_helpers.txt
 new file mode 100644
 index 000..be9416d
 --- /dev/null
 +++ b/Documentation/infiniband/rdma_helpers.txt

 Regards,
 Michael Wang



 Signed-off-by: Michael Wang yun.w...@profitbricks.com
 ---
  Documentation/infiniband/rdma_helpers.txt | 79 
 +++
  1 file changed, 79 insertions(+)
  create mode 100644 Documentation/infiniband/rdma_helpers.txt

 diff --git a/Documentation/infiniband/rdma_helpers.txt 
 b/Documentation/infiniband/rdma_helpers.txt
 new file mode 100644
 index 000..be9416d
 --- /dev/null
 +++ b/Documentation/infiniband/rdma_helpers.txt
 @@ -0,0 +1,79 @@
 +RDMA HELPERS
 +
 +  The following helpers are used to check the specific capabilities of a
 +  particular port before utilizing those capabilities.
 +
 +rdma_cap_ib_mad- Infiniband Management Datagrams.
 +rdma_cap_ib_smi- Infiniband Subnet Management Interface.
 +rdma_cap_ib_cm - Infiniband Communication Manager.
 +rdma_cap_iw_cm - IWARP Communication Manager.
 +rdma_cap_ib_sa - Infiniband Subnet Administration.
 +rdma_cap_ib_mcast  - Infiniband Multicast Join/Leave Protocol.
 +rdma_cap_read_multi_sge- RDMA Read Work Request Support Multiple 
 SGE.
 +rdma_cap_af_ib - Native Infiniband Address.
 +rdma_cap_eth_ah- InfiniBand Transport With Ethernet 
 Address.
 +
 +USAGE
 +
 +  if (rdma_cap_XX(device, i)) {
 +   /* The port i of device support XX */
 +   ...
 +  } else {
 +   /* The port i of device don't support XX */
 +   ...
 +  }
 +
 +  rdma_cap_ib_mad
 +  ---
 +Management Datagrams (MAD) are a required part of the InfiniBand
 +specification and are supported on all InfiniBand devices.  A slightly
 +extended version are also supported on OPA interfaces.
 +
 +  rdma_cap_ib_smi
 +  ---
 +Subnet Management Interface (SMI) will handle SMP packet from SM
 +in an infiniband fabric.
 +
 +  rdma_cap_ib_cm
 +  ---
 +Communication Manager (CM) service, used to ease the process of 
 connecting
 +to a remote host.  The IB-CM can be used to connect to remote hosts 
 using
 +either InfiniBand or RoCE connections, iWARP has its own CM.
 +
 +  rdma_cap_iw_cm
 +  ---
 +iWARP Communication Manager (CM), Similar to the IB-CM, but only used 
 on
 +iWARP devices.
 +
 +  rdma_cap_ib_sa
 +  ---
 +Subnet Administration (SA) is the database built by SM in an
 +infiniband fabric.
 +
 +  rdma_cap_ib_mcast
 +  ---
 +InfiniBand (and OPA) use a different multicast mechanism rather than
 +traditional IP multicast found on Ethernet devices.  If this is true, 
 then
 +traditional IPv4/IPv6 multicast is handled by the IPoIB layer and 
 direct
 +multicast joins and leaves are handled per the InfiniBand 
 specifications.
 +
 +  rdma_cap_read_multi_sge

[PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers

2015-05-18 Thread Michael Wang
Since v1:
  * Merge the descriptions from Doug:
http://www.spinics.net/lists/linux-rdma/msg25172.html

This is the following patch for:
  https://lkml.org/lkml/2015/5/5/417
which try to document the settled rdma_cap_XX().

Highlights:
  There could be many missing/mistakes/misunderstanding, please don't
  be hesitate to point out the issues, any suggestions to improve or
  complete the description are very welcomed ;-)

Signed-off-by: Michael Wang yun.w...@profitbricks.com
---
 Documentation/infiniband/rdma_helpers.txt | 79 +++
 1 file changed, 79 insertions(+)
 create mode 100644 Documentation/infiniband/rdma_helpers.txt

diff --git a/Documentation/infiniband/rdma_helpers.txt 
b/Documentation/infiniband/rdma_helpers.txt
new file mode 100644
index 000..be9416d
--- /dev/null
+++ b/Documentation/infiniband/rdma_helpers.txt
@@ -0,0 +1,79 @@
+RDMA HELPERS
+
+  The following helpers are used to check the specific capabilities of a
+  particular port before utilizing those capabilities.
+
+rdma_cap_ib_mad- Infiniband Management Datagrams.
+rdma_cap_ib_smi- Infiniband Subnet Management Interface.
+rdma_cap_ib_cm - Infiniband Communication Manager.
+rdma_cap_iw_cm - IWARP Communication Manager.
+rdma_cap_ib_sa - Infiniband Subnet Administration.
+rdma_cap_ib_mcast  - Infiniband Multicast Join/Leave Protocol.
+rdma_cap_read_multi_sge- RDMA Read Work Request Support Multiple SGE.
+rdma_cap_af_ib - Native Infiniband Address.
+rdma_cap_eth_ah- InfiniBand Transport With Ethernet Address.
+
+USAGE
+
+  if (rdma_cap_XX(device, i)) {
+   /* The port i of device support XX */
+   ...
+  } else {
+   /* The port i of device don't support XX */
+   ...
+  }
+
+  rdma_cap_ib_mad
+  ---
+Management Datagrams (MAD) are a required part of the InfiniBand
+specification and are supported on all InfiniBand devices.  A slightly
+extended version are also supported on OPA interfaces.
+
+  rdma_cap_ib_smi
+  ---
+Subnet Management Interface (SMI) will handle SMP packet from SM
+in an infiniband fabric.
+
+  rdma_cap_ib_cm
+  ---
+Communication Manager (CM) service, used to ease the process of connecting
+to a remote host.  The IB-CM can be used to connect to remote hosts using
+either InfiniBand or RoCE connections, iWARP has its own CM.
+
+  rdma_cap_iw_cm
+  ---
+iWARP Communication Manager (CM), Similar to the IB-CM, but only used on
+iWARP devices.
+
+  rdma_cap_ib_sa
+  ---
+Subnet Administration (SA) is the database built by SM in an
+infiniband fabric.
+
+  rdma_cap_ib_mcast
+  ---
+InfiniBand (and OPA) use a different multicast mechanism rather than
+traditional IP multicast found on Ethernet devices.  If this is true, then
+traditional IPv4/IPv6 multicast is handled by the IPoIB layer and direct
+multicast joins and leaves are handled per the InfiniBand specifications.
+
+  rdma_cap_read_multi_sge
+  ---
+Certain devices (iWARP in particular) have restrictions on the number of
+scatter gather elements that can be present in an RDMA READ work request,
+this is true if the device does not have that restriction.
+
+  rdma_cap_af_ib
+  ---
+Many code paths for traditional InfiniBand and RoCE links are the same,
+but need minor differences to accommodate the different addresses on the
+two types of connections.  This helper is true when the address of the
+specific connection is of the InfiniBand native variety.
+
+  rdma_cap_eth_ah
+  ---
+Queue Pair is InfiniBand transport, but uses Ethernet address instead
+of native InfiniBand address (aka, this is a RoCE QP, and that means
+ethertype 0x8915 + GRH for RoCEv1 and IP/UDP to well known UDP port for
+RoCEv2), this is true when the address family of the specific queue pair
+is of the Ethernet (RoCE) variety.
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers

2015-05-18 Thread Michael Wang
Hi, Or

On 05/18/2015 11:47 AM, Or Gerlitz wrote:
[snip]
 Highlights:
   There could be many missing/mistakes/misunderstanding, please don't
   be hesitate to point out the issues, any suggestions to improve or
   complete the description are very welcomed ;-)
 
 Michael, none of what you wrote above belongs to the change-log which
 is going to stay for-ever in the upstream git repo. You should put it
 all belong the --- line after your S.O.B and add proper change log
 telling what this patch is about

Thanks for point out this for me :-)

I'll put the highlights and changelog under '---' in next version, is it
looks like this?


Subject: [PATCH RFC v3] Documentation/infiniband: Add docs for rdma-helpers

This is the following patch for:
  https://lkml.org/lkml/2015/5/5/417
which try to document the settled rdma_cap_XX().

Signed-off-by: Michael Wang yun.w...@profitbricks.com
---
Highlights:
  There could be many missing/mistakes/misunderstanding, please don't
  be hesitate to point out the issues, any suggestions to improve or
  complete the description are very welcomed ;-)

v2:
  * Merge the descriptions from Doug:
http://www.spinics.net/lists/linux-rdma/msg25172.html

v3:
  ...

 Documentation/infiniband/rdma_helpers.txt | 79 +++
 1 file changed, 79 insertions(+)
 create mode 100644 Documentation/infiniband/rdma_helpers.txt

diff --git a/Documentation/infiniband/rdma_helpers.txt 
b/Documentation/infiniband/rdma_helpers.txt
new file mode 100644
index 000..be9416d
--- /dev/null
+++ b/Documentation/infiniband/rdma_helpers.txt

Regards,
Michael Wang

 

 Signed-off-by: Michael Wang yun.w...@profitbricks.com
 ---
  Documentation/infiniband/rdma_helpers.txt | 79 
 +++
  1 file changed, 79 insertions(+)
  create mode 100644 Documentation/infiniband/rdma_helpers.txt

 diff --git a/Documentation/infiniband/rdma_helpers.txt 
 b/Documentation/infiniband/rdma_helpers.txt
 new file mode 100644
 index 000..be9416d
 --- /dev/null
 +++ b/Documentation/infiniband/rdma_helpers.txt
 @@ -0,0 +1,79 @@
 +RDMA HELPERS
 +
 +  The following helpers are used to check the specific capabilities of a
 +  particular port before utilizing those capabilities.
 +
 +rdma_cap_ib_mad- Infiniband Management Datagrams.
 +rdma_cap_ib_smi- Infiniband Subnet Management Interface.
 +rdma_cap_ib_cm - Infiniband Communication Manager.
 +rdma_cap_iw_cm - IWARP Communication Manager.
 +rdma_cap_ib_sa - Infiniband Subnet Administration.
 +rdma_cap_ib_mcast  - Infiniband Multicast Join/Leave Protocol.
 +rdma_cap_read_multi_sge- RDMA Read Work Request Support Multiple 
 SGE.
 +rdma_cap_af_ib - Native Infiniband Address.
 +rdma_cap_eth_ah- InfiniBand Transport With Ethernet Address.
 +
 +USAGE
 +
 +  if (rdma_cap_XX(device, i)) {
 +   /* The port i of device support XX */
 +   ...
 +  } else {
 +   /* The port i of device don't support XX */
 +   ...
 +  }
 +
 +  rdma_cap_ib_mad
 +  ---
 +Management Datagrams (MAD) are a required part of the InfiniBand
 +specification and are supported on all InfiniBand devices.  A slightly
 +extended version are also supported on OPA interfaces.
 +
 +  rdma_cap_ib_smi
 +  ---
 +Subnet Management Interface (SMI) will handle SMP packet from SM
 +in an infiniband fabric.
 +
 +  rdma_cap_ib_cm
 +  ---
 +Communication Manager (CM) service, used to ease the process of 
 connecting
 +to a remote host.  The IB-CM can be used to connect to remote hosts 
 using
 +either InfiniBand or RoCE connections, iWARP has its own CM.
 +
 +  rdma_cap_iw_cm
 +  ---
 +iWARP Communication Manager (CM), Similar to the IB-CM, but only used on
 +iWARP devices.
 +
 +  rdma_cap_ib_sa
 +  ---
 +Subnet Administration (SA) is the database built by SM in an
 +infiniband fabric.
 +
 +  rdma_cap_ib_mcast
 +  ---
 +InfiniBand (and OPA) use a different multicast mechanism rather than
 +traditional IP multicast found on Ethernet devices.  If this is true, 
 then
 +traditional IPv4/IPv6 multicast is handled by the IPoIB layer and direct
 +multicast joins and leaves are handled per the InfiniBand 
 specifications.
 +
 +  rdma_cap_read_multi_sge
 +  ---
 +Certain devices (iWARP in particular) have restrictions on the number of
 +scatter gather elements that can be present in an RDMA READ work 
 request,
 +this is true if the device does not have that restriction.
 +
 +  rdma_cap_af_ib
 +  ---
 +Many code paths for traditional InfiniBand and RoCE links are the same,
 +but need minor differences to accommodate the different addresses on the
 +two types of connections.  This helper is true when the address of the
 +specific connection is of the InfiniBand native

Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers

2015-05-15 Thread Michael Wang


On 05/15/2015 04:40 PM, Doug Ledford wrote:
[snip]
> 
> The test itself doesn't mean that.  It means we need a RoCE address
> (it's true when transport is IB and link layer is Ethernet).  That we
> *use* it during connectionless communication because we have to generate
> our own address vector for the packet while during connected queue pair
> use the address vector is created by the card using the queue pair
> information is just the circumstance of its use.  And even though a
> disconnected queue pair isn't solidly connected to a remote endpoint, it
> is solidly bound to an adapter that requires either an IB or Ethernet
> address family.  Maybe this to resolve your issue with the wording:

Thanks for the explain :-) The term 'connectionless' still sounds a little
strange to me when it's just means no HW support on creating address vector,
but I can understand the concept.

> 
> This helper is true when the address family of this queue pair is of the
> Ethernet (RoCE) variety.

Sounds good, will be merged in next version :-)

Regards,
Michael Wang

> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers

2015-05-15 Thread Michael Wang


On 05/15/2015 04:27 PM, Doug Ledford wrote:
[snip]
>>
>> Me too used to think it's 'connection', while I found some docs explain
>> this as 'communication'... but anyway, 'connection' sounds
>> more close to what it did in kernel :-)
> 
> That's kind of what I thought.  Anyway, it's communication management
> (which to me is a gross abuse of the english language for which the IBTA
> should be appropriately chastised), but that doesn't mean that lower
> down in the more descriptive area of text that we can't call out that
> this is really for establishing a connection and that once your
> connection is established and you *truly* want to communicate, this does
> nothing.

I see :-) we can reserve the communication management as the definition of CM,
to obey the standard, meanwhile give some description related to connection
below in the long description.

> 
[snip]
>> Shall we put this long description into USAGE? Here maybe list
>> all the helpers to give some quick overview with a brief
>> description, what's your opinion?
> 
> Given how we have a more complete description of this below, it need not
> have such a lengthy description here.

Got it :-)

Regards,
Michael Wang

>>
>>>> +
>>>> +USAGE
>>>> +
>>>> +  if (rdma_cap_XX(device, i)) {
>>>> +  /* The port i of device support XX */
>>>> +  ...
>>>> +  } else {
>>>> +  /* The port i of device don't support XX */
>>>> +  ...
>>>> +  }
>>>> +
>>>> +  rdma_cap_ib_mad
>>>> +  ---
>>>> +Management Datagrams (MAD) is the prototype of management packet
>>>> +to be used by all the kinds of infiniband managers, use the helper
>>>> +to verify the port before utilize related features.
>>> Management Datagrams (MAD) are a required part of the InfiniBand
>>> specification and are supported on all InfiniBand devices.  A slightly
>>> extended version are also supported on OPA interfaces.
>>>
>>> I would drop all instances of "use the helper to verify..." as that's
>>> redundant.  This whole doc is about using the helpers to verify things.
>>
>> Agree, will be dropped in next version.
>>
>> And all the comments below make sense, will be merged ;-)
>>
>> Regards,
>> Michael Wang
>>
>>>
>>>> +
>>>> +  rdma_cap_ib_smi
>>>> +  ---
>>>> +Subnet Management Interface (SMI) will handle SMP packet from SM
>>>> +in an infiniband fabric, use the helper to verify the port before
>>>> +utilize related features.
>>>> +
>>>> +  rdma_cap_ib_cm
>>>> +  ---
>>>> +Communication Manager (CM) will handle the connections between
>>>^Connection Manager (CM) service, used to ease the process of
>>> connecting to a remote host.  The IB CM can be used to connect to remote
>>> hosts using either InfiniBand or RoCE connections.  iWARP has its own
>>> connection manager, see below.
>>>> +adaptors, currently there are two different implementation,
>>>> +IB or IWARP, use the helper to verify whether the port using
>>>> +IB-CM or not
>>>> +
>>>> +  rdma_cap_iw_cm
>>>> +  ---
>>>> +IWARP has it's own implemented CM which is different from infiniband,
>>>   iWARP connection manager.  Similar to the IB Connection Manager,
>>> but only used on iWARP devices.
>>>> +use the helper to check whether the port using IWARP-CM or not.
>>>> +
>>>> +  rdma_cap_ib_sa
>>>> +  ---
>>>> +Subnet Administration (SA) is the database built by SM in an
>>>> +infiniband fabric, use the helper to verify the port before
>>>> +utilize related features.
>>>> +
>>>> +  rdma_cap_ib_mcast
>>>> +  ---
>>>> +Multicast is the feature for one QP to send messages to multiple
>>>> +QP in an infiniband fabric, use the helper to verify the port before
>>>> +utilize related features.
>>>
>>> InfiniBand (and OPA) use a different multicast mechanism than
>>> traditional IP multicast found on Ethernet devices.  If this capability
>>> is true, then traditional IPv4/IPv6 multicast is handled by the IPoIB
>>> layer and direct multicast joins and leaves are handled per the
>>> InfiniBand specifications.
>>>
>>>> +
>&g

Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers

2015-05-15 Thread Michael Wang


On 05/13/2015 06:42 PM, Hefty, Sean wrote:
>>> +  rdma_cap_ib_cm
>>> +  ---
>>> +Communication Manager (CM) will handle the connections between
>>^Connection Manager (CM) service, used to ease the process of
> 
> In IB terms, this is communication manager.  It also handles transport level 
> address resolution for UD QPs.

I could find both 'connection' and 'communication' in different docs,
while 'connection' is more related to verbs, 'communication' is more
close to specification.

IMHO 'connection' make more sense, after all, all the transport between
adaptors could named as communication, while connection management is exactly
what CM did in kernel.

Doug, what's your opinion?

> 
>>> +  rdma_cap_eth_ah
>>> +  ---
>>> +Infiniband address handler format is special in ethernet fabric,
>> use
>>> +the helper to verify whether the port is using ethernet format or
>> not.
>>
>> This helper is true when the address of the specific connection is of
>> the Ethernet (RoCE) variety.
> 
> This is used for connectionless communication.

Could you please give more details on this?

Regards,
Michael Wang

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers

2015-05-15 Thread Michael Wang


On 05/13/2015 05:11 PM, Doug Ledford wrote:
[snip]
>> +
>> +  For core layer, below helpers are used to check if a paticular capability
>> +  is supported by the port.
> 
> The following helpers are used to check the specific capabilities of a
> particular port before utilizing those capabilities.

Will be in next version :-)

> 
>> +
>> +rdma_cap_ib_mad - Infiniband Management Datagrams.
>> +rdma_cap_ib_smi - Infiniband Subnet Management Interface.
>> +rdma_cap_ib_cm  - Infiniband Communication Manager.
> InfiniBand Connection Management

Me too used to think it's 'connection', while I found some docs explain
this as 'communication'... but anyway, 'connection' sounds
more close to what it did in kernel :-)

>> +rdma_cap_iw_cm  - IWARP Communication Manager.
> iWARP Connection Management
>> +rdma_cap_ib_sa  - Infiniband Subnet Administration.
>> +rdma_cap_ib_mcast   - Infiniband Multicast.
> InfiniBand Multicast join/leave protocol
>> +rdma_cap_read_multi_sge - RDMA Read Multiple Scatter-Gather Entries.
> RDMA Read verb supports more than 1 sge in the work request

Will be in next version :-)

>> +rdma_cap_af_ib  - Native Infiniband Address.
>> +rdma_cap_eth_ah - Ethernet Address Handler.
> Queue Pair is InfiniBand transport, but uses Ethernet address instead of
> native InfiniBand address (aka, this is a RoCE QP, and that means
> ethertype 0x8915 + GRH for RoCEv1 and IP/UDP to well known UDP port for
> RoCEv2)

Shall we put this long description into USAGE? Here maybe list
all the helpers to give some quick overview with a brief
description, what's your opinion?

>> +
>> +USAGE
>> +
>> +  if (rdma_cap_XX(device, i)) {
>> +/* The port i of device support XX */
>> +...
>> +  } else {
>> +/* The port i of device don't support XX */
>> +...
>> +  }
>> +
>> +  rdma_cap_ib_mad
>> +  ---
>> +Management Datagrams (MAD) is the prototype of management packet
>> +to be used by all the kinds of infiniband managers, use the helper
>> +to verify the port before utilize related features.
> Management Datagrams (MAD) are a required part of the InfiniBand
> specification and are supported on all InfiniBand devices.  A slightly
> extended version are also supported on OPA interfaces.
> 
> I would drop all instances of "use the helper to verify..." as that's
> redundant.  This whole doc is about using the helpers to verify things.

Agree, will be dropped in next version.

And all the comments below make sense, will be merged ;-)

Regards,
Michael Wang

> 
>> +
>> +  rdma_cap_ib_smi
>> +  ---
>> +Subnet Management Interface (SMI) will handle SMP packet from SM
>> +in an infiniband fabric, use the helper to verify the port before
>> +utilize related features.
>> +
>> +  rdma_cap_ib_cm
>> +  ---
>> +Communication Manager (CM) will handle the connections between
>^Connection Manager (CM) service, used to ease the process of
> connecting to a remote host.  The IB CM can be used to connect to remote
> hosts using either InfiniBand or RoCE connections.  iWARP has its own
> connection manager, see below.
>> +adaptors, currently there are two different implementation,
>> +IB or IWARP, use the helper to verify whether the port using
>> +IB-CM or not
>> +
>> +  rdma_cap_iw_cm
>> +  ---
>> +IWARP has it's own implemented CM which is different from infiniband,
>   iWARP connection manager.  Similar to the IB Connection Manager,
> but only used on iWARP devices.
>> +use the helper to check whether the port using IWARP-CM or not.
>> +
>> +  rdma_cap_ib_sa
>> +  ---
>> +Subnet Administration (SA) is the database built by SM in an
>> +infiniband fabric, use the helper to verify the port before
>> +utilize related features.
>> +
>> +  rdma_cap_ib_mcast
>> +  ---
>> +Multicast is the feature for one QP to send messages to multiple
>> +QP in an infiniband fabric, use the helper to verify the port before
>> +utilize related features.
> 
> InfiniBand (and OPA) use a different multicast mechanism than
> traditional IP multicast found on Ethernet devices.  If this capability
> is true, then traditional IPv4/IPv6 multicast is handled by the IPoIB
> layer and direct multicast joins and leaves are handled per the
> InfiniBand specifications.
> 
>> +
>> +  rdma_cap_read_multi_sge
>> +  ---
>&g

Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers

2015-05-15 Thread Michael Wang


On 05/13/2015 05:59 PM, Jason Gunthorpe wrote:
> On Wed, May 13, 2015 at 03:24:32PM +0200, Michael Wang wrote:
>> This is the following patch for:
>>   https://lkml.org/lkml/2015/5/5/417
>> which try to document the settled rdma_cap_XX().
>>
>> Highlights:
>>   There could be many missing/mistakes/misunderstanding, please don't
>>   be hesitate to point out the issues, any suggestions to improve or
>>   complete the description are very welcomed ;-)
> 
> I'd rather see this in the kdoc for each function.

I used to thought you mean the kernel documentation like
this... my misunderstanding but this is the usual way to
document kernel stuff, isn't it?

BTW, could you give more details on the kdoc?

Regards,
Michael Wang

> 
> Thanks,
> Jason
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers

2015-05-15 Thread Michael Wang


On 05/13/2015 06:42 PM, Hefty, Sean wrote:
 +  rdma_cap_ib_cm
 +  ---
 +Communication Manager (CM) will handle the connections between
^Connection Manager (CM) service, used to ease the process of
 
 In IB terms, this is communication manager.  It also handles transport level 
 address resolution for UD QPs.

I could find both 'connection' and 'communication' in different docs,
while 'connection' is more related to verbs, 'communication' is more
close to specification.

IMHO 'connection' make more sense, after all, all the transport between
adaptors could named as communication, while connection management is exactly
what CM did in kernel.

Doug, what's your opinion?

 
 +  rdma_cap_eth_ah
 +  ---
 +Infiniband address handler format is special in ethernet fabric,
 use
 +the helper to verify whether the port is using ethernet format or
 not.

 This helper is true when the address of the specific connection is of
 the Ethernet (RoCE) variety.
 
 This is used for connectionless communication.

Could you please give more details on this?

Regards,
Michael Wang

 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers

2015-05-15 Thread Michael Wang


On 05/13/2015 05:59 PM, Jason Gunthorpe wrote:
 On Wed, May 13, 2015 at 03:24:32PM +0200, Michael Wang wrote:
 This is the following patch for:
   https://lkml.org/lkml/2015/5/5/417
 which try to document the settled rdma_cap_XX().

 Highlights:
   There could be many missing/mistakes/misunderstanding, please don't
   be hesitate to point out the issues, any suggestions to improve or
   complete the description are very welcomed ;-)
 
 I'd rather see this in the kdoc for each function.

I used to thought you mean the kernel documentation like
this... my misunderstanding but this is the usual way to
document kernel stuff, isn't it?

BTW, could you give more details on the kdoc?

Regards,
Michael Wang

 
 Thanks,
 Jason
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers

2015-05-15 Thread Michael Wang


On 05/13/2015 05:11 PM, Doug Ledford wrote:
[snip]
 +
 +  For core layer, below helpers are used to check if a paticular capability
 +  is supported by the port.
 
 The following helpers are used to check the specific capabilities of a
 particular port before utilizing those capabilities.

Will be in next version :-)

 
 +
 +rdma_cap_ib_mad - Infiniband Management Datagrams.
 +rdma_cap_ib_smi - Infiniband Subnet Management Interface.
 +rdma_cap_ib_cm  - Infiniband Communication Manager.
 InfiniBand Connection Management

Me too used to think it's 'connection', while I found some docs explain
this as 'communication'... but anyway, 'connection' sounds
more close to what it did in kernel :-)

 +rdma_cap_iw_cm  - IWARP Communication Manager.
 iWARP Connection Management
 +rdma_cap_ib_sa  - Infiniband Subnet Administration.
 +rdma_cap_ib_mcast   - Infiniband Multicast.
 InfiniBand Multicast join/leave protocol
 +rdma_cap_read_multi_sge - RDMA Read Multiple Scatter-Gather Entries.
 RDMA Read verb supports more than 1 sge in the work request

Will be in next version :-)

 +rdma_cap_af_ib  - Native Infiniband Address.
 +rdma_cap_eth_ah - Ethernet Address Handler.
 Queue Pair is InfiniBand transport, but uses Ethernet address instead of
 native InfiniBand address (aka, this is a RoCE QP, and that means
 ethertype 0x8915 + GRH for RoCEv1 and IP/UDP to well known UDP port for
 RoCEv2)

Shall we put this long description into USAGE? Here maybe list
all the helpers to give some quick overview with a brief
description, what's your opinion?

 +
 +USAGE
 +
 +  if (rdma_cap_XX(device, i)) {
 +/* The port i of device support XX */
 +...
 +  } else {
 +/* The port i of device don't support XX */
 +...
 +  }
 +
 +  rdma_cap_ib_mad
 +  ---
 +Management Datagrams (MAD) is the prototype of management packet
 +to be used by all the kinds of infiniband managers, use the helper
 +to verify the port before utilize related features.
 Management Datagrams (MAD) are a required part of the InfiniBand
 specification and are supported on all InfiniBand devices.  A slightly
 extended version are also supported on OPA interfaces.
 
 I would drop all instances of use the helper to verify... as that's
 redundant.  This whole doc is about using the helpers to verify things.

Agree, will be dropped in next version.

And all the comments below make sense, will be merged ;-)

Regards,
Michael Wang

 
 +
 +  rdma_cap_ib_smi
 +  ---
 +Subnet Management Interface (SMI) will handle SMP packet from SM
 +in an infiniband fabric, use the helper to verify the port before
 +utilize related features.
 +
 +  rdma_cap_ib_cm
 +  ---
 +Communication Manager (CM) will handle the connections between
^Connection Manager (CM) service, used to ease the process of
 connecting to a remote host.  The IB CM can be used to connect to remote
 hosts using either InfiniBand or RoCE connections.  iWARP has its own
 connection manager, see below.
 +adaptors, currently there are two different implementation,
 +IB or IWARP, use the helper to verify whether the port using
 +IB-CM or not
 +
 +  rdma_cap_iw_cm
 +  ---
 +IWARP has it's own implemented CM which is different from infiniband,
   iWARP connection manager.  Similar to the IB Connection Manager,
 but only used on iWARP devices.
 +use the helper to check whether the port using IWARP-CM or not.
 +
 +  rdma_cap_ib_sa
 +  ---
 +Subnet Administration (SA) is the database built by SM in an
 +infiniband fabric, use the helper to verify the port before
 +utilize related features.
 +
 +  rdma_cap_ib_mcast
 +  ---
 +Multicast is the feature for one QP to send messages to multiple
 +QP in an infiniband fabric, use the helper to verify the port before
 +utilize related features.
 
 InfiniBand (and OPA) use a different multicast mechanism than
 traditional IP multicast found on Ethernet devices.  If this capability
 is true, then traditional IPv4/IPv6 multicast is handled by the IPoIB
 layer and direct multicast joins and leaves are handled per the
 InfiniBand specifications.
 
 +
 +  rdma_cap_read_multi_sge
 +  ---
 +RDMA read operation could support multiple scatter-gather entries,
 +use the helper to verify wthether the port support this feature
 +or not.
 
 Certain devices (iWARP in particular) have restrictions on the number of
 scatter gather elements that can be present in an RDMA READ work
 request.  This is true if the device does not have that restriction.
 
 +  rdma_cap_af_ib
 +  ---
 +RDMA address format could be ethernet or infiniband, use the helper
 +to verify whether the port support infiniband format or not.
 
 Many code paths for traditional InfiniBand and RoCE links are the same,
 but need minor

Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers

2015-05-15 Thread Michael Wang


On 05/15/2015 04:27 PM, Doug Ledford wrote:
[snip]

 Me too used to think it's 'connection', while I found some docs explain
 this as 'communication'... but anyway, 'connection' sounds
 more close to what it did in kernel :-)
 
 That's kind of what I thought.  Anyway, it's communication management
 (which to me is a gross abuse of the english language for which the IBTA
 should be appropriately chastised), but that doesn't mean that lower
 down in the more descriptive area of text that we can't call out that
 this is really for establishing a connection and that once your
 connection is established and you *truly* want to communicate, this does
 nothing.

I see :-) we can reserve the communication management as the definition of CM,
to obey the standard, meanwhile give some description related to connection
below in the long description.

 
[snip]
 Shall we put this long description into USAGE? Here maybe list
 all the helpers to give some quick overview with a brief
 description, what's your opinion?
 
 Given how we have a more complete description of this below, it need not
 have such a lengthy description here.

Got it :-)

Regards,
Michael Wang


 +
 +USAGE
 +
 +  if (rdma_cap_XX(device, i)) {
 +  /* The port i of device support XX */
 +  ...
 +  } else {
 +  /* The port i of device don't support XX */
 +  ...
 +  }
 +
 +  rdma_cap_ib_mad
 +  ---
 +Management Datagrams (MAD) is the prototype of management packet
 +to be used by all the kinds of infiniband managers, use the helper
 +to verify the port before utilize related features.
 Management Datagrams (MAD) are a required part of the InfiniBand
 specification and are supported on all InfiniBand devices.  A slightly
 extended version are also supported on OPA interfaces.

 I would drop all instances of use the helper to verify... as that's
 redundant.  This whole doc is about using the helpers to verify things.

 Agree, will be dropped in next version.

 And all the comments below make sense, will be merged ;-)

 Regards,
 Michael Wang


 +
 +  rdma_cap_ib_smi
 +  ---
 +Subnet Management Interface (SMI) will handle SMP packet from SM
 +in an infiniband fabric, use the helper to verify the port before
 +utilize related features.
 +
 +  rdma_cap_ib_cm
 +  ---
 +Communication Manager (CM) will handle the connections between
^Connection Manager (CM) service, used to ease the process of
 connecting to a remote host.  The IB CM can be used to connect to remote
 hosts using either InfiniBand or RoCE connections.  iWARP has its own
 connection manager, see below.
 +adaptors, currently there are two different implementation,
 +IB or IWARP, use the helper to verify whether the port using
 +IB-CM or not
 +
 +  rdma_cap_iw_cm
 +  ---
 +IWARP has it's own implemented CM which is different from infiniband,
   iWARP connection manager.  Similar to the IB Connection Manager,
 but only used on iWARP devices.
 +use the helper to check whether the port using IWARP-CM or not.
 +
 +  rdma_cap_ib_sa
 +  ---
 +Subnet Administration (SA) is the database built by SM in an
 +infiniband fabric, use the helper to verify the port before
 +utilize related features.
 +
 +  rdma_cap_ib_mcast
 +  ---
 +Multicast is the feature for one QP to send messages to multiple
 +QP in an infiniband fabric, use the helper to verify the port before
 +utilize related features.

 InfiniBand (and OPA) use a different multicast mechanism than
 traditional IP multicast found on Ethernet devices.  If this capability
 is true, then traditional IPv4/IPv6 multicast is handled by the IPoIB
 layer and direct multicast joins and leaves are handled per the
 InfiniBand specifications.

 +
 +  rdma_cap_read_multi_sge
 +  ---
 +RDMA read operation could support multiple scatter-gather entries,
 +use the helper to verify wthether the port support this feature
 +or not.

 Certain devices (iWARP in particular) have restrictions on the number of
 scatter gather elements that can be present in an RDMA READ work
 request.  This is true if the device does not have that restriction.

 +  rdma_cap_af_ib
 +  ---
 +RDMA address format could be ethernet or infiniband, use the helper
 +to verify whether the port support infiniband format or not.

 Many code paths for traditional InfiniBand and RoCE links are the same,
 but need minor differences to accommodate the different addresses on the
 two types of connections.  This helper is true when the address of the
 specific connection is of the InfiniBand native variety.

 +
 +  rdma_cap_eth_ah
 +  ---
 +Infiniband address handler format is special in ethernet fabric, use
 +the helper to verify whether the port is using ethernet format or not.

 This helper is true when the address of the specific connection is of
 the Ethernet (RoCE) variety

Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers

2015-05-15 Thread Michael Wang


On 05/15/2015 04:40 PM, Doug Ledford wrote:
[snip]
 
 The test itself doesn't mean that.  It means we need a RoCE address
 (it's true when transport is IB and link layer is Ethernet).  That we
 *use* it during connectionless communication because we have to generate
 our own address vector for the packet while during connected queue pair
 use the address vector is created by the card using the queue pair
 information is just the circumstance of its use.  And even though a
 disconnected queue pair isn't solidly connected to a remote endpoint, it
 is solidly bound to an adapter that requires either an IB or Ethernet
 address family.  Maybe this to resolve your issue with the wording:

Thanks for the explain :-) The term 'connectionless' still sounds a little
strange to me when it's just means no HW support on creating address vector,
but I can understand the concept.

 
 This helper is true when the address family of this queue pair is of the
 Ethernet (RoCE) variety.

Sounds good, will be merged in next version :-)

Regards,
Michael Wang

 
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers

2015-05-13 Thread Michael Wang
This is the following patch for:
  https://lkml.org/lkml/2015/5/5/417
which try to document the settled rdma_cap_XX().

Highlights:
  There could be many missing/mistakes/misunderstanding, please don't
  be hesitate to point out the issues, any suggestions to improve or
  complete the description are very welcomed ;-)

Signed-off-by: Michael Wang 
---
 Documentation/infiniband/rdma_helpers.txt | 76 +++
 1 file changed, 76 insertions(+)
 create mode 100644 Documentation/infiniband/rdma_helpers.txt

diff --git a/Documentation/infiniband/rdma_helpers.txt 
b/Documentation/infiniband/rdma_helpers.txt
new file mode 100644
index 000..abc75ec
--- /dev/null
+++ b/Documentation/infiniband/rdma_helpers.txt
@@ -0,0 +1,76 @@
+RDMA HELPERS
+
+  For core layer, below helpers are used to check if a paticular capability
+  is supported by the port.
+
+rdma_cap_ib_mad- Infiniband Management Datagrams.
+rdma_cap_ib_smi- Infiniband Subnet Management Interface.
+rdma_cap_ib_cm - Infiniband Communication Manager.
+rdma_cap_iw_cm - IWARP Communication Manager.
+rdma_cap_ib_sa - Infiniband Subnet Administration.
+rdma_cap_ib_mcast  - Infiniband Multicast.
+rdma_cap_read_multi_sge- RDMA Read Multiple Scatter-Gather Entries.
+rdma_cap_af_ib - Native Infiniband Address.
+rdma_cap_eth_ah- Ethernet Address Handler.
+
+USAGE
+
+  if (rdma_cap_XX(device, i)) {
+   /* The port i of device support XX */
+   ...
+  } else {
+   /* The port i of device don't support XX */
+   ...
+  }
+
+  rdma_cap_ib_mad
+  ---
+Management Datagrams (MAD) is the prototype of management packet
+to be used by all the kinds of infiniband managers, use the helper
+to verify the port before utilize related features.
+
+  rdma_cap_ib_smi
+  ---
+Subnet Management Interface (SMI) will handle SMP packet from SM
+in an infiniband fabric, use the helper to verify the port before
+utilize related features.
+
+  rdma_cap_ib_cm
+  ---
+Communication Manager (CM) will handle the connections between
+adaptors, currently there are two different implementation,
+IB or IWARP, use the helper to verify whether the port using
+IB-CM or not
+
+  rdma_cap_iw_cm
+  ---
+IWARP has it's own implemented CM which is different from infiniband,
+use the helper to check whether the port using IWARP-CM or not.
+
+  rdma_cap_ib_sa
+  ---
+Subnet Administration (SA) is the database built by SM in an
+infiniband fabric, use the helper to verify the port before
+utilize related features.
+
+  rdma_cap_ib_mcast
+  ---
+Multicast is the feature for one QP to send messages to multiple
+QP in an infiniband fabric, use the helper to verify the port before
+utilize related features.
+
+  rdma_cap_read_multi_sge
+  ---
+RDMA read operation could support multiple scatter-gather entries,
+use the helper to verify wthether the port support this feature
+or not.
+
+  rdma_cap_af_ib
+  ---
+RDMA address format could be ethernet or infiniband, use the helper
+to verify whether the port support infiniband format or not.
+
+  rdma_cap_eth_ah
+  ---
+Infiniband address handler format is special in ethernet fabric, use
+the helper to verify whether the port is using ethernet format or not.
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers

2015-05-13 Thread Michael Wang


On 05/12/2015 10:09 PM, Doug Ledford wrote:
[snip]
>>
>> I had asked for better kdocs for the new helpers so new people can
>> understand when and where to use them.
>>
>> I've not looked at the series at all for the past few postings.
> 
> Michael, please work up an incremental patch to address the kdocs issue.
> I've picked up the v8 patchset, and there is no need to respin it, but I
> would like to have that kdoc patch before the 4.2 merge window opens.

Sure, now these helpers are settled down, it's time for document,
I'll send out the RFC ASAP :-)

Regards,
Michael Wang

> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers

2015-05-13 Thread Michael Wang
This is the following patch for:
  https://lkml.org/lkml/2015/5/5/417
which try to document the settled rdma_cap_XX().

Highlights:
  There could be many missing/mistakes/misunderstanding, please don't
  be hesitate to point out the issues, any suggestions to improve or
  complete the description are very welcomed ;-)

Signed-off-by: Michael Wang yun.w...@profitbricks.com
---
 Documentation/infiniband/rdma_helpers.txt | 76 +++
 1 file changed, 76 insertions(+)
 create mode 100644 Documentation/infiniband/rdma_helpers.txt

diff --git a/Documentation/infiniband/rdma_helpers.txt 
b/Documentation/infiniband/rdma_helpers.txt
new file mode 100644
index 000..abc75ec
--- /dev/null
+++ b/Documentation/infiniband/rdma_helpers.txt
@@ -0,0 +1,76 @@
+RDMA HELPERS
+
+  For core layer, below helpers are used to check if a paticular capability
+  is supported by the port.
+
+rdma_cap_ib_mad- Infiniband Management Datagrams.
+rdma_cap_ib_smi- Infiniband Subnet Management Interface.
+rdma_cap_ib_cm - Infiniband Communication Manager.
+rdma_cap_iw_cm - IWARP Communication Manager.
+rdma_cap_ib_sa - Infiniband Subnet Administration.
+rdma_cap_ib_mcast  - Infiniband Multicast.
+rdma_cap_read_multi_sge- RDMA Read Multiple Scatter-Gather Entries.
+rdma_cap_af_ib - Native Infiniband Address.
+rdma_cap_eth_ah- Ethernet Address Handler.
+
+USAGE
+
+  if (rdma_cap_XX(device, i)) {
+   /* The port i of device support XX */
+   ...
+  } else {
+   /* The port i of device don't support XX */
+   ...
+  }
+
+  rdma_cap_ib_mad
+  ---
+Management Datagrams (MAD) is the prototype of management packet
+to be used by all the kinds of infiniband managers, use the helper
+to verify the port before utilize related features.
+
+  rdma_cap_ib_smi
+  ---
+Subnet Management Interface (SMI) will handle SMP packet from SM
+in an infiniband fabric, use the helper to verify the port before
+utilize related features.
+
+  rdma_cap_ib_cm
+  ---
+Communication Manager (CM) will handle the connections between
+adaptors, currently there are two different implementation,
+IB or IWARP, use the helper to verify whether the port using
+IB-CM or not
+
+  rdma_cap_iw_cm
+  ---
+IWARP has it's own implemented CM which is different from infiniband,
+use the helper to check whether the port using IWARP-CM or not.
+
+  rdma_cap_ib_sa
+  ---
+Subnet Administration (SA) is the database built by SM in an
+infiniband fabric, use the helper to verify the port before
+utilize related features.
+
+  rdma_cap_ib_mcast
+  ---
+Multicast is the feature for one QP to send messages to multiple
+QP in an infiniband fabric, use the helper to verify the port before
+utilize related features.
+
+  rdma_cap_read_multi_sge
+  ---
+RDMA read operation could support multiple scatter-gather entries,
+use the helper to verify wthether the port support this feature
+or not.
+
+  rdma_cap_af_ib
+  ---
+RDMA address format could be ethernet or infiniband, use the helper
+to verify whether the port support infiniband format or not.
+
+  rdma_cap_eth_ah
+  ---
+Infiniband address handler format is special in ethernet fabric, use
+the helper to verify whether the port is using ethernet format or not.
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers

2015-05-13 Thread Michael Wang


On 05/12/2015 10:09 PM, Doug Ledford wrote:
[snip]

 I had asked for better kdocs for the new helpers so new people can
 understand when and where to use them.

 I've not looked at the series at all for the past few postings.
 
 Michael, please work up an incremental patch to address the kdocs issue.
 I've picked up the v8 patchset, and there is no need to respin it, but I
 would like to have that kdoc patch before the 4.2 merge window opens.

Sure, now these helpers are settled down, it's time for document,
I'll send out the RFC ASAP :-)

Regards,
Michael Wang

 
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers

2015-05-12 Thread Michael Wang


On 05/12/2015 04:24 PM, Doug Ledford wrote:
[snip]
>>
>> AFAIK Or was asking to merge the #15~23, and want to reserve the changelog
>> meanwhile reply the cover of prev version (I'm still confused on that...),
>> I've replied but get no respond yet.
>>
>> I can make a v9 to merge the #15~#23 if that could benefit the 
>> maintainability,
>> please let me know your opinion :-)
> 
> I don't think it would make a significant difference.  I've pulled the
> v8 patchset out of patchworks and I'll throw a new branch with it
> included up to my github repo sometime today.

Got it :-)

Regards,
Michael Wang

> 
>> About the Bug, if it was not introduced in this series, maybe including the
>> fix in next series would be better?
>>
>> Regards,
>> Michael Wang
>>
>>>
>>>>   Frankly
>>>> I vote for the former because as it stands this series does not break 
>>>> directly.
>>>> It was only after I changed the implementation of rdma_cap_ib_mad that it
>>>> broke.
>>>>
>>>>
>>>> For the rest of the series.
>>>>
>>>> Reviewed-by: Ira Weiny 
>>>> Tested-by: Ira Weiny 
>>>>-- Limited to mlx4, qib, and OPA (with additional patches.)
>>>>
>>>>
>>>> On Tue, May 05, 2015 at 02:50:17PM +0200, Michael Wang wrote:
>>>>> Since v7:
>>>>>   * Thanks to Doug, Ira, Devesh for the testing :-)
>>>>>   * Thanks for the comments from or, Doug, Ira, Jason :-)
>>>>> Please remind me if anything missed :-P
>>>>>   * Use rdma_cap_XX() instead of cap_XX() for readability
>>>>>   * Remove CC list in git log for maintainability
>>>>>   * Use bool as return value
>>>>>   * Updated github repository to v8
>>>>>
>>>>> There are plenty of lengthy code to check the transport type of IB device,
>>>>> or the link layer type of it's port, but actually we are just speculating
>>>>> whether a particular management/feature is supported by the device/port.
>>>>>
>>>>> Thus instead of inferring, we should have our own mechanism for IB 
>>>>> management
>>>>> capability/protocol/feature checking, several proposals below.
>>>>>
>>>>> This patch set will introduce query_protocol() to check management 
>>>>> requirement
>>>>> instead of inferring from transport and link layer respectively, along 
>>>>> with
>>>>> the new enum on protocol type.
>>>>>
>>>>> Mapping List:
>>>>>   node-type   link-layer  transport   protocol
>>>>> nes   RNICETH IWARP   IWARP
>>>>> amso1100  RNICETH IWARP   IWARP
>>>>> cxgb3 RNICETH IWARP   IWARP
>>>>> cxgb4 RNICETH IWARP   IWARP
>>>>> usnic USNIC_UDP   ETH USNIC_UDP   USNIC_UDP
>>>>> ocrdmaIB_CA   ETH IB  IBOE
>>>>> mlx4  IB_CA   IB/ETH  IB  IB/IBOE
>>>>> mlx5  IB_CA   IB  IB  IB
>>>>> ehca  IB_CA   IB  IB  IB
>>>>> ipath IB_CA   IB  IB  IB
>>>>> mthca IB_CA   IB  IB  IB
>>>>> qib   IB_CA   IB  IB  IB
>>>>>
>>>>> For example:
>>>>>   if (transport == IB) && (link-layer == ETH)
>>>>> will now become:
>>>>>   if (query_protocol() == IBOE)
>>>>>
>>>>> Thus we will be able to get rid of the respective transport and link-layer
>>>>> checking, and it will help us to add new protocol/Technology (like OPA) 
>>>>> more
>>>>> easier, also with the introduced management helpers, IB management logical
>>>>> will be more clear and easier for extending.
>>>>>
>>>>> Highlights:
>>>>> The long CC list in each patches was complained consider about the
>>>>> maintainability, it was suggested folks to provide their reviewed-by 
>>>>> or
>>>>> Acked-by instead, so for those who used to be on the CC list, please

Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers

2015-05-12 Thread Michael Wang
On 05/12/2015 01:49 AM, ira.weiny wrote:
> I have run with this series and the only issue I have found is not with this
> patch set directly.
> 
> This patch:
> 
>>   IB/Verbs: Use management helper rdma_cap_ib_mad()
> 
> causes an error when you actually use the port passed from the ib_umad module.
> I have a patch to fix that which I found while trying to build on this series
> for the use of a bit mask.
> 
> Doug, I don't know what you would like to do for this fix.  I am submitting it
> shortly with a new version of the core capability bit patches.  If you want to
> just add it after this series or force Michael to respin with the fix?  
> Frankly
> I vote for the former because as it stands this series does not break 
> directly.
> It was only after I changed the implementation of rdma_cap_ib_mad that it
> broke.

Agree, it sounds more reasonable to include the fix in the series introduced
it :-P

> 
> 
> For the rest of the series.
> 
> Reviewed-by: Ira Weiny 
> Tested-by: Ira Weiny 
>   -- Limited to mlx4, qib, and OPA (with additional patches.)

Thanks for the review and testing :-)

Regards,
Michael Wang

> 
> 
> On Tue, May 05, 2015 at 02:50:17PM +0200, Michael Wang wrote:
>> Since v7:
>>   * Thanks to Doug, Ira, Devesh for the testing :-)
>>   * Thanks for the comments from or, Doug, Ira, Jason :-)
>> Please remind me if anything missed :-P
>>   * Use rdma_cap_XX() instead of cap_XX() for readability
>>   * Remove CC list in git log for maintainability
>>   * Use bool as return value
>>   * Updated github repository to v8
>>
>> There are plenty of lengthy code to check the transport type of IB device,
>> or the link layer type of it's port, but actually we are just speculating
>> whether a particular management/feature is supported by the device/port.
>>
>> Thus instead of inferring, we should have our own mechanism for IB management
>> capability/protocol/feature checking, several proposals below.
>>
>> This patch set will introduce query_protocol() to check management 
>> requirement
>> instead of inferring from transport and link layer respectively, along with
>> the new enum on protocol type.
>>
>> Mapping List:
>>  node-type   link-layer  transport   protocol
>> nes  RNICETH IWARP   IWARP
>> amso1100 RNICETH IWARP   IWARP
>> cxgb3RNICETH IWARP   IWARP
>> cxgb4RNICETH IWARP   IWARP
>> usnicUSNIC_UDP   ETH USNIC_UDP   USNIC_UDP
>> ocrdma   IB_CA   ETH IB  IBOE
>> mlx4 IB_CA   IB/ETH  IB  IB/IBOE
>> mlx5 IB_CA   IB  IB  IB
>> ehca IB_CA   IB  IB  IB
>> ipathIB_CA   IB  IB  IB
>> mthcaIB_CA   IB  IB  IB
>> qib  IB_CA   IB  IB  IB
>>
>> For example:
>>  if (transport == IB) && (link-layer == ETH)
>> will now become:
>>  if (query_protocol() == IBOE)
>>
>> Thus we will be able to get rid of the respective transport and link-layer
>> checking, and it will help us to add new protocol/Technology (like OPA) more
>> easier, also with the introduced management helpers, IB management logical
>> will be more clear and easier for extending.
>>
>> Highlights:
>> The long CC list in each patches was complained consider about the
>> maintainability, it was suggested folks to provide their reviewed-by or
>> Acked-by instead, so for those who used to be on the CC list, please
>> provide your signature voluntarily :-)
>>
>> The 'mgmt-helpers' branch of 'g...@github.com:ywang-pb/infiniband-wy.git'
>> contain this series based on the latest 'infiniband/for-next'
>>
>> Patch 1#~14# included all the logical reform, 15#~23# introduced the
>> management helpers.
>>
>> Doug suggested the bitmask mechanism:
>>  https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23765.html
>> which could be the plan for future reforming, we prefer that to be 
>> another
>> series which focus on semantic and performance.
>>
>> This patch-set is somewhat 'bloated' now and it may be a good timing for
>> staging, I'd like to suggest we focus on improving existed helpers and 
>> push
>> a

Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers

2015-05-12 Thread Michael Wang


On 05/12/2015 02:27 AM, Doug Ledford wrote:
> On Mon, 2015-05-11 at 19:49 -0400, ira.weiny wrote:
>> I have run with this series and the only issue I have found is not with this
>> patch set directly.
>>
>> This patch:
>>
>>>   IB/Verbs: Use management helper rdma_cap_ib_mad()
>>
>> causes an error when you actually use the port passed from the ib_umad 
>> module.
>> I have a patch to fix that which I found while trying to build on this series
>> for the use of a bit mask.
>>
>> Doug, I don't know what you would like to do for this fix.  I am submitting 
>> it
>> shortly with a new version of the core capability bit patches.  If you want 
>> to
>> just add it after this series or force Michael to respin with the fix?
> 
> As I recall, there was a comment from Or requesting to squash some of
> the individual patches down, but I no longer have that email in my Inbox
> to double check.  And it seemed like there was one other review comment
> not yet addressed.  Do I have that right Michael?  And if so, are you
> working on a v9?

AFAIK Or was asking to merge the #15~23, and want to reserve the changelog
meanwhile reply the cover of prev version (I'm still confused on that...),
I've replied but get no respond yet.

I can make a v9 to merge the #15~#23 if that could benefit the maintainability,
please let me know your opinion :-)

About the Bug, if it was not introduced in this series, maybe including the
fix in next series would be better?

Regards,
Michael Wang

> 
>>   Frankly
>> I vote for the former because as it stands this series does not break 
>> directly.
>> It was only after I changed the implementation of rdma_cap_ib_mad that it
>> broke.
>>
>>
>> For the rest of the series.
>>
>> Reviewed-by: Ira Weiny 
>> Tested-by: Ira Weiny 
>>  -- Limited to mlx4, qib, and OPA (with additional patches.)
>>
>>
>> On Tue, May 05, 2015 at 02:50:17PM +0200, Michael Wang wrote:
>>> Since v7:
>>>   * Thanks to Doug, Ira, Devesh for the testing :-)
>>>   * Thanks for the comments from or, Doug, Ira, Jason :-)
>>> Please remind me if anything missed :-P
>>>   * Use rdma_cap_XX() instead of cap_XX() for readability
>>>   * Remove CC list in git log for maintainability
>>>   * Use bool as return value
>>>   * Updated github repository to v8
>>>
>>> There are plenty of lengthy code to check the transport type of IB device,
>>> or the link layer type of it's port, but actually we are just speculating
>>> whether a particular management/feature is supported by the device/port.
>>>
>>> Thus instead of inferring, we should have our own mechanism for IB 
>>> management
>>> capability/protocol/feature checking, several proposals below.
>>>
>>> This patch set will introduce query_protocol() to check management 
>>> requirement
>>> instead of inferring from transport and link layer respectively, along with
>>> the new enum on protocol type.
>>>
>>> Mapping List:
>>> node-type   link-layer  transport   protocol
>>> nes RNICETH IWARP   IWARP
>>> amso1100RNICETH IWARP   IWARP
>>> cxgb3   RNICETH IWARP   IWARP
>>> cxgb4   RNICETH IWARP   IWARP
>>> usnic   USNIC_UDP   ETH USNIC_UDP   USNIC_UDP
>>> ocrdma  IB_CA   ETH IB  IBOE
>>> mlx4IB_CA   IB/ETH  IB  IB/IBOE
>>> mlx5IB_CA   IB  IB  IB
>>> ehcaIB_CA   IB  IB  IB
>>> ipath   IB_CA   IB  IB  IB
>>> mthca   IB_CA   IB  IB  IB
>>> qib IB_CA   IB  IB  IB
>>>
>>> For example:
>>> if (transport == IB) && (link-layer == ETH)
>>> will now become:
>>> if (query_protocol() == IBOE)
>>>
>>> Thus we will be able to get rid of the respective transport and link-layer
>>> checking, and it will help us to add new protocol/Technology (like OPA) more
>>> easier, also with the introduced management helpers, IB management logical
>>> will be more clear and easier for extending.
>>>
>>> Highlights:
>>> The long CC list in each patches was complained consider about t

Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers

2015-05-12 Thread Michael Wang


On 05/12/2015 02:27 AM, Doug Ledford wrote:
 On Mon, 2015-05-11 at 19:49 -0400, ira.weiny wrote:
 I have run with this series and the only issue I have found is not with this
 patch set directly.

 This patch:

   IB/Verbs: Use management helper rdma_cap_ib_mad()

 causes an error when you actually use the port passed from the ib_umad 
 module.
 I have a patch to fix that which I found while trying to build on this series
 for the use of a bit mask.

 Doug, I don't know what you would like to do for this fix.  I am submitting 
 it
 shortly with a new version of the core capability bit patches.  If you want 
 to
 just add it after this series or force Michael to respin with the fix?
 
 As I recall, there was a comment from Or requesting to squash some of
 the individual patches down, but I no longer have that email in my Inbox
 to double check.  And it seemed like there was one other review comment
 not yet addressed.  Do I have that right Michael?  And if so, are you
 working on a v9?

AFAIK Or was asking to merge the #15~23, and want to reserve the changelog
meanwhile reply the cover of prev version (I'm still confused on that...),
I've replied but get no respond yet.

I can make a v9 to merge the #15~#23 if that could benefit the maintainability,
please let me know your opinion :-)

About the Bug, if it was not introduced in this series, maybe including the
fix in next series would be better?

Regards,
Michael Wang

 
   Frankly
 I vote for the former because as it stands this series does not break 
 directly.
 It was only after I changed the implementation of rdma_cap_ib_mad that it
 broke.


 For the rest of the series.

 Reviewed-by: Ira Weiny ira.we...@intel.com
 Tested-by: Ira Weiny ira.we...@intel.com
  -- Limited to mlx4, qib, and OPA (with additional patches.)


 On Tue, May 05, 2015 at 02:50:17PM +0200, Michael Wang wrote:
 Since v7:
   * Thanks to Doug, Ira, Devesh for the testing :-)
   * Thanks for the comments from or, Doug, Ira, Jason :-)
 Please remind me if anything missed :-P
   * Use rdma_cap_XX() instead of cap_XX() for readability
   * Remove CC list in git log for maintainability
   * Use bool as return value
   * Updated github repository to v8

 There are plenty of lengthy code to check the transport type of IB device,
 or the link layer type of it's port, but actually we are just speculating
 whether a particular management/feature is supported by the device/port.

 Thus instead of inferring, we should have our own mechanism for IB 
 management
 capability/protocol/feature checking, several proposals below.

 This patch set will introduce query_protocol() to check management 
 requirement
 instead of inferring from transport and link layer respectively, along with
 the new enum on protocol type.

 Mapping List:
 node-type   link-layer  transport   protocol
 nes RNICETH IWARP   IWARP
 amso1100RNICETH IWARP   IWARP
 cxgb3   RNICETH IWARP   IWARP
 cxgb4   RNICETH IWARP   IWARP
 usnic   USNIC_UDP   ETH USNIC_UDP   USNIC_UDP
 ocrdma  IB_CA   ETH IB  IBOE
 mlx4IB_CA   IB/ETH  IB  IB/IBOE
 mlx5IB_CA   IB  IB  IB
 ehcaIB_CA   IB  IB  IB
 ipath   IB_CA   IB  IB  IB
 mthca   IB_CA   IB  IB  IB
 qib IB_CA   IB  IB  IB

 For example:
 if (transport == IB)  (link-layer == ETH)
 will now become:
 if (query_protocol() == IBOE)

 Thus we will be able to get rid of the respective transport and link-layer
 checking, and it will help us to add new protocol/Technology (like OPA) more
 easier, also with the introduced management helpers, IB management logical
 will be more clear and easier for extending.

 Highlights:
 The long CC list in each patches was complained consider about the
 maintainability, it was suggested folks to provide their reviewed-by or
 Acked-by instead, so for those who used to be on the CC list, please
 provide your signature voluntarily :-)

 The 'mgmt-helpers' branch of 
 'g...@github.com:ywang-pb/infiniband-wy.git'
 contain this series based on the latest 'infiniband/for-next'

 Patch 1#~14# included all the logical reform, 15#~23# introduced the
 management helpers.

 Doug suggested the bitmask mechanism:
 https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23765.html
 which could be the plan for future reforming, we prefer that to be 
 another
 series which focus on semantic and performance.

 This patch-set is somewhat 'bloated' now and it may be a good timing for
 staging, I'd like to suggest we focus on improving existed helpers and 
 push

Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers

2015-05-12 Thread Michael Wang
On 05/12/2015 01:49 AM, ira.weiny wrote:
 I have run with this series and the only issue I have found is not with this
 patch set directly.
 
 This patch:
 
   IB/Verbs: Use management helper rdma_cap_ib_mad()
 
 causes an error when you actually use the port passed from the ib_umad module.
 I have a patch to fix that which I found while trying to build on this series
 for the use of a bit mask.
 
 Doug, I don't know what you would like to do for this fix.  I am submitting it
 shortly with a new version of the core capability bit patches.  If you want to
 just add it after this series or force Michael to respin with the fix?  
 Frankly
 I vote for the former because as it stands this series does not break 
 directly.
 It was only after I changed the implementation of rdma_cap_ib_mad that it
 broke.

Agree, it sounds more reasonable to include the fix in the series introduced
it :-P

 
 
 For the rest of the series.
 
 Reviewed-by: Ira Weiny ira.we...@intel.com
 Tested-by: Ira Weiny ira.we...@intel.com
   -- Limited to mlx4, qib, and OPA (with additional patches.)

Thanks for the review and testing :-)

Regards,
Michael Wang

 
 
 On Tue, May 05, 2015 at 02:50:17PM +0200, Michael Wang wrote:
 Since v7:
   * Thanks to Doug, Ira, Devesh for the testing :-)
   * Thanks for the comments from or, Doug, Ira, Jason :-)
 Please remind me if anything missed :-P
   * Use rdma_cap_XX() instead of cap_XX() for readability
   * Remove CC list in git log for maintainability
   * Use bool as return value
   * Updated github repository to v8

 There are plenty of lengthy code to check the transport type of IB device,
 or the link layer type of it's port, but actually we are just speculating
 whether a particular management/feature is supported by the device/port.

 Thus instead of inferring, we should have our own mechanism for IB management
 capability/protocol/feature checking, several proposals below.

 This patch set will introduce query_protocol() to check management 
 requirement
 instead of inferring from transport and link layer respectively, along with
 the new enum on protocol type.

 Mapping List:
  node-type   link-layer  transport   protocol
 nes  RNICETH IWARP   IWARP
 amso1100 RNICETH IWARP   IWARP
 cxgb3RNICETH IWARP   IWARP
 cxgb4RNICETH IWARP   IWARP
 usnicUSNIC_UDP   ETH USNIC_UDP   USNIC_UDP
 ocrdma   IB_CA   ETH IB  IBOE
 mlx4 IB_CA   IB/ETH  IB  IB/IBOE
 mlx5 IB_CA   IB  IB  IB
 ehca IB_CA   IB  IB  IB
 ipathIB_CA   IB  IB  IB
 mthcaIB_CA   IB  IB  IB
 qib  IB_CA   IB  IB  IB

 For example:
  if (transport == IB)  (link-layer == ETH)
 will now become:
  if (query_protocol() == IBOE)

 Thus we will be able to get rid of the respective transport and link-layer
 checking, and it will help us to add new protocol/Technology (like OPA) more
 easier, also with the introduced management helpers, IB management logical
 will be more clear and easier for extending.

 Highlights:
 The long CC list in each patches was complained consider about the
 maintainability, it was suggested folks to provide their reviewed-by or
 Acked-by instead, so for those who used to be on the CC list, please
 provide your signature voluntarily :-)

 The 'mgmt-helpers' branch of 'g...@github.com:ywang-pb/infiniband-wy.git'
 contain this series based on the latest 'infiniband/for-next'

 Patch 1#~14# included all the logical reform, 15#~23# introduced the
 management helpers.

 Doug suggested the bitmask mechanism:
  https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23765.html
 which could be the plan for future reforming, we prefer that to be 
 another
 series which focus on semantic and performance.

 This patch-set is somewhat 'bloated' now and it may be a good timing for
 staging, I'd like to suggest we focus on improving existed helpers and 
 push
 all the further reforms into next series ;-)

 Proposals:
 Sean:
  https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23339.html
 Doug:
  https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23418.html
  https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23765.html
 Jason:
  https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23425.html

 Michael Wang (23):
   IB/Verbs: Implement new callback query_protocol()
   IB/Verbs: Implement raw management helpers
   IB/Verbs: Reform IB-core mad/agent/user_mad
   IB/Verbs: Reform IB-core cm
   IB/Verbs: Reform IB-core sa_query
   IB

Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers

2015-05-12 Thread Michael Wang


On 05/12/2015 04:24 PM, Doug Ledford wrote:
[snip]

 AFAIK Or was asking to merge the #15~23, and want to reserve the changelog
 meanwhile reply the cover of prev version (I'm still confused on that...),
 I've replied but get no respond yet.

 I can make a v9 to merge the #15~#23 if that could benefit the 
 maintainability,
 please let me know your opinion :-)
 
 I don't think it would make a significant difference.  I've pulled the
 v8 patchset out of patchworks and I'll throw a new branch with it
 included up to my github repo sometime today.

Got it :-)

Regards,
Michael Wang

 
 About the Bug, if it was not introduced in this series, maybe including the
 fix in next series would be better?

 Regards,
 Michael Wang


   Frankly
 I vote for the former because as it stands this series does not break 
 directly.
 It was only after I changed the implementation of rdma_cap_ib_mad that it
 broke.


 For the rest of the series.

 Reviewed-by: Ira Weiny ira.we...@intel.com
 Tested-by: Ira Weiny ira.we...@intel.com
-- Limited to mlx4, qib, and OPA (with additional patches.)


 On Tue, May 05, 2015 at 02:50:17PM +0200, Michael Wang wrote:
 Since v7:
   * Thanks to Doug, Ira, Devesh for the testing :-)
   * Thanks for the comments from or, Doug, Ira, Jason :-)
 Please remind me if anything missed :-P
   * Use rdma_cap_XX() instead of cap_XX() for readability
   * Remove CC list in git log for maintainability
   * Use bool as return value
   * Updated github repository to v8

 There are plenty of lengthy code to check the transport type of IB device,
 or the link layer type of it's port, but actually we are just speculating
 whether a particular management/feature is supported by the device/port.

 Thus instead of inferring, we should have our own mechanism for IB 
 management
 capability/protocol/feature checking, several proposals below.

 This patch set will introduce query_protocol() to check management 
 requirement
 instead of inferring from transport and link layer respectively, along 
 with
 the new enum on protocol type.

 Mapping List:
   node-type   link-layer  transport   protocol
 nes   RNICETH IWARP   IWARP
 amso1100  RNICETH IWARP   IWARP
 cxgb3 RNICETH IWARP   IWARP
 cxgb4 RNICETH IWARP   IWARP
 usnic USNIC_UDP   ETH USNIC_UDP   USNIC_UDP
 ocrdmaIB_CA   ETH IB  IBOE
 mlx4  IB_CA   IB/ETH  IB  IB/IBOE
 mlx5  IB_CA   IB  IB  IB
 ehca  IB_CA   IB  IB  IB
 ipath IB_CA   IB  IB  IB
 mthca IB_CA   IB  IB  IB
 qib   IB_CA   IB  IB  IB

 For example:
   if (transport == IB)  (link-layer == ETH)
 will now become:
   if (query_protocol() == IBOE)

 Thus we will be able to get rid of the respective transport and link-layer
 checking, and it will help us to add new protocol/Technology (like OPA) 
 more
 easier, also with the introduced management helpers, IB management logical
 will be more clear and easier for extending.

 Highlights:
 The long CC list in each patches was complained consider about the
 maintainability, it was suggested folks to provide their reviewed-by 
 or
 Acked-by instead, so for those who used to be on the CC list, please
 provide your signature voluntarily :-)

 The 'mgmt-helpers' branch of 
 'g...@github.com:ywang-pb/infiniband-wy.git'
 contain this series based on the latest 'infiniband/for-next'

 Patch 1#~14# included all the logical reform, 15#~23# introduced the
 management helpers.

 Doug suggested the bitmask mechanism:
   https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23765.html
 which could be the plan for future reforming, we prefer that to be 
 another
 series which focus on semantic and performance.

 This patch-set is somewhat 'bloated' now and it may be a good timing 
 for
 staging, I'd like to suggest we focus on improving existed helpers 
 and push
 all the further reforms into next series ;-)

 Proposals:
 Sean:
   https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23339.html
 Doug:
   https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23418.html
   https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23765.html
 Jason:
   https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23425.html

 Michael Wang (23):
   IB/Verbs: Implement new callback query_protocol()
   IB/Verbs: Implement raw management helpers
   IB/Verbs: Reform IB-core mad/agent/user_mad
   IB/Verbs: Reform IB-core cm
   IB/Verbs: Reform IB-core sa_query
   IB/Verbs: Reform IB-core multicast
   IB/Verbs: Reform IB-ulp ipoib
   IB/Verbs: Reform IB-ulp xprtrdma

  1   2   3   4   5   6   7   8   9   10   >