Re: wake_wide mechanism clarification
Hi, Joel On 07/29/2017 10:13 AM, Joel Fernandes wrote: > +Michael Wang on his current email address (old one bounced). (my > reply was to Mike Galbraith but I also meant to CC Michael Wang for > the discussion). Thanks Just back from vacation and saw this long long discussion... I think guys explained well on the idea, wake_wide() just try to filter out the cases that may congest the waker's domain, as much as possible without side effect (ideally). The factor at very beginning is just a static number which picked by enormous times of practice testing, Peter Zijlstr suggest we add the domain size and make it a flexible factor, which by practice not that bad. So the simple answer is we use the llc_size since no better option at that time :-) But things changing very fast and new feature can introduce new cases, I'm pretty sure if we redo the testing the results will be very different, however, the idea itself still make sense to me, at least on theory. Recently I'm also thinking about the scheduler issue, cfs try to find out general solution for all these cases and the best answer is obviously, all the cases will suffer some damage and scheduler itself bloated to achieve the goal 'taking care of all'. So in order to achieve the maximum performance of particular workload, some user defined scheduler would be an interesting idea :-P Regards, Michael Wang > > On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes <joe...@google.com> wrote: >> Hi Mike, >> >> I have take spent some time understanding the email thread and >> previous discussions. Unfortunately the second condition we are >> checking for in the wake_wide still didn't make sense to me (mentioned >> below) :-( >> >> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith >> <umgwanakikb...@gmail.com> wrote: >>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: >>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: >>>> >>>>> That makes sense that we multiply slave's flips by a factor because >>>>> its low, but I still didn't get why the factor is chosen to be >>>>> llc_size instead of something else for the multiplication with slave >>>>> (slave * factor). >>> >>>> Yeah I don't know why llc_size was chosen... >>> >>> static void update_top_cache_domain(int cpu) >>> { >>> struct sched_domain_shared *sds = NULL; >>> struct sched_domain *sd; >>> int id = cpu; >>> int size = 1; >>> >>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); >>> if (sd) { >>> id = cpumask_first(sched_domain_span(sd)); >>> size = cpumask_weight(sched_domain_span(sd)); >>> sds = sd->shared; >>> } >>> >>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); >>> per_cpu(sd_llc_size, cpu) = size; >>> >>> The goal of wake wide was to approximate when pulling would be a futile >>> consolidation effort and counterproductive to scaling. 'course with >>> ever increasing socket size, any 1:N waker is ever more likely to run >>> out of CPU for its one and only self (slamming into scaling wall) >>> before it needing to turn its minions loose to conquer the world. >> >> Actually the original question was why do we have the second condition >> as "master < slave * factor", instead of "master < factor". that's >> what didn't make sense to me. Why don't we return 0 from wake_wide if >> master < factor ? >> >> Infact, as the factor is set to the llc_size, I think the condition >> that makes sense to me is: >> >> if ((master + slave) < llc_size) >> return 0; >> >> In other words, if the master flips and the slave flips are totally >> higher than the llc_size, then we are most likely waking up too many >> tasks as affine and should then switch to wide to prevent overloading. >> >> Digging further into the original patch from Michael Wang (I also CC'd >> him), this was the code (before you had changed it to master/slave): >> >> wakee->nr_wakee_switch > factor && >> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch) >> >> To explain the second condition above, Michael Wang said the following in [1] >> >> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple >> tasks rely on it, then waker's higher latency will damage all of them, pull >> wakee seems to be a bad deal." >> >> Again I didn't f
Re: wake_wide mechanism clarification
Hi, Joel On 07/29/2017 10:13 AM, Joel Fernandes wrote: > +Michael Wang on his current email address (old one bounced). (my > reply was to Mike Galbraith but I also meant to CC Michael Wang for > the discussion). Thanks Just back from vacation and saw this long long discussion... I think guys explained well on the idea, wake_wide() just try to filter out the cases that may congest the waker's domain, as much as possible without side effect (ideally). The factor at very beginning is just a static number which picked by enormous times of practice testing, Peter Zijlstr suggest we add the domain size and make it a flexible factor, which by practice not that bad. So the simple answer is we use the llc_size since no better option at that time :-) But things changing very fast and new feature can introduce new cases, I'm pretty sure if we redo the testing the results will be very different, however, the idea itself still make sense to me, at least on theory. Recently I'm also thinking about the scheduler issue, cfs try to find out general solution for all these cases and the best answer is obviously, all the cases will suffer some damage and scheduler itself bloated to achieve the goal 'taking care of all'. So in order to achieve the maximum performance of particular workload, some user defined scheduler would be an interesting idea :-P Regards, Michael Wang > > On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes wrote: >> Hi Mike, >> >> I have take spent some time understanding the email thread and >> previous discussions. Unfortunately the second condition we are >> checking for in the wake_wide still didn't make sense to me (mentioned >> below) :-( >> >> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith >> wrote: >>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: >>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: >>>> >>>>> That makes sense that we multiply slave's flips by a factor because >>>>> its low, but I still didn't get why the factor is chosen to be >>>>> llc_size instead of something else for the multiplication with slave >>>>> (slave * factor). >>> >>>> Yeah I don't know why llc_size was chosen... >>> >>> static void update_top_cache_domain(int cpu) >>> { >>> struct sched_domain_shared *sds = NULL; >>> struct sched_domain *sd; >>> int id = cpu; >>> int size = 1; >>> >>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); >>> if (sd) { >>> id = cpumask_first(sched_domain_span(sd)); >>> size = cpumask_weight(sched_domain_span(sd)); >>> sds = sd->shared; >>> } >>> >>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); >>> per_cpu(sd_llc_size, cpu) = size; >>> >>> The goal of wake wide was to approximate when pulling would be a futile >>> consolidation effort and counterproductive to scaling. 'course with >>> ever increasing socket size, any 1:N waker is ever more likely to run >>> out of CPU for its one and only self (slamming into scaling wall) >>> before it needing to turn its minions loose to conquer the world. >> >> Actually the original question was why do we have the second condition >> as "master < slave * factor", instead of "master < factor". that's >> what didn't make sense to me. Why don't we return 0 from wake_wide if >> master < factor ? >> >> Infact, as the factor is set to the llc_size, I think the condition >> that makes sense to me is: >> >> if ((master + slave) < llc_size) >> return 0; >> >> In other words, if the master flips and the slave flips are totally >> higher than the llc_size, then we are most likely waking up too many >> tasks as affine and should then switch to wide to prevent overloading. >> >> Digging further into the original patch from Michael Wang (I also CC'd >> him), this was the code (before you had changed it to master/slave): >> >> wakee->nr_wakee_switch > factor && >> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch) >> >> To explain the second condition above, Michael Wang said the following in [1] >> >> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple >> tasks rely on it, then waker's higher latency will damage all of them, pull >> wakee seems to be a bad deal." >> >> Again I didn't follow why the second condition couldn't just be: >> waker-&
[ Linux 4.4 stable ] missing 'printk: set may_schedule for some of console_trylock() callers'
Hi, greg k-h During our testing with 4.4.73 we got soft lockup like: NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [systemd-udevd:856] ... Call Trace: [] vprintk_emit+0x319/0x4a0 [] printk_emit+0x33/0x3b [] ? simple_strtoull+0x2c/0x50 [] devkmsg_write+0xaa/0x100 [] ? vprintk+0x30/0x30 [] do_readv_writev+0x1c2/0x270 [] ? kmem_cache_free+0x7d/0x1a0 [] vfs_writev+0x39/0x50 [] SyS_writev+0x4a/0xd0 [] entry_SYSCALL_64_fastpath+0x12/0x6a Currently in 4.4 the console_unlock() called by vprintk_emit() is with preemption disabled, so the cond_resched is not working, and soft lockup appear if it take too much time on writing data into every console. We found the upstream patch: commit 6b97a20d3a79 printk: set may_schedule for some of console_trylock() callers which should have addressed this issue, but not included in the latest 4.4.78 stable yet, is there any plan on backport it in future? Regards, Michael Wang
[ Linux 4.4 stable ] missing 'printk: set may_schedule for some of console_trylock() callers'
Hi, greg k-h During our testing with 4.4.73 we got soft lockup like: NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [systemd-udevd:856] ... Call Trace: [] vprintk_emit+0x319/0x4a0 [] printk_emit+0x33/0x3b [] ? simple_strtoull+0x2c/0x50 [] devkmsg_write+0xaa/0x100 [] ? vprintk+0x30/0x30 [] do_readv_writev+0x1c2/0x270 [] ? kmem_cache_free+0x7d/0x1a0 [] vfs_writev+0x39/0x50 [] SyS_writev+0x4a/0xd0 [] entry_SYSCALL_64_fastpath+0x12/0x6a Currently in 4.4 the console_unlock() called by vprintk_emit() is with preemption disabled, so the cond_resched is not working, and soft lockup appear if it take too much time on writing data into every console. We found the upstream patch: commit 6b97a20d3a79 printk: set may_schedule for some of console_trylock() callers which should have addressed this issue, but not included in the latest 4.4.78 stable yet, is there any plan on backport it in future? Regards, Michael Wang
Re: [PATCH] md: return 0 instead of error in rdev_attr_show()
We found the upstream fix, sorry for the noise... Regards, Michael Wang On 04/11/2017 12:14 PM, Michael Wang wrote: > > sysfs_kf_read() expect the show() callback return the dumped > length, while rdev_attr_show() can return the error which lead > into overflow: > > BUG: unable to handle kernel paging request at 88040b084000 > IP: [] __memmove+0x24/0x1a0 > PGD 1edb067 PUD 1ede067 PMD 406b9a063 PTE 80040b084161 > Oops: 0003 [#1] SMP > [snip] > Call Trace: > [] ? sysfs_kf_read+0x80/0xb0 > [] kernfs_fop_read+0xab/0x160 > [] __vfs_read+0x28/0xd0 > [] vfs_read+0x86/0x130 > [] SyS_read+0x46/0xa0 > [] entry_SYSCALL_64_fastpath+0x12/0x6a > > Simply return 0 in case of error solved the problem. > > Signed-off-by: Michael Wang <yun.w...@profitbricks.com> > --- > drivers/md/md.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 1db88d7..d46d714 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -3271,10 +3271,9 @@ rdev_attr_show(struct kobject *kobj, struct attribute > *attr, char *page) > struct rdev_sysfs_entry *entry = container_of(attr, struct > rdev_sysfs_entry, attr); > struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj); > > - if (!entry->show) > - return -EIO; > - if (!rdev->mddev) > - return -EBUSY; > + if (!entry->show || !rdev->mddev) > + return 0; > + > return entry->show(rdev, page); > } > >
Re: [PATCH] md: return 0 instead of error in rdev_attr_show()
We found the upstream fix, sorry for the noise... Regards, Michael Wang On 04/11/2017 12:14 PM, Michael Wang wrote: > > sysfs_kf_read() expect the show() callback return the dumped > length, while rdev_attr_show() can return the error which lead > into overflow: > > BUG: unable to handle kernel paging request at 88040b084000 > IP: [] __memmove+0x24/0x1a0 > PGD 1edb067 PUD 1ede067 PMD 406b9a063 PTE 80040b084161 > Oops: 0003 [#1] SMP > [snip] > Call Trace: > [] ? sysfs_kf_read+0x80/0xb0 > [] kernfs_fop_read+0xab/0x160 > [] __vfs_read+0x28/0xd0 > [] vfs_read+0x86/0x130 > [] SyS_read+0x46/0xa0 > [] entry_SYSCALL_64_fastpath+0x12/0x6a > > Simply return 0 in case of error solved the problem. > > Signed-off-by: Michael Wang > --- > drivers/md/md.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 1db88d7..d46d714 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -3271,10 +3271,9 @@ rdev_attr_show(struct kobject *kobj, struct attribute > *attr, char *page) > struct rdev_sysfs_entry *entry = container_of(attr, struct > rdev_sysfs_entry, attr); > struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj); > > - if (!entry->show) > - return -EIO; > - if (!rdev->mddev) > - return -EBUSY; > + if (!entry->show || !rdev->mddev) > + return 0; > + > return entry->show(rdev, page); > } > >
[PATCH] md: return 0 instead of error in rdev_attr_show()
sysfs_kf_read() expect the show() callback return the dumped length, while rdev_attr_show() can return the error which lead into overflow: BUG: unable to handle kernel paging request at 88040b084000 IP: [] __memmove+0x24/0x1a0 PGD 1edb067 PUD 1ede067 PMD 406b9a063 PTE 80040b084161 Oops: 0003 [#1] SMP [snip] Call Trace: [] ? sysfs_kf_read+0x80/0xb0 [] kernfs_fop_read+0xab/0x160 [] __vfs_read+0x28/0xd0 [] vfs_read+0x86/0x130 [] SyS_read+0x46/0xa0 [] entry_SYSCALL_64_fastpath+0x12/0x6a Simply return 0 in case of error solved the problem. Signed-off-by: Michael Wang <yun.w...@profitbricks.com> --- drivers/md/md.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 1db88d7..d46d714 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -3271,10 +3271,9 @@ rdev_attr_show(struct kobject *kobj, struct attribute *attr, char *page) struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr); struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj); - if (!entry->show) - return -EIO; - if (!rdev->mddev) - return -EBUSY; + if (!entry->show || !rdev->mddev) + return 0; + return entry->show(rdev, page); } -- 2.5.0
[PATCH] md: return 0 instead of error in rdev_attr_show()
sysfs_kf_read() expect the show() callback return the dumped length, while rdev_attr_show() can return the error which lead into overflow: BUG: unable to handle kernel paging request at 88040b084000 IP: [] __memmove+0x24/0x1a0 PGD 1edb067 PUD 1ede067 PMD 406b9a063 PTE 80040b084161 Oops: 0003 [#1] SMP [snip] Call Trace: [] ? sysfs_kf_read+0x80/0xb0 [] kernfs_fop_read+0xab/0x160 [] __vfs_read+0x28/0xd0 [] vfs_read+0x86/0x130 [] SyS_read+0x46/0xa0 [] entry_SYSCALL_64_fastpath+0x12/0x6a Simply return 0 in case of error solved the problem. Signed-off-by: Michael Wang --- drivers/md/md.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 1db88d7..d46d714 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -3271,10 +3271,9 @@ rdev_attr_show(struct kobject *kobj, struct attribute *attr, char *page) struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr); struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj); - if (!entry->show) - return -EIO; - if (!rdev->mddev) - return -EBUSY; + if (!entry->show || !rdev->mddev) + return 0; + return entry->show(rdev, page); } -- 2.5.0
Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
On 04/05/2017 12:17 AM, NeilBrown wrote: [snip] >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >> index 7d67235..0554110 100644 >> --- a/drivers/md/raid1.c >> +++ b/drivers/md/raid1.c >> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio) >> /* Don't try recovering from here - just fail it >> * ... unless it is the last working device of course */ >> md_error(mddev, rdev); >> -if (test_bit(Faulty, >flags)) >> +if (test_bit(Faulty, >flags)) { >> /* Don't try to read from here, but make sure >> * put_buf does it's thing >> */ >> bio->bi_end_io = end_sync_write; >> +bio->bi_next = NULL; >> +} >> } >> >> while(sectors) { > > > Ah - I see what is happening now. I was looking at the vanilla 4.4 > code, which doesn't have the failfast changes. My bad to forgot mention... yes our md stuff is very much close to the upstream. > > I don't think your patch is correct though. We really shouldn't be > re-using that bio, and setting bi_next to NULL just hides the bug. It > doesn't fix it. > As the rdev is now Faulty, it doesn't make sense for > sync_request_write() to submit a write request to it. Make sense, while still have concerns regarding the design: * in this case since the read_disk already abandoned, is it fine to keep r1_bio->read_disk recording the faulty device index? * we assign the 'end_sync_write' to the original read bio in this case, but when is this supposed to be called? > > Can you confirm that this works please. Yes, it works. Tested-by: Michael Wang <yun.w...@profitbricks.com> Regards, Michael Wang > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index d2d8b8a5bd56..219f1e1f1d1d 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, > struct r1bio *r1_bio) >(i == r1_bio->read_disk || > !test_bit(MD_RECOVERY_SYNC, >recovery > continue; > + if (test_bit(Faulty, >mirrors[i].rdev->flags)) > + continue; > > bio_set_op_attrs(wbio, REQ_OP_WRITE, 0); > if (test_bit(FailFast, >mirrors[i].rdev->flags)) > > > Thanks, > NeilBrown >
Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
On 04/05/2017 12:17 AM, NeilBrown wrote: [snip] >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >> index 7d67235..0554110 100644 >> --- a/drivers/md/raid1.c >> +++ b/drivers/md/raid1.c >> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio) >> /* Don't try recovering from here - just fail it >> * ... unless it is the last working device of course */ >> md_error(mddev, rdev); >> -if (test_bit(Faulty, >flags)) >> +if (test_bit(Faulty, >flags)) { >> /* Don't try to read from here, but make sure >> * put_buf does it's thing >> */ >> bio->bi_end_io = end_sync_write; >> +bio->bi_next = NULL; >> +} >> } >> >> while(sectors) { > > > Ah - I see what is happening now. I was looking at the vanilla 4.4 > code, which doesn't have the failfast changes. My bad to forgot mention... yes our md stuff is very much close to the upstream. > > I don't think your patch is correct though. We really shouldn't be > re-using that bio, and setting bi_next to NULL just hides the bug. It > doesn't fix it. > As the rdev is now Faulty, it doesn't make sense for > sync_request_write() to submit a write request to it. Make sense, while still have concerns regarding the design: * in this case since the read_disk already abandoned, is it fine to keep r1_bio->read_disk recording the faulty device index? * we assign the 'end_sync_write' to the original read bio in this case, but when is this supposed to be called? > > Can you confirm that this works please. Yes, it works. Tested-by: Michael Wang Regards, Michael Wang > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index d2d8b8a5bd56..219f1e1f1d1d 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, > struct r1bio *r1_bio) >(i == r1_bio->read_disk || > !test_bit(MD_RECOVERY_SYNC, >recovery > continue; > + if (test_bit(Faulty, >mirrors[i].rdev->flags)) > + continue; > > bio_set_op_attrs(wbio, REQ_OP_WRITE, 0); > if (test_bit(FailFast, >mirrors[i].rdev->flags)) > > > Thanks, > NeilBrown >
[RFC PATCH] raid1: reset 'bi_next' before reuse the bio
During the testing we found the sync read bio can go through path: md_do_sync() sync_request() generic_make_request() blk_queue_bio() blk_attempt_plug_merge() bio->bi_next CHAINED HERE ... raid1d() sync_request_write() fix_sync_read_error() if FailFast && Faulty bio->bi_end_io = end_sync_write generic_make_request() BUG_ON(bio->bi_next) This need to meet the conditions: * bio once merged * read disk have FailFast enabled * read disk is Faulty And since the block layer won't reset the 'bi_next' after bio is done inside request, we hit the BUG like that. This patch simply reset the bi_next before we reuse it. Signed-off-by: Michael Wang <yun.w...@profitbricks.com> --- drivers/md/raid1.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 7d67235..0554110 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio) /* Don't try recovering from here - just fail it * ... unless it is the last working device of course */ md_error(mddev, rdev); - if (test_bit(Faulty, >flags)) + if (test_bit(Faulty, >flags)) { /* Don't try to read from here, but make sure * put_buf does it's thing */ bio->bi_end_io = end_sync_write; + bio->bi_next = NULL; + } } while(sectors) { -- 2.5.0
[RFC PATCH] raid1: reset 'bi_next' before reuse the bio
During the testing we found the sync read bio can go through path: md_do_sync() sync_request() generic_make_request() blk_queue_bio() blk_attempt_plug_merge() bio->bi_next CHAINED HERE ... raid1d() sync_request_write() fix_sync_read_error() if FailFast && Faulty bio->bi_end_io = end_sync_write generic_make_request() BUG_ON(bio->bi_next) This need to meet the conditions: * bio once merged * read disk have FailFast enabled * read disk is Faulty And since the block layer won't reset the 'bi_next' after bio is done inside request, we hit the BUG like that. This patch simply reset the bi_next before we reuse it. Signed-off-by: Michael Wang --- drivers/md/raid1.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 7d67235..0554110 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio) /* Don't try recovering from here - just fail it * ... unless it is the last working device of course */ md_error(mddev, rdev); - if (test_bit(Faulty, >flags)) + if (test_bit(Faulty, >flags)) { /* Don't try to read from here, but make sure * put_buf does it's thing */ bio->bi_end_io = end_sync_write; + bio->bi_next = NULL; + } } while(sectors) { -- 2.5.0
Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
On 04/04/2017 02:24 PM, Michael Wang wrote: > On 04/04/2017 12:23 PM, Michael Wang wrote: > [snip] >>> add something like >>> if (wbio->bi_next) >>> printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n", >>> i, r1_bio->read_disk, wbio->bi_end_io); >>> >>> that might help narrow down what is happening. >> >> Just triggered again in 4.4, dmesg like: >> >> [ 399.240230] md: super_written gets error=-5 >> [ 399.240286] md: super_written gets error=-5 >> [ 399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block >> 204160 >> [ 399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block >> 204160 >> [ 399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block >> 204160 >> [ 399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block >> 204160 >> [ 399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block >> 204160 >> [ 399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block >> 204160 >> [ 399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block >> 204160 >> [ 399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write >> [raid1] > > Is it possible that the fail fast who changed the 'bi_end_io' inside > fix_sync_read_error() help the used bio pass the check? Hi, NeilBrown, below patch fixed the issue in our testing, I'll post a md RFC patch so we can continue the discussion there. Regards, Michael Wang > > I'm not sure but if the read bio was supposed to be reused as write > for fail fast, maybe we should reset it like this? > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 7d67235..0554110 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio) > /* Don't try recovering from here - just fail it > * ... unless it is the last working device of course */ > md_error(mddev, rdev); > - if (test_bit(Faulty, >flags)) > + if (test_bit(Faulty, >flags)) { > /* Don't try to read from here, but make sure > * put_buf does it's thing > */ > bio->bi_end_io = end_sync_write; > + bio->bi_next = NULL; > + } > } > > while(sectors) { > > Regards, > Michael Wang > > >> [ 399.240363] [ cut here ] >> [ 399.240364] kernel BUG at block/blk-core.c:2147! >> [ 399.240365] invalid opcode: [#1] SMP >> [ 399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod >> ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel >> udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt >> iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal >> tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul >> battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass >> dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata >> mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: >> scsi_transport_srp] >> [ 399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted >> 4.4.50-1-pserver+ #26 >> [ 399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS >> 1.3.6 05/26/2016 >> [ 399.240381] task: 8804031b6200 ti: 8800d72b4000 task.ti: >> 8800d72b4000 >> [ 399.240385] RIP: 0010:[] [] >> generic_make_request+0x29e/0x2a0 >> [ 399.240385] RSP: 0018:8800d72b7d10 EFLAGS: 00010286 >> [ 399.240386] RAX: 8804031b6200 RBX: 8800d2577e00 RCX: >> 3fff >> [ 399.240387] RDX: c001 RSI: 0001 RDI: >> 8800d5e8c1e0 >> [ 399.240387] RBP: 8800d72b7d50 R08: R09: >> 003f >> [ 399.240388] R10: 0004 R11: 001db9ac R12: >> >> [ 399.240388] R13: 8800d2748e00 R14: 88040a016400 R15: >> 8800d2748e40 >> [ 399.240389] FS: () GS:88041dc4() >> knlGS: >> [ 399.240390] CS: 0010 DS: ES: CR0: 80050033 >> [ 399.240390] CR2: 7fb49246a000 CR3: 00040215c000 CR4: >> 003406e0 >> [ 399.240391] DR0: DR1: DR2: >> &
Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
On 04/04/2017 02:24 PM, Michael Wang wrote: > On 04/04/2017 12:23 PM, Michael Wang wrote: > [snip] >>> add something like >>> if (wbio->bi_next) >>> printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n", >>> i, r1_bio->read_disk, wbio->bi_end_io); >>> >>> that might help narrow down what is happening. >> >> Just triggered again in 4.4, dmesg like: >> >> [ 399.240230] md: super_written gets error=-5 >> [ 399.240286] md: super_written gets error=-5 >> [ 399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block >> 204160 >> [ 399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block >> 204160 >> [ 399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block >> 204160 >> [ 399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block >> 204160 >> [ 399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block >> 204160 >> [ 399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block >> 204160 >> [ 399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block >> 204160 >> [ 399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write >> [raid1] > > Is it possible that the fail fast who changed the 'bi_end_io' inside > fix_sync_read_error() help the used bio pass the check? Hi, NeilBrown, below patch fixed the issue in our testing, I'll post a md RFC patch so we can continue the discussion there. Regards, Michael Wang > > I'm not sure but if the read bio was supposed to be reused as write > for fail fast, maybe we should reset it like this? > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 7d67235..0554110 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio) > /* Don't try recovering from here - just fail it > * ... unless it is the last working device of course */ > md_error(mddev, rdev); > - if (test_bit(Faulty, >flags)) > + if (test_bit(Faulty, >flags)) { > /* Don't try to read from here, but make sure > * put_buf does it's thing > */ > bio->bi_end_io = end_sync_write; > + bio->bi_next = NULL; > + } > } > > while(sectors) { > > Regards, > Michael Wang > > >> [ 399.240363] [ cut here ] >> [ 399.240364] kernel BUG at block/blk-core.c:2147! >> [ 399.240365] invalid opcode: [#1] SMP >> [ 399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod >> ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel >> udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt >> iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal >> tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul >> battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass >> dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata >> mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: >> scsi_transport_srp] >> [ 399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted >> 4.4.50-1-pserver+ #26 >> [ 399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS >> 1.3.6 05/26/2016 >> [ 399.240381] task: 8804031b6200 ti: 8800d72b4000 task.ti: >> 8800d72b4000 >> [ 399.240385] RIP: 0010:[] [] >> generic_make_request+0x29e/0x2a0 >> [ 399.240385] RSP: 0018:8800d72b7d10 EFLAGS: 00010286 >> [ 399.240386] RAX: 8804031b6200 RBX: 8800d2577e00 RCX: >> 3fff >> [ 399.240387] RDX: c001 RSI: 0001 RDI: >> 8800d5e8c1e0 >> [ 399.240387] RBP: 8800d72b7d50 R08: R09: >> 003f >> [ 399.240388] R10: 0004 R11: 001db9ac R12: >> >> [ 399.240388] R13: 8800d2748e00 R14: 88040a016400 R15: >> 8800d2748e40 >> [ 399.240389] FS: () GS:88041dc4() >> knlGS: >> [ 399.240390] CS: 0010 DS: ES: CR0: 80050033 >> [ 399.240390] CR2: 7fb49246a000 CR3: 00040215c000 CR4: >> 003406e0 >> [ 399.240391] DR0: DR1: DR2: >> &
Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
On 04/04/2017 12:23 PM, Michael Wang wrote: [snip] >> add something like >> if (wbio->bi_next) >> printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n", >> i, r1_bio->read_disk, wbio->bi_end_io); >> >> that might help narrow down what is happening. > > Just triggered again in 4.4, dmesg like: > > [ 399.240230] md: super_written gets error=-5 > [ 399.240286] md: super_written gets error=-5 > [ 399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block > 204160 > [ 399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block > 204160 > [ 399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block > 204160 > [ 399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block > 204160 > [ 399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block > 204160 > [ 399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block > 204160 > [ 399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block > 204160 > [ 399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1] Is it possible that the fail fast who changed the 'bi_end_io' inside fix_sync_read_error() help the used bio pass the check? I'm not sure but if the read bio was supposed to be reused as write for fail fast, maybe we should reset it like this? diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 7d67235..0554110 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio) /* Don't try recovering from here - just fail it * ... unless it is the last working device of course */ md_error(mddev, rdev); - if (test_bit(Faulty, >flags)) + if (test_bit(Faulty, >flags)) { /* Don't try to read from here, but make sure * put_buf does it's thing */ bio->bi_end_io = end_sync_write; + bio->bi_next = NULL; + } } while(sectors) { Regards, Michael Wang > [ 399.240363] [ cut here ] > [ 399.240364] kernel BUG at block/blk-core.c:2147! > [ 399.240365] invalid opcode: [#1] SMP > [ 399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod > ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel > udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt > iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal > tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul > battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass > dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata > mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: > scsi_transport_srp] > [ 399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted 4.4.50-1-pserver+ > #26 > [ 399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS > 1.3.6 05/26/2016 > [ 399.240381] task: 8804031b6200 ti: 8800d72b4000 task.ti: > 8800d72b4000 > [ 399.240385] RIP: 0010:[] [] > generic_make_request+0x29e/0x2a0 > [ 399.240385] RSP: 0018:8800d72b7d10 EFLAGS: 00010286 > [ 399.240386] RAX: 8804031b6200 RBX: 8800d2577e00 RCX: > 3fff > [ 399.240387] RDX: c001 RSI: 0001 RDI: > 8800d5e8c1e0 > [ 399.240387] RBP: 8800d72b7d50 R08: R09: > 003f > [ 399.240388] R10: 0004 R11: 001db9ac R12: > > [ 399.240388] R13: 8800d2748e00 R14: 88040a016400 R15: > 8800d2748e40 > [ 399.240389] FS: () GS:88041dc4() > knlGS: > [ 399.240390] CS: 0010 DS: ES: CR0: 80050033 > [ 399.240390] CR2: 7fb49246a000 CR3: 00040215c000 CR4: > 003406e0 > [ 399.240391] DR0: DR1: DR2: > > [ 399.240391] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 399.240392] Stack: > [ 399.240393] 8800d72b7d18 8800d72b7d30 > > [ 399.240394] a079c290 8800d2577e00 > 8800d2748e00 > [ 399.240395] 8800d72b7e58 a079e74c 88040b661c00 > 8800d2577e00 > [ 399.240396] Call Trace: > [ 399.240398] [] ? sync_request+0xb20/0xb20 [raid1] > [ 399.240400] [] raid1d+0x65c/0x1060 [raid1] > [ 399.240403] [] ? > trace_raw_output_itimer_expire+0x80/0x80 > [ 39
Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
On 04/04/2017 12:23 PM, Michael Wang wrote: [snip] >> add something like >> if (wbio->bi_next) >> printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n", >> i, r1_bio->read_disk, wbio->bi_end_io); >> >> that might help narrow down what is happening. > > Just triggered again in 4.4, dmesg like: > > [ 399.240230] md: super_written gets error=-5 > [ 399.240286] md: super_written gets error=-5 > [ 399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block > 204160 > [ 399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block > 204160 > [ 399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block > 204160 > [ 399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block > 204160 > [ 399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block > 204160 > [ 399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block > 204160 > [ 399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block > 204160 > [ 399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1] Is it possible that the fail fast who changed the 'bi_end_io' inside fix_sync_read_error() help the used bio pass the check? I'm not sure but if the read bio was supposed to be reused as write for fail fast, maybe we should reset it like this? diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 7d67235..0554110 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio) /* Don't try recovering from here - just fail it * ... unless it is the last working device of course */ md_error(mddev, rdev); - if (test_bit(Faulty, >flags)) + if (test_bit(Faulty, >flags)) { /* Don't try to read from here, but make sure * put_buf does it's thing */ bio->bi_end_io = end_sync_write; + bio->bi_next = NULL; + } } while(sectors) { Regards, Michael Wang > [ 399.240363] [ cut here ] > [ 399.240364] kernel BUG at block/blk-core.c:2147! > [ 399.240365] invalid opcode: [#1] SMP > [ 399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod > ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel > udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt > iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal > tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul > battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass > dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata > mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: > scsi_transport_srp] > [ 399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted 4.4.50-1-pserver+ > #26 > [ 399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS > 1.3.6 05/26/2016 > [ 399.240381] task: 8804031b6200 ti: 8800d72b4000 task.ti: > 8800d72b4000 > [ 399.240385] RIP: 0010:[] [] > generic_make_request+0x29e/0x2a0 > [ 399.240385] RSP: 0018:8800d72b7d10 EFLAGS: 00010286 > [ 399.240386] RAX: 8804031b6200 RBX: 8800d2577e00 RCX: > 3fff > [ 399.240387] RDX: c001 RSI: 0001 RDI: > 8800d5e8c1e0 > [ 399.240387] RBP: 8800d72b7d50 R08: R09: > 003f > [ 399.240388] R10: 0004 R11: 001db9ac R12: > > [ 399.240388] R13: 8800d2748e00 R14: 88040a016400 R15: > 8800d2748e40 > [ 399.240389] FS: () GS:88041dc4() > knlGS: > [ 399.240390] CS: 0010 DS: ES: CR0: 80050033 > [ 399.240390] CR2: 7fb49246a000 CR3: 00040215c000 CR4: > 003406e0 > [ 399.240391] DR0: DR1: DR2: > > [ 399.240391] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 399.240392] Stack: > [ 399.240393] 8800d72b7d18 8800d72b7d30 > > [ 399.240394] a079c290 8800d2577e00 > 8800d2748e00 > [ 399.240395] 8800d72b7e58 a079e74c 88040b661c00 > 8800d2577e00 > [ 399.240396] Call Trace: > [ 399.240398] [] ? sync_request+0xb20/0xb20 [raid1] > [ 399.240400] [] raid1d+0x65c/0x1060 [raid1] > [ 399.240403] [] ? > trace_raw_output_itimer_expire+0x80/0x80 > [ 39
Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
On 04/04/2017 11:37 AM, NeilBrown wrote: > On Tue, Apr 04 2017, Michael Wang wrote: [snip] >>> >>> If sync_request_write() is using a bio that has already been used, it >>> should call bio_reset() and fill in the details again. >>> However I don't see how that would happen. >>> Can you give specific details on the situation that triggers the bug? >> >> We have storage side mapping lv through scst to server, on server side >> we assemble them into multipath device, and then assemble these dm into >> two raid1. >> >> The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage >> side we unmap all the lv (could during mkfs or fio), then on server side >> we hit the BUG (reproducible). > > So I assume the initial resync is still happening at this point? > And you unmap *all* the lv's so you expect IO to fail? > I can see that the code would behave strangely if you have a > bad-block-list configured (which is the default). > Do you have a bbl? If you create the array without the bbl, does it > still crash? The resync is at least happen concurrently in this case, we try to simulate the case that all the connections dropped, the IO do failed, also bunch of kernel log like: md: super_written gets error=-5 blk_update_request: I/O error, dev dm-3, sector 64184 md/raid1:md1: dm-2: unrecoverable I/O read error for block 46848 we expect that to happen, but server should not crash on BUG. And we haven't done any thing special regarding bbl, the bad_blocks in sysfs are all empty. > >> >> The path of bio was confirmed by add tracing, it is reused in >> sync_request_write() >> with 'bi_next' once chained inside blk_attempt_plug_merge(). > > I still don't see why it is re-used. > I assume you didn't explicitly ask for a check/repair (i.e. didn't write > to .../md/sync_action at all?). In that case MD_RECOVERY_REQUESTED is > not set. Just unmap lv on storage side, no operation on server side. > So sync_request() sends only one bio to generic_make_request(): >r1_bio->bios[r1_bio->read_disk]; > > then sync_request_write() *doesn't* send that bio again, but does send > all the others. > > So where does it reuse a bio? If that's the design then it would be strange... the log do showing the path of that bio go through sync_request(), will do more investigation. > >> >> We also tried to reset the bi_next inside sync_request_write() before >> generic_make_request() which also works. >> >> The testing was done with 4.4, but we found upstream also left bi_next >> chained after done in request, thus we post this RFC. >> >> Regarding raid1, we haven't found the place on path where the bio was >> reset... where does it supposed to be? > > I'm not sure what you mean. > We only reset bios when they are being reused. > One place is in process_checks() where bio_reset() is called before > filling in all the details. > > > Maybe, in sync_request_write(), before > > wbio->bi_rw = WRITE; > > add something like > if (wbio->bi_next) > printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n", > i, r1_bio->read_disk, wbio->bi_end_io); > > that might help narrow down what is happening. Just triggered again in 4.4, dmesg like: [ 399.240230] md: super_written gets error=-5 [ 399.240286] md: super_written gets error=-5 [ 399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160 [ 399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160 [ 399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160 [ 399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160 [ 399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160 [ 399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160 [ 399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160 [ 399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1] [ 399.240363] [ cut here ] [ 399.240364] kernel BUG at block/blk-core.c:2147! [ 399.240365] invalid opcode: [#1] SMP [ 399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal
Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
On 04/04/2017 11:37 AM, NeilBrown wrote: > On Tue, Apr 04 2017, Michael Wang wrote: [snip] >>> >>> If sync_request_write() is using a bio that has already been used, it >>> should call bio_reset() and fill in the details again. >>> However I don't see how that would happen. >>> Can you give specific details on the situation that triggers the bug? >> >> We have storage side mapping lv through scst to server, on server side >> we assemble them into multipath device, and then assemble these dm into >> two raid1. >> >> The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage >> side we unmap all the lv (could during mkfs or fio), then on server side >> we hit the BUG (reproducible). > > So I assume the initial resync is still happening at this point? > And you unmap *all* the lv's so you expect IO to fail? > I can see that the code would behave strangely if you have a > bad-block-list configured (which is the default). > Do you have a bbl? If you create the array without the bbl, does it > still crash? The resync is at least happen concurrently in this case, we try to simulate the case that all the connections dropped, the IO do failed, also bunch of kernel log like: md: super_written gets error=-5 blk_update_request: I/O error, dev dm-3, sector 64184 md/raid1:md1: dm-2: unrecoverable I/O read error for block 46848 we expect that to happen, but server should not crash on BUG. And we haven't done any thing special regarding bbl, the bad_blocks in sysfs are all empty. > >> >> The path of bio was confirmed by add tracing, it is reused in >> sync_request_write() >> with 'bi_next' once chained inside blk_attempt_plug_merge(). > > I still don't see why it is re-used. > I assume you didn't explicitly ask for a check/repair (i.e. didn't write > to .../md/sync_action at all?). In that case MD_RECOVERY_REQUESTED is > not set. Just unmap lv on storage side, no operation on server side. > So sync_request() sends only one bio to generic_make_request(): >r1_bio->bios[r1_bio->read_disk]; > > then sync_request_write() *doesn't* send that bio again, but does send > all the others. > > So where does it reuse a bio? If that's the design then it would be strange... the log do showing the path of that bio go through sync_request(), will do more investigation. > >> >> We also tried to reset the bi_next inside sync_request_write() before >> generic_make_request() which also works. >> >> The testing was done with 4.4, but we found upstream also left bi_next >> chained after done in request, thus we post this RFC. >> >> Regarding raid1, we haven't found the place on path where the bio was >> reset... where does it supposed to be? > > I'm not sure what you mean. > We only reset bios when they are being reused. > One place is in process_checks() where bio_reset() is called before > filling in all the details. > > > Maybe, in sync_request_write(), before > > wbio->bi_rw = WRITE; > > add something like > if (wbio->bi_next) > printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n", > i, r1_bio->read_disk, wbio->bi_end_io); > > that might help narrow down what is happening. Just triggered again in 4.4, dmesg like: [ 399.240230] md: super_written gets error=-5 [ 399.240286] md: super_written gets error=-5 [ 399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160 [ 399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160 [ 399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160 [ 399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160 [ 399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160 [ 399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160 [ 399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160 [ 399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1] [ 399.240363] [ cut here ] [ 399.240364] kernel BUG at block/blk-core.c:2147! [ 399.240365] invalid opcode: [#1] SMP [ 399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal
Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
Hi, Neil On 04/03/2017 11:25 PM, NeilBrown wrote: > On Mon, Apr 03 2017, Michael Wang wrote: > >> blk_attempt_plug_merge() try to merge bio into request and chain them >> by 'bi_next', while after the bio is done inside request, we forgot to >> reset the 'bi_next'. >> >> This lead into BUG while removing all the underlying devices from md-raid1, >> the bio once go through: >> >> md_do_sync() >> sync_request() >> generic_make_request() > > This is a read request from the "first" device. > >> blk_queue_bio() >> blk_attempt_plug_merge() >> CHAINED HERE >> >> will keep chained and reused by: >> >> raid1d() >> sync_request_write() >> generic_make_request() > > This is a write request to some other device, isn't it? > > If sync_request_write() is using a bio that has already been used, it > should call bio_reset() and fill in the details again. > However I don't see how that would happen. > Can you give specific details on the situation that triggers the bug? We have storage side mapping lv through scst to server, on server side we assemble them into multipath device, and then assemble these dm into two raid1. The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage side we unmap all the lv (could during mkfs or fio), then on server side we hit the BUG (reproducible). The path of bio was confirmed by add tracing, it is reused in sync_request_write() with 'bi_next' once chained inside blk_attempt_plug_merge(). We also tried to reset the bi_next inside sync_request_write() before generic_make_request() which also works. The testing was done with 4.4, but we found upstream also left bi_next chained after done in request, thus we post this RFC. Regarding raid1, we haven't found the place on path where the bio was reset... where does it supposed to be? BTW the fix_sync_read_error() also invoked and succeed before trigger the BUG. Regards, Michael Wang > > Thanks, > NeilBrown > > >> BUG_ON(bio->bi_next) >> >> After reset the 'bi_next' this can no longer happen. >> >> Signed-off-by: Michael Wang <yun.w...@profitbricks.com> >> --- >> block/blk-core.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 43b7d06..91223b2 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int >> error, unsigned int nr_bytes) >> struct bio *bio = req->bio; >> unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes); >> >> - if (bio_bytes == bio->bi_iter.bi_size) >> + if (bio_bytes == bio->bi_iter.bi_size) { >> req->bio = bio->bi_next; >> + bio->bi_next = NULL; >> + } >> >> req_bio_endio(req, bio, bio_bytes, error); >> >> -- >> 2.5.0
Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
Hi, Neil On 04/03/2017 11:25 PM, NeilBrown wrote: > On Mon, Apr 03 2017, Michael Wang wrote: > >> blk_attempt_plug_merge() try to merge bio into request and chain them >> by 'bi_next', while after the bio is done inside request, we forgot to >> reset the 'bi_next'. >> >> This lead into BUG while removing all the underlying devices from md-raid1, >> the bio once go through: >> >> md_do_sync() >> sync_request() >> generic_make_request() > > This is a read request from the "first" device. > >> blk_queue_bio() >> blk_attempt_plug_merge() >> CHAINED HERE >> >> will keep chained and reused by: >> >> raid1d() >> sync_request_write() >> generic_make_request() > > This is a write request to some other device, isn't it? > > If sync_request_write() is using a bio that has already been used, it > should call bio_reset() and fill in the details again. > However I don't see how that would happen. > Can you give specific details on the situation that triggers the bug? We have storage side mapping lv through scst to server, on server side we assemble them into multipath device, and then assemble these dm into two raid1. The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage side we unmap all the lv (could during mkfs or fio), then on server side we hit the BUG (reproducible). The path of bio was confirmed by add tracing, it is reused in sync_request_write() with 'bi_next' once chained inside blk_attempt_plug_merge(). We also tried to reset the bi_next inside sync_request_write() before generic_make_request() which also works. The testing was done with 4.4, but we found upstream also left bi_next chained after done in request, thus we post this RFC. Regarding raid1, we haven't found the place on path where the bio was reset... where does it supposed to be? BTW the fix_sync_read_error() also invoked and succeed before trigger the BUG. Regards, Michael Wang > > Thanks, > NeilBrown > > >> BUG_ON(bio->bi_next) >> >> After reset the 'bi_next' this can no longer happen. >> >> Signed-off-by: Michael Wang >> --- >> block/blk-core.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 43b7d06..91223b2 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int >> error, unsigned int nr_bytes) >> struct bio *bio = req->bio; >> unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes); >> >> - if (bio_bytes == bio->bi_iter.bi_size) >> + if (bio_bytes == bio->bi_iter.bi_size) { >> req->bio = bio->bi_next; >> + bio->bi_next = NULL; >> + } >> >> req_bio_endio(req, bio, bio_bytes, error); >> >> -- >> 2.5.0
[RFC PATCH] blk: reset 'bi_next' when bio is done inside request
blk_attempt_plug_merge() try to merge bio into request and chain them by 'bi_next', while after the bio is done inside request, we forgot to reset the 'bi_next'. This lead into BUG while removing all the underlying devices from md-raid1, the bio once go through: md_do_sync() sync_request() generic_make_request() blk_queue_bio() blk_attempt_plug_merge() CHAINED HERE will keep chained and reused by: raid1d() sync_request_write() generic_make_request() BUG_ON(bio->bi_next) After reset the 'bi_next' this can no longer happen. Signed-off-by: Michael Wang <yun.w...@profitbricks.com> --- block/blk-core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index 43b7d06..91223b2 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes) struct bio *bio = req->bio; unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes); - if (bio_bytes == bio->bi_iter.bi_size) + if (bio_bytes == bio->bi_iter.bi_size) { req->bio = bio->bi_next; + bio->bi_next = NULL; + } req_bio_endio(req, bio, bio_bytes, error); -- 2.5.0
[RFC PATCH] blk: reset 'bi_next' when bio is done inside request
blk_attempt_plug_merge() try to merge bio into request and chain them by 'bi_next', while after the bio is done inside request, we forgot to reset the 'bi_next'. This lead into BUG while removing all the underlying devices from md-raid1, the bio once go through: md_do_sync() sync_request() generic_make_request() blk_queue_bio() blk_attempt_plug_merge() CHAINED HERE will keep chained and reused by: raid1d() sync_request_write() generic_make_request() BUG_ON(bio->bi_next) After reset the 'bi_next' this can no longer happen. Signed-off-by: Michael Wang --- block/blk-core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index 43b7d06..91223b2 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes) struct bio *bio = req->bio; unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes); - if (bio_bytes == bio->bi_iter.bi_size) + if (bio_bytes == bio->bi_iter.bi_size) { req->bio = bio->bi_next; + bio->bi_next = NULL; + } req_bio_endio(req, bio, bio_bytes, error); -- 2.5.0
Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
Dear Maintainers I'd like to ask for the status of this patch since we hit the issue too during our testing on md raid1. Split remainder bio_A was queued ahead, following by bio_B for lower device, at this moment raid start freezing, the loop take out bio_A firstly and deliver it, which will hung since raid is freezing, while the freezing never end since it waiting for bio_B to finish, and bio_B is still on the queue, waiting for bio_A to finish... We're looking for a good solution and we found this patch already progressed a lot, but we can't find it on linux-next, so we'd like to ask are we still planning to have this fix in upstream? Regards, Michael Wang On 07/11/2016 04:10 PM, Lars Ellenberg wrote: > For a long time, generic_make_request() converts recursion into > iteration by queuing recursive arguments on current->bio_list. > > This is convenient for stacking drivers, > the top-most driver would take the originally submitted bio, > and re-submit a re-mapped version of it, or one or more clones, > or one or more new allocated bios to its backend(s). Which > are then simply processed in turn, and each can again queue > more "backend-bios" until we reach the bottom of the driver stack, > and actually dispatch to the real backend device. > > Any stacking driver ->make_request_fn() could expect that, > once it returns, any backend-bios it submitted via recursive calls > to generic_make_request() would now be processed and dispatched, before > the current task would call into this driver again. > > This is changed by commit > 54efd50 block: make generic_make_request handle arbitrarily sized bios > > Drivers may call blk_queue_split() inside their ->make_request_fn(), > which may split the current bio into a front-part to be dealt with > immediately, and a remainder-part, which may need to be split even > further. That remainder-part will simply also be pushed to > current->bio_list, and would end up being head-of-queue, in front > of any backend-bios the current make_request_fn() might submit during > processing of the fron-part. > > Which means the current task would immediately end up back in the same > make_request_fn() of the same driver again, before any of its backend > bios have even been processed. > > This can lead to resource starvation deadlock. > Drivers could avoid this by learning to not need blk_queue_split(), > or by submitting their backend bios in a different context (dedicated > kernel thread, work_queue context, ...). Or by playing funny re-ordering > games with entries on current->bio_list. > > Instead, I suggest to distinguish between recursive calls to > generic_make_request(), and pushing back the remainder part in > blk_queue_split(), by pointing current->bio_lists to a > struct recursion_to_iteration_bio_lists { > struct bio_list recursion; > struct bio_list queue; > } > > By providing each q->make_request_fn() with an empty "recursion" > bio_list, then merging any recursively submitted bios to the > head of the "queue" list, we can make the recursion-to-iteration > logic in generic_make_request() process deepest level bios first, > and "sibling" bios of the same level in "natural" order. > > Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com> > Signed-off-by: Roland Kammerer <roland.kamme...@linbit.com> > --- > block/bio.c | 20 +++ > block/blk-core.c | 49 > +-- > block/blk-merge.c | 5 - > drivers/md/bcache/btree.c | 12 ++-- > drivers/md/dm-bufio.c | 2 +- > drivers/md/raid1.c| 5 ++--- > drivers/md/raid10.c | 5 ++--- > include/linux/bio.h | 25 > include/linux/sched.h | 4 ++-- > 9 files changed, 80 insertions(+), 47 deletions(-) > > diff --git a/block/bio.c b/block/bio.c > index 848cd35..c2606fd 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs) >*/ > > bio_list_init(); > - bio_list_init(); > > - while ((bio = bio_list_pop(current->bio_list))) > + bio_list_init(); > + while ((bio = bio_list_pop(>bio_lists->recursion))) > bio_list_add(bio->bi_pool == bs ? : , bio); > + current->bio_lists->recursion = nopunt; > > - *current->bio_list = nopunt; > + bio_list_init(); > + while ((bio = bio_list_pop(>bio_lists->queue))) > + bio_list_add(bio->bi_pool == bs ? : , bio); > + current->bio_lists->queue = nopunt; > >
Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
Dear Maintainers I'd like to ask for the status of this patch since we hit the issue too during our testing on md raid1. Split remainder bio_A was queued ahead, following by bio_B for lower device, at this moment raid start freezing, the loop take out bio_A firstly and deliver it, which will hung since raid is freezing, while the freezing never end since it waiting for bio_B to finish, and bio_B is still on the queue, waiting for bio_A to finish... We're looking for a good solution and we found this patch already progressed a lot, but we can't find it on linux-next, so we'd like to ask are we still planning to have this fix in upstream? Regards, Michael Wang On 07/11/2016 04:10 PM, Lars Ellenberg wrote: > For a long time, generic_make_request() converts recursion into > iteration by queuing recursive arguments on current->bio_list. > > This is convenient for stacking drivers, > the top-most driver would take the originally submitted bio, > and re-submit a re-mapped version of it, or one or more clones, > or one or more new allocated bios to its backend(s). Which > are then simply processed in turn, and each can again queue > more "backend-bios" until we reach the bottom of the driver stack, > and actually dispatch to the real backend device. > > Any stacking driver ->make_request_fn() could expect that, > once it returns, any backend-bios it submitted via recursive calls > to generic_make_request() would now be processed and dispatched, before > the current task would call into this driver again. > > This is changed by commit > 54efd50 block: make generic_make_request handle arbitrarily sized bios > > Drivers may call blk_queue_split() inside their ->make_request_fn(), > which may split the current bio into a front-part to be dealt with > immediately, and a remainder-part, which may need to be split even > further. That remainder-part will simply also be pushed to > current->bio_list, and would end up being head-of-queue, in front > of any backend-bios the current make_request_fn() might submit during > processing of the fron-part. > > Which means the current task would immediately end up back in the same > make_request_fn() of the same driver again, before any of its backend > bios have even been processed. > > This can lead to resource starvation deadlock. > Drivers could avoid this by learning to not need blk_queue_split(), > or by submitting their backend bios in a different context (dedicated > kernel thread, work_queue context, ...). Or by playing funny re-ordering > games with entries on current->bio_list. > > Instead, I suggest to distinguish between recursive calls to > generic_make_request(), and pushing back the remainder part in > blk_queue_split(), by pointing current->bio_lists to a > struct recursion_to_iteration_bio_lists { > struct bio_list recursion; > struct bio_list queue; > } > > By providing each q->make_request_fn() with an empty "recursion" > bio_list, then merging any recursively submitted bios to the > head of the "queue" list, we can make the recursion-to-iteration > logic in generic_make_request() process deepest level bios first, > and "sibling" bios of the same level in "natural" order. > > Signed-off-by: Lars Ellenberg > Signed-off-by: Roland Kammerer > --- > block/bio.c | 20 +++ > block/blk-core.c | 49 > +-- > block/blk-merge.c | 5 - > drivers/md/bcache/btree.c | 12 ++-- > drivers/md/dm-bufio.c | 2 +- > drivers/md/raid1.c| 5 ++--- > drivers/md/raid10.c | 5 ++--- > include/linux/bio.h | 25 > include/linux/sched.h | 4 ++-- > 9 files changed, 80 insertions(+), 47 deletions(-) > > diff --git a/block/bio.c b/block/bio.c > index 848cd35..c2606fd 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs) >*/ > > bio_list_init(); > - bio_list_init(); > > - while ((bio = bio_list_pop(current->bio_list))) > + bio_list_init(); > + while ((bio = bio_list_pop(>bio_lists->recursion))) > bio_list_add(bio->bi_pool == bs ? : , bio); > + current->bio_lists->recursion = nopunt; > > - *current->bio_list = nopunt; > + bio_list_init(); > + while ((bio = bio_list_pop(>bio_lists->queue))) > + bio_list_add(bio->bi_pool == bs ? : , bio); > + current->bio_lists->queue = nopunt; > > spin_lock(>rescue_lock); > bio_list_
[BUG] block: bdi_register_owner() failure cause NULL pointer dereference
Hi, Folks We observed the hard lockup while trying raid assemble with sas3ircu, it was start with the failure inside bdi_register_owner() with duplicated kobj path, and later comeup the NULL pointer dereference, after that system hang and we saw hard lockup on screen. The duplicated issue could be with the scsi controller driver and we are going to upgrade it anyway, but my question is why we don't do some error handling like: diff --git a/block/genhd.c b/block/genhd.c index a178c8e..318bc63 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -614,7 +614,15 @@ void device_add_disk(struct device *parent, struct gendisk *disk) /* Register BDI before referencing it from bdev */ bdi = >queue->backing_dev_info; - bdi_register_owner(bdi, disk_to_dev(disk)); + if (bdi_register_owner(bdi, disk_to_dev(disk))) { + disk_release_events(disk); + blk_free_devt(devt); + disk->ev = NULL; + disk->first_minor = 0; + disk->major = 0; + WARN_ON(1); + return; + } blk_register_region(disk_devt(disk), disk->minors, NULL, exact_match, exact_lock, disk); to prevent the following NULL pointer dereference and hard lockup? Regards, Michael Wang Sep 29 09:53:28 st401b-3 systemd[1]: Starting Update UTMP about System Runlevel Changes... Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen and drop on 1 v6wildcard :: UDP 123 Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 2 lo 127.0.0.1 UDP 123 Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 3 eth0 10.41.12.3 UDP 123 Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 4 lo ::1 UDP 123 Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 5 eth0 fe80::ec4:7aff:feab:6b0 UDP 123 Sep 29 09:53:28 st401b-3 ntpd[4970]: peers refreshed Sep 29 09:53:28 st401b-3 ntpd[4970]: Listening on routing socket on fd #22 for interface updates Sep 29 09:53:28 st401b-3 systemd[1]: Started Update UTMP about System Runlevel Changes. Sep 29 09:53:28 st401b-3 systemd[1]: Startup finished in 18.720s (kernel) + 39.513s (userspace) = 58.233s. Sep 29 09:55:01 st401b-3 CRON[5433]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Sep 29 09:55:01 st401b-3 CRON[5434]: (root) CMD (test -x /opt/profitbricks/bin/check_memory && /opt/profitbricks/bin/check_memory) Sep 29 09:55:17 st401b-3 kernel: [ 167.693658] scsi 0:1:0:0: Direct-Access LSI Logical Volume 3000 PQ: 0 ANSI: 6 Sep 29 09:55:17 st401b-3 kernel: [ 167.693795] scsi 0:1:0:0: RAID1: handle(0x0143), wwid(0x02f7ec7949091b05), pd_count(2), type(SSP) Sep 29 09:55:17 st401b-3 kernel: [ 167.694042] sd 0:1:0:0: [sdam] 5859373056 512-byte logical blocks: (3.00 TB/2.73 TiB) Sep 29 09:55:17 st401b-3 kernel: [ 167.694044] sd 0:1:0:0: [sdam] 4096-byte physical blocks Sep 29 09:55:17 st401b-3 kernel: [ 167.694057] sd 0:1:0:0: Attached scsi generic sg40 type 0 Sep 29 09:55:17 st401b-3 kernel: [ 167.694129] sd 0:1:0:0: [sdam] Write Protect is off Sep 29 09:55:17 st401b-3 kernel: [ 167.694131] sd 0:1:0:0: [sdam] Mode Sense: 03 00 00 08 Sep 29 09:55:17 st401b-3 kernel: [ 167.694166] sd 0:1:0:0: [sdam] No Caching mode page found Sep 29 09:55:17 st401b-3 kernel: [ 167.694282] sd 0:0:4:0: hidding raid component Sep 29 09:55:17 st401b-3 kernel: [ 167.694589] sd 0:1:0:0: [sdam] Assuming drive cache: write through Sep 29 09:55:17 st401b-3 kernel: [ 167.703346] sd 0:1:0:0: [sdam] Attached SCSI disk Sep 29 09:55:17 st401b-3 kernel: [ 167.703653] sd 0:0:5:0: hidding raid component Sep 29 09:55:39 st401b-3 check_backup_lvm_push: critical: local git command rev-parse HEAD failed, retval: 0, fatal: Not a git repository: '/var/lib/backup-lvm/.git/' Sep 29 09:56:03 st401b-3 kernel: [ 213.684812] scsi 0:1:1:0: Direct-Access LSI Logical Volume 3000 PQ: 0 ANSI: 6 Sep 29 09:56:03 st401b-3 kernel: [ 213.684946] scsi 0:1:1:0: RAID1: handle(0x0142), wwid(0x0ab3eca651cd1b58), pd_count(2), type(SSP) Sep 29 09:56:03 st401b-3 kernel: [ 213.685189] sd 0:1:1:0: [sde] 5859373056 512-byte logical blocks: (3.00 TB/2.73 TiB) Sep 29 09:56:03 st401b-3 kernel: [ 213.685192] sd 0:1:1:0: [sde] 4096-byte physical blocks Sep 29 09:56:03 st401b-3 kernel: [ 213.685204] sd 0:1:1:0: Attached scsi generic sg41 type 0 Sep 29 09:56:03 st401b-3 kernel: [ 213.685275] sd 0:1:1:0: [sde] Write Protect is off Sep 29 09:56:03 st401b-3 kernel: [ 213.685277] sd 0:1:1:0: [sde] Mode Sense: 03 00 00 08 Sep 29 09:56:03 st401b-3 kernel: [ 213.685307] sd 0:1:1:0: [sde] No Caching mode page found Sep 29 09:56:03 st401b-3 kernel: [ 213.685423] sd 0:0:6:0: hidding raid component Sep 29 09:56:03 st401b-3 kernel: [ 213.685698] sd 0:1:1:0: [sde] Assuming drive cache: write through Sep 29 09:56:03 st401b-3 kernel: [ 213.686226] [ cut here ] Sep 29 09:56:03 st401b-3 kernel: [ 213.686232] WARNING: C
[BUG] block: bdi_register_owner() failure cause NULL pointer dereference
Hi, Folks We observed the hard lockup while trying raid assemble with sas3ircu, it was start with the failure inside bdi_register_owner() with duplicated kobj path, and later comeup the NULL pointer dereference, after that system hang and we saw hard lockup on screen. The duplicated issue could be with the scsi controller driver and we are going to upgrade it anyway, but my question is why we don't do some error handling like: diff --git a/block/genhd.c b/block/genhd.c index a178c8e..318bc63 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -614,7 +614,15 @@ void device_add_disk(struct device *parent, struct gendisk *disk) /* Register BDI before referencing it from bdev */ bdi = >queue->backing_dev_info; - bdi_register_owner(bdi, disk_to_dev(disk)); + if (bdi_register_owner(bdi, disk_to_dev(disk))) { + disk_release_events(disk); + blk_free_devt(devt); + disk->ev = NULL; + disk->first_minor = 0; + disk->major = 0; + WARN_ON(1); + return; + } blk_register_region(disk_devt(disk), disk->minors, NULL, exact_match, exact_lock, disk); to prevent the following NULL pointer dereference and hard lockup? Regards, Michael Wang Sep 29 09:53:28 st401b-3 systemd[1]: Starting Update UTMP about System Runlevel Changes... Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen and drop on 1 v6wildcard :: UDP 123 Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 2 lo 127.0.0.1 UDP 123 Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 3 eth0 10.41.12.3 UDP 123 Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 4 lo ::1 UDP 123 Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 5 eth0 fe80::ec4:7aff:feab:6b0 UDP 123 Sep 29 09:53:28 st401b-3 ntpd[4970]: peers refreshed Sep 29 09:53:28 st401b-3 ntpd[4970]: Listening on routing socket on fd #22 for interface updates Sep 29 09:53:28 st401b-3 systemd[1]: Started Update UTMP about System Runlevel Changes. Sep 29 09:53:28 st401b-3 systemd[1]: Startup finished in 18.720s (kernel) + 39.513s (userspace) = 58.233s. Sep 29 09:55:01 st401b-3 CRON[5433]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Sep 29 09:55:01 st401b-3 CRON[5434]: (root) CMD (test -x /opt/profitbricks/bin/check_memory && /opt/profitbricks/bin/check_memory) Sep 29 09:55:17 st401b-3 kernel: [ 167.693658] scsi 0:1:0:0: Direct-Access LSI Logical Volume 3000 PQ: 0 ANSI: 6 Sep 29 09:55:17 st401b-3 kernel: [ 167.693795] scsi 0:1:0:0: RAID1: handle(0x0143), wwid(0x02f7ec7949091b05), pd_count(2), type(SSP) Sep 29 09:55:17 st401b-3 kernel: [ 167.694042] sd 0:1:0:0: [sdam] 5859373056 512-byte logical blocks: (3.00 TB/2.73 TiB) Sep 29 09:55:17 st401b-3 kernel: [ 167.694044] sd 0:1:0:0: [sdam] 4096-byte physical blocks Sep 29 09:55:17 st401b-3 kernel: [ 167.694057] sd 0:1:0:0: Attached scsi generic sg40 type 0 Sep 29 09:55:17 st401b-3 kernel: [ 167.694129] sd 0:1:0:0: [sdam] Write Protect is off Sep 29 09:55:17 st401b-3 kernel: [ 167.694131] sd 0:1:0:0: [sdam] Mode Sense: 03 00 00 08 Sep 29 09:55:17 st401b-3 kernel: [ 167.694166] sd 0:1:0:0: [sdam] No Caching mode page found Sep 29 09:55:17 st401b-3 kernel: [ 167.694282] sd 0:0:4:0: hidding raid component Sep 29 09:55:17 st401b-3 kernel: [ 167.694589] sd 0:1:0:0: [sdam] Assuming drive cache: write through Sep 29 09:55:17 st401b-3 kernel: [ 167.703346] sd 0:1:0:0: [sdam] Attached SCSI disk Sep 29 09:55:17 st401b-3 kernel: [ 167.703653] sd 0:0:5:0: hidding raid component Sep 29 09:55:39 st401b-3 check_backup_lvm_push: critical: local git command rev-parse HEAD failed, retval: 0, fatal: Not a git repository: '/var/lib/backup-lvm/.git/' Sep 29 09:56:03 st401b-3 kernel: [ 213.684812] scsi 0:1:1:0: Direct-Access LSI Logical Volume 3000 PQ: 0 ANSI: 6 Sep 29 09:56:03 st401b-3 kernel: [ 213.684946] scsi 0:1:1:0: RAID1: handle(0x0142), wwid(0x0ab3eca651cd1b58), pd_count(2), type(SSP) Sep 29 09:56:03 st401b-3 kernel: [ 213.685189] sd 0:1:1:0: [sde] 5859373056 512-byte logical blocks: (3.00 TB/2.73 TiB) Sep 29 09:56:03 st401b-3 kernel: [ 213.685192] sd 0:1:1:0: [sde] 4096-byte physical blocks Sep 29 09:56:03 st401b-3 kernel: [ 213.685204] sd 0:1:1:0: Attached scsi generic sg41 type 0 Sep 29 09:56:03 st401b-3 kernel: [ 213.685275] sd 0:1:1:0: [sde] Write Protect is off Sep 29 09:56:03 st401b-3 kernel: [ 213.685277] sd 0:1:1:0: [sde] Mode Sense: 03 00 00 08 Sep 29 09:56:03 st401b-3 kernel: [ 213.685307] sd 0:1:1:0: [sde] No Caching mode page found Sep 29 09:56:03 st401b-3 kernel: [ 213.685423] sd 0:0:6:0: hidding raid component Sep 29 09:56:03 st401b-3 kernel: [ 213.685698] sd 0:1:1:0: [sde] Assuming drive cache: write through Sep 29 09:56:03 st401b-3 kernel: [ 213.686226] [ cut here ] Sep 29 09:56:03 st401b-3 kernel: [ 213.686232] WARNING: C
Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path
On 12/16/2015 07:16 PM, Jason Gunthorpe wrote: > On Wed, Dec 16, 2015 at 11:26:39AM +0100, Michael Wang wrote: [snip] >> >> I've rechecked the ib_init_ah_from_path() again, and found it >> still set IB_AH_GRH when the GID cache missing, but with: > > How do you mean? > > ah_attr->ah_flags = IB_AH_GRH; > ah_attr->grh.dgid = rec->dgid; > > ret = ib_find_cached_gid(device, >sgid, ndev, _num, > _index); > if (ret) { > if (ndev) > dev_put(ndev); > return ret; > } > > If find_cached_gid fails then ib_init_ah_from_path also fails. > > Is there a case where ib_find_cached_gid can succeed but not return > good data? Just for the GRH header, ib_find_cached_gid() will failed but the flag and dgid will be correct, and others are all 0 including hop limit, but may be just coincidence... As long as hop_limit > 1 means the pkg do have to pass through a router to other subnet, the fix make sense :-) Regards, Michael Wang > > I agree it would read nicer if the ah_flags and gr.dgid was moved > after the ib_find_cached_gid > >> BTW, cma_sidr_rep_handler() also call ib_init_ah_from_path() with out >> a check on return. > > That sounds like a problem. > > Jason > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path
On 12/16/2015 07:16 PM, Jason Gunthorpe wrote: > On Wed, Dec 16, 2015 at 11:26:39AM +0100, Michael Wang wrote: [snip] >> >> I've rechecked the ib_init_ah_from_path() again, and found it >> still set IB_AH_GRH when the GID cache missing, but with: > > How do you mean? > > ah_attr->ah_flags = IB_AH_GRH; > ah_attr->grh.dgid = rec->dgid; > > ret = ib_find_cached_gid(device, >sgid, ndev, _num, > _index); > if (ret) { > if (ndev) > dev_put(ndev); > return ret; > } > > If find_cached_gid fails then ib_init_ah_from_path also fails. > > Is there a case where ib_find_cached_gid can succeed but not return > good data? Just for the GRH header, ib_find_cached_gid() will failed but the flag and dgid will be correct, and others are all 0 including hop limit, but may be just coincidence... As long as hop_limit > 1 means the pkg do have to pass through a router to other subnet, the fix make sense :-) Regards, Michael Wang > > I agree it would read nicer if the ah_flags and gr.dgid was moved > after the ib_find_cached_gid > >> BTW, cma_sidr_rep_handler() also call ib_init_ah_from_path() with out >> a check on return. > > That sounds like a problem. > > Jason > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path
On 12/15/2015 06:30 PM, Jason Gunthorpe wrote: > On Tue, Dec 15, 2015 at 05:38:34PM +0100, Michael Wang wrote: >> The hop_limit is only suggest that the package allowed to be >> routed, not have to, correct? > > If the hop limit is >= 2 (?) then the GRH is mandatory. The > SM will return this information in the PathRecord if it determines a > GRH is required. The whole stack follows this protocol. > > The GRH is optional for in-subnet communications. Thanks for the explain :-) I've rechecked the ib_init_ah_from_path() again, and found it still set IB_AH_GRH when the GID cache missing, but with: grh.sgid_index = 0 grh.flow_label = 0 grh.hop_limit = 0 grh.traffic_class = 0 Not sure if it's just coincidence, hop_limit is 0, so router will discard the pkg and GRH won't be used, the transaction in subnet still works. Could this by designed as an optimization for the case like when SM reassigning the GID? BTW, cma_sidr_rep_handler() also call ib_init_ah_from_path() with out a check on return. Regards, Michael Wang > > Jason > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path
On 12/15/2015 06:30 PM, Jason Gunthorpe wrote: > On Tue, Dec 15, 2015 at 05:38:34PM +0100, Michael Wang wrote: >> The hop_limit is only suggest that the package allowed to be >> routed, not have to, correct? > > If the hop limit is >= 2 (?) then the GRH is mandatory. The > SM will return this information in the PathRecord if it determines a > GRH is required. The whole stack follows this protocol. > > The GRH is optional for in-subnet communications. Thanks for the explain :-) I've rechecked the ib_init_ah_from_path() again, and found it still set IB_AH_GRH when the GID cache missing, but with: grh.sgid_index = 0 grh.flow_label = 0 grh.hop_limit = 0 grh.traffic_class = 0 Not sure if it's just coincidence, hop_limit is 0, so router will discard the pkg and GRH won't be used, the transaction in subnet still works. Could this by designed as an optimization for the case like when SM reassigning the GID? BTW, cma_sidr_rep_handler() also call ib_init_ah_from_path() with out a check on return. Regards, Michael Wang > > Jason > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path
On 12/15/2015 04:52 PM, Nicholas Krause wrote: > This adds a needed error path in the function, cm_init_av_by_path > after the call to ib_init_ah_from_path in order to avoid incorrectly > accessing the path pointer of structure type ib_sa_path_rec if this > function call fails to complete its intended work successfully by > returning a error code. > > Signed-off-by: Nicholas Krause > --- > drivers/infiniband/core/cm.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > index 0a26dd6..e9b36ea 100644 > --- a/drivers/infiniband/core/cm.c > +++ b/drivers/infiniband/core/cm.c > @@ -383,8 +383,11 @@ static int cm_init_av_by_path(struct ib_sa_path_rec > *path, struct cm_av *av) > return ret; > > av->port = port; > - ib_init_ah_from_path(cm_dev->ib_device, port->port_num, path, > - >ah_attr); > + ret = ib_init_ah_from_path(cm_dev->ib_device, port->port_num, path, > +>ah_attr); ..Just wondering what if the transport don't require GRH? eg inside the same subnet? The hop_limit is only suggest that the package allowed to be routed, not have to, correct? Regards, Michael Wang > + if (ret) > + return ret; > + > av->timeout = path->packet_life_time + 1; > > return 0; > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path
On 12/15/2015 04:52 PM, Nicholas Krause wrote: > This adds a needed error path in the function, cm_init_av_by_path > after the call to ib_init_ah_from_path in order to avoid incorrectly > accessing the path pointer of structure type ib_sa_path_rec if this > function call fails to complete its intended work successfully by > returning a error code. > > Signed-off-by: Nicholas Krause <xerofo...@gmail.com> > --- > drivers/infiniband/core/cm.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > index 0a26dd6..e9b36ea 100644 > --- a/drivers/infiniband/core/cm.c > +++ b/drivers/infiniband/core/cm.c > @@ -383,8 +383,11 @@ static int cm_init_av_by_path(struct ib_sa_path_rec > *path, struct cm_av *av) > return ret; > > av->port = port; > - ib_init_ah_from_path(cm_dev->ib_device, port->port_num, path, > - >ah_attr); > + ret = ib_init_ah_from_path(cm_dev->ib_device, port->port_num, path, > +>ah_attr); ..Just wondering what if the transport don't require GRH? eg inside the same subnet? The hop_limit is only suggest that the package allowed to be routed, not have to, correct? Regards, Michael Wang > + if (ret) > + return ret; > + > av->timeout = path->packet_life_time + 1; > > return 0; > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] infiniband:core:Add needed error path in cm_init_av_by_path
On 12/07/2015 08:57 AM, Haggai Eran wrote: > On Friday, December 4, 2015 8:02 PM, Nicholas Krause > wrote: >> To: dledf...@redhat.com >> Cc: sean.he...@intel.com; hal.rosenst...@gmail.com; Haggai Eran; >> jguntho...@obsidianresearch.com; Matan Barak; yun.w...@profitbricks.com; >> ted.h@oracle.com; Doron Tsur; Erez Shitrit; david.ah...@oracle.com; >> linux-r...@vger.kernel.org; linux-kernel@vger.kernel.org >> Subject: [PATCH] infiniband:core:Add needed error path in cm_init_av_by_path >> >> This adds a needed error path in the function, cm_init_av_by_path >> after the call to ib_init_ah_from_path in order to avoid incorrectly >> accessing the path pointer of structure type ib_sa_path_rec if this >> function call fails to complete its intended work successfully by >> returning a error code. >> >> Signed-off-by: Nicholas Krause > > The subject doesn't seem to match the convention but apart from that, > > Reviewed-by: Haggai Eran > > I wonder if this should go to stable. If I understand correctly, this will > fail only when the SGID isn't found in the GID table, but such connections > would fail later on when creating a QP, right? Me too think this need a reconsider, to me the current logical don't really care the missing gid in cache when initializing AV, I'm not sure if it's necessary to fail all the following path for such cache missing... Regards, Michael Wang > > Haggai > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] infiniband:core:Add needed error path in cm_init_av_by_path
On 12/07/2015 08:57 AM, Haggai Eran wrote: > On Friday, December 4, 2015 8:02 PM, Nicholas Krause <xerofo...@gmail.com> > wrote: >> To: dledf...@redhat.com >> Cc: sean.he...@intel.com; hal.rosenst...@gmail.com; Haggai Eran; >> jguntho...@obsidianresearch.com; Matan Barak; yun.w...@profitbricks.com; >> ted.h@oracle.com; Doron Tsur; Erez Shitrit; david.ah...@oracle.com; >> linux-r...@vger.kernel.org; linux-kernel@vger.kernel.org >> Subject: [PATCH] infiniband:core:Add needed error path in cm_init_av_by_path >> >> This adds a needed error path in the function, cm_init_av_by_path >> after the call to ib_init_ah_from_path in order to avoid incorrectly >> accessing the path pointer of structure type ib_sa_path_rec if this >> function call fails to complete its intended work successfully by >> returning a error code. >> >> Signed-off-by: Nicholas Krause <xerofo...@gmail.com> > > The subject doesn't seem to match the convention but apart from that, > > Reviewed-by: Haggai Eran <hagg...@mellanox.com> > > I wonder if this should go to stable. If I understand correctly, this will > fail only when the SGID isn't found in the GID table, but such connections > would fail later on when creating a QP, right? Me too think this need a reconsider, to me the current logical don't really care the missing gid in cache when initializing AV, I'm not sure if it's necessary to fail all the following path for such cache missing... Regards, Michael Wang > > Haggai > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 06:36 PM, Catalin Marinas wrote: > On 2 December 2015 at 13:59, Borislav Petkov wrote: [snip] > > 1. The sl?b allocators themselves use page allocations, so kmemleak > could end up detecting the same pointer twice, hiding a potential leak > > 2. Most page allocations do not contain data/pointers relevant to > kmemleak (e.g. page cache pages), however the randomness of such data > greatly diminishes kmemleak's ability to detect real leaks > > Arguably, kmemleak could be made to detect both cases above by a > combination of page flags, additional annotations or specific page > alloc API. However, this has its own drawbacks in terms of code > complexity (potentially outside mm/kmemleak.c) and overhead. Thanks for the very nice explain :-) I used to thought overhead is the only concern, missing the point regarding allocator it self. Regards, Michael Wang > > Regarding a kmemleak_alloc() annotation like in the patch I suggested, > that's the second one I've seen needed outside alloc APIs (the first > one is commit f75782e4e067 - "block: kmemleak: Track the page > allocations for struct request"). If the number of such explicit > annotations stays small, it's better to keep it this way. > > There are other explicit annotations like kmemleak_not_leak() or > kmemleak_ignore() but these are for objects kmemleak knows about and > incorrectly reports them as leaks. Most of the time is because the > pointers to such objects are stored in a different form (e.g. physical > address). > > Anyway, kmemleak is not the only tool requiring annotations (take > spin_lock_nested() for example). If needed, we could do with an > additional page alloc/free API which informs kmemleak in the process > but I don't think it's worth it. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 06:36 PM, Catalin Marinas wrote: > On 2 December 2015 at 13:59, Borislav Petkov <b...@alien8.de> wrote: [snip] > > 1. The sl?b allocators themselves use page allocations, so kmemleak > could end up detecting the same pointer twice, hiding a potential leak > > 2. Most page allocations do not contain data/pointers relevant to > kmemleak (e.g. page cache pages), however the randomness of such data > greatly diminishes kmemleak's ability to detect real leaks > > Arguably, kmemleak could be made to detect both cases above by a > combination of page flags, additional annotations or specific page > alloc API. However, this has its own drawbacks in terms of code > complexity (potentially outside mm/kmemleak.c) and overhead. Thanks for the very nice explain :-) I used to thought overhead is the only concern, missing the point regarding allocator it self. Regards, Michael Wang > > Regarding a kmemleak_alloc() annotation like in the patch I suggested, > that's the second one I've seen needed outside alloc APIs (the first > one is commit f75782e4e067 - "block: kmemleak: Track the page > allocations for struct request"). If the number of such explicit > annotations stays small, it's better to keep it this way. > > There are other explicit annotations like kmemleak_not_leak() or > kmemleak_ignore() but these are for objects kmemleak knows about and > incorrectly reports them as leaks. Most of the time is because the > pointers to such objects are stored in a different form (e.g. physical > address). > > Anyway, kmemleak is not the only tool requiring annotations (take > spin_lock_nested() for example). If needed, we could do with an > additional page alloc/free API which informs kmemleak in the process > but I don't think it's worth it. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 03:13 PM, Borislav Petkov wrote: > On Wed, Dec 02, 2015 at 03:09:18PM +0100, Michael Wang wrote: >> This tool will help improve the kernel, AFAIK it's already made it's >> best, if you got any idea on how to make it even better that would be >> great, but at this moment, it still need few of care :-P > > I think you're replying to my emails without even reading what I said. > So I'm going to do the same and stop wasting my time. First of all, I respect all your reply, and reply regarding your point. Frankly speaking, I think you know all these already, it's not a big deal but you refuse to obey the rules setup by others, although it do help make things better, and benefit yourself too. If you refuse to make things better I'm totally fine, time is valuable for all of us :-) Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 02:59 PM, Borislav Petkov wrote: > On Wed, Dec 02, 2015 at 02:48:47PM +0100, Michael Wang wrote: >> I'm not sure why amd-iommu use get_page not kmalloc to initialize the >> pointer table, but if it's necessary then the conflict will be there, >> it's not the fault of driver or kmemleak, but the design require them >> to cooperate with each other. > > So, according to you, we should go and "fix" all callers of > __get_free_pages() to make kmemleak happy. Then when the next new tool > comes along, we should "fix" another kernel API just so that the tools > are happy. That's the way we have to detect leak, no driver could get rid of the possibility of memory leaking, so it should respect the rule to help others locating the problem, if a driver full of false report then most likely folks will gradually lost interests on help fix leaking problem for it. > > Bzzt. Wrong! > > The tools should work without sprinkling their code everywhere. Driver > etc developers don't need to care about what tool they make happy or > not. Tools' hooks should be hidden in macro magic so that developers > don't care. This tool will help improve the kernel, AFAIK it's already made it's best, if you got any idea on how to make it even better that would be great, but at this moment, it still need few of care :-P Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 02:40 PM, Borislav Petkov wrote: > On Wed, Dec 02, 2015 at 02:18:44PM +0100, Michael Wang wrote: [snip] > >> Yeah, but it would be better to solve it, otherwise whoever saw this >> report will need to go into the amd-iommu, make sure it's not a real >> leak, then change their testing script... > > No, you don't need to go into the iommu - you need to fix kmemleak. > > And frankly, I'm getting sick and tired of all those tools needing > special handling and us adding code just so that they're happy. If the > tools can't figure out something, they shouldn't warn just in case but > shut up instead. If you mean the design of kmemleak, IMHO it's not that bad. The problem is regarding performance, think about if kmemleak go into every page to find out pointers, I guess the whole system will stuck. I'm not sure why amd-iommu use get_page not kmalloc to initialize the pointer table, but if it's necessary then the conflict will be there, it's not the fault of driver or kmemleak, but the design require them to cooperate with each other. Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
Hi, Borislav On 12/02/2015 02:13 PM, Borislav Petkov wrote: > On Wed, Dec 02, 2015 at 02:01:55PM +0100, Michael Wang wrote: >> Yeah.. it's a little complicated since we have our own kernel tree and this >> won't be a problem for us, but we really prefer to help fix it in mainline >> too, as long as this is really a defect, so others could save time on >> research >> in future. > > Well, to keep it realistic and if it were me, I wouldn't even take such > a fix as it is apparently kmemleak's problem. Do you mean this could be a real kmemleak? Could you please provide more details? > > So you could fix your testing instead to ignore that error message now > that you know it is a false-positive. That should be easiest. > Yeah, but it would be better to solve it, otherwise whoever saw this report will need to go into the amd-iommu, make sure it's not a real leak, then change their testing script... Regards, Michael Wang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 01:56 PM, Joerg Roedel wrote: > On Wed, Dec 02, 2015 at 01:31:38PM +0100, Michael Wang wrote: >> It's not my work or your work... it's a defect in the module and maintainer >> should take responsibility on fixing it, correct? > > No, its a false positive from an in-kernel checking tool, the iommu > driver is correct. You just sent a patch to silence the false positive > report. Yeah, but caused by the driver :-P and have to be fixed in there too. > >> We're very willing to help, but as I mentioned we are out of resource for >> testing at this moment, but we can send you a new patch without testing, >> will that works for you? > > This should be testable on any AMD IOMMU system with working interrupt > remapping. I will probably have no time to test this, if you really > can't test yourself, try to get a Tested-by from someone else. Good point, anyone willing to help test the fix? Or better provide the new patch to be the author. Regards, Michael Wang > > > > Joerg > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 01:53 PM, Borislav Petkov wrote: > On Wed, Dec 02, 2015 at 01:31:38PM +0100, Michael Wang wrote: >> It's not my work or your work... it's a defect in the module and maintainer >> should take responsibility on fixing it, correct? > > Well, you said "actually we just want to get rid of this annoying report > on obj won't leak..." > > It sounds to me like you want to have something fixed. So you do the > patch properly, add to the commit message why exactly you're doing it > and test it. Like everyone else. Yeah.. it's a little complicated since we have our own kernel tree and this won't be a problem for us, but we really prefer to help fix it in mainline too, as long as this is really a defect, so others could save time on research in future. But seems like we can only wait for another chance to confirm the another solution, frankly speaking I think we both will forgot this soon... fortunately it's not that critical :-P Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 12:51 PM, Joerg Roedel wrote: > On Wed, Dec 02, 2015 at 12:38:03PM +0100, Michael Wang wrote: >> Joerg, this is really a tiny fix, would you mind to merge it into some >> of your cleanup patch and testing them together? we are not in hurry, >> just want to make sure the issue will get solved. > > I am not doing your work. You sent a patch, received feedback, and now > you can send a new patch based on it. Thats the process. If it addresses > the feedback I will merge it, but I will not scissor your patch > together. It's not my work or your work... it's a defect in the module and maintainer should take responsibility on fixing it, correct? We're very willing to help, but as I mentioned we are out of resource for testing at this moment, but we can send you a new patch without testing, will that works for you? Regards, Michael Wang > > > Joerg > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 12:31 PM, Catalin Marinas wrote: > On 2 December 2015 at 10:56, Michael Wang wrote: [snip] > > I could copy your description but I don't currently have a way (nor > time) to test the patch. If you plan to test it anyway, please feel > free to include my diff (which I guess was badly re-formatted by > gmail), I don't really mind which author it is (I found it easier to > show a diff than explain in plain English ;)). Unfortunately that's the same on my side... we already close the ticket and I don't have resources to testing it again. Joerg, this is really a tiny fix, would you mind to merge it into some of your cleanup patch and testing them together? we are not in hurry, just want to make sure the issue will get solved. Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 11:52 AM, Catalin Marinas wrote: [snip] >> >> Is there any more concern? actually we just want to get rid of this >> annoying report on obj won't leak, if you're going to create obj for >> 'irq_lookup_table' that's also fine for us, or will you pick this patch? > > My preference (from a kmemleak perspective) is to tell kmemleak about > the irq_lookup_table. Untested: I'm fine with both solution, will leave the decision to maintainer :-) BTW, could you please send a formal patch with descriptions? Regards, Michael Wang > > diff --git a/drivers/iommu/amd_iommu_init.c b/drivers/iommu/amd_iommu_init.c > index 013bdfff2d4d..c41609f71cbe 100644 > --- a/drivers/iommu/amd_iommu_init.c > +++ b/drivers/iommu/amd_iommu_init.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -1692,6 +1693,7 @@ static struct syscore_ops amd_iommu_syscore_ops = { > > static void __init free_on_init_error(void) > { > + kmemleak_free(irq_lookup_table); > free_pages((unsigned long)irq_lookup_table, > get_order(rlookup_table_size)); > > @@ -1906,6 +1908,7 @@ static int __init early_amd_iommu_init(void) > irq_lookup_table = (void *)__get_free_pages( > GFP_KERNEL | __GFP_ZERO, > get_order(rlookup_table_size)); > + kmemleak_alloc(irq_lookup_table, rlookup_table_size, 1, GFP_KERNEL); > if (!irq_lookup_table) > goto out; > } > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
Hi, Joerg On 11/25/2015 04:14 PM, Michael Wang wrote: > On 11/25/2015 04:08 PM, Joerg Roedel wrote: [snip] >>> This is caused by the 'irq_lookup_table' was allocated with >>> __get_free_pages() which won't create kmemleak object, thus it's >>> pointers won't be count as referencing 'irq_remap_table' in >>> kmemleak scan. >> >> Isn't it better to allocate the kmemleak object manually instead of >> ignoring all irq-table pointers? With this patch we might not notice any >> real leak of irq-tables. > > We've considered that too, but found that the irq-tables is not > dynamically alloc/free, they won't be freed once initialized, so there > is no leaking for such object :-) Is there any more concern? actually we just want to get rid of this annoying report on obj won't leak, if you're going to create obj for 'irq_lookup_table' that's also fine for us, or will you pick this patch? Regards, Michael Wang > > Regards, > Michael Wang > >> >> >> >> Joerg >> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 03:13 PM, Borislav Petkov wrote: > On Wed, Dec 02, 2015 at 03:09:18PM +0100, Michael Wang wrote: >> This tool will help improve the kernel, AFAIK it's already made it's >> best, if you got any idea on how to make it even better that would be >> great, but at this moment, it still need few of care :-P > > I think you're replying to my emails without even reading what I said. > So I'm going to do the same and stop wasting my time. First of all, I respect all your reply, and reply regarding your point. Frankly speaking, I think you know all these already, it's not a big deal but you refuse to obey the rules setup by others, although it do help make things better, and benefit yourself too. If you refuse to make things better I'm totally fine, time is valuable for all of us :-) Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
Hi, Joerg On 11/25/2015 04:14 PM, Michael Wang wrote: > On 11/25/2015 04:08 PM, Joerg Roedel wrote: [snip] >>> This is caused by the 'irq_lookup_table' was allocated with >>> __get_free_pages() which won't create kmemleak object, thus it's >>> pointers won't be count as referencing 'irq_remap_table' in >>> kmemleak scan. >> >> Isn't it better to allocate the kmemleak object manually instead of >> ignoring all irq-table pointers? With this patch we might not notice any >> real leak of irq-tables. > > We've considered that too, but found that the irq-tables is not > dynamically alloc/free, they won't be freed once initialized, so there > is no leaking for such object :-) Is there any more concern? actually we just want to get rid of this annoying report on obj won't leak, if you're going to create obj for 'irq_lookup_table' that's also fine for us, or will you pick this patch? Regards, Michael Wang > > Regards, > Michael Wang > >> >> >> >> Joerg >> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 11:52 AM, Catalin Marinas wrote: [snip] >> >> Is there any more concern? actually we just want to get rid of this >> annoying report on obj won't leak, if you're going to create obj for >> 'irq_lookup_table' that's also fine for us, or will you pick this patch? > > My preference (from a kmemleak perspective) is to tell kmemleak about > the irq_lookup_table. Untested: I'm fine with both solution, will leave the decision to maintainer :-) BTW, could you please send a formal patch with descriptions? Regards, Michael Wang > > diff --git a/drivers/iommu/amd_iommu_init.c b/drivers/iommu/amd_iommu_init.c > index 013bdfff2d4d..c41609f71cbe 100644 > --- a/drivers/iommu/amd_iommu_init.c > +++ b/drivers/iommu/amd_iommu_init.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -1692,6 +1693,7 @@ static struct syscore_ops amd_iommu_syscore_ops = { > > static void __init free_on_init_error(void) > { > + kmemleak_free(irq_lookup_table); > free_pages((unsigned long)irq_lookup_table, > get_order(rlookup_table_size)); > > @@ -1906,6 +1908,7 @@ static int __init early_amd_iommu_init(void) > irq_lookup_table = (void *)__get_free_pages( > GFP_KERNEL | __GFP_ZERO, > get_order(rlookup_table_size)); > + kmemleak_alloc(irq_lookup_table, rlookup_table_size, 1, GFP_KERNEL); > if (!irq_lookup_table) > goto out; > } > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 12:31 PM, Catalin Marinas wrote: > On 2 December 2015 at 10:56, Michael Wang <yun.w...@profitbricks.com> wrote: [snip] > > I could copy your description but I don't currently have a way (nor > time) to test the patch. If you plan to test it anyway, please feel > free to include my diff (which I guess was badly re-formatted by > gmail), I don't really mind which author it is (I found it easier to > show a diff than explain in plain English ;)). Unfortunately that's the same on my side... we already close the ticket and I don't have resources to testing it again. Joerg, this is really a tiny fix, would you mind to merge it into some of your cleanup patch and testing them together? we are not in hurry, just want to make sure the issue will get solved. Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 12:51 PM, Joerg Roedel wrote: > On Wed, Dec 02, 2015 at 12:38:03PM +0100, Michael Wang wrote: >> Joerg, this is really a tiny fix, would you mind to merge it into some >> of your cleanup patch and testing them together? we are not in hurry, >> just want to make sure the issue will get solved. > > I am not doing your work. You sent a patch, received feedback, and now > you can send a new patch based on it. Thats the process. If it addresses > the feedback I will merge it, but I will not scissor your patch > together. It's not my work or your work... it's a defect in the module and maintainer should take responsibility on fixing it, correct? We're very willing to help, but as I mentioned we are out of resource for testing at this moment, but we can send you a new patch without testing, will that works for you? Regards, Michael Wang > > > Joerg > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 01:53 PM, Borislav Petkov wrote: > On Wed, Dec 02, 2015 at 01:31:38PM +0100, Michael Wang wrote: >> It's not my work or your work... it's a defect in the module and maintainer >> should take responsibility on fixing it, correct? > > Well, you said "actually we just want to get rid of this annoying report > on obj won't leak..." > > It sounds to me like you want to have something fixed. So you do the > patch properly, add to the commit message why exactly you're doing it > and test it. Like everyone else. Yeah.. it's a little complicated since we have our own kernel tree and this won't be a problem for us, but we really prefer to help fix it in mainline too, as long as this is really a defect, so others could save time on research in future. But seems like we can only wait for another chance to confirm the another solution, frankly speaking I think we both will forgot this soon... fortunately it's not that critical :-P Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 01:56 PM, Joerg Roedel wrote: > On Wed, Dec 02, 2015 at 01:31:38PM +0100, Michael Wang wrote: >> It's not my work or your work... it's a defect in the module and maintainer >> should take responsibility on fixing it, correct? > > No, its a false positive from an in-kernel checking tool, the iommu > driver is correct. You just sent a patch to silence the false positive > report. Yeah, but caused by the driver :-P and have to be fixed in there too. > >> We're very willing to help, but as I mentioned we are out of resource for >> testing at this moment, but we can send you a new patch without testing, >> will that works for you? > > This should be testable on any AMD IOMMU system with working interrupt > remapping. I will probably have no time to test this, if you really > can't test yourself, try to get a Tested-by from someone else. Good point, anyone willing to help test the fix? Or better provide the new patch to be the author. Regards, Michael Wang > > > > Joerg > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
Hi, Borislav On 12/02/2015 02:13 PM, Borislav Petkov wrote: > On Wed, Dec 02, 2015 at 02:01:55PM +0100, Michael Wang wrote: >> Yeah.. it's a little complicated since we have our own kernel tree and this >> won't be a problem for us, but we really prefer to help fix it in mainline >> too, as long as this is really a defect, so others could save time on >> research >> in future. > > Well, to keep it realistic and if it were me, I wouldn't even take such > a fix as it is apparently kmemleak's problem. Do you mean this could be a real kmemleak? Could you please provide more details? > > So you could fix your testing instead to ignore that error message now > that you know it is a false-positive. That should be easiest. > Yeah, but it would be better to solve it, otherwise whoever saw this report will need to go into the amd-iommu, make sure it's not a real leak, then change their testing script... Regards, Michael Wang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 02:40 PM, Borislav Petkov wrote: > On Wed, Dec 02, 2015 at 02:18:44PM +0100, Michael Wang wrote: [snip] > >> Yeah, but it would be better to solve it, otherwise whoever saw this >> report will need to go into the amd-iommu, make sure it's not a real >> leak, then change their testing script... > > No, you don't need to go into the iommu - you need to fix kmemleak. > > And frankly, I'm getting sick and tired of all those tools needing > special handling and us adding code just so that they're happy. If the > tools can't figure out something, they shouldn't warn just in case but > shut up instead. If you mean the design of kmemleak, IMHO it's not that bad. The problem is regarding performance, think about if kmemleak go into every page to find out pointers, I guess the whole system will stuck. I'm not sure why amd-iommu use get_page not kmalloc to initialize the pointer table, but if it's necessary then the conflict will be there, it's not the fault of driver or kmemleak, but the design require them to cooperate with each other. Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 12/02/2015 02:59 PM, Borislav Petkov wrote: > On Wed, Dec 02, 2015 at 02:48:47PM +0100, Michael Wang wrote: >> I'm not sure why amd-iommu use get_page not kmalloc to initialize the >> pointer table, but if it's necessary then the conflict will be there, >> it's not the fault of driver or kmemleak, but the design require them >> to cooperate with each other. > > So, according to you, we should go and "fix" all callers of > __get_free_pages() to make kmemleak happy. Then when the next new tool > comes along, we should "fix" another kernel API just so that the tools > are happy. That's the way we have to detect leak, no driver could get rid of the possibility of memory leaking, so it should respect the rule to help others locating the problem, if a driver full of false report then most likely folks will gradually lost interests on help fix leaking problem for it. > > Bzzt. Wrong! > > The tools should work without sprinkling their code everywhere. Driver > etc developers don't need to care about what tool they make happy or > not. Tools' hooks should be hidden in macro magic so that developers > don't care. This tool will help improve the kernel, AFAIK it's already made it's best, if you got any idea on how to make it even better that would be great, but at this moment, it still need few of care :-P Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 11/25/2015 04:08 PM, Joerg Roedel wrote: > On Fri, Nov 20, 2015 at 12:33:50PM +0100, Michael Wang wrote: >> The kmemleak testing on 3.18.24 show: >> >> unreferenced object 0x880233ff9010 (size 16): >> comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s) >> hex dump (first 16 bytes): >> 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff D.3 >> backtrace: >> [] create_object+0x10d/0x2d0 >> [] kmemleak_alloc+0x5b/0xc0 >> [] kmem_cache_alloc_trace+0xb9/0x160 >> [] get_irq_table+0x151/0x380 >> >> This is caused by the 'irq_lookup_table' was allocated with >> __get_free_pages() which won't create kmemleak object, thus it's >> pointers won't be count as referencing 'irq_remap_table' in >> kmemleak scan. > > Isn't it better to allocate the kmemleak object manually instead of > ignoring all irq-table pointers? With this patch we might not notice any > real leak of irq-tables. We've considered that too, but found that the irq-tables is not dynamically alloc/free, they won't be freed once initialized, so there is no leaking for such object :-) Regards, Michael Wang > > > > Joerg > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
Hi, Joery On 11/20/2015 12:33 PM, Michael Wang wrote: > The kmemleak testing on 3.18.24 show: > > unreferenced object 0x880233ff9010 (size 16): > comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s) > hex dump (first 16 bytes): > 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff D.3 > backtrace: > [] create_object+0x10d/0x2d0 > [] kmemleak_alloc+0x5b/0xc0 > [] kmem_cache_alloc_trace+0xb9/0x160 > [] get_irq_table+0x151/0x380 > > This is caused by the 'irq_lookup_table' was allocated with > __get_free_pages() which won't create kmemleak object, thus it's > pointers won't be count as referencing 'irq_remap_table' in > kmemleak scan. > > The 'irq_remap_table' won't be freed after initialized, doesn't > make sense to check it's leaking. > > This patch mark the 'irq_remap_table' object as 'gray' to stop > the 'false positives' report. Any comments on this one? Regards, Michael Wang > > Signed-off-by: Michael Wang > --- > v2: > Use kmemleak_not_leak() instead of kmemleak_ignore() since > the 'irq_remap_table' itself also contain pointer. > > drivers/iommu/amd_iommu.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c > index 8b2be1e..87a1a88 100644 > --- a/drivers/iommu/amd_iommu.c > +++ b/drivers/iommu/amd_iommu.c > @@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, > bool ioapic) > } > > irq_lookup_table[devid] = table; > + kmemleak_not_leak(table); > set_dte_irq_entry(devid, table); > iommu_flush_dte(iommu, devid); > if (devid != alias) { > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 11/25/2015 04:08 PM, Joerg Roedel wrote: > On Fri, Nov 20, 2015 at 12:33:50PM +0100, Michael Wang wrote: >> The kmemleak testing on 3.18.24 show: >> >> unreferenced object 0x880233ff9010 (size 16): >> comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s) >> hex dump (first 16 bytes): >> 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff D.3 >> backtrace: >> [] create_object+0x10d/0x2d0 >> [] kmemleak_alloc+0x5b/0xc0 >> [] kmem_cache_alloc_trace+0xb9/0x160 >> [] get_irq_table+0x151/0x380 >> >> This is caused by the 'irq_lookup_table' was allocated with >> __get_free_pages() which won't create kmemleak object, thus it's >> pointers won't be count as referencing 'irq_remap_table' in >> kmemleak scan. > > Isn't it better to allocate the kmemleak object manually instead of > ignoring all irq-table pointers? With this patch we might not notice any > real leak of irq-tables. We've considered that too, but found that the irq-tables is not dynamically alloc/free, they won't be freed once initialized, so there is no leaking for such object :-) Regards, Michael Wang > > > > Joerg > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
Hi, Joery On 11/20/2015 12:33 PM, Michael Wang wrote: > The kmemleak testing on 3.18.24 show: > > unreferenced object 0x880233ff9010 (size 16): > comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s) > hex dump (first 16 bytes): > 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff D.3 > backtrace: > [] create_object+0x10d/0x2d0 > [] kmemleak_alloc+0x5b/0xc0 > [] kmem_cache_alloc_trace+0xb9/0x160 > [] get_irq_table+0x151/0x380 > > This is caused by the 'irq_lookup_table' was allocated with > __get_free_pages() which won't create kmemleak object, thus it's > pointers won't be count as referencing 'irq_remap_table' in > kmemleak scan. > > The 'irq_remap_table' won't be freed after initialized, doesn't > make sense to check it's leaking. > > This patch mark the 'irq_remap_table' object as 'gray' to stop > the 'false positives' report. Any comments on this one? Regards, Michael Wang > > Signed-off-by: Michael Wang <yun.w...@profitbricks.com> > --- > v2: > Use kmemleak_not_leak() instead of kmemleak_ignore() since > the 'irq_remap_table' itself also contain pointer. > > drivers/iommu/amd_iommu.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c > index 8b2be1e..87a1a88 100644 > --- a/drivers/iommu/amd_iommu.c > +++ b/drivers/iommu/amd_iommu.c > @@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, > bool ioapic) > } > > irq_lookup_table[devid] = table; > + kmemleak_not_leak(table); > set_dte_irq_entry(devid, table); > iommu_flush_dte(iommu, devid); > if (devid != alias) { > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nfnetlink warnings
On 11/23/2015 11:32 AM, Borislav Petkov wrote: > On Mon, Nov 23, 2015 at 11:20:18AM +0100, Michael Wang wrote: >> Who want to do that would take responsibility to make an else branch at >> that time, but reserve the branch at this moment sounds unnecessary, and >> not that pretty frankly speaking. > > Actually, I was looking for the better idea which doesn't uglify the > code. And here it is: > > https://lkml.kernel.org/r/5585663.OcpAQiytKY@wuerfel Looks even better :-) Regards, Michael Wang > > :-) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nfnetlink warnings
On 11/23/2015 10:54 AM, Borislav Petkov wrote: > Hi Michael, > > On Mon, Nov 23, 2015 at 10:49:34AM +0100, Michael Wang wrote: >> Why not just initialized it as NULL, or mark it as uninitialized_var()? > > because I'd like us to save us the redundant NULL initialization in the > if-case. Well, I would vote initialized with NULL, rather than use another else branch to do the same thing. > > I'm not saying any of the approaches are good visually, though. Who > knows, someone might have a better idea like, maybe "Oh, I wanted to > rewrite that code and this handlong is going to be different anyway ..." > or so. Or something to that effect. Who want to do that would take responsibility to make an else branch at that time, but reserve the branch at this moment sounds unnecessary, and not that pretty frankly speaking. > > Btw, please do not top-post. Enjoy ;-) Regards, Michael Wang > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nfnetlink warnings
Hi, Borislav Why not just initialized it as NULL, or mark it as uninitialized_var()? Regards, Michael Wang On 11/23/2015 10:36 AM, Borislav Petkov wrote: > Hey, > > so I keep getting those since recently: > > net/netfilter/nfnetlink_queue.c:519:19: warning: ‘nfnl_ct’ may be used > uninitialized in this function [-Wmaybe-uninitialized] > if (ct && nfnl_ct->build(skb, ct, ctinfo, NFQA_CT, NFQA_CT_INFO) < 0) >^ > net/netfilter/nfnetlink_queue.c:316:23: note: ‘nfnl_ct’ was declared here > struct nfnl_ct_hook *nfnl_ct; >^ > net/netfilter/nfnetlink_queue.c: In function ‘nfqnl_recv_verdict’: > net/netfilter/nfnetlink_queue.c:1083:11: warning: ‘nfnl_ct’ may be used > uninitialized in this function [-Wmaybe-uninitialized] > nfnl_ct->seq_adjust(entry->skb, ct, ctinfo, diff); >^ > > and was thinking can we shut them up like this? I know, it is ugly :-\ > > I mean, it is obvious in both cases that nfnl_ct won't be used if ct is > not set but apparently gcc can't see that far... > > --- > diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c > index 7d81d280cb4f..cd61b0b5c413 100644 > --- a/net/netfilter/nfnetlink_queue.c > +++ b/net/netfilter/nfnetlink_queue.c > @@ -372,6 +372,8 @@ nfqnl_build_packet_message(struct net *net, struct > nfqnl_instance *queue, > if (ct != NULL) > size += nfnl_ct->build_size(ct); > } > + } else { > + nfnl_ct = NULL; > } > > if (queue->flags & NFQA_CFG_F_UID_GID) { > @@ -1069,6 +1071,8 @@ nfqnl_recv_verdict(struct sock *ctnl, struct sk_buff > *skb, > nfnl_ct = rcu_dereference(nfnl_ct_hook); > if (nfnl_ct != NULL) > ct = nfqnl_ct_parse(nfnl_ct, nlh, nfqa, entry, ); > + } else { > + nfnl_ct = NULL; > } > > if (nfqa[NFQA_PAYLOAD]) { > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nfnetlink warnings
On 11/23/2015 11:32 AM, Borislav Petkov wrote: > On Mon, Nov 23, 2015 at 11:20:18AM +0100, Michael Wang wrote: >> Who want to do that would take responsibility to make an else branch at >> that time, but reserve the branch at this moment sounds unnecessary, and >> not that pretty frankly speaking. > > Actually, I was looking for the better idea which doesn't uglify the > code. And here it is: > > https://lkml.kernel.org/r/5585663.OcpAQiytKY@wuerfel Looks even better :-) Regards, Michael Wang > > :-) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nfnetlink warnings
On 11/23/2015 10:54 AM, Borislav Petkov wrote: > Hi Michael, > > On Mon, Nov 23, 2015 at 10:49:34AM +0100, Michael Wang wrote: >> Why not just initialized it as NULL, or mark it as uninitialized_var()? > > because I'd like us to save us the redundant NULL initialization in the > if-case. Well, I would vote initialized with NULL, rather than use another else branch to do the same thing. > > I'm not saying any of the approaches are good visually, though. Who > knows, someone might have a better idea like, maybe "Oh, I wanted to > rewrite that code and this handlong is going to be different anyway ..." > or so. Or something to that effect. Who want to do that would take responsibility to make an else branch at that time, but reserve the branch at this moment sounds unnecessary, and not that pretty frankly speaking. > > Btw, please do not top-post. Enjoy ;-) Regards, Michael Wang > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nfnetlink warnings
Hi, Borislav Why not just initialized it as NULL, or mark it as uninitialized_var()? Regards, Michael Wang On 11/23/2015 10:36 AM, Borislav Petkov wrote: > Hey, > > so I keep getting those since recently: > > net/netfilter/nfnetlink_queue.c:519:19: warning: ‘nfnl_ct’ may be used > uninitialized in this function [-Wmaybe-uninitialized] > if (ct && nfnl_ct->build(skb, ct, ctinfo, NFQA_CT, NFQA_CT_INFO) < 0) >^ > net/netfilter/nfnetlink_queue.c:316:23: note: ‘nfnl_ct’ was declared here > struct nfnl_ct_hook *nfnl_ct; >^ > net/netfilter/nfnetlink_queue.c: In function ‘nfqnl_recv_verdict’: > net/netfilter/nfnetlink_queue.c:1083:11: warning: ‘nfnl_ct’ may be used > uninitialized in this function [-Wmaybe-uninitialized] > nfnl_ct->seq_adjust(entry->skb, ct, ctinfo, diff); >^ > > and was thinking can we shut them up like this? I know, it is ugly :-\ > > I mean, it is obvious in both cases that nfnl_ct won't be used if ct is > not set but apparently gcc can't see that far... > > --- > diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c > index 7d81d280cb4f..cd61b0b5c413 100644 > --- a/net/netfilter/nfnetlink_queue.c > +++ b/net/netfilter/nfnetlink_queue.c > @@ -372,6 +372,8 @@ nfqnl_build_packet_message(struct net *net, struct > nfqnl_instance *queue, > if (ct != NULL) > size += nfnl_ct->build_size(ct); > } > + } else { > + nfnl_ct = NULL; > } > > if (queue->flags & NFQA_CFG_F_UID_GID) { > @@ -1069,6 +1071,8 @@ nfqnl_recv_verdict(struct sock *ctnl, struct sk_buff > *skb, > nfnl_ct = rcu_dereference(nfnl_ct_hook); > if (nfnl_ct != NULL) > ct = nfqnl_ct_parse(nfnl_ct, nlh, nfqa, entry, ); > + } else { > + nfnl_ct = NULL; > } > > if (nfqa[NFQA_PAYLOAD]) { > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 11/20/2015 12:33 PM, Michael Wang wrote: > The kmemleak testing on 3.18.24 show: > > unreferenced object 0x880233ff9010 (size 16): > comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s) > hex dump (first 16 bytes): > 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff D.3 > backtrace: > [] create_object+0x10d/0x2d0 > [] kmemleak_alloc+0x5b/0xc0 > [] kmem_cache_alloc_trace+0xb9/0x160 > [] get_irq_table+0x151/0x380 > > This is caused by the 'irq_lookup_table' was allocated with > __get_free_pages() which won't create kmemleak object, thus it's > pointers won't be count as referencing 'irq_remap_table' in > kmemleak scan. > > The 'irq_remap_table' won't be freed after initialized, doesn't > make sense to check it's leaking. > > This patch mark the 'irq_remap_table' object as 'gray' to stop > the 'false positives' report. > > Signed-off-by: Michael Wang Reported-by: Danil Kipnis Regards, Michael Wang > --- > v2: > Use kmemleak_not_leak() instead of kmemleak_ignore() since > the 'irq_remap_table' itself also contain pointer. > > drivers/iommu/amd_iommu.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c > index 8b2be1e..87a1a88 100644 > --- a/drivers/iommu/amd_iommu.c > +++ b/drivers/iommu/amd_iommu.c > @@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, > bool ioapic) > } > > irq_lookup_table[devid] = table; > + kmemleak_not_leak(table); > set_dte_irq_entry(devid, table); > iommu_flush_dte(iommu, devid); > if (devid != alias) { > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
The kmemleak testing on 3.18.24 show: unreferenced object 0x880233ff9010 (size 16): comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s) hex dump (first 16 bytes): 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff D.3 backtrace: [] create_object+0x10d/0x2d0 [] kmemleak_alloc+0x5b/0xc0 [] kmem_cache_alloc_trace+0xb9/0x160 [] get_irq_table+0x151/0x380 This is caused by the 'irq_lookup_table' was allocated with __get_free_pages() which won't create kmemleak object, thus it's pointers won't be count as referencing 'irq_remap_table' in kmemleak scan. The 'irq_remap_table' won't be freed after initialized, doesn't make sense to check it's leaking. This patch mark the 'irq_remap_table' object as 'gray' to stop the 'false positives' report. Signed-off-by: Michael Wang --- v2: Use kmemleak_not_leak() instead of kmemleak_ignore() since the 'irq_remap_table' itself also contain pointer. drivers/iommu/amd_iommu.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index 8b2be1e..87a1a88 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, bool ioapic) } irq_lookup_table[devid] = table; + kmemleak_not_leak(table); set_dte_irq_entry(devid, table); iommu_flush_dte(iommu, devid); if (devid != alias) { -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] iommu/amd: make kmemleak ignore the 'irq_remap_table' object
The kmemleak testing on 3.18.24 show: unreferenced object 0x880233ff9010 (size 16): comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s) hex dump (first 16 bytes): 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff D.3 backtrace: [] create_object+0x10d/0x2d0 [] kmemleak_alloc+0x5b/0xc0 [] kmem_cache_alloc_trace+0xb9/0x160 [] get_irq_table+0x151/0x380 This is caused by the 'irq_lookup_table' was allocated with __get_free_pages() which won't create kmemleak object, thus it's pointers won't be count as referencing in kmemleak scanning. The 'irq_remap_table' allocated won't be freed after initialized, doesn't make sense to let kmemleak scan it. This patch mark the 'irq_remap_table' object as 'ignored' to stop the 'false positives' report. Signed-off-by: Michael Wang --- drivers/iommu/amd_iommu.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index 8b2be1e..87a1a88 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, bool ioapic) } irq_lookup_table[devid] = table; + kmemleak_ignore(table); set_dte_irq_entry(devid, table); iommu_flush_dte(iommu, devid); if (devid != alias) { -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
On 11/20/2015 12:33 PM, Michael Wang wrote: > The kmemleak testing on 3.18.24 show: > > unreferenced object 0x880233ff9010 (size 16): > comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s) > hex dump (first 16 bytes): > 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff D.3 > backtrace: > [] create_object+0x10d/0x2d0 > [] kmemleak_alloc+0x5b/0xc0 > [] kmem_cache_alloc_trace+0xb9/0x160 > [] get_irq_table+0x151/0x380 > > This is caused by the 'irq_lookup_table' was allocated with > __get_free_pages() which won't create kmemleak object, thus it's > pointers won't be count as referencing 'irq_remap_table' in > kmemleak scan. > > The 'irq_remap_table' won't be freed after initialized, doesn't > make sense to check it's leaking. > > This patch mark the 'irq_remap_table' object as 'gray' to stop > the 'false positives' report. > > Signed-off-by: Michael Wang <yun.w...@profitbricks.com> Reported-by: Danil Kipnis <danil.kip...@profitbricks.com> Regards, Michael Wang > --- > v2: > Use kmemleak_not_leak() instead of kmemleak_ignore() since > the 'irq_remap_table' itself also contain pointer. > > drivers/iommu/amd_iommu.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c > index 8b2be1e..87a1a88 100644 > --- a/drivers/iommu/amd_iommu.c > +++ b/drivers/iommu/amd_iommu.c > @@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, > bool ioapic) > } > > irq_lookup_table[devid] = table; > + kmemleak_not_leak(table); > set_dte_irq_entry(devid, table); > iommu_flush_dte(iommu, devid); > if (devid != alias) { > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] iommu/amd: make kmemleak ignore the 'irq_remap_table' object
The kmemleak testing on 3.18.24 show: unreferenced object 0x880233ff9010 (size 16): comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s) hex dump (first 16 bytes): 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff D.3 backtrace: [] create_object+0x10d/0x2d0 [] kmemleak_alloc+0x5b/0xc0 [] kmem_cache_alloc_trace+0xb9/0x160 [] get_irq_table+0x151/0x380 This is caused by the 'irq_lookup_table' was allocated with __get_free_pages() which won't create kmemleak object, thus it's pointers won't be count as referencing in kmemleak scanning. The 'irq_remap_table' allocated won't be freed after initialized, doesn't make sense to let kmemleak scan it. This patch mark the 'irq_remap_table' object as 'ignored' to stop the 'false positives' report. Signed-off-by: Michael Wang <yun.w...@profitbricks.com> --- drivers/iommu/amd_iommu.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index 8b2be1e..87a1a88 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, bool ioapic) } irq_lookup_table[devid] = table; + kmemleak_ignore(table); set_dte_irq_entry(devid, table); iommu_flush_dte(iommu, devid); if (devid != alias) { -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2] iommu/amd: gray the 'irq_remap_table' object for kmemleak
The kmemleak testing on 3.18.24 show: unreferenced object 0x880233ff9010 (size 16): comm "swapper/0", pid 1, jiffies 4294937440 (age 2010.490s) hex dump (first 16 bytes): 0a 0a 00 00 20 00 00 00 00 44 fb 33 02 88 ff ff D.3 backtrace: [] create_object+0x10d/0x2d0 [] kmemleak_alloc+0x5b/0xc0 [] kmem_cache_alloc_trace+0xb9/0x160 [] get_irq_table+0x151/0x380 This is caused by the 'irq_lookup_table' was allocated with __get_free_pages() which won't create kmemleak object, thus it's pointers won't be count as referencing 'irq_remap_table' in kmemleak scan. The 'irq_remap_table' won't be freed after initialized, doesn't make sense to check it's leaking. This patch mark the 'irq_remap_table' object as 'gray' to stop the 'false positives' report. Signed-off-by: Michael Wang <yun.w...@profitbricks.com> --- v2: Use kmemleak_not_leak() instead of kmemleak_ignore() since the 'irq_remap_table' itself also contain pointer. drivers/iommu/amd_iommu.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index 8b2be1e..87a1a88 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -3603,6 +3603,7 @@ static struct irq_remap_table *get_irq_table(u16 devid, bool ioapic) } irq_lookup_table[devid] = table; + kmemleak_not_leak(table); set_dte_irq_entry(devid, table); iommu_flush_dte(iommu, devid); if (devid != alias) { -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers
On 05/18/2015 06:58 PM, Doug Ledford wrote: [snip] >> I see :-) I've not work with the kdoc yet, not sure if there is any >> guidelines on how to write the header of inline func for kdoc? > > It's an automated tool thing. Any comment section that starts with /** > is automatically included as a kdoc. Then there is an expected format > after that. See Documentation/kernel-doc-nano-HOWTO.txt. Got it :-) > >>> >>> Just because I want to move this along versus waiting for another >>> respin, I'm going to copy and paste these into those locations and clean >>> up the changelog when I integrate this patch. >> >> Got it, if there is anything I could help, please let me know ;-) > > I'm sending the patch for review, please let me know if you are OK with > how I handled the attribution. The definition is far more detailed and accurate, it's already good enough according to my understanding, should benefit the developer a lot ;-) Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers
On 05/18/2015 06:58 PM, Doug Ledford wrote: [snip] I see :-) I've not work with the kdoc yet, not sure if there is any guidelines on how to write the header of inline func for kdoc? It's an automated tool thing. Any comment section that starts with /** is automatically included as a kdoc. Then there is an expected format after that. See Documentation/kernel-doc-nano-HOWTO.txt. Got it :-) Just because I want to move this along versus waiting for another respin, I'm going to copy and paste these into those locations and clean up the changelog when I integrate this patch. Got it, if there is anything I could help, please let me know ;-) I'm sending the patch for review, please let me know if you are OK with how I handled the attribution. The definition is far more detailed and accurate, it's already good enough according to my understanding, should benefit the developer a lot ;-) Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers
On 05/18/2015 05:21 PM, Doug Ledford wrote: [snip] >> >> I'll put the highlights and changelog under '---' in next version, is it >> looks like this? > > We're still missing Jason's feedback request though. Specifically, he > pointed out that kdocs are usually not done in Documentation/*, they are > done in the .c files where the function is (or the .h file if the > function is an inline, which these all are). So, you included some > limited documentation for each of these items in your original patches > that added them. His request was that you put this expanded information > not in Documentation/infiniband where someone has to go looking for it, > but as part of the kdoc header for each of the various helpers in > ib_verbs.h itself. I see :-) I've not work with the kdoc yet, not sure if there is any guidelines on how to write the header of inline func for kdoc? > > Just because I want to move this along versus waiting for another > respin, I'm going to copy and paste these into those locations and clean > up the changelog when I integrate this patch. Got it, if there is anything I could help, please let me know ;-) Regards, Michael Wang > >> >> Subject: [PATCH RFC v3] Documentation/infiniband: Add docs for rdma-helpers >> >> This is the following patch for: >> https://lkml.org/lkml/2015/5/5/417 >> which try to document the settled rdma_cap_XX(). >> >> Signed-off-by: Michael Wang >> --- >> Highlights: >> There could be many missing/mistakes/misunderstanding, please don't >> be hesitate to point out the issues, any suggestions to improve or >> complete the description are very welcomed ;-) >> >> v2: >> * Merge the descriptions from Doug: >> http://www.spinics.net/lists/linux-rdma/msg25172.html >> >> v3: >> ... >> >> Documentation/infiniband/rdma_helpers.txt | 79 >> +++ >> 1 file changed, 79 insertions(+) >> create mode 100644 Documentation/infiniband/rdma_helpers.txt >> >> diff --git a/Documentation/infiniband/rdma_helpers.txt >> b/Documentation/infiniband/rdma_helpers.txt >> new file mode 100644 >> index 000..be9416d >> --- /dev/null >> +++ b/Documentation/infiniband/rdma_helpers.txt >> >> Regards, >> Michael Wang >> >>> >>>> >>>> Signed-off-by: Michael Wang >>>> --- >>>> Documentation/infiniband/rdma_helpers.txt | 79 >>>> +++ >>>> 1 file changed, 79 insertions(+) >>>> create mode 100644 Documentation/infiniband/rdma_helpers.txt >>>> >>>> diff --git a/Documentation/infiniband/rdma_helpers.txt >>>> b/Documentation/infiniband/rdma_helpers.txt >>>> new file mode 100644 >>>> index 000..be9416d >>>> --- /dev/null >>>> +++ b/Documentation/infiniband/rdma_helpers.txt >>>> @@ -0,0 +1,79 @@ >>>> +RDMA HELPERS >>>> + >>>> + The following helpers are used to check the specific capabilities of a >>>> + particular port before utilizing those capabilities. >>>> + >>>> +rdma_cap_ib_mad- Infiniband Management Datagrams. >>>> +rdma_cap_ib_smi- Infiniband Subnet Management Interface. >>>> +rdma_cap_ib_cm - Infiniband Communication Manager. >>>> +rdma_cap_iw_cm - IWARP Communication Manager. >>>> +rdma_cap_ib_sa - Infiniband Subnet Administration. >>>> +rdma_cap_ib_mcast - Infiniband Multicast Join/Leave Protocol. >>>> +rdma_cap_read_multi_sge- RDMA Read Work Request Support Multiple >>>> SGE. >>>> +rdma_cap_af_ib - Native Infiniband Address. >>>> +rdma_cap_eth_ah- InfiniBand Transport With Ethernet >>>> Address. >>>> + >>>> +USAGE >>>> + >>>> + if (rdma_cap_XX(device, i)) { >>>> + /* The port i of device support XX */ >>>> + ... >>>> + } else { >>>> + /* The port i of device don't support XX */ >>>> + ... >>>> + } >>>> + >>>> + rdma_cap_ib_mad >>>> + --- >>>> +Management Datagrams (MAD) are a required part of the InfiniBand >>>> +specification and are supported on all InfiniBand devices. A slightly >>>> +extended version are also supported on OPA interfaces. >>>> + >>>&
Re: [PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers
Hi, Or On 05/18/2015 11:47 AM, Or Gerlitz wrote: [snip] >> Highlights: >> There could be many missing/mistakes/misunderstanding, please don't >> be hesitate to point out the issues, any suggestions to improve or >> complete the description are very welcomed ;-) > > Michael, none of what you wrote above belongs to the change-log which > is going to stay for-ever in the upstream git repo. You should put it > all belong the --- line after your S.O.B and add proper change log > telling what this patch is about Thanks for point out this for me :-) I'll put the highlights and changelog under '---' in next version, is it looks like this? Subject: [PATCH RFC v3] Documentation/infiniband: Add docs for rdma-helpers This is the following patch for: https://lkml.org/lkml/2015/5/5/417 which try to document the settled rdma_cap_XX(). Signed-off-by: Michael Wang --- Highlights: There could be many missing/mistakes/misunderstanding, please don't be hesitate to point out the issues, any suggestions to improve or complete the description are very welcomed ;-) v2: * Merge the descriptions from Doug: http://www.spinics.net/lists/linux-rdma/msg25172.html v3: ... Documentation/infiniband/rdma_helpers.txt | 79 +++ 1 file changed, 79 insertions(+) create mode 100644 Documentation/infiniband/rdma_helpers.txt diff --git a/Documentation/infiniband/rdma_helpers.txt b/Documentation/infiniband/rdma_helpers.txt new file mode 100644 index 000..be9416d --- /dev/null +++ b/Documentation/infiniband/rdma_helpers.txt Regards, Michael Wang > >> >> Signed-off-by: Michael Wang >> --- >> Documentation/infiniband/rdma_helpers.txt | 79 >> +++ >> 1 file changed, 79 insertions(+) >> create mode 100644 Documentation/infiniband/rdma_helpers.txt >> >> diff --git a/Documentation/infiniband/rdma_helpers.txt >> b/Documentation/infiniband/rdma_helpers.txt >> new file mode 100644 >> index 000..be9416d >> --- /dev/null >> +++ b/Documentation/infiniband/rdma_helpers.txt >> @@ -0,0 +1,79 @@ >> +RDMA HELPERS >> + >> + The following helpers are used to check the specific capabilities of a >> + particular port before utilizing those capabilities. >> + >> +rdma_cap_ib_mad- Infiniband Management Datagrams. >> +rdma_cap_ib_smi- Infiniband Subnet Management Interface. >> +rdma_cap_ib_cm - Infiniband Communication Manager. >> +rdma_cap_iw_cm - IWARP Communication Manager. >> +rdma_cap_ib_sa - Infiniband Subnet Administration. >> +rdma_cap_ib_mcast - Infiniband Multicast Join/Leave Protocol. >> +rdma_cap_read_multi_sge- RDMA Read Work Request Support Multiple >> SGE. >> +rdma_cap_af_ib - Native Infiniband Address. >> +rdma_cap_eth_ah- InfiniBand Transport With Ethernet Address. >> + >> +USAGE >> + >> + if (rdma_cap_XX(device, i)) { >> + /* The port i of device support XX */ >> + ... >> + } else { >> + /* The port i of device don't support XX */ >> + ... >> + } >> + >> + rdma_cap_ib_mad >> + --- >> +Management Datagrams (MAD) are a required part of the InfiniBand >> +specification and are supported on all InfiniBand devices. A slightly >> +extended version are also supported on OPA interfaces. >> + >> + rdma_cap_ib_smi >> + --- >> +Subnet Management Interface (SMI) will handle SMP packet from SM >> +in an infiniband fabric. >> + >> + rdma_cap_ib_cm >> + --- >> +Communication Manager (CM) service, used to ease the process of >> connecting >> +to a remote host. The IB-CM can be used to connect to remote hosts >> using >> +either InfiniBand or RoCE connections, iWARP has its own CM. >> + >> + rdma_cap_iw_cm >> + --- >> +iWARP Communication Manager (CM), Similar to the IB-CM, but only used on >> +iWARP devices. >> + >> + rdma_cap_ib_sa >> + --- >> +Subnet Administration (SA) is the database built by SM in an >> +infiniband fabric. >> + >> + rdma_cap_ib_mcast >> + --- >> +InfiniBand (and OPA) use a different multicast mechanism rather than >> +traditional IP multicast found on Ethernet devices. If this is true, >> then >> +traditional IPv4/IPv6 multicast is handled by the IPoIB layer and direct >> +multicast joins and leaves are handled per the InfiniBan
[PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers
Since v1: * Merge the descriptions from Doug: http://www.spinics.net/lists/linux-rdma/msg25172.html This is the following patch for: https://lkml.org/lkml/2015/5/5/417 which try to document the settled rdma_cap_XX(). Highlights: There could be many missing/mistakes/misunderstanding, please don't be hesitate to point out the issues, any suggestions to improve or complete the description are very welcomed ;-) Signed-off-by: Michael Wang --- Documentation/infiniband/rdma_helpers.txt | 79 +++ 1 file changed, 79 insertions(+) create mode 100644 Documentation/infiniband/rdma_helpers.txt diff --git a/Documentation/infiniband/rdma_helpers.txt b/Documentation/infiniband/rdma_helpers.txt new file mode 100644 index 000..be9416d --- /dev/null +++ b/Documentation/infiniband/rdma_helpers.txt @@ -0,0 +1,79 @@ +RDMA HELPERS + + The following helpers are used to check the specific capabilities of a + particular port before utilizing those capabilities. + +rdma_cap_ib_mad- Infiniband Management Datagrams. +rdma_cap_ib_smi- Infiniband Subnet Management Interface. +rdma_cap_ib_cm - Infiniband Communication Manager. +rdma_cap_iw_cm - IWARP Communication Manager. +rdma_cap_ib_sa - Infiniband Subnet Administration. +rdma_cap_ib_mcast - Infiniband Multicast Join/Leave Protocol. +rdma_cap_read_multi_sge- RDMA Read Work Request Support Multiple SGE. +rdma_cap_af_ib - Native Infiniband Address. +rdma_cap_eth_ah- InfiniBand Transport With Ethernet Address. + +USAGE + + if (rdma_cap_XX(device, i)) { + /* The port i of device support XX */ + ... + } else { + /* The port i of device don't support XX */ + ... + } + + rdma_cap_ib_mad + --- +Management Datagrams (MAD) are a required part of the InfiniBand +specification and are supported on all InfiniBand devices. A slightly +extended version are also supported on OPA interfaces. + + rdma_cap_ib_smi + --- +Subnet Management Interface (SMI) will handle SMP packet from SM +in an infiniband fabric. + + rdma_cap_ib_cm + --- +Communication Manager (CM) service, used to ease the process of connecting +to a remote host. The IB-CM can be used to connect to remote hosts using +either InfiniBand or RoCE connections, iWARP has its own CM. + + rdma_cap_iw_cm + --- +iWARP Communication Manager (CM), Similar to the IB-CM, but only used on +iWARP devices. + + rdma_cap_ib_sa + --- +Subnet Administration (SA) is the database built by SM in an +infiniband fabric. + + rdma_cap_ib_mcast + --- +InfiniBand (and OPA) use a different multicast mechanism rather than +traditional IP multicast found on Ethernet devices. If this is true, then +traditional IPv4/IPv6 multicast is handled by the IPoIB layer and direct +multicast joins and leaves are handled per the InfiniBand specifications. + + rdma_cap_read_multi_sge + --- +Certain devices (iWARP in particular) have restrictions on the number of +scatter gather elements that can be present in an RDMA READ work request, +this is true if the device does not have that restriction. + + rdma_cap_af_ib + --- +Many code paths for traditional InfiniBand and RoCE links are the same, +but need minor differences to accommodate the different addresses on the +two types of connections. This helper is true when the address of the +specific connection is of the InfiniBand native variety. + + rdma_cap_eth_ah + --- +Queue Pair is InfiniBand transport, but uses Ethernet address instead +of native InfiniBand address (aka, this is a RoCE QP, and that means +ethertype 0x8915 + GRH for RoCEv1 and IP/UDP to well known UDP port for +RoCEv2), this is true when the address family of the specific queue pair +is of the Ethernet (RoCE) variety. -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers
On 05/18/2015 05:21 PM, Doug Ledford wrote: [snip] I'll put the highlights and changelog under '---' in next version, is it looks like this? We're still missing Jason's feedback request though. Specifically, he pointed out that kdocs are usually not done in Documentation/*, they are done in the .c files where the function is (or the .h file if the function is an inline, which these all are). So, you included some limited documentation for each of these items in your original patches that added them. His request was that you put this expanded information not in Documentation/infiniband where someone has to go looking for it, but as part of the kdoc header for each of the various helpers in ib_verbs.h itself. I see :-) I've not work with the kdoc yet, not sure if there is any guidelines on how to write the header of inline func for kdoc? Just because I want to move this along versus waiting for another respin, I'm going to copy and paste these into those locations and clean up the changelog when I integrate this patch. Got it, if there is anything I could help, please let me know ;-) Regards, Michael Wang Subject: [PATCH RFC v3] Documentation/infiniband: Add docs for rdma-helpers This is the following patch for: https://lkml.org/lkml/2015/5/5/417 which try to document the settled rdma_cap_XX(). Signed-off-by: Michael Wang yun.w...@profitbricks.com --- Highlights: There could be many missing/mistakes/misunderstanding, please don't be hesitate to point out the issues, any suggestions to improve or complete the description are very welcomed ;-) v2: * Merge the descriptions from Doug: http://www.spinics.net/lists/linux-rdma/msg25172.html v3: ... Documentation/infiniband/rdma_helpers.txt | 79 +++ 1 file changed, 79 insertions(+) create mode 100644 Documentation/infiniband/rdma_helpers.txt diff --git a/Documentation/infiniband/rdma_helpers.txt b/Documentation/infiniband/rdma_helpers.txt new file mode 100644 index 000..be9416d --- /dev/null +++ b/Documentation/infiniband/rdma_helpers.txt Regards, Michael Wang Signed-off-by: Michael Wang yun.w...@profitbricks.com --- Documentation/infiniband/rdma_helpers.txt | 79 +++ 1 file changed, 79 insertions(+) create mode 100644 Documentation/infiniband/rdma_helpers.txt diff --git a/Documentation/infiniband/rdma_helpers.txt b/Documentation/infiniband/rdma_helpers.txt new file mode 100644 index 000..be9416d --- /dev/null +++ b/Documentation/infiniband/rdma_helpers.txt @@ -0,0 +1,79 @@ +RDMA HELPERS + + The following helpers are used to check the specific capabilities of a + particular port before utilizing those capabilities. + +rdma_cap_ib_mad- Infiniband Management Datagrams. +rdma_cap_ib_smi- Infiniband Subnet Management Interface. +rdma_cap_ib_cm - Infiniband Communication Manager. +rdma_cap_iw_cm - IWARP Communication Manager. +rdma_cap_ib_sa - Infiniband Subnet Administration. +rdma_cap_ib_mcast - Infiniband Multicast Join/Leave Protocol. +rdma_cap_read_multi_sge- RDMA Read Work Request Support Multiple SGE. +rdma_cap_af_ib - Native Infiniband Address. +rdma_cap_eth_ah- InfiniBand Transport With Ethernet Address. + +USAGE + + if (rdma_cap_XX(device, i)) { + /* The port i of device support XX */ + ... + } else { + /* The port i of device don't support XX */ + ... + } + + rdma_cap_ib_mad + --- +Management Datagrams (MAD) are a required part of the InfiniBand +specification and are supported on all InfiniBand devices. A slightly +extended version are also supported on OPA interfaces. + + rdma_cap_ib_smi + --- +Subnet Management Interface (SMI) will handle SMP packet from SM +in an infiniband fabric. + + rdma_cap_ib_cm + --- +Communication Manager (CM) service, used to ease the process of connecting +to a remote host. The IB-CM can be used to connect to remote hosts using +either InfiniBand or RoCE connections, iWARP has its own CM. + + rdma_cap_iw_cm + --- +iWARP Communication Manager (CM), Similar to the IB-CM, but only used on +iWARP devices. + + rdma_cap_ib_sa + --- +Subnet Administration (SA) is the database built by SM in an +infiniband fabric. + + rdma_cap_ib_mcast + --- +InfiniBand (and OPA) use a different multicast mechanism rather than +traditional IP multicast found on Ethernet devices. If this is true, then +traditional IPv4/IPv6 multicast is handled by the IPoIB layer and direct +multicast joins and leaves are handled per the InfiniBand specifications. + + rdma_cap_read_multi_sge
[PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers
Since v1: * Merge the descriptions from Doug: http://www.spinics.net/lists/linux-rdma/msg25172.html This is the following patch for: https://lkml.org/lkml/2015/5/5/417 which try to document the settled rdma_cap_XX(). Highlights: There could be many missing/mistakes/misunderstanding, please don't be hesitate to point out the issues, any suggestions to improve or complete the description are very welcomed ;-) Signed-off-by: Michael Wang yun.w...@profitbricks.com --- Documentation/infiniband/rdma_helpers.txt | 79 +++ 1 file changed, 79 insertions(+) create mode 100644 Documentation/infiniband/rdma_helpers.txt diff --git a/Documentation/infiniband/rdma_helpers.txt b/Documentation/infiniband/rdma_helpers.txt new file mode 100644 index 000..be9416d --- /dev/null +++ b/Documentation/infiniband/rdma_helpers.txt @@ -0,0 +1,79 @@ +RDMA HELPERS + + The following helpers are used to check the specific capabilities of a + particular port before utilizing those capabilities. + +rdma_cap_ib_mad- Infiniband Management Datagrams. +rdma_cap_ib_smi- Infiniband Subnet Management Interface. +rdma_cap_ib_cm - Infiniband Communication Manager. +rdma_cap_iw_cm - IWARP Communication Manager. +rdma_cap_ib_sa - Infiniband Subnet Administration. +rdma_cap_ib_mcast - Infiniband Multicast Join/Leave Protocol. +rdma_cap_read_multi_sge- RDMA Read Work Request Support Multiple SGE. +rdma_cap_af_ib - Native Infiniband Address. +rdma_cap_eth_ah- InfiniBand Transport With Ethernet Address. + +USAGE + + if (rdma_cap_XX(device, i)) { + /* The port i of device support XX */ + ... + } else { + /* The port i of device don't support XX */ + ... + } + + rdma_cap_ib_mad + --- +Management Datagrams (MAD) are a required part of the InfiniBand +specification and are supported on all InfiniBand devices. A slightly +extended version are also supported on OPA interfaces. + + rdma_cap_ib_smi + --- +Subnet Management Interface (SMI) will handle SMP packet from SM +in an infiniband fabric. + + rdma_cap_ib_cm + --- +Communication Manager (CM) service, used to ease the process of connecting +to a remote host. The IB-CM can be used to connect to remote hosts using +either InfiniBand or RoCE connections, iWARP has its own CM. + + rdma_cap_iw_cm + --- +iWARP Communication Manager (CM), Similar to the IB-CM, but only used on +iWARP devices. + + rdma_cap_ib_sa + --- +Subnet Administration (SA) is the database built by SM in an +infiniband fabric. + + rdma_cap_ib_mcast + --- +InfiniBand (and OPA) use a different multicast mechanism rather than +traditional IP multicast found on Ethernet devices. If this is true, then +traditional IPv4/IPv6 multicast is handled by the IPoIB layer and direct +multicast joins and leaves are handled per the InfiniBand specifications. + + rdma_cap_read_multi_sge + --- +Certain devices (iWARP in particular) have restrictions on the number of +scatter gather elements that can be present in an RDMA READ work request, +this is true if the device does not have that restriction. + + rdma_cap_af_ib + --- +Many code paths for traditional InfiniBand and RoCE links are the same, +but need minor differences to accommodate the different addresses on the +two types of connections. This helper is true when the address of the +specific connection is of the InfiniBand native variety. + + rdma_cap_eth_ah + --- +Queue Pair is InfiniBand transport, but uses Ethernet address instead +of native InfiniBand address (aka, this is a RoCE QP, and that means +ethertype 0x8915 + GRH for RoCEv1 and IP/UDP to well known UDP port for +RoCEv2), this is true when the address family of the specific queue pair +is of the Ethernet (RoCE) variety. -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v2] Documentation/infiniband: Add docs for rdma-helpers
Hi, Or On 05/18/2015 11:47 AM, Or Gerlitz wrote: [snip] Highlights: There could be many missing/mistakes/misunderstanding, please don't be hesitate to point out the issues, any suggestions to improve or complete the description are very welcomed ;-) Michael, none of what you wrote above belongs to the change-log which is going to stay for-ever in the upstream git repo. You should put it all belong the --- line after your S.O.B and add proper change log telling what this patch is about Thanks for point out this for me :-) I'll put the highlights and changelog under '---' in next version, is it looks like this? Subject: [PATCH RFC v3] Documentation/infiniband: Add docs for rdma-helpers This is the following patch for: https://lkml.org/lkml/2015/5/5/417 which try to document the settled rdma_cap_XX(). Signed-off-by: Michael Wang yun.w...@profitbricks.com --- Highlights: There could be many missing/mistakes/misunderstanding, please don't be hesitate to point out the issues, any suggestions to improve or complete the description are very welcomed ;-) v2: * Merge the descriptions from Doug: http://www.spinics.net/lists/linux-rdma/msg25172.html v3: ... Documentation/infiniband/rdma_helpers.txt | 79 +++ 1 file changed, 79 insertions(+) create mode 100644 Documentation/infiniband/rdma_helpers.txt diff --git a/Documentation/infiniband/rdma_helpers.txt b/Documentation/infiniband/rdma_helpers.txt new file mode 100644 index 000..be9416d --- /dev/null +++ b/Documentation/infiniband/rdma_helpers.txt Regards, Michael Wang Signed-off-by: Michael Wang yun.w...@profitbricks.com --- Documentation/infiniband/rdma_helpers.txt | 79 +++ 1 file changed, 79 insertions(+) create mode 100644 Documentation/infiniband/rdma_helpers.txt diff --git a/Documentation/infiniband/rdma_helpers.txt b/Documentation/infiniband/rdma_helpers.txt new file mode 100644 index 000..be9416d --- /dev/null +++ b/Documentation/infiniband/rdma_helpers.txt @@ -0,0 +1,79 @@ +RDMA HELPERS + + The following helpers are used to check the specific capabilities of a + particular port before utilizing those capabilities. + +rdma_cap_ib_mad- Infiniband Management Datagrams. +rdma_cap_ib_smi- Infiniband Subnet Management Interface. +rdma_cap_ib_cm - Infiniband Communication Manager. +rdma_cap_iw_cm - IWARP Communication Manager. +rdma_cap_ib_sa - Infiniband Subnet Administration. +rdma_cap_ib_mcast - Infiniband Multicast Join/Leave Protocol. +rdma_cap_read_multi_sge- RDMA Read Work Request Support Multiple SGE. +rdma_cap_af_ib - Native Infiniband Address. +rdma_cap_eth_ah- InfiniBand Transport With Ethernet Address. + +USAGE + + if (rdma_cap_XX(device, i)) { + /* The port i of device support XX */ + ... + } else { + /* The port i of device don't support XX */ + ... + } + + rdma_cap_ib_mad + --- +Management Datagrams (MAD) are a required part of the InfiniBand +specification and are supported on all InfiniBand devices. A slightly +extended version are also supported on OPA interfaces. + + rdma_cap_ib_smi + --- +Subnet Management Interface (SMI) will handle SMP packet from SM +in an infiniband fabric. + + rdma_cap_ib_cm + --- +Communication Manager (CM) service, used to ease the process of connecting +to a remote host. The IB-CM can be used to connect to remote hosts using +either InfiniBand or RoCE connections, iWARP has its own CM. + + rdma_cap_iw_cm + --- +iWARP Communication Manager (CM), Similar to the IB-CM, but only used on +iWARP devices. + + rdma_cap_ib_sa + --- +Subnet Administration (SA) is the database built by SM in an +infiniband fabric. + + rdma_cap_ib_mcast + --- +InfiniBand (and OPA) use a different multicast mechanism rather than +traditional IP multicast found on Ethernet devices. If this is true, then +traditional IPv4/IPv6 multicast is handled by the IPoIB layer and direct +multicast joins and leaves are handled per the InfiniBand specifications. + + rdma_cap_read_multi_sge + --- +Certain devices (iWARP in particular) have restrictions on the number of +scatter gather elements that can be present in an RDMA READ work request, +this is true if the device does not have that restriction. + + rdma_cap_af_ib + --- +Many code paths for traditional InfiniBand and RoCE links are the same, +but need minor differences to accommodate the different addresses on the +two types of connections. This helper is true when the address of the +specific connection is of the InfiniBand native
Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers
On 05/15/2015 04:40 PM, Doug Ledford wrote: [snip] > > The test itself doesn't mean that. It means we need a RoCE address > (it's true when transport is IB and link layer is Ethernet). That we > *use* it during connectionless communication because we have to generate > our own address vector for the packet while during connected queue pair > use the address vector is created by the card using the queue pair > information is just the circumstance of its use. And even though a > disconnected queue pair isn't solidly connected to a remote endpoint, it > is solidly bound to an adapter that requires either an IB or Ethernet > address family. Maybe this to resolve your issue with the wording: Thanks for the explain :-) The term 'connectionless' still sounds a little strange to me when it's just means no HW support on creating address vector, but I can understand the concept. > > This helper is true when the address family of this queue pair is of the > Ethernet (RoCE) variety. Sounds good, will be merged in next version :-) Regards, Michael Wang > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers
On 05/15/2015 04:27 PM, Doug Ledford wrote: [snip] >> >> Me too used to think it's 'connection', while I found some docs explain >> this as 'communication'... but anyway, 'connection' sounds >> more close to what it did in kernel :-) > > That's kind of what I thought. Anyway, it's communication management > (which to me is a gross abuse of the english language for which the IBTA > should be appropriately chastised), but that doesn't mean that lower > down in the more descriptive area of text that we can't call out that > this is really for establishing a connection and that once your > connection is established and you *truly* want to communicate, this does > nothing. I see :-) we can reserve the communication management as the definition of CM, to obey the standard, meanwhile give some description related to connection below in the long description. > [snip] >> Shall we put this long description into USAGE? Here maybe list >> all the helpers to give some quick overview with a brief >> description, what's your opinion? > > Given how we have a more complete description of this below, it need not > have such a lengthy description here. Got it :-) Regards, Michael Wang >> >>>> + >>>> +USAGE >>>> + >>>> + if (rdma_cap_XX(device, i)) { >>>> + /* The port i of device support XX */ >>>> + ... >>>> + } else { >>>> + /* The port i of device don't support XX */ >>>> + ... >>>> + } >>>> + >>>> + rdma_cap_ib_mad >>>> + --- >>>> +Management Datagrams (MAD) is the prototype of management packet >>>> +to be used by all the kinds of infiniband managers, use the helper >>>> +to verify the port before utilize related features. >>> Management Datagrams (MAD) are a required part of the InfiniBand >>> specification and are supported on all InfiniBand devices. A slightly >>> extended version are also supported on OPA interfaces. >>> >>> I would drop all instances of "use the helper to verify..." as that's >>> redundant. This whole doc is about using the helpers to verify things. >> >> Agree, will be dropped in next version. >> >> And all the comments below make sense, will be merged ;-) >> >> Regards, >> Michael Wang >> >>> >>>> + >>>> + rdma_cap_ib_smi >>>> + --- >>>> +Subnet Management Interface (SMI) will handle SMP packet from SM >>>> +in an infiniband fabric, use the helper to verify the port before >>>> +utilize related features. >>>> + >>>> + rdma_cap_ib_cm >>>> + --- >>>> +Communication Manager (CM) will handle the connections between >>>^Connection Manager (CM) service, used to ease the process of >>> connecting to a remote host. The IB CM can be used to connect to remote >>> hosts using either InfiniBand or RoCE connections. iWARP has its own >>> connection manager, see below. >>>> +adaptors, currently there are two different implementation, >>>> +IB or IWARP, use the helper to verify whether the port using >>>> +IB-CM or not >>>> + >>>> + rdma_cap_iw_cm >>>> + --- >>>> +IWARP has it's own implemented CM which is different from infiniband, >>> iWARP connection manager. Similar to the IB Connection Manager, >>> but only used on iWARP devices. >>>> +use the helper to check whether the port using IWARP-CM or not. >>>> + >>>> + rdma_cap_ib_sa >>>> + --- >>>> +Subnet Administration (SA) is the database built by SM in an >>>> +infiniband fabric, use the helper to verify the port before >>>> +utilize related features. >>>> + >>>> + rdma_cap_ib_mcast >>>> + --- >>>> +Multicast is the feature for one QP to send messages to multiple >>>> +QP in an infiniband fabric, use the helper to verify the port before >>>> +utilize related features. >>> >>> InfiniBand (and OPA) use a different multicast mechanism than >>> traditional IP multicast found on Ethernet devices. If this capability >>> is true, then traditional IPv4/IPv6 multicast is handled by the IPoIB >>> layer and direct multicast joins and leaves are handled per the >>> InfiniBand specifications. >>> >>>> + >&g
Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers
On 05/13/2015 06:42 PM, Hefty, Sean wrote: >>> + rdma_cap_ib_cm >>> + --- >>> +Communication Manager (CM) will handle the connections between >>^Connection Manager (CM) service, used to ease the process of > > In IB terms, this is communication manager. It also handles transport level > address resolution for UD QPs. I could find both 'connection' and 'communication' in different docs, while 'connection' is more related to verbs, 'communication' is more close to specification. IMHO 'connection' make more sense, after all, all the transport between adaptors could named as communication, while connection management is exactly what CM did in kernel. Doug, what's your opinion? > >>> + rdma_cap_eth_ah >>> + --- >>> +Infiniband address handler format is special in ethernet fabric, >> use >>> +the helper to verify whether the port is using ethernet format or >> not. >> >> This helper is true when the address of the specific connection is of >> the Ethernet (RoCE) variety. > > This is used for connectionless communication. Could you please give more details on this? Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers
On 05/13/2015 05:11 PM, Doug Ledford wrote: [snip] >> + >> + For core layer, below helpers are used to check if a paticular capability >> + is supported by the port. > > The following helpers are used to check the specific capabilities of a > particular port before utilizing those capabilities. Will be in next version :-) > >> + >> +rdma_cap_ib_mad - Infiniband Management Datagrams. >> +rdma_cap_ib_smi - Infiniband Subnet Management Interface. >> +rdma_cap_ib_cm - Infiniband Communication Manager. > InfiniBand Connection Management Me too used to think it's 'connection', while I found some docs explain this as 'communication'... but anyway, 'connection' sounds more close to what it did in kernel :-) >> +rdma_cap_iw_cm - IWARP Communication Manager. > iWARP Connection Management >> +rdma_cap_ib_sa - Infiniband Subnet Administration. >> +rdma_cap_ib_mcast - Infiniband Multicast. > InfiniBand Multicast join/leave protocol >> +rdma_cap_read_multi_sge - RDMA Read Multiple Scatter-Gather Entries. > RDMA Read verb supports more than 1 sge in the work request Will be in next version :-) >> +rdma_cap_af_ib - Native Infiniband Address. >> +rdma_cap_eth_ah - Ethernet Address Handler. > Queue Pair is InfiniBand transport, but uses Ethernet address instead of > native InfiniBand address (aka, this is a RoCE QP, and that means > ethertype 0x8915 + GRH for RoCEv1 and IP/UDP to well known UDP port for > RoCEv2) Shall we put this long description into USAGE? Here maybe list all the helpers to give some quick overview with a brief description, what's your opinion? >> + >> +USAGE >> + >> + if (rdma_cap_XX(device, i)) { >> +/* The port i of device support XX */ >> +... >> + } else { >> +/* The port i of device don't support XX */ >> +... >> + } >> + >> + rdma_cap_ib_mad >> + --- >> +Management Datagrams (MAD) is the prototype of management packet >> +to be used by all the kinds of infiniband managers, use the helper >> +to verify the port before utilize related features. > Management Datagrams (MAD) are a required part of the InfiniBand > specification and are supported on all InfiniBand devices. A slightly > extended version are also supported on OPA interfaces. > > I would drop all instances of "use the helper to verify..." as that's > redundant. This whole doc is about using the helpers to verify things. Agree, will be dropped in next version. And all the comments below make sense, will be merged ;-) Regards, Michael Wang > >> + >> + rdma_cap_ib_smi >> + --- >> +Subnet Management Interface (SMI) will handle SMP packet from SM >> +in an infiniband fabric, use the helper to verify the port before >> +utilize related features. >> + >> + rdma_cap_ib_cm >> + --- >> +Communication Manager (CM) will handle the connections between >^Connection Manager (CM) service, used to ease the process of > connecting to a remote host. The IB CM can be used to connect to remote > hosts using either InfiniBand or RoCE connections. iWARP has its own > connection manager, see below. >> +adaptors, currently there are two different implementation, >> +IB or IWARP, use the helper to verify whether the port using >> +IB-CM or not >> + >> + rdma_cap_iw_cm >> + --- >> +IWARP has it's own implemented CM which is different from infiniband, > iWARP connection manager. Similar to the IB Connection Manager, > but only used on iWARP devices. >> +use the helper to check whether the port using IWARP-CM or not. >> + >> + rdma_cap_ib_sa >> + --- >> +Subnet Administration (SA) is the database built by SM in an >> +infiniband fabric, use the helper to verify the port before >> +utilize related features. >> + >> + rdma_cap_ib_mcast >> + --- >> +Multicast is the feature for one QP to send messages to multiple >> +QP in an infiniband fabric, use the helper to verify the port before >> +utilize related features. > > InfiniBand (and OPA) use a different multicast mechanism than > traditional IP multicast found on Ethernet devices. If this capability > is true, then traditional IPv4/IPv6 multicast is handled by the IPoIB > layer and direct multicast joins and leaves are handled per the > InfiniBand specifications. > >> + >> + rdma_cap_read_multi_sge >> + --- >&g
Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers
On 05/13/2015 05:59 PM, Jason Gunthorpe wrote: > On Wed, May 13, 2015 at 03:24:32PM +0200, Michael Wang wrote: >> This is the following patch for: >> https://lkml.org/lkml/2015/5/5/417 >> which try to document the settled rdma_cap_XX(). >> >> Highlights: >> There could be many missing/mistakes/misunderstanding, please don't >> be hesitate to point out the issues, any suggestions to improve or >> complete the description are very welcomed ;-) > > I'd rather see this in the kdoc for each function. I used to thought you mean the kernel documentation like this... my misunderstanding but this is the usual way to document kernel stuff, isn't it? BTW, could you give more details on the kdoc? Regards, Michael Wang > > Thanks, > Jason > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers
On 05/13/2015 06:42 PM, Hefty, Sean wrote: + rdma_cap_ib_cm + --- +Communication Manager (CM) will handle the connections between ^Connection Manager (CM) service, used to ease the process of In IB terms, this is communication manager. It also handles transport level address resolution for UD QPs. I could find both 'connection' and 'communication' in different docs, while 'connection' is more related to verbs, 'communication' is more close to specification. IMHO 'connection' make more sense, after all, all the transport between adaptors could named as communication, while connection management is exactly what CM did in kernel. Doug, what's your opinion? + rdma_cap_eth_ah + --- +Infiniband address handler format is special in ethernet fabric, use +the helper to verify whether the port is using ethernet format or not. This helper is true when the address of the specific connection is of the Ethernet (RoCE) variety. This is used for connectionless communication. Could you please give more details on this? Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers
On 05/13/2015 05:59 PM, Jason Gunthorpe wrote: On Wed, May 13, 2015 at 03:24:32PM +0200, Michael Wang wrote: This is the following patch for: https://lkml.org/lkml/2015/5/5/417 which try to document the settled rdma_cap_XX(). Highlights: There could be many missing/mistakes/misunderstanding, please don't be hesitate to point out the issues, any suggestions to improve or complete the description are very welcomed ;-) I'd rather see this in the kdoc for each function. I used to thought you mean the kernel documentation like this... my misunderstanding but this is the usual way to document kernel stuff, isn't it? BTW, could you give more details on the kdoc? Regards, Michael Wang Thanks, Jason -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers
On 05/13/2015 05:11 PM, Doug Ledford wrote: [snip] + + For core layer, below helpers are used to check if a paticular capability + is supported by the port. The following helpers are used to check the specific capabilities of a particular port before utilizing those capabilities. Will be in next version :-) + +rdma_cap_ib_mad - Infiniband Management Datagrams. +rdma_cap_ib_smi - Infiniband Subnet Management Interface. +rdma_cap_ib_cm - Infiniband Communication Manager. InfiniBand Connection Management Me too used to think it's 'connection', while I found some docs explain this as 'communication'... but anyway, 'connection' sounds more close to what it did in kernel :-) +rdma_cap_iw_cm - IWARP Communication Manager. iWARP Connection Management +rdma_cap_ib_sa - Infiniband Subnet Administration. +rdma_cap_ib_mcast - Infiniband Multicast. InfiniBand Multicast join/leave protocol +rdma_cap_read_multi_sge - RDMA Read Multiple Scatter-Gather Entries. RDMA Read verb supports more than 1 sge in the work request Will be in next version :-) +rdma_cap_af_ib - Native Infiniband Address. +rdma_cap_eth_ah - Ethernet Address Handler. Queue Pair is InfiniBand transport, but uses Ethernet address instead of native InfiniBand address (aka, this is a RoCE QP, and that means ethertype 0x8915 + GRH for RoCEv1 and IP/UDP to well known UDP port for RoCEv2) Shall we put this long description into USAGE? Here maybe list all the helpers to give some quick overview with a brief description, what's your opinion? + +USAGE + + if (rdma_cap_XX(device, i)) { +/* The port i of device support XX */ +... + } else { +/* The port i of device don't support XX */ +... + } + + rdma_cap_ib_mad + --- +Management Datagrams (MAD) is the prototype of management packet +to be used by all the kinds of infiniband managers, use the helper +to verify the port before utilize related features. Management Datagrams (MAD) are a required part of the InfiniBand specification and are supported on all InfiniBand devices. A slightly extended version are also supported on OPA interfaces. I would drop all instances of use the helper to verify... as that's redundant. This whole doc is about using the helpers to verify things. Agree, will be dropped in next version. And all the comments below make sense, will be merged ;-) Regards, Michael Wang + + rdma_cap_ib_smi + --- +Subnet Management Interface (SMI) will handle SMP packet from SM +in an infiniband fabric, use the helper to verify the port before +utilize related features. + + rdma_cap_ib_cm + --- +Communication Manager (CM) will handle the connections between ^Connection Manager (CM) service, used to ease the process of connecting to a remote host. The IB CM can be used to connect to remote hosts using either InfiniBand or RoCE connections. iWARP has its own connection manager, see below. +adaptors, currently there are two different implementation, +IB or IWARP, use the helper to verify whether the port using +IB-CM or not + + rdma_cap_iw_cm + --- +IWARP has it's own implemented CM which is different from infiniband, iWARP connection manager. Similar to the IB Connection Manager, but only used on iWARP devices. +use the helper to check whether the port using IWARP-CM or not. + + rdma_cap_ib_sa + --- +Subnet Administration (SA) is the database built by SM in an +infiniband fabric, use the helper to verify the port before +utilize related features. + + rdma_cap_ib_mcast + --- +Multicast is the feature for one QP to send messages to multiple +QP in an infiniband fabric, use the helper to verify the port before +utilize related features. InfiniBand (and OPA) use a different multicast mechanism than traditional IP multicast found on Ethernet devices. If this capability is true, then traditional IPv4/IPv6 multicast is handled by the IPoIB layer and direct multicast joins and leaves are handled per the InfiniBand specifications. + + rdma_cap_read_multi_sge + --- +RDMA read operation could support multiple scatter-gather entries, +use the helper to verify wthether the port support this feature +or not. Certain devices (iWARP in particular) have restrictions on the number of scatter gather elements that can be present in an RDMA READ work request. This is true if the device does not have that restriction. + rdma_cap_af_ib + --- +RDMA address format could be ethernet or infiniband, use the helper +to verify whether the port support infiniband format or not. Many code paths for traditional InfiniBand and RoCE links are the same, but need minor
Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers
On 05/15/2015 04:27 PM, Doug Ledford wrote: [snip] Me too used to think it's 'connection', while I found some docs explain this as 'communication'... but anyway, 'connection' sounds more close to what it did in kernel :-) That's kind of what I thought. Anyway, it's communication management (which to me is a gross abuse of the english language for which the IBTA should be appropriately chastised), but that doesn't mean that lower down in the more descriptive area of text that we can't call out that this is really for establishing a connection and that once your connection is established and you *truly* want to communicate, this does nothing. I see :-) we can reserve the communication management as the definition of CM, to obey the standard, meanwhile give some description related to connection below in the long description. [snip] Shall we put this long description into USAGE? Here maybe list all the helpers to give some quick overview with a brief description, what's your opinion? Given how we have a more complete description of this below, it need not have such a lengthy description here. Got it :-) Regards, Michael Wang + +USAGE + + if (rdma_cap_XX(device, i)) { + /* The port i of device support XX */ + ... + } else { + /* The port i of device don't support XX */ + ... + } + + rdma_cap_ib_mad + --- +Management Datagrams (MAD) is the prototype of management packet +to be used by all the kinds of infiniband managers, use the helper +to verify the port before utilize related features. Management Datagrams (MAD) are a required part of the InfiniBand specification and are supported on all InfiniBand devices. A slightly extended version are also supported on OPA interfaces. I would drop all instances of use the helper to verify... as that's redundant. This whole doc is about using the helpers to verify things. Agree, will be dropped in next version. And all the comments below make sense, will be merged ;-) Regards, Michael Wang + + rdma_cap_ib_smi + --- +Subnet Management Interface (SMI) will handle SMP packet from SM +in an infiniband fabric, use the helper to verify the port before +utilize related features. + + rdma_cap_ib_cm + --- +Communication Manager (CM) will handle the connections between ^Connection Manager (CM) service, used to ease the process of connecting to a remote host. The IB CM can be used to connect to remote hosts using either InfiniBand or RoCE connections. iWARP has its own connection manager, see below. +adaptors, currently there are two different implementation, +IB or IWARP, use the helper to verify whether the port using +IB-CM or not + + rdma_cap_iw_cm + --- +IWARP has it's own implemented CM which is different from infiniband, iWARP connection manager. Similar to the IB Connection Manager, but only used on iWARP devices. +use the helper to check whether the port using IWARP-CM or not. + + rdma_cap_ib_sa + --- +Subnet Administration (SA) is the database built by SM in an +infiniband fabric, use the helper to verify the port before +utilize related features. + + rdma_cap_ib_mcast + --- +Multicast is the feature for one QP to send messages to multiple +QP in an infiniband fabric, use the helper to verify the port before +utilize related features. InfiniBand (and OPA) use a different multicast mechanism than traditional IP multicast found on Ethernet devices. If this capability is true, then traditional IPv4/IPv6 multicast is handled by the IPoIB layer and direct multicast joins and leaves are handled per the InfiniBand specifications. + + rdma_cap_read_multi_sge + --- +RDMA read operation could support multiple scatter-gather entries, +use the helper to verify wthether the port support this feature +or not. Certain devices (iWARP in particular) have restrictions on the number of scatter gather elements that can be present in an RDMA READ work request. This is true if the device does not have that restriction. + rdma_cap_af_ib + --- +RDMA address format could be ethernet or infiniband, use the helper +to verify whether the port support infiniband format or not. Many code paths for traditional InfiniBand and RoCE links are the same, but need minor differences to accommodate the different addresses on the two types of connections. This helper is true when the address of the specific connection is of the InfiniBand native variety. + + rdma_cap_eth_ah + --- +Infiniband address handler format is special in ethernet fabric, use +the helper to verify whether the port is using ethernet format or not. This helper is true when the address of the specific connection is of the Ethernet (RoCE) variety
Re: [PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers
On 05/15/2015 04:40 PM, Doug Ledford wrote: [snip] The test itself doesn't mean that. It means we need a RoCE address (it's true when transport is IB and link layer is Ethernet). That we *use* it during connectionless communication because we have to generate our own address vector for the packet while during connected queue pair use the address vector is created by the card using the queue pair information is just the circumstance of its use. And even though a disconnected queue pair isn't solidly connected to a remote endpoint, it is solidly bound to an adapter that requires either an IB or Ethernet address family. Maybe this to resolve your issue with the wording: Thanks for the explain :-) The term 'connectionless' still sounds a little strange to me when it's just means no HW support on creating address vector, but I can understand the concept. This helper is true when the address family of this queue pair is of the Ethernet (RoCE) variety. Sounds good, will be merged in next version :-) Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers
This is the following patch for: https://lkml.org/lkml/2015/5/5/417 which try to document the settled rdma_cap_XX(). Highlights: There could be many missing/mistakes/misunderstanding, please don't be hesitate to point out the issues, any suggestions to improve or complete the description are very welcomed ;-) Signed-off-by: Michael Wang --- Documentation/infiniband/rdma_helpers.txt | 76 +++ 1 file changed, 76 insertions(+) create mode 100644 Documentation/infiniband/rdma_helpers.txt diff --git a/Documentation/infiniband/rdma_helpers.txt b/Documentation/infiniband/rdma_helpers.txt new file mode 100644 index 000..abc75ec --- /dev/null +++ b/Documentation/infiniband/rdma_helpers.txt @@ -0,0 +1,76 @@ +RDMA HELPERS + + For core layer, below helpers are used to check if a paticular capability + is supported by the port. + +rdma_cap_ib_mad- Infiniband Management Datagrams. +rdma_cap_ib_smi- Infiniband Subnet Management Interface. +rdma_cap_ib_cm - Infiniband Communication Manager. +rdma_cap_iw_cm - IWARP Communication Manager. +rdma_cap_ib_sa - Infiniband Subnet Administration. +rdma_cap_ib_mcast - Infiniband Multicast. +rdma_cap_read_multi_sge- RDMA Read Multiple Scatter-Gather Entries. +rdma_cap_af_ib - Native Infiniband Address. +rdma_cap_eth_ah- Ethernet Address Handler. + +USAGE + + if (rdma_cap_XX(device, i)) { + /* The port i of device support XX */ + ... + } else { + /* The port i of device don't support XX */ + ... + } + + rdma_cap_ib_mad + --- +Management Datagrams (MAD) is the prototype of management packet +to be used by all the kinds of infiniband managers, use the helper +to verify the port before utilize related features. + + rdma_cap_ib_smi + --- +Subnet Management Interface (SMI) will handle SMP packet from SM +in an infiniband fabric, use the helper to verify the port before +utilize related features. + + rdma_cap_ib_cm + --- +Communication Manager (CM) will handle the connections between +adaptors, currently there are two different implementation, +IB or IWARP, use the helper to verify whether the port using +IB-CM or not + + rdma_cap_iw_cm + --- +IWARP has it's own implemented CM which is different from infiniband, +use the helper to check whether the port using IWARP-CM or not. + + rdma_cap_ib_sa + --- +Subnet Administration (SA) is the database built by SM in an +infiniband fabric, use the helper to verify the port before +utilize related features. + + rdma_cap_ib_mcast + --- +Multicast is the feature for one QP to send messages to multiple +QP in an infiniband fabric, use the helper to verify the port before +utilize related features. + + rdma_cap_read_multi_sge + --- +RDMA read operation could support multiple scatter-gather entries, +use the helper to verify wthether the port support this feature +or not. + + rdma_cap_af_ib + --- +RDMA address format could be ethernet or infiniband, use the helper +to verify whether the port support infiniband format or not. + + rdma_cap_eth_ah + --- +Infiniband address handler format is special in ethernet fabric, use +the helper to verify whether the port is using ethernet format or not. -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers
On 05/12/2015 10:09 PM, Doug Ledford wrote: [snip] >> >> I had asked for better kdocs for the new helpers so new people can >> understand when and where to use them. >> >> I've not looked at the series at all for the past few postings. > > Michael, please work up an incremental patch to address the kdocs issue. > I've picked up the v8 patchset, and there is no need to respin it, but I > would like to have that kdoc patch before the 4.2 merge window opens. Sure, now these helpers are settled down, it's time for document, I'll send out the RFC ASAP :-) Regards, Michael Wang > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC] Documentation/infiniband: Add docs for rdma-helpers
This is the following patch for: https://lkml.org/lkml/2015/5/5/417 which try to document the settled rdma_cap_XX(). Highlights: There could be many missing/mistakes/misunderstanding, please don't be hesitate to point out the issues, any suggestions to improve or complete the description are very welcomed ;-) Signed-off-by: Michael Wang yun.w...@profitbricks.com --- Documentation/infiniband/rdma_helpers.txt | 76 +++ 1 file changed, 76 insertions(+) create mode 100644 Documentation/infiniband/rdma_helpers.txt diff --git a/Documentation/infiniband/rdma_helpers.txt b/Documentation/infiniband/rdma_helpers.txt new file mode 100644 index 000..abc75ec --- /dev/null +++ b/Documentation/infiniband/rdma_helpers.txt @@ -0,0 +1,76 @@ +RDMA HELPERS + + For core layer, below helpers are used to check if a paticular capability + is supported by the port. + +rdma_cap_ib_mad- Infiniband Management Datagrams. +rdma_cap_ib_smi- Infiniband Subnet Management Interface. +rdma_cap_ib_cm - Infiniband Communication Manager. +rdma_cap_iw_cm - IWARP Communication Manager. +rdma_cap_ib_sa - Infiniband Subnet Administration. +rdma_cap_ib_mcast - Infiniband Multicast. +rdma_cap_read_multi_sge- RDMA Read Multiple Scatter-Gather Entries. +rdma_cap_af_ib - Native Infiniband Address. +rdma_cap_eth_ah- Ethernet Address Handler. + +USAGE + + if (rdma_cap_XX(device, i)) { + /* The port i of device support XX */ + ... + } else { + /* The port i of device don't support XX */ + ... + } + + rdma_cap_ib_mad + --- +Management Datagrams (MAD) is the prototype of management packet +to be used by all the kinds of infiniband managers, use the helper +to verify the port before utilize related features. + + rdma_cap_ib_smi + --- +Subnet Management Interface (SMI) will handle SMP packet from SM +in an infiniband fabric, use the helper to verify the port before +utilize related features. + + rdma_cap_ib_cm + --- +Communication Manager (CM) will handle the connections between +adaptors, currently there are two different implementation, +IB or IWARP, use the helper to verify whether the port using +IB-CM or not + + rdma_cap_iw_cm + --- +IWARP has it's own implemented CM which is different from infiniband, +use the helper to check whether the port using IWARP-CM or not. + + rdma_cap_ib_sa + --- +Subnet Administration (SA) is the database built by SM in an +infiniband fabric, use the helper to verify the port before +utilize related features. + + rdma_cap_ib_mcast + --- +Multicast is the feature for one QP to send messages to multiple +QP in an infiniband fabric, use the helper to verify the port before +utilize related features. + + rdma_cap_read_multi_sge + --- +RDMA read operation could support multiple scatter-gather entries, +use the helper to verify wthether the port support this feature +or not. + + rdma_cap_af_ib + --- +RDMA address format could be ethernet or infiniband, use the helper +to verify whether the port support infiniband format or not. + + rdma_cap_eth_ah + --- +Infiniband address handler format is special in ethernet fabric, use +the helper to verify whether the port is using ethernet format or not. -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers
On 05/12/2015 10:09 PM, Doug Ledford wrote: [snip] I had asked for better kdocs for the new helpers so new people can understand when and where to use them. I've not looked at the series at all for the past few postings. Michael, please work up an incremental patch to address the kdocs issue. I've picked up the v8 patchset, and there is no need to respin it, but I would like to have that kdoc patch before the 4.2 merge window opens. Sure, now these helpers are settled down, it's time for document, I'll send out the RFC ASAP :-) Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers
On 05/12/2015 04:24 PM, Doug Ledford wrote: [snip] >> >> AFAIK Or was asking to merge the #15~23, and want to reserve the changelog >> meanwhile reply the cover of prev version (I'm still confused on that...), >> I've replied but get no respond yet. >> >> I can make a v9 to merge the #15~#23 if that could benefit the >> maintainability, >> please let me know your opinion :-) > > I don't think it would make a significant difference. I've pulled the > v8 patchset out of patchworks and I'll throw a new branch with it > included up to my github repo sometime today. Got it :-) Regards, Michael Wang > >> About the Bug, if it was not introduced in this series, maybe including the >> fix in next series would be better? >> >> Regards, >> Michael Wang >> >>> >>>> Frankly >>>> I vote for the former because as it stands this series does not break >>>> directly. >>>> It was only after I changed the implementation of rdma_cap_ib_mad that it >>>> broke. >>>> >>>> >>>> For the rest of the series. >>>> >>>> Reviewed-by: Ira Weiny >>>> Tested-by: Ira Weiny >>>>-- Limited to mlx4, qib, and OPA (with additional patches.) >>>> >>>> >>>> On Tue, May 05, 2015 at 02:50:17PM +0200, Michael Wang wrote: >>>>> Since v7: >>>>> * Thanks to Doug, Ira, Devesh for the testing :-) >>>>> * Thanks for the comments from or, Doug, Ira, Jason :-) >>>>> Please remind me if anything missed :-P >>>>> * Use rdma_cap_XX() instead of cap_XX() for readability >>>>> * Remove CC list in git log for maintainability >>>>> * Use bool as return value >>>>> * Updated github repository to v8 >>>>> >>>>> There are plenty of lengthy code to check the transport type of IB device, >>>>> or the link layer type of it's port, but actually we are just speculating >>>>> whether a particular management/feature is supported by the device/port. >>>>> >>>>> Thus instead of inferring, we should have our own mechanism for IB >>>>> management >>>>> capability/protocol/feature checking, several proposals below. >>>>> >>>>> This patch set will introduce query_protocol() to check management >>>>> requirement >>>>> instead of inferring from transport and link layer respectively, along >>>>> with >>>>> the new enum on protocol type. >>>>> >>>>> Mapping List: >>>>> node-type link-layer transport protocol >>>>> nes RNICETH IWARP IWARP >>>>> amso1100 RNICETH IWARP IWARP >>>>> cxgb3 RNICETH IWARP IWARP >>>>> cxgb4 RNICETH IWARP IWARP >>>>> usnic USNIC_UDP ETH USNIC_UDP USNIC_UDP >>>>> ocrdmaIB_CA ETH IB IBOE >>>>> mlx4 IB_CA IB/ETH IB IB/IBOE >>>>> mlx5 IB_CA IB IB IB >>>>> ehca IB_CA IB IB IB >>>>> ipath IB_CA IB IB IB >>>>> mthca IB_CA IB IB IB >>>>> qib IB_CA IB IB IB >>>>> >>>>> For example: >>>>> if (transport == IB) && (link-layer == ETH) >>>>> will now become: >>>>> if (query_protocol() == IBOE) >>>>> >>>>> Thus we will be able to get rid of the respective transport and link-layer >>>>> checking, and it will help us to add new protocol/Technology (like OPA) >>>>> more >>>>> easier, also with the introduced management helpers, IB management logical >>>>> will be more clear and easier for extending. >>>>> >>>>> Highlights: >>>>> The long CC list in each patches was complained consider about the >>>>> maintainability, it was suggested folks to provide their reviewed-by >>>>> or >>>>> Acked-by instead, so for those who used to be on the CC list, please
Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers
On 05/12/2015 01:49 AM, ira.weiny wrote: > I have run with this series and the only issue I have found is not with this > patch set directly. > > This patch: > >> IB/Verbs: Use management helper rdma_cap_ib_mad() > > causes an error when you actually use the port passed from the ib_umad module. > I have a patch to fix that which I found while trying to build on this series > for the use of a bit mask. > > Doug, I don't know what you would like to do for this fix. I am submitting it > shortly with a new version of the core capability bit patches. If you want to > just add it after this series or force Michael to respin with the fix? > Frankly > I vote for the former because as it stands this series does not break > directly. > It was only after I changed the implementation of rdma_cap_ib_mad that it > broke. Agree, it sounds more reasonable to include the fix in the series introduced it :-P > > > For the rest of the series. > > Reviewed-by: Ira Weiny > Tested-by: Ira Weiny > -- Limited to mlx4, qib, and OPA (with additional patches.) Thanks for the review and testing :-) Regards, Michael Wang > > > On Tue, May 05, 2015 at 02:50:17PM +0200, Michael Wang wrote: >> Since v7: >> * Thanks to Doug, Ira, Devesh for the testing :-) >> * Thanks for the comments from or, Doug, Ira, Jason :-) >> Please remind me if anything missed :-P >> * Use rdma_cap_XX() instead of cap_XX() for readability >> * Remove CC list in git log for maintainability >> * Use bool as return value >> * Updated github repository to v8 >> >> There are plenty of lengthy code to check the transport type of IB device, >> or the link layer type of it's port, but actually we are just speculating >> whether a particular management/feature is supported by the device/port. >> >> Thus instead of inferring, we should have our own mechanism for IB management >> capability/protocol/feature checking, several proposals below. >> >> This patch set will introduce query_protocol() to check management >> requirement >> instead of inferring from transport and link layer respectively, along with >> the new enum on protocol type. >> >> Mapping List: >> node-type link-layer transport protocol >> nes RNICETH IWARP IWARP >> amso1100 RNICETH IWARP IWARP >> cxgb3RNICETH IWARP IWARP >> cxgb4RNICETH IWARP IWARP >> usnicUSNIC_UDP ETH USNIC_UDP USNIC_UDP >> ocrdma IB_CA ETH IB IBOE >> mlx4 IB_CA IB/ETH IB IB/IBOE >> mlx5 IB_CA IB IB IB >> ehca IB_CA IB IB IB >> ipathIB_CA IB IB IB >> mthcaIB_CA IB IB IB >> qib IB_CA IB IB IB >> >> For example: >> if (transport == IB) && (link-layer == ETH) >> will now become: >> if (query_protocol() == IBOE) >> >> Thus we will be able to get rid of the respective transport and link-layer >> checking, and it will help us to add new protocol/Technology (like OPA) more >> easier, also with the introduced management helpers, IB management logical >> will be more clear and easier for extending. >> >> Highlights: >> The long CC list in each patches was complained consider about the >> maintainability, it was suggested folks to provide their reviewed-by or >> Acked-by instead, so for those who used to be on the CC list, please >> provide your signature voluntarily :-) >> >> The 'mgmt-helpers' branch of 'g...@github.com:ywang-pb/infiniband-wy.git' >> contain this series based on the latest 'infiniband/for-next' >> >> Patch 1#~14# included all the logical reform, 15#~23# introduced the >> management helpers. >> >> Doug suggested the bitmask mechanism: >> https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23765.html >> which could be the plan for future reforming, we prefer that to be >> another >> series which focus on semantic and performance. >> >> This patch-set is somewhat 'bloated' now and it may be a good timing for >> staging, I'd like to suggest we focus on improving existed helpers and >> push >> a
Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers
On 05/12/2015 02:27 AM, Doug Ledford wrote: > On Mon, 2015-05-11 at 19:49 -0400, ira.weiny wrote: >> I have run with this series and the only issue I have found is not with this >> patch set directly. >> >> This patch: >> >>> IB/Verbs: Use management helper rdma_cap_ib_mad() >> >> causes an error when you actually use the port passed from the ib_umad >> module. >> I have a patch to fix that which I found while trying to build on this series >> for the use of a bit mask. >> >> Doug, I don't know what you would like to do for this fix. I am submitting >> it >> shortly with a new version of the core capability bit patches. If you want >> to >> just add it after this series or force Michael to respin with the fix? > > As I recall, there was a comment from Or requesting to squash some of > the individual patches down, but I no longer have that email in my Inbox > to double check. And it seemed like there was one other review comment > not yet addressed. Do I have that right Michael? And if so, are you > working on a v9? AFAIK Or was asking to merge the #15~23, and want to reserve the changelog meanwhile reply the cover of prev version (I'm still confused on that...), I've replied but get no respond yet. I can make a v9 to merge the #15~#23 if that could benefit the maintainability, please let me know your opinion :-) About the Bug, if it was not introduced in this series, maybe including the fix in next series would be better? Regards, Michael Wang > >> Frankly >> I vote for the former because as it stands this series does not break >> directly. >> It was only after I changed the implementation of rdma_cap_ib_mad that it >> broke. >> >> >> For the rest of the series. >> >> Reviewed-by: Ira Weiny >> Tested-by: Ira Weiny >> -- Limited to mlx4, qib, and OPA (with additional patches.) >> >> >> On Tue, May 05, 2015 at 02:50:17PM +0200, Michael Wang wrote: >>> Since v7: >>> * Thanks to Doug, Ira, Devesh for the testing :-) >>> * Thanks for the comments from or, Doug, Ira, Jason :-) >>> Please remind me if anything missed :-P >>> * Use rdma_cap_XX() instead of cap_XX() for readability >>> * Remove CC list in git log for maintainability >>> * Use bool as return value >>> * Updated github repository to v8 >>> >>> There are plenty of lengthy code to check the transport type of IB device, >>> or the link layer type of it's port, but actually we are just speculating >>> whether a particular management/feature is supported by the device/port. >>> >>> Thus instead of inferring, we should have our own mechanism for IB >>> management >>> capability/protocol/feature checking, several proposals below. >>> >>> This patch set will introduce query_protocol() to check management >>> requirement >>> instead of inferring from transport and link layer respectively, along with >>> the new enum on protocol type. >>> >>> Mapping List: >>> node-type link-layer transport protocol >>> nes RNICETH IWARP IWARP >>> amso1100RNICETH IWARP IWARP >>> cxgb3 RNICETH IWARP IWARP >>> cxgb4 RNICETH IWARP IWARP >>> usnic USNIC_UDP ETH USNIC_UDP USNIC_UDP >>> ocrdma IB_CA ETH IB IBOE >>> mlx4IB_CA IB/ETH IB IB/IBOE >>> mlx5IB_CA IB IB IB >>> ehcaIB_CA IB IB IB >>> ipath IB_CA IB IB IB >>> mthca IB_CA IB IB IB >>> qib IB_CA IB IB IB >>> >>> For example: >>> if (transport == IB) && (link-layer == ETH) >>> will now become: >>> if (query_protocol() == IBOE) >>> >>> Thus we will be able to get rid of the respective transport and link-layer >>> checking, and it will help us to add new protocol/Technology (like OPA) more >>> easier, also with the introduced management helpers, IB management logical >>> will be more clear and easier for extending. >>> >>> Highlights: >>> The long CC list in each patches was complained consider about t
Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers
On 05/12/2015 02:27 AM, Doug Ledford wrote: On Mon, 2015-05-11 at 19:49 -0400, ira.weiny wrote: I have run with this series and the only issue I have found is not with this patch set directly. This patch: IB/Verbs: Use management helper rdma_cap_ib_mad() causes an error when you actually use the port passed from the ib_umad module. I have a patch to fix that which I found while trying to build on this series for the use of a bit mask. Doug, I don't know what you would like to do for this fix. I am submitting it shortly with a new version of the core capability bit patches. If you want to just add it after this series or force Michael to respin with the fix? As I recall, there was a comment from Or requesting to squash some of the individual patches down, but I no longer have that email in my Inbox to double check. And it seemed like there was one other review comment not yet addressed. Do I have that right Michael? And if so, are you working on a v9? AFAIK Or was asking to merge the #15~23, and want to reserve the changelog meanwhile reply the cover of prev version (I'm still confused on that...), I've replied but get no respond yet. I can make a v9 to merge the #15~#23 if that could benefit the maintainability, please let me know your opinion :-) About the Bug, if it was not introduced in this series, maybe including the fix in next series would be better? Regards, Michael Wang Frankly I vote for the former because as it stands this series does not break directly. It was only after I changed the implementation of rdma_cap_ib_mad that it broke. For the rest of the series. Reviewed-by: Ira Weiny ira.we...@intel.com Tested-by: Ira Weiny ira.we...@intel.com -- Limited to mlx4, qib, and OPA (with additional patches.) On Tue, May 05, 2015 at 02:50:17PM +0200, Michael Wang wrote: Since v7: * Thanks to Doug, Ira, Devesh for the testing :-) * Thanks for the comments from or, Doug, Ira, Jason :-) Please remind me if anything missed :-P * Use rdma_cap_XX() instead of cap_XX() for readability * Remove CC list in git log for maintainability * Use bool as return value * Updated github repository to v8 There are plenty of lengthy code to check the transport type of IB device, or the link layer type of it's port, but actually we are just speculating whether a particular management/feature is supported by the device/port. Thus instead of inferring, we should have our own mechanism for IB management capability/protocol/feature checking, several proposals below. This patch set will introduce query_protocol() to check management requirement instead of inferring from transport and link layer respectively, along with the new enum on protocol type. Mapping List: node-type link-layer transport protocol nes RNICETH IWARP IWARP amso1100RNICETH IWARP IWARP cxgb3 RNICETH IWARP IWARP cxgb4 RNICETH IWARP IWARP usnic USNIC_UDP ETH USNIC_UDP USNIC_UDP ocrdma IB_CA ETH IB IBOE mlx4IB_CA IB/ETH IB IB/IBOE mlx5IB_CA IB IB IB ehcaIB_CA IB IB IB ipath IB_CA IB IB IB mthca IB_CA IB IB IB qib IB_CA IB IB IB For example: if (transport == IB) (link-layer == ETH) will now become: if (query_protocol() == IBOE) Thus we will be able to get rid of the respective transport and link-layer checking, and it will help us to add new protocol/Technology (like OPA) more easier, also with the introduced management helpers, IB management logical will be more clear and easier for extending. Highlights: The long CC list in each patches was complained consider about the maintainability, it was suggested folks to provide their reviewed-by or Acked-by instead, so for those who used to be on the CC list, please provide your signature voluntarily :-) The 'mgmt-helpers' branch of 'g...@github.com:ywang-pb/infiniband-wy.git' contain this series based on the latest 'infiniband/for-next' Patch 1#~14# included all the logical reform, 15#~23# introduced the management helpers. Doug suggested the bitmask mechanism: https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23765.html which could be the plan for future reforming, we prefer that to be another series which focus on semantic and performance. This patch-set is somewhat 'bloated' now and it may be a good timing for staging, I'd like to suggest we focus on improving existed helpers and push
Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers
On 05/12/2015 01:49 AM, ira.weiny wrote: I have run with this series and the only issue I have found is not with this patch set directly. This patch: IB/Verbs: Use management helper rdma_cap_ib_mad() causes an error when you actually use the port passed from the ib_umad module. I have a patch to fix that which I found while trying to build on this series for the use of a bit mask. Doug, I don't know what you would like to do for this fix. I am submitting it shortly with a new version of the core capability bit patches. If you want to just add it after this series or force Michael to respin with the fix? Frankly I vote for the former because as it stands this series does not break directly. It was only after I changed the implementation of rdma_cap_ib_mad that it broke. Agree, it sounds more reasonable to include the fix in the series introduced it :-P For the rest of the series. Reviewed-by: Ira Weiny ira.we...@intel.com Tested-by: Ira Weiny ira.we...@intel.com -- Limited to mlx4, qib, and OPA (with additional patches.) Thanks for the review and testing :-) Regards, Michael Wang On Tue, May 05, 2015 at 02:50:17PM +0200, Michael Wang wrote: Since v7: * Thanks to Doug, Ira, Devesh for the testing :-) * Thanks for the comments from or, Doug, Ira, Jason :-) Please remind me if anything missed :-P * Use rdma_cap_XX() instead of cap_XX() for readability * Remove CC list in git log for maintainability * Use bool as return value * Updated github repository to v8 There are plenty of lengthy code to check the transport type of IB device, or the link layer type of it's port, but actually we are just speculating whether a particular management/feature is supported by the device/port. Thus instead of inferring, we should have our own mechanism for IB management capability/protocol/feature checking, several proposals below. This patch set will introduce query_protocol() to check management requirement instead of inferring from transport and link layer respectively, along with the new enum on protocol type. Mapping List: node-type link-layer transport protocol nes RNICETH IWARP IWARP amso1100 RNICETH IWARP IWARP cxgb3RNICETH IWARP IWARP cxgb4RNICETH IWARP IWARP usnicUSNIC_UDP ETH USNIC_UDP USNIC_UDP ocrdma IB_CA ETH IB IBOE mlx4 IB_CA IB/ETH IB IB/IBOE mlx5 IB_CA IB IB IB ehca IB_CA IB IB IB ipathIB_CA IB IB IB mthcaIB_CA IB IB IB qib IB_CA IB IB IB For example: if (transport == IB) (link-layer == ETH) will now become: if (query_protocol() == IBOE) Thus we will be able to get rid of the respective transport and link-layer checking, and it will help us to add new protocol/Technology (like OPA) more easier, also with the introduced management helpers, IB management logical will be more clear and easier for extending. Highlights: The long CC list in each patches was complained consider about the maintainability, it was suggested folks to provide their reviewed-by or Acked-by instead, so for those who used to be on the CC list, please provide your signature voluntarily :-) The 'mgmt-helpers' branch of 'g...@github.com:ywang-pb/infiniband-wy.git' contain this series based on the latest 'infiniband/for-next' Patch 1#~14# included all the logical reform, 15#~23# introduced the management helpers. Doug suggested the bitmask mechanism: https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23765.html which could be the plan for future reforming, we prefer that to be another series which focus on semantic and performance. This patch-set is somewhat 'bloated' now and it may be a good timing for staging, I'd like to suggest we focus on improving existed helpers and push all the further reforms into next series ;-) Proposals: Sean: https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23339.html Doug: https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23418.html https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23765.html Jason: https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23425.html Michael Wang (23): IB/Verbs: Implement new callback query_protocol() IB/Verbs: Implement raw management helpers IB/Verbs: Reform IB-core mad/agent/user_mad IB/Verbs: Reform IB-core cm IB/Verbs: Reform IB-core sa_query IB
Re: [PATCH v8 00/23] IB/Verbs: IB Management Helpers
On 05/12/2015 04:24 PM, Doug Ledford wrote: [snip] AFAIK Or was asking to merge the #15~23, and want to reserve the changelog meanwhile reply the cover of prev version (I'm still confused on that...), I've replied but get no respond yet. I can make a v9 to merge the #15~#23 if that could benefit the maintainability, please let me know your opinion :-) I don't think it would make a significant difference. I've pulled the v8 patchset out of patchworks and I'll throw a new branch with it included up to my github repo sometime today. Got it :-) Regards, Michael Wang About the Bug, if it was not introduced in this series, maybe including the fix in next series would be better? Regards, Michael Wang Frankly I vote for the former because as it stands this series does not break directly. It was only after I changed the implementation of rdma_cap_ib_mad that it broke. For the rest of the series. Reviewed-by: Ira Weiny ira.we...@intel.com Tested-by: Ira Weiny ira.we...@intel.com -- Limited to mlx4, qib, and OPA (with additional patches.) On Tue, May 05, 2015 at 02:50:17PM +0200, Michael Wang wrote: Since v7: * Thanks to Doug, Ira, Devesh for the testing :-) * Thanks for the comments from or, Doug, Ira, Jason :-) Please remind me if anything missed :-P * Use rdma_cap_XX() instead of cap_XX() for readability * Remove CC list in git log for maintainability * Use bool as return value * Updated github repository to v8 There are plenty of lengthy code to check the transport type of IB device, or the link layer type of it's port, but actually we are just speculating whether a particular management/feature is supported by the device/port. Thus instead of inferring, we should have our own mechanism for IB management capability/protocol/feature checking, several proposals below. This patch set will introduce query_protocol() to check management requirement instead of inferring from transport and link layer respectively, along with the new enum on protocol type. Mapping List: node-type link-layer transport protocol nes RNICETH IWARP IWARP amso1100 RNICETH IWARP IWARP cxgb3 RNICETH IWARP IWARP cxgb4 RNICETH IWARP IWARP usnic USNIC_UDP ETH USNIC_UDP USNIC_UDP ocrdmaIB_CA ETH IB IBOE mlx4 IB_CA IB/ETH IB IB/IBOE mlx5 IB_CA IB IB IB ehca IB_CA IB IB IB ipath IB_CA IB IB IB mthca IB_CA IB IB IB qib IB_CA IB IB IB For example: if (transport == IB) (link-layer == ETH) will now become: if (query_protocol() == IBOE) Thus we will be able to get rid of the respective transport and link-layer checking, and it will help us to add new protocol/Technology (like OPA) more easier, also with the introduced management helpers, IB management logical will be more clear and easier for extending. Highlights: The long CC list in each patches was complained consider about the maintainability, it was suggested folks to provide their reviewed-by or Acked-by instead, so for those who used to be on the CC list, please provide your signature voluntarily :-) The 'mgmt-helpers' branch of 'g...@github.com:ywang-pb/infiniband-wy.git' contain this series based on the latest 'infiniband/for-next' Patch 1#~14# included all the logical reform, 15#~23# introduced the management helpers. Doug suggested the bitmask mechanism: https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23765.html which could be the plan for future reforming, we prefer that to be another series which focus on semantic and performance. This patch-set is somewhat 'bloated' now and it may be a good timing for staging, I'd like to suggest we focus on improving existed helpers and push all the further reforms into next series ;-) Proposals: Sean: https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23339.html Doug: https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23418.html https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23765.html Jason: https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23425.html Michael Wang (23): IB/Verbs: Implement new callback query_protocol() IB/Verbs: Implement raw management helpers IB/Verbs: Reform IB-core mad/agent/user_mad IB/Verbs: Reform IB-core cm IB/Verbs: Reform IB-core sa_query IB/Verbs: Reform IB-core multicast IB/Verbs: Reform IB-ulp ipoib IB/Verbs: Reform IB-ulp xprtrdma