Re: [PATCH] locking/lockdep: Remove the cross-release locking checks

2017-12-14 Thread Byungchul Park
On Fri, Dec 15, 2017 at 3:24 PM, Theodore Ts'o  wrote:
> On Fri, Dec 15, 2017 at 01:05:43PM +0900, Byungchul Park wrote:
>> For example, in the case of fs issues, for now we can
>> invalidate wait_for_completion() in submit_bio_wait()
>
> And this will spawn a checkpatch.pl ERROR:
>
> ERROR("LOCKDEP",
> "lockdep_no_validate class is reserved for 
> device->mutex.\n" . $herecurr);
>
> This mention in checkpatch.pl is the only documentation I've been able
> to find about lockdep_set_novalidate_class(), by the way.
>
>> ... and re-validate it later, of course, I really want to find more
>> fundamental solution though.
>
> Oh, and I was finally able to find the quote that the *only* people
> who are likely to be able to deal with lock annotations are the

Right. Using the word, "only", is one that I should not have done
and I apologize for.

They are just "only" people who solve and classify locks quickly,
assuming all of kernel guys are familiar with lockdep annotations.
Thus, even this statement is bad as well, since no good
document for that exists, as you pointed out. I agree.

> subsystem maintainers.  But if the ways of dealing with lock
> annotations are not documented, such that subsystem maintainers are
> going to have a very hard time figuring out *how* to deal with it, it

Right. I've agreed with this, since you pointed out it's problem not
to be documented well.

> seems that lock classification as a solution to cross-release false
> positives seems unlikely:
>
>From: Byungchul Park 
>Date: Fri, 8 Dec 2017 18:27:45 +0900
>Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
>
>1) Firstly, it's hard to assign lock classes *properly*. By
>default, it relies on the caller site of lockdep_init_map(),
>but we need to assign another class manually, where ordering
>rules are complicated so cannot rely on the caller site. That
>*only* can be done by experts of the subsystem.
>
> - Ted



-- 
Thanks,
Byungchul


Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-14 Thread Peter Zijlstra
On Fri, Dec 15, 2017 at 10:12:50AM +0800, jianchao.wang wrote:
> > That only makes it a little better:
> > 
> > Task-A  Worker
> > 
> > write_seqcount_begin()
> > blk_mq_rw_update_state(rq, IN_FLIGHT)
> > blk_add_timer(rq)
> > 
> > schedule_work()
> > 
> > 
> > read_seqcount_begin()
> > while(seq & 1)
> > cpu_relax();
> > 
> Hi Peter
> 
> The current seqcount read side is as below:
>   do {
>   start = read_seqcount_begin(>gstate_seq);


static inline unsigned read_seqcount_begin(const seqcount_t *s)
{
seqcount_lockdep_reader_access(s);
return raw_read_seqcount_begin(s);
}

static inline unsigned raw_read_seqcount_begin(const seqcount_t *s)
{
unsigned ret = __read_seqcount_begin(s);
smp_rmb();
return ret;
}

static inline unsigned __read_seqcount_begin(const seqcount_t *s)
{
unsigned ret;

repeat:
ret = READ_ONCE(s->sequence);
if (unlikely(ret & 1)) {
cpu_relax();
goto repeat;
}
return ret;
}


Re: [PATCH] locking/lockdep: Remove the cross-release locking checks

2017-12-14 Thread Theodore Ts'o
On Fri, Dec 15, 2017 at 01:05:43PM +0900, Byungchul Park wrote:
> For example, in the case of fs issues, for now we can
> invalidate wait_for_completion() in submit_bio_wait()

And this will spawn a checkpatch.pl ERROR:

ERROR("LOCKDEP",
"lockdep_no_validate class is reserved for 
device->mutex.\n" . $herecurr);

This mention in checkpatch.pl is the only documentation I've been able
to find about lockdep_set_novalidate_class(), by the way. 

> ... and re-validate it later, of course, I really want to find more
> fundamental solution though.

Oh, and I was finally able to find the quote that the *only* people
who are likely to be able to deal with lock annotations are the
subsystem maintainers.  But if the ways of dealing with lock
annotations are not documented, such that subsystem maintainers are
going to have a very hard time figuring out *how* to deal with it, it
seems that lock classification as a solution to cross-release false
positives seems unlikely:
   
   From: Byungchul Park 
   Date: Fri, 8 Dec 2017 18:27:45 +0900
   Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

   1) Firstly, it's hard to assign lock classes *properly*. By
   default, it relies on the caller site of lockdep_init_map(),
   but we need to assign another class manually, where ordering
   rules are complicated so cannot rely on the caller site. That
   *only* can be done by experts of the subsystem.

- Ted


[PATCH IMPROVEMENT] block, bfq: consider recent past to reduce soft-real-time false positives

2017-12-14 Thread Paolo Valente
Hi,
this patch reduces false positives in the detection of bfq_queues
associated with soft real-time applications. As such, this patch
reduces the damages caused by these false positives to other
applications. The patch proved to be very effective, as detailed in
the commit message. A sort of counterpart of this patch should follow
soon, this time for false positives in the detection of cooperating
processes (processes doing I/O close to each other).

I guess it is evidently too late for 4.15, so, Jens, please consider
this patch for 4.16. Of course if the patch turns out to be ok.

Thanks,
Paolo

Paolo Valente (1):
  block, bfq: consider also past I/O in soft real-time detection

 block/bfq-iosched.c | 115 
 1 file changed, 81 insertions(+), 34 deletions(-)

--
2.10.0


[PATCH IMPROVEMENT] block, bfq: consider also past I/O in soft real-time detection

2017-12-14 Thread Paolo Valente
BFQ privileges the I/O of soft real-time applications, such as video
players, to guarantee to these application a high bandwidth and a low
latency. In this respect, it is not easy to correctly detect when an
application is soft real-time. A particularly nasty false positive is
that of an I/O-bound application that occasionally happens to meet all
requirements to be deemed as soft real-time. After being detected as
soft real-time, such an application monopolizes the device. Fortunately,
BFQ will realize soon that the application is actually not soft
real-time and suspend every privilege. Yet, the application may happen
again to be wrongly detected as soft real-time, and so on.

As highlighted by our tests, this problem causes BFQ to occasionally
fail to guarantee a high responsiveness, in the presence of heavy
background I/O workloads. The reason is that the background workload
happens to be detected as soft real-time, more or less frequently,
during the execution of the interactive task under test. To give an
idea, because of this problem, Libreoffice Writer occasionally takes 8
seconds, instead of 3, to start up, if there are sequential reads and
writes in the background, on a Kingston SSDNow V300.

This commit addresses this issue by leveraging the following facts.

The reason why some applications are detected as soft real-time despite
all BFQ checks to avoid false positives, is simply that, during high
CPU or storage-device load, I/O-bound applications may happen to do
I/O slowly enough to meet all soft real-time requirements, and pass
all BFQ extra checks. Yet, this happens only for limited time periods:
slow-speed time intervals are usually interspersed between other time
intervals during which these applications do I/O at a very high speed.
To exploit these facts, this commit introduces a little change, in the
detection of soft real-time behavior, to systematically consider also
the recent past: the higher the speed was in the recent past, the
later next I/O should arrive for the application to be considered as
soft real-time. At the beginning of a slow-speed interval, the minimum
arrival time allowed for the next I/O usually happens to still be so
high, to fall *after* the end of the slow-speed period itself. As a
consequence, the application does not risk to be deemed as soft
real-time during the slow-speed interval. Then, during the next
high-speed interval, the application cannot, evidently, be deemed as
soft real-time (exactly because of its speed), and so on.

This extra filtering proved to be rather effective: in the above test,
the frequency of false positives became so low that the start-up time
was 3 seconds in all iterations (apart from occasional outliers,
caused by page-cache-management issues, which are out of the scope of
this commit, and cannot be solved by an I/O scheduler).

Signed-off-by: Paolo Valente 
Signed-off-by: Angelo Ruocco 
---
 block/bfq-iosched.c | 115 
 1 file changed, 81 insertions(+), 34 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index bcb6d21..a9e06217 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2917,45 +2917,87 @@ static bool bfq_bfqq_is_slow(struct bfq_data *bfqd, 
struct bfq_queue *bfqq,
  * whereas soft_rt_next_start is set to infinity for applications that do
  * not.
  *
- * Unfortunately, even a greedy application may happen to behave in an
- * isochronous way if the CPU load is high. In fact, the application may
- * stop issuing requests while the CPUs are busy serving other processes,
- * then restart, then stop again for a while, and so on. In addition, if
- * the disk achieves a low enough throughput with the request pattern
- * issued by the application (e.g., because the request pattern is random
- * and/or the device is slow), then the application may meet the above
- * bandwidth requirement too. To prevent such a greedy application to be
- * deemed as soft real-time, a further rule is used in the computation of
- * soft_rt_next_start: soft_rt_next_start must be higher than the current
- * time plus the maximum time for which the arrival of a request is waited
- * for when a sync queue becomes idle, namely bfqd->bfq_slice_idle.
- * This filters out greedy applications, as the latter issue instead their
- * next request as soon as possible after the last one has been completed
- * (in contrast, when a batch of requests is completed, a soft real-time
- * application spends some time processing data).
+ * Unfortunately, even a greedy (i.e., I/O-bound) application may
+ * happen to meet, occasionally or systematically, both the above
+ * bandwidth and isochrony requirements. This may happen at least in
+ * the following circumstances. First, if the CPU load is high. The
+ * application may stop issuing requests while the CPUs are busy
+ * serving other processes, then restart, then stop again for a while,
+ * 

How to Persistent nvme Storage?

2017-12-14 Thread Tony Yang
Dear All:

  I download the new nvme multipath version (git clone
git.infradead.org/nvme.git),I want to know how to use the
  multipath, because I want use udev to Persistent Storage,But
when use scsi_id command not found any messages.
  What method can nvme achieve multi-path? Can you provide an
example? Thank you

  [root@udev]# lsblk
NAME  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sdb 8:16   0   1.8T  0 disk
dfa   252:00   2.9T  0 disk
├─dfa5252:50 734.8G  0 part
├─dfa3252:30 745.1G  0 part
├─dfa1252:10   9.3G  0 part
├─dfa4252:40 745.1G  0 part
└─dfa2252:20 745.1G  0 part
nvme1n1   259:10   0   2.9T  0 disk
├─nvme1n1p5   259:15   0 734.8G  0 part
├─nvme1n1p3   259:13   0 745.1G  0 part
├─nvme1n1p1   259:11   0   9.3G  0 part
├─nvme1n1p4   259:14   0 745.1G  0 part
└─nvme1n1p2   259:12   0 745.1G  0 part
loop0   7:00   3.8G  0 loop /mnt/cdrom
sdc 8:32   0 557.9G  0 disk
├─sdc2  8:34   0   512M  0 part /boot
├─sdc3  8:35   0 557.3G  0 part
│ ├─rhel-swap 253:10 4G  0 lvm  [SWAP]
│ └─rhel-root 253:00 553.3G  0 lvm  /
└─sdc1  8:33   0   100M  0 part /boot/efi
sda 8:00   1.8T  0 disk
nvme0n1   259:00   1.8T  0 disk
├─nvme0n1p5   259:50   100G  0 part
├─nvme0n1p3   259:30 8G  0 part
├─nvme0n1p1   259:10 8G  0 part
├─nvme0n1p8   259:80   100G  0 part
├─nvme0n1p6   259:60   100G  0 part
├─nvme0n1p4   259:40 1K  0 part
├─nvme0n1p2   259:20 8G  0 part
├─nvme0n1p9   259:90   100G  0 part
└─nvme0n1p7   259:70   100G  0 part
[root@udev]# /usr/lib/udev/scsi_id -g -u -d /dev/sdb
3518f2920201113ff
[root@udev]# /usr/lib/udev/scsi_id -g -u -d /dev/dfa
[root@udev]# /usr/lib/udev/scsi_id -g -u -d /dev/dfa5
[root@udev]# /usr/lib/udev/scsi_id -g -u -d /dev/nvme1n1p5


Re: [PATCH] locking/lockdep: Remove the cross-release locking checks

2017-12-14 Thread Byungchul Park
On Thu, Dec 14, 2017 at 2:01 PM, Byungchul Park
 wrote:
> On Wed, Dec 13, 2017 at 7:46 PM, Ingo Molnar  wrote:
>>
>> * Byungchul Park  wrote:
>>
>>> Lockdep works, based on the following:
>>>
>>>(1) Classifying locks properly
>>>(2) Checking relationship between the classes
>>>
>>> If (1) is not good or (2) is not good, then we
>>> might get false positives.
>>>
>>> For (1), we don't have to classify locks 100%
>>> properly but need as enough as lockdep works.
>>>
>>> For (2), we should have a mechanism w/o
>>> logical defects.
>>>
>>> Cross-release added an additional capacity to
>>> (2) and requires (1) to get more precisely classified.
>>>
>>> Since the current classification level is too low for
>>> cross-release to work, false positives are being
>>> reported frequently with enabling cross-release.
>>> Yes. It's a obvious problem. It needs to be off by
>>> default until the classification is done by the level
>>> that cross-release requires.
>>>
>>> But, the logic (2) is valid and logically true. Please
>>> keep the code, mechanism, and logic.
>>
>> Just to give a full context to everyone: the patch that removes the 
>> cross-release
>> locking checks was Cc:-ed to lkml, I've attached the patch below again.
>>
>> In general, as described in the changelog, the cross-release checks were
>> historically just too painful (first they were too slow, and they also had a 
>> lot
>> of false positives), and today, 4 months after its introduction, the 
>> cross-release
>> checks *still* produce numerous false positives, especially in the filesystem
>> space, but the continuous-integration testing folks were still having 
>> trouble with
>> kthread locking patterns causing false positives:
>
> I admit false positives are the main problem, that should be solved.
>
> I'm going willingly to try my best to solve that. However, as you may
> know through introduction of lockdep, it's not something that I can
> do easily and shortly on my own. It need take time to annotate
> properly to avoid false positives.
>
>>   https://bugs.freedesktop.org/show_bug.cgi?id=103950
>>
>> which were resulting in two bad reactions:
>>
>>  - turning off lockdep
>>
>>  - writing patches that uglified unrelated subsystems
>
> I can't give you a solution at the moment but it's something we
> think more so that we classify locks properly and not uglify them.
>
> Even without cross-release, once we start to add lock_acquire() in
> submit_bio_wait() in the ugly way to consider wait_for_completion()
> someday, we would face this problem again. It's not an easy problem,
> however, it's worth trying.
>
>> So while I appreciate the fixes that resulted from running cross-release, 
>> there's
>> still numerous false positives, months after its interaction, which is
>> unacceptable. For us to have this feature it has to have roughly similar 
>> qualities
>> as compiler warnings:
>>
>>  - there's a "zero false positive warnings" policy
>
> It's almost impossible... but need time. I wonder if lockdep did at the
> beginning. If I can, I want to fix false positive as many as possible by
> myself. But, unluckily it does not happen in my machine. I want to get
> informed from others, keeping it in mainline tree.
>
>>  - plus any widespread changes to avoid warnings has to improve the code,
>>not make it uglier.
>
> Agree.
>
>> Lockdep itself is a following that policy: the default state is that it 
>> produces
>> no warnings upstream, and any annotations added to unrelated code documents 
>> the
>> locking hierarchies.
>>
>> While technically we could keep the cross-release checking code upstream and 
>> turn
>> it off by default via the Kconfig switch, I'm not a big believer in such a 
>> policy
>> for complex debugging code:
>>
>>  - We already did that for v4.14, two months ago:
>>
>>  b483cf3bc249: locking/lockdep: Disable cross-release features for now
>
> The main reason disabling it was "performance regression".
>
>>
>>... and re-enabled it for v4.15 - but the false positives are still not 
>> fixed.
>
> Right. But, all false positives cannot be fixed in a short period. We need
> time to annotate them one by one.
>
>>  - either the cross-release checking code can be fixed and then having it 
>> off by
>
> It's not a problem of cross-release checking code. The way I have to fix it
> should be to add additional annotation or change the way to assign lock
> classes.
>
>>default is just wrong, because we can apply the fixed code again once it's
>>fixed.
>>
>>  - or it cannot be fixed (or we don't have the manpower/interest to fix it),
>>in which case having it off is only delaying the inevitable.
>
> The more precisely assigning lock classes, the more false positives
> would be getting fixed. It's not something messing it as time goes.
>
>> In any case, for v4.15 it's clear that the false positives are too numerous.
>>
>> Thanks,
>>
>> 

Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-14 Thread Mike Galbraith
On Thu, 2017-12-14 at 22:54 +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 09:42:48PM +, Bart Van Assche wrote:
> 
> > Some time ago the block layer was changed to handle timeouts in thread 
> > context
> > instead of interrupt context. See also commit 287922eb0b18 ("block: defer
> > timeouts to a workqueue").
> 
> That only makes it a little better:
> 
>   Task-A  Worker
> 
>   write_seqcount_begin()
>   blk_mq_rw_update_state(rq, IN_FLIGHT)
>   blk_add_timer(rq)
>   
>   schedule_work()
>   
>   
>   read_seqcount_begin()
>   while(seq & 1)
>   cpu_relax();
> 
> 
> Now normally this isn't fatal because Worker will simply spin its entire
> time slice away and we'll eventually schedule our Task-A back in, which
> will complete the seqcount and things will work.
> 
> But if, for some reason, our Worker was to have RT priority higher than
> our Task-A we'd be up some creek without no paddles.

Most kthreads, including kworkers, are very frequently SCHED_FIFO here.

-Mike


Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-14 Thread jianchao.wang


On 12/15/2017 05:54 AM, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 09:42:48PM +, Bart Van Assche wrote:
>> On Thu, 2017-12-14 at 21:20 +0100, Peter Zijlstra wrote:
>>> On Thu, Dec 14, 2017 at 06:51:11PM +, Bart Van Assche wrote:
 On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> + write_seqcount_begin(>gstate_seq);
> + blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> + blk_add_timer(rq);
> + write_seqcount_end(>gstate_seq);

 My understanding is that both write_seqcount_begin() and 
 write_seqcount_end()
 trigger a write memory barrier. Is a seqcount really faster than a 
 spinlock?
>>>
>>> Yes lots, no atomic operations and no waiting.
>>>
>>> The only constraint for write_seqlock is that there must not be any
>>> concurrency.
>>>
>>> But now that I look at this again, TJ, why can't the below happen?
>>>
>>> write_seqlock_begin();
>>> blk_mq_rq_update_state(rq, IN_FLIGHT);
>>> blk_add_timer(rq);
>>> 
>>> read_seqcount_begin()
>>> while (seq & 1)
>>> cpurelax();
>>> // life-lock
>>> 
>>> write_seqlock_end();
>>
>> Hello Peter,
>>
>> Some time ago the block layer was changed to handle timeouts in thread 
>> context
>> instead of interrupt context. See also commit 287922eb0b18 ("block: defer
>> timeouts to a workqueue").
> 
> That only makes it a little better:
> 
>   Task-A  Worker
> 
>   write_seqcount_begin()
>   blk_mq_rw_update_state(rq, IN_FLIGHT)
>   blk_add_timer(rq)
>   
>   schedule_work()
>   
>   
>   read_seqcount_begin()
>   while(seq & 1)
>   cpu_relax();
> 
Hi Peter

The current seqcount read side is as below:
do {
start = read_seqcount_begin(>gstate_seq);
gstate = READ_ONCE(rq->gstate);
deadline = rq->deadline;
} while (read_seqcount_retry(>gstate_seq, start));
read_seqcount_retry() doesn't check the bit 0, but whether the saved value from 
read_seqcount_begin() is equal to the current value of seqcount.
pls refer:
static inline int __read_seqcount_retry(const seqcount_t *s, unsigned start)
{
return unlikely(s->sequence != start);
}

Thanks
Jianchao
> 
> Now normally this isn't fatal because Worker will simply spin its entire
> time slice away and we'll eventually schedule our Task-A back in, which
> will complete the seqcount and things will work.
> 
> But if, for some reason, our Worker was to have RT priority higher than
> our Task-A we'd be up some creek without no paddles.
> 
> We don't happen to have preemption of IRQs off here? That would fix
> things nicely.
> 


Re: Regression with a0747a859ef6 ("bdi: add error handle for bdi_debug_register")

2017-12-14 Thread Bruno Wolff III

On Fri, Dec 15, 2017 at 10:04:32 +0800,
 weiping zhang  wrote:


so I want see the WARN_ON as you paste before, also my DEBUG log will help
to find which step fail.


The previous time also journalctl for output, but maybe I used slightly 
different options. I'll look and see if it is in the journal for the last 
bad boot. I can do that from home.


Re: Regression with a0747a859ef6 ("bdi: add error handle for bdi_debug_register")

2017-12-14 Thread weiping zhang
2017-12-15 9:44 GMT+08:00 Bruno Wolff III :
> On Fri, Dec 15, 2017 at 09:22:21 +0800,
>  weiping zhang  wrote:
>>
>>
>> Thanks your testing, but I cann't find WARN_ON in device_add_disk from
>> this boot1.log, could you help reproduce that issue? And does this issue
>> can be
>> triggered at every bootup ?
>
>
> I don't know what you need for the first question. When I am physically at
> the machine I can do test reboots. If you have something specific you want
> me to try I should be able to.
>
> Every time I boot with the problem commit, the boot never completes. However
> it does seem to get pretty far. I get multiple register dumps every time.
> After a while (a few minutes) I reboot to a wrking kernel.
>
> The output I included is from: journalctl -k -b -1
> If you think it would be better to see more than dmesg output let me know.
I just want to know WARN_ON WHAT in device_add_disk,
if bdi_register_owner return error code, it may fail at any step of following:

bdi_debug_root is NULL
bdi->debug_dir is NULL
bdi->debug_stats is NULL

so I want see the WARN_ON as you paste before, also my DEBUG log will help
to find which step fail.


Re: Regression with a0747a859ef6 ("bdi: add error handle for bdi_debug_register")

2017-12-14 Thread Bruno Wolff III

On Fri, Dec 15, 2017 at 09:22:21 +0800,
 weiping zhang  wrote:


Thanks your testing, but I cann't find WARN_ON in device_add_disk from
this boot1.log, could you help reproduce that issue? And does this issue can be
triggered at every bootup ?


I don't know what you need for the first question. When I am physically at 
the machine I can do test reboots. If you have something specific you want 
me to try I should be able to.


Every time I boot with the problem commit, the boot never completes. However 
it does seem to get pretty far. I get multiple register dumps every time. 
After a while (a few minutes) I reboot to a wrking kernel.


The output I included is from: journalctl -k -b -1
If you think it would be better to see more than dmesg output let me know.


Re: Regression with a0747a859ef6 ("bdi: add error handle for bdi_debug_register")

2017-12-14 Thread weiping zhang
2017-12-14 23:41 GMT+08:00 Bruno Wolff III :
> On Thu, Dec 14, 2017 at 18:09:27 +0800,
>  weiping zhang  wrote:
>>
>>
>> It seems something wrong with bdi debugfs register, could you help
>> test the forllowing debug patch, I add some debug log, no function
>> change, thanks.
>
>
> I applied your patch to d39a01eff9af1045f6e30ff9db40310517c4b45f and there
> were some new debug messages in the dmesg output. Hopefully this helps. I
> also added the patch and output to the Fedora bug for people following
> there.

Hi Bruno,

Thanks your testing, but I cann't find WARN_ON in device_add_disk from
this boot1.log, could you help reproduce that issue? And does this issue can be
triggered at every bootup ?

--
Thanks
weiping


Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 09:42:48PM +, Bart Van Assche wrote:
> On Thu, 2017-12-14 at 21:20 +0100, Peter Zijlstra wrote:
> > On Thu, Dec 14, 2017 at 06:51:11PM +, Bart Van Assche wrote:
> > > On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > > > +   write_seqcount_begin(>gstate_seq);
> > > > +   blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> > > > +   blk_add_timer(rq);
> > > > +   write_seqcount_end(>gstate_seq);
> > > 
> > > My understanding is that both write_seqcount_begin() and 
> > > write_seqcount_end()
> > > trigger a write memory barrier. Is a seqcount really faster than a 
> > > spinlock?
> > 
> > Yes lots, no atomic operations and no waiting.
> > 
> > The only constraint for write_seqlock is that there must not be any
> > concurrency.
> > 
> > But now that I look at this again, TJ, why can't the below happen?
> > 
> > write_seqlock_begin();
> > blk_mq_rq_update_state(rq, IN_FLIGHT);
> > blk_add_timer(rq);
> > 
> > read_seqcount_begin()
> > while (seq & 1)
> > cpurelax();
> > // life-lock
> > 
> > write_seqlock_end();
> 
> Hello Peter,
> 
> Some time ago the block layer was changed to handle timeouts in thread context
> instead of interrupt context. See also commit 287922eb0b18 ("block: defer
> timeouts to a workqueue").

That only makes it a little better:

Task-A  Worker

write_seqcount_begin()
blk_mq_rw_update_state(rq, IN_FLIGHT)
blk_add_timer(rq)

schedule_work()


read_seqcount_begin()
while(seq & 1)
cpu_relax();


Now normally this isn't fatal because Worker will simply spin its entire
time slice away and we'll eventually schedule our Task-A back in, which
will complete the seqcount and things will work.

But if, for some reason, our Worker was to have RT priority higher than
our Task-A we'd be up some creek without no paddles.

We don't happen to have preemption of IRQs off here? That would fix
things nicely.


Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-14 Thread Bart Van Assche
On Thu, 2017-12-14 at 21:20 +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 06:51:11PM +, Bart Van Assche wrote:
> > On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > > + write_seqcount_begin(>gstate_seq);
> > > + blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> > > + blk_add_timer(rq);
> > > + write_seqcount_end(>gstate_seq);
> > 
> > My understanding is that both write_seqcount_begin() and 
> > write_seqcount_end()
> > trigger a write memory barrier. Is a seqcount really faster than a spinlock?
> 
> Yes lots, no atomic operations and no waiting.
> 
> The only constraint for write_seqlock is that there must not be any
> concurrency.
> 
> But now that I look at this again, TJ, why can't the below happen?
> 
>   write_seqlock_begin();
>   blk_mq_rq_update_state(rq, IN_FLIGHT);
>   blk_add_timer(rq);
>   
>   read_seqcount_begin()
>   while (seq & 1)
>   cpurelax();
>   // life-lock
>   
>   write_seqlock_end();

Hello Peter,

Some time ago the block layer was changed to handle timeouts in thread context
instead of interrupt context. See also commit 287922eb0b18 ("block: defer
timeouts to a workqueue").

Bart.

Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-14 Thread Bart Van Assche
On Thu, 2017-12-14 at 11:19 -0800, t...@kernel.org wrote:
> On Thu, Dec 14, 2017 at 06:51:11PM +, Bart Van Assche wrote:
> > On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct 
> > > request *rq)
> > >   rq->start_time = jiffies;
> > >   set_start_time_ns(rq);
> > >   rq->part = NULL;
> > > + seqcount_init(>gstate_seq);
> > > + u64_stats_init(>aborted_gstate_sync);
> > >  }
> > >  EXPORT_SYMBOL(blk_rq_init);
> > 
> > Sorry but the above change looks ugly to me. My understanding is that 
> > blk_rq_init() is only used inside the block layer to initialize legacy block
> > layer requests while gstate_seq and aborted_gstate_sync are only relevant
> > for blk-mq requests. Wouldn't it be better to avoid that blk_rq_init() is
> > called for blk-mq requests such that the above change can be left out? The
> > only callers outside the block layer core of blk_rq_init() I know of are
> > ide_prep_sense() and scsi_ioctl_reset(). I can help with converting the SCSI
> > code if you want.
> 
> This is also used by flush path.  We probably should clean that up,
> but let's worry about that later cuz flush handling has enough of its
> own complications.

We may have a different opinion about this but I think it is more than a
detail. This patch needs gstate_seq and aborted_gstate_sync to be preserved
across request state transitions from MQ_RQ_IN_FLIGHT to MR_RQ_IDLE.
blk_mq_init_request() is called at request allocation time so it's the right
context to initialize gstate_seq and aborted_gstate_sync. blk_rq_init()
however is called before a every use of a request. Sorry but I'm not
enthusiast about the code in blk_rq_init() that reinitializes state
information that should survive request reuse.

Bart.

Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 06:51:11PM +, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > +   write_seqcount_begin(>gstate_seq);
> > +   blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> > +   blk_add_timer(rq);
> > +   write_seqcount_end(>gstate_seq);
> 
> My understanding is that both write_seqcount_begin() and write_seqcount_end()
> trigger a write memory barrier. Is a seqcount really faster than a spinlock?

Yes lots, no atomic operations and no waiting.

The only constraint for write_seqlock is that there must not be any
concurrency.

But now that I look at this again, TJ, why can't the below happen?

write_seqlock_begin();
blk_mq_rq_update_state(rq, IN_FLIGHT);
blk_add_timer(rq);

read_seqcount_begin()
while (seq & 1)
cpurelax();
// life-lock

write_seqlock_end();


Re: [PATCH 4/6] blk-mq: make blk_abort_request() trigger timeout path

2017-12-14 Thread t...@kernel.org
Hello,

On Thu, Dec 14, 2017 at 06:56:55PM +, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> >  void blk_abort_request(struct request *req)
> >  {
> > -   if (blk_mark_rq_complete(req))
> > -   return;
> >  
> > if (req->q->mq_ops) {
> > -   blk_mq_rq_timed_out(req, false);
> > +   req->deadline = jiffies;
> > +   mod_timer(>q->timeout, 0);
> > } else {
> > +   if (blk_mark_rq_complete(req))
> > +   return;
> > blk_delete_timer(req);
> > blk_rq_timed_out(req);
> > }
> 
> This patch makes blk_abort_request() asynchronous for blk-mq. Have all callers
> been audited to verify whether this change is safe?

I *think* so.  For all the ata related parts, I know they're.
mtip32xx and dasd_ioctl, it seems safe, but I can't tell for sure.
Will cc the respective maintainers on the next posting.

Thanks.

-- 
tejun


Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-14 Thread t...@kernel.org
Hello,

On Thu, Dec 14, 2017 at 06:51:11PM +, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > rules.  Unfortunatley, it contains quite a few holes.
>   ^
>   Unfortunately?
> 
> > While this change makes REQ_ATOM_COMPLETE synchornization unnecessary
> ^^^
> synchronization?

lol, believe it or not, my english spelling is a lot better than my
korean.  Will fix them.

> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct 
> > request *rq)
> > rq->start_time = jiffies;
> > set_start_time_ns(rq);
> > rq->part = NULL;
> > +   seqcount_init(>gstate_seq);
> > +   u64_stats_init(>aborted_gstate_sync);
> >  }
> >  EXPORT_SYMBOL(blk_rq_init);
> 
> Sorry but the above change looks ugly to me. My understanding is that 
> blk_rq_init() is only used inside the block layer to initialize legacy block
> layer requests while gstate_seq and aborted_gstate_sync are only relevant
> for blk-mq requests. Wouldn't it be better to avoid that blk_rq_init() is
> called for blk-mq requests such that the above change can be left out? The
> only callers outside the block layer core of blk_rq_init() I know of are
> ide_prep_sense() and scsi_ioctl_reset(). I can help with converting the SCSI
> code if you want.

This is also used by flush path.  We probably should clean that up,
but let's worry about that later cuz flush handling has enough of its
own complications.

> > +   write_seqcount_begin(>gstate_seq);
> > +   blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> > +   blk_add_timer(rq);
> > +   write_seqcount_end(>gstate_seq);
> 
> My understanding is that both write_seqcount_begin() and write_seqcount_end()
> trigger a write memory barrier. Is a seqcount really faster than a spinlock?

Write memory barrier has no cost on x86 and usually pretty low cost
elsewhere too and they're likely in the same cacheline as other rq
fields.

> > @@ -792,6 +811,14 @@ void blk_mq_rq_timed_out(struct request *req, bool 
> > reserved)
> > __blk_mq_complete_request(req);
> > break;
> > case BLK_EH_RESET_TIMER:
> > +   /*
> > +* As nothing prevents from completion happening while
> > +* ->aborted_gstate is set, this may lead to ignored
> > +* completions and further spurious timeouts.
> > +*/
> > +   u64_stats_update_begin(>aborted_gstate_sync);
> > +   req->aborted_gstate = 0;
> > +   u64_stats_update_end(>aborted_gstate_sync);
> 
> If a blk-mq request is resubmitted 2**62 times, can that result in the above
> code setting aborted_gstate to the same value as gstate? Isn't that a bug?
> If so, how about setting aborted_gstate in the above code to e.g. gstate ^ 
> (2**63)?

A request gets aborted only if the state is in-flight, 0 isn't that.
Also, how many years would 2^62 times be?

> > +   struct u64_stats_sync aborted_gstate_sync;
> > +   u64 aborted_gstate;
> > +
> > unsigned long deadline;
> > struct list_head timeout_list;
> 
> Why are gstate and aborted_gstate 64-bit variables? What makes you think that
> 32 bits would not be enough?

Because 32 bits puts it in the rance where a false hit is still
theoretically possible in a reasonable amount of time.

Thanks.

-- 
tejun


Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-14 Thread Bart Van Assche
On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> rules.  Unfortunatley, it contains quite a few holes.
  ^
  Unfortunately?

> While this change makes REQ_ATOM_COMPLETE synchornization unnecessary
^^^
synchronization?

> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request 
> *rq)
>   rq->start_time = jiffies;
>   set_start_time_ns(rq);
>   rq->part = NULL;
> + seqcount_init(>gstate_seq);
> + u64_stats_init(>aborted_gstate_sync);
>  }
>  EXPORT_SYMBOL(blk_rq_init);

Sorry but the above change looks ugly to me. My understanding is that 
blk_rq_init() is only used inside the block layer to initialize legacy block
layer requests while gstate_seq and aborted_gstate_sync are only relevant
for blk-mq requests. Wouldn't it be better to avoid that blk_rq_init() is
called for blk-mq requests such that the above change can be left out? The
only callers outside the block layer core of blk_rq_init() I know of are
ide_prep_sense() and scsi_ioctl_reset(). I can help with converting the SCSI
code if you want.

> + write_seqcount_begin(>gstate_seq);
> + blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> + blk_add_timer(rq);
> + write_seqcount_end(>gstate_seq);

My understanding is that both write_seqcount_begin() and write_seqcount_end()
trigger a write memory barrier. Is a seqcount really faster than a spinlock?

> 
> @@ -792,6 +811,14 @@ void blk_mq_rq_timed_out(struct request *req, bool 
> reserved)
>   __blk_mq_complete_request(req);
>   break;
>   case BLK_EH_RESET_TIMER:
> + /*
> +  * As nothing prevents from completion happening while
> +  * ->aborted_gstate is set, this may lead to ignored
> +  * completions and further spurious timeouts.
> +  */
> + u64_stats_update_begin(>aborted_gstate_sync);
> + req->aborted_gstate = 0;
> + u64_stats_update_end(>aborted_gstate_sync);

If a blk-mq request is resubmitted 2**62 times, can that result in the above
code setting aborted_gstate to the same value as gstate? Isn't that a bug?
If so, how about setting aborted_gstate in the above code to e.g. gstate ^ 
(2**63)?

> @@ -228,6 +230,27 @@ struct request {
>  
>   unsigned short write_hint;
>  
> + /*
> +  * On blk-mq, the lower bits of ->gstate carry the MQ_RQ_* state
> +  * value and the upper bits the generation number which is
> +  * monotonically incremented and used to distinguish the reuse
> +  * instances.
> +  *
> +  * ->gstate_seq allows updates to ->gstate and other fields
> +  * (currently ->deadline) during request start to be read
> +  * atomically from the timeout path, so that it can operate on a
> +  * coherent set of information.
> +  */
> + seqcount_t gstate_seq;
> + u64 gstate;
> +
> + /*
> +  * ->aborted_gstate is used by the timeout to claim a specific
> +  * recycle instance of this request.  See blk_mq_timeout_work().
> +  */
> + struct u64_stats_sync aborted_gstate_sync;
> + u64 aborted_gstate;
> +
>   unsigned long deadline;
>   struct list_head timeout_list;

Why are gstate and aborted_gstate 64-bit variables? What makes you think that
32 bits would not be enough?

Thanks,

Bart.

Re: [PATCH 1/6] blk-mq: protect completion path with RCU

2017-12-14 Thread t...@kernel.org
On Thu, Dec 14, 2017 at 05:01:06PM +, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > +   } else {
> > +   srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> > +   if (!blk_mark_rq_complete(rq))
> > +   __blk_mq_complete_request(rq);
> > +   srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
> 
> Hello Tejun,
> 
> The name queue_rq_srcu was chosen to reflect the original use of that 
> structure,
> namely to protect .queue_rq() calls. Your patch series broadens the use of 
> that

Yeah, will add a patch to rename it.

> srcu structure so I would appreciate it if it would be renamed, e.g. into 
> "srcu".
> See also commit 6a83e74d214a ("blk-mq: Introduce blk_mq_quiesce_queue()").

Ah yeah, it'd be nice to have the [s]rcu synchronize calls factored
out.

Thanks.

-- 
tejun


Re: [PATCH 1/6] blk-mq: protect completion path with RCU

2017-12-14 Thread Bart Van Assche
On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> + } else {
> + srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> + if (!blk_mark_rq_complete(rq))
> + __blk_mq_complete_request(rq);
> + srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);

Hello Tejun,

The name queue_rq_srcu was chosen to reflect the original use of that structure,
namely to protect .queue_rq() calls. Your patch series broadens the use of that
srcu structure so I would appreciate it if it would be renamed, e.g. into 
"srcu".
See also commit 6a83e74d214a ("blk-mq: Introduce blk_mq_quiesce_queue()").

Thanks,

Bart.

Re: Regression with a0747a859ef6 ("bdi: add error handle for bdi_debug_register")

2017-12-14 Thread Bruno Wolff III

On Thu, Dec 14, 2017 at 18:09:27 +0800,
 weiping zhang  wrote:


It seems something wrong with bdi debugfs register, could you help
test the forllowing debug patch, I add some debug log, no function
change, thanks.


I applied your patch to d39a01eff9af1045f6e30ff9db40310517c4b45f and there 
were some new debug messages in the dmesg output. Hopefully this helps. I 
also added the patch and output to the Fedora bug for people following there.
-- Logs begin at Thu 2017-09-28 16:17:29 CDT, end at Thu 2017-12-14 09:36:50 
CST. --
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: microcode: microcode updated early 
to revision 0x3a, date = 2017-01-30
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: Linux version 4.15.0-rc3+ 
(br...@cerberus.csd.uwm.edu) (gcc version 7.2.1 20170915 (Red Hat 7.2.1-4) 
(GCC)) #15 SMP Thu Dec 14 09:07:46 CST 2017
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: Command line: 
BOOT_IMAGE=/vmlinuz-4.15.0-rc3+ 
root=/dev/mapper/luks-f5e2d09b-f8a3-487d-9517-abe4fb0eada3 ro 
rd.md.uuid=7f4fcca0:13b1445f:a91ff455:6bb1ab48 
rd.luks.uuid=luks-cc6ee93c-e729-4f78-9baf-0cc5cc8a9ff1 
rd.md.uuid=ef18531c:760102fb:7797cbdb:5cf9516f 
rd.md.uuid=42efe386:0c315f28:f7c61920:ea098f81 
rd.luks.uuid=luks-f5e2d09b-f8a3-487d-9517-abe4fb0eada3 LANG=en_US.UTF-8
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: x86/fpu: Supporting XSAVE feature 
0x001: 'x87 floating point registers'
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: x86/fpu: Supporting XSAVE feature 
0x002: 'SSE registers'
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: x86/fpu: Supporting XSAVE feature 
0x004: 'AVX registers'
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: x86/fpu: xstate_offset[2]:  576, 
xstate_sizes[2]:  256
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: x86/fpu: Enabled xstate features 
0x7, context size is 832 bytes, using 'standard' format.
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: e820: BIOS-provided physical RAM 
map:
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: BIOS-e820: [mem 
0x-0x0009e7ff] usable
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: BIOS-e820: [mem 
0x0009e800-0x0009] reserved
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: BIOS-e820: [mem 
0x000e-0x000f] reserved
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: BIOS-e820: [mem 
0x0010-0x998f1fff] usable
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: BIOS-e820: [mem 
0x998f2000-0x9a29dfff] reserved
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: BIOS-e820: [mem 
0x9a29e000-0x9a2e6fff] ACPI data
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: BIOS-e820: [mem 
0x9a2e7000-0x9af43fff] ACPI NVS
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: BIOS-e820: [mem 
0x9af44000-0x9b40afff] reserved
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: BIOS-e820: [mem 
0x9b40b000-0x9b40bfff] usable
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: BIOS-e820: [mem 
0x9b40c000-0x9b419fff] reserved
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: BIOS-e820: [mem 
0x9b41a000-0x9cff] usable
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: BIOS-e820: [mem 
0xa000-0xafff] reserved
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: BIOS-e820: [mem 
0xfed1c000-0xfed1] reserved
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: BIOS-e820: [mem 
0xff00-0x] reserved
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: BIOS-e820: [mem 
0x0001-0x00085fff] usable
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: NX (Execute Disable) protection: 
active
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: random: fast init done
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: SMBIOS 2.8 present.
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: DMI: Dell Inc. Precision Tower 
5810/0WR1RF, BIOS A07 04/14/2015
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: e820: update [mem 
0x-0x0fff] usable ==> reserved
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: e820: remove [mem 
0x000a-0x000f] usable
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: e820: last_pfn = 0x86 
max_arch_pfn = 0x4
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: MTRR default type: write-back
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: MTRR fixed ranges enabled:
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel:   0-9 write-back
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel:   A-B uncachable
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel:   C-E3FFF write-through
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel:   E4000-F write-protect
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel: MTRR variable ranges enabled:
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel:   0 base C000 mask 
3FFFC000 uncachable
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel:   1 base A000 mask 
3FFFE000 uncachable
Dec 14 09:17:43 cerberus.csd.uwm.edu kernel:   2 base 0300 mask 

Re: About the try to remove cross-release feature entirely by Ingo

2017-12-14 Thread Byungchul Park
On Thu, Dec 14, 2017 at 8:18 PM, Peter Zijlstra  wrote:
> On Wed, Dec 13, 2017 at 10:07:11PM -0500, Theodore Ts'o wrote:
>> interpreted this as the lockdep maintainers saying, "hey, not my
>> fault, it's the subsystem maintainer's fault for not properly
>> classifying the locks" --- and thus dumping the responsibility in the
>> subsystem maintainers' laps.
>
> Let me clarify that I (as lockdep maintainer) disagree with that
> sentiment. I have spend a lot of time over the years staring at random
> code trying to fix lockdep splats. Its awesome if corresponding
> subsystem maintainers help out and many have, but I very much do not
> agree its their problem and their problem alone.

I apologize to all of you. That's really not what I intended to say.

I said that other folks can annotate it for the sub-system better
than lockdep developer, so suggested to invalidate locks making
trouble and wanting to avoid annotating it at the moment, and
validate those back when necessary with additional annotations.

It's my fault. I'm not sure how I should express what I want to say,
but, I didn't intend to charge the responsibility to other folks.

Ideally, I think it's best to solve it with co-work. I should've been
more careful to say that.

Again, I apologize for that, to lockdep and fs maintainers.

Of course, for cross-release, I have the will to annotate it or
find a better way to avoid false positives. And I think I have to.

-- 
Thanks,
Byungchul


Re: About the try to remove cross-release feature entirely by Ingo

2017-12-14 Thread Peter Zijlstra
On Wed, Dec 13, 2017 at 10:07:11PM -0500, Theodore Ts'o wrote:
> interpreted this as the lockdep maintainers saying, "hey, not my
> fault, it's the subsystem maintainer's fault for not properly
> classifying the locks" --- and thus dumping the responsibility in the
> subsystem maintainers' laps.

Let me clarify that I (as lockdep maintainer) disagree with that
sentiment. I have spend a lot of time over the years staring at random
code trying to fix lockdep splats. Its awesome if corresponding
subsystem maintainers help out and many have, but I very much do not
agree its their problem and their problem alone.

This attitude is one of the biggest issues I have with the crossrelease
stuff and why I don't disagree with Ingo taking it out (for now).


Re: Regression with a0747a859ef6 ("bdi: add error handle for bdi_debug_register")

2017-12-14 Thread Bruno Wolff III

On Thu, Dec 14, 2017 at 18:09:27 +0800,
 weiping zhang  wrote:

On Thu, Dec 14, 2017 at 02:24:52AM -0600, Bruno Wolff III wrote:

On Wed, Dec 13, 2017 at 16:54:17 -0800,
 Laura Abbott  wrote:
>Hi,
>
>Fedora got a bug report https://bugzilla.redhat.com/show_bug.cgi?id=1520982
>of a boot failure/bug on Linus' master (full bootlog at the bugzilla)

I'm available for testing. The problem happens on my x86_64 Dell
Workstation, but not an old i386 server or an x86_64 mac hardware
based laptop.


Hi,

It seems something wrong with bdi debugfs register, could you help
test the forllowing debug patch, I add some debug log, no function
change, thanks.


I'll test it this morning. I'll probably have results in about 7 hrs from now.


Re: Regression with a0747a859ef6 ("bdi: add error handle for bdi_debug_register")

2017-12-14 Thread weiping zhang
On Thu, Dec 14, 2017 at 02:24:52AM -0600, Bruno Wolff III wrote:
> On Wed, Dec 13, 2017 at 16:54:17 -0800,
>  Laura Abbott  wrote:
> >Hi,
> >
> >Fedora got a bug report https://bugzilla.redhat.com/show_bug.cgi?id=1520982
> >of a boot failure/bug on Linus' master (full bootlog at the bugzilla)
> 
> I'm available for testing. The problem happens on my x86_64 Dell
> Workstation, but not an old i386 server or an x86_64 mac hardware
> based laptop.

Hi,

It seems something wrong with bdi debugfs register, could you help
test the forllowing debug patch, I add some debug log, no function
change, thanks.


>From d2728c07589e8b83115a51e0c629451bff7308db Mon Sep 17 00:00:00 2001
From: weiping zhang 
Date: Thu, 14 Dec 2017 17:56:22 +0800
Subject: [PATCH] bdi debugfs

Signed-off-by: weiping zhang 
---
 mm/backing-dev.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 84b2dc7..fbbb9a6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -39,6 +39,10 @@ static struct dentry *bdi_debug_root;
 static void bdi_debug_init(void)
 {
bdi_debug_root = debugfs_create_dir("bdi", NULL);
+   if (!bdi_debug_root)
+   pr_err("DEBUG:bdi_debug_root fail\n");
+   else
+   pr_err("DEBUG:bdi_debug_root success\n");
 }
 
 static int bdi_debug_stats_show(struct seq_file *m, void *v)
@@ -115,18 +119,29 @@ static const struct file_operations bdi_debug_stats_fops 
= {
 
 static int bdi_debug_register(struct backing_dev_info *bdi, const char *name)
 {
-   if (!bdi_debug_root)
+   if (!bdi_debug_root) {
+   pr_err("DEBUG:dev:%s, bdi_debug_root fail\n", name);
return -ENOMEM;
+   } else {
+   pr_err("DEBUG:dev:%s, bdi_debug_root success\n", name);
+   }
 
bdi->debug_dir = debugfs_create_dir(name, bdi_debug_root);
-   if (!bdi->debug_dir)
+   if (!bdi->debug_dir) {
+   pr_err("DEBUG:dev:%s, debug_dir fail\n", name);
return -ENOMEM;
+   } else {
+   pr_err("DEBUG:dev:%s, debug_dir success\n", name);
+   }
 
bdi->debug_stats = debugfs_create_file("stats", 0444, bdi->debug_dir,
   bdi, _debug_stats_fops);
if (!bdi->debug_stats) {
debugfs_remove(bdi->debug_dir);
+   pr_err("DEBUG:dev:%s, debug_stats fail\n", name);
return -ENOMEM;
+   } else {
+   pr_err("DEBUG:dev:%s, debug_stats success\n", name);
}
 
return 0;
@@ -879,13 +894,20 @@ int bdi_register_va(struct backing_dev_info *bdi, const 
char *fmt, va_list args)
return 0;
 
dev = device_create_vargs(bdi_class, NULL, MKDEV(0, 0), bdi, fmt, args);
-   if (IS_ERR(dev))
+   if (IS_ERR(dev)) {
+   pr_err("DEBUG: bdi device_create_vargs fail\n");
return PTR_ERR(dev);
+   }
+   pr_err("DEBUG: bdi(0x%p) device_create_vargs sucess\n", bdi);
 
if (bdi_debug_register(bdi, dev_name(dev))) {
+   pr_err("DEBUG: dev:%s, bdi(0x%p) bdi_debug_register fail\n",
+   dev_name(dev), bdi);
device_destroy(bdi_class, dev->devt);
return -ENOMEM;
}
+   pr_err("DEBUG: dev:%s, bdi(0x%p) bdi_debug_register success\n",
+   dev_name(dev), bdi);
cgwb_bdi_register(bdi);
bdi->dev = dev;
 
-- 
2.9.4



Re: Regression with a0747a859ef6 ("bdi: add error handle for bdi_debug_register")

2017-12-14 Thread Bruno Wolff III

On Wed, Dec 13, 2017 at 16:54:17 -0800,
 Laura Abbott  wrote:

Hi,

Fedora got a bug report https://bugzilla.redhat.com/show_bug.cgi?id=1520982
of a boot failure/bug on Linus' master (full bootlog at the bugzilla)


I'm available for testing. The problem happens on my x86_64 Dell Workstation, 
but not an old i386 server or an x86_64 mac hardware based laptop.