Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-16 Thread Jan Kara
On Fri 16-10-15 11:08:46, Nikolay Borisov wrote:
> On 10/13/2015 04:14 PM, Jan Kara wrote:
> >> Turns out in my original email the bh->b_size was shown : 
> >> b_size = 0x400 == 1k. So the filesystem is not 4k but 1k. 
> > 
> > OK, then I have a theory. We can manipulate bh->b_state in a non-atomic
> > manner in _ext4_get_block(). If we happen to do that on the first buffer in
> > a page while IO completes on another buffer in the same page, we could in
> > theory mess up and miss clearing of BH_Uptodate_Lock flag. Can you try
> > whether the attached patch fixes your problem?
> 
> I just want to make sure I have fully understood the problem. So the way
> I understand what you have said is that since the blocksize is 1k this
> potentially means we might need up to 4 buffer heads to map everything
> in a page. But as you mentioned the locking in ext4_finish_bio is done
> only on the first buffer_head in this set of bh. At the same time in
> _ext4_get_block bh->b_state is modified non-atomically as you said, and
> this can happen in one of those 4bh used to map the whole page? Correct?
> 
> If my reasoning is correct then I see no reason why this corruption
> couldn't happen even with blocksize == pagesize since ext4_get_block's
> non-atomic modification of the bh->b_state could still race with
> ext4_finish_bio's atomic modification (even though ext4_finish_bio would
> just work on a single buffer head)?

Well, the reason why this race is not possible with 4k blocks is that we
map blocks in a page (i.e., call _ext4_get_block()) only under page lock
and only when the buffer isn't already mapped. And in such case IO
completion cannot happen (for IO to run buffer must be mapped). So it's
mostly a pure luck but the race is not possible...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-16 Thread Nikolay Borisov


On 10/13/2015 04:14 PM, Jan Kara wrote:
> On Tue 13-10-15 13:37:16, Nikolay Borisov wrote:
>>
>>
>> On 10/13/2015 11:15 AM, Jan Kara wrote:
>>> On Mon 12-10-15 17:51:07, Nikolay Borisov wrote:
 Hello and thanks for the reply,

 On 10/12/2015 04:40 PM, Jan Kara wrote:
> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
>> On 10/09/2015 10:37 AM, Hillf Danton wrote:
>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
>>  if (bio->bi_error)
>>  buffer_io_error(bh);
>>  } while ((bh = bh->b_this_page) != head);
>> -bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>>  local_irq_restore(flags);
>
> What if it takes 100ms to unlock after IRQ restored?

 I'm not sure I understand in what direction you are going? Care to
 elaborate?

>>> Your change introduces extra time cost the lock waiter has to pay in
>>> the case that irq happens before the lock is released.
>>
>> [CC filesystem and mm people. For reference the thread starts here:
>>  http://thread.gmane.org/gmane.linux.kernel/2056996 ]
>>
>> Right, I see what you mean and it's a good point but when doing the
>> patches I was striving for correctness and starting a discussion, hence
>> the RFC. In any case I'd personally choose correctness over performance
>> always ;).
>>
>> As I'm not an fs/ext4 expert and have added the relevant parties (please
>> use reply-all from now on so that the thread is not being cut in the
>> middle) who will be able to say whether it impact is going to be that
>> big. I guess in this particular code path worrying about this is prudent
>> as writeback sounds like a heavily used path.
>>
>> Maybe the problem should be approached from a different angle e.g.
>> drain_all_pages and its reliance on the fact that the IPI will always be
>> delivered in some finite amount of time? But what if a cpu with disabled
>> interrupts is waiting on the task issuing the IPI?
>
> So I have looked through your patch and also original report (thread 
> starts
> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
> been properly answered yet: Who is holding BH_Uptodate_Lock we are 
> spinning
> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
> __block_write_full_page_endio() call but that cannot really be the case.
> BH_Uptodate_Lock is used only in IO completion handlers -
> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
> really should be some end_io function running on some other CPU which 
> holds
> BH_Uptodate_Lock for that buffer.

 I did check all the call traces of the current processes on the machine
 at the time of the hard lockup and none of the 3 functions you mentioned
 were in any of the call chains. But while I was looking the code of
 end_buffer_async_write and in the comments I saw it was mentioned that
 those completion handler were called from __block_write_full_page_endio
 so that's what pointed my attention to that function. But you are right
 that it doesn't take the BH lock.

 Furthermore the fact that the BH_Async_Write flag is set points me in
 the direction that end_buffer_async_write should have been executing but
 as I said issuing "bt" for all the tasks didn't show this function.
>>>
>>> Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like
>>> a more likely place where that flag got set since ext4_finish_bio() was
>>> then handling IO completion.
>>>
 I'm beginning to wonder if it's possible that a single bit memory error
 has crept up, but this still seems like a long shot...
>>>
>>> Yup. Possible but a long shot. Is the problem reproducible in any way?
>>
>> Okay, I rule out hardware issue since a different server today 
>> experienced the same hard lockup. One thing which looks 
>> suspicious to me are the repetitions of bio_endio/clone_endio: 
>>
>> Oct 13 03:16:54 10.80.5.48 Call Trace:
>> Oct 13 03:16:54 10.80.5.48 
>> Oct 13 03:16:54 10.80.5.48 [] dump_stack+0x58/0x7f
>> Oct 13 03:16:54 10.80.5.48 [] 
>> warn_slowpath_common+0x8c/0xc0
>> Oct 13 03:16:54 10.80.5.48 [] warn_slowpath_fmt+0x46/0x50
>> Oct 13 03:16:54 10.80.5.48 [] 
>> watchdog_overflow_callback+0x98/0xc0
>> Oct 13 03:16:54 10.80.5.48 [] 
>> __perf_event_overflow+0x9c/0x250
>> Oct 13 03:16:54 10.80.5.48 [] perf_event_overflow+0x14/0x20
>> Oct 13 03:16:54 10.80.5.48 [] 
>> intel_pmu_handle_irq+0x1d6/0x3e0
>> Oct 13 03:16:54 10.80.5.48 [] 
>> perf_event_nmi_handler+0x34/0x60
>> Oct 13 03:16:54 10.80.5.48 [] nmi_handle+0xa2/0x1a0
>> Oct 13 03:16:54 10.80.5.48 [] do_nmi+0x164/0x430
>> Oct 13 03:16:54 10.80.5.48 [] end_repeat_nmi+0x1a/0x1e
>> Oct 13 03:16:54 10.80.5.48 

Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-16 Thread Nikolay Borisov


On 10/13/2015 04:14 PM, Jan Kara wrote:
> On Tue 13-10-15 13:37:16, Nikolay Borisov wrote:
>>
>>
>> On 10/13/2015 11:15 AM, Jan Kara wrote:
>>> On Mon 12-10-15 17:51:07, Nikolay Borisov wrote:
 Hello and thanks for the reply,

 On 10/12/2015 04:40 PM, Jan Kara wrote:
> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
>> On 10/09/2015 10:37 AM, Hillf Danton wrote:
>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
>>  if (bio->bi_error)
>>  buffer_io_error(bh);
>>  } while ((bh = bh->b_this_page) != head);
>> -bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>>  local_irq_restore(flags);
>
> What if it takes 100ms to unlock after IRQ restored?

 I'm not sure I understand in what direction you are going? Care to
 elaborate?

>>> Your change introduces extra time cost the lock waiter has to pay in
>>> the case that irq happens before the lock is released.
>>
>> [CC filesystem and mm people. For reference the thread starts here:
>>  http://thread.gmane.org/gmane.linux.kernel/2056996 ]
>>
>> Right, I see what you mean and it's a good point but when doing the
>> patches I was striving for correctness and starting a discussion, hence
>> the RFC. In any case I'd personally choose correctness over performance
>> always ;).
>>
>> As I'm not an fs/ext4 expert and have added the relevant parties (please
>> use reply-all from now on so that the thread is not being cut in the
>> middle) who will be able to say whether it impact is going to be that
>> big. I guess in this particular code path worrying about this is prudent
>> as writeback sounds like a heavily used path.
>>
>> Maybe the problem should be approached from a different angle e.g.
>> drain_all_pages and its reliance on the fact that the IPI will always be
>> delivered in some finite amount of time? But what if a cpu with disabled
>> interrupts is waiting on the task issuing the IPI?
>
> So I have looked through your patch and also original report (thread 
> starts
> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
> been properly answered yet: Who is holding BH_Uptodate_Lock we are 
> spinning
> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
> __block_write_full_page_endio() call but that cannot really be the case.
> BH_Uptodate_Lock is used only in IO completion handlers -
> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
> really should be some end_io function running on some other CPU which 
> holds
> BH_Uptodate_Lock for that buffer.

 I did check all the call traces of the current processes on the machine
 at the time of the hard lockup and none of the 3 functions you mentioned
 were in any of the call chains. But while I was looking the code of
 end_buffer_async_write and in the comments I saw it was mentioned that
 those completion handler were called from __block_write_full_page_endio
 so that's what pointed my attention to that function. But you are right
 that it doesn't take the BH lock.

 Furthermore the fact that the BH_Async_Write flag is set points me in
 the direction that end_buffer_async_write should have been executing but
 as I said issuing "bt" for all the tasks didn't show this function.
>>>
>>> Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like
>>> a more likely place where that flag got set since ext4_finish_bio() was
>>> then handling IO completion.
>>>
 I'm beginning to wonder if it's possible that a single bit memory error
 has crept up, but this still seems like a long shot...
>>>
>>> Yup. Possible but a long shot. Is the problem reproducible in any way?
>>
>> Okay, I rule out hardware issue since a different server today 
>> experienced the same hard lockup. One thing which looks 
>> suspicious to me are the repetitions of bio_endio/clone_endio: 
>>
>> Oct 13 03:16:54 10.80.5.48 Call Trace:
>> Oct 13 03:16:54 10.80.5.48 
>> Oct 13 03:16:54 10.80.5.48 [] dump_stack+0x58/0x7f
>> Oct 13 03:16:54 10.80.5.48 [] 
>> warn_slowpath_common+0x8c/0xc0
>> Oct 13 03:16:54 10.80.5.48 [] warn_slowpath_fmt+0x46/0x50
>> Oct 13 03:16:54 10.80.5.48 [] 
>> watchdog_overflow_callback+0x98/0xc0
>> Oct 13 03:16:54 10.80.5.48 [] 
>> __perf_event_overflow+0x9c/0x250
>> Oct 13 03:16:54 10.80.5.48 [] perf_event_overflow+0x14/0x20
>> Oct 13 03:16:54 10.80.5.48 [] 
>> intel_pmu_handle_irq+0x1d6/0x3e0
>> Oct 13 03:16:54 10.80.5.48 [] 
>> perf_event_nmi_handler+0x34/0x60
>> Oct 13 03:16:54 10.80.5.48 [] nmi_handle+0xa2/0x1a0
>> Oct 13 03:16:54 10.80.5.48 [] do_nmi+0x164/0x430
>> Oct 13 03:16:54 10.80.5.48 [] end_repeat_nmi+0x1a/0x1e
>> Oct 13 03:16:54 10.80.5.48 

Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-16 Thread Jan Kara
On Fri 16-10-15 11:08:46, Nikolay Borisov wrote:
> On 10/13/2015 04:14 PM, Jan Kara wrote:
> >> Turns out in my original email the bh->b_size was shown : 
> >> b_size = 0x400 == 1k. So the filesystem is not 4k but 1k. 
> > 
> > OK, then I have a theory. We can manipulate bh->b_state in a non-atomic
> > manner in _ext4_get_block(). If we happen to do that on the first buffer in
> > a page while IO completes on another buffer in the same page, we could in
> > theory mess up and miss clearing of BH_Uptodate_Lock flag. Can you try
> > whether the attached patch fixes your problem?
> 
> I just want to make sure I have fully understood the problem. So the way
> I understand what you have said is that since the blocksize is 1k this
> potentially means we might need up to 4 buffer heads to map everything
> in a page. But as you mentioned the locking in ext4_finish_bio is done
> only on the first buffer_head in this set of bh. At the same time in
> _ext4_get_block bh->b_state is modified non-atomically as you said, and
> this can happen in one of those 4bh used to map the whole page? Correct?
> 
> If my reasoning is correct then I see no reason why this corruption
> couldn't happen even with blocksize == pagesize since ext4_get_block's
> non-atomic modification of the bh->b_state could still race with
> ext4_finish_bio's atomic modification (even though ext4_finish_bio would
> just work on a single buffer head)?

Well, the reason why this race is not possible with 4k blocks is that we
map blocks in a page (i.e., call _ext4_get_block()) only under page lock
and only when the buffer isn't already mapped. And in such case IO
completion cannot happen (for IO to run buffer must be mapped). So it's
mostly a pure luck but the race is not possible...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-14 Thread Nikolay Borisov


On 10/13/2015 04:14 PM, Jan Kara wrote:
> On Tue 13-10-15 13:37:16, Nikolay Borisov wrote:
>>
>>
>> On 10/13/2015 11:15 AM, Jan Kara wrote:
>>> On Mon 12-10-15 17:51:07, Nikolay Borisov wrote:
 Hello and thanks for the reply,

 On 10/12/2015 04:40 PM, Jan Kara wrote:
> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
>> On 10/09/2015 10:37 AM, Hillf Danton wrote:
>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
>>  if (bio->bi_error)
>>  buffer_io_error(bh);
>>  } while ((bh = bh->b_this_page) != head);
>> -bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>>  local_irq_restore(flags);
>
> What if it takes 100ms to unlock after IRQ restored?

 I'm not sure I understand in what direction you are going? Care to
 elaborate?

>>> Your change introduces extra time cost the lock waiter has to pay in
>>> the case that irq happens before the lock is released.
>>
>> [CC filesystem and mm people. For reference the thread starts here:
>>  http://thread.gmane.org/gmane.linux.kernel/2056996 ]
>>
>> Right, I see what you mean and it's a good point but when doing the
>> patches I was striving for correctness and starting a discussion, hence
>> the RFC. In any case I'd personally choose correctness over performance
>> always ;).
>>
>> As I'm not an fs/ext4 expert and have added the relevant parties (please
>> use reply-all from now on so that the thread is not being cut in the
>> middle) who will be able to say whether it impact is going to be that
>> big. I guess in this particular code path worrying about this is prudent
>> as writeback sounds like a heavily used path.
>>
>> Maybe the problem should be approached from a different angle e.g.
>> drain_all_pages and its reliance on the fact that the IPI will always be
>> delivered in some finite amount of time? But what if a cpu with disabled
>> interrupts is waiting on the task issuing the IPI?
>
> So I have looked through your patch and also original report (thread 
> starts
> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
> been properly answered yet: Who is holding BH_Uptodate_Lock we are 
> spinning
> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
> __block_write_full_page_endio() call but that cannot really be the case.
> BH_Uptodate_Lock is used only in IO completion handlers -
> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
> really should be some end_io function running on some other CPU which 
> holds
> BH_Uptodate_Lock for that buffer.

 I did check all the call traces of the current processes on the machine
 at the time of the hard lockup and none of the 3 functions you mentioned
 were in any of the call chains. But while I was looking the code of
 end_buffer_async_write and in the comments I saw it was mentioned that
 those completion handler were called from __block_write_full_page_endio
 so that's what pointed my attention to that function. But you are right
 that it doesn't take the BH lock.

 Furthermore the fact that the BH_Async_Write flag is set points me in
 the direction that end_buffer_async_write should have been executing but
 as I said issuing "bt" for all the tasks didn't show this function.
>>>
>>> Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like
>>> a more likely place where that flag got set since ext4_finish_bio() was
>>> then handling IO completion.
>>>
 I'm beginning to wonder if it's possible that a single bit memory error
 has crept up, but this still seems like a long shot...
>>>
>>> Yup. Possible but a long shot. Is the problem reproducible in any way?
>>
>> Okay, I rule out hardware issue since a different server today 
>> experienced the same hard lockup. One thing which looks 
>> suspicious to me are the repetitions of bio_endio/clone_endio: 
>>
>> Oct 13 03:16:54 10.80.5.48 Call Trace:
>> Oct 13 03:16:54 10.80.5.48 
>> Oct 13 03:16:54 10.80.5.48 [] dump_stack+0x58/0x7f
>> Oct 13 03:16:54 10.80.5.48 [] 
>> warn_slowpath_common+0x8c/0xc0
>> Oct 13 03:16:54 10.80.5.48 [] warn_slowpath_fmt+0x46/0x50
>> Oct 13 03:16:54 10.80.5.48 [] 
>> watchdog_overflow_callback+0x98/0xc0
>> Oct 13 03:16:54 10.80.5.48 [] 
>> __perf_event_overflow+0x9c/0x250
>> Oct 13 03:16:54 10.80.5.48 [] perf_event_overflow+0x14/0x20
>> Oct 13 03:16:54 10.80.5.48 [] 
>> intel_pmu_handle_irq+0x1d6/0x3e0
>> Oct 13 03:16:54 10.80.5.48 [] 
>> perf_event_nmi_handler+0x34/0x60
>> Oct 13 03:16:54 10.80.5.48 [] nmi_handle+0xa2/0x1a0
>> Oct 13 03:16:54 10.80.5.48 [] do_nmi+0x164/0x430
>> Oct 13 03:16:54 10.80.5.48 [] end_repeat_nmi+0x1a/0x1e
>> Oct 13 03:16:54 10.80.5.48 

Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-14 Thread Nikolay Borisov


On 10/13/2015 04:14 PM, Jan Kara wrote:
> On Tue 13-10-15 13:37:16, Nikolay Borisov wrote:
>>
>>
>> On 10/13/2015 11:15 AM, Jan Kara wrote:
>>> On Mon 12-10-15 17:51:07, Nikolay Borisov wrote:
 Hello and thanks for the reply,

 On 10/12/2015 04:40 PM, Jan Kara wrote:
> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
>> On 10/09/2015 10:37 AM, Hillf Danton wrote:
>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
>>  if (bio->bi_error)
>>  buffer_io_error(bh);
>>  } while ((bh = bh->b_this_page) != head);
>> -bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>>  local_irq_restore(flags);
>
> What if it takes 100ms to unlock after IRQ restored?

 I'm not sure I understand in what direction you are going? Care to
 elaborate?

>>> Your change introduces extra time cost the lock waiter has to pay in
>>> the case that irq happens before the lock is released.
>>
>> [CC filesystem and mm people. For reference the thread starts here:
>>  http://thread.gmane.org/gmane.linux.kernel/2056996 ]
>>
>> Right, I see what you mean and it's a good point but when doing the
>> patches I was striving for correctness and starting a discussion, hence
>> the RFC. In any case I'd personally choose correctness over performance
>> always ;).
>>
>> As I'm not an fs/ext4 expert and have added the relevant parties (please
>> use reply-all from now on so that the thread is not being cut in the
>> middle) who will be able to say whether it impact is going to be that
>> big. I guess in this particular code path worrying about this is prudent
>> as writeback sounds like a heavily used path.
>>
>> Maybe the problem should be approached from a different angle e.g.
>> drain_all_pages and its reliance on the fact that the IPI will always be
>> delivered in some finite amount of time? But what if a cpu with disabled
>> interrupts is waiting on the task issuing the IPI?
>
> So I have looked through your patch and also original report (thread 
> starts
> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
> been properly answered yet: Who is holding BH_Uptodate_Lock we are 
> spinning
> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
> __block_write_full_page_endio() call but that cannot really be the case.
> BH_Uptodate_Lock is used only in IO completion handlers -
> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
> really should be some end_io function running on some other CPU which 
> holds
> BH_Uptodate_Lock for that buffer.

 I did check all the call traces of the current processes on the machine
 at the time of the hard lockup and none of the 3 functions you mentioned
 were in any of the call chains. But while I was looking the code of
 end_buffer_async_write and in the comments I saw it was mentioned that
 those completion handler were called from __block_write_full_page_endio
 so that's what pointed my attention to that function. But you are right
 that it doesn't take the BH lock.

 Furthermore the fact that the BH_Async_Write flag is set points me in
 the direction that end_buffer_async_write should have been executing but
 as I said issuing "bt" for all the tasks didn't show this function.
>>>
>>> Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like
>>> a more likely place where that flag got set since ext4_finish_bio() was
>>> then handling IO completion.
>>>
 I'm beginning to wonder if it's possible that a single bit memory error
 has crept up, but this still seems like a long shot...
>>>
>>> Yup. Possible but a long shot. Is the problem reproducible in any way?
>>
>> Okay, I rule out hardware issue since a different server today 
>> experienced the same hard lockup. One thing which looks 
>> suspicious to me are the repetitions of bio_endio/clone_endio: 
>>
>> Oct 13 03:16:54 10.80.5.48 Call Trace:
>> Oct 13 03:16:54 10.80.5.48 
>> Oct 13 03:16:54 10.80.5.48 [] dump_stack+0x58/0x7f
>> Oct 13 03:16:54 10.80.5.48 [] 
>> warn_slowpath_common+0x8c/0xc0
>> Oct 13 03:16:54 10.80.5.48 [] warn_slowpath_fmt+0x46/0x50
>> Oct 13 03:16:54 10.80.5.48 [] 
>> watchdog_overflow_callback+0x98/0xc0
>> Oct 13 03:16:54 10.80.5.48 [] 
>> __perf_event_overflow+0x9c/0x250
>> Oct 13 03:16:54 10.80.5.48 [] perf_event_overflow+0x14/0x20
>> Oct 13 03:16:54 10.80.5.48 [] 
>> intel_pmu_handle_irq+0x1d6/0x3e0
>> Oct 13 03:16:54 10.80.5.48 [] 
>> perf_event_nmi_handler+0x34/0x60
>> Oct 13 03:16:54 10.80.5.48 [] nmi_handle+0xa2/0x1a0
>> Oct 13 03:16:54 10.80.5.48 [] do_nmi+0x164/0x430
>> Oct 13 03:16:54 10.80.5.48 [] end_repeat_nmi+0x1a/0x1e
>> Oct 13 03:16:54 10.80.5.48 

Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-13 Thread Jan Kara
On Tue 13-10-15 13:37:16, Nikolay Borisov wrote:
> 
> 
> On 10/13/2015 11:15 AM, Jan Kara wrote:
> > On Mon 12-10-15 17:51:07, Nikolay Borisov wrote:
> >> Hello and thanks for the reply,
> >>
> >> On 10/12/2015 04:40 PM, Jan Kara wrote:
> >>> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
>  On 10/09/2015 10:37 AM, Hillf Danton wrote:
>  @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
>   if (bio->bi_error)
>   buffer_io_error(bh);
>   } while ((bh = bh->b_this_page) != head);
>  -bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>   local_irq_restore(flags);
> >>>
> >>> What if it takes 100ms to unlock after IRQ restored?
> >>
> >> I'm not sure I understand in what direction you are going? Care to
> >> elaborate?
> >>
> > Your change introduces extra time cost the lock waiter has to pay in
> > the case that irq happens before the lock is released.
> 
>  [CC filesystem and mm people. For reference the thread starts here:
>   http://thread.gmane.org/gmane.linux.kernel/2056996 ]
> 
>  Right, I see what you mean and it's a good point but when doing the
>  patches I was striving for correctness and starting a discussion, hence
>  the RFC. In any case I'd personally choose correctness over performance
>  always ;).
> 
>  As I'm not an fs/ext4 expert and have added the relevant parties (please
>  use reply-all from now on so that the thread is not being cut in the
>  middle) who will be able to say whether it impact is going to be that
>  big. I guess in this particular code path worrying about this is prudent
>  as writeback sounds like a heavily used path.
> 
>  Maybe the problem should be approached from a different angle e.g.
>  drain_all_pages and its reliance on the fact that the IPI will always be
>  delivered in some finite amount of time? But what if a cpu with disabled
>  interrupts is waiting on the task issuing the IPI?
> >>>
> >>> So I have looked through your patch and also original report (thread 
> >>> starts
> >>> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
> >>> been properly answered yet: Who is holding BH_Uptodate_Lock we are 
> >>> spinning
> >>> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
> >>> __block_write_full_page_endio() call but that cannot really be the case.
> >>> BH_Uptodate_Lock is used only in IO completion handlers -
> >>> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
> >>> really should be some end_io function running on some other CPU which 
> >>> holds
> >>> BH_Uptodate_Lock for that buffer.
> >>
> >> I did check all the call traces of the current processes on the machine
> >> at the time of the hard lockup and none of the 3 functions you mentioned
> >> were in any of the call chains. But while I was looking the code of
> >> end_buffer_async_write and in the comments I saw it was mentioned that
> >> those completion handler were called from __block_write_full_page_endio
> >> so that's what pointed my attention to that function. But you are right
> >> that it doesn't take the BH lock.
> >>
> >> Furthermore the fact that the BH_Async_Write flag is set points me in
> >> the direction that end_buffer_async_write should have been executing but
> >> as I said issuing "bt" for all the tasks didn't show this function.
> > 
> > Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like
> > a more likely place where that flag got set since ext4_finish_bio() was
> > then handling IO completion.
> > 
> >> I'm beginning to wonder if it's possible that a single bit memory error
> >> has crept up, but this still seems like a long shot...
> > 
> > Yup. Possible but a long shot. Is the problem reproducible in any way?
> 
> Okay, I rule out hardware issue since a different server today 
> experienced the same hard lockup. One thing which looks 
> suspicious to me are the repetitions of bio_endio/clone_endio: 
> 
> Oct 13 03:16:54 10.80.5.48 Call Trace:
> Oct 13 03:16:54 10.80.5.48 
> Oct 13 03:16:54 10.80.5.48 [] dump_stack+0x58/0x7f
> Oct 13 03:16:54 10.80.5.48 [] warn_slowpath_common+0x8c/0xc0
> Oct 13 03:16:54 10.80.5.48 [] warn_slowpath_fmt+0x46/0x50
> Oct 13 03:16:54 10.80.5.48 [] 
> watchdog_overflow_callback+0x98/0xc0
> Oct 13 03:16:54 10.80.5.48 [] 
> __perf_event_overflow+0x9c/0x250
> Oct 13 03:16:54 10.80.5.48 [] perf_event_overflow+0x14/0x20
> Oct 13 03:16:54 10.80.5.48 [] 
> intel_pmu_handle_irq+0x1d6/0x3e0
> Oct 13 03:16:54 10.80.5.48 [] 
> perf_event_nmi_handler+0x34/0x60
> Oct 13 03:16:54 10.80.5.48 [] nmi_handle+0xa2/0x1a0
> Oct 13 03:16:54 10.80.5.48 [] do_nmi+0x164/0x430
> Oct 13 03:16:54 10.80.5.48 [] end_repeat_nmi+0x1a/0x1e
> Oct 13 03:16:54 10.80.5.48 [] ? ext4_finish_bio+0x279/0x2a0
> Oct 13 03:16:54 10.80.5.48 [] ? 

Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-13 Thread Nikolay Borisov


On 10/13/2015 11:15 AM, Jan Kara wrote:
> On Mon 12-10-15 17:51:07, Nikolay Borisov wrote:
>> Hello and thanks for the reply,
>>
>> On 10/12/2015 04:40 PM, Jan Kara wrote:
>>> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
 On 10/09/2015 10:37 AM, Hillf Danton wrote:
 @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
if (bio->bi_error)
buffer_io_error(bh);
} while ((bh = bh->b_this_page) != head);
 -  bit_spin_unlock(BH_Uptodate_Lock, >b_state);
local_irq_restore(flags);
>>>
>>> What if it takes 100ms to unlock after IRQ restored?
>>
>> I'm not sure I understand in what direction you are going? Care to
>> elaborate?
>>
> Your change introduces extra time cost the lock waiter has to pay in
> the case that irq happens before the lock is released.

 [CC filesystem and mm people. For reference the thread starts here:
  http://thread.gmane.org/gmane.linux.kernel/2056996 ]

 Right, I see what you mean and it's a good point but when doing the
 patches I was striving for correctness and starting a discussion, hence
 the RFC. In any case I'd personally choose correctness over performance
 always ;).

 As I'm not an fs/ext4 expert and have added the relevant parties (please
 use reply-all from now on so that the thread is not being cut in the
 middle) who will be able to say whether it impact is going to be that
 big. I guess in this particular code path worrying about this is prudent
 as writeback sounds like a heavily used path.

 Maybe the problem should be approached from a different angle e.g.
 drain_all_pages and its reliance on the fact that the IPI will always be
 delivered in some finite amount of time? But what if a cpu with disabled
 interrupts is waiting on the task issuing the IPI?
>>>
>>> So I have looked through your patch and also original report (thread starts
>>> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
>>> been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning
>>> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
>>> __block_write_full_page_endio() call but that cannot really be the case.
>>> BH_Uptodate_Lock is used only in IO completion handlers -
>>> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
>>> really should be some end_io function running on some other CPU which holds
>>> BH_Uptodate_Lock for that buffer.
>>
>> I did check all the call traces of the current processes on the machine
>> at the time of the hard lockup and none of the 3 functions you mentioned
>> were in any of the call chains. But while I was looking the code of
>> end_buffer_async_write and in the comments I saw it was mentioned that
>> those completion handler were called from __block_write_full_page_endio
>> so that's what pointed my attention to that function. But you are right
>> that it doesn't take the BH lock.
>>
>> Furthermore the fact that the BH_Async_Write flag is set points me in
>> the direction that end_buffer_async_write should have been executing but
>> as I said issuing "bt" for all the tasks didn't show this function.
> 
> Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like
> a more likely place where that flag got set since ext4_finish_bio() was
> then handling IO completion.
> 
>> I'm beginning to wonder if it's possible that a single bit memory error
>> has crept up, but this still seems like a long shot...
> 
> Yup. Possible but a long shot. Is the problem reproducible in any way?

Okay, I rule out hardware issue since a different server today 
experienced the same hard lockup. One thing which looks 
suspicious to me are the repetitions of bio_endio/clone_endio: 

Oct 13 03:16:54 10.80.5.48 Call Trace:
Oct 13 03:16:54 10.80.5.48 
Oct 13 03:16:54 10.80.5.48 [] dump_stack+0x58/0x7f
Oct 13 03:16:54 10.80.5.48 [] warn_slowpath_common+0x8c/0xc0
Oct 13 03:16:54 10.80.5.48 [] warn_slowpath_fmt+0x46/0x50
Oct 13 03:16:54 10.80.5.48 [] 
watchdog_overflow_callback+0x98/0xc0
Oct 13 03:16:54 10.80.5.48 [] __perf_event_overflow+0x9c/0x250
Oct 13 03:16:54 10.80.5.48 [] perf_event_overflow+0x14/0x20
Oct 13 03:16:54 10.80.5.48 [] intel_pmu_handle_irq+0x1d6/0x3e0
Oct 13 03:16:54 10.80.5.48 [] perf_event_nmi_handler+0x34/0x60
Oct 13 03:16:54 10.80.5.48 [] nmi_handle+0xa2/0x1a0
Oct 13 03:16:54 10.80.5.48 [] do_nmi+0x164/0x430
Oct 13 03:16:54 10.80.5.48 [] end_repeat_nmi+0x1a/0x1e
Oct 13 03:16:54 10.80.5.48 [] ? ext4_finish_bio+0x279/0x2a0
Oct 13 03:16:54 10.80.5.48 [] ? ext4_finish_bio+0x279/0x2a0
Oct 13 03:16:54 10.80.5.48 [] ? ext4_finish_bio+0x279/0x2a0
Oct 13 03:16:54 10.80.5.48 <> 
Oct 13 03:16:54 10.80.5.48  
Oct 13 03:16:54 10.80.5.48 [] ext4_end_bio+0xc8/0x120
Oct 13 03:16:54 10.80.5.48 [] bio_endio+0x1d/0x40
Oct 13 

Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-13 Thread Jan Kara
On Mon 12-10-15 17:51:07, Nikolay Borisov wrote:
> Hello and thanks for the reply,
> 
> On 10/12/2015 04:40 PM, Jan Kara wrote:
> > On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
> >> On 10/09/2015 10:37 AM, Hillf Danton wrote:
> >> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
> >>if (bio->bi_error)
> >>buffer_io_error(bh);
> >>} while ((bh = bh->b_this_page) != head);
> >> -  bit_spin_unlock(BH_Uptodate_Lock, >b_state);
> >>local_irq_restore(flags);
> >
> > What if it takes 100ms to unlock after IRQ restored?
> 
>  I'm not sure I understand in what direction you are going? Care to
>  elaborate?
> 
> >>> Your change introduces extra time cost the lock waiter has to pay in
> >>> the case that irq happens before the lock is released.
> >>
> >> [CC filesystem and mm people. For reference the thread starts here:
> >>  http://thread.gmane.org/gmane.linux.kernel/2056996 ]
> >>
> >> Right, I see what you mean and it's a good point but when doing the
> >> patches I was striving for correctness and starting a discussion, hence
> >> the RFC. In any case I'd personally choose correctness over performance
> >> always ;).
> >>
> >> As I'm not an fs/ext4 expert and have added the relevant parties (please
> >> use reply-all from now on so that the thread is not being cut in the
> >> middle) who will be able to say whether it impact is going to be that
> >> big. I guess in this particular code path worrying about this is prudent
> >> as writeback sounds like a heavily used path.
> >>
> >> Maybe the problem should be approached from a different angle e.g.
> >> drain_all_pages and its reliance on the fact that the IPI will always be
> >> delivered in some finite amount of time? But what if a cpu with disabled
> >> interrupts is waiting on the task issuing the IPI?
> > 
> > So I have looked through your patch and also original report (thread starts
> > here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
> > been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning
> > on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
> > __block_write_full_page_endio() call but that cannot really be the case.
> > BH_Uptodate_Lock is used only in IO completion handlers -
> > end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
> > really should be some end_io function running on some other CPU which holds
> > BH_Uptodate_Lock for that buffer.
> 
> I did check all the call traces of the current processes on the machine
> at the time of the hard lockup and none of the 3 functions you mentioned
> were in any of the call chains. But while I was looking the code of
> end_buffer_async_write and in the comments I saw it was mentioned that
> those completion handler were called from __block_write_full_page_endio
> so that's what pointed my attention to that function. But you are right
> that it doesn't take the BH lock.
> 
> Furthermore the fact that the BH_Async_Write flag is set points me in
> the direction that end_buffer_async_write should have been executing but
> as I said issuing "bt" for all the tasks didn't show this function.

Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like
a more likely place where that flag got set since ext4_finish_bio() was
then handling IO completion.

> I'm beginning to wonder if it's possible that a single bit memory error
> has crept up, but this still seems like a long shot...

Yup. Possible but a long shot. Is the problem reproducible in any way?

> Btw I think in any case the spin_lock patch is wrong as this code can be
> called from within softirq context and we do want to be interrupt safe
> at that point.

Agreed, that patch is definitely wrong.

> > BTW: I suppose the filesystem uses 4k blocksize, doesn't it?
> 
> Unfortunately I cannot tell you with 100% certainty, since on this
> server there are multiple block devices with blocksize either 1k or 4k.
> So it is one of these. If you know a way to extract this information
> from a vmcore file I'd be happy to do it.

Well, if you have a crashdump, then bh->b_size is the block size. So just
check that for the bh we are spinning on.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-13 Thread Jan Kara
On Mon 12-10-15 17:51:07, Nikolay Borisov wrote:
> Hello and thanks for the reply,
> 
> On 10/12/2015 04:40 PM, Jan Kara wrote:
> > On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
> >> On 10/09/2015 10:37 AM, Hillf Danton wrote:
> >> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
> >>if (bio->bi_error)
> >>buffer_io_error(bh);
> >>} while ((bh = bh->b_this_page) != head);
> >> -  bit_spin_unlock(BH_Uptodate_Lock, >b_state);
> >>local_irq_restore(flags);
> >
> > What if it takes 100ms to unlock after IRQ restored?
> 
>  I'm not sure I understand in what direction you are going? Care to
>  elaborate?
> 
> >>> Your change introduces extra time cost the lock waiter has to pay in
> >>> the case that irq happens before the lock is released.
> >>
> >> [CC filesystem and mm people. For reference the thread starts here:
> >>  http://thread.gmane.org/gmane.linux.kernel/2056996 ]
> >>
> >> Right, I see what you mean and it's a good point but when doing the
> >> patches I was striving for correctness and starting a discussion, hence
> >> the RFC. In any case I'd personally choose correctness over performance
> >> always ;).
> >>
> >> As I'm not an fs/ext4 expert and have added the relevant parties (please
> >> use reply-all from now on so that the thread is not being cut in the
> >> middle) who will be able to say whether it impact is going to be that
> >> big. I guess in this particular code path worrying about this is prudent
> >> as writeback sounds like a heavily used path.
> >>
> >> Maybe the problem should be approached from a different angle e.g.
> >> drain_all_pages and its reliance on the fact that the IPI will always be
> >> delivered in some finite amount of time? But what if a cpu with disabled
> >> interrupts is waiting on the task issuing the IPI?
> > 
> > So I have looked through your patch and also original report (thread starts
> > here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
> > been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning
> > on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
> > __block_write_full_page_endio() call but that cannot really be the case.
> > BH_Uptodate_Lock is used only in IO completion handlers -
> > end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
> > really should be some end_io function running on some other CPU which holds
> > BH_Uptodate_Lock for that buffer.
> 
> I did check all the call traces of the current processes on the machine
> at the time of the hard lockup and none of the 3 functions you mentioned
> were in any of the call chains. But while I was looking the code of
> end_buffer_async_write and in the comments I saw it was mentioned that
> those completion handler were called from __block_write_full_page_endio
> so that's what pointed my attention to that function. But you are right
> that it doesn't take the BH lock.
> 
> Furthermore the fact that the BH_Async_Write flag is set points me in
> the direction that end_buffer_async_write should have been executing but
> as I said issuing "bt" for all the tasks didn't show this function.

Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like
a more likely place where that flag got set since ext4_finish_bio() was
then handling IO completion.

> I'm beginning to wonder if it's possible that a single bit memory error
> has crept up, but this still seems like a long shot...

Yup. Possible but a long shot. Is the problem reproducible in any way?

> Btw I think in any case the spin_lock patch is wrong as this code can be
> called from within softirq context and we do want to be interrupt safe
> at that point.

Agreed, that patch is definitely wrong.

> > BTW: I suppose the filesystem uses 4k blocksize, doesn't it?
> 
> Unfortunately I cannot tell you with 100% certainty, since on this
> server there are multiple block devices with blocksize either 1k or 4k.
> So it is one of these. If you know a way to extract this information
> from a vmcore file I'd be happy to do it.

Well, if you have a crashdump, then bh->b_size is the block size. So just
check that for the bh we are spinning on.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-13 Thread Nikolay Borisov


On 10/13/2015 11:15 AM, Jan Kara wrote:
> On Mon 12-10-15 17:51:07, Nikolay Borisov wrote:
>> Hello and thanks for the reply,
>>
>> On 10/12/2015 04:40 PM, Jan Kara wrote:
>>> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
 On 10/09/2015 10:37 AM, Hillf Danton wrote:
 @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
if (bio->bi_error)
buffer_io_error(bh);
} while ((bh = bh->b_this_page) != head);
 -  bit_spin_unlock(BH_Uptodate_Lock, >b_state);
local_irq_restore(flags);
>>>
>>> What if it takes 100ms to unlock after IRQ restored?
>>
>> I'm not sure I understand in what direction you are going? Care to
>> elaborate?
>>
> Your change introduces extra time cost the lock waiter has to pay in
> the case that irq happens before the lock is released.

 [CC filesystem and mm people. For reference the thread starts here:
  http://thread.gmane.org/gmane.linux.kernel/2056996 ]

 Right, I see what you mean and it's a good point but when doing the
 patches I was striving for correctness and starting a discussion, hence
 the RFC. In any case I'd personally choose correctness over performance
 always ;).

 As I'm not an fs/ext4 expert and have added the relevant parties (please
 use reply-all from now on so that the thread is not being cut in the
 middle) who will be able to say whether it impact is going to be that
 big. I guess in this particular code path worrying about this is prudent
 as writeback sounds like a heavily used path.

 Maybe the problem should be approached from a different angle e.g.
 drain_all_pages and its reliance on the fact that the IPI will always be
 delivered in some finite amount of time? But what if a cpu with disabled
 interrupts is waiting on the task issuing the IPI?
>>>
>>> So I have looked through your patch and also original report (thread starts
>>> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
>>> been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning
>>> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
>>> __block_write_full_page_endio() call but that cannot really be the case.
>>> BH_Uptodate_Lock is used only in IO completion handlers -
>>> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
>>> really should be some end_io function running on some other CPU which holds
>>> BH_Uptodate_Lock for that buffer.
>>
>> I did check all the call traces of the current processes on the machine
>> at the time of the hard lockup and none of the 3 functions you mentioned
>> were in any of the call chains. But while I was looking the code of
>> end_buffer_async_write and in the comments I saw it was mentioned that
>> those completion handler were called from __block_write_full_page_endio
>> so that's what pointed my attention to that function. But you are right
>> that it doesn't take the BH lock.
>>
>> Furthermore the fact that the BH_Async_Write flag is set points me in
>> the direction that end_buffer_async_write should have been executing but
>> as I said issuing "bt" for all the tasks didn't show this function.
> 
> Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like
> a more likely place where that flag got set since ext4_finish_bio() was
> then handling IO completion.
> 
>> I'm beginning to wonder if it's possible that a single bit memory error
>> has crept up, but this still seems like a long shot...
> 
> Yup. Possible but a long shot. Is the problem reproducible in any way?

Okay, I rule out hardware issue since a different server today 
experienced the same hard lockup. One thing which looks 
suspicious to me are the repetitions of bio_endio/clone_endio: 

Oct 13 03:16:54 10.80.5.48 Call Trace:
Oct 13 03:16:54 10.80.5.48 
Oct 13 03:16:54 10.80.5.48 [] dump_stack+0x58/0x7f
Oct 13 03:16:54 10.80.5.48 [] warn_slowpath_common+0x8c/0xc0
Oct 13 03:16:54 10.80.5.48 [] warn_slowpath_fmt+0x46/0x50
Oct 13 03:16:54 10.80.5.48 [] 
watchdog_overflow_callback+0x98/0xc0
Oct 13 03:16:54 10.80.5.48 [] __perf_event_overflow+0x9c/0x250
Oct 13 03:16:54 10.80.5.48 [] perf_event_overflow+0x14/0x20
Oct 13 03:16:54 10.80.5.48 [] intel_pmu_handle_irq+0x1d6/0x3e0
Oct 13 03:16:54 10.80.5.48 [] perf_event_nmi_handler+0x34/0x60
Oct 13 03:16:54 10.80.5.48 [] nmi_handle+0xa2/0x1a0
Oct 13 03:16:54 10.80.5.48 [] do_nmi+0x164/0x430
Oct 13 03:16:54 10.80.5.48 [] end_repeat_nmi+0x1a/0x1e
Oct 13 03:16:54 10.80.5.48 [] ? ext4_finish_bio+0x279/0x2a0
Oct 13 03:16:54 10.80.5.48 [] ? ext4_finish_bio+0x279/0x2a0
Oct 13 03:16:54 10.80.5.48 [] ? ext4_finish_bio+0x279/0x2a0
Oct 13 03:16:54 10.80.5.48 <> 
Oct 13 03:16:54 10.80.5.48  
Oct 13 03:16:54 10.80.5.48 [] ext4_end_bio+0xc8/0x120
Oct 13 03:16:54 10.80.5.48 [] bio_endio+0x1d/0x40
Oct 13 

Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-13 Thread Jan Kara
On Tue 13-10-15 13:37:16, Nikolay Borisov wrote:
> 
> 
> On 10/13/2015 11:15 AM, Jan Kara wrote:
> > On Mon 12-10-15 17:51:07, Nikolay Borisov wrote:
> >> Hello and thanks for the reply,
> >>
> >> On 10/12/2015 04:40 PM, Jan Kara wrote:
> >>> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
>  On 10/09/2015 10:37 AM, Hillf Danton wrote:
>  @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
>   if (bio->bi_error)
>   buffer_io_error(bh);
>   } while ((bh = bh->b_this_page) != head);
>  -bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>   local_irq_restore(flags);
> >>>
> >>> What if it takes 100ms to unlock after IRQ restored?
> >>
> >> I'm not sure I understand in what direction you are going? Care to
> >> elaborate?
> >>
> > Your change introduces extra time cost the lock waiter has to pay in
> > the case that irq happens before the lock is released.
> 
>  [CC filesystem and mm people. For reference the thread starts here:
>   http://thread.gmane.org/gmane.linux.kernel/2056996 ]
> 
>  Right, I see what you mean and it's a good point but when doing the
>  patches I was striving for correctness and starting a discussion, hence
>  the RFC. In any case I'd personally choose correctness over performance
>  always ;).
> 
>  As I'm not an fs/ext4 expert and have added the relevant parties (please
>  use reply-all from now on so that the thread is not being cut in the
>  middle) who will be able to say whether it impact is going to be that
>  big. I guess in this particular code path worrying about this is prudent
>  as writeback sounds like a heavily used path.
> 
>  Maybe the problem should be approached from a different angle e.g.
>  drain_all_pages and its reliance on the fact that the IPI will always be
>  delivered in some finite amount of time? But what if a cpu with disabled
>  interrupts is waiting on the task issuing the IPI?
> >>>
> >>> So I have looked through your patch and also original report (thread 
> >>> starts
> >>> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
> >>> been properly answered yet: Who is holding BH_Uptodate_Lock we are 
> >>> spinning
> >>> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
> >>> __block_write_full_page_endio() call but that cannot really be the case.
> >>> BH_Uptodate_Lock is used only in IO completion handlers -
> >>> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
> >>> really should be some end_io function running on some other CPU which 
> >>> holds
> >>> BH_Uptodate_Lock for that buffer.
> >>
> >> I did check all the call traces of the current processes on the machine
> >> at the time of the hard lockup and none of the 3 functions you mentioned
> >> were in any of the call chains. But while I was looking the code of
> >> end_buffer_async_write and in the comments I saw it was mentioned that
> >> those completion handler were called from __block_write_full_page_endio
> >> so that's what pointed my attention to that function. But you are right
> >> that it doesn't take the BH lock.
> >>
> >> Furthermore the fact that the BH_Async_Write flag is set points me in
> >> the direction that end_buffer_async_write should have been executing but
> >> as I said issuing "bt" for all the tasks didn't show this function.
> > 
> > Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like
> > a more likely place where that flag got set since ext4_finish_bio() was
> > then handling IO completion.
> > 
> >> I'm beginning to wonder if it's possible that a single bit memory error
> >> has crept up, but this still seems like a long shot...
> > 
> > Yup. Possible but a long shot. Is the problem reproducible in any way?
> 
> Okay, I rule out hardware issue since a different server today 
> experienced the same hard lockup. One thing which looks 
> suspicious to me are the repetitions of bio_endio/clone_endio: 
> 
> Oct 13 03:16:54 10.80.5.48 Call Trace:
> Oct 13 03:16:54 10.80.5.48 
> Oct 13 03:16:54 10.80.5.48 [] dump_stack+0x58/0x7f
> Oct 13 03:16:54 10.80.5.48 [] warn_slowpath_common+0x8c/0xc0
> Oct 13 03:16:54 10.80.5.48 [] warn_slowpath_fmt+0x46/0x50
> Oct 13 03:16:54 10.80.5.48 [] 
> watchdog_overflow_callback+0x98/0xc0
> Oct 13 03:16:54 10.80.5.48 [] 
> __perf_event_overflow+0x9c/0x250
> Oct 13 03:16:54 10.80.5.48 [] perf_event_overflow+0x14/0x20
> Oct 13 03:16:54 10.80.5.48 [] 
> intel_pmu_handle_irq+0x1d6/0x3e0
> Oct 13 03:16:54 10.80.5.48 [] 
> perf_event_nmi_handler+0x34/0x60
> Oct 13 03:16:54 10.80.5.48 [] nmi_handle+0xa2/0x1a0
> Oct 13 03:16:54 10.80.5.48 [] do_nmi+0x164/0x430
> Oct 13 03:16:54 10.80.5.48 [] end_repeat_nmi+0x1a/0x1e
> Oct 13 03:16:54 10.80.5.48 [] ? ext4_finish_bio+0x279/0x2a0
> Oct 13 03:16:54 10.80.5.48 [] ? 

Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-12 Thread Nikolay Borisov
Hello and thanks for the reply,

On 10/12/2015 04:40 PM, Jan Kara wrote:
> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
>> On 10/09/2015 10:37 AM, Hillf Danton wrote:
>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
>>  if (bio->bi_error)
>>  buffer_io_error(bh);
>>  } while ((bh = bh->b_this_page) != head);
>> -bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>>  local_irq_restore(flags);
>
> What if it takes 100ms to unlock after IRQ restored?

 I'm not sure I understand in what direction you are going? Care to
 elaborate?

>>> Your change introduces extra time cost the lock waiter has to pay in
>>> the case that irq happens before the lock is released.
>>
>> [CC filesystem and mm people. For reference the thread starts here:
>>  http://thread.gmane.org/gmane.linux.kernel/2056996 ]
>>
>> Right, I see what you mean and it's a good point but when doing the
>> patches I was striving for correctness and starting a discussion, hence
>> the RFC. In any case I'd personally choose correctness over performance
>> always ;).
>>
>> As I'm not an fs/ext4 expert and have added the relevant parties (please
>> use reply-all from now on so that the thread is not being cut in the
>> middle) who will be able to say whether it impact is going to be that
>> big. I guess in this particular code path worrying about this is prudent
>> as writeback sounds like a heavily used path.
>>
>> Maybe the problem should be approached from a different angle e.g.
>> drain_all_pages and its reliance on the fact that the IPI will always be
>> delivered in some finite amount of time? But what if a cpu with disabled
>> interrupts is waiting on the task issuing the IPI?
> 
> So I have looked through your patch and also original report (thread starts
> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
> been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning
> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
> __block_write_full_page_endio() call but that cannot really be the case.
> BH_Uptodate_Lock is used only in IO completion handlers -
> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
> really should be some end_io function running on some other CPU which holds
> BH_Uptodate_Lock for that buffer.

I did check all the call traces of the current processes on the machine
at the time of the hard lockup and none of the 3 functions you mentioned
were in any of the call chains. But while I was looking the code of
end_buffer_async_write and in the comments I saw it was mentioned that
those completion handler were called from __block_write_full_page_endio
so that's what pointed my attention to that function. But you are right
that it doesn't take the BH lock.

Furthermore the fact that the BH_Async_Write flag is set points me in
the direction that end_buffer_async_write should have been executing but
as I said issuing "bt" for all the tasks didn't show this function.

I'm beginning to wonder if it's possible that a single bit memory error
has crept up, but this still seems like a long shot...

Btw I think in any case the spin_lock patch is wrong as this code can be
called from within softirq context and we do want to be interrupt safe
at that point.

> 
> BTW: I suppose the filesystem uses 4k blocksize, doesn't it?

Unfortunately I cannot tell you with 100% certainty, since on this
server there are multiple block devices with blocksize either 1k or 4k.
So it is one of these. If you know a way to extract this information
from a vmcore file I'd be happy to do it.

> 
>   Honza
> 
>> +bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>>  if (!under_io) {
>>  #ifdef CONFIG_EXT4_FS_ENCRYPTION
>>  if (ctx)
>> --
>> 2.5.0
>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-12 Thread Jan Kara
On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
> On 10/09/2015 10:37 AM, Hillf Danton wrote:
>  @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
>   if (bio->bi_error)
>   buffer_io_error(bh);
>   } while ((bh = bh->b_this_page) != head);
>  -bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>   local_irq_restore(flags);
> >>>
> >>> What if it takes 100ms to unlock after IRQ restored?
> >>
> >> I'm not sure I understand in what direction you are going? Care to
> >> elaborate?
> >>
> > Your change introduces extra time cost the lock waiter has to pay in
> > the case that irq happens before the lock is released.
> 
> [CC filesystem and mm people. For reference the thread starts here:
>  http://thread.gmane.org/gmane.linux.kernel/2056996 ]
> 
> Right, I see what you mean and it's a good point but when doing the
> patches I was striving for correctness and starting a discussion, hence
> the RFC. In any case I'd personally choose correctness over performance
> always ;).
> 
> As I'm not an fs/ext4 expert and have added the relevant parties (please
> use reply-all from now on so that the thread is not being cut in the
> middle) who will be able to say whether it impact is going to be that
> big. I guess in this particular code path worrying about this is prudent
> as writeback sounds like a heavily used path.
> 
> Maybe the problem should be approached from a different angle e.g.
> drain_all_pages and its reliance on the fact that the IPI will always be
> delivered in some finite amount of time? But what if a cpu with disabled
> interrupts is waiting on the task issuing the IPI?

So I have looked through your patch and also original report (thread starts
here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning
on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
__block_write_full_page_endio() call but that cannot really be the case.
BH_Uptodate_Lock is used only in IO completion handlers -
end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
really should be some end_io function running on some other CPU which holds
BH_Uptodate_Lock for that buffer.

BTW: I suppose the filesystem uses 4k blocksize, doesn't it?

Honza

>  +bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>   if (!under_io) {
>   #ifdef CONFIG_EXT4_FS_ENCRYPTION
>   if (ctx)
>  --
>  2.5.0
> >>>
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-12 Thread Jan Kara
On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
> On 10/09/2015 10:37 AM, Hillf Danton wrote:
>  @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
>   if (bio->bi_error)
>   buffer_io_error(bh);
>   } while ((bh = bh->b_this_page) != head);
>  -bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>   local_irq_restore(flags);
> >>>
> >>> What if it takes 100ms to unlock after IRQ restored?
> >>
> >> I'm not sure I understand in what direction you are going? Care to
> >> elaborate?
> >>
> > Your change introduces extra time cost the lock waiter has to pay in
> > the case that irq happens before the lock is released.
> 
> [CC filesystem and mm people. For reference the thread starts here:
>  http://thread.gmane.org/gmane.linux.kernel/2056996 ]
> 
> Right, I see what you mean and it's a good point but when doing the
> patches I was striving for correctness and starting a discussion, hence
> the RFC. In any case I'd personally choose correctness over performance
> always ;).
> 
> As I'm not an fs/ext4 expert and have added the relevant parties (please
> use reply-all from now on so that the thread is not being cut in the
> middle) who will be able to say whether it impact is going to be that
> big. I guess in this particular code path worrying about this is prudent
> as writeback sounds like a heavily used path.
> 
> Maybe the problem should be approached from a different angle e.g.
> drain_all_pages and its reliance on the fact that the IPI will always be
> delivered in some finite amount of time? But what if a cpu with disabled
> interrupts is waiting on the task issuing the IPI?

So I have looked through your patch and also original report (thread starts
here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning
on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
__block_write_full_page_endio() call but that cannot really be the case.
BH_Uptodate_Lock is used only in IO completion handlers -
end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
really should be some end_io function running on some other CPU which holds
BH_Uptodate_Lock for that buffer.

BTW: I suppose the filesystem uses 4k blocksize, doesn't it?

Honza

>  +bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>   if (!under_io) {
>   #ifdef CONFIG_EXT4_FS_ENCRYPTION
>   if (ctx)
>  --
>  2.5.0
> >>>
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-12 Thread Nikolay Borisov
Hello and thanks for the reply,

On 10/12/2015 04:40 PM, Jan Kara wrote:
> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote:
>> On 10/09/2015 10:37 AM, Hillf Danton wrote:
>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
>>  if (bio->bi_error)
>>  buffer_io_error(bh);
>>  } while ((bh = bh->b_this_page) != head);
>> -bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>>  local_irq_restore(flags);
>
> What if it takes 100ms to unlock after IRQ restored?

 I'm not sure I understand in what direction you are going? Care to
 elaborate?

>>> Your change introduces extra time cost the lock waiter has to pay in
>>> the case that irq happens before the lock is released.
>>
>> [CC filesystem and mm people. For reference the thread starts here:
>>  http://thread.gmane.org/gmane.linux.kernel/2056996 ]
>>
>> Right, I see what you mean and it's a good point but when doing the
>> patches I was striving for correctness and starting a discussion, hence
>> the RFC. In any case I'd personally choose correctness over performance
>> always ;).
>>
>> As I'm not an fs/ext4 expert and have added the relevant parties (please
>> use reply-all from now on so that the thread is not being cut in the
>> middle) who will be able to say whether it impact is going to be that
>> big. I guess in this particular code path worrying about this is prudent
>> as writeback sounds like a heavily used path.
>>
>> Maybe the problem should be approached from a different angle e.g.
>> drain_all_pages and its reliance on the fact that the IPI will always be
>> delivered in some finite amount of time? But what if a cpu with disabled
>> interrupts is waiting on the task issuing the IPI?
> 
> So I have looked through your patch and also original report (thread starts
> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't
> been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning
> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was
> __block_write_full_page_endio() call but that cannot really be the case.
> BH_Uptodate_Lock is used only in IO completion handlers -
> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there
> really should be some end_io function running on some other CPU which holds
> BH_Uptodate_Lock for that buffer.

I did check all the call traces of the current processes on the machine
at the time of the hard lockup and none of the 3 functions you mentioned
were in any of the call chains. But while I was looking the code of
end_buffer_async_write and in the comments I saw it was mentioned that
those completion handler were called from __block_write_full_page_endio
so that's what pointed my attention to that function. But you are right
that it doesn't take the BH lock.

Furthermore the fact that the BH_Async_Write flag is set points me in
the direction that end_buffer_async_write should have been executing but
as I said issuing "bt" for all the tasks didn't show this function.

I'm beginning to wonder if it's possible that a single bit memory error
has crept up, but this still seems like a long shot...

Btw I think in any case the spin_lock patch is wrong as this code can be
called from within softirq context and we do want to be interrupt safe
at that point.

> 
> BTW: I suppose the filesystem uses 4k blocksize, doesn't it?

Unfortunately I cannot tell you with 100% certainty, since on this
server there are multiple block devices with blocksize either 1k or 4k.
So it is one of these. If you know a way to extract this information
from a vmcore file I'd be happy to do it.

> 
>   Honza
> 
>> +bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>>  if (!under_io) {
>>  #ifdef CONFIG_EXT4_FS_ENCRYPTION
>>  if (ctx)
>> --
>> 2.5.0
>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-09 Thread Nikolay Borisov


On 10/09/2015 11:41 AM, Gilad Ben-Yossef wrote:
> On Oct 8, 2015 18:31, "Nikolay Borisov"  wrote:
>>
>> Currently when bios are being finished in ext4_finish_bio this is done by
>> first disabling interrupts and then acquiring a bit_spin_lock.
> ...
>>
>> To fix the situation this patch changes the order in which the
>> bit_spin_lock and interrupts disabling occcurs. The exepected
>> effect is that even if a core is spinning on the bitlock it will
>> have its interrupts enabled, thus being able to respond to IPIs.
> 
> Are you sure this spin lock is never taken from interrupt context?

Good point.

I think indeed this fix is wrong as this lock is actually taken from
soft irq context as evident from this thread, which prompted me to write
the patch in the first place:
http://permalink.gmane.org/gmane.linux.kernel/2056730


> If it does an interrupt occurring after the lock is taken and before
> interrupts are disabled can deadlock .
> 
> Gilad
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-09 Thread Nikolay Borisov
On 10/09/2015 10:37 AM, Hillf Danton wrote:
 @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
if (bio->bi_error)
buffer_io_error(bh);
} while ((bh = bh->b_this_page) != head);
 -  bit_spin_unlock(BH_Uptodate_Lock, >b_state);
local_irq_restore(flags);
>>>
>>> What if it takes 100ms to unlock after IRQ restored?
>>
>> I'm not sure I understand in what direction you are going? Care to
>> elaborate?
>>
> Your change introduces extra time cost the lock waiter has to pay in
> the case that irq happens before the lock is released.

[CC filesystem and mm people. For reference the thread starts here:
 http://thread.gmane.org/gmane.linux.kernel/2056996 ]

Right, I see what you mean and it's a good point but when doing the
patches I was striving for correctness and starting a discussion, hence
the RFC. In any case I'd personally choose correctness over performance
always ;).

As I'm not an fs/ext4 expert and have added the relevant parties (please
use reply-all from now on so that the thread is not being cut in the
middle) who will be able to say whether it impact is going to be that
big. I guess in this particular code path worrying about this is prudent
as writeback sounds like a heavily used path.

Maybe the problem should be approached from a different angle e.g.
drain_all_pages and its reliance on the fact that the IPI will always be
delivered in some finite amount of time? But what if a cpu with disabled
interrupts is waiting on the task issuing the IPI?

> 
 +  bit_spin_unlock(BH_Uptodate_Lock, >b_state);
if (!under_io) {
  #ifdef CONFIG_EXT4_FS_ENCRYPTION
if (ctx)
 --
 2.5.0
>>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-09 Thread Hillf Danton
> >> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
> >>if (bio->bi_error)
> >>buffer_io_error(bh);
> >>} while ((bh = bh->b_this_page) != head);
> >> -  bit_spin_unlock(BH_Uptodate_Lock, >b_state);
> >>local_irq_restore(flags);
> >
> > What if it takes 100ms to unlock after IRQ restored?
> 
> I'm not sure I understand in what direction you are going? Care to
> elaborate?
> 
Your change introduces extra time cost the lock waiter has to pay in
the case that irq happens before the lock is released.

> >> +  bit_spin_unlock(BH_Uptodate_Lock, >b_state);
> >>if (!under_io) {
> >>  #ifdef CONFIG_EXT4_FS_ENCRYPTION
> >>if (ctx)
> >> --
> >> 2.5.0
> >

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-09 Thread Nikolay Borisov


On 10/09/2015 10:19 AM, Hillf Danton wrote:
>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
>>  if (bio->bi_error)
>>  buffer_io_error(bh);
>>  } while ((bh = bh->b_this_page) != head);
>> -bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>>  local_irq_restore(flags);
> 
> What if it takes 100ms to unlock after IRQ restored?

I'm not sure I understand in what direction you are going? Care to
elaborate?

>> +bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>>  if (!under_io) {
>>  #ifdef CONFIG_EXT4_FS_ENCRYPTION
>>  if (ctx)
>> --
>> 2.5.0
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-09 Thread Hillf Danton
> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
>   if (bio->bi_error)
>   buffer_io_error(bh);
>   } while ((bh = bh->b_this_page) != head);
> - bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>   local_irq_restore(flags);

What if it takes 100ms to unlock after IRQ restored?

> + bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>   if (!under_io) {
>  #ifdef CONFIG_EXT4_FS_ENCRYPTION
>   if (ctx)
> --
> 2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-09 Thread Nikolay Borisov


On 10/09/2015 11:41 AM, Gilad Ben-Yossef wrote:
> On Oct 8, 2015 18:31, "Nikolay Borisov"  wrote:
>>
>> Currently when bios are being finished in ext4_finish_bio this is done by
>> first disabling interrupts and then acquiring a bit_spin_lock.
> ...
>>
>> To fix the situation this patch changes the order in which the
>> bit_spin_lock and interrupts disabling occcurs. The exepected
>> effect is that even if a core is spinning on the bitlock it will
>> have its interrupts enabled, thus being able to respond to IPIs.
> 
> Are you sure this spin lock is never taken from interrupt context?

Good point.

I think indeed this fix is wrong as this lock is actually taken from
soft irq context as evident from this thread, which prompted me to write
the patch in the first place:
http://permalink.gmane.org/gmane.linux.kernel/2056730


> If it does an interrupt occurring after the lock is taken and before
> interrupts are disabled can deadlock .
> 
> Gilad
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-09 Thread Hillf Danton
> >> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
> >>if (bio->bi_error)
> >>buffer_io_error(bh);
> >>} while ((bh = bh->b_this_page) != head);
> >> -  bit_spin_unlock(BH_Uptodate_Lock, >b_state);
> >>local_irq_restore(flags);
> >
> > What if it takes 100ms to unlock after IRQ restored?
> 
> I'm not sure I understand in what direction you are going? Care to
> elaborate?
> 
Your change introduces extra time cost the lock waiter has to pay in
the case that irq happens before the lock is released.

> >> +  bit_spin_unlock(BH_Uptodate_Lock, >b_state);
> >>if (!under_io) {
> >>  #ifdef CONFIG_EXT4_FS_ENCRYPTION
> >>if (ctx)
> >> --
> >> 2.5.0
> >

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-09 Thread Hillf Danton
> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
>   if (bio->bi_error)
>   buffer_io_error(bh);
>   } while ((bh = bh->b_this_page) != head);
> - bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>   local_irq_restore(flags);

What if it takes 100ms to unlock after IRQ restored?

> + bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>   if (!under_io) {
>  #ifdef CONFIG_EXT4_FS_ENCRYPTION
>   if (ctx)
> --
> 2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-09 Thread Nikolay Borisov
On 10/09/2015 10:37 AM, Hillf Danton wrote:
 @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
if (bio->bi_error)
buffer_io_error(bh);
} while ((bh = bh->b_this_page) != head);
 -  bit_spin_unlock(BH_Uptodate_Lock, >b_state);
local_irq_restore(flags);
>>>
>>> What if it takes 100ms to unlock after IRQ restored?
>>
>> I'm not sure I understand in what direction you are going? Care to
>> elaborate?
>>
> Your change introduces extra time cost the lock waiter has to pay in
> the case that irq happens before the lock is released.

[CC filesystem and mm people. For reference the thread starts here:
 http://thread.gmane.org/gmane.linux.kernel/2056996 ]

Right, I see what you mean and it's a good point but when doing the
patches I was striving for correctness and starting a discussion, hence
the RFC. In any case I'd personally choose correctness over performance
always ;).

As I'm not an fs/ext4 expert and have added the relevant parties (please
use reply-all from now on so that the thread is not being cut in the
middle) who will be able to say whether it impact is going to be that
big. I guess in this particular code path worrying about this is prudent
as writeback sounds like a heavily used path.

Maybe the problem should be approached from a different angle e.g.
drain_all_pages and its reliance on the fact that the IPI will always be
delivered in some finite amount of time? But what if a cpu with disabled
interrupts is waiting on the task issuing the IPI?

> 
 +  bit_spin_unlock(BH_Uptodate_Lock, >b_state);
if (!under_io) {
  #ifdef CONFIG_EXT4_FS_ENCRYPTION
if (ctx)
 --
 2.5.0
>>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-09 Thread Nikolay Borisov


On 10/09/2015 10:19 AM, Hillf Danton wrote:
>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
>>  if (bio->bi_error)
>>  buffer_io_error(bh);
>>  } while ((bh = bh->b_this_page) != head);
>> -bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>>  local_irq_restore(flags);
> 
> What if it takes 100ms to unlock after IRQ restored?

I'm not sure I understand in what direction you are going? Care to
elaborate?

>> +bit_spin_unlock(BH_Uptodate_Lock, >b_state);
>>  if (!under_io) {
>>  #ifdef CONFIG_EXT4_FS_ENCRYPTION
>>  if (ctx)
>> --
>> 2.5.0
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-08 Thread Nikolay Borisov
Currently when bios are being finished in ext4_finish_bio this is done by
first disabling interrupts and then acquiring a bit_spin_lock.
However, those buffer heads might be under async write and as such the
wait on bit_spin_lock might cause the CPU to be spinning with interrupts
disabled for arbitrary period of time. If in the mean time there is
demand for memory and such cannot be freed the allocator's code might
have to resort to dumping the per-cpu lists, like so:

PID: 3  TASK: 881cbb2fb870  CPU: 2   COMMAND: "kworker/u96:0"
 #0 [881fffa46dc0] crash_nmi_callback at 8106f24e
 #1 [881fffa46de0] nmi_handle at 8104c152
 #2 [881fffa46e70] do_nmi at 8104c3b4
 #3 [881fffa46ef0] end_repeat_nmi at 81656e2e
[exception RIP: smp_call_function_many+577]
RIP: 810e7f81  RSP: 880d35b815c8  RFLAGS: 0202
RAX: 0017  RBX: 81142690  RCX: 0017
RDX: 883fff375478  RSI: 0040  RDI: 0040
RBP: 880d35b81628   R8: 881fffa51ec8   R9: 
R10:   R11: 812943f3  R12: 
R13: 881fffa51ec0  R14: 881fffa51ec8  R15: 00011f00
ORIG_RAX:   CS: 0010  SS: 0018
---  ---
 #4 [880d35b815c8] smp_call_function_many at 810e7f81
 #5 [880d35b81630] on_each_cpu_mask at 810e801c
 #6 [880d35b81660] drain_all_pages at 81140178
 #7 [880d35b81690] __alloc_pages_nodemask at 8114310b
 #8 [880d35b81810] alloc_pages_current at 81181c5e
 #9 [880d35b81860] new_slab at 81188305

However, this will never return, since on_each_cpu_mask is being called
with last argument 1 i.e. wait until the IPI handler is invoked on every
cpu. Additionally, if there is another thread on which ext4_finish_bio
depends to complete e.g:

PID: 34220  TASK: 883937660810  CPU: 44  COMMAND: "kworker/u98:39"
 #0 [88209d5b10b8] __schedule at 81653d5a
 #1 [88209d5b1150] schedule at 816542f9
 #2 [88209d5b1160] schedule_preempt_disabled at 81654686
 #3 [88209d5b1180] __mutex_lock_slowpath at 816521eb
 #4 [88209d5b1200] mutex_lock at 816522d1
 #5 [88209d5b1220] new_read at a0152a7e [dm_bufio]
 #6 [88209d5b1280] dm_bufio_get at a0152ba6 [dm_bufio]
 #7 [88209d5b1290] dm_bm_read_try_lock at a015c878 
[dm_persistent_data]
 #8 [88209d5b12e0] dm_tm_read_lock at a015f7ad [dm_persistent_data]
 #9 [88209d5b12f0] bn_read_lock at a016281b [dm_persistent_data]

And in turn this second thread is dependent on the original, allocation
to succeed a hard lockup occurs, since ext4_finish_bio would be waitin for
block_write_full_page to complete, which in turn is dependent on the original
memory allocation to succeeds, which in turn is dependent on the IPI executing
on each core.  For completeness here is how the call stack for hung 
ext4_bio_finish
would look like:

[427160.405277] NMI backtrace for cpu 23
[427160.405279] CPU: 23 PID: 4611 Comm: kworker/u98:7 Tainted: GW
3.12.47-clouder1 #1
[427160.405281] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
[427160.405285] Workqueue: writeback bdi_writeback_workfn (flush-252:148)
[427160.405286] task: 8825aa819830 ti: 882b1918 task.ti: 
882b1918
[427160.405290] RIP: 0010:[]  [] 
ext4_finish_bio+0x273/0x2a0
[427160.405291] RSP: :883fff3639b0  EFLAGS: 0002
[427160.405292] RAX: 882b1918 RBX: 883f67480a80 RCX: 
0110
[427160.405292] RDX: 882b1918 RSI:  RDI: 
883f67480a80
[427160.405293] RBP: 883fff363a70 R08: 00014b80 R09: 
881fff454f00
[427160.405294] R10: ea00473214c0 R11: 8113bfd7 R12: 
880826272138
[427160.405295] R13:  R14:  R15: 
ea00aeaea400
[427160.405296] FS:  () GS:883fff36() 
knlGS:
[427160.405296] CS:  0010 DS:  ES:  CR0: 80050033
[427160.405297] CR2: 003c5b009c24 CR3: 01c0b000 CR4: 
001407e0
[427160.405297] Stack:
[427160.405305]   8203f230 883fff363a00 
882b1918
[427160.405312]  882b1918 882b1918 0400018e0af8 
882b1918
[427160.405319]  883f67480a80  0202 
d219e720
[427160.405320] Call Trace:
[427160.405324]  
[427160.405327]  [] ext4_end_bio+0xc8/0x120
[427160.405335]  [] bio_endio+0x1d/0x40
[427160.405341]  [] dec_pending+0x1c1/0x360
[427160.405345]  [] clone_endio+0x76/0xa0
[427160.405350]  [] bio_endio+0x1d/0x40
[427160.405353]  [] dec_pending+0x1c1/0x360
[427160.405358]  [] clone_endio+0x76/0xa0
[427160.405362]  [] bio_endio+0x1d/0x40
[427160.405365]  [] dec_pending+0x1c1/0x360
[427160.405369]  [] clone_endio+0x76/0xa0
[427160.405373]  [] 

[RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

2015-10-08 Thread Nikolay Borisov
Currently when bios are being finished in ext4_finish_bio this is done by
first disabling interrupts and then acquiring a bit_spin_lock.
However, those buffer heads might be under async write and as such the
wait on bit_spin_lock might cause the CPU to be spinning with interrupts
disabled for arbitrary period of time. If in the mean time there is
demand for memory and such cannot be freed the allocator's code might
have to resort to dumping the per-cpu lists, like so:

PID: 3  TASK: 881cbb2fb870  CPU: 2   COMMAND: "kworker/u96:0"
 #0 [881fffa46dc0] crash_nmi_callback at 8106f24e
 #1 [881fffa46de0] nmi_handle at 8104c152
 #2 [881fffa46e70] do_nmi at 8104c3b4
 #3 [881fffa46ef0] end_repeat_nmi at 81656e2e
[exception RIP: smp_call_function_many+577]
RIP: 810e7f81  RSP: 880d35b815c8  RFLAGS: 0202
RAX: 0017  RBX: 81142690  RCX: 0017
RDX: 883fff375478  RSI: 0040  RDI: 0040
RBP: 880d35b81628   R8: 881fffa51ec8   R9: 
R10:   R11: 812943f3  R12: 
R13: 881fffa51ec0  R14: 881fffa51ec8  R15: 00011f00
ORIG_RAX:   CS: 0010  SS: 0018
---  ---
 #4 [880d35b815c8] smp_call_function_many at 810e7f81
 #5 [880d35b81630] on_each_cpu_mask at 810e801c
 #6 [880d35b81660] drain_all_pages at 81140178
 #7 [880d35b81690] __alloc_pages_nodemask at 8114310b
 #8 [880d35b81810] alloc_pages_current at 81181c5e
 #9 [880d35b81860] new_slab at 81188305

However, this will never return, since on_each_cpu_mask is being called
with last argument 1 i.e. wait until the IPI handler is invoked on every
cpu. Additionally, if there is another thread on which ext4_finish_bio
depends to complete e.g:

PID: 34220  TASK: 883937660810  CPU: 44  COMMAND: "kworker/u98:39"
 #0 [88209d5b10b8] __schedule at 81653d5a
 #1 [88209d5b1150] schedule at 816542f9
 #2 [88209d5b1160] schedule_preempt_disabled at 81654686
 #3 [88209d5b1180] __mutex_lock_slowpath at 816521eb
 #4 [88209d5b1200] mutex_lock at 816522d1
 #5 [88209d5b1220] new_read at a0152a7e [dm_bufio]
 #6 [88209d5b1280] dm_bufio_get at a0152ba6 [dm_bufio]
 #7 [88209d5b1290] dm_bm_read_try_lock at a015c878 
[dm_persistent_data]
 #8 [88209d5b12e0] dm_tm_read_lock at a015f7ad [dm_persistent_data]
 #9 [88209d5b12f0] bn_read_lock at a016281b [dm_persistent_data]

And in turn this second thread is dependent on the original, allocation
to succeed a hard lockup occurs, since ext4_finish_bio would be waitin for
block_write_full_page to complete, which in turn is dependent on the original
memory allocation to succeeds, which in turn is dependent on the IPI executing
on each core.  For completeness here is how the call stack for hung 
ext4_bio_finish
would look like:

[427160.405277] NMI backtrace for cpu 23
[427160.405279] CPU: 23 PID: 4611 Comm: kworker/u98:7 Tainted: GW
3.12.47-clouder1 #1
[427160.405281] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
[427160.405285] Workqueue: writeback bdi_writeback_workfn (flush-252:148)
[427160.405286] task: 8825aa819830 ti: 882b1918 task.ti: 
882b1918
[427160.405290] RIP: 0010:[]  [] 
ext4_finish_bio+0x273/0x2a0
[427160.405291] RSP: :883fff3639b0  EFLAGS: 0002
[427160.405292] RAX: 882b1918 RBX: 883f67480a80 RCX: 
0110
[427160.405292] RDX: 882b1918 RSI:  RDI: 
883f67480a80
[427160.405293] RBP: 883fff363a70 R08: 00014b80 R09: 
881fff454f00
[427160.405294] R10: ea00473214c0 R11: 8113bfd7 R12: 
880826272138
[427160.405295] R13:  R14:  R15: 
ea00aeaea400
[427160.405296] FS:  () GS:883fff36() 
knlGS:
[427160.405296] CS:  0010 DS:  ES:  CR0: 80050033
[427160.405297] CR2: 003c5b009c24 CR3: 01c0b000 CR4: 
001407e0
[427160.405297] Stack:
[427160.405305]   8203f230 883fff363a00 
882b1918
[427160.405312]  882b1918 882b1918 0400018e0af8 
882b1918
[427160.405319]  883f67480a80  0202 
d219e720
[427160.405320] Call Trace:
[427160.405324]  
[427160.405327]  [] ext4_end_bio+0xc8/0x120
[427160.405335]  [] bio_endio+0x1d/0x40
[427160.405341]  [] dec_pending+0x1c1/0x360
[427160.405345]  [] clone_endio+0x76/0xa0
[427160.405350]  [] bio_endio+0x1d/0x40
[427160.405353]  [] dec_pending+0x1c1/0x360
[427160.405358]  [] clone_endio+0x76/0xa0
[427160.405362]  [] bio_endio+0x1d/0x40
[427160.405365]  [] dec_pending+0x1c1/0x360
[427160.405369]  [] clone_endio+0x76/0xa0
[427160.405373]  []