Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-12 Thread Liu Bo
On Tue, Aug 12, 2014 at 10:55:51AM +0800, Miao Xie wrote:
 On Sun, 10 Aug 2014 22:55:44 +0800, Liu Bo wrote:
  This part of the trace is relatively new because Liu Bo's patch made us
  redirty the pages, making it more likely that we'd try to write them
  during commit.
 
  But, at the end of the day we have a fundamental deadlock with
  committing a transaction while holding a locked page from an ordered 
  file.
 
  For now, I'm ripping out the strict ordered file and going back to a
  best-effort filemap_flush like ext4 is using.
 
  I think I've figured the deadlock out, this is obviously a race case, 
  really
  hard to reproduce :-(
 
  So it turns out to be related to workqueues -- now a kthread can process
  work_struct queued in different workqueues, so we can explain the 
  deadlock as
  such,
 
  (1)
  btrfs-delalloc workqueue gets a compressed extent to process with all 
  its pages
  locked during this, and it runs into read free space cache inode, and 
  then wait
  on lock_page().
 
  (2)
  Reading that free space cache inode comes to submit part, and we have a
  indirect twice endio way for it, with the first endio we come to 
  end_workqueue_bio()
  and queue a work in btrfs-endio-meta workqueue, and it will run the 
  real
  endio() for us, but ONLY when it's processed.
 
  So the problem is a kthread can serve several workqueues, which means
  works in btrfs-endio-meta workqueues and works in 
  btrfs-flush_delalloc
  workqueues can be in the same processing list of a kthread.  When
  btrfs-flush_delalloc waits for the compressed page and 
  btrfs-endio-meta
  comes after it, it hangs.
 
  I don't think it is right. All the btrfs workqueue has RECLAIM flag, 
  which means
  each btrfs workqueue has its own rescue worker. So the problem you said 
  should
  not happen.
 
  Right, I traded some emails with Tejun about this and spent a few days
  trying to prove the workqueues were doing the wrong thing.   It will end
  up spawning another worker thread for the new work, and it won't get
  queued up behind the existing thread.
 
  If both work items went to the same workqueue, you'd definitely be right.
 
  I've got a patch to change the flush-delalloc code so we don't do the
  file writes during commit.  It seems like the only choice right now.
  
  Not the only choice any more ;)
  
  It turns out to be related to async_cow's ordered list, say we have two
  async_cow works on the wq-ordered_list, and the first work(named A) 
  finishes its
  -ordered_func() and -ordered_free(), and the second work(B) starts B's
  -ordered_func() which gets to read free space cache inode, where it queues 
  a
  work on @endio_meta_workers, but this work happens to be the same address 
  with
  A's work.
  
  So now the situation is,
  
  (1) in kthread's looping worker_thread(), work A is actually running its 
  job,
  (2) however, work A has freed its memory but kthread still want to use this
  address of memory, which means worker-current_work is still A's 
  address.
  (3) B's readahead for free space cache inode happens to queue a work whose
  address of memory is just the previous address of A's work, which means
  another worker's -current_work is also A's address.
  (4) as in btrfs we all use function normal_work_helper(), so
  worker-current_func is fixed here.
  (5) worker_thread()
 -process_one_work()
  -find_worker_executing_work()
  (find a collision, another work returns)
  
  Then we saw the hang.
 
 Here is my understand of what you said:
 The same worker dealt with work A and work B, and the 3rd work which was 
 introduced by
 work B and has the same virtual memory address as work A was also inserted 
 into the work
 list of that worker. But work B was wait for the 3rd work at that time, so 
 deadlock happened.
 Am I right?
 
 If I'm right, I think what you said is impossible. Before we dealt with work 
 B, we should
 already invoke spin_unlock_irq(pool-lock), which implies a memory barrier 
 that all changes
 happens before unlock should complete before unlock, that is the address in 
 current_work should
 be the address of work B, when we inserted the 3rd work which was introduced 
 by work B, we
 should not find the address of work A in current_work of work B's worker.
 
 I can not reproduce the problem on my machine, so I don't verify whether what 
 I said is right
 or not. Please correct me if I am wrong.

Hmm, my fault...the problem is complex so I didn't describe it 100% precisely
in the above reply.

I've sent a patch about this which has the truth you may want :-)

-liubo
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-11 Thread Chris Mason
On 08/10/2014 10:55 AM, Liu Bo wrote:
 On Thu, Aug 07, 2014 at 10:02:15AM -0400, Chris Mason wrote:


 On 08/07/2014 04:20 AM, Miao Xie wrote:
 On Thu, 7 Aug 2014 15:50:30 +0800, Liu Bo wrote:
 [90496.156016] kworker/u8:14   D 880044e38540 0 21050  2 
 0x
 [90496.157683] Workqueue: btrfs-delalloc normal_work_helper [btrfs]
 [90496.159320]  88022880f990 0002 880407f649b0 
 88022880ffd8
 [90496.160997]  880044e38000 00013040 880044e38000 
 7fff
 [90496.162686]  880301383aa0 0002 814705d0 
 880301383a98
 [90496.164360] Call Trace:
 [90496.166028]  [814705d0] ? michael_mic.part.6+0x21/0x21
 [90496.167854]  [81470fd0] schedule+0x64/0x66
 [90496.169574]  [814705ff] schedule_timeout+0x2f/0x114
 [90496.171221]  [8106479a] ? wake_up_process+0x2f/0x32
 [90496.172867]  [81062c3b] ? get_parent_ip+0xd/0x3c
 [90496.174472]  [81062ce5] ? preempt_count_add+0x7b/0x8e
 [90496.176053]  [814717f3] __wait_for_common+0x11e/0x163
 [90496.177619]  [814717f3] ? __wait_for_common+0x11e/0x163
 [90496.179173]  [810647aa] ? wake_up_state+0xd/0xd
 [90496.180728]  [81471857] wait_for_completion+0x1f/0x21
 [90496.182285]  [c044e3b3] 
 btrfs_async_run_delayed_refs+0xbf/0xd9 [btrfs]
 [90496.183833]  [c04624e1] __btrfs_end_transaction+0x2b6/0x2ec 
 [btrfs]
 [90496.185380]  [c0462522] btrfs_end_transaction+0xb/0xd 
 [btrfs]
 [90496.186940]  [c0451742] find_free_extent+0x8a9/0x976 [btrfs]
 [90496.189464]  [c0451990] btrfs_reserve_extent+0x6f/0x119 
 [btrfs]
 [90496.191326]  [c0466b45] cow_file_range+0x1a6/0x377 [btrfs]
 [90496.193080]  [c047adc4] ? 
 extent_write_locked_range+0x10c/0x11e [btrfs]
 [90496.194659]  [c04677e4] 
 submit_compressed_extents+0x100/0x412 [btrfs]
 [90496.196225]  [8120e344] ? debug_smp_processor_id+0x17/0x19
 [90496.197776]  [c0467b78] async_cow_submit+0x82/0x87 [btrfs]
 [90496.199383]  [c048644b] normal_work_helper+0x153/0x224 
 [btrfs]
 [90496.200944]  [81052d8c] process_one_work+0x16f/0x2b8
 [90496.202483]  [81053636] worker_thread+0x27b/0x32e
 [90496.204000]  [810533bb] ? cancel_delayed_work_sync+0x10/0x10
 [90496.205514]  [81058012] kthread+0xb2/0xba
 [90496.207040]  [8147] ? ap_handle_dropped_data+0xf/0xc8
 [90496.208565]  [81057f60] ? __kthread_parkme+0x62/0x62
 [90496.210096]  [81473f6c] ret_from_fork+0x7c/0xb0
 [90496.211618]  [81057f60] ? __kthread_parkme+0x62/0x62


 Ok, this should explain the hang.  submit_compressed_extents is calling
 cow_file_range with a locked page.

 cow_file_range is trying to find a free extent and in the process is
 calling btrfs_end_transaction, which is running the async delayed refs,
 which is trying to write dirty pages, which is waiting for your locked
 page.

 I should be able to reproduce this ;)

 This part of the trace is relatively new because Liu Bo's patch made us
 redirty the pages, making it more likely that we'd try to write them
 during commit.

 But, at the end of the day we have a fundamental deadlock with
 committing a transaction while holding a locked page from an ordered file.

 For now, I'm ripping out the strict ordered file and going back to a
 best-effort filemap_flush like ext4 is using.

 I think I've figured the deadlock out, this is obviously a race case, 
 really
 hard to reproduce :-(

 So it turns out to be related to workqueues -- now a kthread can process
 work_struct queued in different workqueues, so we can explain the deadlock 
 as
 such,

 (1)
 btrfs-delalloc workqueue gets a compressed extent to process with all 
 its pages
 locked during this, and it runs into read free space cache inode, and then 
 wait
 on lock_page().

 (2)
 Reading that free space cache inode comes to submit part, and we have a
 indirect twice endio way for it, with the first endio we come to 
 end_workqueue_bio()
 and queue a work in btrfs-endio-meta workqueue, and it will run the real
 endio() for us, but ONLY when it's processed.

 So the problem is a kthread can serve several workqueues, which means
 works in btrfs-endio-meta workqueues and works in btrfs-flush_delalloc
 workqueues can be in the same processing list of a kthread.  When
 btrfs-flush_delalloc waits for the compressed page and btrfs-endio-meta
 comes after it, it hangs.

 I don't think it is right. All the btrfs workqueue has RECLAIM flag, which 
 means
 each btrfs workqueue has its own rescue worker. So the problem you said 
 should
 not happen.

 Right, I traded some emails with Tejun about this and spent a few days
 trying to prove the workqueues were doing the wrong thing.   It will end
 up spawning another worker thread for the new work, and it won't get
 queued up behind the existing thread.

 If both work items went to the same workqueue, you'd definitely be right.

Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-11 Thread Miao Xie
On Sun, 10 Aug 2014 22:55:44 +0800, Liu Bo wrote:
 This part of the trace is relatively new because Liu Bo's patch made us
 redirty the pages, making it more likely that we'd try to write them
 during commit.

 But, at the end of the day we have a fundamental deadlock with
 committing a transaction while holding a locked page from an ordered file.

 For now, I'm ripping out the strict ordered file and going back to a
 best-effort filemap_flush like ext4 is using.

 I think I've figured the deadlock out, this is obviously a race case, 
 really
 hard to reproduce :-(

 So it turns out to be related to workqueues -- now a kthread can process
 work_struct queued in different workqueues, so we can explain the deadlock 
 as
 such,

 (1)
 btrfs-delalloc workqueue gets a compressed extent to process with all 
 its pages
 locked during this, and it runs into read free space cache inode, and then 
 wait
 on lock_page().

 (2)
 Reading that free space cache inode comes to submit part, and we have a
 indirect twice endio way for it, with the first endio we come to 
 end_workqueue_bio()
 and queue a work in btrfs-endio-meta workqueue, and it will run the real
 endio() for us, but ONLY when it's processed.

 So the problem is a kthread can serve several workqueues, which means
 works in btrfs-endio-meta workqueues and works in btrfs-flush_delalloc
 workqueues can be in the same processing list of a kthread.  When
 btrfs-flush_delalloc waits for the compressed page and btrfs-endio-meta
 comes after it, it hangs.

 I don't think it is right. All the btrfs workqueue has RECLAIM flag, which 
 means
 each btrfs workqueue has its own rescue worker. So the problem you said 
 should
 not happen.

 Right, I traded some emails with Tejun about this and spent a few days
 trying to prove the workqueues were doing the wrong thing.   It will end
 up spawning another worker thread for the new work, and it won't get
 queued up behind the existing thread.

 If both work items went to the same workqueue, you'd definitely be right.

 I've got a patch to change the flush-delalloc code so we don't do the
 file writes during commit.  It seems like the only choice right now.
 
 Not the only choice any more ;)
 
 It turns out to be related to async_cow's ordered list, say we have two
 async_cow works on the wq-ordered_list, and the first work(named A) finishes 
 its
 -ordered_func() and -ordered_free(), and the second work(B) starts B's
 -ordered_func() which gets to read free space cache inode, where it queues a
 work on @endio_meta_workers, but this work happens to be the same address with
 A's work.
 
 So now the situation is,
 
 (1) in kthread's looping worker_thread(), work A is actually running its job,
 (2) however, work A has freed its memory but kthread still want to use this
 address of memory, which means worker-current_work is still A's address.
 (3) B's readahead for free space cache inode happens to queue a work whose
 address of memory is just the previous address of A's work, which means
 another worker's -current_work is also A's address.
 (4) as in btrfs we all use function normal_work_helper(), so
 worker-current_func is fixed here.
 (5) worker_thread()
-process_one_work()
 -find_worker_executing_work()
 (find a collision, another work returns)
 
 Then we saw the hang.

Here is my understand of what you said:
The same worker dealt with work A and work B, and the 3rd work which was 
introduced by
work B and has the same virtual memory address as work A was also inserted into 
the work
list of that worker. But work B was wait for the 3rd work at that time, so 
deadlock happened.
Am I right?

If I'm right, I think what you said is impossible. Before we dealt with work B, 
we should
already invoke spin_unlock_irq(pool-lock), which implies a memory barrier 
that all changes
happens before unlock should complete before unlock, that is the address in 
current_work should
be the address of work B, when we inserted the 3rd work which was introduced by 
work B, we
should not find the address of work A in current_work of work B's worker.

I can not reproduce the problem on my machine, so I don't verify whether what I 
said is right
or not. Please correct me if I am wrong.

Thanks
Miao
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-10 Thread Liu Bo
On Thu, Aug 07, 2014 at 10:02:15AM -0400, Chris Mason wrote:
 
 
 On 08/07/2014 04:20 AM, Miao Xie wrote:
  On Thu, 7 Aug 2014 15:50:30 +0800, Liu Bo wrote:
  [90496.156016] kworker/u8:14   D 880044e38540 0 21050  2 
  0x
  [90496.157683] Workqueue: btrfs-delalloc normal_work_helper [btrfs]
  [90496.159320]  88022880f990 0002 880407f649b0 
  88022880ffd8
  [90496.160997]  880044e38000 00013040 880044e38000 
  7fff
  [90496.162686]  880301383aa0 0002 814705d0 
  880301383a98
  [90496.164360] Call Trace:
  [90496.166028]  [814705d0] ? michael_mic.part.6+0x21/0x21
  [90496.167854]  [81470fd0] schedule+0x64/0x66
  [90496.169574]  [814705ff] schedule_timeout+0x2f/0x114
  [90496.171221]  [8106479a] ? wake_up_process+0x2f/0x32
  [90496.172867]  [81062c3b] ? get_parent_ip+0xd/0x3c
  [90496.174472]  [81062ce5] ? preempt_count_add+0x7b/0x8e
  [90496.176053]  [814717f3] __wait_for_common+0x11e/0x163
  [90496.177619]  [814717f3] ? __wait_for_common+0x11e/0x163
  [90496.179173]  [810647aa] ? wake_up_state+0xd/0xd
  [90496.180728]  [81471857] wait_for_completion+0x1f/0x21
  [90496.182285]  [c044e3b3] 
  btrfs_async_run_delayed_refs+0xbf/0xd9 [btrfs]
  [90496.183833]  [c04624e1] __btrfs_end_transaction+0x2b6/0x2ec 
  [btrfs]
  [90496.185380]  [c0462522] btrfs_end_transaction+0xb/0xd 
  [btrfs]
  [90496.186940]  [c0451742] find_free_extent+0x8a9/0x976 [btrfs]
  [90496.189464]  [c0451990] btrfs_reserve_extent+0x6f/0x119 
  [btrfs]
  [90496.191326]  [c0466b45] cow_file_range+0x1a6/0x377 [btrfs]
  [90496.193080]  [c047adc4] ? 
  extent_write_locked_range+0x10c/0x11e [btrfs]
  [90496.194659]  [c04677e4] 
  submit_compressed_extents+0x100/0x412 [btrfs]
  [90496.196225]  [8120e344] ? debug_smp_processor_id+0x17/0x19
  [90496.197776]  [c0467b78] async_cow_submit+0x82/0x87 [btrfs]
  [90496.199383]  [c048644b] normal_work_helper+0x153/0x224 
  [btrfs]
  [90496.200944]  [81052d8c] process_one_work+0x16f/0x2b8
  [90496.202483]  [81053636] worker_thread+0x27b/0x32e
  [90496.204000]  [810533bb] ? cancel_delayed_work_sync+0x10/0x10
  [90496.205514]  [81058012] kthread+0xb2/0xba
  [90496.207040]  [8147] ? ap_handle_dropped_data+0xf/0xc8
  [90496.208565]  [81057f60] ? __kthread_parkme+0x62/0x62
  [90496.210096]  [81473f6c] ret_from_fork+0x7c/0xb0
  [90496.211618]  [81057f60] ? __kthread_parkme+0x62/0x62
 
 
  Ok, this should explain the hang.  submit_compressed_extents is calling
  cow_file_range with a locked page.
 
  cow_file_range is trying to find a free extent and in the process is
  calling btrfs_end_transaction, which is running the async delayed refs,
  which is trying to write dirty pages, which is waiting for your locked
  page.
 
  I should be able to reproduce this ;)
 
  This part of the trace is relatively new because Liu Bo's patch made us
  redirty the pages, making it more likely that we'd try to write them
  during commit.
 
  But, at the end of the day we have a fundamental deadlock with
  committing a transaction while holding a locked page from an ordered file.
 
  For now, I'm ripping out the strict ordered file and going back to a
  best-effort filemap_flush like ext4 is using.
 
  I think I've figured the deadlock out, this is obviously a race case, 
  really
  hard to reproduce :-(
 
  So it turns out to be related to workqueues -- now a kthread can process
  work_struct queued in different workqueues, so we can explain the deadlock 
  as
  such,
 
  (1)
  btrfs-delalloc workqueue gets a compressed extent to process with all 
  its pages
  locked during this, and it runs into read free space cache inode, and then 
  wait
  on lock_page().
 
  (2)
  Reading that free space cache inode comes to submit part, and we have a
  indirect twice endio way for it, with the first endio we come to 
  end_workqueue_bio()
  and queue a work in btrfs-endio-meta workqueue, and it will run the real
  endio() for us, but ONLY when it's processed.
 
  So the problem is a kthread can serve several workqueues, which means
  works in btrfs-endio-meta workqueues and works in btrfs-flush_delalloc
  workqueues can be in the same processing list of a kthread.  When
  btrfs-flush_delalloc waits for the compressed page and btrfs-endio-meta
  comes after it, it hangs.
  
  I don't think it is right. All the btrfs workqueue has RECLAIM flag, which 
  means
  each btrfs workqueue has its own rescue worker. So the problem you said 
  should
  not happen.
 
 Right, I traded some emails with Tejun about this and spent a few days
 trying to prove the workqueues were doing the wrong thing.   It will end
 up spawning another worker thread for the new work, and it won't get
 queued up behind the existing thread.
 
 If 

Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-07 Thread Miao Xie
On Thu, 7 Aug 2014 15:50:30 +0800, Liu Bo wrote:
 [90496.156016] kworker/u8:14   D 880044e38540 0 21050  2 
 0x
 [90496.157683] Workqueue: btrfs-delalloc normal_work_helper [btrfs]
 [90496.159320]  88022880f990 0002 880407f649b0 
 88022880ffd8
 [90496.160997]  880044e38000 00013040 880044e38000 
 7fff
 [90496.162686]  880301383aa0 0002 814705d0 
 880301383a98
 [90496.164360] Call Trace:
 [90496.166028]  [814705d0] ? michael_mic.part.6+0x21/0x21
 [90496.167854]  [81470fd0] schedule+0x64/0x66
 [90496.169574]  [814705ff] schedule_timeout+0x2f/0x114
 [90496.171221]  [8106479a] ? wake_up_process+0x2f/0x32
 [90496.172867]  [81062c3b] ? get_parent_ip+0xd/0x3c
 [90496.174472]  [81062ce5] ? preempt_count_add+0x7b/0x8e
 [90496.176053]  [814717f3] __wait_for_common+0x11e/0x163
 [90496.177619]  [814717f3] ? __wait_for_common+0x11e/0x163
 [90496.179173]  [810647aa] ? wake_up_state+0xd/0xd
 [90496.180728]  [81471857] wait_for_completion+0x1f/0x21
 [90496.182285]  [c044e3b3] btrfs_async_run_delayed_refs+0xbf/0xd9 
 [btrfs]
 [90496.183833]  [c04624e1] __btrfs_end_transaction+0x2b6/0x2ec 
 [btrfs]
 [90496.185380]  [c0462522] btrfs_end_transaction+0xb/0xd [btrfs]
 [90496.186940]  [c0451742] find_free_extent+0x8a9/0x976 [btrfs]
 [90496.189464]  [c0451990] btrfs_reserve_extent+0x6f/0x119 [btrfs]
 [90496.191326]  [c0466b45] cow_file_range+0x1a6/0x377 [btrfs]
 [90496.193080]  [c047adc4] ? 
 extent_write_locked_range+0x10c/0x11e [btrfs]
 [90496.194659]  [c04677e4] submit_compressed_extents+0x100/0x412 
 [btrfs]
 [90496.196225]  [8120e344] ? debug_smp_processor_id+0x17/0x19
 [90496.197776]  [c0467b78] async_cow_submit+0x82/0x87 [btrfs]
 [90496.199383]  [c048644b] normal_work_helper+0x153/0x224 [btrfs]
 [90496.200944]  [81052d8c] process_one_work+0x16f/0x2b8
 [90496.202483]  [81053636] worker_thread+0x27b/0x32e
 [90496.204000]  [810533bb] ? cancel_delayed_work_sync+0x10/0x10
 [90496.205514]  [81058012] kthread+0xb2/0xba
 [90496.207040]  [8147] ? ap_handle_dropped_data+0xf/0xc8
 [90496.208565]  [81057f60] ? __kthread_parkme+0x62/0x62
 [90496.210096]  [81473f6c] ret_from_fork+0x7c/0xb0
 [90496.211618]  [81057f60] ? __kthread_parkme+0x62/0x62


 Ok, this should explain the hang.  submit_compressed_extents is calling
 cow_file_range with a locked page.

 cow_file_range is trying to find a free extent and in the process is
 calling btrfs_end_transaction, which is running the async delayed refs,
 which is trying to write dirty pages, which is waiting for your locked
 page.

 I should be able to reproduce this ;)

 This part of the trace is relatively new because Liu Bo's patch made us
 redirty the pages, making it more likely that we'd try to write them
 during commit.

 But, at the end of the day we have a fundamental deadlock with
 committing a transaction while holding a locked page from an ordered file.

 For now, I'm ripping out the strict ordered file and going back to a
 best-effort filemap_flush like ext4 is using.
 
 I think I've figured the deadlock out, this is obviously a race case, really
 hard to reproduce :-(
 
 So it turns out to be related to workqueues -- now a kthread can process
 work_struct queued in different workqueues, so we can explain the deadlock as
 such,
 
 (1)
 btrfs-delalloc workqueue gets a compressed extent to process with all its 
 pages
 locked during this, and it runs into read free space cache inode, and then 
 wait
 on lock_page().
 
 (2)
 Reading that free space cache inode comes to submit part, and we have a
 indirect twice endio way for it, with the first endio we come to 
 end_workqueue_bio()
 and queue a work in btrfs-endio-meta workqueue, and it will run the real
 endio() for us, but ONLY when it's processed.
 
 So the problem is a kthread can serve several workqueues, which means
 works in btrfs-endio-meta workqueues and works in btrfs-flush_delalloc
 workqueues can be in the same processing list of a kthread.  When
 btrfs-flush_delalloc waits for the compressed page and btrfs-endio-meta
 comes after it, it hangs.

I don't think it is right. All the btrfs workqueue has RECLAIM flag, which means
each btrfs workqueue has its own rescue worker. So the problem you said should
not happen.

Thanks
Miao

 
 For now, making a btrfs-endio-meta-high can make it pass, but still testing 
 it.
 
 PS:
 However, I still have a question on this, kernel workqueue kthread can wake up
 another kthread in the same worker pool to run when it's going to sleep(in
 __schedule()), so in our hang case, does it happen to be there is no such
 'another kthread'?
 
 PPS:
 Looking at the above new trace, I think we'd better come up with something 
 new,
 making a high priority workqueue seems 

Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-07 Thread Chris Mason


On 08/07/2014 04:20 AM, Miao Xie wrote:
 On Thu, 7 Aug 2014 15:50:30 +0800, Liu Bo wrote:
 [90496.156016] kworker/u8:14   D 880044e38540 0 21050  2 
 0x
 [90496.157683] Workqueue: btrfs-delalloc normal_work_helper [btrfs]
 [90496.159320]  88022880f990 0002 880407f649b0 
 88022880ffd8
 [90496.160997]  880044e38000 00013040 880044e38000 
 7fff
 [90496.162686]  880301383aa0 0002 814705d0 
 880301383a98
 [90496.164360] Call Trace:
 [90496.166028]  [814705d0] ? michael_mic.part.6+0x21/0x21
 [90496.167854]  [81470fd0] schedule+0x64/0x66
 [90496.169574]  [814705ff] schedule_timeout+0x2f/0x114
 [90496.171221]  [8106479a] ? wake_up_process+0x2f/0x32
 [90496.172867]  [81062c3b] ? get_parent_ip+0xd/0x3c
 [90496.174472]  [81062ce5] ? preempt_count_add+0x7b/0x8e
 [90496.176053]  [814717f3] __wait_for_common+0x11e/0x163
 [90496.177619]  [814717f3] ? __wait_for_common+0x11e/0x163
 [90496.179173]  [810647aa] ? wake_up_state+0xd/0xd
 [90496.180728]  [81471857] wait_for_completion+0x1f/0x21
 [90496.182285]  [c044e3b3] 
 btrfs_async_run_delayed_refs+0xbf/0xd9 [btrfs]
 [90496.183833]  [c04624e1] __btrfs_end_transaction+0x2b6/0x2ec 
 [btrfs]
 [90496.185380]  [c0462522] btrfs_end_transaction+0xb/0xd [btrfs]
 [90496.186940]  [c0451742] find_free_extent+0x8a9/0x976 [btrfs]
 [90496.189464]  [c0451990] btrfs_reserve_extent+0x6f/0x119 
 [btrfs]
 [90496.191326]  [c0466b45] cow_file_range+0x1a6/0x377 [btrfs]
 [90496.193080]  [c047adc4] ? 
 extent_write_locked_range+0x10c/0x11e [btrfs]
 [90496.194659]  [c04677e4] submit_compressed_extents+0x100/0x412 
 [btrfs]
 [90496.196225]  [8120e344] ? debug_smp_processor_id+0x17/0x19
 [90496.197776]  [c0467b78] async_cow_submit+0x82/0x87 [btrfs]
 [90496.199383]  [c048644b] normal_work_helper+0x153/0x224 [btrfs]
 [90496.200944]  [81052d8c] process_one_work+0x16f/0x2b8
 [90496.202483]  [81053636] worker_thread+0x27b/0x32e
 [90496.204000]  [810533bb] ? cancel_delayed_work_sync+0x10/0x10
 [90496.205514]  [81058012] kthread+0xb2/0xba
 [90496.207040]  [8147] ? ap_handle_dropped_data+0xf/0xc8
 [90496.208565]  [81057f60] ? __kthread_parkme+0x62/0x62
 [90496.210096]  [81473f6c] ret_from_fork+0x7c/0xb0
 [90496.211618]  [81057f60] ? __kthread_parkme+0x62/0x62


 Ok, this should explain the hang.  submit_compressed_extents is calling
 cow_file_range with a locked page.

 cow_file_range is trying to find a free extent and in the process is
 calling btrfs_end_transaction, which is running the async delayed refs,
 which is trying to write dirty pages, which is waiting for your locked
 page.

 I should be able to reproduce this ;)

 This part of the trace is relatively new because Liu Bo's patch made us
 redirty the pages, making it more likely that we'd try to write them
 during commit.

 But, at the end of the day we have a fundamental deadlock with
 committing a transaction while holding a locked page from an ordered file.

 For now, I'm ripping out the strict ordered file and going back to a
 best-effort filemap_flush like ext4 is using.

 I think I've figured the deadlock out, this is obviously a race case, really
 hard to reproduce :-(

 So it turns out to be related to workqueues -- now a kthread can process
 work_struct queued in different workqueues, so we can explain the deadlock as
 such,

 (1)
 btrfs-delalloc workqueue gets a compressed extent to process with all its 
 pages
 locked during this, and it runs into read free space cache inode, and then 
 wait
 on lock_page().

 (2)
 Reading that free space cache inode comes to submit part, and we have a
 indirect twice endio way for it, with the first endio we come to 
 end_workqueue_bio()
 and queue a work in btrfs-endio-meta workqueue, and it will run the real
 endio() for us, but ONLY when it's processed.

 So the problem is a kthread can serve several workqueues, which means
 works in btrfs-endio-meta workqueues and works in btrfs-flush_delalloc
 workqueues can be in the same processing list of a kthread.  When
 btrfs-flush_delalloc waits for the compressed page and btrfs-endio-meta
 comes after it, it hangs.
 
 I don't think it is right. All the btrfs workqueue has RECLAIM flag, which 
 means
 each btrfs workqueue has its own rescue worker. So the problem you said should
 not happen.

Right, I traded some emails with Tejun about this and spent a few days
trying to prove the workqueues were doing the wrong thing.   It will end
up spawning another worker thread for the new work, and it won't get
queued up behind the existing thread.

If both work items went to the same workqueue, you'd definitely be right.

I've got a patch to change the flush-delalloc code so we don't do the
file writes during commit.  It seems like 

Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-06 Thread Martin Steigerwald
Am Montag, 4. August 2014, 14:50:29 schrieb Martin Steigerwald:
 Am Freitag, 25. Juli 2014, 11:54:37 schrieb Martin Steigerwald:
  Am Donnerstag, 24. Juli 2014, 22:48:05 schrieben Sie:
   When failing to allocate space for the whole compressed extent, we'll
   fallback to uncompressed IO, but we've forgotten to redirty the pages
   which belong to this compressed extent, and these 'clean' pages will
   simply skip 'submit' part and go to endio directly, at last we got data
   corruption as we write nothing.
   
   Signed-off-by: Liu Bo bo.li@oracle.com
   ---
   
fs/btrfs/inode.c | 12 
1 file changed, 12 insertions(+)
   
   diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
   index 3668048..8ea7610 100644
   --- a/fs/btrfs/inode.c
   +++ b/fs/btrfs/inode.c
   
   @@ -709,6 +709,18 @@ retry:
 unlock_extent(io_tree, async_extent-start,
 
   async_extent-start +
   async_extent-ram_size - 1);
   
   +
   + /*
   +  * we need to redirty the pages if we decide to
   +  * fallback to uncompressed IO, otherwise we
   +  * will not submit these pages down to lower
   +  * layers.
   +  */
   + extent_range_redirty_for_io(inode,
   + async_extent-start,
   + async_extent-start +
   + async_extent-ram_size - 1);
   +
   
 goto retry;
 
 }
 goto out_free;
  
  I am testing this currently. So far no lockup. Lets see. Still has not
  filled the the block device with trees completely after I balanced them:
  
  Label: 'home'  uuid: […]
  
  Total devices 2 FS bytes used 125.57GiB
  devid1 size 160.00GiB used 153.00GiB path /dev/dm-0
  devid2 size 160.00GiB used 153.00GiB path
  /dev/mapper/sata-home
  
  I believe the lockups happen more easily if the trees occupy all of disk
  space. Well I will do some compiling of some KDE components which may let
  BTRFS fill all space again.
  
  This patch will mean it when it can´t make enough free space in the
  (fragmented) tree it will write uncompressed?
  
  This would mean that one would have a defragment trees regularily to allow
  for writes to happen compressed at all times.
  
  Well… of course still better than lockup or corruption.
 
 So lookups so far anymore.
 
 Tested with 3.16-rc5, 3.16-rc7, now running 3-16 final.
 
 /home BTRFS only got today filled completely tree-wise:
 
 merkaba:~ btrfs fi sh /home
 Label: 'home'  uuid: […]
 Total devices 2 FS bytes used 127.35GiB
 devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
 devid2 size 160.00GiB used 160.00GiB path /dev/dm-3
 
 But I had KDE and kernel compile running full throttle at that time and
 still good.
 
 Tested-By: Martin Steigerwald mar...@lichtvoll.de
 
 
 I think this should go to stable. Thanks, Liu.

Unfortunately this fix does not seem to fix all lockups.

Just had a hard lockup again during java-bases CrashPlanPROe app backuping
company data which is stored on BTRFS via ecryptfs to central Backup server.

It basically happened on about the first heavy write I/O occasion after
the BTRFS trees filled the complete device:

I am now balancing the trees down to lower sizes manually with

btrfs balance start -dusage=10 /home

btrfs balance start -musage=10 /home

and raising values. BTW I got out of space with trying both at the same time:

merkaba:~#1 btrfs balance start -dusage=10 -musage=10 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail

merkaba:~#1 btrfs fi sh /home
Label: 'home'  uuid: […]
Total devices 2 FS bytes used 128.76GiB
devid1 size 160.00GiB used 146.00GiB path /dev/dm-0
devid2 size 160.00GiB used 146.00GiB path /dev/mapper/sata-home

So I am pretty sure meanwhile that hangs can best be trigger *if* BTRFS
trees fill the complete device.

I will try to keep tree sizes down as a work-around for now even it if means
additional write access towards the SSD devices.

And make sure tree sizes stay down on my first server BTRFS as well although
this uses debian backport kernel 3.14 and thus may not be affected.

Are there any other fixes to try out? I really like to see this resolved. Its
in two stable kernel revisions already: 3.15 and 3.16. And by this it means
if not fixed next Debian stable (Jessie) will be affected by it.


Some kern.log (have stored the complete file)

Aug  6 12:01:16 merkaba kernel: [90496.262084] INFO: task java:21301 blocked 
for more than 120 seconds.
Aug  6 

Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-06 Thread Hugo Mills
On Wed, Aug 06, 2014 at 12:21:59PM +0200, Martin Steigerwald wrote:
 It basically happened on about the first heavy write I/O occasion after
 the BTRFS trees filled the complete device:
 
 I am now balancing the trees down to lower sizes manually with
 
 btrfs balance start -dusage=10 /home
 
 btrfs balance start -musage=10 /home

   Note that balance has nothing to do with balancing the metadata
trees. The tree structures are automatically balanced as part of their
normal operation. A btrfs balance start is a much higher-level
operation. It's called balance because the overall effect is to
balance the data usage evenly across multiple devices. (Actually, to
balance the available space evenly).

   Also note that the data part isn't tree-structured, so referring to
balancing the trees with a -d flag is doubly misleading. :)

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- You know... I'm sure this code would seem a lot better if I ---   
 never tried running it. 


signature.asc
Description: Digital signature


Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-06 Thread Martin Steigerwald
Am Mittwoch, 6. August 2014, 11:29:19 schrieb Hugo Mills:
 On Wed, Aug 06, 2014 at 12:21:59PM +0200, Martin Steigerwald wrote:
  It basically happened on about the first heavy write I/O occasion after
  the BTRFS trees filled the complete device:
  
  I am now balancing the trees down to lower sizes manually with
  
  btrfs balance start -dusage=10 /home
  
  btrfs balance start -musage=10 /home
 
Note that balance has nothing to do with balancing the metadata
 trees. The tree structures are automatically balanced as part of their
 normal operation. A btrfs balance start is a much higher-level
 operation. It's called balance because the overall effect is to
 balance the data usage evenly across multiple devices. (Actually, to
 balance the available space evenly).
 
Also note that the data part isn't tree-structured, so referring to
 balancing the trees with a -d flag is doubly misleading. :)

Hmm, it makes used size in

merkaba:~ btrfs fi sh /home
Label: 'home'  uuid: […]
Total devices 2 FS bytes used 129.12GiB
devid1 size 160.00GiB used 142.03GiB path /dev/dm-0
devid2 size 160.00GiB used 142.03GiB path /dev/mapper/sata-home

and I thought this the is size used by the trees BTRFS creates.

So you say it does not balance shortest versus longest path but… as the tree 
algorithm does this automatically… but just the *data* in the tree?

In any way: I should not be required to do this kind of manual maintenance in 
order to prevent BTRFS from locking up hard on write accesses.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

signature.asc
Description: This is a digitally signed message part.


Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-06 Thread Chris Mason
On 08/06/2014 06:21 AM, Martin Steigerwald wrote:

 I think this should go to stable. Thanks, Liu.

I'm definitely tagging this for stable.

 
 Unfortunately this fix does not seem to fix all lockups.

The traces below are a little different, could you please send the whole
file?

-chris

 
 Just had a hard lockup again during java-bases CrashPlanPROe app backuping
 company data which is stored on BTRFS via ecryptfs to central Backup server.
 
 It basically happened on about the first heavy write I/O occasion after
 the BTRFS trees filled the complete device:
 
 I am now balancing the trees down to lower sizes manually with
 
 btrfs balance start -dusage=10 /home
 
 btrfs balance start -musage=10 /home
 
 and raising values. BTW I got out of space with trying both at the same time:
 
 merkaba:~#1 btrfs balance start -dusage=10 -musage=10 /home
 ERROR: error during balancing '/home' - No space left on device
 There may be more info in syslog - try dmesg | tail
 
 merkaba:~#1 btrfs fi sh /home
 Label: 'home'  uuid: […]
 Total devices 2 FS bytes used 128.76GiB
 devid1 size 160.00GiB used 146.00GiB path /dev/dm-0
 devid2 size 160.00GiB used 146.00GiB path /dev/mapper/sata-home
 
 So I am pretty sure meanwhile that hangs can best be trigger *if* BTRFS
 trees fill the complete device.
 
 I will try to keep tree sizes down as a work-around for now even it if means
 additional write access towards the SSD devices.
 
 And make sure tree sizes stay down on my first server BTRFS as well although
 this uses debian backport kernel 3.14 and thus may not be affected.
 
 Are there any other fixes to try out? I really like to see this resolved. Its
 in two stable kernel revisions already: 3.15 and 3.16. And by this it means
 if not fixed next Debian stable (Jessie) will be affected by it.
 
 
 Some kern.log (have stored the complete file)
 
 Aug  6 12:01:16 merkaba kernel: [90496.262084] INFO: task java:21301 blocked 
 for more than 120 seconds.
 Aug  6 12:01:16 merkaba kernel: [90496.263626]   Tainted: G   O  
 3.16.0-tp520-fixcompwrite+ #3
 Aug  6 12:01:16 merkaba kernel: [90496.265159] echo 0  
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Aug  6 12:01:16 merkaba kernel: [90496.266756] javaD 
 880044e3cef0 0 21301  1 0x
 Aug  6 12:01:16 merkaba kernel: [90496.268353]  8801960e3bd8 
 0002 880407f649b0 8801960e3fd8
 Aug  6 12:01:16 merkaba kernel: [90496.269980]  880044e3c9b0 
 00013040 880044e3c9b0 88041e293040
 Aug  6 12:01:16 merkaba kernel: [90496.271766]  88041e5c6868 
 8801960e3c70 0002 810db1d9
 Aug  6 12:01:16 merkaba kernel: [90496.273383] Call Trace:
 Aug  6 12:01:16 merkaba kernel: [90496.275017]  [810db1d9] ? 
 wait_on_page_read+0x37/0x37
 Aug  6 12:01:16 merkaba kernel: [90496.276630]  [81470fd0] 
 schedule+0x64/0x66
 Aug  6 12:01:16 merkaba kernel: [90496.278209]  [81471157] 
 io_schedule+0x57/0x76
 Aug  6 12:01:16 merkaba kernel: [90496.279817]  [810db1e2] 
 sleep_on_page+0x9/0xd
 Aug  6 12:01:16 merkaba kernel: [90496.281403]  [8147152d] 
 __wait_on_bit_lock+0x41/0x85
 Aug  6 12:01:16 merkaba kernel: [90496.282991]  [810db29f] 
 __lock_page+0x70/0x7c
 Aug  6 12:01:16 merkaba kernel: [90496.284550]  [81070f3a] ? 
 autoremove_wake_function+0x2f/0x2f
 Aug  6 12:01:16 merkaba kernel: [90496.286156]  [c0476617] 
 lock_page+0x1e/0x21 [btrfs]
 Aug  6 12:01:16 merkaba kernel: [90496.287742]  [c0476617] ? 
 lock_page+0x1e/0x21 [btrfs]
 Aug  6 12:01:16 merkaba kernel: [90496.289344]  [c047a8f4] 
 extent_write_cache_pages.isra.21.constprop.42+0x1a7/0x2d9 [btrfs]
 Aug  6 12:01:16 merkaba kernel: [90496.290955]  [810dba82] ? 
 find_get_pages_tag+0xfc/0x123
 Aug  6 12:01:16 merkaba kernel: [90496.292574]  [c047ae1c] 
 extent_writepages+0x46/0x57 [btrfs]
 Aug  6 12:01:16 merkaba kernel: [90496.294154]  [c04650b2] ? 
 btrfs_submit_direct+0x3ef/0x3ef [btrfs]
 Aug  6 12:01:16 merkaba kernel: [90496.295760]  [c046361f] 
 btrfs_writepages+0x23/0x25 [btrfs]
 Aug  6 12:01:16 merkaba kernel: [90496.297492]  [810e556b] 
 do_writepages+0x1b/0x24
 Aug  6 12:01:16 merkaba kernel: [90496.299035]  [810dc88a] 
 __filemap_fdatawrite_range+0x50/0x52
 Aug  6 12:01:16 merkaba kernel: [90496.300561]  [810dc8f1] 
 filemap_fdatawrite_range+0xe/0x10
 Aug  6 12:01:16 merkaba kernel: [90496.302118]  [c046ea30] 
 btrfs_sync_file+0x67/0x2bd [btrfs]
 Aug  6 12:01:16 merkaba kernel: [90496.303630]  [810dc88a] ? 
 __filemap_fdatawrite_range+0x50/0x52
 Aug  6 12:01:16 merkaba kernel: [90496.305158]  [81156289] 
 vfs_fsync_range+0x1c/0x1e
 Aug  6 12:01:16 merkaba kernel: [90496.306669]  [811562a2] 
 vfs_fsync+0x17/0x19
 Aug  6 12:01:16 merkaba kernel: [90496.308197]  [c04f732a] 
 ecryptfs_fsync+0x2f/0x34 [ecryptfs]

Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-06 Thread Martin Steigerwald
Am Mittwoch, 6. August 2014, 09:35:51 schrieb Chris Mason:
 On 08/06/2014 06:21 AM, Martin Steigerwald wrote:
  I think this should go to stable. Thanks, Liu.
 
 I'm definitely tagging this for stable.
 
  Unfortunately this fix does not seem to fix all lockups.
 
 The traces below are a little different, could you please send the whole
 file?

Will paste it at the end.
 
  Just had a hard lockup again during java-bases CrashPlanPROe app backuping
  company data which is stored on BTRFS via ecryptfs to central Backup
  server.
  
  It basically happened on about the first heavy write I/O occasion after
  the BTRFS trees filled the complete device:
  
  I am now balancing the trees down to lower sizes manually with
  
  btrfs balance start -dusage=10 /home
  
  btrfs balance start -musage=10 /home
  
  and raising values. BTW I got out of space with trying both at the same
  time:
  
  merkaba:~#1 btrfs balance start -dusage=10 -musage=10 /home
  ERROR: error during balancing '/home' - No space left on device
  There may be more info in syslog - try dmesg | tail
  
  merkaba:~#1 btrfs fi sh /home
  Label: 'home'  uuid: […]
  
  Total devices 2 FS bytes used 128.76GiB
  devid1 size 160.00GiB used 146.00GiB path /dev/dm-0
  devid2 size 160.00GiB used 146.00GiB path
  /dev/mapper/sata-home
  
  So I am pretty sure meanwhile that hangs can best be trigger *if* BTRFS
  trees fill the complete device.
  
  I will try to keep tree sizes down as a work-around for now even it if
  means additional write access towards the SSD devices.
  
  And make sure tree sizes stay down on my first server BTRFS as well
  although this uses debian backport kernel 3.14 and thus may not be
  affected.
  
  Are there any other fixes to try out? I really like to see this resolved.
  Its in two stable kernel revisions already: 3.15 and 3.16. And by this it
  means if not fixed next Debian stable (Jessie) will be affected by it.
  
  
  Some kern.log (have stored the complete file)
[…]

Attached, xz compressed, since 180 KB as plaintext in a mail is a bit much.

This is complete from today resuming from hibernation today morning, the BTRFS 
hang, the reboot and the first balancing runs to make BTRFS more stable again.

Interestingly it was ecryptfs reporting issues before the BTRFS hang got 
reported. There are several a bit different looking traces in there.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

kern.log.xz
Description: application/xz


Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-06 Thread Chris Mason
On 08/06/2014 11:18 AM, Chris Mason wrote:
 On 08/06/2014 10:43 AM, Martin Steigerwald wrote:
 Am Mittwoch, 6. August 2014, 09:35:51 schrieb Chris Mason:
 On 08/06/2014 06:21 AM, Martin Steigerwald wrote:
 I think this should go to stable. Thanks, Liu.

 I'm definitely tagging this for stable.

 Unfortunately this fix does not seem to fix all lockups.

 The traces below are a little different, could you please send the whole
 file?

 Will paste it at the end.
 
 [90496.156016] kworker/u8:14   D 880044e38540 0 21050  2 
 0x
 [90496.157683] Workqueue: btrfs-delalloc normal_work_helper [btrfs]
 [90496.159320]  88022880f990 0002 880407f649b0 
 88022880ffd8
 [90496.160997]  880044e38000 00013040 880044e38000 
 7fff
 [90496.162686]  880301383aa0 0002 814705d0 
 880301383a98
 [90496.164360] Call Trace:
 [90496.166028]  [814705d0] ? michael_mic.part.6+0x21/0x21
 [90496.167854]  [81470fd0] schedule+0x64/0x66
 [90496.169574]  [814705ff] schedule_timeout+0x2f/0x114
 [90496.171221]  [8106479a] ? wake_up_process+0x2f/0x32
 [90496.172867]  [81062c3b] ? get_parent_ip+0xd/0x3c
 [90496.174472]  [81062ce5] ? preempt_count_add+0x7b/0x8e
 [90496.176053]  [814717f3] __wait_for_common+0x11e/0x163
 [90496.177619]  [814717f3] ? __wait_for_common+0x11e/0x163
 [90496.179173]  [810647aa] ? wake_up_state+0xd/0xd
 [90496.180728]  [81471857] wait_for_completion+0x1f/0x21
 [90496.182285]  [c044e3b3] btrfs_async_run_delayed_refs+0xbf/0xd9 
 [btrfs]
 [90496.183833]  [c04624e1] __btrfs_end_transaction+0x2b6/0x2ec 
 [btrfs]
 [90496.185380]  [c0462522] btrfs_end_transaction+0xb/0xd [btrfs]
 [90496.186940]  [c0451742] find_free_extent+0x8a9/0x976 [btrfs]
 [90496.189464]  [c0451990] btrfs_reserve_extent+0x6f/0x119 [btrfs]
 [90496.191326]  [c0466b45] cow_file_range+0x1a6/0x377 [btrfs]
 [90496.193080]  [c047adc4] ? extent_write_locked_range+0x10c/0x11e 
 [btrfs]
 [90496.194659]  [c04677e4] submit_compressed_extents+0x100/0x412 
 [btrfs]
 [90496.196225]  [8120e344] ? debug_smp_processor_id+0x17/0x19
 [90496.197776]  [c0467b78] async_cow_submit+0x82/0x87 [btrfs]
 [90496.199383]  [c048644b] normal_work_helper+0x153/0x224 [btrfs]
 [90496.200944]  [81052d8c] process_one_work+0x16f/0x2b8
 [90496.202483]  [81053636] worker_thread+0x27b/0x32e
 [90496.204000]  [810533bb] ? cancel_delayed_work_sync+0x10/0x10
 [90496.205514]  [81058012] kthread+0xb2/0xba
 [90496.207040]  [8147] ? ap_handle_dropped_data+0xf/0xc8
 [90496.208565]  [81057f60] ? __kthread_parkme+0x62/0x62
 [90496.210096]  [81473f6c] ret_from_fork+0x7c/0xb0
 [90496.211618]  [81057f60] ? __kthread_parkme+0x62/0x62
 
 
 Ok, this should explain the hang.  submit_compressed_extents is calling
 cow_file_range with a locked page.
 
 cow_file_range is trying to find a free extent and in the process is
 calling btrfs_end_transaction, which is running the async delayed refs,
 which is trying to write dirty pages, which is waiting for your locked
 page.
 
 I should be able to reproduce this ;)

This part of the trace is relatively new because Liu Bo's patch made us
redirty the pages, making it more likely that we'd try to write them
during commit.

But, at the end of the day we have a fundamental deadlock with
committing a transaction while holding a locked page from an ordered file.

For now, I'm ripping out the strict ordered file and going back to a
best-effort filemap_flush like ext4 is using.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-04 Thread Martin Steigerwald
Am Freitag, 25. Juli 2014, 11:54:37 schrieb Martin Steigerwald:
 Am Donnerstag, 24. Juli 2014, 22:48:05 schrieben Sie:
  When failing to allocate space for the whole compressed extent, we'll
  fallback to uncompressed IO, but we've forgotten to redirty the pages
  which belong to this compressed extent, and these 'clean' pages will
  simply skip 'submit' part and go to endio directly, at last we got data
  corruption as we write nothing.
  
  Signed-off-by: Liu Bo bo.li@oracle.com
  ---
  
   fs/btrfs/inode.c | 12 
   1 file changed, 12 insertions(+)
  
  diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
  index 3668048..8ea7610 100644
  --- a/fs/btrfs/inode.c
  +++ b/fs/btrfs/inode.c
  
  @@ -709,6 +709,18 @@ retry:
  unlock_extent(io_tree, async_extent-start,
  
async_extent-start +
async_extent-ram_size - 1);
  
  +
  +   /*
  +* we need to redirty the pages if we decide to
  +* fallback to uncompressed IO, otherwise we
  +* will not submit these pages down to lower
  +* layers.
  +*/
  +   extent_range_redirty_for_io(inode,
  +   async_extent-start,
  +   async_extent-start +
  +   async_extent-ram_size - 1);
  +
  
  goto retry;
  
  }
  goto out_free;
 
 I am testing this currently. So far no lockup. Lets see. Still has not
 filled the the block device with trees completely after I balanced them:
 
 Label: 'home'  uuid: […]
 Total devices 2 FS bytes used 125.57GiB
 devid1 size 160.00GiB used 153.00GiB path /dev/dm-0
 devid2 size 160.00GiB used 153.00GiB path /dev/mapper/sata-home
 
 I believe the lockups happen more easily if the trees occupy all of disk
 space. Well I will do some compiling of some KDE components which may let
 BTRFS fill all space again.
 
 This patch will mean it when it can´t make enough free space in the
 (fragmented) tree it will write uncompressed?
 
 This would mean that one would have a defragment trees regularily to allow
 for writes to happen compressed at all times.
 
 Well… of course still better than lockup or corruption.

So lookups so far anymore.

Tested with 3.16-rc5, 3.16-rc7, now running 3-16 final.

/home BTRFS only got today filled completely tree-wise:

merkaba:~ btrfs fi sh /home
Label: 'home'  uuid: […]
Total devices 2 FS bytes used 127.35GiB
devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
devid2 size 160.00GiB used 160.00GiB path /dev/dm-3

But I had KDE and kernel compile running full throttle at that time and still 
good.

Tested-By: Martin Steigerwald mar...@lichtvoll.de


I think this should go to stable. Thanks, Liu.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-04 Thread Martin Steigerwald
Am Montag, 4. August 2014, 14:50:29 schrieb Martin Steigerwald:
 Am Freitag, 25. Juli 2014, 11:54:37 schrieb Martin Steigerwald:
  Am Donnerstag, 24. Juli 2014, 22:48:05 schrieben Sie:
   When failing to allocate space for the whole compressed extent, we'll
   fallback to uncompressed IO, but we've forgotten to redirty the pages
   which belong to this compressed extent, and these 'clean' pages will
   simply skip 'submit' part and go to endio directly, at last we got data
   corruption as we write nothing.
  
   
  
   Signed-off-by: Liu Bo bo.li@oracle.com
   ---
  
   
fs/btrfs/inode.c | 12 
1 file changed, 12 insertions(+)
   
  
   diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
   index 3668048..8ea7610 100644
   --- a/fs/btrfs/inode.c
   +++ b/fs/btrfs/inode.c
  
   
  
   @@ -709,6 +709,18 @@ retry:
   unlock_extent(io_tree, async_extent-start,
   
 async_extent-start +
 async_extent-ram_size - 1);
   
  
   +
   +   /*
   +* we need to redirty the pages if we decide
   to
   +* fallback to uncompressed IO, otherwise we
   +* will not submit these pages down to lower
   +* layers.
   +*/
   +   extent_range_redirty_for_io(inode,
   +   async_extent-start,
   +   async_extent-start +
   +   async_extent-ram_size - 1);
   +
  
   
   goto retry;
   
   }
   goto out_free;
 
  
 
  I am testing this currently. So far no lockup. Lets see. Still has not
 
  filled the the block device with trees completely after I balanced them:
  
 
  Label: 'home'  uuid: […]
 
  Total devices 2 FS bytes used 125.57GiB
  devid1 size 160.00GiB used 153.00GiB path /dev/dm-0
  devid2 size 160.00GiB used 153.00GiB path
 /dev/mapper/sata-home 
 
  I believe the lockups happen more easily if the trees occupy all of disk
  space. Well I will do some compiling of some KDE components which may let
  BTRFS fill all space again.
 
  
 
  This patch will mean it when it can´t make enough free space in the
  (fragmented) tree it will write uncompressed?
 
  
 
  This would mean that one would have a defragment trees regularily to allow
  for writes to happen compressed at all times.
 
  
 
  Well… of course still better than lockup or corruption.
 
 So lookups so far anymore.

No lookups of course.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-07-25 Thread Martin Steigerwald
Am Donnerstag, 24. Juli 2014, 22:48:05 schrieben Sie:
 When failing to allocate space for the whole compressed extent, we'll
 fallback to uncompressed IO, but we've forgotten to redirty the pages
 which belong to this compressed extent, and these 'clean' pages will
 simply skip 'submit' part and go to endio directly, at last we got data
 corruption as we write nothing.
 
 Signed-off-by: Liu Bo bo.li@oracle.com
 ---
  fs/btrfs/inode.c | 12 
  1 file changed, 12 insertions(+)
 
 diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
 index 3668048..8ea7610 100644
 --- a/fs/btrfs/inode.c
 +++ b/fs/btrfs/inode.c
 @@ -709,6 +709,18 @@ retry:
   unlock_extent(io_tree, async_extent-start,
 async_extent-start +
 async_extent-ram_size - 1);
 +
 + /*
 +  * we need to redirty the pages if we decide to
 +  * fallback to uncompressed IO, otherwise we
 +  * will not submit these pages down to lower
 +  * layers.
 +  */
 + extent_range_redirty_for_io(inode,
 + async_extent-start,
 + async_extent-start +
 + async_extent-ram_size - 1);
 +
   goto retry;
   }
   goto out_free;

I am testing this currently. So far no lockup. Lets see. Still has not filled 
the the block device with trees completely after I balanced them:

Label: 'home'  uuid: […]
Total devices 2 FS bytes used 125.57GiB
devid1 size 160.00GiB used 153.00GiB path /dev/dm-0
devid2 size 160.00GiB used 153.00GiB path /dev/mapper/sata-home

I believe the lockups happen more easily if the trees occupy all of disk 
space. Well I will do some compiling of some KDE components which may let 
BTRFS fill all space again.

This patch will mean it when it can´t make enough free space in the 
(fragmented) tree it will write uncompressed?

This would mean that one would have a defragment trees regularily to allow for 
writes to happen compressed at all times.

Well… of course still better than lockup or corruption.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-07-25 Thread Martin Steigerwald
Am Freitag, 25. Juli 2014, 09:00:19 schrieb Liu Bo:
 On Thu, Jul 24, 2014 at 10:55:47AM -0400, Chris Mason wrote:
  On 07/24/2014 10:48 AM, Liu Bo wrote:
   When failing to allocate space for the whole compressed extent, we'll
   fallback to uncompressed IO, but we've forgotten to redirty the pages
   which belong to this compressed extent, and these 'clean' pages will
   simply skip 'submit' part and go to endio directly, at last we got data
   corruption as we write nothing.
  
  This fallback code was my #1 suspect for the hangs people have been
  seeing since 3.15.  I changed things around to trigger the fallback
  randomly and wasn't able to trigger problems, but I was looking for
  hangs and not corruptions.
 
 So now you're able to trigger the hang without changing the fallback code?
 
 I tried raid1 and raid0 with fsmark and rsync in different ways but still
 fails to reproduce the hang :-(
 
 The most weird thing is who the hell holds the free space inode's page, is
 it possible to share pages with other inode? (My answer is NO, but I'm not
 sure now...)

Can you try doing this on a BTRFS filesystem that has all of block device free 
space filled with its trees already? My suspicion is that this highly 
contributes to the likelyhood of this hang to happen (see my other mail).

This really seems had to hit in a test scenario, while it happens regularily 
on some production filesystems.

Making a kernel package with make-kpkg (from kernel-package package in Debian) 
triggers lockup quite reliably here.

That said, with 3.16.0-rc6-tp520-fixcompwrite+ no lockup so far, but after 
balancing trees do not yet fill whole device again.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-07-24 Thread Chris Mason
On 07/24/2014 10:48 AM, Liu Bo wrote:
 When failing to allocate space for the whole compressed extent, we'll
 fallback to uncompressed IO, but we've forgotten to redirty the pages
 which belong to this compressed extent, and these 'clean' pages will
 simply skip 'submit' part and go to endio directly, at last we got data
 corruption as we write nothing.

This fallback code was my #1 suspect for the hangs people have been
seeing since 3.15.  I changed things around to trigger the fallback
randomly and wasn't able to trigger problems, but I was looking for
hangs and not corruptions.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-07-24 Thread Liu Bo
On Thu, Jul 24, 2014 at 10:55:47AM -0400, Chris Mason wrote:
 On 07/24/2014 10:48 AM, Liu Bo wrote:
  When failing to allocate space for the whole compressed extent, we'll
  fallback to uncompressed IO, but we've forgotten to redirty the pages
  which belong to this compressed extent, and these 'clean' pages will
  simply skip 'submit' part and go to endio directly, at last we got data
  corruption as we write nothing.
 
 This fallback code was my #1 suspect for the hangs people have been
 seeing since 3.15.  I changed things around to trigger the fallback
 randomly and wasn't able to trigger problems, but I was looking for
 hangs and not corruptions.
 

So now you're able to trigger the hang without changing the fallback code?

I tried raid1 and raid0 with fsmark and rsync in different ways but still fails
to reproduce the hang :-(

The most weird thing is who the hell holds the free space inode's page, is it
possible to share pages with other inode? (My answer is NO, but I'm not sure
now...)

thanks,
-liubo

 -chris
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-07-24 Thread Wang Shilong

On 07/24/2014 10:48 PM, Liu Bo wrote:

When failing to allocate space for the whole compressed extent, we'll
fallback to uncompressed IO, but we've forgotten to redirty the pages
which belong to this compressed extent, and these 'clean' pages will
simply skip 'submit' part and go to endio directly, at last we got data
corruption as we write nothing.

Signed-off-by: Liu Bo bo.li@oracle.com
---
  fs/btrfs/inode.c | 12 
  1 file changed, 12 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3668048..8ea7610 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -709,6 +709,18 @@ retry:
unlock_extent(io_tree, async_extent-start,
  async_extent-start +
  async_extent-ram_size - 1);
+
+   /*
+* we need to redirty the pages if we decide to
+* fallback to uncompressed IO, otherwise we
+* will not submit these pages down to lower
+* layers.
+*/
+   extent_range_redirty_for_io(inode,
+   async_extent-start,
+   async_extent-start +
+   async_extent-ram_size - 1);
+
goto retry;

BTW, if such ENOSPC happens, it means we could not reserve compressed space.
So we retry with nocompression codes, it will try to reserve more space. 
Any reason

we do such things?

Thanks,
Wang

}
goto out_free;


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-07-24 Thread Wang Shilong

On 07/25/2014 10:08 AM, Liu Bo wrote:

On Fri, Jul 25, 2014 at 09:53:43AM +0800, Wang Shilong wrote:

On 07/24/2014 10:48 PM, Liu Bo wrote:

When failing to allocate space for the whole compressed extent, we'll
fallback to uncompressed IO, but we've forgotten to redirty the pages
which belong to this compressed extent, and these 'clean' pages will
simply skip 'submit' part and go to endio directly, at last we got data
corruption as we write nothing.

Signed-off-by: Liu Bo bo.li@oracle.com
---
  fs/btrfs/inode.c | 12 
  1 file changed, 12 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3668048..8ea7610 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -709,6 +709,18 @@ retry:
unlock_extent(io_tree, async_extent-start,
  async_extent-start +
  async_extent-ram_size - 1);
+
+   /*
+* we need to redirty the pages if we decide to
+* fallback to uncompressed IO, otherwise we
+* will not submit these pages down to lower
+* layers.
+*/
+   extent_range_redirty_for_io(inode,
+   async_extent-start,
+   async_extent-start +
+   async_extent-ram_size - 1);
+
goto retry;

BTW, if such ENOSPC happens, it means we could not reserve compressed space.
So we retry with nocompression codes, it will try to reserve more
space. Any reason
we do such things?

Compressed extents needs continuous space while uncompressed extents can have
more choices.

Yeah, that is reasonable. Thanks for your answer.^_^



thanks,
-liubo


Thanks,
Wang

}
goto out_free;

.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html