Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

2013-07-02 Thread Jan Schmidt
On Sun, June 30, 2013 at 15:55 (+0200), Josef Bacik wrote:
 On Sun, Jun 30, 2013 at 10:25:05AM +0200, Jan Schmidt wrote:
 On 30.06.2013 05:17, Josef Bacik wrote:
 We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk
 forward in the tree mod entries, otherwise we'll end up with random entries 
 and
 trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes the 
 panics
 people were seeing when running

 find /whatever -type f -exec btrfs fi defrag {} \;

 This patch cannot help to solve the problem, as far as I've understood
 what is going on. It does change timing, though, which presumably makes
 it pass the current reproducer we're having.

 On rewinding, iteration through the tree mod log rb-tree goes backwards
 in time, which means that once we've found our staring point we cannot
 be trapped by later additions. The old items we're rewinding towards
 cannot be freed, because we've allocated a blocker element within the
 tree and rewinding never goes beyond the allocated blocker. The blocker
 element is allocated by btrfs_get_tree_mod_seq and mostly referred to as
 time_seq within the other tree mod log functions in ctree.c. To sum up,
 the added lock is not required.

 The debug output I've analyzed so far shows that after we've rewinded
 all REMOVE_WHILE_FREEING operations on a buffer, ordered consecutively
 as expected, there comes another REMOVE_WHILE_FREEING with a sequence
 number much further in the past for the same buffer (but that sequence
 number is still higher than out time_seq rewind barrier at that point).
 This must be a logical problem I've not completely understood so far,
 but locking doesn't seem to be the right track.

 
 Finally reproduced it, this is my output
 
  btrfs-endio-wri-23110 [000] ...2  9556.882103: __tree_mod_log_rewind: 
 rewinding 15450537984
  btrfs-endio-wri-23110 [000] ...2  9556.882104: __tree_mod_log_rewind: 
 15450537984: processing 880246590a40, op 3, seq 68719476829, slot 0
  btrfs-endio-wri-23110 [000] ...2  9556.882106: __tree_mod_log_rewind: 
 15450537984: processing 880246590ac0, op 3, seq 68719476828, slot 1
  btrfs-endio-wri-23110 [000] ...2  9556.882108: __tree_mod_log_rewind: 
 15450537984: processing 880246590a40, op 3, seq 68719476829, slot 0
  btrfs-endio-wri-23110 [000] ...2  9556.882110: __tree_mod_log_rewind: 
 15450537984: this tm is failing, 880246590a40, seq 68719476829, slot 0
 
 so I'm inclined to beleive I've got it right.  Thanks,

Looking at the code I agree we should have a read lock around rb_next,
protecting it against reorganization during insertions. Fits to that kind of
debug output.

How about just getting the lock for the rb_next call? There can be quite a lot
of operations to rewind and I'd rather not have every other fs tree modification
block on that.

Thanks,
-Jan

 Josef
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

2013-07-01 Thread Liu Bo
On Sun, Jun 30, 2013 at 02:01:17PM -0400, Josef Bacik wrote:
 On Sun, Jun 30, 2013 at 11:02:10PM +0800, Liu Bo wrote:
  On Sun, Jun 30, 2013 at 07:22:00AM -0400, Josef Bacik wrote:
   On Sun, Jun 30, 2013 at 10:25:05AM +0200, Jan Schmidt wrote:
On 30.06.2013 05:17, Josef Bacik wrote:
 We need to hold the tree mod log lock in __tree_mod_log_rewind since 
 we walk
 forward in the tree mod entries, otherwise we'll end up with random 
 entries and
 trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes 
 the panics
 people were seeing when running
 
 find /whatever -type f -exec btrfs fi defrag {} \;

This patch cannot help to solve the problem, as far as I've understood
what is going on. It does change timing, though, which presumably makes
it pass the current reproducer we're having.

On rewinding, iteration through the tree mod log rb-tree goes backwards
in time, which means that once we've found our staring point we cannot
be trapped by later additions. The old items we're rewinding towards
cannot be freed, because we've allocated a blocker element within the
tree and rewinding never goes beyond the allocated blocker. The blocker
element is allocated by btrfs_get_tree_mod_seq and mostly referred to as
time_seq within the other tree mod log functions in ctree.c. To sum up,
the added lock is not required.

The debug output I've analyzed so far shows that after we've rewinded
all REMOVE_WHILE_FREEING operations on a buffer, ordered consecutively
as expected, there comes another REMOVE_WHILE_FREEING with a sequence
number much further in the past for the same buffer (but that sequence
number is still higher than out time_seq rewind barrier at that point).
This must be a logical problem I've not completely understood so far,
but locking doesn't seem to be the right track.

   
   So I've seen 2 scenarios with my debug output, once with my printk() which
   seemed to be unreliable and then again with trace_printk() which was much 
   more
   reliable.
   
   With the printk() output I saw what you described, however I was also 
   printing
   out every mod we added and I never saw the sequence number allocated that 
   we had
   suddenly tripped over, so I assumed that the printk() just got lost 
   between the
   boxes (I was using netconsole to capture the output).  However if it was 
   never
   actually allocated then it would make sense that we just pointed into some
   random peice of memory.
   
   The second bit I saw which is when I was convinced this was the problem 
   had us
   processing the keys, and suddenly we'd try to process the key again.  So 
   I saw
   something like this (I lost the output because I was capturing the output 
   into
   the same file again when I tried to reproduce with this patch so I don't 
   have
   the exact output)
   
   processing op 3, seq 10 slot 0
   processing op 3, seq 9 slot 1
   processing op 3, seq 8 slot 2
   processing op 3, seq 7 slot 3
   processing op 3, seq 6 slot 4
   processing op 3, seq 5 slot 5
   processing op 3, seq 8 slot 2
   
   so we got an entry we had already processed, which is completely in line 
   with
   the rb_tree pointing to something random when we do rb_next.
   
   Now I had another theory early on, but I haven't seen output to confirm 
   this.
   We can re-allocate a block if we've allocated it in the same transaction, 
   and it
   appears this would completely screw up the tree mod log.  So say we 
   allocate
   block 5, we add keys to it, and then we free it, so we get this
   MOVE_WHILE_FREEING thing done on the entire block, but then we 
   re-allocate the
   same block and start doing normal operations to it.  Then we go to replay 
   back
   and we will replay all the normal operations, but then we will walk into 
   the
   REMOVE_WHILE_FREEING entries from when the block was free'd last and this 
   will
   trigger the BUG_ON().  Now I hadn't actually had this happen so either it 
   can't
   happen some how and I've missed the reasoning for it, or what I saw was 
   just the
   rb_tree messup because we aren't locking it and what you are seeing is 
   this
   particular case.  Either way I'm going to back my patch out and re-run a 
   few
   more times with my trace_printk()'s so I can have several runs to look at 
   and
   verify what I'm seeing.  Thanks,
  
  I doubt the theory here, if we free a block, it's marked as pinned and
  will then be marked as free only when we commit the transaction, isn't it?
  
 
 Not sure if my other reply made it through, my phone only responds with HTML
 which is kind of a pain.
 
 If you look at btrfs_free_tree_block we will just add the space back if the
 buffer was allocated in this transaction and hasn't been written yet, so this
 case can happen.  I think we should probably add another mod log operation 
 that
 gets added when this 

Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

2013-06-30 Thread Jan Schmidt
On 30.06.2013 05:17, Josef Bacik wrote:
 We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk
 forward in the tree mod entries, otherwise we'll end up with random entries 
 and
 trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes the 
 panics
 people were seeing when running
 
 find /whatever -type f -exec btrfs fi defrag {} \;

This patch cannot help to solve the problem, as far as I've understood
what is going on. It does change timing, though, which presumably makes
it pass the current reproducer we're having.

On rewinding, iteration through the tree mod log rb-tree goes backwards
in time, which means that once we've found our staring point we cannot
be trapped by later additions. The old items we're rewinding towards
cannot be freed, because we've allocated a blocker element within the
tree and rewinding never goes beyond the allocated blocker. The blocker
element is allocated by btrfs_get_tree_mod_seq and mostly referred to as
time_seq within the other tree mod log functions in ctree.c. To sum up,
the added lock is not required.

The debug output I've analyzed so far shows that after we've rewinded
all REMOVE_WHILE_FREEING operations on a buffer, ordered consecutively
as expected, there comes another REMOVE_WHILE_FREEING with a sequence
number much further in the past for the same buffer (but that sequence
number is still higher than out time_seq rewind barrier at that point).
This must be a logical problem I've not completely understood so far,
but locking doesn't seem to be the right track.

Thanks,
-Jan


 Thansk,
 
 Signed-off-by: Josef Bacik jba...@fusionio.com
 ---
  fs/btrfs/ctree.c |   10 ++
  1 files changed, 6 insertions(+), 4 deletions(-)
 
 diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
 index c32d03d..7921e1d 100644
 --- a/fs/btrfs/ctree.c
 +++ b/fs/btrfs/ctree.c
 @@ -1161,8 +1161,8 @@ __tree_mod_log_oldest_root(struct btrfs_fs_info 
 *fs_info,
   * time_seq).
   */
  static void
 -__tree_mod_log_rewind(struct extent_buffer *eb, u64 time_seq,
 -   struct tree_mod_elem *first_tm)
 +__tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct extent_buffer 
 *eb,
 +   u64 time_seq, struct tree_mod_elem *first_tm)
  {
   u32 n;
   struct rb_node *next;
 @@ -1172,6 +1172,7 @@ __tree_mod_log_rewind(struct extent_buffer *eb, u64 
 time_seq,
   unsigned long p_size = sizeof(struct btrfs_key_ptr);
  
   n = btrfs_header_nritems(eb);
 + tree_mod_log_read_lock(fs_info);
   while (tm  tm-seq = time_seq) {
   /*
* all the operations are recorded with the operator used for
 @@ -1226,6 +1227,7 @@ __tree_mod_log_rewind(struct extent_buffer *eb, u64 
 time_seq,
   if (tm-index != first_tm-index)
   break;
   }
 + tree_mod_log_read_unlock(fs_info);
   btrfs_set_header_nritems(eb, n);
  }
  
 @@ -1274,7 +1276,7 @@ tree_mod_log_rewind(struct btrfs_fs_info *fs_info, 
 struct extent_buffer *eb,
  
   extent_buffer_get(eb_rewin);
   btrfs_tree_read_lock(eb_rewin);
 - __tree_mod_log_rewind(eb_rewin, time_seq, tm);
 + __tree_mod_log_rewind(fs_info, eb_rewin, time_seq, tm);
   WARN_ON(btrfs_header_nritems(eb_rewin) 
   BTRFS_NODEPTRS_PER_BLOCK(fs_info-tree_root));
  
 @@ -1350,7 +1352,7 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
   btrfs_set_header_generation(eb, old_generation);
   }
   if (tm)
 - __tree_mod_log_rewind(eb, time_seq, tm);
 + __tree_mod_log_rewind(root-fs_info, eb, time_seq, tm);
   else
   WARN_ON(btrfs_header_level(eb) != 0);
   WARN_ON(btrfs_header_nritems(eb)  BTRFS_NODEPTRS_PER_BLOCK(root));
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

2013-06-30 Thread Josef Bacik
On Sun, Jun 30, 2013 at 10:25:05AM +0200, Jan Schmidt wrote:
 On 30.06.2013 05:17, Josef Bacik wrote:
  We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk
  forward in the tree mod entries, otherwise we'll end up with random entries 
  and
  trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes the 
  panics
  people were seeing when running
  
  find /whatever -type f -exec btrfs fi defrag {} \;
 
 This patch cannot help to solve the problem, as far as I've understood
 what is going on. It does change timing, though, which presumably makes
 it pass the current reproducer we're having.
 
 On rewinding, iteration through the tree mod log rb-tree goes backwards
 in time, which means that once we've found our staring point we cannot
 be trapped by later additions. The old items we're rewinding towards
 cannot be freed, because we've allocated a blocker element within the
 tree and rewinding never goes beyond the allocated blocker. The blocker
 element is allocated by btrfs_get_tree_mod_seq and mostly referred to as
 time_seq within the other tree mod log functions in ctree.c. To sum up,
 the added lock is not required.
 
 The debug output I've analyzed so far shows that after we've rewinded
 all REMOVE_WHILE_FREEING operations on a buffer, ordered consecutively
 as expected, there comes another REMOVE_WHILE_FREEING with a sequence
 number much further in the past for the same buffer (but that sequence
 number is still higher than out time_seq rewind barrier at that point).
 This must be a logical problem I've not completely understood so far,
 but locking doesn't seem to be the right track.
 

So I've seen 2 scenarios with my debug output, once with my printk() which
seemed to be unreliable and then again with trace_printk() which was much more
reliable.

With the printk() output I saw what you described, however I was also printing
out every mod we added and I never saw the sequence number allocated that we had
suddenly tripped over, so I assumed that the printk() just got lost between the
boxes (I was using netconsole to capture the output).  However if it was never
actually allocated then it would make sense that we just pointed into some
random peice of memory.

The second bit I saw which is when I was convinced this was the problem had us
processing the keys, and suddenly we'd try to process the key again.  So I saw
something like this (I lost the output because I was capturing the output into
the same file again when I tried to reproduce with this patch so I don't have
the exact output)

processing op 3, seq 10 slot 0
processing op 3, seq 9 slot 1
processing op 3, seq 8 slot 2
processing op 3, seq 7 slot 3
processing op 3, seq 6 slot 4
processing op 3, seq 5 slot 5
processing op 3, seq 8 slot 2

so we got an entry we had already processed, which is completely in line with
the rb_tree pointing to something random when we do rb_next.

Now I had another theory early on, but I haven't seen output to confirm this.
We can re-allocate a block if we've allocated it in the same transaction, and it
appears this would completely screw up the tree mod log.  So say we allocate
block 5, we add keys to it, and then we free it, so we get this
MOVE_WHILE_FREEING thing done on the entire block, but then we re-allocate the
same block and start doing normal operations to it.  Then we go to replay back
and we will replay all the normal operations, but then we will walk into the
REMOVE_WHILE_FREEING entries from when the block was free'd last and this will
trigger the BUG_ON().  Now I hadn't actually had this happen so either it can't
happen some how and I've missed the reasoning for it, or what I saw was just the
rb_tree messup because we aren't locking it and what you are seeing is this
particular case.  Either way I'm going to back my patch out and re-run a few
more times with my trace_printk()'s so I can have several runs to look at and
verify what I'm seeing.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

2013-06-30 Thread Josef Bacik
On Sun, Jun 30, 2013 at 10:25:05AM +0200, Jan Schmidt wrote:
 On 30.06.2013 05:17, Josef Bacik wrote:
  We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk
  forward in the tree mod entries, otherwise we'll end up with random entries 
  and
  trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes the 
  panics
  people were seeing when running
  
  find /whatever -type f -exec btrfs fi defrag {} \;
 
 This patch cannot help to solve the problem, as far as I've understood
 what is going on. It does change timing, though, which presumably makes
 it pass the current reproducer we're having.
 
 On rewinding, iteration through the tree mod log rb-tree goes backwards
 in time, which means that once we've found our staring point we cannot
 be trapped by later additions. The old items we're rewinding towards
 cannot be freed, because we've allocated a blocker element within the
 tree and rewinding never goes beyond the allocated blocker. The blocker
 element is allocated by btrfs_get_tree_mod_seq and mostly referred to as
 time_seq within the other tree mod log functions in ctree.c. To sum up,
 the added lock is not required.
 
 The debug output I've analyzed so far shows that after we've rewinded
 all REMOVE_WHILE_FREEING operations on a buffer, ordered consecutively
 as expected, there comes another REMOVE_WHILE_FREEING with a sequence
 number much further in the past for the same buffer (but that sequence
 number is still higher than out time_seq rewind barrier at that point).
 This must be a logical problem I've not completely understood so far,
 but locking doesn't seem to be the right track.
 

Finally reproduced it, this is my output

 btrfs-endio-wri-23110 [000] ...2  9556.882103: __tree_mod_log_rewind: 
rewinding 15450537984
 btrfs-endio-wri-23110 [000] ...2  9556.882104: __tree_mod_log_rewind: 
15450537984: processing 880246590a40, op 3, seq 68719476829, slot 0
 btrfs-endio-wri-23110 [000] ...2  9556.882106: __tree_mod_log_rewind: 
15450537984: processing 880246590ac0, op 3, seq 68719476828, slot 1
 btrfs-endio-wri-23110 [000] ...2  9556.882108: __tree_mod_log_rewind: 
15450537984: processing 880246590a40, op 3, seq 68719476829, slot 0
 btrfs-endio-wri-23110 [000] ...2  9556.882110: __tree_mod_log_rewind: 
15450537984: this tm is failing, 880246590a40, seq 68719476829, slot 0

so I'm inclined to beleive I've got it right.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

2013-06-30 Thread Josef Bacik
On Sun, Jun 30, 2013 at 11:02:10PM +0800, Liu Bo wrote:
 On Sun, Jun 30, 2013 at 07:22:00AM -0400, Josef Bacik wrote:
  On Sun, Jun 30, 2013 at 10:25:05AM +0200, Jan Schmidt wrote:
   On 30.06.2013 05:17, Josef Bacik wrote:
We need to hold the tree mod log lock in __tree_mod_log_rewind since we 
walk
forward in the tree mod entries, otherwise we'll end up with random 
entries and
trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes 
the panics
people were seeing when running

find /whatever -type f -exec btrfs fi defrag {} \;
   
   This patch cannot help to solve the problem, as far as I've understood
   what is going on. It does change timing, though, which presumably makes
   it pass the current reproducer we're having.
   
   On rewinding, iteration through the tree mod log rb-tree goes backwards
   in time, which means that once we've found our staring point we cannot
   be trapped by later additions. The old items we're rewinding towards
   cannot be freed, because we've allocated a blocker element within the
   tree and rewinding never goes beyond the allocated blocker. The blocker
   element is allocated by btrfs_get_tree_mod_seq and mostly referred to as
   time_seq within the other tree mod log functions in ctree.c. To sum up,
   the added lock is not required.
   
   The debug output I've analyzed so far shows that after we've rewinded
   all REMOVE_WHILE_FREEING operations on a buffer, ordered consecutively
   as expected, there comes another REMOVE_WHILE_FREEING with a sequence
   number much further in the past for the same buffer (but that sequence
   number is still higher than out time_seq rewind barrier at that point).
   This must be a logical problem I've not completely understood so far,
   but locking doesn't seem to be the right track.
   
  
  So I've seen 2 scenarios with my debug output, once with my printk() which
  seemed to be unreliable and then again with trace_printk() which was much 
  more
  reliable.
  
  With the printk() output I saw what you described, however I was also 
  printing
  out every mod we added and I never saw the sequence number allocated that 
  we had
  suddenly tripped over, so I assumed that the printk() just got lost between 
  the
  boxes (I was using netconsole to capture the output).  However if it was 
  never
  actually allocated then it would make sense that we just pointed into some
  random peice of memory.
  
  The second bit I saw which is when I was convinced this was the problem had 
  us
  processing the keys, and suddenly we'd try to process the key again.  So I 
  saw
  something like this (I lost the output because I was capturing the output 
  into
  the same file again when I tried to reproduce with this patch so I don't 
  have
  the exact output)
  
  processing op 3, seq 10 slot 0
  processing op 3, seq 9 slot 1
  processing op 3, seq 8 slot 2
  processing op 3, seq 7 slot 3
  processing op 3, seq 6 slot 4
  processing op 3, seq 5 slot 5
  processing op 3, seq 8 slot 2
  
  so we got an entry we had already processed, which is completely in line 
  with
  the rb_tree pointing to something random when we do rb_next.
  
  Now I had another theory early on, but I haven't seen output to confirm 
  this.
  We can re-allocate a block if we've allocated it in the same transaction, 
  and it
  appears this would completely screw up the tree mod log.  So say we allocate
  block 5, we add keys to it, and then we free it, so we get this
  MOVE_WHILE_FREEING thing done on the entire block, but then we re-allocate 
  the
  same block and start doing normal operations to it.  Then we go to replay 
  back
  and we will replay all the normal operations, but then we will walk into the
  REMOVE_WHILE_FREEING entries from when the block was free'd last and this 
  will
  trigger the BUG_ON().  Now I hadn't actually had this happen so either it 
  can't
  happen some how and I've missed the reasoning for it, or what I saw was 
  just the
  rb_tree messup because we aren't locking it and what you are seeing is this
  particular case.  Either way I'm going to back my patch out and re-run a few
  more times with my trace_printk()'s so I can have several runs to look at 
  and
  verify what I'm seeing.  Thanks,
 
 I doubt the theory here, if we free a block, it's marked as pinned and
 will then be marked as free only when we commit the transaction, isn't it?
 

Not sure if my other reply made it through, my phone only responds with HTML
which is kind of a pain.

If you look at btrfs_free_tree_block we will just add the space back if the
buffer was allocated in this transaction and hasn't been written yet, so this
case can happen.  I think we should probably add another mod log operation that
gets added when this happens so that if we do re-allocate it and try to rewind
back to when it belonged to another tree we can do that without panicing.
Thanks,

Josef
--
To unsubscribe from this list: send the