trying to balance, filesystem keeps going read-only.

2015-11-01 Thread Ken Long
I have a file system of four 5TB drives. Well, one drive is 8TB with a
5TB partition.. the rest are 5TB drives.  I created the initial btrfs
file system on on drive. rsync'd data to it. added another drive.
rsync'd data. added a third drive, rsync'd data. Added a four drive,
trying to balance. The file system gets an error and I have to reboot
to get the file system out of read only.

I dont think it is hardware issue..but It could be...  or it could be
some kind bug in btrfs?
To recover.. I've been comment out the btrfs in fstab, shutdown..
power off.. verify cable connections. Power on. verify all the devices
are present.. remount.

This kind of balance was successful:
btrfs balance start -dusage=55 /mnt/magenta/

but this gets the error below and goes into Read-only:
btrfs filesystem balance /mnt/magenta/

my kernel now is:

Linux ubuntu 4.2.0-17-lowlatency #21-Ubuntu SMP PREEMPT Fri Oct 23
20:40:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

~# btrfs filesystem  show
Label: 'uv000'  uuid: f974efbd-82cb-4812-a3b3-eb12ae470f2c
Total devices 4 FS bytes used 11.17TiB
devid1 size 4.77TiB used 4.26TiB path /dev/sdf1
devid2 size 4.55TiB used 4.09TiB path /dev/sde
devid3 size 4.55TiB used 2.74TiB path /dev/sdb
devid4 size 4.55TiB used 96.03GiB path /dev/sdg

Label: 'LinuxSampler'  uuid: 55aa3bf2-f319-42c7-8f1d-20ee35a91f9a
Total devices 1 FS bytes used 555.80GiB
devid1 size 2.51TiB used 558.04GiB path /dev/sdf2

# btrfs filesystem df /mnt/btrfs/
Data, single: total=11.16TiB, used=11.16TiB
System, RAID1: total=32.00MiB, used=1.20MiB
Metadata, RAID1: total=10.00GiB, used=8.29GiB
Metadata, DUP: total=4.50GiB, used=4.21GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



# btrfs filesystem usage /mnt/btrfs/
Overall:
Device size:  18.41TiB
Device allocated:  11.19TiB
Device unallocated:   7.22TiB
Device missing: 0.00B
Used:  11.18TiB
Free (estimated):   7.22TiB (min: 3.61TiB)
Data ratio:  1.00
Metadata ratio:  2.00
Global reserve: 512.00MiB (used: 0.00B)

Data,single: Size:11.16TiB, Used:11.16TiB
   /dev/sdb   2.73TiB
   /dev/sde   4.09TiB
   /dev/sdf1   4.25TiB
   /dev/sdg  93.00GiB

Metadata,RAID1: Size:10.00GiB, Used:8.29GiB
   /dev/sdb   5.00GiB
   /dev/sde   6.00GiB
   /dev/sdf1   6.00GiB
   /dev/sdg   3.00GiB

Metadata,DUP: Size:4.50GiB, Used:4.21GiB
   /dev/sdf1   9.00GiB

System,RAID1: Size:32.00MiB, Used:1.20MiB
   /dev/sdb  32.00MiB
   /dev/sdg  32.00MiB

Unallocated:
   /dev/sdb   1.81TiB
   /dev/sde 465.53GiB
   /dev/sdf1 515.80GiB
   /dev/sdg   4.45TiB



and the tail of dmesg -

[50363.836660] ata10: EH complete
[64894.794201] BTRFS info (device sdb): relocating block group
12704106414080 flags 1
[64899.944105] BTRFS info (device sdb): found 124 extents
[64906.622024] BTRFS info (device sdb): found 124 extents
[64906.915197] BTRFS info (device sdb): relocating block group
12703032672256 flags 1
[64915.409311] BTRFS info (device sdb): found 813 extents
[64947.160961] ata10.00: exception Emask 0x0 SAct 0x7fff SErr 0x0
action 0x6 frozen
[64947.160966] ata10.00: failed command: WRITE FPDMA QUEUED
[64947.160970] ata10.00: cmd 61/c0:00:38:8a:1d/0f:00:0c:00:00/40 tag 0
ncq 2064384 out
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask
0x4 (timeout)
[64947.160975] ata10.00: status: { DRDY }
[64947.160977] ata10.00: failed command: WRITE FPDMA QUEUED
[64947.160980] ata10.00: cmd 61/80:08:f8:99:1d/0a:00:0c:00:00/40 tag 1
ncq 1376256 out
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask
0x4 (timeout)
[64947.160981] ata10.00: status: { DRDY }
[64947.160982] ata10.00: failed command: WRITE FPDMA QUEUED
[64947.160985] ata10.00: cmd 61/c0:10:78:a4:1d/0f:00:0c:00:00/40 tag 2
ncq 2064384 out
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask
0x4 (timeout)
[64947.160987] ata10.00: status: { DRDY }
[64947.160988] ata10.00: failed command: WRITE FPDMA QUEUED
[64947.160991] ata10.00: cmd 61/80:18:38:b4:1d/0a:00:0c:00:00/40 tag 3
ncq 1376256 out
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask
0x4 (timeout)
[64947.160992] ata10.00: status: { DRDY }
[64947.160994] ata10.00: failed command: WRITE FPDMA QUEUED
[64947.160996] ata10.00: cmd 61/c0:20:b8:be:1d/0f:00:0c:00:00/40 tag 4
ncq 2064384 out
res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask
0x4 (timeout)
[64947.160998] ata10.00: status: { DRDY }
[64947.160999] ata10.00: failed command: WRITE FPDMA QUEUED
[64947.161002] ata10.00: cmd 61/40:28:78:ce:1d/1a:00:0c:00:00/40 tag 5
ncq 3440640 out
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask
0x4 (timeout)
[64947.161003] ata10.00: status: { DRDY }
[64947.161005] ata10.00: failed command: WRITE FPDMA QUEUED
[64947.161007] ata10.00: cmd 61/40:30:b8:e8:1d/1a:00:0c:00:00/40 tag 6
ncq 3440640 out
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask
0x4 (timeout)
[64947.161009] ata10.00: status: { DRDY }
[64947.161010] 

Re: trying to balance, filesystem keeps going read-only.

2015-11-01 Thread Hugo Mills
On Sun, Nov 01, 2015 at 06:24:53AM -0500, Ken Long wrote:
> I have a file system of four 5TB drives. Well, one drive is 8TB with a
> 5TB partition.. the rest are 5TB drives.  I created the initial btrfs
> file system on on drive. rsync'd data to it. added another drive.
> rsync'd data. added a third drive, rsync'd data. Added a four drive,
> trying to balance. The file system gets an error and I have to reboot
> to get the file system out of read only.
> 
> I dont think it is hardware issue..but It could be...  or it could be
> some kind bug in btrfs?

   Looks very much like a hardware error to me. This stuff:

> [64947.160961] ata10.00: exception Emask 0x0 SAct 0x7fff SErr 0x0
> action 0x6 frozen
> [64947.160966] ata10.00: failed command: WRITE FPDMA QUEUED
> [64947.160970] ata10.00: cmd 61/c0:00:38:8a:1d/0f:00:0c:00:00/40 tag 0
> ncq 2064384 out
> res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask
> 0x4 (timeout)

is coming from the ATA layer, a couple of layers below btrfs, and
would definitely indicate some kind of issue with the hardware.

> [66025.199406] ata10: softreset failed (1st FIS failed)
> [66025.199417] ata10: hard resetting link
> [66030.407703] ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> [66030.407713] ata10.00: link online but device misclassified
> [66030.407746] ata10: EH complete
> [66030.408360] sd 9:0:0:0: [sdg] tag#16 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [66030.408363] sd 9:0:0:0: [sdg] tag#16 CDB: Write(16) 8a 00 00 00 00
> 00 09 a4 bf 80 00 00 49 80 00 00
> [66030.408365] blk_update_request: I/O error, dev sdg, sector 161791872
> [66030.408369] BTRFS: bdev /dev/sdg errs: wr 1, rd 0, flush 0, corrupt 0, gen > 0
> [66030.408439] BTRFS: bdev /dev/sdg errs: wr 2, rd 0, flush 0, corrupt 0, gen > 0
> [66030.408537] BTRFS: bdev /dev/sdg errs: wr 3, rd 0, flush 0, corrupt 0, gen > 0
> [66030.408643] BTRFS: bdev /dev/sdg errs: wr 4, rd 0, flush 0, corrupt 0, gen > 0
> [66030.408768] BTRFS: bdev /dev/sdg errs: wr 5, rd 0, flush 0, corrupt 0, gen > 0
> [66030.408880] BTRFS: bdev /dev/sdg errs: wr 6, rd 0, flush 0, corrupt 0, gen > 0
> [66030.408985] BTRFS: bdev /dev/sdg errs: wr 7, rd 0, flush 0, corrupt 0, gen > 0
> [66030.409082] BTRFS: bdev /dev/sdg errs: wr 8, rd 0, flush 0, corrupt 0, gen > 0
> [66030.409180] BTRFS: bdev /dev/sdg errs: wr 9, rd 0, flush 0, corrupt 0, gen > 0
> [66030.409284] BTRFS: bdev /dev/sdg errs: wr 10, rd 0, flush 0, corrupt 0, 
> gen 0
> [66030.409847] sd 9:0:0:0: [sdg] tag#17 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [66030.409850] sd 9:0:0:0: [sdg] tag#17 CDB: Write(16) 8a 00 00 00 00
> 00 09 a5 09 00 00 00 44 40 00 00
> [66030.409851] blk_update_request: I/O error, dev sdg, sector 161810688
> [66030.411235] sd 9:0:0:0: [sdg] tag#18 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [66030.411238] sd 9:0:0:0: [sdg] tag#18 CDB: Write(16) 8a 00 00 00 00
> 00 09 a5 4d 40 00 00 49 80 00 00
> [66030.411239] blk_update_request: I/O error, dev sdg, sector 161828160
> [66030.412695] sd 9:0:0:0: [sdg] tag#19 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [66030.412697] sd 9:0:0:0: [sdg] tag#19 CDB: Write(16) 8a 00 00 00 00
> 00 09 a5 96 c0 00 00 49 80 00 00
> [66030.412699] blk_update_request: I/O error, dev sdg, sector 161846976
> [66030.414113] sd 9:0:0:0: [sdg] tag#20 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [66030.414115] sd 9:0:0:0: [sdg] tag#20 CDB: Write(16) 8a 00 00 00 00
> 00 09 a5 e0 40 00 00 1f 80 00 00
> [66030.414117] blk_update_request: I/O error, dev sdg, sector 161865792
> [66030.414755] sd 9:0:0:0: [sdg] tag#21 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [66030.414758] sd 9:0:0:0: [sdg] tag#21 CDB: Write(16) 8a 00 00 00 00
> 00 09 a5 ff c0 00 00 15 00 00 00
> [66030.414759] blk_update_request: I/O error, dev sdg, sector 161873856
> [66030.415205] sd 9:0:0:0: [sdg] tag#22 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [66030.415207] sd 9:0:0:0: [sdg] tag#22 CDB: Write(16) 8a 00 00 00 00
> 00 09 a6 14 c0 00 00 44 40 00 00
> [66030.415208] blk_update_request: I/O error, dev sdg, sector 161879232
> [66030.416562] sd 9:0:0:0: [sdg] tag#23 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [66030.416564] sd 9:0:0:0: [sdg] tag#23 CDB: Write(16) 8a 00 00 00 00
> 00 09 a6 59 00 00 00 44 40 00 00
> [66030.416572] blk_update_request: I/O error, dev sdg, sector 161896704
> [66030.417922] sd 9:0:0:0: [sdg] tag#24 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [66030.417924] sd 9:0:0:0: [sdg] tag#24 CDB: Write(16) 8a 00 00 00 00
> 00 09 a6 9d 40 00 00 49 80 00 00
> [66030.417926] blk_update_request: I/O error, dev sdg, sector 161914176
> [66030.419365] sd 9:0:0:0: [sdg] tag#25 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [66030.419368] sd 9:0:0:0: [sdg] tag#25 CDB: Write(16) 8a 00 00 00 00
> 00 09 a6 e6 c0 00 00 49 80 00 00

   Here, we've got 

Re: Crash during mount -o degraded, kernel BUG at fs/btrfs/extent_io.c:2044

2015-11-01 Thread Anand Jain




This is misleading, these error messages might make one think that the
4th drive is bad and has to be replaced, which would reduce the
redundancy to the minimum because it's the second drive that's actually
bad.


following RFC will solve the misleading part of the problem..

[RFC PATCH] Btrfs: fix fs logging for multi device

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trying to balance, filesystem keeps going read-only.

2015-11-01 Thread Ken Long
I get a similar read-only status when I try to remove the drive from the array..

Too bad the utility's function can not be slowed down.. to avoid
triggering this error... ?

I had some success putting data *onto* the drive by croning sync every
two seconds in a different terminal.

Doesn't seem to be fixed yet..
https://bugzilla.kernel.org/show_bug.cgi?id=93581


On Sun, Nov 1, 2015 at 9:17 AM, Roman Mamedov  wrote:
> On Sun, 1 Nov 2015 09:07:08 -0500
> Ken Long  wrote:
>
>> Yes, the one drive is that Seagate 8TB drive..
>>
>> Smart tools doesn't show anything outrageous or obvious in hardware.
>>
>> Is there any other info I can provide to isolate, troubleshoot further?
>>
>> I'm not sure how to correlate the dmesg message to a specific drive,
>> SATA cable etc..
>
> See this discussion: http://www.spinics.net/lists/linux-btrfs/msg48054.html
>
> My guess is these drives need to do a lot of housekeeping internally,
> especially during heavy write load or random writes, and do not reply to the
> host machine in time, which translates into those "frozen [...] failed
> command: WRITE FPDMA QUEUED" failures.
>
> I did not follow the issue closely enough to know if there's a solution yet, 
> or
> even if this is specific to Btrfs or to GNU/Linux in general. Maybe your best
> bet would be to avoid using that drive in your Btrfs array altogether for the
> time being.
>
> --
> With respect,
> Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash during mount -o degraded, kernel BUG at fs/btrfs/extent_io.c:2044

2015-11-01 Thread Philip Seeger


On 11/01/2015 04:22 AM, Duncan wrote:

So what btrfs is logging to dmesg on mount here, are the historical error
counts, in this case expected as they were deliberate during your test,
nearly 200K of them, not one or more new errors.

To have btrfs report these at the CLI, use btrfs device stats.  To zero


Thanks for clarifying. I forgot to check btrfs dev stats. That explains it.

The 2-drive failure scenario still caused data corruption with 4.3-rc7 
though.



Philip
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trying to balance, filesystem keeps going read-only.

2015-11-01 Thread Roman Mamedov
On Sun, 1 Nov 2015 06:24:53 -0500
Ken Long  wrote:

> Well, one drive is 8TB with a 5TB partition.

Is this by any chance a Seagate "SMR" drive? From what I remember seeing on
the list, those do not work well with Btrfs currently, with symptoms very
similar to what you're seeing.

-- 
With respect,
Roman


signature.asc
Description: PGP signature


Re: trying to balance, filesystem keeps going read-only.

2015-11-01 Thread Ken Long
Yes, the one drive is that Seagate 8TB drive..

Smart tools doesn't show anything outrageous or obvious in hardware.

Is there any other info I can provide to isolate, troubleshoot further?

I'm not sure how to correlate the dmesg message to a specific drive,
SATA cable etc..



On Sun, Nov 1, 2015 at 8:48 AM, Roman Mamedov  wrote:
> On Sun, 1 Nov 2015 06:24:53 -0500
> Ken Long  wrote:
>
>> Well, one drive is 8TB with a 5TB partition.
>
> Is this by any chance a Seagate "SMR" drive? From what I remember seeing on
> the list, those do not work well with Btrfs currently, with symptoms very
> similar to what you're seeing.
>
> --
> With respect,
> Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trying to balance, filesystem keeps going read-only.

2015-11-01 Thread Roman Mamedov
On Sun, 1 Nov 2015 09:07:08 -0500
Ken Long  wrote:

> Yes, the one drive is that Seagate 8TB drive..
> 
> Smart tools doesn't show anything outrageous or obvious in hardware.
> 
> Is there any other info I can provide to isolate, troubleshoot further?
> 
> I'm not sure how to correlate the dmesg message to a specific drive,
> SATA cable etc..

See this discussion: http://www.spinics.net/lists/linux-btrfs/msg48054.html

My guess is these drives need to do a lot of housekeeping internally,
especially during heavy write load or random writes, and do not reply to the
host machine in time, which translates into those "frozen [...] failed
command: WRITE FPDMA QUEUED" failures.

I did not follow the issue closely enough to know if there's a solution yet, or
even if this is specific to Btrfs or to GNU/Linux in general. Maybe your best
bet would be to avoid using that drive in your Btrfs array altogether for the
time being.

-- 
With respect,
Roman


signature.asc
Description: PGP signature


Re: BTRFS raid 5/6 status

2015-11-01 Thread audio muze
I've looked into snap-raid and it seems well suited to my needs as
most of the data is static.  I'm planning on using it in conjunction
with mhddfs so all drives are seen as a single storage pool. Is there
then any benefit in using Btrfs as the underlying filesystem on each
of the drives?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 5/6] btrfs-progs: free comparer_set in cmd_qgroup_show

2015-11-01 Thread Zhao Lei
Hi, David Sterba

> -Original Message-
> From: David Sterba [mailto:dste...@suse.cz]
> Sent: Friday, October 30, 2015 9:36 PM
> To: Zhao Lei 
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH 5/6] btrfs-progs: free comparer_set in cmd_qgroup_show
> 
> On Thu, Oct 29, 2015 at 05:31:47PM +0800, Zhao Lei wrote:
> > comparer_set, which was allocated by malloc(), should be free before
> > function return.
> >
> > Signed-off-by: Zhao Lei 
> > ---
> >  cmds-qgroup.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/cmds-qgroup.c b/cmds-qgroup.c index a64b716..f069d32
> > 100644
> > --- a/cmds-qgroup.c
> > +++ b/cmds-qgroup.c
> > @@ -290,7 +290,7 @@ static int cmd_qgroup_show(int argc, char **argv)
> > int filter_flag = 0;
> > unsigned unit_mode;
> >
> > -   struct btrfs_qgroup_comparer_set *comparer_set;
> > +   struct btrfs_qgroup_comparer_set *comparer_set = NULL;
> > struct btrfs_qgroup_filter_set *filter_set;
> > filter_set = btrfs_qgroup_alloc_filter_set();
> > comparer_set = btrfs_qgroup_alloc_comparer_set();
> > @@ -372,6 +372,8 @@ static int cmd_qgroup_show(int argc, char **argv)
> > fprintf(stderr, "ERROR: can't list qgroups: %s\n",
> > strerror(e));
> >
> > +   free(comparer_set);
> 
> Doh, coverity correctly found that comparer_set is freed inside
> btrfs_show_qgroups() a few lines above. Patch dropped.
> 
My bad.

This problem is find in my node by valgrind memckeck, maybe it
is not freed in some case, or a valgrind misreport.
I'll check it deeply.

Thanks
Zhaolei

> > +

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete

2015-11-01 Thread Qu Wenruo



Mark Fasheh wrote on 2015/09/22 13:15 -0700:

Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup
mechanism.') removed our qgroup accounting during
btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
going bad shortly after a snapshot is removed.

Fix this by adding a dirty extent record when we encounter extents during
our shared subtree walk. This effectively restores the functionality we had
with the original shared subtree walking code in 1152651 (btrfs: qgroup:
account shared subtrees during snapshot delete).

The idea with the original patch (and this one) is that shared subtrees can
get skipped during drop_snapshot. The shared subtree walk then allows us a
chance to visit those extents and add them to the qgroup work for later
processing. This ultimately makes the accounting for drop snapshot work.

The new qgroup code nicely handles all the other extents during the tree
walk via the ref dec/inc functions so we don't have to add actions beyond
what we had originally.

Signed-off-by: Mark Fasheh 


Hi Mark,

Despite the performance regression reported from Stefan Priebe,
there is another problem, I'll comment inlined below.


---
  fs/btrfs/extent-tree.c | 41 ++---
  1 file changed, 34 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3a70e6c..89be620 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7757,17 +7757,37 @@ reada:
  }

  /*
- * TODO: Modify related function to add related node/leaf to dirty_extent_root,
- * for later qgroup accounting.
- *
- * Current, this function does nothing.
+ * These may not be seen by the usual inc/dec ref code so we have to
+ * add them here.
   */
+static int record_one_subtree_extent(struct btrfs_trans_handle *trans,
+struct btrfs_root *root, u64 bytenr,
+u64 num_bytes)
+{
+   struct btrfs_qgroup_extent_record *qrecord;
+   struct btrfs_delayed_ref_root *delayed_refs;
+
+   qrecord = kmalloc(sizeof(*qrecord), GFP_NOFS);
+   if (!qrecord)
+   return -ENOMEM;
+
+   qrecord->bytenr = bytenr;
+   qrecord->num_bytes = num_bytes;
+   qrecord->old_roots = NULL;
+
+   delayed_refs = >transaction->delayed_refs;
+   if (btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord))
+   kfree(qrecord);


1) Unprotected dirty_extent_root.

Unfortunately, btrfs_qgroup_insert_dirty_exntet() is not protected by 
any lock/mutex.


And I'm sorry not to add comment about that.

In fact, btrfs_qgroup_insert_dirty_extent should always be used with
delayed_refs->lock hold.
Just like add_delayed_ref_head(), where every caller of 
add_delayed_ref_head() holds delayed_refs->lock.


So here you will nned to hold delayed_refs->lock.

2) Performance regression.(Reported by Stefan Priebe)

The performance regression is not caused by your codes, at least not 
completely.


It's my fault not adding enough comment for insert_dirty_extent() 
function. (just like 1, I must say I'm a bad reviewer until there is bug 
report)


As I was only expecting it called inside add_delayed_ref_head(),
and caller of add_delayed_ref_head() has judged whether qgroup is 
enabled before calling add_delayed_ref_head().


So for qgroup disabled case, insert_dirty_extent() won't ever be called.



As a result, if you want to call btrfs_qgroup_insert_dirty_extent() out 
of add_delayed_ref_head(), you will need to handle the 
delayed_refs->lock and judge whether qgroup is enabled.


BTW, if it's OK for you, you can also further improve the performance of 
qgroup by using kmem_cache for struct btrfs_qgroup_extent_record.


I assume the kmalloc() may be one performance hot spot considering the 
amount it called in qgroup enabled case.


Thanks,
Qu


+
+   return 0;
+}
+
  static int account_leaf_items(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
  struct extent_buffer *eb)
  {
int nr = btrfs_header_nritems(eb);
-   int i, extent_type;
+   int i, extent_type, ret;
struct btrfs_key key;
struct btrfs_file_extent_item *fi;
u64 bytenr, num_bytes;
@@ -7790,6 +7810,10 @@ static int account_leaf_items(struct btrfs_trans_handle 
*trans,
continue;

num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
+
+   ret = record_one_subtree_extent(trans, root, bytenr, num_bytes);
+   if (ret)
+   return ret;
}
return 0;
  }
@@ -7858,8 +7882,6 @@ static int adjust_slots_upwards(struct btrfs_root *root,

  /*
   * root_eb is the subtree root and is locked before this function is called.
- * TODO: Modify this function to mark all (including complete shared node)
- * to dirty_extent_root to allow it get accounted in qgroup.
   */
  static int 

Re: Regression in: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete

2015-11-01 Thread Qu Wenruo



Stefan Priebe wrote on 2015/11/01 21:49 +0100:

Hi,

this one: http://www.spinics.net/lists/linux-btrfs/msg47377.html

adds a regression to my test systems with very large disks (30tb and 50tb).

btrfs balance is super slow afterwards while heavily making use of cp
--reflink=always on big files (200gb - 500gb).

Sorry didn't know how to correctly reply to that "old" message.

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thanks for the testing.

Are you using qgroup or just doing normal balance with qgroup disabled?

For the latter case, that's should be optimized to skip the dirty extent 
insert in qgroup disabled case.


For qgroup enabled case, I'm afraid that's the design.
As relocation will drop a subtree to relocate, and to ensure qgroup 
consistent, we must walk down all the tree blocks and mark them dirty 
for later qgroup accounting.


But there should be some hope left for optimization.
For example, if all subtree blocks are already relocated, we can skip 
the tree down walk routine.


Anyway, for your case of huge files, as tree level grows rapidly, any 
workload involving tree iteration will be very time consuming.

Like snapshot deletion and relocation.

BTW, thanks for you regression report, I also found another problem of 
the patch.

I'll reply to the author to improve the patchset.

Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regression in: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete

2015-11-01 Thread Duncan
Stefan Priebe posted on Sun, 01 Nov 2015 21:49:44 +0100 as excerpted:

> this one: http://www.spinics.net/lists/linux-btrfs/msg47377.html
> 
> adds a regression to my test systems with very large disks (30tb and
> 50tb).
> 
> btrfs balance is super slow afterwards while heavily making use of cp
> --reflink=always on big files (200gb - 500gb).
> 
> Sorry didn't know how to correctly reply to that "old" message.

Just on the message-reply bit...

Gmane.org carries this list (among many), archiving the posts with both 
nntp/news and http/web interfaces.  Both the web and news interfaces 
normally allow replies to both old and current messages via the gmane 
gateway forwarding to the list, tho the first time you reply to a list 
via gmane, it'll respond with a confirmation to the email address you 
used, requiring you to reply to that before forwarding the mail on to the 
list.  If you don't reply within a week, the message is dropped.  
However, at least for the news interface (not sure about the web 
interface), you only have to confirm for a particular list/newsgroup 
once, after that, it forwards to the list without further confirmations.

That's how I follow all my lists, reading and replying to them as 
newsgroups via the gmane list2news interface.

http://gmane.org for more info.

The one caveat is that while on a lot of lists replies to the list only 
is the norm, on the Linux kernel and vger.kernel.org hosted lists 
(including this one), replying to all, list and previous posters, is the 
norm, and I'm not sure if the web interface allows that.  On the news 
interface it of course depends on your news client -- mine is more 
adapted to news than mail, and while it allows forwarding to your normal 
mail client for the mail side, normal followups are to news only, and 
it's not easy to reply to all, so I generally reply to list (as 
newsgroup) only, unless a poster specifically requests to be CCed on 
replies.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Regression in: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete

2015-11-01 Thread Stefan Priebe

Hi,

this one: http://www.spinics.net/lists/linux-btrfs/msg47377.html

adds a regression to my test systems with very large disks (30tb and 50tb).

btrfs balance is super slow afterwards while heavily making use of cp 
--reflink=always on big files (200gb - 500gb).


Sorry didn't know how to correctly reply to that "old" message.

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regression in: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete

2015-11-01 Thread Stefan Priebe

Am 02.11.2015 um 02:34 schrieb Qu Wenruo:



Stefan Priebe wrote on 2015/11/01 21:49 +0100:

Hi,

this one: http://www.spinics.net/lists/linux-btrfs/msg47377.html

adds a regression to my test systems with very large disks (30tb and
50tb).

btrfs balance is super slow afterwards while heavily making use of cp
--reflink=always on big files (200gb - 500gb).

Sorry didn't know how to correctly reply to that "old" message.

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thanks for the testing.

Are you using qgroup or just doing normal balance with qgroup disabled?


just doing normal balance with qgroup disabled.


For the latter case, that's should be optimized to skip the dirty extent
insert in qgroup disabled case.

For qgroup enabled case, I'm afraid that's the design.
As relocation will drop a subtree to relocate, and to ensure qgroup
consistent, we must walk down all the tree blocks and mark them dirty
for later qgroup accounting.

But there should be some hope left for optimization.
For example, if all subtree blocks are already relocated, we can skip
the tree down walk routine.

Anyway, for your case of huge files, as tree level grows rapidly, any
workload involving tree iteration will be very time consuming.
Like snapshot deletion and relocation.

BTW, thanks for you regression report, I also found another problem of
the patch.
I'll reply to the author to improve the patchset.


Thanks,
Stefan



Thanks,
Qu

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html