Re: [PATCH] Btrfs: fix unexpected result when dio reading corrupted blocks

2017-09-24 Thread David Sterba
On Fri, Sep 15, 2017 at 03:06:51PM -0600, Liu Bo wrote:
> commit 4246a0b63bd8 ("block: add a bi_error field to struct bio")
> changed the logic of how dio read endio reports errors.
> 
> For single stripe dio read, %bio->bi_status reflects the error before
> verifying checksum, and now we're updating it when data block matches
> with its checksum, while in the mismatching case, %bio->bi_status is
> not updated to relfect that.
> 
> When some blocks in a file have been corrupted on disk, reading such a
> file ends up with
> 
> 1) checksum errros are reported in kernel log
> 2) read(2) returns successfully with some content being 0x01.
> 
> In order to fix it, we need to report its checksum mismatch error to
> the upper layer (dio layer in this case) as well.
> 
> Signed-off-by: Liu Bo 
> Reported-by: Goffredo Baroncelli 
> cc: Goffredo Baroncelli 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix memory leak in raid56

2017-09-24 Thread David Sterba
On Fri, Sep 22, 2017 at 12:11:18PM -0600, Liu Bo wrote:
> The local bio_list may have pending bios when doing cleanup, it can
> end up with memory leak if they don't get free'd.

I was wondering if we could make a common helper that would call
rbio_orig_end_io and while (..) put_bio(), but __raid56_parity_recover
does not fit the pattern. Still might be worth, but the patch is ok as
is.

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: check: check invalid extent_inline_ref type in lowmem

2017-09-24 Thread David Sterba
On Sun, Sep 24, 2017 at 05:17:19PM +0300, Nikolay Borisov wrote:
> >>> However, such whac-a-mole fix will finally be a nightmare to maintain.
> >>>
> >>> What about integrating all of such validation checkers into one place?
> >>> So fsck part will only need to check their cross reference without
> >>> bothering such corruption.
> >>>
> >> I was confused how to fix the bug(new checker or little changes
> >> in this patch).
> >> The reason why I fix it in this way is that most callers do
> >> check type before calling btrfs_extent_inline_ref_size().
> >>
> >> Since you prefer the former, I'm going to do it.
> > 
> > Current version looks good enough as a fix.>
> > Just saying we'd better using an integrated solution later.
> 
> I agree with you that we should have an integrated solution but frankly
> I'd rather see it sooner rather than later, because history has shown
> that if something is not done when it's first talked about it usually
> gets bogged down by other work.  With this in mind, Su, might I ask you
> that you repost the patch but with the centralised idea of handling the
> validation lifted from Qu's kernel side patches.

Agreed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs performance with small blocksize on SSD

2017-09-24 Thread Kai Krakow
Am Sun, 24 Sep 2017 19:43:05 +0300
schrieb Andrei Borzenkov :

> 24.09.2017 16:53, Fuhrmann, Carsten пишет:
> > Hello,
> > 
> > 1)
> > I used direct write (no page cache) but I didn't disable the Disk
> > cache of the HDD/SSD itself. In all tests I wrote 1GB and looked
> > for the runtime of that write process.  
> 
> So "latency" on your diagram means total time to write 1GiB file? That
> is highly unusual meaning for "latency" which normally means time to
> perform single IO. If so, you should better rename Y-axis to something
> like "total run time".

If you look closely it says "Laufzeit" which visually looks similar to
"latency" but really means "run time". ;-)


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_extref_hash 64-bit vs. btrfs_crc32c 32-bit

2017-09-24 Thread Hans van Kranenburg
On 09/23/2017 12:35 PM, Hans van Kranenburg wrote:
> Hi,
> 
> When looking around in the kernel code, I ran into this (hash.h):
> 
> u32 btrfs_crc32c(u32 crc, const void *address, unsigned int length);
> 
> [...]
> 
> static inline u64 btrfs_extref_hash(u64 parent_objectid, const char *name,
> int len)
> {
> return (u64) btrfs_crc32c(parent_objectid, name, len);
> }
> 
> [...]
> 
> What is the "official" behaviour of just stuffing a 64-bit
> (parent_objectid) value into a 32-bit function argument (crc)? Does it
> get truncated? Does this compile without a warning?
> 
> I would expect that the caller should do the housekeeping of deciding
> how to transform the 64 bit parent_objectid into some 32 bit value.

In the meantime I can of course answer this myself...

#include 
#include 

typedef unsigned int __u32;
typedef unsigned long __u64;

void i_eat_32_for_breakfast(__u32 food) {
printf("%u\n", food);
}

int main(void) {
__u64 eat_this = 4294968320;
i_eat_32_for_breakfast(eat_this);
return 0;
}

-$ ./breakfast
1024

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AW: Btrfs performance with small blocksize on SSD

2017-09-24 Thread Andrei Borzenkov
24.09.2017 16:53, Fuhrmann, Carsten пишет:
> Hello,
> 
> 1)
> I used direct write (no page cache) but I didn't disable the Disk cache of 
> the HDD/SSD itself. In all tests I wrote 1GB and looked for the runtime of 
> that write process.

So "latency" on your diagram means total time to write 1GiB file? That
is highly unusual meaning for "latency" which normally means time to
perform single IO. If so, you should better rename Y-axis to something
like "total run time".
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AW: AW: Btrfs performance with small blocksize on SSD

2017-09-24 Thread Fuhrmann, Carsten

1)
Every test has it's own file. So the 2k blocksize write to a different file 
then the 4k blocksize test. In the End there are 5 files on the disk (2k, 
8k,...)

2)
Well I think it is 2 as well since for 4k and higher the performance is much 
better .
I'm gonna test the -o max_inline and test with 4k blocksize as well.

Thank you for your help

Carsten

-Ursprüngliche Nachricht-
Von: Qu Wenruo [mailto:quwenruo.bt...@gmx.com] 
Gesendet: Sonntag, 24. September 2017 16:11
An: Fuhrmann, Carsten ; 
linux-btrfs@vger.kernel.org
Betreff: Re: AW: Btrfs performance with small blocksize on SSD



On 2017年09月24日 21:53, Fuhrmann, Carsten wrote:
> Hello,
> 
> 1)
> I used direct write (no page cache) but I didn't disable the Disk cache of 
> the HDD/SSD itself. In all tests I wrote 1GB and looked for the runtime of 
> that write process.

Are you writing all the 1G into one file?
Or into different files?

> I run every test 5 times with different Blocksizes (2k, 8k, 32k, 128k, 512k). 
> Those values are on the x-axis. On the Y-Axis is the runtime for the test.

Good to know that.

Then there may be 2 factors impacting performance:

1) Convert between inlined and regular data extent
1st 2K write will be inlined and then 2nd 2K write will convert it
back to regular data extent.
The overhead can be quite high.

Retest with "-o max_inline=0" will disable such behavior so all write
will only cause regular data extent.

2) Unaligned data size causing extra rewrite/CoW
Btrfs restore its data in unit of sectorsize, and in your case it is
4K.
Writing with 2K will cause btrfs to read out the half written data
and then CoW it to somewhere else.
The overhead can be quite huge.

And I assume 2) is the main overhead.

Retest with 4K blocksize to see if it's related.

Please note that, 4K blocksize and 2K blocksize are going through
different routines (4K blocks routine has no extra CoW overhead, so
it' should be near 8K blockszie result)

Thanks,
Qu

> 
> 2)
> Yes every test is on RAID1 for data and metadata
> 
> 3)
> Everything default
> mkfs.btrfs -d raid1 -m raid1 /dev/sda /dev/sdb /dev/sdc /dev/sdd
> 
> 
> best regards
> 
> Carsten
> 
> -Ursprüngliche Nachricht-
> Von: Qu Wenruo [mailto:quwenruo.bt...@gmx.com]
> Gesendet: Sonntag, 24. September 2017 15:41
> An: Fuhrmann, Carsten ; 
> linux-btrfs@vger.kernel.org
> Betreff: Re: Btrfs performance with small blocksize on SSD
> 
> 
> 
> On 2017年09月24日 21:24, Fuhrmann, Carsten wrote:
>> Hello,
>>
>> i run a few performance tests comparing mdadm, hardware raid and the btrfs 
>> raid. I noticed that the performance for small blocksizes (2k) is very bad 
>> on SSD in general and on HDD for sequential writing.
> 
> 2K is smaller than the minimal btrfs sectorsize (4K for x86 family).
> 
> It's common that unaligned access will impact performance, but we need more 
> info about your test cases, including:
> 1) How write is done?
>  Buffered? DIO? O_SYNC? fdatasync?
>  I can't read Germany so I'm not sure what the result means. (Although
>  I can guess Y axle is latency, but I don't know the meaning of X axle.
>  And how many files are involved, how large of these files and etc.
> 
> 2) Data/meta/sys profiles
>  All RADI1?
> 
> 3) Mkfs profile
>  Like nodesize if not default, and any incompat features enabled.
> 
>> I wonder about that result, because you say on the wiki that btrfs is very 
>> effective for small files.
> 
> It can be space effective or performance effective.
> 
> If *ignoring* meta profile, btrfs is space-effectient since it inline the 
> data into metadata, avoiding padding it to sectorsize so it can save some 
> space.
> 
> And such behavior can also be somewhat performance effective, by avoiding 
> extra seeking for data, since when reading out the metadata we have already 
> read out the inlined data.
> 
> But such efficiency come with cost.
> 
> One obvious one is when we need to convert inline data into regular one.
> It may cause extra tree balancing and increase latency.
> 
> Would you please try retest with "-o max_inline=0" mount option to disable 
> inline data (which makes btrfs behavior like ext*/xfs) to see if it's related?
> 
> Thanks,
> Qu
> 
>>
>> I attached my results from raid 1 random write HDD (rH1), SSD (rS1) 
>> and from sequential write HDD (sH1), SSD (sS1)
>>
>> Hopefully you have an explanation for that.
>>
>> raid@raid-PowerEdge-T630:~$ uname -a
>> Linux raid-PowerEdge-T630 4.10.0-33-generic #37~16.04.1-Ubuntu SMP 
>> Fri Aug 11 14:07:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux 
>> raid@raid-PowerEdge-T630:~$ btrfs --version btrfs-progs v4.4
>>
>>
>> best regards
>>
>> Carsten
>>
> N r  y   b X  ǧv ^ )޺{.n +{ n ߲)   w*jg    ݢj/   z ޖ  2 
> ޙ   & )ߡ a     G   h  j:+v   w ٥
> 


Re: [PATCH] btrfs-progs: check: check invalid extent_inline_ref type in lowmem

2017-09-24 Thread Nikolay Borisov


On 19.09.2017 13:00, Qu Wenruo wrote:
> 
> 
> On 2017年09月19日 17:48, Su Yue wrote:
>>
>>
>> On 09/19/2017 04:48 PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2017年09月19日 16:32, Su Yue wrote:
 Lowmem check does not skip invalid type in extent_inline_ref then
 calls btrfs_extent_inline_ref_size(type) which causes crash.

 Example:
 $ ./btrfs-corrupt-block -e -l 20971520 /tmp/data_small
 corrupting extent record: key 20971520 169 0
 $ btrfs check --mode=lowmem  /tmp/data_small
 Checking filesystem on /tmp/data_small
 UUID: ee205d69-8724-4aa2-a4f5-bc8558a62169
 checking extents
 ERROR: extent[20971520 16384] backref type mismatch, missing bit: 2
 ERROR: extent[20971520 16384] backref generation mismatch,
 wanted: 7, have: 0
 ERROR: extent[20971520 16384] is referred by other roots than 3
 ctree.h:1754: btrfs_extent_inline_ref_size: BUG_ON `1` triggered,
 value 1
 btrfs(+0x543db)[0x55fabc2ab3db]
 btrfs(+0x587f7)[0x55fabc2af7f7]
 btrfs(+0x5fa44)[0x55fabc2b6a44]
 btrfs(cmd_check+0x194a)[0x55fabc2bd717]
 btrfs(main+0x88)[0x55fabc2682e0]
 /usr/lib/libc.so.6(__libc_start_main+0xea)[0x7f021c3824ca]
 btrfs(_start+0x2a)[0x55fabc267e7a]
 [1]    5188 abort (core dumped)  btrfs check --mode=lowmem
 /tmp/data_small

 Fix it by checking type before obtaining inline_ref size.

 Signed-off-by: Su Yue 
>>>
>>> Code itself looks good.
>>> And test case please.
>>>
>> Thanks for review. I'll add test case in next version.
>>
>>> Reviewed-by: Qu Wenruo 
>>>
>>> However, such whac-a-mole fix will finally be a nightmare to maintain.
>>>
>>> What about integrating all of such validation checkers into one place?
>>> So fsck part will only need to check their cross reference without
>>> bothering such corruption.
>>>
>> I was confused how to fix the bug(new checker or little changes
>> in this patch).
>> The reason why I fix it in this way is that most callers do
>> check type before calling btrfs_extent_inline_ref_size().
>>
>> Since you prefer the former, I'm going to do it.
> 
> Current version looks good enough as a fix.>
> Just saying we'd better using an integrated solution later.

I agree with you that we should have an integrated solution but frankly
I'd rather see it sooner rather than later, because history has shown
that if something is not done when it's first talked about it usually
gets bogged down by other work.  With this in mind, Su, might I ask you
that you repost the patch but with the centralised idea of handling the
validation lifted from Qu's kernel side patches.

> 
> Thanks,
> Qu
>>
>> Thanks
>> Su
>>
>>> Just like the kernel patch I submitted some times ago.
>>> https://www.spinics.net/lists/linux-btrfs/msg68498.html
>>>
>>> Thanks,
>>> Qu
>>>
 ---
   cmds-check.c | 9 +
   1 file changed, 9 insertions(+)

 diff --git a/cmds-check.c b/cmds-check.c
 index 4e0b0fe4..74c10c75 100644
 --- a/cmds-check.c
 +++ b/cmds-check.c
 @@ -11565,6 +11565,10 @@ static int check_tree_block_ref(struct
 btrfs_root *root,
   offset, level + 1, owner,
   NULL);
   }
 +    } else {
 +    error("extent[%llu %u %llu] has unknown ref type: %d",
 +  key.objectid, key.type, key.offset, type);
 +    break;
   }
   if (found_ref)
 @@ -11831,6 +11835,11 @@ static int check_extent_data_item(struct
 btrfs_root *root,
   found_dbackref = !check_tree_block_ref(root, NULL,
   btrfs_extent_inline_ref_offset(leaf, iref),
   0, owner, NULL);
 +    } else {
 +    error("extent[%llu %u %llu] has unknown ref type: %d",
 +  disk_bytenr, BTRFS_EXTENT_DATA_KEY,
 +  disk_num_bytes, type);
 +    break;
   }
   if (found_dbackref)

>>>
>>>
>>
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AW: Btrfs performance with small blocksize on SSD

2017-09-24 Thread Qu Wenruo



On 2017年09月24日 21:53, Fuhrmann, Carsten wrote:

Hello,

1)
I used direct write (no page cache) but I didn't disable the Disk cache of the 
HDD/SSD itself. In all tests I wrote 1GB and looked for the runtime of that 
write process.


Are you writing all the 1G into one file?
Or into different files?


I run every test 5 times with different Blocksizes (2k, 8k, 32k, 128k, 512k). 
Those values are on the x-axis. On the Y-Axis is the runtime for the test.


Good to know that.

Then there may be 2 factors impacting performance:

1) Convert between inlined and regular data extent
   1st 2K write will be inlined and then 2nd 2K write will convert it
   back to regular data extent.
   The overhead can be quite high.

   Retest with "-o max_inline=0" will disable such behavior so all write
   will only cause regular data extent.

2) Unaligned data size causing extra rewrite/CoW
   Btrfs restore its data in unit of sectorsize, and in your case it is
   4K.
   Writing with 2K will cause btrfs to read out the half written data
   and then CoW it to somewhere else.
   The overhead can be quite huge.

   And I assume 2) is the main overhead.

   Retest with 4K blocksize to see if it's related.

   Please note that, 4K blocksize and 2K blocksize are going through
   different routines (4K blocks routine has no extra CoW overhead, so
   it' should be near 8K blockszie result)

Thanks,
Qu



2)
Yes every test is on RAID1 for data and metadata

3)
Everything default
mkfs.btrfs -d raid1 -m raid1 /dev/sda /dev/sdb /dev/sdc /dev/sdd


best regards

Carsten

-Ursprüngliche Nachricht-
Von: Qu Wenruo [mailto:quwenruo.bt...@gmx.com]
Gesendet: Sonntag, 24. September 2017 15:41
An: Fuhrmann, Carsten ; 
linux-btrfs@vger.kernel.org
Betreff: Re: Btrfs performance with small blocksize on SSD



On 2017年09月24日 21:24, Fuhrmann, Carsten wrote:

Hello,

i run a few performance tests comparing mdadm, hardware raid and the btrfs 
raid. I noticed that the performance for small blocksizes (2k) is very bad on 
SSD in general and on HDD for sequential writing.


2K is smaller than the minimal btrfs sectorsize (4K for x86 family).

It's common that unaligned access will impact performance, but we need more 
info about your test cases, including:
1) How write is done?
 Buffered? DIO? O_SYNC? fdatasync?
 I can't read Germany so I'm not sure what the result means. (Although
 I can guess Y axle is latency, but I don't know the meaning of X axle.
 And how many files are involved, how large of these files and etc.

2) Data/meta/sys profiles
 All RADI1?

3) Mkfs profile
 Like nodesize if not default, and any incompat features enabled.


I wonder about that result, because you say on the wiki that btrfs is very 
effective for small files.


It can be space effective or performance effective.

If *ignoring* meta profile, btrfs is space-effectient since it inline the data 
into metadata, avoiding padding it to sectorsize so it can save some space.

And such behavior can also be somewhat performance effective, by avoiding extra 
seeking for data, since when reading out the metadata we have already read out 
the inlined data.

But such efficiency come with cost.

One obvious one is when we need to convert inline data into regular one.
It may cause extra tree balancing and increase latency.

Would you please try retest with "-o max_inline=0" mount option to disable 
inline data (which makes btrfs behavior like ext*/xfs) to see if it's related?

Thanks,
Qu



I attached my results from raid 1 random write HDD (rH1), SSD (rS1)
and from sequential write HDD (sH1), SSD (sS1)

Hopefully you have an explanation for that.

raid@raid-PowerEdge-T630:~$ uname -a
Linux raid-PowerEdge-T630 4.10.0-33-generic #37~16.04.1-Ubuntu SMP Fri
Aug 11 14:07:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
raid@raid-PowerEdge-T630:~$ btrfs --version btrfs-progs v4.4


best regards

Carsten


N�r��y���b�X��ǧv�^�)޺{.n�+{�n�߲)���w*jg����ݢj/���z�ޖ��2�ޙ���&�)ߡ�a�����G���h��j:+v���w�٥


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/14] btrfs-progs: Doc/mkfs: Add extra condition for rootdir option

2017-09-24 Thread Goffredo Baroncelli
On 09/24/2017 12:10 PM, Anand Jain wrote:
> 
> 
>> All my points are clear for this patchset:
>> I know I removed one function, and my reason is:
>> 1) No or little usage
>>     And it's anti intuition.
>> 2) Dead code (not tested nor well documented)
>> 3) Possible workaround
>>
>> I can add several extra reasons as I stated before, but number of reasons 
>> won't validate anything anyway.
> 
>  End user convenience wins over the developer's technical difficulties.

Sorry, but I have to agree with Qu; there is no a big demand (other than 
Austin) for this functionality; even you stated that "...at some point it 
may..."; until now the UI is quite unfriendly (we should use a big enough 
device, and then cut it by hand on the basis of the output of mkfs.btrfs)...

I fear that this is another feature which increase the complexity of btrfs (and 
tools) with little or none usage

The work of Qu is a nice cleanup; I hope that this will be the direction which 
BTRFS will takes: removing of "strange/unused" features improving the 
reliability of the others.

BR
G.Baroncelli



> 
>  Pls don't remove this feature, it needs fix such as #2 (above) to improve on 
> #1 (above) as in your list.
> 
> Thanks, Anand
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AW: Btrfs performance with small blocksize on SSD

2017-09-24 Thread Fuhrmann, Carsten
Hello,

1)
I used direct write (no page cache) but I didn't disable the Disk cache of the 
HDD/SSD itself. In all tests I wrote 1GB and looked for the runtime of that 
write process.
I run every test 5 times with different Blocksizes (2k, 8k, 32k, 128k, 512k). 
Those values are on the x-axis. On the Y-Axis is the runtime for the test.

2)
Yes every test is on RAID1 for data and metadata

3)
Everything default
mkfs.btrfs -d raid1 -m raid1 /dev/sda /dev/sdb /dev/sdc /dev/sdd


best regards

Carsten

-Ursprüngliche Nachricht-
Von: Qu Wenruo [mailto:quwenruo.bt...@gmx.com] 
Gesendet: Sonntag, 24. September 2017 15:41
An: Fuhrmann, Carsten ; 
linux-btrfs@vger.kernel.org
Betreff: Re: Btrfs performance with small blocksize on SSD



On 2017年09月24日 21:24, Fuhrmann, Carsten wrote:
> Hello,
> 
> i run a few performance tests comparing mdadm, hardware raid and the btrfs 
> raid. I noticed that the performance for small blocksizes (2k) is very bad on 
> SSD in general and on HDD for sequential writing.

2K is smaller than the minimal btrfs sectorsize (4K for x86 family).

It's common that unaligned access will impact performance, but we need more 
info about your test cases, including:
1) How write is done?
Buffered? DIO? O_SYNC? fdatasync?
I can't read Germany so I'm not sure what the result means. (Although
I can guess Y axle is latency, but I don't know the meaning of X axle.
And how many files are involved, how large of these files and etc.

2) Data/meta/sys profiles
All RADI1?

3) Mkfs profile
Like nodesize if not default, and any incompat features enabled.

> I wonder about that result, because you say on the wiki that btrfs is very 
> effective for small files.

It can be space effective or performance effective.

If *ignoring* meta profile, btrfs is space-effectient since it inline the data 
into metadata, avoiding padding it to sectorsize so it can save some space.

And such behavior can also be somewhat performance effective, by avoiding extra 
seeking for data, since when reading out the metadata we have already read out 
the inlined data.

But such efficiency come with cost.

One obvious one is when we need to convert inline data into regular one.
It may cause extra tree balancing and increase latency.

Would you please try retest with "-o max_inline=0" mount option to disable 
inline data (which makes btrfs behavior like ext*/xfs) to see if it's related?

Thanks,
Qu

> 
> I attached my results from raid 1 random write HDD (rH1), SSD (rS1) 
> and from sequential write HDD (sH1), SSD (sS1)
> 
> Hopefully you have an explanation for that.
> 
> raid@raid-PowerEdge-T630:~$ uname -a
> Linux raid-PowerEdge-T630 4.10.0-33-generic #37~16.04.1-Ubuntu SMP Fri 
> Aug 11 14:07:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux 
> raid@raid-PowerEdge-T630:~$ btrfs --version btrfs-progs v4.4
> 
> 
> best regards
> 
> Carsten
> 


Re: [PATCH 2/2] Btrfs: skip checksum when reading compressed data if some IO have failed

2017-09-24 Thread David Sterba
On Wed, Sep 20, 2017 at 05:50:19PM -0600, Liu Bo wrote:
> Currently even if the underlying disk reports failure on IO,
> compressed read endio still gets to verify checksum and reports it as
> a checksum error.
> 
> In fact, if some IO have failed during reading a compressed data
> extent , there's no way the checksum could match, therefore, we can
> skip that in order to return error quickly to the upper layer.
> 
> Please note that we need to do this after recording the failed mirror
> index so that read-repair in the upper layer's endio can work
> properly.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: fix kernel oops while reading compressed data

2017-09-24 Thread David Sterba
On Wed, Sep 20, 2017 at 05:50:18PM -0600, Liu Bo wrote:
> The kernel oops happens at
> 
> kernel BUG at fs/btrfs/extent_io.c:2104!
> ...
> RIP: clean_io_failure+0x263/0x2a0 [btrfs]
> 
> It's showing that read-repair code is using an improper mirror index.
> This is due to the fact that compression read's endio hasn't recorded
> the failed mirror index in %cb->orig_bio.
> 
> With this, btrfs's read-repair can work properly on reading compressed
> data.
> 
> Signed-off-by: Liu Bo 
> Reported-by: Paul Jones 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs performance with small blocksize on SSD

2017-09-24 Thread Qu Wenruo



On 2017年09月24日 21:24, Fuhrmann, Carsten wrote:

Hello,

i run a few performance tests comparing mdadm, hardware raid and the btrfs 
raid. I noticed that the performance for small blocksizes (2k) is very bad on 
SSD in general and on HDD for sequential writing.


2K is smaller than the minimal btrfs sectorsize (4K for x86 family).

It's common that unaligned access will impact performance, but we need 
more info about your test cases, including:

1) How write is done?
   Buffered? DIO? O_SYNC? fdatasync?
   I can't read Germany so I'm not sure what the result means. (Although
   I can guess Y axle is latency, but I don't know the meaning of X axle.
   And how many files are involved, how large of these files and etc.

2) Data/meta/sys profiles
   All RADI1?

3) Mkfs profile
   Like nodesize if not default, and any incompat features enabled.


I wonder about that result, because you say on the wiki that btrfs is very 
effective for small files.


It can be space effective or performance effective.

If *ignoring* meta profile, btrfs is space-effectient since it inline 
the data into metadata, avoiding padding it to sectorsize so it can save 
some space.


And such behavior can also be somewhat performance effective, by 
avoiding extra seeking for data, since when reading out the metadata we 
have already read out the inlined data.


But such efficiency come with cost.

One obvious one is when we need to convert inline data into regular one.
It may cause extra tree balancing and increase latency.

Would you please try retest with "-o max_inline=0" mount option to 
disable inline data (which makes btrfs behavior like ext*/xfs) to see if 
it's related?


Thanks,
Qu



I attached my results from raid 1 random write HDD (rH1), SSD (rS1) and from 
sequential write HDD (sH1), SSD (sS1)

Hopefully you have an explanation for that.

raid@raid-PowerEdge-T630:~$ uname -a
Linux raid-PowerEdge-T630 4.10.0-33-generic #37~16.04.1-Ubuntu SMP Fri Aug 11 
14:07:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
raid@raid-PowerEdge-T630:~$ btrfs --version
btrfs-progs v4.4


best regards

Carsten


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: use self-explaining variable

2017-09-24 Thread David Sterba
On Sat, Sep 23, 2017 at 09:09:24AM +0800, Qu Wenruo wrote:
> 
> 
> On 2017年09月23日 08:48, Liu Bo wrote:
> > On Sat, Sep 23, 2017 at 08:46:55AM +0800, Qu Wenruo wrote:
> >>
> >>
> >> On 2017年09月23日 07:36, Liu Bo wrote:
> >>> This uses a bool 'do_backup' to help understand this piece of code.
> >>>
> >>> Signed-off-by: Liu Bo 
> >>> ---
> >>> This is based on a patch "Btrfs: do not backup tree roots when fsync".
> >>>
> >>>fs/btrfs/disk-io.c | 3 ++-
> >>>1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >>> index cdb7043..9811b9d 100644
> >>> --- a/fs/btrfs/disk-io.c
> >>> +++ b/fs/btrfs/disk-io.c
> >>> @@ -3691,6 +3691,7 @@ int write_all_supers(struct btrfs_fs_info *fs_info, 
> >>> int max_mirrors)
> >>>   int max_errors;
> >>>   int total_errors = 0;
> >>>   u64 flags;
> >>> + bool do_backup = (max_mirrors == 0);
> >>
> >> Why not replacing @max_mirrors with @do_backup as parameter?
> > 
> > If I read the code correctly, max_mirrors is not just for deciding
> > backup.
> 
> That's strange.
> 
> write_all_supers() only uses @max_mirrors by passing it to 
> write_dev_supers() and wait_dev_supers().
> 
> Both of the write/wait_dev_supers() will replace @max_mirrors to 
> BTRFS_SUPER_MIRROR_MAX if it's zero.
> 
> Further more, all write_all_supers() callers just pass @max_mirrors as 
> bool (either 0 or 1).
> 
> So I don't see any point not replacing the parameter as bool.

Agreed, the idea was to replace the parameter by the bool.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs performance with small blocksize on SSD

2017-09-24 Thread Fuhrmann, Carsten
Hello,

i run a few performance tests comparing mdadm, hardware raid and the btrfs 
raid. I noticed that the performance for small blocksizes (2k) is very bad on 
SSD in general and on HDD for sequential writing.
I wonder about that result, because you say on the wiki that btrfs is very 
effective for small files. 

I attached my results from raid 1 random write HDD (rH1), SSD (rS1) and from 
sequential write HDD (sH1), SSD (sS1)

Hopefully you have an explanation for that.

raid@raid-PowerEdge-T630:~$ uname -a
Linux raid-PowerEdge-T630 4.10.0-33-generic #37~16.04.1-Ubuntu SMP Fri Aug 11 
14:07:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
raid@raid-PowerEdge-T630:~$ btrfs --version
btrfs-progs v4.4


best regards

Carsten



Re: [PATCH v2 0/3] btrfs: cleanup mount path

2017-09-24 Thread David Sterba
On Fri, Sep 22, 2017 at 02:56:49PM +0900, Misono, Tomohiro wrote:
> Summary:
> Cleanup mount path by avoiding calling btrfs_mount() twice.

That would be great to get rid of it, but please do it in smaller steps,
each of them being bisectable and preserving the existing functionality.

Patch 1/3 does not compile on any branch (master, v4.13, current devel
queue), removes the ability to mount via subvolume. I don't undserstand
(yet) why the other filesystem type is introduced as it is exactly the
same as the one we have now.

The description in the cover leter of the overall logic of the mount
should be added as a comment before btrfs_mount.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: fix kernel oops while reading compressed data

2017-09-24 Thread David Sterba
On Wed, Sep 20, 2017 at 05:50:18PM -0600, Liu Bo wrote:
> The kernel oops happens at
> 
> kernel BUG at fs/btrfs/extent_io.c:2104!
> ...
> RIP: clean_io_failure+0x263/0x2a0 [btrfs]
> 
> It's showing that read-repair code is using an improper mirror index.
> This is due to the fact that compression read's endio hasn't recorded
> the failed mirror index in %cb->orig_bio.

This sounds like the corner case of raid1 repair that was reported i the
past and actually is the only thing because of which we are not so
comfortable to mark raid1 'completely ok' in the Status page.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/14] btrfs-progs: Doc/mkfs: Add extra condition for rootdir option

2017-09-24 Thread Anand Jain




All my points are clear for this patchset:
I know I removed one function, and my reason is:
1) No or little usage
And it's anti intuition.
2) Dead code (not tested nor well documented)
3) Possible workaround

I can add several extra reasons as I stated before, but number of 
reasons won't validate anything anyway.


 End user convenience wins over the developer's technical difficulties.

 Pls don't remove this feature, it needs fix such as #2 (above) to 
improve on #1 (above) as in your list.


Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] fstests: btrfs/150 regression test for reading compressed data

2017-09-24 Thread Lu Fengqi
On Fri, Sep 22, 2017 at 05:21:27PM -0600, Liu Bo wrote:
>We had a bug in btrfs compression code which could end up with a
>kernel panic.
>
>This is adding a regression test for the bug and I've also sent a
>kernel patch to fix the bug.
>
>The patch is "Btrfs: fix kernel oops while reading compressed data".
>
>Signed-off-by: Liu Bo 
>---
>v2: - Fix ambiguous copyright.
>- Use /proc/$pid/make-it-fail to specify IO failure

- /sys/kernel/debug/fail*/task-filter:

Format: { 'Y' | 'N' }
A value of 'N' disables filtering by process (default).
Any positive value limits failures to only processes indicated by
/proc//make-it-fail==1.

-- 
Thanks,
Lu

>- Use bash -c to run test only when pid is odd.
>- Add test to dangerous group.
>
> tests/btrfs/150 | 103 
> tests/btrfs/150.out |   3 ++
> tests/btrfs/group   |   1 +
> 3 files changed, 107 insertions(+)
> create mode 100755 tests/btrfs/150
> create mode 100644 tests/btrfs/150.out
>
>diff --git a/tests/btrfs/150 b/tests/btrfs/150
>new file mode 100755
>index 000..8891c38
>--- /dev/null
>+++ b/tests/btrfs/150
>@@ -0,0 +1,103 @@
>+#! /bin/bash
>+# FS QA Test btrfs/150
>+#
>+# This is a regression test which ends up with a kernel oops in btrfs.
>+# It occurs when btrfs's read repair happens while reading a compressed
>+# extent.
>+# The patch to fix it is
>+# Btrfs: fix kernel oops while reading compressed data
>+#
>+#---
>+# Copyright (c) 2017 Oracle.  All Rights Reserved.
>+#
>+# This program is free software; you can redistribute it and/or
>+# modify it under the terms of the GNU General Public License as
>+# published by the Free Software Foundation.
>+#
>+# This program is distributed in the hope that it would be useful,
>+# but WITHOUT ANY WARRANTY; without even the implied warranty of
>+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>+# GNU General Public License for more details.
>+#
>+# You should have received a copy of the GNU General Public License
>+# along with this program; if not, write the Free Software Foundation,
>+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
>+#---
>+#
>+
>+seq=`basename $0`
>+seqres=$RESULT_DIR/$seq
>+echo "QA output created by $seq"
>+
>+here=`pwd`
>+tmp=/tmp/$$
>+status=1  # failure is the default!
>+trap "_cleanup; exit \$status" 0 1 2 3 15
>+
>+_cleanup()
>+{
>+  cd /
>+  rm -f $tmp.*
>+}
>+
>+# get standard environment, filters and checks
>+. ./common/rc
>+. ./common/filter
>+
>+# remove previous $seqres.full before test
>+rm -f $seqres.full
>+
>+# real QA test starts here
>+
>+# Modify as appropriate.
>+_supported_fs btrfs
>+_supported_os Linux
>+_require_scratch
>+_require_fail_make_request
>+_require_scratch_dev_pool 2 
>+
>+SYSFS_BDEV=`_sysfs_dev $SCRATCH_DEV`
>+enable_io_failure()
>+{
>+echo 100 > $DEBUGFS_MNT/fail_make_request/probability
>+echo 1000 > $DEBUGFS_MNT/fail_make_request/times
>+echo 0 > $DEBUGFS_MNT/fail_make_request/verbose
>+echo 1 > $SYSFS_BDEV/make-it-fail
>+}
>+
>+disable_io_failure()
>+{
>+echo 0 > $SYSFS_BDEV/make-it-fail
>+echo 0 > $DEBUGFS_MNT/fail_make_request/probability
>+echo 0 > $DEBUGFS_MNT/fail_make_request/times
>+}
>+
>+_scratch_pool_mkfs "-d raid1 -b 1G" >> $seqres.full 2>&1
>+
>+# It doesn't matter which compression algorithm we use.
>+_scratch_mount -ocompress
>+
>+# Create a file with all data being compressed
>+$XFS_IO_PROG -f -c "pwrite -W 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io
>+
>+# Raid1 consists of two copies and btrfs decides which copy to read by 
>reader's
>+# %pid.  Now we inject errors to copy #1 and copy #0 is good.  We want to read
>+# the bad copy to trigger read-repair.
>+while [[ -z $result ]]; do
>+  # invalidate the page cache
>+  $XFS_IO_PROG -f -c "fadvise -d 0 8K" $SCRATCH_MNT/foobar
>+
>+  enable_io_failure
>+
>+  result=$(bash -c "
>+  if [ \$((\$\$ % 2)) == 1 ]; then
>+  echo 1 > /proc/\$\$/make-it-fail
>+  exec $XFS_IO_PROG -c \"pread 0 8K\" \$SCRATCH_MNT/foobar
>+  fi")
>+
>+  disable_io_failure
>+done
>+
>+# success, all done
>+status=0
>+exit
>diff --git a/tests/btrfs/150.out b/tests/btrfs/150.out
>new file mode 100644
>index 000..c492c24
>--- /dev/null
>+++ b/tests/btrfs/150.out
>@@ -0,0 +1,3 @@
>+QA output created by 150
>+wrote 8192/8192 bytes at offset 0
>+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>diff --git a/tests/btrfs/group b/tests/btrfs/group
>index 70c3f05..e73bb1b 100644
>--- a/tests/btrfs/group
>+++ b/tests/btrfs/group
>@@ -152,3 +152,4 @@
> 147 auto quick send
> 148 auto quick rw
> 149 auto quick send compress
>+150 auto quick dangerous
>-- 
>2.5.0
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in