Re: BTRFS RAID filesystem unmountable

2018-12-06 Thread Qu Wenruo


On 2018/12/7 上午7:15, Michael Wade wrote:
> Hi Qu,
> 
> Me again! Having formatted the drives and rebuilt the RAID array I
> seem to have be having the same problem as before (no power cut this
> time [I bought a UPS]).

But strangely, your super block shows it has log tree, which means
either your hit a kernel panic/transaction abort, or a unexpected power
loss.

> The brtfs volume is broken on my ReadyNAS.
> 
> I have attached the results of some of the commands you asked me to
> run last time, and I am hoping you might be able to help me out.

This time, the problem is more serious, some chunk tree blocks are not
even inside system chunk range, no wonder it fails to mount.

To confirm it, you could run "btrfs ins dump-tree -b 17725903077376
" and paste the output.

But I don't have any clue. My guess is some kernel problem related to
new chunk allocation, or the chunk root node itself is already seriously
corrupted.

Considering how old your kernel is (4.4), it's not recommended to use
btrfs on such old kernel, unless it's well backported with tons of btrfs
fixes.

Thanks,
Qu

> 
> Kind regards
> Michael
> On Sat, 19 May 2018 at 12:43, Michael Wade  wrote:
>>
>> I have let the find root command run for 14+ days, its produced a
>> pretty huge log file 1.6 GB but still hasn't completed. I think I will
>> start the process of reformatting my drives and starting over.
>>
>> Thanks for your help anyway.
>>
>> Kind regards
>> Michael
>>
>> On 5 May 2018 at 01:43, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年05月05日 00:18, Michael Wade wrote:
 Hi Qu,

 The tool is still running and the log file is now ~300mb. I guess it
 shouldn't normally take this long.. Is there anything else worth
 trying?
>>>
>>> I'm afraid not much.
>>>
>>> Although there is a possibility to modify btrfs-find-root to do much
>>> faster but limited search.
>>>
>>> But from the result, it looks like underlying device corruption, and not
>>> much we can do right now.
>>>
>>> Thanks,
>>> Qu
>>>

 Kind regards
 Michael

 On 2 May 2018 at 06:29, Michael Wade  wrote:
> Thanks Qu,
>
> I actually aborted the run with the old btrfs tools once I saw its
> output. The new btrfs tools is still running and has produced a log
> file of ~85mb filled with that content so far.
>
> Kind regards
> Michael
>
> On 2 May 2018 at 02:31, Qu Wenruo  wrote:
>>
>>
>> On 2018年05月01日 23:50, Michael Wade wrote:
>>> Hi Qu,
>>>
>>> Oh dear that is not good news!
>>>
>>> I have been running the find root command since yesterday but it only
>>> seems to be only be outputting the following message:
>>>
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>
>> It's mostly fine, as find-root will go through all tree blocks and try
>> to read them as tree blocks.
>> Although btrfs-find-root will suppress csum error output, but such basic
>> tree validation check is not suppressed, thus you get such message.
>>
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>
>>> I tried with the latest btrfs tools compiled from source and the ones
>>> I have installed with the same result. Is there a CLI utility I could
>>> use to determine if the log contains any other content?
>>
>> Did it report any useful info at the end?
>>
>> Thanks,
>> Qu
>>
>>>
>>> Kind regards
>>> Michael
>>>
>>>
>>> On 30 April 2018 at 04:02, Qu Wenruo  wrote:


 On 2018年04月29日 22:08, Michael Wade wrote:
> Hi Qu,
>
> Got this error message:
>
> ./btrfs inspect dump-tree -b 20800943685632 /dev/md127
> btrfs-progs v4.16.1
> bytenr mismatch, want=20800943685632, have=3118598835113619663
> ERROR: cannot read chunk root
> ERROR: unable to open /dev/md127
>
> I have attached the dumps for:
>
> dd if=/dev/md127 of=/tmp/chunk_root.copy1 bs=1 count=32K 
> skip=266325721088
> dd if=/dev/md127 of=/tmp/chunk_root.copy2 bs=1 count=32K 
> skip=266359275520

 Unfortunately, both dumps are corrupted and contain mostly garbage.
 I think it's the underlying stack (mdraid) has something wrong or 
 failed
 to recover its data.

 This means your last 

Re: BTRFS RAID filesystem unmountable

2018-05-19 Thread Michael Wade
I have let the find root command run for 14+ days, its produced a
pretty huge log file 1.6 GB but still hasn't completed. I think I will
start the process of reformatting my drives and starting over.

Thanks for your help anyway.

Kind regards
Michael

On 5 May 2018 at 01:43, Qu Wenruo  wrote:
>
>
> On 2018年05月05日 00:18, Michael Wade wrote:
>> Hi Qu,
>>
>> The tool is still running and the log file is now ~300mb. I guess it
>> shouldn't normally take this long.. Is there anything else worth
>> trying?
>
> I'm afraid not much.
>
> Although there is a possibility to modify btrfs-find-root to do much
> faster but limited search.
>
> But from the result, it looks like underlying device corruption, and not
> much we can do right now.
>
> Thanks,
> Qu
>
>>
>> Kind regards
>> Michael
>>
>> On 2 May 2018 at 06:29, Michael Wade  wrote:
>>> Thanks Qu,
>>>
>>> I actually aborted the run with the old btrfs tools once I saw its
>>> output. The new btrfs tools is still running and has produced a log
>>> file of ~85mb filled with that content so far.
>>>
>>> Kind regards
>>> Michael
>>>
>>> On 2 May 2018 at 02:31, Qu Wenruo  wrote:


 On 2018年05月01日 23:50, Michael Wade wrote:
> Hi Qu,
>
> Oh dear that is not good news!
>
> I have been running the find root command since yesterday but it only
> seems to be only be outputting the following message:
>
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096

 It's mostly fine, as find-root will go through all tree blocks and try
 to read them as tree blocks.
 Although btrfs-find-root will suppress csum error output, but such basic
 tree validation check is not suppressed, thus you get such message.

> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>
> I tried with the latest btrfs tools compiled from source and the ones
> I have installed with the same result. Is there a CLI utility I could
> use to determine if the log contains any other content?

 Did it report any useful info at the end?

 Thanks,
 Qu

>
> Kind regards
> Michael
>
>
> On 30 April 2018 at 04:02, Qu Wenruo  wrote:
>>
>>
>> On 2018年04月29日 22:08, Michael Wade wrote:
>>> Hi Qu,
>>>
>>> Got this error message:
>>>
>>> ./btrfs inspect dump-tree -b 20800943685632 /dev/md127
>>> btrfs-progs v4.16.1
>>> bytenr mismatch, want=20800943685632, have=3118598835113619663
>>> ERROR: cannot read chunk root
>>> ERROR: unable to open /dev/md127
>>>
>>> I have attached the dumps for:
>>>
>>> dd if=/dev/md127 of=/tmp/chunk_root.copy1 bs=1 count=32K 
>>> skip=266325721088
>>> dd if=/dev/md127 of=/tmp/chunk_root.copy2 bs=1 count=32K 
>>> skip=266359275520
>>
>> Unfortunately, both dumps are corrupted and contain mostly garbage.
>> I think it's the underlying stack (mdraid) has something wrong or failed
>> to recover its data.
>>
>> This means your last chance will be btrfs-find-root.
>>
>> Please try:
>> # btrfs-find-root -o 3 
>>
>> And provide all the output.
>>
>> But please keep in mind, chunk root is a critical tree, and so far it's
>> already heavily damaged.
>> Although I could still continue try to recover, there is pretty low
>> chance now.
>>
>> Thanks,
>> Qu
>>>
>>> Kind regards
>>> Michael
>>>
>>>
>>> On 29 April 2018 at 10:33, Qu Wenruo  wrote:


 On 2018年04月29日 16:59, Michael Wade wrote:
> Ok, will it be possible for me to install the new version of the tools
> on my current kernel without overriding the existing install? Hesitant
> to update kernel/btrfs as it might break the ReadyNAS interface /
> future firmware upgrades.
>
> Perhaps I could grab this:
> https://github.com/kdave/btrfs-progs/releases/tag/v4.16.1 and
> hopefully build from source and then run the binaries directly?

 Of course, that's how most of us test btrfs-progs builds.

 Thanks,
 Qu

>
> Kind regards
>
> On 29 April 2018 at 09:33, Qu Wenruo  wrote:
>>
>>
>> On 

Re: BTRFS RAID filesystem unmountable

2018-05-04 Thread Qu Wenruo


On 2018年05月05日 00:18, Michael Wade wrote:
> Hi Qu,
> 
> The tool is still running and the log file is now ~300mb. I guess it
> shouldn't normally take this long.. Is there anything else worth
> trying?

I'm afraid not much.

Although there is a possibility to modify btrfs-find-root to do much
faster but limited search.

But from the result, it looks like underlying device corruption, and not
much we can do right now.

Thanks,
Qu

> 
> Kind regards
> Michael
> 
> On 2 May 2018 at 06:29, Michael Wade  wrote:
>> Thanks Qu,
>>
>> I actually aborted the run with the old btrfs tools once I saw its
>> output. The new btrfs tools is still running and has produced a log
>> file of ~85mb filled with that content so far.
>>
>> Kind regards
>> Michael
>>
>> On 2 May 2018 at 02:31, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年05月01日 23:50, Michael Wade wrote:
 Hi Qu,

 Oh dear that is not good news!

 I have been running the find root command since yesterday but it only
 seems to be only be outputting the following message:

 ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>
>>> It's mostly fine, as find-root will go through all tree blocks and try
>>> to read them as tree blocks.
>>> Although btrfs-find-root will suppress csum error output, but such basic
>>> tree validation check is not suppressed, thus you get such message.
>>>
 ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
 ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
 ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
 ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
 ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
 ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
 ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
 ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
 ERROR: tree block bytenr 0 is not aligned to sectorsize 4096

 I tried with the latest btrfs tools compiled from source and the ones
 I have installed with the same result. Is there a CLI utility I could
 use to determine if the log contains any other content?
>>>
>>> Did it report any useful info at the end?
>>>
>>> Thanks,
>>> Qu
>>>

 Kind regards
 Michael


 On 30 April 2018 at 04:02, Qu Wenruo  wrote:
>
>
> On 2018年04月29日 22:08, Michael Wade wrote:
>> Hi Qu,
>>
>> Got this error message:
>>
>> ./btrfs inspect dump-tree -b 20800943685632 /dev/md127
>> btrfs-progs v4.16.1
>> bytenr mismatch, want=20800943685632, have=3118598835113619663
>> ERROR: cannot read chunk root
>> ERROR: unable to open /dev/md127
>>
>> I have attached the dumps for:
>>
>> dd if=/dev/md127 of=/tmp/chunk_root.copy1 bs=1 count=32K 
>> skip=266325721088
>> dd if=/dev/md127 of=/tmp/chunk_root.copy2 bs=1 count=32K 
>> skip=266359275520
>
> Unfortunately, both dumps are corrupted and contain mostly garbage.
> I think it's the underlying stack (mdraid) has something wrong or failed
> to recover its data.
>
> This means your last chance will be btrfs-find-root.
>
> Please try:
> # btrfs-find-root -o 3 
>
> And provide all the output.
>
> But please keep in mind, chunk root is a critical tree, and so far it's
> already heavily damaged.
> Although I could still continue try to recover, there is pretty low
> chance now.
>
> Thanks,
> Qu
>>
>> Kind regards
>> Michael
>>
>>
>> On 29 April 2018 at 10:33, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年04月29日 16:59, Michael Wade wrote:
 Ok, will it be possible for me to install the new version of the tools
 on my current kernel without overriding the existing install? Hesitant
 to update kernel/btrfs as it might break the ReadyNAS interface /
 future firmware upgrades.

 Perhaps I could grab this:
 https://github.com/kdave/btrfs-progs/releases/tag/v4.16.1 and
 hopefully build from source and then run the binaries directly?
>>>
>>> Of course, that's how most of us test btrfs-progs builds.
>>>
>>> Thanks,
>>> Qu
>>>

 Kind regards

 On 29 April 2018 at 09:33, Qu Wenruo  wrote:
>
>
> On 2018年04月29日 16:11, Michael Wade wrote:
>> Thanks Qu,
>>
>> Please find attached the log file for the chunk recover command.
>
> Strangely, btrfs chunk recovery found no extra chunk beyond current
> system chunk range.
>
> Which means, it's chunk tree corrupted.
>
> Please dump the chunk tree with latest btrfs-progs (which provides the
> new --follow option).
>
> # 

Re: BTRFS RAID filesystem unmountable

2018-05-04 Thread Michael Wade
Hi Qu,

The tool is still running and the log file is now ~300mb. I guess it
shouldn't normally take this long.. Is there anything else worth
trying?

Kind regards
Michael

On 2 May 2018 at 06:29, Michael Wade  wrote:
> Thanks Qu,
>
> I actually aborted the run with the old btrfs tools once I saw its
> output. The new btrfs tools is still running and has produced a log
> file of ~85mb filled with that content so far.
>
> Kind regards
> Michael
>
> On 2 May 2018 at 02:31, Qu Wenruo  wrote:
>>
>>
>> On 2018年05月01日 23:50, Michael Wade wrote:
>>> Hi Qu,
>>>
>>> Oh dear that is not good news!
>>>
>>> I have been running the find root command since yesterday but it only
>>> seems to be only be outputting the following message:
>>>
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>
>> It's mostly fine, as find-root will go through all tree blocks and try
>> to read them as tree blocks.
>> Although btrfs-find-root will suppress csum error output, but such basic
>> tree validation check is not suppressed, thus you get such message.
>>
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>
>>> I tried with the latest btrfs tools compiled from source and the ones
>>> I have installed with the same result. Is there a CLI utility I could
>>> use to determine if the log contains any other content?
>>
>> Did it report any useful info at the end?
>>
>> Thanks,
>> Qu
>>
>>>
>>> Kind regards
>>> Michael
>>>
>>>
>>> On 30 April 2018 at 04:02, Qu Wenruo  wrote:


 On 2018年04月29日 22:08, Michael Wade wrote:
> Hi Qu,
>
> Got this error message:
>
> ./btrfs inspect dump-tree -b 20800943685632 /dev/md127
> btrfs-progs v4.16.1
> bytenr mismatch, want=20800943685632, have=3118598835113619663
> ERROR: cannot read chunk root
> ERROR: unable to open /dev/md127
>
> I have attached the dumps for:
>
> dd if=/dev/md127 of=/tmp/chunk_root.copy1 bs=1 count=32K skip=266325721088
> dd if=/dev/md127 of=/tmp/chunk_root.copy2 bs=1 count=32K skip=266359275520

 Unfortunately, both dumps are corrupted and contain mostly garbage.
 I think it's the underlying stack (mdraid) has something wrong or failed
 to recover its data.

 This means your last chance will be btrfs-find-root.

 Please try:
 # btrfs-find-root -o 3 

 And provide all the output.

 But please keep in mind, chunk root is a critical tree, and so far it's
 already heavily damaged.
 Although I could still continue try to recover, there is pretty low
 chance now.

 Thanks,
 Qu
>
> Kind regards
> Michael
>
>
> On 29 April 2018 at 10:33, Qu Wenruo  wrote:
>>
>>
>> On 2018年04月29日 16:59, Michael Wade wrote:
>>> Ok, will it be possible for me to install the new version of the tools
>>> on my current kernel without overriding the existing install? Hesitant
>>> to update kernel/btrfs as it might break the ReadyNAS interface /
>>> future firmware upgrades.
>>>
>>> Perhaps I could grab this:
>>> https://github.com/kdave/btrfs-progs/releases/tag/v4.16.1 and
>>> hopefully build from source and then run the binaries directly?
>>
>> Of course, that's how most of us test btrfs-progs builds.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Kind regards
>>>
>>> On 29 April 2018 at 09:33, Qu Wenruo  wrote:


 On 2018年04月29日 16:11, Michael Wade wrote:
> Thanks Qu,
>
> Please find attached the log file for the chunk recover command.

 Strangely, btrfs chunk recovery found no extra chunk beyond current
 system chunk range.

 Which means, it's chunk tree corrupted.

 Please dump the chunk tree with latest btrfs-progs (which provides the
 new --follow option).

 # btrfs inspect dump-tree -b 20800943685632 

 If it doesn't work, please provide the following binary dump:

 # dd if= of=/tmp/chunk_root.copy1 bs=1 count=32K skip=266325721088
 # dd if= of=/tmp/chunk_root.copy2 bs=1 count=32K skip=266359275520
 (And will need to repeat similar dump for several times according to
 above dump)

 Thanks,
 Qu



Re: BTRFS RAID filesystem unmountable

2018-05-01 Thread Michael Wade
Thanks Qu,

I actually aborted the run with the old btrfs tools once I saw its
output. The new btrfs tools is still running and has produced a log
file of ~85mb filled with that content so far.

Kind regards
Michael

On 2 May 2018 at 02:31, Qu Wenruo  wrote:
>
>
> On 2018年05月01日 23:50, Michael Wade wrote:
>> Hi Qu,
>>
>> Oh dear that is not good news!
>>
>> I have been running the find root command since yesterday but it only
>> seems to be only be outputting the following message:
>>
>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>
> It's mostly fine, as find-root will go through all tree blocks and try
> to read them as tree blocks.
> Although btrfs-find-root will suppress csum error output, but such basic
> tree validation check is not suppressed, thus you get such message.
>
>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>
>> I tried with the latest btrfs tools compiled from source and the ones
>> I have installed with the same result. Is there a CLI utility I could
>> use to determine if the log contains any other content?
>
> Did it report any useful info at the end?
>
> Thanks,
> Qu
>
>>
>> Kind regards
>> Michael
>>
>>
>> On 30 April 2018 at 04:02, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年04月29日 22:08, Michael Wade wrote:
 Hi Qu,

 Got this error message:

 ./btrfs inspect dump-tree -b 20800943685632 /dev/md127
 btrfs-progs v4.16.1
 bytenr mismatch, want=20800943685632, have=3118598835113619663
 ERROR: cannot read chunk root
 ERROR: unable to open /dev/md127

 I have attached the dumps for:

 dd if=/dev/md127 of=/tmp/chunk_root.copy1 bs=1 count=32K skip=266325721088
 dd if=/dev/md127 of=/tmp/chunk_root.copy2 bs=1 count=32K skip=266359275520
>>>
>>> Unfortunately, both dumps are corrupted and contain mostly garbage.
>>> I think it's the underlying stack (mdraid) has something wrong or failed
>>> to recover its data.
>>>
>>> This means your last chance will be btrfs-find-root.
>>>
>>> Please try:
>>> # btrfs-find-root -o 3 
>>>
>>> And provide all the output.
>>>
>>> But please keep in mind, chunk root is a critical tree, and so far it's
>>> already heavily damaged.
>>> Although I could still continue try to recover, there is pretty low
>>> chance now.
>>>
>>> Thanks,
>>> Qu

 Kind regards
 Michael


 On 29 April 2018 at 10:33, Qu Wenruo  wrote:
>
>
> On 2018年04月29日 16:59, Michael Wade wrote:
>> Ok, will it be possible for me to install the new version of the tools
>> on my current kernel without overriding the existing install? Hesitant
>> to update kernel/btrfs as it might break the ReadyNAS interface /
>> future firmware upgrades.
>>
>> Perhaps I could grab this:
>> https://github.com/kdave/btrfs-progs/releases/tag/v4.16.1 and
>> hopefully build from source and then run the binaries directly?
>
> Of course, that's how most of us test btrfs-progs builds.
>
> Thanks,
> Qu
>
>>
>> Kind regards
>>
>> On 29 April 2018 at 09:33, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年04月29日 16:11, Michael Wade wrote:
 Thanks Qu,

 Please find attached the log file for the chunk recover command.
>>>
>>> Strangely, btrfs chunk recovery found no extra chunk beyond current
>>> system chunk range.
>>>
>>> Which means, it's chunk tree corrupted.
>>>
>>> Please dump the chunk tree with latest btrfs-progs (which provides the
>>> new --follow option).
>>>
>>> # btrfs inspect dump-tree -b 20800943685632 
>>>
>>> If it doesn't work, please provide the following binary dump:
>>>
>>> # dd if= of=/tmp/chunk_root.copy1 bs=1 count=32K skip=266325721088
>>> # dd if= of=/tmp/chunk_root.copy2 bs=1 count=32K skip=266359275520
>>> (And will need to repeat similar dump for several times according to
>>> above dump)
>>>
>>> Thanks,
>>> Qu
>>>
>>>

 Kind regards
 Michael

 On 28 April 2018 at 12:38, Qu Wenruo  wrote:
>
>
> On 2018年04月28日 17:37, Michael Wade wrote:
>> Hi Qu,
>>
>> Thanks for your reply. I will investigate upgrading the kernel,
>> however I worry that future ReadyNAS firmware upgrades would fail on 

Re: BTRFS RAID filesystem unmountable

2018-05-01 Thread Qu Wenruo


On 2018年05月01日 23:50, Michael Wade wrote:
> Hi Qu,
> 
> Oh dear that is not good news!
> 
> I have been running the find root command since yesterday but it only
> seems to be only be outputting the following message:
> 
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096

It's mostly fine, as find-root will go through all tree blocks and try
to read them as tree blocks.
Although btrfs-find-root will suppress csum error output, but such basic
tree validation check is not suppressed, thus you get such message.

> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
> 
> I tried with the latest btrfs tools compiled from source and the ones
> I have installed with the same result. Is there a CLI utility I could
> use to determine if the log contains any other content?

Did it report any useful info at the end?

Thanks,
Qu

> 
> Kind regards
> Michael
> 
> 
> On 30 April 2018 at 04:02, Qu Wenruo  wrote:
>>
>>
>> On 2018年04月29日 22:08, Michael Wade wrote:
>>> Hi Qu,
>>>
>>> Got this error message:
>>>
>>> ./btrfs inspect dump-tree -b 20800943685632 /dev/md127
>>> btrfs-progs v4.16.1
>>> bytenr mismatch, want=20800943685632, have=3118598835113619663
>>> ERROR: cannot read chunk root
>>> ERROR: unable to open /dev/md127
>>>
>>> I have attached the dumps for:
>>>
>>> dd if=/dev/md127 of=/tmp/chunk_root.copy1 bs=1 count=32K skip=266325721088
>>> dd if=/dev/md127 of=/tmp/chunk_root.copy2 bs=1 count=32K skip=266359275520
>>
>> Unfortunately, both dumps are corrupted and contain mostly garbage.
>> I think it's the underlying stack (mdraid) has something wrong or failed
>> to recover its data.
>>
>> This means your last chance will be btrfs-find-root.
>>
>> Please try:
>> # btrfs-find-root -o 3 
>>
>> And provide all the output.
>>
>> But please keep in mind, chunk root is a critical tree, and so far it's
>> already heavily damaged.
>> Although I could still continue try to recover, there is pretty low
>> chance now.
>>
>> Thanks,
>> Qu
>>>
>>> Kind regards
>>> Michael
>>>
>>>
>>> On 29 April 2018 at 10:33, Qu Wenruo  wrote:


 On 2018年04月29日 16:59, Michael Wade wrote:
> Ok, will it be possible for me to install the new version of the tools
> on my current kernel without overriding the existing install? Hesitant
> to update kernel/btrfs as it might break the ReadyNAS interface /
> future firmware upgrades.
>
> Perhaps I could grab this:
> https://github.com/kdave/btrfs-progs/releases/tag/v4.16.1 and
> hopefully build from source and then run the binaries directly?

 Of course, that's how most of us test btrfs-progs builds.

 Thanks,
 Qu

>
> Kind regards
>
> On 29 April 2018 at 09:33, Qu Wenruo  wrote:
>>
>>
>> On 2018年04月29日 16:11, Michael Wade wrote:
>>> Thanks Qu,
>>>
>>> Please find attached the log file for the chunk recover command.
>>
>> Strangely, btrfs chunk recovery found no extra chunk beyond current
>> system chunk range.
>>
>> Which means, it's chunk tree corrupted.
>>
>> Please dump the chunk tree with latest btrfs-progs (which provides the
>> new --follow option).
>>
>> # btrfs inspect dump-tree -b 20800943685632 
>>
>> If it doesn't work, please provide the following binary dump:
>>
>> # dd if= of=/tmp/chunk_root.copy1 bs=1 count=32K skip=266325721088
>> # dd if= of=/tmp/chunk_root.copy2 bs=1 count=32K skip=266359275520
>> (And will need to repeat similar dump for several times according to
>> above dump)
>>
>> Thanks,
>> Qu
>>
>>
>>>
>>> Kind regards
>>> Michael
>>>
>>> On 28 April 2018 at 12:38, Qu Wenruo  wrote:


 On 2018年04月28日 17:37, Michael Wade wrote:
> Hi Qu,
>
> Thanks for your reply. I will investigate upgrading the kernel,
> however I worry that future ReadyNAS firmware upgrades would fail on a
> newer kernel version (I don't have much linux experience so maybe my
> concerns are unfounded!?).
>
> I have attached the output of the dump super command.
>
> I did actually run chunk recover before, without the verbose option,
> it took around 24 hours to finish but did not resolve my issue. Happy
> to start that again if you need its output.


Re: BTRFS RAID filesystem unmountable

2018-05-01 Thread Michael Wade
Hi Qu,

Oh dear that is not good news!

I have been running the find root command since yesterday but it only
seems to be only be outputting the following message:

ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
ERROR: tree block bytenr 0 is not aligned to sectorsize 4096

I tried with the latest btrfs tools compiled from source and the ones
I have installed with the same result. Is there a CLI utility I could
use to determine if the log contains any other content?

Kind regards
Michael


On 30 April 2018 at 04:02, Qu Wenruo  wrote:
>
>
> On 2018年04月29日 22:08, Michael Wade wrote:
>> Hi Qu,
>>
>> Got this error message:
>>
>> ./btrfs inspect dump-tree -b 20800943685632 /dev/md127
>> btrfs-progs v4.16.1
>> bytenr mismatch, want=20800943685632, have=3118598835113619663
>> ERROR: cannot read chunk root
>> ERROR: unable to open /dev/md127
>>
>> I have attached the dumps for:
>>
>> dd if=/dev/md127 of=/tmp/chunk_root.copy1 bs=1 count=32K skip=266325721088
>> dd if=/dev/md127 of=/tmp/chunk_root.copy2 bs=1 count=32K skip=266359275520
>
> Unfortunately, both dumps are corrupted and contain mostly garbage.
> I think it's the underlying stack (mdraid) has something wrong or failed
> to recover its data.
>
> This means your last chance will be btrfs-find-root.
>
> Please try:
> # btrfs-find-root -o 3 
>
> And provide all the output.
>
> But please keep in mind, chunk root is a critical tree, and so far it's
> already heavily damaged.
> Although I could still continue try to recover, there is pretty low
> chance now.
>
> Thanks,
> Qu
>>
>> Kind regards
>> Michael
>>
>>
>> On 29 April 2018 at 10:33, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年04月29日 16:59, Michael Wade wrote:
 Ok, will it be possible for me to install the new version of the tools
 on my current kernel without overriding the existing install? Hesitant
 to update kernel/btrfs as it might break the ReadyNAS interface /
 future firmware upgrades.

 Perhaps I could grab this:
 https://github.com/kdave/btrfs-progs/releases/tag/v4.16.1 and
 hopefully build from source and then run the binaries directly?
>>>
>>> Of course, that's how most of us test btrfs-progs builds.
>>>
>>> Thanks,
>>> Qu
>>>

 Kind regards

 On 29 April 2018 at 09:33, Qu Wenruo  wrote:
>
>
> On 2018年04月29日 16:11, Michael Wade wrote:
>> Thanks Qu,
>>
>> Please find attached the log file for the chunk recover command.
>
> Strangely, btrfs chunk recovery found no extra chunk beyond current
> system chunk range.
>
> Which means, it's chunk tree corrupted.
>
> Please dump the chunk tree with latest btrfs-progs (which provides the
> new --follow option).
>
> # btrfs inspect dump-tree -b 20800943685632 
>
> If it doesn't work, please provide the following binary dump:
>
> # dd if= of=/tmp/chunk_root.copy1 bs=1 count=32K skip=266325721088
> # dd if= of=/tmp/chunk_root.copy2 bs=1 count=32K skip=266359275520
> (And will need to repeat similar dump for several times according to
> above dump)
>
> Thanks,
> Qu
>
>
>>
>> Kind regards
>> Michael
>>
>> On 28 April 2018 at 12:38, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年04月28日 17:37, Michael Wade wrote:
 Hi Qu,

 Thanks for your reply. I will investigate upgrading the kernel,
 however I worry that future ReadyNAS firmware upgrades would fail on a
 newer kernel version (I don't have much linux experience so maybe my
 concerns are unfounded!?).

 I have attached the output of the dump super command.

 I did actually run chunk recover before, without the verbose option,
 it took around 24 hours to finish but did not resolve my issue. Happy
 to start that again if you need its output.
>>>
>>> The system chunk only contains the following chunks:
>>> [0, 4194304]:   Initial temporary chunk, not used at all
>>> [20971520, 29360128]:   System chunk created by mkfs, should be full
>>> used up
>>> [20800943685632, 20800977240064]:
>>> The newly created large system chunk.
>>>
>>> The chunk root is still in 2nd chunk thus valid, but some of its leaf is
>>> out of the range.
>>>
>>> If you can't wait 24h 

Re: BTRFS RAID filesystem unmountable

2018-04-29 Thread Qu Wenruo


On 2018年04月29日 22:08, Michael Wade wrote:
> Hi Qu,
> 
> Got this error message:
> 
> ./btrfs inspect dump-tree -b 20800943685632 /dev/md127
> btrfs-progs v4.16.1
> bytenr mismatch, want=20800943685632, have=3118598835113619663
> ERROR: cannot read chunk root
> ERROR: unable to open /dev/md127
> 
> I have attached the dumps for:
> 
> dd if=/dev/md127 of=/tmp/chunk_root.copy1 bs=1 count=32K skip=266325721088
> dd if=/dev/md127 of=/tmp/chunk_root.copy2 bs=1 count=32K skip=266359275520

Unfortunately, both dumps are corrupted and contain mostly garbage.
I think it's the underlying stack (mdraid) has something wrong or failed
to recover its data.

This means your last chance will be btrfs-find-root.

Please try:
# btrfs-find-root -o 3 

And provide all the output.

But please keep in mind, chunk root is a critical tree, and so far it's
already heavily damaged.
Although I could still continue try to recover, there is pretty low
chance now.

Thanks,
Qu
> 
> Kind regards
> Michael
> 
> 
> On 29 April 2018 at 10:33, Qu Wenruo  wrote:
>>
>>
>> On 2018年04月29日 16:59, Michael Wade wrote:
>>> Ok, will it be possible for me to install the new version of the tools
>>> on my current kernel without overriding the existing install? Hesitant
>>> to update kernel/btrfs as it might break the ReadyNAS interface /
>>> future firmware upgrades.
>>>
>>> Perhaps I could grab this:
>>> https://github.com/kdave/btrfs-progs/releases/tag/v4.16.1 and
>>> hopefully build from source and then run the binaries directly?
>>
>> Of course, that's how most of us test btrfs-progs builds.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Kind regards
>>>
>>> On 29 April 2018 at 09:33, Qu Wenruo  wrote:


 On 2018年04月29日 16:11, Michael Wade wrote:
> Thanks Qu,
>
> Please find attached the log file for the chunk recover command.

 Strangely, btrfs chunk recovery found no extra chunk beyond current
 system chunk range.

 Which means, it's chunk tree corrupted.

 Please dump the chunk tree with latest btrfs-progs (which provides the
 new --follow option).

 # btrfs inspect dump-tree -b 20800943685632 

 If it doesn't work, please provide the following binary dump:

 # dd if= of=/tmp/chunk_root.copy1 bs=1 count=32K skip=266325721088
 # dd if= of=/tmp/chunk_root.copy2 bs=1 count=32K skip=266359275520
 (And will need to repeat similar dump for several times according to
 above dump)

 Thanks,
 Qu


>
> Kind regards
> Michael
>
> On 28 April 2018 at 12:38, Qu Wenruo  wrote:
>>
>>
>> On 2018年04月28日 17:37, Michael Wade wrote:
>>> Hi Qu,
>>>
>>> Thanks for your reply. I will investigate upgrading the kernel,
>>> however I worry that future ReadyNAS firmware upgrades would fail on a
>>> newer kernel version (I don't have much linux experience so maybe my
>>> concerns are unfounded!?).
>>>
>>> I have attached the output of the dump super command.
>>>
>>> I did actually run chunk recover before, without the verbose option,
>>> it took around 24 hours to finish but did not resolve my issue. Happy
>>> to start that again if you need its output.
>>
>> The system chunk only contains the following chunks:
>> [0, 4194304]:   Initial temporary chunk, not used at all
>> [20971520, 29360128]:   System chunk created by mkfs, should be full
>> used up
>> [20800943685632, 20800977240064]:
>> The newly created large system chunk.
>>
>> The chunk root is still in 2nd chunk thus valid, but some of its leaf is
>> out of the range.
>>
>> If you can't wait 24h for chunk recovery to run, my advice would be move
>> the disk to some other computer, and use latest btrfs-progs to execute
>> the following command:
>>
>> # btrfs inpsect dump-tree -b 20800943685632 --follow
>>
>> If we're lucky enough, we may read out the tree leaf containing the new
>> system chunk and save a day.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks so much for your help.
>>>
>>> Kind regards
>>> Michael
>>>
>>> On 28 April 2018 at 09:45, Qu Wenruo  wrote:


 On 2018年04月28日 16:30, Michael Wade wrote:
> Hi all,
>
> I was hoping that someone would be able to help me resolve the issues
> I am having with my ReadyNAS BTRFS volume. Basically my trouble
> started after a power cut, subsequently the volume would not mount.
> Here are the details of my setup as it is at the moment:
>
> uname -a
> Linux QAI 4.4.116.alpine.1 #1 SMP Mon Feb 19 21:58:38 PST 2018 armv7l 
> GNU/Linux

 The kernel is pretty old for btrfs.
 Strongly recommended to upgrade.


Re: BTRFS RAID filesystem unmountable

2018-04-29 Thread Qu Wenruo


On 2018年04月29日 22:08, Michael Wade wrote:
> Hi Qu,
> 
> Got this error message:
> 
> ./btrfs inspect dump-tree -b 20800943685632 /dev/md127
> btrfs-progs v4.16.1
> bytenr mismatch, want=20800943685632, have=3118598835113619663
> ERROR: cannot read chunk root
> ERROR: unable to open /dev/md127
> 
> I have attached the dumps for:
> 
> dd if=/dev/md127 of=/tmp/chunk_root.copy1 bs=1 count=32K skip=266325721088
> dd if=/dev/md127 of=/tmp/chunk_root.copy2 bs=1 count=32K skip=266359275520

A little strange, the two copies mismatch with each other.

I'll double check the different of it, and maybe that's the reason.

Thanks,
Qu

> 
> Kind regards
> Michael
> 
> 
> On 29 April 2018 at 10:33, Qu Wenruo  wrote:
>>
>>
>> On 2018年04月29日 16:59, Michael Wade wrote:
>>> Ok, will it be possible for me to install the new version of the tools
>>> on my current kernel without overriding the existing install? Hesitant
>>> to update kernel/btrfs as it might break the ReadyNAS interface /
>>> future firmware upgrades.
>>>
>>> Perhaps I could grab this:
>>> https://github.com/kdave/btrfs-progs/releases/tag/v4.16.1 and
>>> hopefully build from source and then run the binaries directly?
>>
>> Of course, that's how most of us test btrfs-progs builds.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Kind regards
>>>
>>> On 29 April 2018 at 09:33, Qu Wenruo  wrote:


 On 2018年04月29日 16:11, Michael Wade wrote:
> Thanks Qu,
>
> Please find attached the log file for the chunk recover command.

 Strangely, btrfs chunk recovery found no extra chunk beyond current
 system chunk range.

 Which means, it's chunk tree corrupted.

 Please dump the chunk tree with latest btrfs-progs (which provides the
 new --follow option).

 # btrfs inspect dump-tree -b 20800943685632 

 If it doesn't work, please provide the following binary dump:

 # dd if= of=/tmp/chunk_root.copy1 bs=1 count=32K skip=266325721088
 # dd if= of=/tmp/chunk_root.copy2 bs=1 count=32K skip=266359275520
 (And will need to repeat similar dump for several times according to
 above dump)

 Thanks,
 Qu


>
> Kind regards
> Michael
>
> On 28 April 2018 at 12:38, Qu Wenruo  wrote:
>>
>>
>> On 2018年04月28日 17:37, Michael Wade wrote:
>>> Hi Qu,
>>>
>>> Thanks for your reply. I will investigate upgrading the kernel,
>>> however I worry that future ReadyNAS firmware upgrades would fail on a
>>> newer kernel version (I don't have much linux experience so maybe my
>>> concerns are unfounded!?).
>>>
>>> I have attached the output of the dump super command.
>>>
>>> I did actually run chunk recover before, without the verbose option,
>>> it took around 24 hours to finish but did not resolve my issue. Happy
>>> to start that again if you need its output.
>>
>> The system chunk only contains the following chunks:
>> [0, 4194304]:   Initial temporary chunk, not used at all
>> [20971520, 29360128]:   System chunk created by mkfs, should be full
>> used up
>> [20800943685632, 20800977240064]:
>> The newly created large system chunk.
>>
>> The chunk root is still in 2nd chunk thus valid, but some of its leaf is
>> out of the range.
>>
>> If you can't wait 24h for chunk recovery to run, my advice would be move
>> the disk to some other computer, and use latest btrfs-progs to execute
>> the following command:
>>
>> # btrfs inpsect dump-tree -b 20800943685632 --follow
>>
>> If we're lucky enough, we may read out the tree leaf containing the new
>> system chunk and save a day.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks so much for your help.
>>>
>>> Kind regards
>>> Michael
>>>
>>> On 28 April 2018 at 09:45, Qu Wenruo  wrote:


 On 2018年04月28日 16:30, Michael Wade wrote:
> Hi all,
>
> I was hoping that someone would be able to help me resolve the issues
> I am having with my ReadyNAS BTRFS volume. Basically my trouble
> started after a power cut, subsequently the volume would not mount.
> Here are the details of my setup as it is at the moment:
>
> uname -a
> Linux QAI 4.4.116.alpine.1 #1 SMP Mon Feb 19 21:58:38 PST 2018 armv7l 
> GNU/Linux

 The kernel is pretty old for btrfs.
 Strongly recommended to upgrade.

>
> btrfs --version
> btrfs-progs v4.12

 So is the user tools.

 Although I think it won't be a big problem, as needed tool should be 
 there.

>
> btrfs fi show
> Label: '11baed92:data'  uuid: 20628cda-d98f-4f85-955c-932a367f8821

Re: BTRFS RAID filesystem unmountable

2018-04-29 Thread Qu Wenruo


On 2018年04月29日 16:59, Michael Wade wrote:
> Ok, will it be possible for me to install the new version of the tools
> on my current kernel without overriding the existing install? Hesitant
> to update kernel/btrfs as it might break the ReadyNAS interface /
> future firmware upgrades.
> 
> Perhaps I could grab this:
> https://github.com/kdave/btrfs-progs/releases/tag/v4.16.1 and
> hopefully build from source and then run the binaries directly?

Of course, that's how most of us test btrfs-progs builds.

Thanks,
Qu

> 
> Kind regards
> 
> On 29 April 2018 at 09:33, Qu Wenruo  wrote:
>>
>>
>> On 2018年04月29日 16:11, Michael Wade wrote:
>>> Thanks Qu,
>>>
>>> Please find attached the log file for the chunk recover command.
>>
>> Strangely, btrfs chunk recovery found no extra chunk beyond current
>> system chunk range.
>>
>> Which means, it's chunk tree corrupted.
>>
>> Please dump the chunk tree with latest btrfs-progs (which provides the
>> new --follow option).
>>
>> # btrfs inspect dump-tree -b 20800943685632 
>>
>> If it doesn't work, please provide the following binary dump:
>>
>> # dd if= of=/tmp/chunk_root.copy1 bs=1 count=32K skip=266325721088
>> # dd if= of=/tmp/chunk_root.copy2 bs=1 count=32K skip=266359275520
>> (And will need to repeat similar dump for several times according to
>> above dump)
>>
>> Thanks,
>> Qu
>>
>>
>>>
>>> Kind regards
>>> Michael
>>>
>>> On 28 April 2018 at 12:38, Qu Wenruo  wrote:


 On 2018年04月28日 17:37, Michael Wade wrote:
> Hi Qu,
>
> Thanks for your reply. I will investigate upgrading the kernel,
> however I worry that future ReadyNAS firmware upgrades would fail on a
> newer kernel version (I don't have much linux experience so maybe my
> concerns are unfounded!?).
>
> I have attached the output of the dump super command.
>
> I did actually run chunk recover before, without the verbose option,
> it took around 24 hours to finish but did not resolve my issue. Happy
> to start that again if you need its output.

 The system chunk only contains the following chunks:
 [0, 4194304]:   Initial temporary chunk, not used at all
 [20971520, 29360128]:   System chunk created by mkfs, should be full
 used up
 [20800943685632, 20800977240064]:
 The newly created large system chunk.

 The chunk root is still in 2nd chunk thus valid, but some of its leaf is
 out of the range.

 If you can't wait 24h for chunk recovery to run, my advice would be move
 the disk to some other computer, and use latest btrfs-progs to execute
 the following command:

 # btrfs inpsect dump-tree -b 20800943685632 --follow

 If we're lucky enough, we may read out the tree leaf containing the new
 system chunk and save a day.

 Thanks,
 Qu

>
> Thanks so much for your help.
>
> Kind regards
> Michael
>
> On 28 April 2018 at 09:45, Qu Wenruo  wrote:
>>
>>
>> On 2018年04月28日 16:30, Michael Wade wrote:
>>> Hi all,
>>>
>>> I was hoping that someone would be able to help me resolve the issues
>>> I am having with my ReadyNAS BTRFS volume. Basically my trouble
>>> started after a power cut, subsequently the volume would not mount.
>>> Here are the details of my setup as it is at the moment:
>>>
>>> uname -a
>>> Linux QAI 4.4.116.alpine.1 #1 SMP Mon Feb 19 21:58:38 PST 2018 armv7l 
>>> GNU/Linux
>>
>> The kernel is pretty old for btrfs.
>> Strongly recommended to upgrade.
>>
>>>
>>> btrfs --version
>>> btrfs-progs v4.12
>>
>> So is the user tools.
>>
>> Although I think it won't be a big problem, as needed tool should be 
>> there.
>>
>>>
>>> btrfs fi show
>>> Label: '11baed92:data'  uuid: 20628cda-d98f-4f85-955c-932a367f8821
>>> Total devices 1 FS bytes used 5.12TiB
>>> devid1 size 7.27TiB used 6.24TiB path /dev/md127
>>
>> So, it's btrfs on mdraid.
>> It would normally make things harder to debug, so I could only provide
>> advice from the respect of btrfs.
>> For mdraid part, I can't ensure anything.
>>
>>>
>>> Here are the relevant dmesg logs for the current state of the device:
>>>
>>> [   19.119391] md: md127 stopped.
>>> [   19.120841] md: bind
>>> [   19.121120] md: bind
>>> [   19.121380] md: bind
>>> [   19.125535] md/raid:md127: device sda3 operational as raid disk 0
>>> [   19.125547] md/raid:md127: device sdc3 operational as raid disk 2
>>> [   19.125554] md/raid:md127: device sdb3 operational as raid disk 1
>>> [   19.126712] md/raid:md127: allocated 3240kB
>>> [   19.126778] md/raid:md127: raid level 5 active with 3 out of 3
>>> devices, algorithm 2
>>> [   19.126784] RAID conf printout:
>>> [   

Re: BTRFS RAID filesystem unmountable

2018-04-29 Thread Michael Wade
Ok, will it be possible for me to install the new version of the tools
on my current kernel without overriding the existing install? Hesitant
to update kernel/btrfs as it might break the ReadyNAS interface /
future firmware upgrades.

Perhaps I could grab this:
https://github.com/kdave/btrfs-progs/releases/tag/v4.16.1 and
hopefully build from source and then run the binaries directly?

Kind regards

On 29 April 2018 at 09:33, Qu Wenruo  wrote:
>
>
> On 2018年04月29日 16:11, Michael Wade wrote:
>> Thanks Qu,
>>
>> Please find attached the log file for the chunk recover command.
>
> Strangely, btrfs chunk recovery found no extra chunk beyond current
> system chunk range.
>
> Which means, it's chunk tree corrupted.
>
> Please dump the chunk tree with latest btrfs-progs (which provides the
> new --follow option).
>
> # btrfs inspect dump-tree -b 20800943685632 
>
> If it doesn't work, please provide the following binary dump:
>
> # dd if= of=/tmp/chunk_root.copy1 bs=1 count=32K skip=266325721088
> # dd if= of=/tmp/chunk_root.copy2 bs=1 count=32K skip=266359275520
> (And will need to repeat similar dump for several times according to
> above dump)
>
> Thanks,
> Qu
>
>
>>
>> Kind regards
>> Michael
>>
>> On 28 April 2018 at 12:38, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年04月28日 17:37, Michael Wade wrote:
 Hi Qu,

 Thanks for your reply. I will investigate upgrading the kernel,
 however I worry that future ReadyNAS firmware upgrades would fail on a
 newer kernel version (I don't have much linux experience so maybe my
 concerns are unfounded!?).

 I have attached the output of the dump super command.

 I did actually run chunk recover before, without the verbose option,
 it took around 24 hours to finish but did not resolve my issue. Happy
 to start that again if you need its output.
>>>
>>> The system chunk only contains the following chunks:
>>> [0, 4194304]:   Initial temporary chunk, not used at all
>>> [20971520, 29360128]:   System chunk created by mkfs, should be full
>>> used up
>>> [20800943685632, 20800977240064]:
>>> The newly created large system chunk.
>>>
>>> The chunk root is still in 2nd chunk thus valid, but some of its leaf is
>>> out of the range.
>>>
>>> If you can't wait 24h for chunk recovery to run, my advice would be move
>>> the disk to some other computer, and use latest btrfs-progs to execute
>>> the following command:
>>>
>>> # btrfs inpsect dump-tree -b 20800943685632 --follow
>>>
>>> If we're lucky enough, we may read out the tree leaf containing the new
>>> system chunk and save a day.
>>>
>>> Thanks,
>>> Qu
>>>

 Thanks so much for your help.

 Kind regards
 Michael

 On 28 April 2018 at 09:45, Qu Wenruo  wrote:
>
>
> On 2018年04月28日 16:30, Michael Wade wrote:
>> Hi all,
>>
>> I was hoping that someone would be able to help me resolve the issues
>> I am having with my ReadyNAS BTRFS volume. Basically my trouble
>> started after a power cut, subsequently the volume would not mount.
>> Here are the details of my setup as it is at the moment:
>>
>> uname -a
>> Linux QAI 4.4.116.alpine.1 #1 SMP Mon Feb 19 21:58:38 PST 2018 armv7l 
>> GNU/Linux
>
> The kernel is pretty old for btrfs.
> Strongly recommended to upgrade.
>
>>
>> btrfs --version
>> btrfs-progs v4.12
>
> So is the user tools.
>
> Although I think it won't be a big problem, as needed tool should be 
> there.
>
>>
>> btrfs fi show
>> Label: '11baed92:data'  uuid: 20628cda-d98f-4f85-955c-932a367f8821
>> Total devices 1 FS bytes used 5.12TiB
>> devid1 size 7.27TiB used 6.24TiB path /dev/md127
>
> So, it's btrfs on mdraid.
> It would normally make things harder to debug, so I could only provide
> advice from the respect of btrfs.
> For mdraid part, I can't ensure anything.
>
>>
>> Here are the relevant dmesg logs for the current state of the device:
>>
>> [   19.119391] md: md127 stopped.
>> [   19.120841] md: bind
>> [   19.121120] md: bind
>> [   19.121380] md: bind
>> [   19.125535] md/raid:md127: device sda3 operational as raid disk 0
>> [   19.125547] md/raid:md127: device sdc3 operational as raid disk 2
>> [   19.125554] md/raid:md127: device sdb3 operational as raid disk 1
>> [   19.126712] md/raid:md127: allocated 3240kB
>> [   19.126778] md/raid:md127: raid level 5 active with 3 out of 3
>> devices, algorithm 2
>> [   19.126784] RAID conf printout:
>> [   19.126789]  --- level:5 rd:3 wd:3
>> [   19.126794]  disk 0, o:1, dev:sda3
>> [   19.126799]  disk 1, o:1, dev:sdb3
>> [   19.126804]  disk 2, o:1, dev:sdc3
>> [   19.128118] md127: detected capacity change from 0 to 7991637573632
>> [   19.395112] 

Re: BTRFS RAID filesystem unmountable

2018-04-29 Thread Qu Wenruo


On 2018年04月29日 16:11, Michael Wade wrote:
> Thanks Qu,
> 
> Please find attached the log file for the chunk recover command.

Strangely, btrfs chunk recovery found no extra chunk beyond current
system chunk range.

Which means, it's chunk tree corrupted.

Please dump the chunk tree with latest btrfs-progs (which provides the
new --follow option).

# btrfs inspect dump-tree -b 20800943685632 

If it doesn't work, please provide the following binary dump:

# dd if= of=/tmp/chunk_root.copy1 bs=1 count=32K skip=266325721088
# dd if= of=/tmp/chunk_root.copy2 bs=1 count=32K skip=266359275520
(And will need to repeat similar dump for several times according to
above dump)

Thanks,
Qu


> 
> Kind regards
> Michael
> 
> On 28 April 2018 at 12:38, Qu Wenruo  wrote:
>>
>>
>> On 2018年04月28日 17:37, Michael Wade wrote:
>>> Hi Qu,
>>>
>>> Thanks for your reply. I will investigate upgrading the kernel,
>>> however I worry that future ReadyNAS firmware upgrades would fail on a
>>> newer kernel version (I don't have much linux experience so maybe my
>>> concerns are unfounded!?).
>>>
>>> I have attached the output of the dump super command.
>>>
>>> I did actually run chunk recover before, without the verbose option,
>>> it took around 24 hours to finish but did not resolve my issue. Happy
>>> to start that again if you need its output.
>>
>> The system chunk only contains the following chunks:
>> [0, 4194304]:   Initial temporary chunk, not used at all
>> [20971520, 29360128]:   System chunk created by mkfs, should be full
>> used up
>> [20800943685632, 20800977240064]:
>> The newly created large system chunk.
>>
>> The chunk root is still in 2nd chunk thus valid, but some of its leaf is
>> out of the range.
>>
>> If you can't wait 24h for chunk recovery to run, my advice would be move
>> the disk to some other computer, and use latest btrfs-progs to execute
>> the following command:
>>
>> # btrfs inpsect dump-tree -b 20800943685632 --follow
>>
>> If we're lucky enough, we may read out the tree leaf containing the new
>> system chunk and save a day.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks so much for your help.
>>>
>>> Kind regards
>>> Michael
>>>
>>> On 28 April 2018 at 09:45, Qu Wenruo  wrote:


 On 2018年04月28日 16:30, Michael Wade wrote:
> Hi all,
>
> I was hoping that someone would be able to help me resolve the issues
> I am having with my ReadyNAS BTRFS volume. Basically my trouble
> started after a power cut, subsequently the volume would not mount.
> Here are the details of my setup as it is at the moment:
>
> uname -a
> Linux QAI 4.4.116.alpine.1 #1 SMP Mon Feb 19 21:58:38 PST 2018 armv7l 
> GNU/Linux

 The kernel is pretty old for btrfs.
 Strongly recommended to upgrade.

>
> btrfs --version
> btrfs-progs v4.12

 So is the user tools.

 Although I think it won't be a big problem, as needed tool should be there.

>
> btrfs fi show
> Label: '11baed92:data'  uuid: 20628cda-d98f-4f85-955c-932a367f8821
> Total devices 1 FS bytes used 5.12TiB
> devid1 size 7.27TiB used 6.24TiB path /dev/md127

 So, it's btrfs on mdraid.
 It would normally make things harder to debug, so I could only provide
 advice from the respect of btrfs.
 For mdraid part, I can't ensure anything.

>
> Here are the relevant dmesg logs for the current state of the device:
>
> [   19.119391] md: md127 stopped.
> [   19.120841] md: bind
> [   19.121120] md: bind
> [   19.121380] md: bind
> [   19.125535] md/raid:md127: device sda3 operational as raid disk 0
> [   19.125547] md/raid:md127: device sdc3 operational as raid disk 2
> [   19.125554] md/raid:md127: device sdb3 operational as raid disk 1
> [   19.126712] md/raid:md127: allocated 3240kB
> [   19.126778] md/raid:md127: raid level 5 active with 3 out of 3
> devices, algorithm 2
> [   19.126784] RAID conf printout:
> [   19.126789]  --- level:5 rd:3 wd:3
> [   19.126794]  disk 0, o:1, dev:sda3
> [   19.126799]  disk 1, o:1, dev:sdb3
> [   19.126804]  disk 2, o:1, dev:sdc3
> [   19.128118] md127: detected capacity change from 0 to 7991637573632
> [   19.395112] Adding 523708k swap on /dev/md1.  Priority:-1 extents:1
> across:523708k
> [   19.434956] BTRFS: device label 11baed92:data devid 1 transid
> 151800 /dev/md127
> [   19.739276] BTRFS info (device md127): setting nodatasum
> [   19.740440] BTRFS critical (device md127): unable to find logical
> 3208757641216 len 4096
> [   19.740450] BTRFS critical (device md127): unable to find logical
> 3208757641216 len 4096
> [   19.740498] BTRFS critical (device md127): unable to find logical
> 3208757641216 len 4096
> [   19.740512] BTRFS critical (device md127): unable to find logical

Re: BTRFS RAID filesystem unmountable

2018-04-28 Thread Qu Wenruo


On 2018年04月28日 17:37, Michael Wade wrote:
> Hi Qu,
> 
> Thanks for your reply. I will investigate upgrading the kernel,
> however I worry that future ReadyNAS firmware upgrades would fail on a
> newer kernel version (I don't have much linux experience so maybe my
> concerns are unfounded!?).
> 
> I have attached the output of the dump super command.
> 
> I did actually run chunk recover before, without the verbose option,
> it took around 24 hours to finish but did not resolve my issue. Happy
> to start that again if you need its output.

The system chunk only contains the following chunks:
[0, 4194304]:   Initial temporary chunk, not used at all
[20971520, 29360128]:   System chunk created by mkfs, should be full
used up
[20800943685632, 20800977240064]:
The newly created large system chunk.

The chunk root is still in 2nd chunk thus valid, but some of its leaf is
out of the range.

If you can't wait 24h for chunk recovery to run, my advice would be move
the disk to some other computer, and use latest btrfs-progs to execute
the following command:

# btrfs inpsect dump-tree -b 20800943685632 --follow

If we're lucky enough, we may read out the tree leaf containing the new
system chunk and save a day.

Thanks,
Qu

> 
> Thanks so much for your help.
> 
> Kind regards
> Michael
> 
> On 28 April 2018 at 09:45, Qu Wenruo  wrote:
>>
>>
>> On 2018年04月28日 16:30, Michael Wade wrote:
>>> Hi all,
>>>
>>> I was hoping that someone would be able to help me resolve the issues
>>> I am having with my ReadyNAS BTRFS volume. Basically my trouble
>>> started after a power cut, subsequently the volume would not mount.
>>> Here are the details of my setup as it is at the moment:
>>>
>>> uname -a
>>> Linux QAI 4.4.116.alpine.1 #1 SMP Mon Feb 19 21:58:38 PST 2018 armv7l 
>>> GNU/Linux
>>
>> The kernel is pretty old for btrfs.
>> Strongly recommended to upgrade.
>>
>>>
>>> btrfs --version
>>> btrfs-progs v4.12
>>
>> So is the user tools.
>>
>> Although I think it won't be a big problem, as needed tool should be there.
>>
>>>
>>> btrfs fi show
>>> Label: '11baed92:data'  uuid: 20628cda-d98f-4f85-955c-932a367f8821
>>> Total devices 1 FS bytes used 5.12TiB
>>> devid1 size 7.27TiB used 6.24TiB path /dev/md127
>>
>> So, it's btrfs on mdraid.
>> It would normally make things harder to debug, so I could only provide
>> advice from the respect of btrfs.
>> For mdraid part, I can't ensure anything.
>>
>>>
>>> Here are the relevant dmesg logs for the current state of the device:
>>>
>>> [   19.119391] md: md127 stopped.
>>> [   19.120841] md: bind
>>> [   19.121120] md: bind
>>> [   19.121380] md: bind
>>> [   19.125535] md/raid:md127: device sda3 operational as raid disk 0
>>> [   19.125547] md/raid:md127: device sdc3 operational as raid disk 2
>>> [   19.125554] md/raid:md127: device sdb3 operational as raid disk 1
>>> [   19.126712] md/raid:md127: allocated 3240kB
>>> [   19.126778] md/raid:md127: raid level 5 active with 3 out of 3
>>> devices, algorithm 2
>>> [   19.126784] RAID conf printout:
>>> [   19.126789]  --- level:5 rd:3 wd:3
>>> [   19.126794]  disk 0, o:1, dev:sda3
>>> [   19.126799]  disk 1, o:1, dev:sdb3
>>> [   19.126804]  disk 2, o:1, dev:sdc3
>>> [   19.128118] md127: detected capacity change from 0 to 7991637573632
>>> [   19.395112] Adding 523708k swap on /dev/md1.  Priority:-1 extents:1
>>> across:523708k
>>> [   19.434956] BTRFS: device label 11baed92:data devid 1 transid
>>> 151800 /dev/md127
>>> [   19.739276] BTRFS info (device md127): setting nodatasum
>>> [   19.740440] BTRFS critical (device md127): unable to find logical
>>> 3208757641216 len 4096
>>> [   19.740450] BTRFS critical (device md127): unable to find logical
>>> 3208757641216 len 4096
>>> [   19.740498] BTRFS critical (device md127): unable to find logical
>>> 3208757641216 len 4096
>>> [   19.740512] BTRFS critical (device md127): unable to find logical
>>> 3208757641216 len 4096
>>> [   19.740552] BTRFS critical (device md127): unable to find logical
>>> 3208757641216 len 4096
>>> [   19.740560] BTRFS critical (device md127): unable to find logical
>>> 3208757641216 len 4096
>>> [   19.740576] BTRFS error (device md127): failed to read chunk root
>>
>> This shows it pretty clear, btrfs fails to read chunk root.
>> And according your above "len 4096" it's pretty old fs, as it's still
>> using 4K nodesize other than 16K nodesize.
>>
>> According to above output, it means your superblock by somehow lacks the
>> needed system chunk mapping, which is used to initialize chunk mapping.
>>
>> Please provide the following command output:
>>
>> # btrfs inspect dump-super -fFa /dev/md127
>>
>> Also, please consider run the following command and dump all its output:
>>
>> # btrfs rescue chunk-recover -v /dev/md127.
>>
>> Please note that, above command can take a long time to finish, and if
>> it works without problem, it may solve your problem.
>> But if it doesn't 

Re: BTRFS RAID filesystem unmountable

2018-04-28 Thread Michael Wade
Hi Qu,

Thanks for your reply. I will investigate upgrading the kernel,
however I worry that future ReadyNAS firmware upgrades would fail on a
newer kernel version (I don't have much linux experience so maybe my
concerns are unfounded!?).

I have attached the output of the dump super command.

I did actually run chunk recover before, without the verbose option,
it took around 24 hours to finish but did not resolve my issue. Happy
to start that again if you need its output.

Thanks so much for your help.

Kind regards
Michael

On 28 April 2018 at 09:45, Qu Wenruo  wrote:
>
>
> On 2018年04月28日 16:30, Michael Wade wrote:
>> Hi all,
>>
>> I was hoping that someone would be able to help me resolve the issues
>> I am having with my ReadyNAS BTRFS volume. Basically my trouble
>> started after a power cut, subsequently the volume would not mount.
>> Here are the details of my setup as it is at the moment:
>>
>> uname -a
>> Linux QAI 4.4.116.alpine.1 #1 SMP Mon Feb 19 21:58:38 PST 2018 armv7l 
>> GNU/Linux
>
> The kernel is pretty old for btrfs.
> Strongly recommended to upgrade.
>
>>
>> btrfs --version
>> btrfs-progs v4.12
>
> So is the user tools.
>
> Although I think it won't be a big problem, as needed tool should be there.
>
>>
>> btrfs fi show
>> Label: '11baed92:data'  uuid: 20628cda-d98f-4f85-955c-932a367f8821
>> Total devices 1 FS bytes used 5.12TiB
>> devid1 size 7.27TiB used 6.24TiB path /dev/md127
>
> So, it's btrfs on mdraid.
> It would normally make things harder to debug, so I could only provide
> advice from the respect of btrfs.
> For mdraid part, I can't ensure anything.
>
>>
>> Here are the relevant dmesg logs for the current state of the device:
>>
>> [   19.119391] md: md127 stopped.
>> [   19.120841] md: bind
>> [   19.121120] md: bind
>> [   19.121380] md: bind
>> [   19.125535] md/raid:md127: device sda3 operational as raid disk 0
>> [   19.125547] md/raid:md127: device sdc3 operational as raid disk 2
>> [   19.125554] md/raid:md127: device sdb3 operational as raid disk 1
>> [   19.126712] md/raid:md127: allocated 3240kB
>> [   19.126778] md/raid:md127: raid level 5 active with 3 out of 3
>> devices, algorithm 2
>> [   19.126784] RAID conf printout:
>> [   19.126789]  --- level:5 rd:3 wd:3
>> [   19.126794]  disk 0, o:1, dev:sda3
>> [   19.126799]  disk 1, o:1, dev:sdb3
>> [   19.126804]  disk 2, o:1, dev:sdc3
>> [   19.128118] md127: detected capacity change from 0 to 7991637573632
>> [   19.395112] Adding 523708k swap on /dev/md1.  Priority:-1 extents:1
>> across:523708k
>> [   19.434956] BTRFS: device label 11baed92:data devid 1 transid
>> 151800 /dev/md127
>> [   19.739276] BTRFS info (device md127): setting nodatasum
>> [   19.740440] BTRFS critical (device md127): unable to find logical
>> 3208757641216 len 4096
>> [   19.740450] BTRFS critical (device md127): unable to find logical
>> 3208757641216 len 4096
>> [   19.740498] BTRFS critical (device md127): unable to find logical
>> 3208757641216 len 4096
>> [   19.740512] BTRFS critical (device md127): unable to find logical
>> 3208757641216 len 4096
>> [   19.740552] BTRFS critical (device md127): unable to find logical
>> 3208757641216 len 4096
>> [   19.740560] BTRFS critical (device md127): unable to find logical
>> 3208757641216 len 4096
>> [   19.740576] BTRFS error (device md127): failed to read chunk root
>
> This shows it pretty clear, btrfs fails to read chunk root.
> And according your above "len 4096" it's pretty old fs, as it's still
> using 4K nodesize other than 16K nodesize.
>
> According to above output, it means your superblock by somehow lacks the
> needed system chunk mapping, which is used to initialize chunk mapping.
>
> Please provide the following command output:
>
> # btrfs inspect dump-super -fFa /dev/md127
>
> Also, please consider run the following command and dump all its output:
>
> # btrfs rescue chunk-recover -v /dev/md127.
>
> Please note that, above command can take a long time to finish, and if
> it works without problem, it may solve your problem.
> But if it doesn't work, the output could help me to manually craft a fix
> to your super block.
>
> Thanks,
> Qu
>
>
>> [   19.783975] BTRFS error (device md127): open_ctree failed
>>
>> In an attempt to recover the volume myself I run a few BTRFS commands
>> mostly using advice from here:
>> https://lists.opensuse.org/opensuse/2017-02/msg00930.html. However
>> that actually seems to have made things worse as I can no longer mount
>> the file system, not even in readonly mode.
>>
>> So starting from the beginning here is a list of things I have done so
>> far (hopefully I remembered the order in which I ran them!)
>>
>> 1. Noticed that my backups to the NAS were not running (didn't get
>> notified that the volume had basically "died")
>> 2. ReadyNAS UI indicated that the volume was inactive.
>> 3. SSHed onto the box and found that the first drive was not marked as
>> operational (log showed I/O errors / UNKOWN (0x2003))  so I 

Re: BTRFS RAID filesystem unmountable

2018-04-28 Thread Qu Wenruo


On 2018年04月28日 16:30, Michael Wade wrote:
> Hi all,
> 
> I was hoping that someone would be able to help me resolve the issues
> I am having with my ReadyNAS BTRFS volume. Basically my trouble
> started after a power cut, subsequently the volume would not mount.
> Here are the details of my setup as it is at the moment:
> 
> uname -a
> Linux QAI 4.4.116.alpine.1 #1 SMP Mon Feb 19 21:58:38 PST 2018 armv7l 
> GNU/Linux

The kernel is pretty old for btrfs.
Strongly recommended to upgrade.

> 
> btrfs --version
> btrfs-progs v4.12

So is the user tools.

Although I think it won't be a big problem, as needed tool should be there.

> 
> btrfs fi show
> Label: '11baed92:data'  uuid: 20628cda-d98f-4f85-955c-932a367f8821
> Total devices 1 FS bytes used 5.12TiB
> devid1 size 7.27TiB used 6.24TiB path /dev/md127

So, it's btrfs on mdraid.
It would normally make things harder to debug, so I could only provide
advice from the respect of btrfs.
For mdraid part, I can't ensure anything.

> 
> Here are the relevant dmesg logs for the current state of the device:
> 
> [   19.119391] md: md127 stopped.
> [   19.120841] md: bind
> [   19.121120] md: bind
> [   19.121380] md: bind
> [   19.125535] md/raid:md127: device sda3 operational as raid disk 0
> [   19.125547] md/raid:md127: device sdc3 operational as raid disk 2
> [   19.125554] md/raid:md127: device sdb3 operational as raid disk 1
> [   19.126712] md/raid:md127: allocated 3240kB
> [   19.126778] md/raid:md127: raid level 5 active with 3 out of 3
> devices, algorithm 2
> [   19.126784] RAID conf printout:
> [   19.126789]  --- level:5 rd:3 wd:3
> [   19.126794]  disk 0, o:1, dev:sda3
> [   19.126799]  disk 1, o:1, dev:sdb3
> [   19.126804]  disk 2, o:1, dev:sdc3
> [   19.128118] md127: detected capacity change from 0 to 7991637573632
> [   19.395112] Adding 523708k swap on /dev/md1.  Priority:-1 extents:1
> across:523708k
> [   19.434956] BTRFS: device label 11baed92:data devid 1 transid
> 151800 /dev/md127
> [   19.739276] BTRFS info (device md127): setting nodatasum
> [   19.740440] BTRFS critical (device md127): unable to find logical
> 3208757641216 len 4096
> [   19.740450] BTRFS critical (device md127): unable to find logical
> 3208757641216 len 4096
> [   19.740498] BTRFS critical (device md127): unable to find logical
> 3208757641216 len 4096
> [   19.740512] BTRFS critical (device md127): unable to find logical
> 3208757641216 len 4096
> [   19.740552] BTRFS critical (device md127): unable to find logical
> 3208757641216 len 4096
> [   19.740560] BTRFS critical (device md127): unable to find logical
> 3208757641216 len 4096
> [   19.740576] BTRFS error (device md127): failed to read chunk root

This shows it pretty clear, btrfs fails to read chunk root.
And according your above "len 4096" it's pretty old fs, as it's still
using 4K nodesize other than 16K nodesize.

According to above output, it means your superblock by somehow lacks the
needed system chunk mapping, which is used to initialize chunk mapping.

Please provide the following command output:

# btrfs inspect dump-super -fFa /dev/md127

Also, please consider run the following command and dump all its output:

# btrfs rescue chunk-recover -v /dev/md127.

Please note that, above command can take a long time to finish, and if
it works without problem, it may solve your problem.
But if it doesn't work, the output could help me to manually craft a fix
to your super block.

Thanks,
Qu


> [   19.783975] BTRFS error (device md127): open_ctree failed
> 
> In an attempt to recover the volume myself I run a few BTRFS commands
> mostly using advice from here:
> https://lists.opensuse.org/opensuse/2017-02/msg00930.html. However
> that actually seems to have made things worse as I can no longer mount
> the file system, not even in readonly mode.
> 
> So starting from the beginning here is a list of things I have done so
> far (hopefully I remembered the order in which I ran them!)
> 
> 1. Noticed that my backups to the NAS were not running (didn't get
> notified that the volume had basically "died")
> 2. ReadyNAS UI indicated that the volume was inactive.
> 3. SSHed onto the box and found that the first drive was not marked as
> operational (log showed I/O errors / UNKOWN (0x2003))  so I replaced
> the disk and let the array resync.
> 4. After resync the volume still was unaccessible so I looked at the
> logs once more and saw something like the following which seemed to
> indicate that the replay log had been corrupted when the power went
> out:
> 
> BTRFS critical (device md127): corrupt leaf, non-root leaf's nritems
> is 0: block=232292352, root=7, slot=0
> BTRFS critical (device md127): corrupt leaf, non-root leaf's nritems
> is 0: block=232292352, root=7, slot=0
> BTRFS: error (device md127) in btrfs_replay_log:2524: errno=-5 IO
> failure (Failed to recover log tree)
> BTRFS error (device md127): pending csums is 155648
> BTRFS error (device md127): cleaner transaction attach returned -30
> BTRFS 

Re: BTRFS RAID 1 not mountable: open_ctree failed, super_num_devices 3 mismatch with num_devices 2 found here

2017-08-24 Thread Dmitrii Tcvetkov
>  I rebootet with HWE K4.11
> 
> and took a pic of the error message (see attachment).
> 
> It seems btrfs still sees the removed NVME. 
> There is a mismatch from super_num_devices (3) to num_devices (2)
> with indicates something strage is going on here, imho. 
> 
> Then i returned and booted K4.4, which boots fine.
> 
> root@vHost1:~# btrfs dev stat /
> [/dev/nvme0n1p1].write_io_errs   0
> [/dev/nvme0n1p1].read_io_errs0
> [/dev/nvme0n1p1].flush_io_errs   0
> [/dev/nvme0n1p1].corruption_errs 0
> [/dev/nvme0n1p1].generation_errs 0
> [/dev/sda1].write_io_errs   0
> [/dev/sda1].read_io_errs0
> [/dev/sda1].flush_io_errs   0
> [/dev/sda1].corruption_errs 0
> [/dev/sda1].generation_errs 0
> 
> Btw i edited the subject to match the correct error.
> 
> 
> Sash

Thats very odd, if super_num_devices in superblocks don't match real number of 
devices
then 4.4 kernel shouldn't mount the filesystem too.

We probably need help from one of btrfs developers since I'm not one, I'm just 
btrfs user.
Can you provide outpus of:
btrfs inspect-internal dump-super -f /dev/sda1
btrfs inspect-internal dump-super -f /dev/nvme0n1p1

Depending on version of btrfs-progs you may need to use btrfs-dump-super 
instead of "btrfs inspect-internal dump-super"

>3rd i saw https://patchwork.kernel.org/patch/9419189/ from Roman. Did
>he receive any comments on his patch? This one could help on this
>problem, too. 

Don't know about this patch from Roman per se, but there is a patchset[1] which 
is aimed to be merged in 4.14 merge window AFAIK.

[1] https://www.spinics.net/lists/linux-btrfs/msg66891.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid assurance

2017-07-26 Thread Hugo Mills
On Wed, Jul 26, 2017 at 08:36:54AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-07-26 08:27, Hugo Mills wrote:
> >On Wed, Jul 26, 2017 at 08:12:19AM -0400, Austin S. Hemmelgarn wrote:
> >>On 2017-07-25 17:45, Hugo Mills wrote:
> >>>On Tue, Jul 25, 2017 at 11:29:13PM +0200, waxhead wrote:
> 
> 
> Hugo Mills wrote:
> >
> >>>You can see about the disk usage in different scenarios with the
> >>>online tool at:
> >>>
> >>>http://carfax.org.uk/btrfs-usage/
> >>>
> >>>Hugo.
> >>>
> As a side note, have you ever considered making this online tool
> (that should never go away just for the record) part of btrfs-progs
> e.g. a proper tool? I use it quite often (at least several timers
> per. month) and I would love for this to be a visual tool
> 'btrfs-space-calculator' would be a great name for it I think.
> 
> Imagine how nice it would be to run
> 
> btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1
> /dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get
> something similar to my example below (no accuracy intended)
> >>>
> >>>It's certainly a thought. I've already got the algorithm written
> >>>up. I'd have to resurrect my C skills, though, and it's a long way
> >>>down my list of things to do. :/
> >>>
> >>>Also on the subject of this tool, I'd like to make it so that the
> >>>parameters get set in the URL, so that people can copy-paste the URL
> >>>of the settings they've got into IRC for discussion. However, that
> >>>would involve doing more JavaScript, which is possibly even lower down
> >>>my list of things to do than starting doing C again...
> >
> >>Is the core logic posted somewhere?  Because if I have some time, I
> >>might write up a quick Python script to do this locally (it may not
> >>be as tightly integrated with the regular tools, but I can count on
> >>half a hand how many distros don't include Python by default).
> >
> >If it's going to be done in python, I might as well do it myself --
> >I can do python with my eyes closed. It's just C and JS I'm rusty with.
> Same here ironically :)
> >
> >There is a write-up of the usable-space algorithm somewhere. I
> >wrote it up in detail (with pseudocode) in a mail on this list. I've
> >also got several pages of LaTeX somewhere where I tried and failed to
> >prove the correctness of the formula. I'll see if I can dig them out
> >this evening.
> It looks like the Message-ID for the one on the mailing list is
> <20160311221703.gj17...@carfax.org.uk>
> I had forgotten that I'd archived that with the intent of actually
> doing something with it eventually...

   Here's the write-up of my attempted proof of the optimality of the
current allocator algorithm:

http://carfax.org.uk/files/temp/btrfs-allocator-draft.pdf

   Section 1 is a general (allocator-agnostic) description of the
process. Section 2 finds a bound on how well _any_ allocator can
do. That's the formula (eq 9) used in the online btrfs-usage
tool. Section 3 described the current allocator. Section 4 is a failed
attempt at proving that the algorithm achieves the bound from section
2. I wasn't able to complete the proof.

   Hugo.

-- 
Hugo Mills | Great films about cricket: Interview with the Umpire
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-26 Thread Austin S. Hemmelgarn

On 2017-07-26 08:27, Hugo Mills wrote:

On Wed, Jul 26, 2017 at 08:12:19AM -0400, Austin S. Hemmelgarn wrote:

On 2017-07-25 17:45, Hugo Mills wrote:

On Tue, Jul 25, 2017 at 11:29:13PM +0200, waxhead wrote:



Hugo Mills wrote:



You can see about the disk usage in different scenarios with the
online tool at:

http://carfax.org.uk/btrfs-usage/

Hugo.


As a side note, have you ever considered making this online tool
(that should never go away just for the record) part of btrfs-progs
e.g. a proper tool? I use it quite often (at least several timers
per. month) and I would love for this to be a visual tool
'btrfs-space-calculator' would be a great name for it I think.

Imagine how nice it would be to run

btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1
/dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get
something similar to my example below (no accuracy intended)


It's certainly a thought. I've already got the algorithm written
up. I'd have to resurrect my C skills, though, and it's a long way
down my list of things to do. :/

Also on the subject of this tool, I'd like to make it so that the
parameters get set in the URL, so that people can copy-paste the URL
of the settings they've got into IRC for discussion. However, that
would involve doing more JavaScript, which is possibly even lower down
my list of things to do than starting doing C again...



Is the core logic posted somewhere?  Because if I have some time, I
might write up a quick Python script to do this locally (it may not
be as tightly integrated with the regular tools, but I can count on
half a hand how many distros don't include Python by default).


If it's going to be done in python, I might as well do it myself --
I can do python with my eyes closed. It's just C and JS I'm rusty with.

Same here ironically :)


There is a write-up of the usable-space algorithm somewhere. I
wrote it up in detail (with pseudocode) in a mail on this list. I've
also got several pages of LaTeX somewhere where I tried and failed to
prove the correctness of the formula. I'll see if I can dig them out
this evening.
It looks like the Message-ID for the one on the mailing list is 
<20160311221703.gj17...@carfax.org.uk>
I had forgotten that I'd archived that with the intent of actually doing 
something with it eventually...

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid assurance

2017-07-26 Thread Hugo Mills
On Wed, Jul 26, 2017 at 12:27:20PM +, Hugo Mills wrote:
> On Wed, Jul 26, 2017 at 08:12:19AM -0400, Austin S. Hemmelgarn wrote:
> > On 2017-07-25 17:45, Hugo Mills wrote:
> > >On Tue, Jul 25, 2017 at 11:29:13PM +0200, waxhead wrote:
> > >>
> > >>
> > >>Hugo Mills wrote:
> > >>>
> > >You can see about the disk usage in different scenarios with the
> > >online tool at:
> > >
> > >http://carfax.org.uk/btrfs-usage/
> > >
> > >Hugo.
> > >
> > >>As a side note, have you ever considered making this online tool
> > >>(that should never go away just for the record) part of btrfs-progs
> > >>e.g. a proper tool? I use it quite often (at least several timers
> > >>per. month) and I would love for this to be a visual tool
> > >>'btrfs-space-calculator' would be a great name for it I think.
> > >>
> > >>Imagine how nice it would be to run
> > >>
> > >>btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1
> > >>/dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get
> > >>something similar to my example below (no accuracy intended)
> > >
> > >It's certainly a thought. I've already got the algorithm written
> > >up. I'd have to resurrect my C skills, though, and it's a long way
> > >down my list of things to do. :/
> > >
> > >Also on the subject of this tool, I'd like to make it so that the
> > >parameters get set in the URL, so that people can copy-paste the URL
> > >of the settings they've got into IRC for discussion. However, that
> > >would involve doing more JavaScript, which is possibly even lower down
> > >my list of things to do than starting doing C again...
> 
> > Is the core logic posted somewhere?  Because if I have some time, I
> > might write up a quick Python script to do this locally (it may not
> > be as tightly integrated with the regular tools, but I can count on
> > half a hand how many distros don't include Python by default).
> 
>If it's going to be done in python, I might as well do it myself --
> I can do python with my eyes closed. It's just C and JS I'm rusty with.
> 
>There is a write-up of the usable-space algorithm somewhere. I
> wrote it up in detail (with pseudocode) in a mail on this list. I've
> also got several pages of LaTeX somewhere where I tried and failed to
> prove the correctness of the formula. I'll see if I can dig them out
> this evening.

   Oh, and of course there's the JS from the website... that's not
minified, and should be readable (if not particularly well-commented).

   Hugo.

-- 
Hugo Mills | How do you become King? You stand in the marketplace
hugo@... carfax.org.uk | and announce you're going to tax everyone. If you
http://carfax.org.uk/  | get out alive, you're King.
PGP: E2AB1DE4  |Harry Harrison


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-26 Thread Hugo Mills
On Wed, Jul 26, 2017 at 08:12:19AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-07-25 17:45, Hugo Mills wrote:
> >On Tue, Jul 25, 2017 at 11:29:13PM +0200, waxhead wrote:
> >>
> >>
> >>Hugo Mills wrote:
> >>>
> >You can see about the disk usage in different scenarios with the
> >online tool at:
> >
> >http://carfax.org.uk/btrfs-usage/
> >
> >Hugo.
> >
> >>As a side note, have you ever considered making this online tool
> >>(that should never go away just for the record) part of btrfs-progs
> >>e.g. a proper tool? I use it quite often (at least several timers
> >>per. month) and I would love for this to be a visual tool
> >>'btrfs-space-calculator' would be a great name for it I think.
> >>
> >>Imagine how nice it would be to run
> >>
> >>btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1
> >>/dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get
> >>something similar to my example below (no accuracy intended)
> >
> >It's certainly a thought. I've already got the algorithm written
> >up. I'd have to resurrect my C skills, though, and it's a long way
> >down my list of things to do. :/
> >
> >Also on the subject of this tool, I'd like to make it so that the
> >parameters get set in the URL, so that people can copy-paste the URL
> >of the settings they've got into IRC for discussion. However, that
> >would involve doing more JavaScript, which is possibly even lower down
> >my list of things to do than starting doing C again...

> Is the core logic posted somewhere?  Because if I have some time, I
> might write up a quick Python script to do this locally (it may not
> be as tightly integrated with the regular tools, but I can count on
> half a hand how many distros don't include Python by default).

   If it's going to be done in python, I might as well do it myself --
I can do python with my eyes closed. It's just C and JS I'm rusty with.

   There is a write-up of the usable-space algorithm somewhere. I
wrote it up in detail (with pseudocode) in a mail on this list. I've
also got several pages of LaTeX somewhere where I tried and failed to
prove the correctness of the formula. I'll see if I can dig them out
this evening.

   Hugo.

-- 
Hugo Mills | How do you become King? You stand in the marketplace
hugo@... carfax.org.uk | and announce you're going to tax everyone. If you
http://carfax.org.uk/  | get out alive, you're King.
PGP: E2AB1DE4  |Harry Harrison


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-26 Thread Austin S. Hemmelgarn

On 2017-07-25 17:45, Hugo Mills wrote:

On Tue, Jul 25, 2017 at 11:29:13PM +0200, waxhead wrote:



Hugo Mills wrote:



You can see about the disk usage in different scenarios with the
online tool at:

http://carfax.org.uk/btrfs-usage/

Hugo.


As a side note, have you ever considered making this online tool
(that should never go away just for the record) part of btrfs-progs
e.g. a proper tool? I use it quite often (at least several timers
per. month) and I would love for this to be a visual tool
'btrfs-space-calculator' would be a great name for it I think.

Imagine how nice it would be to run

btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1
/dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get
something similar to my example below (no accuracy intended)


It's certainly a thought. I've already got the algorithm written
up. I'd have to resurrect my C skills, though, and it's a long way
down my list of things to do. :/

Also on the subject of this tool, I'd like to make it so that the
parameters get set in the URL, so that people can copy-paste the URL
of the settings they've got into IRC for discussion. However, that
would involve doing more JavaScript, which is possibly even lower down
my list of things to do than starting doing C again...
Is the core logic posted somewhere?  Because if I have some time, I 
might write up a quick Python script to do this locally (it may not be 
as tightly integrated with the regular tools, but I can count on half a 
hand how many distros don't include Python by default).


Hugo.


d=data
m=metadata
.=unusable

{  500mb} [|d|] /dev/sda1
{ 3000mb} [|d|m|m|m|m|mm...|] /dev/sdb1
{ 3000mb} [|d|m|m|m|m|mmm..|] /dev/sdc2
{ 5000mb}
[|d|m|m|m|m|m|m|m|m|m|]
/dev/sdb1

{11500mb} Total space

usable for data (raid10): 1000mb / 2000mb
usable for metadata (raid1): 4500mb / 9000mb
unusable: 500mb

Of course this would have to change one (if ever) subvolumes can
have different raid levels etc, but I would have loved using
something like this instead of jumping around carfax abbey (!) at
night.


The core algorithm for the tool actually works pretty well for
dealing with different RAID levels, as long as you know how much of
each kind of data you're going to be using. (Although it's actually
path-dependent -- write 100 GB of RAID-0 then 100 GB of RAID-1 can
have different results than if you write them in the opposite order --
but that's a kind of edge effect).

Hugo.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid assurance

2017-07-25 Thread Hugo Mills
On Tue, Jul 25, 2017 at 11:29:13PM +0200, waxhead wrote:
> 
> 
> Hugo Mills wrote:
> >
> >>>You can see about the disk usage in different scenarios with the
> >>>online tool at:
> >>>
> >>>http://carfax.org.uk/btrfs-usage/
> >>>
> >>>Hugo.
> >>>
> As a side note, have you ever considered making this online tool
> (that should never go away just for the record) part of btrfs-progs
> e.g. a proper tool? I use it quite often (at least several timers
> per. month) and I would love for this to be a visual tool
> 'btrfs-space-calculator' would be a great name for it I think.
> 
> Imagine how nice it would be to run
> 
> btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1
> /dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get
> something similar to my example below (no accuracy intended)

   It's certainly a thought. I've already got the algorithm written
up. I'd have to resurrect my C skills, though, and it's a long way
down my list of things to do. :/

   Also on the subject of this tool, I'd like to make it so that the
parameters get set in the URL, so that people can copy-paste the URL
of the settings they've got into IRC for discussion. However, that
would involve doing more JavaScript, which is possibly even lower down
my list of things to do than starting doing C again...

   Hugo.

> d=data
> m=metadata
> .=unusable
> 
> {  500mb} [|d|] /dev/sda1
> { 3000mb} [|d|m|m|m|m|mm...|] /dev/sdb1
> { 3000mb} [|d|m|m|m|m|mmm..|] /dev/sdc2
> { 5000mb}
> [|d|m|m|m|m|m|m|m|m|m|]
> /dev/sdb1
> 
> {11500mb} Total space
> 
> usable for data (raid10): 1000mb / 2000mb
> usable for metadata (raid1): 4500mb / 9000mb
> unusable: 500mb
> 
> Of course this would have to change one (if ever) subvolumes can
> have different raid levels etc, but I would have loved using
> something like this instead of jumping around carfax abbey (!) at
> night.

   The core algorithm for the tool actually works pretty well for
dealing with different RAID levels, as long as you know how much of
each kind of data you're going to be using. (Although it's actually
path-dependent -- write 100 GB of RAID-0 then 100 GB of RAID-1 can
have different results than if you write them in the opposite order --
but that's a kind of edge effect).

   Hugo.

-- 
Hugo Mills | Great oxymorons of the world, no. 4:
hugo@... carfax.org.uk | Future Perfect
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-25 Thread waxhead



Hugo Mills wrote:



You can see about the disk usage in different scenarios with the
online tool at:

http://carfax.org.uk/btrfs-usage/

Hugo.

As a side note, have you ever considered making this online tool (that 
should never go away just for the record) part of btrfs-progs e.g. a 
proper tool? I use it quite often (at least several timers per. month) 
and I would love for this to be a visual tool 'btrfs-space-calculator' 
would be a great name for it I think.


Imagine how nice it would be to run

btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1 /dev/sdc2 
/dev/sdd2 /dev/sde3 for example and instantly get something similar to 
my example below (no accuracy intended)


d=data
m=metadata
.=unusable

{  500mb} [|d|] /dev/sda1
{ 3000mb} [|d|m|m|m|m|mm...|] /dev/sdb1
{ 3000mb} [|d|m|m|m|m|mmm..|] /dev/sdc2
{ 5000mb} 
[|d|m|m|m|m|m|m|m|m|m|] /dev/sdb1


{11500mb} Total space

usable for data (raid10): 1000mb / 2000mb
usable for metadata (raid1): 4500mb / 9000mb
unusable: 500mb

Of course this would have to change one (if ever) subvolumes can have 
different raid levels etc, but I would have loved using something like 
this instead of jumping around carfax abbey (!) at night.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid assurance

2017-07-25 Thread Hugo Mills
On Tue, Jul 25, 2017 at 10:55:18AM -0300, Hérikz Nawarro wrote:
> And btw, my current disk conf is a 1x 500GB, 2x3TB and a 5TB.

   OK, so by my mental arithmetic(*), you'd get:

 -  9.5  TB usable in RAID-0
 - 11.5  TB usable in single mode
 -  5.75 TB usable in RAID-1

   Hugo.

(*) Which may be a bit wobbly. :)

> 2017-07-25 10:51 GMT-03:00 Hugo Mills :
> > On Tue, Jul 25, 2017 at 01:46:56PM +, Hugo Mills wrote:
> >> On Tue, Jul 25, 2017 at 09:55:37AM -0300, Hérikz Nawarro wrote:
> >> > Hello everyone,
> >> >
> >> > I'm migrating to btrfs and i would like to know, in a btrfs filesystem
> >> > with 4 disks (multiple sizes) with -d raid0 & -m raid1, how many
> >> > drives can i lost without losing the entire array?
> >
> >Oh, and one other thing -- if you have different-sized devices,
> > RAID-0 is probably the wrong thing to be using anyway, as you won't be
> > able to use the difference between the largest and second-largest
> > device. If you want to use all the space on the available devices,
> > then "single" mode is probably better (although you still lose a lot
> > of data if a device breaks), or RAID-1 (which will cope well with the
> > different sizes as long as the largest device is smaller than the rest
> > of them added together).
> >
> >You can see about the disk usage in different scenarios with the
> > online tool at:
> >
> > http://carfax.org.uk/btrfs-usage/
> >
> >Hugo.
> >

-- 
Hugo Mills | One of these days, I'll catch that man without a
hugo@... carfax.org.uk | quotation, and he'll look undressed.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Leto Atreides, Dune


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-25 Thread Hérikz Nawarro
And btw, my current disk conf is a 1x 500GB, 2x3TB and a 5TB.

2017-07-25 10:51 GMT-03:00 Hugo Mills :
> On Tue, Jul 25, 2017 at 01:46:56PM +, Hugo Mills wrote:
>> On Tue, Jul 25, 2017 at 09:55:37AM -0300, Hérikz Nawarro wrote:
>> > Hello everyone,
>> >
>> > I'm migrating to btrfs and i would like to know, in a btrfs filesystem
>> > with 4 disks (multiple sizes) with -d raid0 & -m raid1, how many
>> > drives can i lost without losing the entire array?
>
>Oh, and one other thing -- if you have different-sized devices,
> RAID-0 is probably the wrong thing to be using anyway, as you won't be
> able to use the difference between the largest and second-largest
> device. If you want to use all the space on the available devices,
> then "single" mode is probably better (although you still lose a lot
> of data if a device breaks), or RAID-1 (which will cope well with the
> different sizes as long as the largest device is smaller than the rest
> of them added together).
>
>You can see about the disk usage in different scenarios with the
> online tool at:
>
> http://carfax.org.uk/btrfs-usage/
>
>Hugo.
>
> --
> Hugo Mills | One of these days, I'll catch that man without a
> hugo@... carfax.org.uk | quotation, and he'll look undressed.
> http://carfax.org.uk/  |
> PGP: E2AB1DE4  |   Leto Atreides, Dune
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid assurance

2017-07-25 Thread Hugo Mills
On Tue, Jul 25, 2017 at 01:46:56PM +, Hugo Mills wrote:
> On Tue, Jul 25, 2017 at 09:55:37AM -0300, Hérikz Nawarro wrote:
> > Hello everyone,
> > 
> > I'm migrating to btrfs and i would like to know, in a btrfs filesystem
> > with 4 disks (multiple sizes) with -d raid0 & -m raid1, how many
> > drives can i lost without losing the entire array?

   Oh, and one other thing -- if you have different-sized devices,
RAID-0 is probably the wrong thing to be using anyway, as you won't be
able to use the difference between the largest and second-largest
device. If you want to use all the space on the available devices,
then "single" mode is probably better (although you still lose a lot
of data if a device breaks), or RAID-1 (which will cope well with the
different sizes as long as the largest device is smaller than the rest
of them added together).

   You can see about the disk usage in different scenarios with the
online tool at:

http://carfax.org.uk/btrfs-usage/

   Hugo.

-- 
Hugo Mills | One of these days, I'll catch that man without a
hugo@... carfax.org.uk | quotation, and he'll look undressed.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Leto Atreides, Dune


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-25 Thread Hérikz Nawarro
Thanks everyone, I'll stick with raid 1.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid assurance

2017-07-25 Thread Hugo Mills
On Tue, Jul 25, 2017 at 09:55:37AM -0300, Hérikz Nawarro wrote:
> Hello everyone,
> 
> I'm migrating to btrfs and i would like to know, in a btrfs filesystem
> with 4 disks (multiple sizes) with -d raid0 & -m raid1, how many
> drives can i lost without losing the entire array?

   You can lose one device in the array, and the FS structure will be
OK -- it will still mount, and you'll be able to see all the filenames
and directory structures and so on.

   However, if you do lose one device, then you'll lose
(approximately) half of the bytes in all of your files, most likely in
alternating 64k slices in each file. Attempting to read the missing
parts will result in I/O errors being returned from the filesystem.

   So, while the FS is in theory still fine as a (probably read-only)
filesystem, it's actually going to be *completely* useless with a
missing device, because none of your file data will be usably intact.

   If you want the FS to behave well when you lose a device, you'll
need some kind of actual redundancy in the data storage part -- RAID-1
would be my recommendation (it stores two copies of each piece of
data, so you can lose up to one device and still be OK).

   Hugo.

-- 
Hugo Mills | One of these days, I'll catch that man without a
hugo@... carfax.org.uk | quotation, and he'll look undressed.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Leto Atreides, Dune


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-25 Thread Austin S. Hemmelgarn

On 2017-07-25 08:55, Hérikz Nawarro wrote:

Hello everyone,

I'm migrating to btrfs and i would like to know, in a btrfs filesystem
with 4 disks (multiple sizes) with -d raid0 & -m raid1, how many
drives can i lost without losing the entire array?
Exactly one, but you will lose data if you lose one device.  Most of the 
BTRFS profiles are poorly named (single and dup being the exceptions), 
and do not behave consistently with the RAID levels they are named 
after.  BTRFS raid1 mode is functionally equivalent to MD or LVM RAID10 
mode, just with larger blocks.  It only gives you two copies of the 
data, so losing one device will make the array degraded, and two will 
effectively nuke it.


If you care about data safety, I would advise against using raid0 for 
data.  If you lose _one_ device in raid0 mode, you will usually lose 
part of most of the files on the volume.  Single mode for data will 
still distribute things evenly and will not have that issue (unless you 
have files larger than 1GB, a file will either be all there or all gone, 
as opposed to having read errors part way through), and isn't much worse 
in terms of performance (BTRFS does not parallelize device access as 
well as it should).


If you care about both performance and data safety, and can tolerate 
having only half the usable space, I would actually suggest running 
BTRFS with both data and metadata in raid1 mode on top of two LVM or MD 
RAID0 volumes.  This should outperform the configuration you listed by a 
significant amount, will provide better data safety, and should also do 
a better job of distributing the load across devices.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 8:24 PM, Tomasz Kusmierz  wrote:

> you are throwing a lot of useful data, maybe diverting some of it into wiki ? 
> you know, us normal people might find it useful for making educated choice in 
> some future ? :)

There is a wiki, and it's difficult for keep up to date as it is.
There are just too many changes happening in Btrfs, and  really only
the devs have a birds eye view of what's going on and what will happen
sooner than later.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Tomasz Kusmierz

> On 7 Jul 2016, at 02:46, Chris Murphy  wrote:
> 

Chaps, I didn’t wanted this to spring up as a performance of btrfs argument,

BUT 

you are throwing a lot of useful data, maybe diverting some of it into wiki ? 
you know, us normal people might find it useful for making educated choice in 
some future ? :)

Interestingly on my RAID10 with 6 disks I only get:

dd if=/mnt/share/asdf of=/dev/zero bs=100M
113+1 records in
113+1 records out
11874643004 bytes (12 GB, 11 GiB) copied, 45.3123 s, 262 MB/s


filefrag -v
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..2471: 2101940598..2101943069:   2472:
   1: 2472..   12583: 1938312686..1938322797:  10112: 2101943070:
   2:12584..   12837: 1937534654..1937534907:254: 1938322798:
   3:12838..   12839: 1937534908..1937534909:  2:
   4:12840..   34109: 1902954063..1902975332:  21270: 1937534910:
   5:34110..   53671: 1900857931..1900877492:  19562: 1902975333:
   6:53672..   54055: 1900877493..1900877876:384:
   7:54056..   54063: 1900877877..1900877884:  8:
   8:54064..   98041: 1900877885..1900921862:  43978:
   9:98042..  117671: 1900921863..1900941492:  19630:
  10:   117672..  118055: 1900941493..1900941876:384:
  11:   118056..  161833: 1900941877..1900985654:  43778:
  12:   161834..  204013: 1900985655..1901027834:  42180:
  13:   204014..  214269: 1901027835..1901038090:  10256:
  14:   214270..  214401: 1901038091..1901038222:132:
  15:   214402..  214407: 1901038223..1901038228:  6:
  16:   214408..  258089: 1901038229..1901081910:  43682:
  17:   258090..  300139: 1901081911..1901123960:  42050:
  18:   300140..  310559: 1901123961..1901134380:  10420:
  19:   310560..  310695: 1901134381..1901134516:136:
  20:   310696..  354251: 1901134517..1901178072:  43556:
  21:   354252..  396389: 1901178073..1901220210:  42138:
  22:   396390..  406353: 1901220211..1901230174:   9964:
  23:   406354..  406515: 1901230175..1901230336:162:
  24:   406516..  406519: 1901230337..1901230340:  4:
  25:   406520..  450115: 1901230341..1901273936:  43596:
  26:   450116..  492161: 1901273937..1901315982:  42046:
  27:   492162..  524199: 1901315983..1901348020:  32038:
  28:   524200..  535355: 1901348021..1901359176:  11156:
  29:   535356..  535591: 1901359177..1901359412:236:
  30:   535592.. 1315369: 1899830240..1900610017: 779778: 1901359413:
  31:  1315370.. 1357435: 1901359413..1901401478:  42066: 1900610018:
  32:  1357436.. 1368091: 1928101070..1928111725:  10656: 1901401479:
  33:  1368092.. 1368231: 1928111726..1928111865:140:
  34:  1368232.. 2113959: 1899043808..1899789535: 745728: 1928111866:
  35:  2113960.. 2899082: 1898257376..1899042498: 785123: 1899789536: last,elf


If it would be possible to read from 6 disks at once maybe this performance 
would be better for linear read.

Anyway this is a huge diversion from original question, so maybe we will end 
here ?


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 5:22 PM, Kai Krakow  wrote:

> The current implementation of RAID0 in btrfs is probably not very
> optimized. RAID0 is a special case anyways: Stripes have a defined
> width - I'm not sure what it is for btrfs, probably it's per chunk, so
> it's 1GB, maybe it's 64k **.

Stripe element (a.k.a. strip, a.k.a. md chunk) size in Btrfs is fixed at 64KiB.

>That means your data is usually not read
> from multiple disks in parallel anyways as long as requests are below
> stripe width (which is probably true for most access patterns except
> copying files) - there's no immediate performance benefit.

Most any write pattern benefits from raid0 due to less disk
contention, even if the typical file size is smaller than stripe size.
Parallelization is improved even if it's suboptimal. This is really no
different than md raid striping with a 64KiB chunk size.

On Btrfs, it might be that some workloads benefit from metadata
raid10, and others don't. I also think it's hard to estimate without
benchmarking an actual workload with metadata as raid1 vs raid10.



> So I guess, at this stage there's no big difference between RAID1 and
> RAID10 in btrfs (except maybe for large file copies), not for single
> process access patterns and neither for multi process access patterns.
> Btrfs can only benefit from RAID1 in multi process access patterns
> currently, as can btrfs RAID0 by design for usual small random access
> patterns (and maybe large sequential operations). But RAID1 with more
> than two disks and multi process access patterns is more or less equal
> to RAID10 because stripes are likely to be on different devices anyways.

I think that too would need to be benchmarked and I think it'd need to
be aged as well to see the effect of both file and block group free
space fragmentation. The devil will be in really minute details, all
you have to do is read a few weeks of XFS list stuff with people
talking about optimization or bad performance and almost always it's
not the fault of the file system. And when it is, it depends on the
kernel version as XFS has had substantial changes even over its long
career, including (somewhat) recent changes for metadata heavy
workloads.

> In conclusion: RAID1 is simpler than RAID10 and thus its less likely to
> contain flaws or bugs.

I don't know about that. I think it's about the same. All multiple
device support, except raid56, was introduced at the same time
practically from day 2. Btrfs raid1 and raid10 tolerate only exactly 1
device loss, *maybe* two if you're very lucky, so neither of them are
really scalable.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Kai Krakow
Am Thu, 7 Jul 2016 00:51:16 +0100
schrieb Tomasz Kusmierz :

> > On 7 Jul 2016, at 00:22, Kai Krakow  wrote:
> > 
> > Am Wed, 6 Jul 2016 13:20:15 +0100
> > schrieb Tomasz Kusmierz :
> >   
> >> When I think of it, I did move this folder first when filesystem
> >> was RAID 1 (or not even RAID at all) and then it was upgraded to
> >> RAID 1 then RAID 10. Was there a faulty balance around August
> >> 2014 ? Please remember that I’m using Ubuntu so it was probably
> >> kernel from Ubuntu 14.04 LTS
> >> 
> >> Also, I would like to hear it from horses mouth: dos & donts for a
> >> long term storage where you moderately care about the data: RAID10
> >> - flaky ? would RAID1 give similar performance ?  
> > 
> > The current implementation of RAID0 in btrfs is probably not very
> > optimized. RAID0 is a special case anyways: Stripes have a defined
> > width - I'm not sure what it is for btrfs, probably it's per chunk,
> > so it's 1GB, maybe it's 64k **. That means your data is usually not
> > read from multiple disks in parallel anyways as long as requests
> > are below stripe width (which is probably true for most access
> > patterns except copying files) - there's no immediate performance
> > benefit. This holds true for any RAID0 with read and write patterns
> > below the stripe size. Data is just more evenly distributed across
> > devices and your application will only benefit performance-wise if
> > accesses spread semi-random across the span of the whole file. And
> > at least last time I checked, it was stated that btrfs raid0 does
> > not submit IOs in parallel yet but first reads one stripe, then the
> > next - so it doesn't submit IOs to different devices in parallel.
> > 
> > Getting to RAID1, btrfs is even less optimized: Stripe decision is
> > based on process pids instead of device load, read accesses won't
> > distribute evenly to different stripes per single process, it's
> > only just reading from the same single device - always. Write
> > access isn't faster anyways: Both stripes need to be written -
> > writing RAID1 is single device performance only.
> > 
> > So I guess, at this stage there's no big difference between RAID1
> > and RAID10 in btrfs (except maybe for large file copies), not for
> > single process access patterns and neither for multi process access
> > patterns. Btrfs can only benefit from RAID1 in multi process access
> > patterns currently, as can btrfs RAID0 by design for usual small
> > random access patterns (and maybe large sequential operations). But
> > RAID1 with more than two disks and multi process access patterns is
> > more or less equal to RAID10 because stripes are likely to be on
> > different devices anyways.
> > 
> > In conclusion: RAID1 is simpler than RAID10 and thus its less
> > likely to contain flaws or bugs.
> > 
> > **: Please enlighten me, I couldn't find docs on this matter.  
> 
> :O 
> 
> It’s an eye opener - I think that this should end up on btrfs WIKI …
> seriously !
> 
> Anyway my use case for this is “storage” therefore I predominantly
> copy large files. 

Then RAID10 may be your best option - for local operations. Copying
large files, even a modern single SATA spindle can saturate a gigabit
link. So, if your use case is NAS, and you don't use server side copies
(like modern versions of NFS and Samba support), you won't benefit from
RAID10 vs RAID1 - so just use the simpler implementation.

My personal recommendation: Add a small, high quality SSD to your array
and configure btrfs on top of bcache, configure it for write-around
caching to get best life-time and data safety. This should cache mostly
meta data access in your usecase and improve performance much better
than RAID10 over RAID1. I can recommend Crucial MX series from
personal experience, choose 250GB or higher as 120GB versions of
Crucial MX suffer much lower durability for caching purposes. Adding
bcache to an existing btrfs array is a little painful but easily doable
if you have enough free space to temporarily sacrifice one disk.

BTW: I'm using 3x 1TB btrfs mraid1/draid0 with a single 500GB bcache
SSD in write-back mode and local operation (it's my desktop machine).
The performance is great, bcache decouples some of the performance
downsides the current btrfs raid implementation has. I do daily
backups, so write-back caching is not a real problem (in case it
fails), and btrfs draid0 is also not a problem (mraid1 ensures meta
data integrity, so only file contents are at risk, and covered by
backups). With this setup I can easily saturate my 6Gb onboard SATA
controller, the system boots to usable desktop in 30 seconds from cold
start (including EFI firmware), including autologin to full-blown
KDE, autostart of Chrome and Steam, 2 virtual machine containers
(nspawn-based, one MySQL instance, one ElasticSearch instance), plus
local MySQL and ElasticSearch service (used for development and staging
purposes), and a 

Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Tomasz Kusmierz

> On 7 Jul 2016, at 00:22, Kai Krakow  wrote:
> 
> Am Wed, 6 Jul 2016 13:20:15 +0100
> schrieb Tomasz Kusmierz :
> 
>> When I think of it, I did move this folder first when filesystem was
>> RAID 1 (or not even RAID at all) and then it was upgraded to RAID 1
>> then RAID 10. Was there a faulty balance around August 2014 ? Please
>> remember that I’m using Ubuntu so it was probably kernel from Ubuntu
>> 14.04 LTS
>> 
>> Also, I would like to hear it from horses mouth: dos & donts for a
>> long term storage where you moderately care about the data: RAID10 -
>> flaky ? would RAID1 give similar performance ?
> 
> The current implementation of RAID0 in btrfs is probably not very
> optimized. RAID0 is a special case anyways: Stripes have a defined
> width - I'm not sure what it is for btrfs, probably it's per chunk, so
> it's 1GB, maybe it's 64k **. That means your data is usually not read
> from multiple disks in parallel anyways as long as requests are below
> stripe width (which is probably true for most access patterns except
> copying files) - there's no immediate performance benefit. This holds
> true for any RAID0 with read and write patterns below the stripe size.
> Data is just more evenly distributed across devices and your
> application will only benefit performance-wise if accesses spread
> semi-random across the span of the whole file. And at least last time I
> checked, it was stated that btrfs raid0 does not submit IOs in parallel
> yet but first reads one stripe, then the next - so it doesn't submit
> IOs to different devices in parallel.
> 
> Getting to RAID1, btrfs is even less optimized: Stripe decision is based
> on process pids instead of device load, read accesses won't distribute
> evenly to different stripes per single process, it's only just reading
> from the same single device - always. Write access isn't faster anyways:
> Both stripes need to be written - writing RAID1 is single device
> performance only.
> 
> So I guess, at this stage there's no big difference between RAID1 and
> RAID10 in btrfs (except maybe for large file copies), not for single
> process access patterns and neither for multi process access patterns.
> Btrfs can only benefit from RAID1 in multi process access patterns
> currently, as can btrfs RAID0 by design for usual small random access
> patterns (and maybe large sequential operations). But RAID1 with more
> than two disks and multi process access patterns is more or less equal
> to RAID10 because stripes are likely to be on different devices anyways.
> 
> In conclusion: RAID1 is simpler than RAID10 and thus its less likely to
> contain flaws or bugs.
> 
> **: Please enlighten me, I couldn't find docs on this matter.

:O 

It’s an eye opener - I think that this should end up on btrfs WIKI … seriously !

Anyway my use case for this is “storage” therefore I predominantly copy large 
files. 


> -- 
> Regards,
> Kai
> 
> Replies to list-only preferred.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Kai Krakow
Am Wed, 6 Jul 2016 13:20:15 +0100
schrieb Tomasz Kusmierz :

> When I think of it, I did move this folder first when filesystem was
> RAID 1 (or not even RAID at all) and then it was upgraded to RAID 1
> then RAID 10. Was there a faulty balance around August 2014 ? Please
> remember that I’m using Ubuntu so it was probably kernel from Ubuntu
> 14.04 LTS
> 
> Also, I would like to hear it from horses mouth: dos & donts for a
> long term storage where you moderately care about the data: RAID10 -
> flaky ? would RAID1 give similar performance ?

The current implementation of RAID0 in btrfs is probably not very
optimized. RAID0 is a special case anyways: Stripes have a defined
width - I'm not sure what it is for btrfs, probably it's per chunk, so
it's 1GB, maybe it's 64k **. That means your data is usually not read
from multiple disks in parallel anyways as long as requests are below
stripe width (which is probably true for most access patterns except
copying files) - there's no immediate performance benefit. This holds
true for any RAID0 with read and write patterns below the stripe size.
Data is just more evenly distributed across devices and your
application will only benefit performance-wise if accesses spread
semi-random across the span of the whole file. And at least last time I
checked, it was stated that btrfs raid0 does not submit IOs in parallel
yet but first reads one stripe, then the next - so it doesn't submit
IOs to different devices in parallel.

Getting to RAID1, btrfs is even less optimized: Stripe decision is based
on process pids instead of device load, read accesses won't distribute
evenly to different stripes per single process, it's only just reading
from the same single device - always. Write access isn't faster anyways:
Both stripes need to be written - writing RAID1 is single device
performance only.

So I guess, at this stage there's no big difference between RAID1 and
RAID10 in btrfs (except maybe for large file copies), not for single
process access patterns and neither for multi process access patterns.
Btrfs can only benefit from RAID1 in multi process access patterns
currently, as can btrfs RAID0 by design for usual small random access
patterns (and maybe large sequential operations). But RAID1 with more
than two disks and multi process access patterns is more or less equal
to RAID10 because stripes are likely to be on different devices anyways.

In conclusion: RAID1 is simpler than RAID10 and thus its less likely to
contain flaws or bugs.

**: Please enlighten me, I couldn't find docs on this matter.

-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Tomasz Kusmierz

> On 6 Jul 2016, at 22:41, Henk Slager  wrote:
> 
> On Wed, Jul 6, 2016 at 2:20 PM, Tomasz Kusmierz  
> wrote:
>> 
>>> On 6 Jul 2016, at 02:25, Henk Slager  wrote:
>>> 
>>> On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz  
>>> wrote:
 
 On 6 Jul 2016, at 00:30, Henk Slager  wrote:
 
 On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz 
 wrote:
 
 I did consider that, but:
 - some files were NOT accessed by anything with 100% certainty (well if
 there is a rootkit on my system or something in that shape than maybe yes)
 - the only application that could access those files is totem (well
 Nautilius checks extension -> directs it to totem) so in that case we would
 hear about out break of totem killing people files.
 - if it was a kernel bug then other large files would be affected.
 
 Maybe I’m wrong and it’s actually related to the fact that all those files
 are located in single location on file system (single folder) that might
 have a historical bug in some structure somewhere ?
 
 
 I find it hard to imagine that this has something to do with the
 folderstructure, unless maybe the folder is a subvolume with
 non-default attributes or so. How the files in that folder are created
 (at full disktransferspeed or during a day or even a week) might give
 some hint. You could run filefrag and see if that rings a bell.
 
 files that are 4096 show:
 1 extent found
>>> 
>>> I actually meant filefrag for the files that are not (yet) truncated
>>> to 4k. For example for virtual machine imagefiles (CoW), one could see
>>> an MBR write.
>> 117 extents found
>> filesize 15468645003
>> 
>> good / bad ?
> 
> 117 extents for a 1.5G file is fine, with -v option you could see the
> fragmentation at the start, but this won't lead to any hint why you
> have the truncate issue.
> 
 I did forgot to add that file system was created a long time ago and it was
 created with leaf & node size = 16k.
 
 
 If this long time ago is >2 years then you have likely specifically
 set node size = 16k, otherwise with older tools it would have been 4K.
 
 You are right I used -l 16K -n 16K
 
 Have you created it as raid10 or has it undergone profile conversions?
 
 Due to lack of spare disks
 (it may sound odd for some but spending for more than 6 disks for home use
 seems like an overkill)
 and due to last I’ve had I had to migrate all data to new file system.
 This played that way that I’ve:
 1. from original FS I’ve removed 2 disks
 2. Created RAID1 on those 2 disks,
 3. shifted 2TB
 4. removed 2 disks from source FS and adde those to destination FS
 5 shifted 2 further TB
 6 destroyed original FS and adde 2 disks to destination FS
 7 converted destination FS to RAID10
 
 FYI, when I convert to raid 10 I use:
 btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f
 /path/to/FS
 
 this filesystem has 5 sub volumes. Files affected are located in separate
 folder within a “victim folder” that is within a one sub volume.
 
 
 It could also be that the ondisk format is somewhat corrupted (btrfs
 check should find that ) and that that causes the issue.
 
 
 root@noname_server:/mnt# btrfs check /dev/sdg1
 Checking filesystem on /dev/sdg1
 UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
 checking extents
 checking free space cache
 checking fs roots
 checking csums
 checking root refs
 found 4424060642634 bytes used err is 0
 total csum bytes: 4315954936
 total tree bytes: 4522786816
 total fs tree bytes: 61702144
 total extent tree bytes: 41402368
 btree space waste bytes: 72430813
 file data blocks allocated: 4475917217792
 referenced 4420407603200
 
 No luck there :/
>>> 
>>> Indeed looks all normal.
>>> 
 In-lining on raid10 has caused me some trouble (I had 4k nodes) over
 time, it has happened over a year ago with kernels recent at that
 time, but the fs was converted from raid5
 
 Could you please elaborate on that ? you also ended up with files that got
 truncated to 4096 bytes ?
>>> 
>>> I did not have truncated to 4k files, but your case lets me think of
>>> small files inlining. Default max_inline mount option is 8k and that
>>> means that 0 to ~3k files end up in metadata. I had size corruptions
>>> for several of those small sized files that were updated quite
>>> frequent, also within commit time AFAIK. Btrfs check lists this as
>>> errors 400, although fs operation is not disturbed. I don't know what
>>> happens if those small files are being updated/rewritten and are just
>>> below or just above the max_inline limit.
>>> 
>>> The only thing I was 

Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Henk Slager
On Wed, Jul 6, 2016 at 2:20 PM, Tomasz Kusmierz  wrote:
>
>> On 6 Jul 2016, at 02:25, Henk Slager  wrote:
>>
>> On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz  
>> wrote:
>>>
>>> On 6 Jul 2016, at 00:30, Henk Slager  wrote:
>>>
>>> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz 
>>> wrote:
>>>
>>> I did consider that, but:
>>> - some files were NOT accessed by anything with 100% certainty (well if
>>> there is a rootkit on my system or something in that shape than maybe yes)
>>> - the only application that could access those files is totem (well
>>> Nautilius checks extension -> directs it to totem) so in that case we would
>>> hear about out break of totem killing people files.
>>> - if it was a kernel bug then other large files would be affected.
>>>
>>> Maybe I’m wrong and it’s actually related to the fact that all those files
>>> are located in single location on file system (single folder) that might
>>> have a historical bug in some structure somewhere ?
>>>
>>>
>>> I find it hard to imagine that this has something to do with the
>>> folderstructure, unless maybe the folder is a subvolume with
>>> non-default attributes or so. How the files in that folder are created
>>> (at full disktransferspeed or during a day or even a week) might give
>>> some hint. You could run filefrag and see if that rings a bell.
>>>
>>> files that are 4096 show:
>>> 1 extent found
>>
>> I actually meant filefrag for the files that are not (yet) truncated
>> to 4k. For example for virtual machine imagefiles (CoW), one could see
>> an MBR write.
> 117 extents found
> filesize 15468645003
>
> good / bad ?

117 extents for a 1.5G file is fine, with -v option you could see the
fragmentation at the start, but this won't lead to any hint why you
have the truncate issue.

>>> I did forgot to add that file system was created a long time ago and it was
>>> created with leaf & node size = 16k.
>>>
>>>
>>> If this long time ago is >2 years then you have likely specifically
>>> set node size = 16k, otherwise with older tools it would have been 4K.
>>>
>>> You are right I used -l 16K -n 16K
>>>
>>> Have you created it as raid10 or has it undergone profile conversions?
>>>
>>> Due to lack of spare disks
>>> (it may sound odd for some but spending for more than 6 disks for home use
>>> seems like an overkill)
>>> and due to last I’ve had I had to migrate all data to new file system.
>>> This played that way that I’ve:
>>> 1. from original FS I’ve removed 2 disks
>>> 2. Created RAID1 on those 2 disks,
>>> 3. shifted 2TB
>>> 4. removed 2 disks from source FS and adde those to destination FS
>>> 5 shifted 2 further TB
>>> 6 destroyed original FS and adde 2 disks to destination FS
>>> 7 converted destination FS to RAID10
>>>
>>> FYI, when I convert to raid 10 I use:
>>> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f
>>> /path/to/FS
>>>
>>> this filesystem has 5 sub volumes. Files affected are located in separate
>>> folder within a “victim folder” that is within a one sub volume.
>>>
>>>
>>> It could also be that the ondisk format is somewhat corrupted (btrfs
>>> check should find that ) and that that causes the issue.
>>>
>>>
>>> root@noname_server:/mnt# btrfs check /dev/sdg1
>>> Checking filesystem on /dev/sdg1
>>> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>> checking extents
>>> checking free space cache
>>> checking fs roots
>>> checking csums
>>> checking root refs
>>> found 4424060642634 bytes used err is 0
>>> total csum bytes: 4315954936
>>> total tree bytes: 4522786816
>>> total fs tree bytes: 61702144
>>> total extent tree bytes: 41402368
>>> btree space waste bytes: 72430813
>>> file data blocks allocated: 4475917217792
>>> referenced 4420407603200
>>>
>>> No luck there :/
>>
>> Indeed looks all normal.
>>
>>> In-lining on raid10 has caused me some trouble (I had 4k nodes) over
>>> time, it has happened over a year ago with kernels recent at that
>>> time, but the fs was converted from raid5
>>>
>>> Could you please elaborate on that ? you also ended up with files that got
>>> truncated to 4096 bytes ?
>>
>> I did not have truncated to 4k files, but your case lets me think of
>> small files inlining. Default max_inline mount option is 8k and that
>> means that 0 to ~3k files end up in metadata. I had size corruptions
>> for several of those small sized files that were updated quite
>> frequent, also within commit time AFAIK. Btrfs check lists this as
>> errors 400, although fs operation is not disturbed. I don't know what
>> happens if those small files are being updated/rewritten and are just
>> below or just above the max_inline limit.
>>
>> The only thing I was thinking of is that your files were started as
>> small, so inline, then extended to multi-GB. In the past, there were
>> 'bad extent/chunk type' issues and it was suggested that the fs would
>> have been an ext4-converted one (which 

Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Tomasz Kusmierz

> On 6 Jul 2016, at 02:25, Henk Slager  wrote:
> 
> On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz  
> wrote:
>> 
>> On 6 Jul 2016, at 00:30, Henk Slager  wrote:
>> 
>> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz 
>> wrote:
>> 
>> I did consider that, but:
>> - some files were NOT accessed by anything with 100% certainty (well if
>> there is a rootkit on my system or something in that shape than maybe yes)
>> - the only application that could access those files is totem (well
>> Nautilius checks extension -> directs it to totem) so in that case we would
>> hear about out break of totem killing people files.
>> - if it was a kernel bug then other large files would be affected.
>> 
>> Maybe I’m wrong and it’s actually related to the fact that all those files
>> are located in single location on file system (single folder) that might
>> have a historical bug in some structure somewhere ?
>> 
>> 
>> I find it hard to imagine that this has something to do with the
>> folderstructure, unless maybe the folder is a subvolume with
>> non-default attributes or so. How the files in that folder are created
>> (at full disktransferspeed or during a day or even a week) might give
>> some hint. You could run filefrag and see if that rings a bell.
>> 
>> files that are 4096 show:
>> 1 extent found
> 
> I actually meant filefrag for the files that are not (yet) truncated
> to 4k. For example for virtual machine imagefiles (CoW), one could see
> an MBR write.
117 extents found
filesize 15468645003

good / bad ?  
> 
>> I did forgot to add that file system was created a long time ago and it was
>> created with leaf & node size = 16k.
>> 
>> 
>> If this long time ago is >2 years then you have likely specifically
>> set node size = 16k, otherwise with older tools it would have been 4K.
>> 
>> You are right I used -l 16K -n 16K
>> 
>> Have you created it as raid10 or has it undergone profile conversions?
>> 
>> Due to lack of spare disks
>> (it may sound odd for some but spending for more than 6 disks for home use
>> seems like an overkill)
>> and due to last I’ve had I had to migrate all data to new file system.
>> This played that way that I’ve:
>> 1. from original FS I’ve removed 2 disks
>> 2. Created RAID1 on those 2 disks,
>> 3. shifted 2TB
>> 4. removed 2 disks from source FS and adde those to destination FS
>> 5 shifted 2 further TB
>> 6 destroyed original FS and adde 2 disks to destination FS
>> 7 converted destination FS to RAID10
>> 
>> FYI, when I convert to raid 10 I use:
>> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f
>> /path/to/FS
>> 
>> this filesystem has 5 sub volumes. Files affected are located in separate
>> folder within a “victim folder” that is within a one sub volume.
>> 
>> 
>> It could also be that the ondisk format is somewhat corrupted (btrfs
>> check should find that ) and that that causes the issue.
>> 
>> 
>> root@noname_server:/mnt# btrfs check /dev/sdg1
>> Checking filesystem on /dev/sdg1
>> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>> checking extents
>> checking free space cache
>> checking fs roots
>> checking csums
>> checking root refs
>> found 4424060642634 bytes used err is 0
>> total csum bytes: 4315954936
>> total tree bytes: 4522786816
>> total fs tree bytes: 61702144
>> total extent tree bytes: 41402368
>> btree space waste bytes: 72430813
>> file data blocks allocated: 4475917217792
>> referenced 4420407603200
>> 
>> No luck there :/
> 
> Indeed looks all normal.
> 
>> In-lining on raid10 has caused me some trouble (I had 4k nodes) over
>> time, it has happened over a year ago with kernels recent at that
>> time, but the fs was converted from raid5
>> 
>> Could you please elaborate on that ? you also ended up with files that got
>> truncated to 4096 bytes ?
> 
> I did not have truncated to 4k files, but your case lets me think of
> small files inlining. Default max_inline mount option is 8k and that
> means that 0 to ~3k files end up in metadata. I had size corruptions
> for several of those small sized files that were updated quite
> frequent, also within commit time AFAIK. Btrfs check lists this as
> errors 400, although fs operation is not disturbed. I don't know what
> happens if those small files are being updated/rewritten and are just
> below or just above the max_inline limit.
> 
> The only thing I was thinking of is that your files were started as
> small, so inline, then extended to multi-GB. In the past, there were
> 'bad extent/chunk type' issues and it was suggested that the fs would
> have been an ext4-converted one (which had non-compliant mixed
> metadata and data) but for most it was not the case. So there was/is
> something unclear, but full balance or so fixed it as far as I
> remember. But it is guessing, I do not have any failure cases like the
> one you see.

When I think of it, I did move this folder first when filesystem was RAID 1 (or 
not 

Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-05 Thread Henk Slager
On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz  wrote:
>
> On 6 Jul 2016, at 00:30, Henk Slager  wrote:
>
> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz 
> wrote:
>
> I did consider that, but:
> - some files were NOT accessed by anything with 100% certainty (well if
> there is a rootkit on my system or something in that shape than maybe yes)
> - the only application that could access those files is totem (well
> Nautilius checks extension -> directs it to totem) so in that case we would
> hear about out break of totem killing people files.
> - if it was a kernel bug then other large files would be affected.
>
> Maybe I’m wrong and it’s actually related to the fact that all those files
> are located in single location on file system (single folder) that might
> have a historical bug in some structure somewhere ?
>
>
> I find it hard to imagine that this has something to do with the
> folderstructure, unless maybe the folder is a subvolume with
> non-default attributes or so. How the files in that folder are created
> (at full disktransferspeed or during a day or even a week) might give
> some hint. You could run filefrag and see if that rings a bell.
>
> files that are 4096 show:
> 1 extent found

I actually meant filefrag for the files that are not (yet) truncated
to 4k. For example for virtual machine imagefiles (CoW), one could see
an MBR write.

> I did forgot to add that file system was created a long time ago and it was
> created with leaf & node size = 16k.
>
>
> If this long time ago is >2 years then you have likely specifically
> set node size = 16k, otherwise with older tools it would have been 4K.
>
> You are right I used -l 16K -n 16K
>
> Have you created it as raid10 or has it undergone profile conversions?
>
> Due to lack of spare disks
> (it may sound odd for some but spending for more than 6 disks for home use
> seems like an overkill)
> and due to last I’ve had I had to migrate all data to new file system.
> This played that way that I’ve:
> 1. from original FS I’ve removed 2 disks
> 2. Created RAID1 on those 2 disks,
> 3. shifted 2TB
> 4. removed 2 disks from source FS and adde those to destination FS
> 5 shifted 2 further TB
> 6 destroyed original FS and adde 2 disks to destination FS
> 7 converted destination FS to RAID10
>
> FYI, when I convert to raid 10 I use:
> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f
> /path/to/FS
>
> this filesystem has 5 sub volumes. Files affected are located in separate
> folder within a “victim folder” that is within a one sub volume.
>
>
> It could also be that the ondisk format is somewhat corrupted (btrfs
> check should find that ) and that that causes the issue.
>
>
> root@noname_server:/mnt# btrfs check /dev/sdg1
> Checking filesystem on /dev/sdg1
> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
> checking extents
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> found 4424060642634 bytes used err is 0
> total csum bytes: 4315954936
> total tree bytes: 4522786816
> total fs tree bytes: 61702144
> total extent tree bytes: 41402368
> btree space waste bytes: 72430813
> file data blocks allocated: 4475917217792
>  referenced 4420407603200
>
> No luck there :/

Indeed looks all normal.

> In-lining on raid10 has caused me some trouble (I had 4k nodes) over
> time, it has happened over a year ago with kernels recent at that
> time, but the fs was converted from raid5
>
> Could you please elaborate on that ? you also ended up with files that got
> truncated to 4096 bytes ?

I did not have truncated to 4k files, but your case lets me think of
small files inlining. Default max_inline mount option is 8k and that
means that 0 to ~3k files end up in metadata. I had size corruptions
for several of those small sized files that were updated quite
frequent, also within commit time AFAIK. Btrfs check lists this as
errors 400, although fs operation is not disturbed. I don't know what
happens if those small files are being updated/rewritten and are just
below or just above the max_inline limit.

The only thing I was thinking of is that your files were started as
small, so inline, then extended to multi-GB. In the past, there were
'bad extent/chunk type' issues and it was suggested that the fs would
have been an ext4-converted one (which had non-compliant mixed
metadata and data) but for most it was not the case. So there was/is
something unclear, but full balance or so fixed it as far as I
remember. But it is guessing, I do not have any failure cases like the
one you see.

> You might want to run the python scrips from here:
> https://github.com/knorrie/python-btrfs
>
> Will do.
>
> so that maybe you see how block-groups/chunks are filled etc.
>
> (ps. this email client on OS X is driving me up the wall … have to correct
> the corrections all the time :/)
>
> On 4 Jul 2016, at 22:13, Henk Slager  wrote:
>
> On Sun, Jul 

Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-05 Thread Tomasz Kusmierz
On 6 Jul 2016, at 00:30, Henk Slager > wrote:
> 
> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz  > wrote:
>> I did consider that, but:
>> - some files were NOT accessed by anything with 100% certainty (well if 
>> there is a rootkit on my system or something in that shape than maybe yes)
>> - the only application that could access those files is totem (well 
>> Nautilius checks extension -> directs it to totem) so in that case we would 
>> hear about out break of totem killing people files.
>> - if it was a kernel bug then other large files would be affected.
>> 
>> Maybe I’m wrong and it’s actually related to the fact that all those files 
>> are located in single location on file system (single folder) that might 
>> have a historical bug in some structure somewhere ?
> 
> I find it hard to imagine that this has something to do with the
> folderstructure, unless maybe the folder is a subvolume with
> non-default attributes or so. How the files in that folder are created
> (at full disktransferspeed or during a day or even a week) might give
> some hint. You could run filefrag and see if that rings a bell.
files that are 4096 show:
1 extent found
> 
>> I did forgot to add that file system was created a long time ago and it was 
>> created with leaf & node size = 16k.
> 
> If this long time ago is >2 years then you have likely specifically
> set node size = 16k, otherwise with older tools it would have been 4K.
You are right I used -l 16K -n 16K
> Have you created it as raid10 or has it undergone profile conversions?
Due to lack of spare disks 
(it may sound odd for some but spending for more than 6 disks for home use 
seems like an overkill)
and due to last I’ve had I had to migrate all data to new file system.
This played that way that I’ve:
1. from original FS I’ve removed 2 disks
2. Created RAID1 on those 2 disks,
3. shifted 2TB
4. removed 2 disks from source FS and adde those to destination FS
5 shifted 2 further TB 
6 destroyed original FS and adde 2 disks to destination FS
7 converted destination FS to RAID10

FYI, when I convert to raid 10 I use:
btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f 
/path/to/FS

this filesystem has 5 sub volumes. Files affected are located in separate 
folder within a “victim folder” that is within a one sub volume.
> 
> It could also be that the ondisk format is somewhat corrupted (btrfs
> check should find that ) and that that causes the issue.

root@noname_server:/mnt# btrfs check /dev/sdg1
Checking filesystem on /dev/sdg1
UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
found 4424060642634 bytes used err is 0
total csum bytes: 4315954936
total tree bytes: 4522786816
total fs tree bytes: 61702144
total extent tree bytes: 41402368
btree space waste bytes: 72430813
file data blocks allocated: 4475917217792
 referenced 4420407603200

No luck there :/

> In-lining on raid10 has caused me some trouble (I had 4k nodes) over
> time, it has happened over a year ago with kernels recent at that
> time, but the fs was converted from raid5
Could you please elaborate on that ? you also ended up with files that got 
truncated to 4096 bytes ?

> You might want to run the python scrips from here:
> https://github.com/knorrie/python-btrfs 
> 
Will do. 

> so that maybe you see how block-groups/chunks are filled etc.
> 
>> (ps. this email client on OS X is driving me up the wall … have to correct 
>> the corrections all the time :/)
>> 
>>> On 4 Jul 2016, at 22:13, Henk Slager >> > wrote:
>>> 
>>> On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz >> > wrote:
 Hi,
 
 My setup is that I use one file system for / and /home (on SSD) and a
 larger raid 10 for /mnt/share (6 x 2TB).
 
 Today I've discovered that 14 of files that are supposed to be over
 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
 and it seems that it does contain information that were at the
 beginnings of the files.
 
 I've experienced this problem in the past (3 - 4 years ago ?) but
 attributed it to different problem that I've spoke with you guys here
 about (corruption due to non ECC ram). At that time I did deleted
 files affected (56) and similar problem was discovered a year but not
 more than 2 years ago and I believe I've deleted the files.
 
 I periodically (once a month) run a scrub on my system to eliminate
 any errors sneaking in. I believe I did a balance a half a year ago ?
 to reclaim space after I deleted a large database.
 
 root@noname_server:/mnt/share# btrfs fi show
 Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
   Total 

Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-05 Thread Henk Slager
On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz  wrote:
> I did consider that, but:
> - some files were NOT accessed by anything with 100% certainty (well if there 
> is a rootkit on my system or something in that shape than maybe yes)
> - the only application that could access those files is totem (well Nautilius 
> checks extension -> directs it to totem) so in that case we would hear about 
> out break of totem killing people files.
> - if it was a kernel bug then other large files would be affected.
>
> Maybe I’m wrong and it’s actually related to the fact that all those files 
> are located in single location on file system (single folder) that might have 
> a historical bug in some structure somewhere ?

I find it hard to imagine that this has something to do with the
folderstructure, unless maybe the folder is a subvolume with
non-default attributes or so. How the files in that folder are created
(at full disktransferspeed or during a day or even a week) might give
some hint. You could run filefrag and see if that rings a bell.

> I did forgot to add that file system was created a long time ago and it was 
> created with leaf & node size = 16k.

If this long time ago is >2 years then you have likely specifically
set node size = 16k, otherwise with older tools it would have been 4K.
Have you created it as raid10 or has it undergone profile conversions?

It could also be that the ondisk format is somewhat corrupted (btrfs
check should find that ) and that that causes the issue.

In-lining on raid10 has caused me some trouble (I had 4k nodes) over
time, it has happened over a year ago with kernels recent at that
time, but the fs was converted from raid5.

You might want to run the python scrips from here:
https://github.com/knorrie/python-btrfs

so that maybe you see how block-groups/chunks are filled etc.

> (ps. this email client on OS X is driving me up the wall … have to correct 
> the corrections all the time :/)
>
>> On 4 Jul 2016, at 22:13, Henk Slager  wrote:
>>
>> On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz  
>> wrote:
>>> Hi,
>>>
>>> My setup is that I use one file system for / and /home (on SSD) and a
>>> larger raid 10 for /mnt/share (6 x 2TB).
>>>
>>> Today I've discovered that 14 of files that are supposed to be over
>>> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
>>> and it seems that it does contain information that were at the
>>> beginnings of the files.
>>>
>>> I've experienced this problem in the past (3 - 4 years ago ?) but
>>> attributed it to different problem that I've spoke with you guys here
>>> about (corruption due to non ECC ram). At that time I did deleted
>>> files affected (56) and similar problem was discovered a year but not
>>> more than 2 years ago and I believe I've deleted the files.
>>>
>>> I periodically (once a month) run a scrub on my system to eliminate
>>> any errors sneaking in. I believe I did a balance a half a year ago ?
>>> to reclaim space after I deleted a large database.
>>>
>>> root@noname_server:/mnt/share# btrfs fi show
>>> Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
>>>Total devices 1 FS bytes used 177.19GiB
>>>devid3 size 899.22GiB used 360.06GiB path /dev/sde2
>>>
>>> Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>>Total devices 6 FS bytes used 4.02TiB
>>>devid1 size 1.82TiB used 1.34TiB path /dev/sdg1
>>>devid2 size 1.82TiB used 1.34TiB path /dev/sdh1
>>>devid3 size 1.82TiB used 1.34TiB path /dev/sdi1
>>>devid4 size 1.82TiB used 1.34TiB path /dev/sdb1
>>>devid5 size 1.82TiB used 1.34TiB path /dev/sda1
>>>devid6 size 1.82TiB used 1.34TiB path /dev/sdf1
>>>
>>> root@noname_server:/mnt/share# uname -a
>>> Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
>>> 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>> root@noname_server:/mnt/share# btrfs --version
>>> btrfs-progs v4.4
>>> root@noname_server:/mnt/share#
>>>
>>>
>>> Problem is that stuff on this filesystem moves so slowly that it's
>>> hard to remember historical events ... it's like AWS glacier. What I
>>> can state with 100% certainty is that:
>>> - files that are affected are 2GB and over (safe to assume 4GB and over)
>>> - files affected were just read (and some not even read) never written
>>> after putting into storage
>>> - In the past I've assumed that files affected are due to size, but I
>>> have quite few ISO files some backups of virtual machines ... no
>>> problems there - seems like problem originates in one folder & size >
>>> 2GB & extension .mkv
>>
>> In case some application is the root cause of the issue, I would say
>> try to keep some ro snapshots done by a tool like snapper for example,
>> but maybe you do that already. It sounds also like this is some kernel
>> bug, snaphots won't help that much then I think.
>
--
To unsubscribe from this list: send the line "unsubscribe 

Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-04 Thread Duncan
Henk Slager posted on Mon, 04 Jul 2016 23:13:52 +0200 as excerpted:

> [Tomasz Kusmierz wrote...]

>> Problem is that stuff on this filesystem moves so slowly that it's hard
>> to remember historical events ... it's like AWS glacier. What I can
>> state with 100% certainty is that:
>> - files that are affected are 2GB and over (safe to assume 4GB and
>> over)
>> - files affected were just read (and some not even read) never written
>> after putting into storage
>> - In the past I've assumed that files
>> affected are due to size, but I have quite few ISO files some backups
>> of virtual machines ... no problems there - seems like problem
>> originates in one folder & size > 2GB & extension .mkv

This reads to me like a security-video use-case, very large media files 
that are mostly WORN (write-once read-never).  These files would be time-
based and likely mostly the same size, tho compression could cause them 
to vary in size somewhat.

I see a comment that I didn't quote, to the effect that you did a balance 
a half a year ago or so.

Btrfs data chunk size is nominally 1 GiB.  However, on large enough 
btrfs, I believe sometimes dependent on striped-raid as well (the exact 
conditions aren't clear to me), chunks are some multiple of that.  With a 
6-drive btrfs raid10, which we know does two copies and stripe as wide as 
possible, so 3-device-wide stripes here with two mirrors of the stripe, 
I'd guess it's 3 GiB chunks, 1 GiB * 3-device stripe width.

Is it possible that it's 3 GiB plus files only that are affected, and 
that the culprit was a buggy balance shifting around those big chunks 
half a year or whatever ago?

As to your VM images not being affected, their usage is far different, 
unless they're simply archived images, not actually in use.  If they're 
in-use not archived VM images, they're likely either highly fragmented, 
or you managed the fragmentation with the use of the NOCOW file 
attribute.  Either way, the way the filesystem treats them as opposed to 
very large write-once files that are likely using whole data chunks is 
very different, and it could well be that difference that explains why 
the video files were affected but the VM images not.

Given the evidence, a buggy balance would indeed be my first suspect, but 
I'm not a dev, and I haven't the foggiest what sort of balance bug might 
be the trigger here, or whether it has been fixed at all, let alone when, 
if so.

But of course that does suggest a potential partial proof and a test.  
The partial proof would be that none of the files created after the 
balance should be affected.

And the test, after backing up newer video files if they're likely to be 
needed, try another balance and see if it eats them too.  If it does...

If it doesn't with a new kernel and tools, you might try yet another 
balance with the same kernel and progs you were likely using half a year 
ago when you did that balance, just to nail down for sure whether it did 
eat the files back then, so we don't have to worry about some other 
problem.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-04 Thread Tomasz Kusmierz
I did consider that, but:
- some files were NOT accessed by anything with 100% certainty (well if there 
is a rootkit on my system or something in that shape than maybe yes) 
- the only application that could access those files is totem (well Nautilius 
checks extension -> directs it to totem) so in that case we would hear about 
out break of totem killing people files.  
- if it was a kernel bug then other large files would be affected.

Maybe I’m wrong and it’s actually related to the fact that all those files are 
located in single location on file system (single folder) that might have a 
historical bug in some structure somewhere ? 

I did forgot to add that file system was created a long time ago and it was 
created with leaf & node size = 16k. 

(ps. this email client on OS X is driving me up the wall … have to correct the 
corrections all the time :/)

> On 4 Jul 2016, at 22:13, Henk Slager  wrote:
> 
> On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz  
> wrote:
>> Hi,
>> 
>> My setup is that I use one file system for / and /home (on SSD) and a
>> larger raid 10 for /mnt/share (6 x 2TB).
>> 
>> Today I've discovered that 14 of files that are supposed to be over
>> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
>> and it seems that it does contain information that were at the
>> beginnings of the files.
>> 
>> I've experienced this problem in the past (3 - 4 years ago ?) but
>> attributed it to different problem that I've spoke with you guys here
>> about (corruption due to non ECC ram). At that time I did deleted
>> files affected (56) and similar problem was discovered a year but not
>> more than 2 years ago and I believe I've deleted the files.
>> 
>> I periodically (once a month) run a scrub on my system to eliminate
>> any errors sneaking in. I believe I did a balance a half a year ago ?
>> to reclaim space after I deleted a large database.
>> 
>> root@noname_server:/mnt/share# btrfs fi show
>> Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
>>Total devices 1 FS bytes used 177.19GiB
>>devid3 size 899.22GiB used 360.06GiB path /dev/sde2
>> 
>> Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>Total devices 6 FS bytes used 4.02TiB
>>devid1 size 1.82TiB used 1.34TiB path /dev/sdg1
>>devid2 size 1.82TiB used 1.34TiB path /dev/sdh1
>>devid3 size 1.82TiB used 1.34TiB path /dev/sdi1
>>devid4 size 1.82TiB used 1.34TiB path /dev/sdb1
>>devid5 size 1.82TiB used 1.34TiB path /dev/sda1
>>devid6 size 1.82TiB used 1.34TiB path /dev/sdf1
>> 
>> root@noname_server:/mnt/share# uname -a
>> Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
>> 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>> root@noname_server:/mnt/share# btrfs --version
>> btrfs-progs v4.4
>> root@noname_server:/mnt/share#
>> 
>> 
>> Problem is that stuff on this filesystem moves so slowly that it's
>> hard to remember historical events ... it's like AWS glacier. What I
>> can state with 100% certainty is that:
>> - files that are affected are 2GB and over (safe to assume 4GB and over)
>> - files affected were just read (and some not even read) never written
>> after putting into storage
>> - In the past I've assumed that files affected are due to size, but I
>> have quite few ISO files some backups of virtual machines ... no
>> problems there - seems like problem originates in one folder & size >
>> 2GB & extension .mkv
> 
> In case some application is the root cause of the issue, I would say
> try to keep some ro snapshots done by a tool like snapper for example,
> but maybe you do that already. It sounds also like this is some kernel
> bug, snaphots won't help that much then I think.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-04 Thread Henk Slager
On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz  wrote:
> Hi,
>
> My setup is that I use one file system for / and /home (on SSD) and a
> larger raid 10 for /mnt/share (6 x 2TB).
>
> Today I've discovered that 14 of files that are supposed to be over
> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
> and it seems that it does contain information that were at the
> beginnings of the files.
>
> I've experienced this problem in the past (3 - 4 years ago ?) but
> attributed it to different problem that I've spoke with you guys here
> about (corruption due to non ECC ram). At that time I did deleted
> files affected (56) and similar problem was discovered a year but not
> more than 2 years ago and I believe I've deleted the files.
>
> I periodically (once a month) run a scrub on my system to eliminate
> any errors sneaking in. I believe I did a balance a half a year ago ?
> to reclaim space after I deleted a large database.
>
> root@noname_server:/mnt/share# btrfs fi show
> Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
> Total devices 1 FS bytes used 177.19GiB
> devid3 size 899.22GiB used 360.06GiB path /dev/sde2
>
> Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
> Total devices 6 FS bytes used 4.02TiB
> devid1 size 1.82TiB used 1.34TiB path /dev/sdg1
> devid2 size 1.82TiB used 1.34TiB path /dev/sdh1
> devid3 size 1.82TiB used 1.34TiB path /dev/sdi1
> devid4 size 1.82TiB used 1.34TiB path /dev/sdb1
> devid5 size 1.82TiB used 1.34TiB path /dev/sda1
> devid6 size 1.82TiB used 1.34TiB path /dev/sdf1
>
> root@noname_server:/mnt/share# uname -a
> Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
> 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> root@noname_server:/mnt/share# btrfs --version
> btrfs-progs v4.4
> root@noname_server:/mnt/share#
>
>
> Problem is that stuff on this filesystem moves so slowly that it's
> hard to remember historical events ... it's like AWS glacier. What I
> can state with 100% certainty is that:
> - files that are affected are 2GB and over (safe to assume 4GB and over)
> - files affected were just read (and some not even read) never written
> after putting into storage
> - In the past I've assumed that files affected are due to size, but I
> have quite few ISO files some backups of virtual machines ... no
> problems there - seems like problem originates in one folder & size >
> 2GB & extension .mkv

In case some application is the root cause of the issue, I would say
try to keep some ro snapshots done by a tool like snapper for example,
but maybe you do that already. It sounds also like this is some kernel
bug, snaphots won't help that much then I think.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS RAID 1 broken: Mounted drive(s) basically empty after repair attempt

2016-05-18 Thread Duncan
Quanttek Jonas posted on Tue, 17 May 2016 10:00:41 -0400 as excerpted:

> So, the question is: How can I recover from this? How do I get my data
> back, after foolishly using "btrfsck --repair"?

First, let me note that I'm a list regular and btrfs user, not a dev, and 
that as such, much of your post was beyond my tech understanding level.  
Thus I snipped it above.  For a technical take perhaps one of the devs 
will help, and other user and general dev but not btrfs dev will likely 
post their thoughts as well.

But here I'd probably declare the filesystem beyond full repair and focus 
on getting any files off it I could using the below described method, and 
restoring what I couldn't get from the damaged filesystem from backup

It's worth pausing to note that this point the sysadmin's rule of 
backups, which in simplest form simply states that if you don't have at 
least one level of backup, you are by choosing not to do that backup, 
defining your data as worth less than the trouble and resources necessary 
to do that backup.  Thus, by definition, you /always/ save what was of 
most importance to you, either the data, if you decided it was worth 
making that backup, or if by your actions you defined the time and 
resources that would otherwise be spent in making that backup as more 
valuable than the data, then you saved your valuable time and resource, 
even if you lost what you had defined to be of lower value, that being 
your data.

And that rule applies in normal conditions, using fully mature and long-
term stable filesystems such as ext3/4, xfs, or (the one I still use on 
my spinning rust, I only use btrfs on my ssds) reiserfs.  Btrfs, while 
stabilizing, is not yet fully stable and mature, definitely not to the 
level of the above filesystems, so the rule applies even more strongly 
there (a less simple form of the rule takes into account varying levels 
of risk and varying data value, along with multiple levels of backup, 100 
levels of backup with some offsite in other locations may not be enough 
for extremely high value data).

So I'll assume that much like me you keep backups where the data is 
valuable enough to warrant it, but you may not always have /current/ 
backups, because the value of the data in the delta between the last 
backup and current simply doesn't warrant the hassle of refreshing the 
backup, yet, given the limited risk of /future/ loss.  However, once the 
potential loss happens, the question changes.  Now it's a matter of 
whether the hassle of further recovery efforts is justified, vs. the 
known loss of the data in that delta between the last backup and the last 
"good" state before things started going bad.

As it happens, btrfs has this really useful tool called btrfs restore, 
that can often help you recover your data at very close to the last good 
state, or at least to a state beyond that of your last backup.  It has 
certainly helped me recover this from-last-backup-delta data a couple 
times here, allowing me to use it instead of having to fall back to the 
older and more stale backup.  One nice thing about btrfs restore is that 
it's read-only with respect to the damaged filesystem, so you can safely 
use it on a filesystem to restore what you can, before trying more 
dangerous things that might cause even more damage.  Since it's a purely 
read-only operation, it won't cause further damage. =:^)

There's a page on the wiki that describes this process in more detail, 
but be aware, once you get beyond where automatic mode can help and you 
have to try manual, it gets quite technical, and a lot of folks find they 
need some additional help from a human, beyond the wiki.

Before I link the wiki page, here's an introduction...

Btrfs restore works on the /unmounted/ filesystem, writing any files it 
recovers to some other filesystem, which of course means that you need 
enough space on that other filesystem to store whatever you wish to 
recover.  By default it will write them as root, using root's umask, with 
current timestamps, and will skip writing symlinks or restoring extended 
attributes, but there are options that will restore ownership/perms/
timestamps, extended attributes, and symlinks, if desired.

Normally, btrfs restore will use a mechanism similar to the recovery 
mount option to try to find a copy of the root tree of the filesystem 
within a few commits (which are 30-seconds apart by default) of what the 
superblocks say is current.

If that works, great.  If not, you have to use a much more manual mode, 
telling btrfs restore what root to try, while using btrfs-find-root to 
find older roots (by generation, aka transid), then feeding the addresses 
found to btrfs restore -t, first with the -l option to list the other 
trees available from that root, then if it finds all the critical trees, 
using it with --dry-run to see if it seems to find most of the expected 
files, before trying the real restore if things look good.

With that, here's the 

Re: btrfs RAID-1 vs md RAID-1?

2016-05-15 Thread Kai Krakow
Am Sun, 15 May 2016 19:24:47 +0900
schrieb Tomasz Chmielewski :

> I'm trying to read two large files in parallel from a 2-disk RAID-1 
> btrfs setup (using kernel 4.5.3).
> 
> According to iostat, one of the disks is 100% saturated, while the
> other disk is around 0% busy.
> 
> Is it expected?
> 
> With two readers from the same disk, each file is being read with ~50 
> MB/s from disk (with just one reader from disk, the speed goes up to 
> around ~150 MB/s).
> 
> 
> In md RAID, with many readers, it will try to distribute the reads - 
> after md manual on http://linux.die.net/man/4/md:
> 
>  Raid1
>  (...)
>  Data is read from any one device. The driver attempts to
> distribute read requests across all devices
>  to maximise performance.
> 
>  Raid5
>  (...)
>  This also allows more parallelism when reading, as read requests
> are distributed over all the devices
>  in the array instead of all but one.
> 
> 
> Are there any plans to improve this is btrfs?
> 
> 
> Tomasz Chmielewski
> http://wpkg.org

Here is an idea that could need improvement:
http://permalink.gmane.org/gmane.comp.file-systems.btrfs/17985


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID-1 vs md RAID-1?

2016-05-15 Thread Duncan
Tomasz Chmielewski posted on Sun, 15 May 2016 19:24:47 +0900 as excerpted:

> I'm trying to read two large files in parallel from a 2-disk RAID-1
> btrfs setup (using kernel 4.5.3).
> 
> According to iostat, one of the disks is 100% saturated, while the other
> disk is around 0% busy.
> 
> Is it expected?

Depends.  Btrfs redundancy-raid, raid1/10 has an unoptimized read 
algorithm at this time (and parity-raid, raid5/6, remains new and 
unstable in terms of parity-recovery and restriping after device loss, so 
isn't recommended except for testing).  See below.

> With two readers from the same disk, each file is being read with ~50
> MB/s from disk (with just one reader from disk, the speed goes up to
> around ~150 MB/s).
> 
> In md RAID, with many readers, it will try to distribute the reads -
> after md manual on http://linux.die.net/man/4/md:
> 
>  Raid1 (...)
>  Data is read from any one device. The driver attempts to distribute
>  read requests across all devices to maximize performance.

Btrfs' current redundancy-raid read-scheduling algorithm is a pretty 
basic unoptimized even/odd PID implementation at this point.  It's 
suitable for basic use and will parallelize over a large enough random 
set of read tasks as the PIDs distribute even/odd, and it's well suited 
to testing as it's simple, and easy enough to ensure use of either just 
one side or the other, or both, by simply arranging for all even/odd or 
mixed PIDs.  But as you discovered, it's not yet anything near as well 
optimized as md redundancy-raid.

Another difference between the two that favors mdraid1 is that the latter 
will make N redundant copies across N devices, while btrfs redundancy 
raid in all forms (raid1/10 and dup on single device) has exactly two 
copies, no matter the number of devices.  More devices simply gives you 
more capacity, not more copies, as there's still only two.

OTOH, for those concerned about data integrity, btrfs has one seriously 
killer feature that mdraid lacks -- btrfs checksums both data and 
metadata and verifies a checksum match on read-back, falling back to the 
second copy on redundancy-raid if the first copy fails checksum 
verification, rewriting the bad copy from the good one.  One of the 
things that distressed me about mdraid is that in all cases, redundancy 
and parity alike, it never actually cross-checks either redundant copies 
or parity in normal operation -- if you get a bad copy and the hardware/
firmware level doesn't detect it, you get a bad copy and mdraid is none 
the wiser.  Only during a scrub or device recovery does mdraid actually 
use the parity or redundant copies, and even then, for redundancy-scrub, 
it simply arbitrarily calls the first copy good and rewrites it to the 
others if they differ.

What I'm actually wanting myself, is this killer data integrity 
verification feature, in combination with N-way mirroring instead of just 
the two-way that current btrfs offers.  For me, N=3, three-way-mirroring, 
would be perfect, as with just two-way-mirroring, if one copy is found 
invalid, you better /hope/ the second one is good, while with three way, 
there's still two fallbacks if one is bad.  4+-way would of course be 
even better in that regard, but of course there's the practical side of 
actually buying and housing the things too, and 3-way simply happens to 
be my sweet-spot.

N-way-mirroring is on the roadmap for after parity-raid (the current 
raid56), as it'll use some of the same code.  However, parity-raid ended 
up being rather more complex to properly implement along with COW and 
other btrfs features than they expected, so it took way more time to 
complete than originally estimated and as mentioned above it's still not 
really stable as there remain a couple known bugs that affect restriping 
and recovery from lost device.  So N-way-mirroring could be awhile, and 
if it follows the pattern of parity-raid, it'll be awhile after that 
before it's reasonably stable.  So we're talking years...  But I'm still 
eagerly anticipating...

Obviously, once N-way-mirroring gets in they'll need to revisit the read-
scheduling algorithm anyway, because even/odd won't cut it when there's 
three-plus-way scheduling.  So that's when I'd expect some optimization 
to occur, effectively as part of N-way-mirroring.

Meanwhile, I've argued before that the unoptimized read-scheduling of 
btrfs raid1 remains a prime example-in-point of btrfs' overall stability 
status, particularly when mdraid has a much better algorithm already 
implemented in the same kernel.  Developers tend to be very aware of 
something called premature optimization, where optimization too early 
will either lock out otherwise viable extensions later, or force throwing 
away major sections of optimization code as the optimization is redone to 
account for the new extensions that don't work with the old optimization 
code.

That such prime examples as raid1 read-scheduling remain so under-
optimized 

Re: btrfs RAID-1 vs md RAID-1?

2016-05-15 Thread Anand Jain



On 05/15/2016 06:24 PM, Tomasz Chmielewski wrote:

I'm trying to read two large files in parallel from a 2-disk RAID-1
btrfs setup (using kernel 4.5.3).

According to iostat, one of the disks is 100% saturated, while the other
disk is around 0% busy.

Is it expected?


No.



Are there any plans to improve this is btrfs?


yes.

Thanks, Anand


Tomasz Chmielewski
http://wpkg.org

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid

2016-03-06 Thread Chris Murphy
On Sun, Mar 6, 2016 at 3:27 PM, Rich Freeman  wrote:
> On Sun, Mar 6, 2016 at 4:07 PM, Chris Murphy  wrote:
>> On Sun, Mar 6, 2016 at 5:01 AM, Rich Freeman  wrote:
>>
>>> I think it depends on how you define "old."  I think that 3.18.28
>>> would be fine as it is a supported longterm.
>>
>> For raid56? I disagree. There were substantial raid56 code changes in
>> 3.19 that were not backported to 3.18.
>
> Of course.  I was referring to raid1.

Oops. Sorry. Yeah it should be safe. But still there's thousands of
bug fixes that don't get backported even to longterm releases. I
personally wouldn't risk it since there's another option. I guess it
is sort of weighing the bugs you know with the older one, versus the
bugs you don't know with the newer one.


 I wouldn't run raid56 without
> an expectation of occasionally losing everything on any version of
> linux.  :)  If I were just testing it or I could tolerate losing
> everything occasionally I'd probably track the current stable, if not
> mainline, depending on my goals.

Yeah exactly.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid

2016-03-06 Thread Rich Freeman
On Sun, Mar 6, 2016 at 4:07 PM, Chris Murphy  wrote:
> On Sun, Mar 6, 2016 at 5:01 AM, Rich Freeman  wrote:
>
>> I think it depends on how you define "old."  I think that 3.18.28
>> would be fine as it is a supported longterm.
>
> For raid56? I disagree. There were substantial raid56 code changes in
> 3.19 that were not backported to 3.18.

Of course.  I was referring to raid1.  I wouldn't run raid56 without
an expectation of occasionally losing everything on any version of
linux.  :)  If I were just testing it or I could tolerate losing
everything occasionally I'd probably track the current stable, if not
mainline, depending on my goals.

-- 
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid

2016-03-06 Thread Chris Murphy
On Sun, Mar 6, 2016 at 5:01 AM, Rich Freeman  wrote:

> I think it depends on how you define "old."  I think that 3.18.28
> would be fine as it is a supported longterm.

For raid56? I disagree. There were substantial raid56 code changes in
3.19 that were not backported to 3.18.
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/diff/fs/btrfs/raid56.c?id=v3.19=v3.18.28

There's just no way I'd recommend it, not worth it at all.

Anyone using Btrfs raid56 is still a test subject. I'd only use
current longterm, with the high expectation I'd use current stable or
even mainline if a problem arises before going to the list. If you
can't do any of this, then it's a waste of your time, and you're
better off looking at ZFS on Linux.

Look at the number of changes between 4.1.19 and 4.4.4, just for
extent-tree.c - there's over 1200 changes.
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/diff/fs/btrfs/extent-tree.c?id=v4.4.4=v4.1.19

And 4.1.19 doesn't have the dev replace code for raid56. While that's
more feature than bug fix, there are piles of bug fixes between even
the current 4.1.19 and 4.4.4 kernels.
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/diff/fs/btrfs/raid56.c?id=v4.4.4=v4.1.19

Things are certainly getting more stable enough that if your chances
of hitting an edge case are remote, you can get away with using older
kernels. But that's the thing, you don't know if you're hitting an
edge case in advance most of the time. If it's just a file server
using Samba, that's a docile use case than as root fs, let alone if
snapshots are being taken regularly. Every feature of Btrfs being used
adds on to the unknown factor.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid

2016-03-06 Thread Duncan
Rich Freeman posted on Sun, 06 Mar 2016 07:01:00 -0500 as excerpted:

> On Tue, Mar 1, 2016 at 11:27 AM, Hugo Mills  wrote:
>>
>>Definitely don't use parity RAID on 3.19. It's not really something
>> I'd trust, personally, even on 4.4, except for testing purposes.
> 
> ++ - raid 5/6 are fairly unstable at this point.  Raid 1 should be just
> fine.
> 
>>TBH, I wouldn't really want to be running something as old as 3.19
>> either. The actual problems of running older kernels are, IME,
>> considerably worse than the perceived problems of upgrading.
> 
> I think it depends on how you define "old."  I think that 3.18.28 would
> be fine as it is a supported longterm.  I've just upgraded to the 4.1
> series which I plan to track until a new longterm has been out for a few
> months and things lok quiet.
> 
> 3.19 is very problematic though, as it is no longer supported.  I'd
> sooner "downgrade" to 3.18.28 (which likely has more btrfs backports
> unless your distro handles them).  Or, upgrade to 4.1.19.
> 
> If you are using highly experimental features like raid5 support on
> btrfs then bleeding-edge is probably better, but I've found I've had the
> fewest issues sticking with the previous longterm.  I've been bitten by
> a few btrfs regressions over the years and I think 3.19 was actually
> around the time I got hit by one of them.  Since I've switched to just
> staying on a longterm once it hits the x.x.15 version or so I've found
> things to be much more reliable.

Agreed.

The two generally recommended kernel tracks are current and long-term 
stable.  If you choose current, staying within the last two releases is 
recommended.  That's 4.3 or 4.4 at this point, tho 4.3 is getting a bit 
long in the tooth and 4.5 is close to release (I upgraded to it between 
rc5 and rc6).

For some time, the LTS track was also the last couple releases, which 
with 4.4 being LTS, would be it or 4.1, but the previous LTS 3.18 has 
been fairly stable as well, and as long as you're not trying to run 
parity raid, there hasn't been a hugely pressing reason to recommend 
upgrading for those who prefer to play things conservative.

But older than 3.18 LTS is definitely not recommended, and 3.19 isn't LTS 
and is long out of standard support so isn't recommended either.  Neither 
are 4.0 and 4.2 as they too are out of support.

And parity raid is still unstable enough that you really need to be on 
latest current for it, for another couple kernel series anyway.  Once it 
stabilizes, it's likely 4.4 will be the first LTS series considered 
stable for parity raid, and hopefully 4.6 will do it, but as of now, 
there's at least one more bug that hasn't been traced, whereby restriping 
to cover changes in the number of devices takes about 10 times longer 
than it should, and if that change in the number of devices is due to a 
device failure, that means a window during which additional device 
failures isn't covered also 10 times longer than it should be.  Not good 
for the life of your data, for sure!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid

2016-03-06 Thread Rich Freeman
On Tue, Mar 1, 2016 at 11:27 AM, Hugo Mills  wrote:
>
>Definitely don't use parity RAID on 3.19. It's not really something
> I'd trust, personally, even on 4.4, except for testing purposes.

++ - raid 5/6 are fairly unstable at this point.  Raid 1 should be just fine.

>TBH, I wouldn't really want to be running something as old as 3.19
> either. The actual problems of running older kernels are, IME,
> considerably worse than the perceived problems of upgrading.

I think it depends on how you define "old."  I think that 3.18.28
would be fine as it is a supported longterm.  I've just upgraded to
the 4.1 series which I plan to track until a new longterm has been out
for a few months and things lok quiet.

3.19 is very problematic though, as it is no longer supported.  I'd
sooner "downgrade" to 3.18.28 (which likely has more btrfs backports
unless your distro handles them).  Or, upgrade to 4.1.19.

If you are using highly experimental features like raid5 support on
btrfs then bleeding-edge is probably better, but I've found I've had
the fewest issues sticking with the previous longterm.  I've been
bitten by a few btrfs regressions over the years and I think 3.19 was
actually around the time I got hit by one of them.  Since I've
switched to just staying on a longterm once it hits the x.x.15 version
or so I've found things to be much more reliable.

-- 
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS Raid 6 corruption - please help with restore

2016-03-02 Thread Chris Murphy
On Wed, Mar 2, 2016 at 11:42 AM, Stuart Gittings  wrote:
> All devices are present.  Btrfs if show is listed below and shows they are 
> all there.  I'm afraid btrfs dev scan does not help


What do you get for 'btrfs check'  (do not use --repair yet)




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS Raid 6 corruption - please help with restore

2016-03-02 Thread Chris Murphy
On Wed, Mar 2, 2016 at 3:47 AM, Stuart Gittings  wrote:
> Hi - I have some corruption on a 12 drive Raid 6 volume.  Here's the
> basics - if someone could help with restore it would save me a ton of
> time (and some data loss - I have critical data backed up, but not
> all).
>
> stuart@debian:~$ uname -a
> Linux debian 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-7~bpo8+1
> (2016-01-19) x86_64 GNU/Linux
>
> stuart@debian:~$ sudo btrfs --version
> btrfs-progs v4.4
>
>  sudo btrfs fi sh
> Label: none  uuid: 7f994e11-e146-4dee-80f0-c16ac3073e91
> Total devices 12 FS bytes used 14.25TiB
> devid1 size 2.73TiB used 167.14GiB path /dev/sdc
> devid2 size 5.46TiB used 1.75TiB path /dev/sdd
> devid3 size 5.46TiB used 1.75TiB path /dev/sde
> devid4 size 2.73TiB used 167.14GiB path /dev/sdn
> devid5 size 5.46TiB used 1.75TiB path /dev/sdf
> devid6 size 2.73TiB used 1.75TiB path /dev/sdm
> devid9 size 2.73TiB used 1.75TiB path /dev/sdj
> devid   10 size 2.73TiB used 1.75TiB path /dev/sdi
> devid   11 size 2.73TiB used 1.75TiB path /dev/sdg
> devid   13 size 2.73TiB used 1.75TiB path /dev/sdl
> devid   14 size 2.73TiB used 1.75TiB path /dev/sdk
> devid   15 size 2.73TiB used 1.75TiB path /dev/sdh
>
> sudo mount -t btrfs -oro,recover /dev/sdc /data
> mount: wrong fs type, bad option, bad superblock on /dev/sdc,
>missing codepage or helper program, or other error
>
>In some cases useful info is found in syslog - try
>dmesg | tail or so.
>
> dmesg:
>
> [ 5642.118303] BTRFS info (device sdc): enabling auto recovery
> [ 5642.118313] BTRFS info (device sdc): disk space caching is enabled
> [ 5642.118316] BTRFS: has skinny extents
> [ 5642.130145] btree_readpage_end_io_hook: 39 callbacks suppressed
> [ 5642.130148] BTRFS (device sdc): bad tree block start
> 13629298965300190098 47255853072384
> [ 5642.130759] BTRFS (device sdc): bad tree block start
> 10584834564968318131 47255853105152
> [ 5642.131289] BTRFS (device sdc): bad tree block start
> 2775635947161390306 47255853121536
> [ 5644.730012] BTRFS: bdev /dev/sdc errs: wr 1664846, rd 210656, flush
> 18054, corrupt 0, gen 0
> [ 5644.801291] BTRFS (device sdc): bad tree block start
> 8578409561856120450 47254279438336
> [ 5644.801304] BTRFS (device sdc): bad tree block start
> 18087369170870825197 47254279454720
> [ 5644.831199] BTRFS (device sdc): bad tree block start
> 9721403008164124267 47254277718016
> [ 5644.842763] BTRFS (device sdc): bad tree block start
> 18087369170870825197 47254279454720
> [ 5644.891992] BTRFS (device sdc): bad tree block start
> 17582844917171188859 47254194176000
> [ 5644.951366] BTRFS (device sdc): bad tree block start
> 3962496226683925584 47254278586368
> [ 5645.097168] BTRFS (device sdc): bad tree block start
> 17049293152820168762 47255619846144
> [ 5646.159819] BTRFS: Failed to read block groups: -5
> [ 5646.215905] BTRFS: open_ctree failed
> stuart@debian:~$
>
> Finally:
>  sudo btrfs restore /dev/sdc /backup
> checksum verify failed on 47255853072384 found 70F58CCA wanted AE18D5BC
> checksum verify failed on 47255853072384 found 70F58CCA wanted AE18D5BC
> checksum verify failed on 47255853072384 found 805B1FF7 wanted B76A652F
> checksum verify failed on 47255853072384 found 70F58CCA wanted AE18D5BC
> bytenr mismatch, want=47255853072384, have=13629298965300190098
> Couldn't read chunk tree
> Could not open root, trying backup super
> warning, device 3 is missing
> warning, device 2 is missing
> warning, device 5 is missing
> warning, device 4 is missing
> bytenr mismatch, want=47255851761664, have=47255851958272
> Couldn't read chunk root
> Could not open root, trying backup super
> warning, device 3 is missing
> warning, device 2 is missing
> warning, device 5 is missing
> warning, device 4 is missing
> bytenr mismatch, want=47255851761664, have=47255851958272
> Couldn't read chunk root
> Could not open root, trying backup super
>


Well there appear to be too many devices missing, I count four. What
does 'btrfs fi show' look like? If there are missing devices, try
'btrfs dev scan' and then 'btrfs fi show' again and see if it changes.
I don't think much can be done if there really are four missing
devices.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid

2016-03-01 Thread Hugo Mills
On Tue, Mar 01, 2016 at 11:19:34AM -0500, Carlos Ortega wrote:
> I'd like to confirm that btrfs raid actually works.  My filesystem
> looks like it's a simple concatenation judging from its size in df -k
> output.  btrfs filesystem df says it's a raid10, I just don't
> completely trust it.  Also I'm stuck at version 3.19.1, I can't go
> higher.

   Don't trust plain df. It's likely to be lying to some degree. btrfs
fi df will tell you how much data is under RAID, and btrfs fi show
will tell you how much space is not yet allocated. There's a couple of
FAQ entries on how to interpret that output correctly.

> Can someone confirm that raid works in this version?  If so which
> levels of raid?  I'd prefer raid5 or raid6.

   Definitely don't use parity RAID on 3.19. It's not really something
I'd trust, personally, even on 4.4, except for testing purposes.

   TBH, I wouldn't really want to be running something as old as 3.19
either. The actual problems of running older kernels are, IME,
considerably worse than the perceived problems of upgrading. If you're
restricted to that version by some upstream supplier, then they should
be giving you support and recommending what's usable in that kernel.

   Hugo.

-- 
Hugo Mills | There are three mistaikes in this sentance.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: BTRFS raid 5/6 status

2015-11-01 Thread audio muze
I've looked into snap-raid and it seems well suited to my needs as
most of the data is static.  I'm planning on using it in conjunction
with mhddfs so all drives are seen as a single storage pool. Is there
then any benefit in using Btrfs as the underlying filesystem on each
of the drives?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS raid 5/6 status

2015-10-15 Thread Erkki Seppala
audio muze  writes:

> It seems to me that the simplest option at present is probably to use
> each disk separately, formatted btrfs, and backed up to other drives.
> The data to be stored on these drives is largely static - video and
> audio library.

In that case this might be applicaple to your use case:

  http://www.snapraid.it/

It's a tool to generate (and recover files based on) redundancy data
from n filesystems to m filesystems.

I haven't personally tried it, though, but it sounds like it could fit
that particular use case extraordinarily well, limiting data loss even
in the most extreme situations only to the devices actually lost.

Maybe someone can then use unionfs on top of that.. :)

-- 
  _
 / __// /__   __   http://www.modeemi.fi/~flux/\   \
/ /_ / // // /\ \/ /\  /
   /_/  /_/ \___/ /_/\_\@modeemi.fi  \/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS raid 5/6 status

2015-10-15 Thread Pasi Kärkkäinen
On Thu, Oct 15, 2015 at 07:39:20AM +0200, audio muze wrote:
> Thanks Chris
> 
> I should've browsed recent threads, my apologies.  Terribly
> frustrating though that the issues you refer to aren't documented in
> the btrfs wiki.  Reading the wiki one is lead to believe that the only
> real issue is the write hole that can occur as a result of a power
> loss.  There I was thinking I've a UPS attached, no problem.
> Notification and handling of device failures is fundamental to any
> raid system, so it seems it's nowhere near ready.
>

Remember UPS does NOT protect against PSU failure, or a kernel crash..


-- Pasi
 
> On Thu, Oct 15, 2015 at 7:06 AM, Chris Murphy  wrote:
> > See the other recent thread on the list "RAID6 stable enough for 
> > production?"
> >
> > A lot of your questions have already been answered in recent previous 
> > threads.
> >
> > While there are advantages to Btrfs raid56, there are some missing
> > parts that make it incomplete and possibly unworkable for certain use
> > cases. For example there's no notification of device failures like md
> > and LVM raid. If there's a device failure, and then also any other
> > problem crops up, the whole file system can become unusable. So
> > whatever your backup strategy is going to be, it needs to be even more
> > bulletproof if you're going to depend on Btrfs for production.
> >
> >
> > --
> > Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS raid 5/6 status

2015-10-14 Thread Gareth Pye
On Thu, Oct 15, 2015 at 3:11 PM, audio muze  wrote:
> Rebuilds and/or expanding the array should be pretty quick given only
> actual data blocks are written on rebuild or expansion as opposed to
> traditional raid systems that write out the entire array.


While that might be the intended final functionality I don't think
balances are anywhere near that optimised currently.

Starting with a relatively green format isn't a great option for a
file system you intend to use for ever.

-- 
Gareth Pye - blog.cerberos.id.au
Level 2 MTG Judge, Melbourne, Australia
"Dear God, I would like to file a bug report"
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS raid 5/6 status

2015-10-14 Thread Chris Murphy
See the other recent thread on the list "RAID6 stable enough for production?"

A lot of your questions have already been answered in recent previous threads.

While there are advantages to Btrfs raid56, there are some missing
parts that make it incomplete and possibly unworkable for certain use
cases. For example there's no notification of device failures like md
and LVM raid. If there's a device failure, and then also any other
problem crops up, the whole file system can become unusable. So
whatever your backup strategy is going to be, it needs to be even more
bulletproof if you're going to depend on Btrfs for production.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS raid 5/6 status

2015-10-14 Thread Roman Mamedov
On Thu, 15 Oct 2015 06:11:49 +0200
audio muze  wrote:

> Before I go down this road I'd appreciate thoughts/ suggestions/
> alternatives?  Have I left anything out?  Most importantly is btrfs
> raid6 now stable enough to use in this fashion?

I would suggest going with Btrfs on top of mdadm RAID6, Btrfs still does have
a number of "you what"-grade showstoppers in its native multi-device modes,
but a single device mode Btrfs (on top of some other multi-device system such
as Linux software RAID and/or LVM) in my experience works very well.

-- 
With respect,
Roman


signature.asc
Description: PGP signature


Re: BTRFS raid 5/6 status

2015-10-14 Thread audio muze
Thanks Chris

I should've browsed recent threads, my apologies.  Terribly
frustrating though that the issues you refer to aren't documented in
the btrfs wiki.  Reading the wiki one is lead to believe that the only
real issue is the write hole that can occur as a result of a power
loss.  There I was thinking I've a UPS attached, no problem.
Notification and handling of device failures is fundamental to any
raid system, so it seems it's nowhere near ready.

On Thu, Oct 15, 2015 at 7:06 AM, Chris Murphy  wrote:
> See the other recent thread on the list "RAID6 stable enough for production?"
>
> A lot of your questions have already been answered in recent previous threads.
>
> While there are advantages to Btrfs raid56, there are some missing
> parts that make it incomplete and possibly unworkable for certain use
> cases. For example there's no notification of device failures like md
> and LVM raid. If there's a device failure, and then also any other
> problem crops up, the whole file system can become unusable. So
> whatever your backup strategy is going to be, it needs to be even more
> bulletproof if you're going to depend on Btrfs for production.
>
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS raid 5/6 status

2015-10-14 Thread audio muze
Thanks Roman, but I don't have the appetite to use mdadm and have the
array take forever to build or get yet another set of risks to
ultimately migrate from mdadm to btrfs when raid6 is stable.  It seems
to me that the simplest option at present is probably to use each disk
separately, formatted btrfs, and backed up to other drives.  The data
to be stored on these drives is largely static - video and audio
library.

On Thu, Oct 15, 2015 at 7:25 AM, Roman Mamedov  wrote:

> I would suggest going with Btrfs on top of mdadm RAID6, Btrfs still does have
> a number of "you what"-grade showstoppers in its native multi-device modes,
> but a single device mode Btrfs (on top of some other multi-device system such
> as Linux software RAID and/or LVM) in my experience works very well.
>
> --
> With respect,
> Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs RAID 1 Very poor file re read cache

2015-03-25 Thread Duncan
Chris Severance posted on Tue, 24 Mar 2015 00:00:32 -0400 as excerpted:

 System:
 
 Thinkserver TS140 E3-1225, 32GB ECC RAM, LSI9211-8i (IT unraid), 2 WD xe
 SAS as mdraid-raid1-ext4, 2 WD xe SAS as btrfs-raid1
 
 Linux xyzzy 3.19.2-1-ARCH #1 SMP PREEMPT Wed Mar 18 16:21:02 CET 2015
 x86_64 GNU/Linux
 
 btrfs-progs v3.19
 
 btrfs fi: partition already removed, created with mkfs.btrfs -m raid1 -d
 raid1 -L sdmdata /dev/sdc /dev/sdd
 
 dmesg: (not a problem with crashing)
 
 Problem:
 
 Very poor file reread cache. The database I use organizes both data and
 keys in a single file as a btree. This means that each successive record
 is located randomly around the file. Reading from first to last
 generates a lot of seeks.
 
 On btrfs the speed is consistent throughout the whole file as it is on
 any system with too little memory for an effective cache. Every reread
 runs at the same slow and consistent speed.
 
 So I unmount btrfs, quick zero the drives, mkfs.ext4, mount, and unpack
 the same data and run the same test on the same drives.
 
 On ext4 (and xfs from other testing) the first time I read through the
 whole file it starts slow as it seeks around to uncached data and speeds
 up as more of the file is found in the cache. It is very fast by the
 end. Once in the cache I can read the file over and over super fast. The
 ext4 read cache is mitigating the time cost from the poor arrangement of
 the file.
 
 I'm the only user on this test system so nothing is clearing my 32GB.

[Caveat.  I'm not a dev, only a fellow btrfs using admin and list 
regular.  My understanding isn't perfect and I've been known to be wrong 
from time to time.]

Interesting.  

But AFAIK it's not filesystem-specific cache, but generic kernel vfs-
level cache, so filesystem shouldn't have much effect on whether it's 
cached or not.

And on my btrfs-raid1 based system with 16 gigs RAM, I definitely notice 
the effects of caching, tho there are some differences.  Among other 
things, I'm on reasonably fast SSD, so 0-ms seeks and cache isn't the big 
deal it was back on spinning rust.  But I have one particular app, as it 
happens the pan news client I'm replying to this post with (via 
gmane.org's list2news service), that loads the over a gig of small text-
message files I have in local cache (unexpiring list/group archive) from 
permanent storage at startup, in ordered to create a threading map in 
memory.  And even on ssd that takes some time at first load, but 
subsequent startups are essentially instantaneous as the files are all 
cached.

So caching on btrfs raid1 is definitely working, tho my use-case is 187k+ 
small files totaling about a gig and a quarter, on ssd, while yours is an 
apparently large single file on spinning rust.

But some additional factors that remain publicly unknown as you didn't 
mention them.  I'd guess #5 is the factor here, but if you plan on 
deploying on btrfs, you should be aware of the other factors as well.

1) Size of that single, apparently large, file.

2) How was the file originally created on btrfs?  Was it created by use, 
that is, effectively appended to and modified over time, or was it 
created as a single file copy of an existing database from other media.

Btrfs is copy-on-write and can fragment pretty heavily on essentially 
random rewrite-in-place operations.

3) Mount options

The autodefrag mount option comes to mind, and nodatacow.

4) Nocow file attribute applied at file creation?  Btrfs snapshotting?  
(Snapshotting can nullify nocow's anti-fragmentation effects.)


Tho all those probably wouldn't have much effect on an effectively 
serially copied all at once file, without live rewriting going on, if 
that's what you were doing for testing.  OTOH, a database level copy 
would likely not have been serial, and it already sounds like you were 
doing a database level read, not serial, for the testing.

5) Does your database use DIO access, turning off thru-the-VFS caching?

I suspect that it does so, thus explaining why you didn't see any VFS 
caching effect.  That doesn't explain why ext4 and xfs get faster, which 
you attributed to caching, except that apps doing DIO access are 
effectively expected to manage their own caching, and your database's own 
caching may simply not work well with btrfs yet.  Additionally, btrfs has 
had some DIO issues in the past and you may be running into still 
existing bugs there.  You can be commended for testing with a current 
kernel, however, as so many doing database work are running hopelessly 
old kernels for a filesystem still under as intense bugfixing and 
development as btrfs is at this point.  And DIO is known to be an area 
that's likely to need further attention.  So any bugs you're experiencing 
are likely to be of interest to the devs, if you're interested in working 
with them to pin them down.


-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your 

Re: btrfs raid-1 uuid-fstab

2015-02-17 Thread Duncan
Kai Krakow posted on Tue, 17 Feb 2015 00:15:50 +0100 as excerpted:

 Long story short: I managed to strip dracut down to
 too few modules and it lost its ability to mount anything and even could
 not spawn a shell. *gnarf

Ouch!

FWIW, that's why I use a kernel built-in initramfs.  If I upgrade dracut 
or change its config and it fails to work, just as if the new kernel the 
initramfs is appended to fails to work, I simply boot an older kernel... 
with a known-working dracut-created initramfs.

Tho I /did/ have trouble with an older dracut locking to a particular 
default-root UUID at one point, so it would boot any root= I pointed it 
at, but *ONLY* as long as that particular UUID continued to exist!

Which is pretty hard to test for, since until you actually mkfs the 
existing default-root, its UUID will continue to exist, and you'll never 
know that your boot to the backup root using root= is working now, but 
will fail as soon as the default-root ceases to exist, until you're 
actually in the situation and can't boot, using any kernel/dracut 
combination!

That did drop me to the dracut/initramfs shell, but I was new enough with 
dracut at the time that I didn't really know how to fix it from there, 
nor could I properly edit a file or even view an entire file (cat worked, 
but that only let me see the last N lines and I didn't have a pager in 
the initramfs), to try to read documentation and fix the issue.

What I finally did to get out of that hole was manually ln -s the /dev/
disk/by-uuid/* symlink that the dracut/initramfs scripts were looking for 
based on the error, pointing it at an existing /dev/sdXN.  It didn't have 
to point at the root device, it could point at any device-block file, as 
long as that device-block file actually existed.

I didn't originally file a bug on that as the host-only option 
documentation warned about it being host-specific, so I figured it was 
/designed/ to do that.  Only later, when host-only was being discussed as 
the gentoo-recommended default on gentoo-dev and I explained that it 
wasn't always suitable as it broke if/when you blew away your default-
root and recreated it with a new UUID, and the gentoo dracut maintainer 
asked why I hadn't filed a bug, did I figure out it /was/ a bug, not a 
confusingly documented feature.  So I filed a bug and the gentoo 
maintainer filed one upstream as well, and it was apparently fixed.  But 
of course by then I had long since worked around the problem with more 
specific dracut-module include and exclude statements in the config, 
instead of using host-only, and that was working and continues to work, 
so I've never had reason to go back and test the more loosely specified 
host-only mode, and thus have never confirmed whether the bug was 
actually fixed or not, since I don't use that mode any more.

 And when that wasn't fun enough, my BIOS decided to no longer initialize
 USB so I could neither get into BIOS nor into Grub shell. I don't know
 when that problem happened. Probably been that for a while and I never
 noticed. Just that it went a lot slower through BIOS after I managed to
 convince it to initialize USB again (by opening the case and shorting
 the reset jumper).

Ouch.  FWIW my mobo has dual-bios, which is nice, but I've been down the 
bios-reset road before, several times.

I even had a BIOS update go bad once (due to bad RAM), screwed up the 
last-ditch bios-rescue it offered as I didn't know what I was doing, and 
had to use my netbook to setup a webmail account (didn't have the 
passwords to my normal email as I don't normally keep anything private on 
the netbook at all, in case I lose it, and couldn't access my other disks 
without a device to convert them to external/USB) and order a new BIOS 
shipped to me.

That is of course the big reason my new machine is dual-bios! =:^)  Tho 
it's not an absolute cure-all, as once it successfully boots from the 
main BIOS it auto-overwrites the second one, if different.  I'd actually 
rather make the auto-overwrite bit manual, so I could update it only when 
I was sufficiently sure it worked _reliably_, but oh, well, better than 
not having a backup BIOS at all, as I learned from experience!

 The next fun part was: My backup was incomplete in a special way: It had
 no directories dev, proc, run, sys and friends... Don't ask me how I
 solved that, probably by init=/bin/bash.

init=/bin/bash is indeed a very handy tool to have as a sysadmin. =:^)

I think I mentioned that setting that (via grub var) is actually one of 
my grub2 menu options, in the backup menu, FWIW.

 It happens, because I used
 rsync with the option to exclude those dirs. But well: In the end by
 backup was tested bootable. :-)
 
 I fixed by dracut setup and in the same procedure also fixed a
 long-standing issue with btrfs check telling me nlink errors. Luckily,
 this newer version could tell me the paths and I just delete those files
 in the chrome profile and var/lib/bluetooth directory. I 

Re: btrfs raid-1 uuid-fstab

2015-02-16 Thread Kai Krakow
Duncan 1i5t5.dun...@cox.net schrieb:

 It's probably just a flaw that
 btrfs device composition comes up later and the kernel tries to early to
 mount root. rootwait probably won't help here, too. But rootdelay
 may help that case tho I myself don't have the ambitions to experiment
 with it. My dracut initrd setup works fine and has some benefits like
 early debug shell to investigate problems without resorting to rescue
 systems or bootable USB sticks.
 
 FWIW, my root backup and rescue solution are one and the same, an
 occasional (every few kernel cycles) snapshot copy (not btrfs snapshot,
 a full copy) of my root filesystem, made when things seem reasonably
 stable and have been working for awhile, to an identically sized backup
 root filesystem located elsewhere.  That way, I have effectively a fully
 operational system snapshot copy, taken when the system was known to be
 operational, complete with everything I normally use, X, KDE, firefox,
 media players, games, everything, and of course tested to boot and run as
 normal.  No crippled semi-functional rescue media for me! =:^)

I accidently forced myself into using my USB3 backup drive as my rootfs due 
to fiddling around with dracut build options without thinking about it too 
much while waiting for my btrfs device add/del disk jockeying to migrate to 
bcache. Long story short: I managed to strip dracut down to too few modules 
and it lost its ability to mount anything and even could not spawn a shell. 
*gnarf

And when that wasn't fun enough, my BIOS decided to no longer initialize USB 
so I could neither get into BIOS nor into Grub shell. I don't know when that 
problem happened. Probably been that for a while and I never noticed. Just 
that it went a lot slower through BIOS after I managed to convince it to 
initialize USB again (by opening the case and shorting the reset jumper).

The next fun part was: My backup was incomplete in a special way: It had no 
directories dev, proc, run, sys and friends... Don't ask me how I solved 
that, probably by init=/bin/bash. It happens, because I used rsync with 
the option to exclude those dirs. But well: In the end by backup was tested 
bootable. :-)

I fixed by dracut setup and in the same procedure also fixed a long-standing 
issue with btrfs check telling me nlink errors. Luckily, this newer 
version could tell me the paths and I just delete those files in the chrome 
profile and var/lib/bluetooth directory. I wonder if those errors were 
causing me issues with chrome freezing the PC and bluetooth stopped working 
sometimes.

And BTW: bcache is pretty fast, booting to graphical.target within 3-8 
seconds (mostly around 5). Now I wonder what I need the resume swap for 
which I created in the process: It takes longer to resume from swap than 
just booting to complete KDE desktop. Well, without the benefit of having a 
fully running session at least.

 Very flexible, this grub2 is! =:^)

I've been waiting long before doing the switch. But I had to use it when I 
migrated from legacy to UEFI boot mode. Although every configuration bit 
looked confusing and cumbersome, everything worked automatically out of the 
box. Very suprising it is. :-)

 If I lose all three devices at once, I figure it's quite likely I'm
 dealing with a rather larger disaster, say a fire or flood or the like,
 and will probably have my hands full just surviving for awhile.  When I
 do get back to worrying about the computer, likely after replacing what I
 lost in the disaster, it won't be that big a deal to start over
 downloading a live image and doing a new install from the stage-3
 starter.  After all, the *REAL* important backup is in my head, and if I
 lose that, I guess I won't be worrying much about computers any more,
 even if I'm still alive in some facility somewhere.  Tho I /do/ have
 some stuff backed up on USB thumb drive and the like as well.  But I
 don't put much priority in it, because I figure if I'm having to restore
 from that backup in the first place, I'm pretty much screwed in any case,
 and the /last/ thing I'm likely to be worried about is having to start
 over with a new computer install.

From my own experience, the head is not a very good backup. While there are 
things which you simply cannot remember to rebuild, there are other things 
which, when rebuilt from scratch, probably get better and more well thought 
about but very frustrating to rebuild and thus never reach the same stage of 
completeness again. So, no: Not a good backup. It's no fun even when I had 
no other stuff to deal with...

But to get back to the multi-device btrfs booting issue: Thanks for 
recommending rootwait, I will try that. I had thought it would have no 
effect if booting from initrd. Let's see if dracut+systemd with rootwait 
will work for me, too.

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo 

Re: btrfs raid-1 uuid-fstab

2015-02-15 Thread Kai Krakow
Duncan 1i5t5.dun...@cox.net schrieb:

 While in theory btrfs has the device= mount option, and the kernel has
 rootflags= to tell it what mount options to use, at least last I checked
 a few kernel cycles ago (I'd say last summer, so 3-5 kernel cycles ago),
 for some reason rootflags=device= doesn't appear to work correctly.  My
 theory is that the kernel commandline parser breaks at the second/last =
 instead of the first, so instead of seeing settings for the rootflags
 parameter, it sees settings for the rootflags=device parameter, which of
 course makes no sense to the kernel and is ignored.  But that's just my
 best theory.

Gentoo here, too. And I tried to fiddle around with the exact same issue 
some kernel versions back and didn't get it to work, so I did go with dracut 
which works pretty well for me - combined with grub2, multi-device detection 
works pretty well tho you sometimes need rootdelay={1,2,3} to wait up to 
three seconds for btrfs figure out its setup. Looks like btrfs devices are 
assembled with a delay by the kernel and at the point you try to mount one 
of the compound devices, if done too early, the kernel code cannot yet find 
all the other devices of the set. Maybe rootwait would also do tho I 
didn't tried that yet (it probably won't as the root device is initrd 
initially). It may be a side-effect of the kernel doing async SCSI device 
detection. It may be worth trying to turn that option of.

But about your theory: I don't think the cmdline parser works incorrect, 
becauce rootflags=subvol=something works. It's probably just a flaw that 
btrfs device composition comes up later and the kernel tries to early to 
mount root. rootwait probably won't help here, too. But rootdelay may 
help that case tho I myself don't have the ambitions to experiment with it. 
My dracut initrd setup works fine and has some benefits like early debug 
shell to investigate problems without resorting to rescue systems or 
bootable USB sticks.

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid-1 uuid-fstab

2015-02-15 Thread Duncan
Kai Krakow posted on Sun, 15 Feb 2015 12:11:56 +0100 as excerpted:

 Duncan 1i5t5.dun...@cox.net schrieb:
 
 Gentoo here, too. And I tried to fiddle around with the exact same issue
 some kernel versions back and didn't get it to work, so I did go with
 dracut which works pretty well for me - combined with grub2,
 multi-device detection works pretty well tho you sometimes need
 rootdelay={1,2,3} to wait up to three seconds for btrfs figure out its
 setup. Looks like btrfs devices are assembled with a delay by the kernel
 and at the point you try to mount one of the compound devices, if done
 too early, the kernel code cannot yet find all the other devices of the
 set. Maybe rootwait would also do tho I didn't tried that yet (it
 probably won't as the root device is initrd initially). It may be a
 side-effect of the kernel doing async SCSI device detection. It may be
 worth trying to turn that option of.

Interesting.  I had forgotten I had rootwait set as a builtin kernel 
commandline-option, and was about to reply that I had SCSI_ASYNC_SCAN 
turned on and had never seen problems, but then I remembered having to 
turn on rootwait.

Actually, I had tried rootdelay=N some years ago, perhaps before rootwait 
actually became an kernel commandline option, certainly before I knew of 
it.  I used it with mdraid (initr*-less) too.  But eventually I got tired 
of having to play with rootdelay timeouts, and when I came across rootwait 
I decided to try it, and that solved my timeouts issue once and for all.

So I can confirm that rootwait seems to work for multi-device btrfs as 
well, which of course requires an initr*.  But that actually might be 
dracut reading the kernel commandline and applying the same option at the 
initr* level, and thus not work with other initr*-generators, if they 
don't do the same thing.  I'm actually not sure.

What I can say, however, is that after I set rootwait here, I've had no 
more block-device-detection-timing issues.  It has just worked in terms 
of timing.

And what's nice is that rootwait actually appears to go into a loop, 
checking for a mountable root, as well, and will continue immediately 
upon finding it.  So the delay is exactly as long as it needs to be, and 
no longer.  (I don't remember whether rootdelay=N could terminate the 
delay early if it found all necessary devices, or not, but certainly, 
rootwait does.)

 But about your theory: I don't think the cmdline parser works incorrect,
 becauce rootflags=subvol=something works.

Well, so much for /that/ theory, then.  I /thought/ the kernel devs were 
too smart to have let a bug that simple, especially where it was likely 
to be triggered by other = options as well, remain for as long as this 
has.  But that was what I came up with as a possible explanation.  I 
think your theory below makes more sense.

 It's probably just a flaw that
 btrfs device composition comes up later and the kernel tries to early to
 mount root. rootwait probably won't help here, too. But rootdelay
 may help that case tho I myself don't have the ambitions to experiment
 with it. My dracut initrd setup works fine and has some benefits like
 early debug shell to investigate problems without resorting to rescue
 systems or bootable USB sticks.

FWIW, my root backup and rescue solution are one and the same, an 
occasional (every few kernel cycles) snapshot copy (not btrfs snapshot, 
a full copy) of my root filesystem, made when things seem reasonably 
stable and have been working for awhile, to an identically sized backup 
root filesystem located elsewhere.  That way, I have effectively a fully 
operational system snapshot copy, taken when the system was known to be 
operational, complete with everything I normally use, X, KDE, firefox, 
media players, games, everything, and of course tested to boot and run as 
normal.  No crippled semi-functional rescue media for me! =:^)

With a root filesystem of 8 GiB, that's easy enough, and I keep several 
backup copies available, the first one another 8 GiB partitions each pair-
device btrfs raid1 on the same physical pair of SSDs, with a second and 
third 8 GiB root backup on reiserfs on spinning rust, in case the pair of 
SSD physical devices fail, or if btrfs itself gets majorly bugged out, 
such that booting to the first backup kills it just like it did the 
working copy.

And I have my grub2 menu setup with the root= boot option assigned a 
variable, and menu options to set that variable to point to any of the 
backups as necessary.  So to boot a particular backup, I just select the 
option to set the pointer variable appropriately, and then select boot.  
Similarly with other kernel commandline options, including the kernel 
choice and init=.  They're all loaded into pointer variables, and if I 
want to choose a different one, I simply select the menu option that sets 
the pointer variable appropriately, and then select boot.

Very flexible, this grub2 is! =:^)

Meanwhile, grub2 is setup on both 

Re: btrfs raid-1 uuid-fstab

2015-02-15 Thread Chris Murphy
On Sat, Feb 14, 2015 at 11:28 PM, Duncan 1i5t5.dun...@cox.net wrote:
 Chris Murphy posted on Sat, 14 Feb 2015 04:52:12 -0700 as excerpted:

 Also, there's a nasty
 little gotcha, there is no equivalent for mdadm bitmap. So once one
 member drive is mounted degraded+rw, it's changed, and there's no way to
 catch up the other drive - if you reconnect, it might seem things are
 OK but there's a good chance of corruption in such a case. You have to
 make sure you wipe the lost drive (the older version one). wipefs -a
 should be sufficient, then use 'device add' and 'device delete missing'
 to rebuild it.

 I caught this in my initial btrfs experimentation, before I set it up
 permanently.  It's worth repeating for emphasis, with a bit more
 information as well.

 *** If you break up a btrfs raid1 and attempt to recombine afterward, be
 *SURE* you *ONLY* mount the one side writable after that.  As long as
 ONLY one side is written to, that one side will consistently have a later
 generation than the device that was dropped out, and you can add the
 dropped device back in,

Right. I left out the distinguishing factor in whether or not it
corrupts. I'm uncertain how bad this corruption is, I've never tried
reproducing it.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid-1 uuid-fstab

2015-02-14 Thread Duncan
Chris Murphy posted on Sat, 14 Feb 2015 04:52:12 -0700 as excerpted:

 On Fri, Feb 13, 2015 at 7:31 PM, James wirel...@tampabay.rr.com wrote:

 What I want is if a drive fails,
 I can just replace it, or pull one drive out, replace it with a second
 blank, 2T new drive. Them move the removed drive into a second
 (identical) system to build a cloned workstation. From what I've read,
 uuid numbers are suppose to be use with fstab + btrfs Partuuid is still
 flaky. But the UUID numbers to not appear uniq (due to raid-1)? Do the
 only get listed once in fstab?
 
 Once is enough. Kernel code will find both devices.

[Preliminary note. FWIW, gentooer here too, running a btrfs raid1 root, 
altho I strongly prefer several smaller filesystems over a single large 
filesystem, so all my data eggs aren't in the same filesystem basket if 
the proverbial bottom drops out of it.  So /home is a separate 
filesystem, as is /var/log, as is my updates stuff (gentoo and other 
repos, including kernel, sources, binpkgs, ccache, everything I use to 
update the system on a single filesystem, kept unmounted unless I'm 
updating), as is my media partition, and of course /tmp, which is tmpfs.  
But of interest here is that I'm running a btrfs raid1 root.]

CM is correct. =:^)

But in addition, for a btrfs raid1 root (or any multi-device btrfs root, 
for that matter), you *WILL* need an initr*, because normally the kernel 
must run a userspace (initr* to mount root) btrfs device scan, before it 
can actually assemble a multi-device btrfs properly.  As I don't believe 
Chris is a gentooer, I'm guessing he's used to an initr* and thus forgot 
about this requirement, which can be a big one for a gentooer, since we 
build our own kernels and often build in at least the modules required to 
mount root, thus in many cases making an initr* unnecessary.  
Unfortunately, for a multi-device btrfs root, it's necessary. =:^(

While in theory btrfs has the device= mount option, and the kernel has 
rootflags= to tell it what mount options to use, at least last I checked 
a few kernel cycles ago (I'd say last summer, so 3-5 kernel cycles ago), 
for some reason rootflags=device= doesn't appear to work correctly.  My 
theory is that the kernel commandline parser breaks at the second/last = 
instead of the first, so instead of seeing settings for the rootflags 
parameter, it sees settings for the rootflags=device parameter, which of 
course makes no sense to the kernel and is ignored.  But that's just my 
best theory.  All I know for sure is that the subject has come up a 
number of times here and has been acknowledged by the btrfs devs, I had 
to set up an initr* to get a raid1 btrfs root to mount when I originally 
set it up here, and some time later when I decided to try an initr*-less 
rootflags= boot again and see if the problem had been fixed, it still 
didn't work.

So for a multi-device btrfs root, plan on that initr*.  If you'd never 
really learned how to set one up, as was the case here, you will probably 
either have to learn, or skip the idea of a multi-device btrfs root until 
the problem is, eventually/hopefully, fixed.

FWIW, I use dracut to create my initr* here, and have the kernel options 
set such that the dracut-pre-created initr* is attached to each kernel I 
build as an initramfs, so I don't have to have an initr* setting in grub2 
-- each kernel image has its own, attached.

And FWIW, when I first setup the btrfs root (and dracut-based initr*), I 
was running openrc (and thus using sysv-init as my init).  I've since 
switched to systemd and activated the appropriate dracut systemd module.  
So I know from personal experience, a dracut-based initr* can be setup to 
boot either openrc/sysvinit, or systemd.  Both work. =:^)

 For degraded use, this gets tricky, you have to use boot param
 rootflags=degraded to get it to mount, otherwise mount fails and you'll
 be dropped to a pre-mount shell in the initramfs.

See, assumed initr*. =:^\

But while on the topic of rootflags=degraded, in my experimentation, 
without an initr* with its pre-mount btrfs device scan, since it /was/ a 
two-device btrfs raid1 both data and metadata, thus with copies of 
everything on each device, the only way to boot without an initr* was to 
set rootflags=degraded, since the kernel would only know about the root= 
device in that case.

And that worked, so the kernel certainly could parse rootflags= and pass 
the mount options to btrfs as it should.  It simply broke when device= 
was passed in those rootflags.  Thus my theory about the parser breaking 
at the wrong =.

 Also, there's a nasty
 little gotcha, there is no equivalent for mdadm bitmap. So once one
 member drive is mounted degraded+rw, it's changed, and there's no way to
 catch up the other drive - if you reconnect, it might seem things are
 OK but there's a good chance of corruption in such a case. You have to
 make sure you wipe the lost drive (the older version one). wipefs -a
 should be 

Re: btrfs raid-1 uuid-fstab

2015-02-14 Thread Chris Murphy
On Fri, Feb 13, 2015 at 7:31 PM, James wirel...@tampabay.rr.com wrote:

No swap for now (each system had 32G) if I need
 swap later, I can just setup a file and use swapon?

No. You should read the wiki.
https://btrfs.wiki.kernel.org/index.php/FAQ#Does_btrfs_support_swap_files.3F

 What I want is if a drive fails,
 I can just replace it, or pull one drive out, replace it with a second
 blank, 2T new drive. Them move the removed drive into a second (identical)
 system to build a cloned workstation. From what I've read, uuid numbers
 are suppose to be use with fstab + btrfs Partuuid is still flaky. But the
 UUID numbers to not appear uniq (due to raid-1)? Do the only get listed once
 in fstab?

Once is enough. Kernel code will find both devices.

For degraded use, this gets tricky, you have to use boot param
rootflags=degraded to get it to mount, otherwise mount fails and
you'll be dropped to a pre-mount shell in the initramfs. Also, there's
a nasty little gotcha, there is no equivalent for mdadm bitmap. So
once one member drive is mounted degraded+rw, it's changed, and
there's no way to catch up the other drive - if you reconnect, it
might seem things are OK but there's a good chance of corruption in
such a case. You have to make sure you wipe the lost drive (the
older version one). wipefs -a should be sufficient, then use 'device
add' and 'device delete missing' to rebuild it.

This should not be formatted ext4, it's strictly for GRUB, it doesn't
get a file system. You should use wipefs -a on this.

This fstab has lots of problems. Based on your partition scheme it
should only have two entries total. A btrfs /boot UUID=d67a... and a
btrfs / UUID=b7753... There is no mountpoint for biosboot, it's used
by GRUB and is never formatted or mounted.

 First I notice the last partition (sdb1) seems to be missing the ext4 file
 system I guess when I exit the chroot I can just fix that to match sda1.

No the problem is sda1 is wrongly formatted ext4, you should use
wipefs -a on it.

 Any help or guidance would be keen,
 to help salvage the installation and get a few partitions installed
 with btrfs. Maybe I can somehow migrate to a raid-1 configuration
 under btrfs.

Good luck. Make backups often. Btrfs raid1 is not a backup. Btrfs
snapshots are not a backup. And use recent kernels. Recent on this
list means 3.18.3 or newer, and is listed unstable on this list
http://packages.gentoo.org/package/sys-kernel/gentoo-sources Based on
the kernel.org change log, you'd probably be fine running 3.14.31, but
if you have problems and ask about it on this list, there's a decent
chance the first question will be can you reproduce the problem on a
current kernel?

Anyway, I suggest reading the entire btrfs wiki.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS RAID-1 leaf size change scenario

2015-02-12 Thread Chris Murphy
On Thu, Feb 12, 2015 at 6:26 AM, Swâmi Petaramesh sw...@petaramesh.org wrote:

 It also contains *lots* of subvols and snapshots.

About how many is lots?


 1/ Could I first pull a disk out of the current RAID-1 config, losing 
 redundancy
 without breaking anything else ?


 2/ Then reset the removed HD, and create onto it a new BTRFS FS with 16K leaf
 size ?

 3/ Then is there a ways I could btrfs send | btrfs receive a complete volume
 including its subvolumes and snapshots,or is this impossible (and would I
 rather have to create the receiving volumes structure manually, and use rsync)
 ?

 4/ Once the data are copied onto the new FS, could I reset the remaining old
 HD, import it into the new FS and get back to a RAID-1 config, rebuilding the
 RAID with a balance operation ?

You could do that, however at any point in the migration a read
error/checksum mismatch could occur, and with the raid1 degraded, the
entire point of having the btrfs raid1 in the first place is defeated.

Best practices suggests acquiring a 3rd drive to migrate the data to.
Only once that's successful and confirmed, then you can obliterate one
of the old raid1 mirrors, and put the other old mirror on a shelf JUST
IN CASE. You can always mount it ro,degraded later.

When wiping one of the old mirrors, I go with some overkill by using
btrfs-show-super -a to show all superblocks, and write 1MB of zeros to
each super. Then add it to the new volume, then btrfs balance
-dconvert=raid1 -mconvert=raid1.

The problems: All of your subvolumes and snapshots. To use btrfs
send/receive on them, they each have to have a read-only version. And
you have to have a naming convention that ensures you get either the
-p or -c correct, so that you aren't unnecessarily duplicating data
during the send/receive. If you don't get it right, you either miss
migrating important data, or you'll run out of space on the
destination. The same problem applies with rsync if you want to keep
most or all of these snapshots.

The other option, is to make the raid1 volume a seed device, add two
new drives, then delete the seed drive(s). I've only ever done this
with a single device as seed, not a raid1. I don't even know if it
will work because of this, but also there still may be seed device
bugs in even recent kernels. The huge plus of this method though, is
that you don't have to make a bunch of ro snapshots first, everything
is migrated as it is on the seed. It's much easier. If it works. But
since the seed is data and metadata raid1, so will be any added
devices. So I think there isn't a way to make a raid1 a seed, where
the added device is single profile. That'd be pretty nifty if it were
possible.



 (Machine's kernel is an Ubuntu 3.16.0-30 with btrfs-tools 3.14.1-1)

Unless this kernel contains, at a minimum, the btrfs fixes in 3.16.2,
I would stop using it. There are also a set of fixes in 3.16.7 that
ought to be used. Since 3.16 isn't even a listed longterm or stable
kernel anymore, I suggest using 3.17.8, 3.18.3 or newer.


 Many thanks for all help / lights about if this is feasible / how to do it
 without losing my data...

I think the strategy at this point necessitates a 3rd drive. And
you're going to need to thin out the herd of subvols and snapshots you
have to something that can be manageably migrated to the new volume.
Once that's done, break one of the old mirrors to make it into a new
mirror (conversion), and put the other old mirror on a shelf in case
this whole thing goes badly. It's the only safe way.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS RAID-1 leaf size change scenario

2015-02-12 Thread Duncan
Swâmi Petaramesh posted on Thu, 12 Feb 2015 14:26:09 +0100 as excerpted:

 I have a BTRFS RAID-1 FS made from 2x 2TB SATA mechanical drives.
 
 It was created a while ago, with defaults by the time of 4K leaf sizes.
 
 It also contains *lots* of subvols and snapshots.
 
 It has become very slow over time, and I know that BTRFS performs better
 with the new 16K leaf sizes.

I agree with everything Chris Murphy said and were it me, would do the 
migration as he suggested.  Here I'll focus on another aspect, and 
reemphasize one he mentioned, as well.

1) Focus: Snapshots

Btrfs makes it deceivingly easy to make snapshots, since due to COW they 
can be created at very close to zero cost.  Unfortunately, that ease of 
creation belies the much more complicated snapshot maintenance and 
deletion costs, and often people create and keep around far more 
snapshots than is healthy for an optimally functioning btrfs.

Basically, anything over say 500 snapshots on a btrfs is going to start 
bogging it down, and more than around 250-ish snapshots of any single 
subvolume should be entirely unnecessary.  Unfortunately, due to the 
deceptive ease of creation, some people even take per-minute snapshots 
and fail to thin them down well over time, thus ending up with thousands 
to hundreds of thousands of snapshots, particularly if they're 
snapshotting multiple subvolumes at that extreme per-minute frequency.  A 
filesystem in this condition is going to be a nightmare to do any 
reasonable maintenance (like a rebalance to add/remove/replace devices, 
or a defrag of more than a few files) on at all, and even regular 
operations will likely slow down due to fragmentation, etc.

Given your starred-emphasis *lots* of snapshots, I strongly suspect 
this to be one of the big reasons for your slowdowns, far more so than 
the 4k nodesize, tho that won't help.  OTOH, if your characterization of 
*lots* was actually less than this, snapshotting probably isn't such a 
big problem after all and you can skip to the reemphasis point, below.

Unfortunately, at this point it may not be reasonable to recover from the 
situation on the existing filesystem, as doing the necessary thinning 
down of those snapshots could take nigh eternity (well, days per 
snapshot, not reasonable if you're dealing with anything near the 
thousands of snapshots I suspect) due to all that overhead.

But regardless of whether you can fix the existing btrfs, at least once 
you start over with a new one, try to better manage your snapshotting 
practices and I suspect the filesystem won't slow down as fast as this 
one did, while if you don't, I strongly suspect the newer 16k nodesizes 
aren't going to make that much difference and you'll get the same sort of 
slowdowns over time as you're dealing with now.

Here's the base argument concerning snapshot thinning management.  
Suppose you're doing hourly snapshots, and not doing any thinning.  
Suppose that a year later, you find you need a version of a file from a 
year ago, and go to retrieve it from one of those snapshots.  So you go 
to mount a year-old snapshot and you have to pick one.  Is it *REALLY* 
going to matter, a year on, with no reason to access it since, what exact 
hour it was?  How are you even going to /know/ what exact hour to pick?

A year on, practicality suggests you'll simply pick one out of the 24 for 
the day and call it good.  But is even that level of precision 
necessary?  A year on, might a single snapshot for the week, or for the 
month, or even the quarter, be sufficient?  Chances are it will be, and 
if the one you pick is too new or too old, you can simply pick one newer 
or one older and be done with it.

Similarly, per-minute snapshots?  In the extreme case, maybe for a half 
our or hour.  Then thin them down to say 10 minute snapshots, and to half 
hour snapshots after four or six hours (depending on whether you're 
basing on an 8-hour workday or a 24-hour day), then to hourly after a 
day, four or six hourly after three days, and daily after a week.

But in practice, per-minute snapshots are seldom necessary at all, and 
could be problems for maintenance if they end up taking more than a 
minute to delete.  Ten minute, possibly, more likely half-hour or hourly 
is fine.

So say we start with half hour snapshots, 24-hours/day, but thinning down 
to hourly after four hours and to four-hourly after a day, for a week.  
That's:

8*half-hourly, 24-4=20*hourly, 7-1=6*6=36-4-hourly = 8+20+36 =
64 snapshots in a week.

Now, keep daily snapshots for three additional weeks = 64+21 =
85 snapshots in four weeks.

And keep weekly snapshots to fill out the half-year (26 weeks)
26-4 = 22 weeks, 22*7=154, 154+85 = 239 snapshots in a half a year.

Now after half a year, if the data is any value at all, it will have been 
backed up elsewhere.  If you like, to avoid having to dig up those 
backups, you can keep quarterly snapshots for... pretty much the life of 
the filesystem or hardware, it'll 

Re: BTRFS RAID-1 leaf size change scenario

2015-02-12 Thread Chris Murphy
I'm going to amend what I wrote earlier. The problem with the seed
device method, it won't let you change the leafsize. And that means
you'll need to go with a new volume with mkfs, and migrate data with
btrfs send receive instead.

And to clarify, you don't need to thin out subvolume to start out with
on the existing volume. Do the thinning as part of the send/receive
strategy. Keep the old raid1 mirror as a read only archive until
you're clear you don't need it as much as you need a new backup
volume. I wouldn't combine archive and backup in a single device.

Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID with enterprise SATA or SAS drives

2014-07-10 Thread Martin Steigerwald
Am Donnerstag, 10. Juli 2014, 12:10:46 schrieb Russell Coker:
 On Wed, 9 Jul 2014 16:48:05 Martin Steigerwald wrote:
   - for someone using SAS or enterprise SATA drives with Linux, I
   understand btrfs gives the extra benefit of checksums, are there any
   other specific benefits over using mdadm or dmraid?
  
  I think I can answer this one.
  
  Most important advantage I think is BTRFS is aware of which blocks of
  the RAID are in use and need to be synced:
  
  - Instant initialization of RAID regardless of size (unless at some
  capacity mkfs.btrfs needs more time)
 
 From mdadm(8):
 
--assume-clean
   Tell mdadm that the array pre-existed and is known to be 
 clean. It  can be useful when trying to recover from a major failure as
 you can be sure that no data will be affected unless  you  actu‐ ally 
 write  to  the array.  It can also be used when creating a RAID1 or
 RAID10 if you want to avoid the initial resync, however this  practice 
 — while normally safe — is not recommended.  Use this only if you
 really know what you are doing.
 
   When the devices that will be part of a new  array  were 
 filled with zeros before creation the operator knows the array is actu‐
 ally clean. If that is the case,  such  as  after  running  bad‐
 blocks,  this  argument  can be used to tell mdadm the facts the
 operator knows.
 
 While it might be regarded as a hack, it is possible to do a fairly
 instant initialisation of a Linux software RAID-1.

It is not the same.

BTRFS doesn´t care if the data of the unused blocks differ.

The RAID is on *filesystem* level, not on raw block level. The data on both 
disks don´t even have to be located in the exact same sectors.


  - Rebuild after disk failure or disk replace will only copy *used*
  blocks
 Have you done any benchmarks on this?  The down-side of copying used
 blocks is that you first need to discover which blocks are used.  Given
 that seek time is a major bottleneck at some portion of space used it
 will be faster to just copy the entire disk.

As BTRFS operates the RAID on the filesystem level it already knows which 
blocks are in use. I never had a disk replace or faulty disk yet in my two 
RAID-1 arrays, so I have no measurements. It may depend on free space 
fragementation.

  Scrubbing can repair from good disk if RAID with redundancy, but
  SoftRAID should be able to do this as well. But also for scrubbing:
  BTRFS only check and repairs used blocks.
 
 When you scrub Linux Software RAID (and in fact pretty much every RAID)
 it will only correct errors that the disks flag.  If a disk returns bad
 data and says that it's good then the RAID scrub will happily copy the
 bad data over the good data (for a RAID-1) or generate new valid parity
 blocks for bad data (for RAID-5/6).
 
 http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html
 
 Page 12 of the above document says that nearline disks (IE the ones
 people like me can afford for home use) have a 0.466% incidence of
 returning bad data and claiming it's good in a year.  Currently I run
 about 20 such disks in a variety of servers, workstations, and laptops.
  Therefore the probability of having no such errors on all those disks
 would be .99534^20=.91081.  The probability of having no such errors
 over a period of 10 years would be (.99534^20)^10=.39290 which means
 that over 10 years I should expect to have such errors, which is why
 BTRFS RAID-1 and DUP metadata on single disks are necessary features.

Yeah, the checksums comes in handy here.

(excuse long signature, its added by server)

Ciao,

-- 
Martin Steigerwald
Consultant / Trainer

teamix GmbH
Südwestpark 43
90449 Nürnberg

fon:  +49 911 30999 55
fax:  +49 911 30999 99
mail: martin.steigerw...@teamix.de
web:  http://www.teamix.de
blog: http://blog.teamix.de

Amtsgericht Nürnberg, HRB 18320
Geschäftsführer: Oliver Kügow, Richard Müller

** JETZT ANMELDEN – teamix TechDemo - 23.07.2014 - 
http://www.teamix.de/techdemo **

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID with enterprise SATA or SAS drives

2014-07-10 Thread Austin S Hemmelgarn
On 2014-07-09 22:10, Russell Coker wrote:
 On Wed, 9 Jul 2014 16:48:05 Martin Steigerwald wrote:
 - for someone using SAS or enterprise SATA drives with Linux, I
 understand btrfs gives the extra benefit of checksums, are there any
 other specific benefits over using mdadm or dmraid?

 I think I can answer this one.

 Most important advantage I think is BTRFS is aware of which blocks of the
 RAID are in use and need to be synced:

 - Instant initialization of RAID regardless of size (unless at some
 capacity mkfs.btrfs needs more time)
 
 From mdadm(8):
 
--assume-clean
   Tell mdadm that the array pre-existed and is known to be  clean.
   It  can be useful when trying to recover from a major failure as
   you can be sure that no data will be affected unless  you  actu‐
   ally  write  to  the array.  It can also be used when creating a
   RAID1 or RAID10 if you want to avoid the initial resync, however
   this  practice  — while normally safe — is not recommended.  Use
   this only if you really know what you are doing.
 
   When the devices that will be part of a new  array  were  filled
   with zeros before creation the operator knows the array is actu‐
   ally clean. If that is the case,  such  as  after  running  bad‐
   blocks,  this  argument  can be used to tell mdadm the facts the
   operator knows.
 
 While it might be regarded as a hack, it is possible to do a fairly instant 
 initialisation of a Linux software RAID-1.

This has the notable disadvantage however that the first scrub you run
will essentially preform a full resync if you didn't make sure that the
disks had identical data to begin with.
 - Rebuild after disk failure or disk replace will only copy *used* blocks
 
 Have you done any benchmarks on this?  The down-side of copying used blocks 
 is 
 that you first need to discover which blocks are used.  Given that seek time 
 is 
 a major bottleneck at some portion of space used it will be faster to just 
 copy the entire disk.
 
 I haven't done any tests on BTRFS in this regard, but I've seen a disk 
 replacement on ZFS run significantly slower than a dd of the block device 
 would.
 
First of all, this isn't really a good comparison for two reasons:
1. EVERYTHING on ZFS (or any filesystem that tries to do that much work)
is slower than a dd of the raw block device.
2. Even if the throughput is lower, this is only really an issue if the
disk is more than half full, because you don't copy the unused blocks

Also, while it isn't really a recovery situation, I recently upgraded
from a 2 1TB disk BTRFS RAID1 setup to a 4 1TB disk BTRFS RAID10 setup,
and the performance of the re-balance really wasn't all that bad.  I
have maybe 100GB of actual data, so the array started out roughly 10%
full, and the re-balance only took about 2 minutes.  Of course, it
probably helps that I make a point to keep my filesystems de-fragmented,
scrub and balance regularly, and don't use a lot of sub-volumes or
snapshots, so the filesystem in question is not too different from what
it would have looked like if I had just wiped the FS and restored from a
backup.
 Scrubbing can repair from good disk if RAID with redundancy, but SoftRAID
 should be able to do this as well. But also for scrubbing: BTRFS only
 check and repairs used blocks.
 
 When you scrub Linux Software RAID (and in fact pretty much every RAID) it 
 will only correct errors that the disks flag.  If a disk returns bad data and 
 says that it's good then the RAID scrub will happily copy the bad data over 
 the good data (for a RAID-1) or generate new valid parity blocks for bad data 
 (for RAID-5/6).
 
 http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html
 
 Page 12 of the above document says that nearline disks (IE the ones people 
 like me can afford for home use) have a 0.466% incidence of returning bad 
 data 
 and claiming it's good in a year.  Currently I run about 20 such disks in a 
 variety of servers, workstations, and laptops.  Therefore the probability of 
 having no such errors on all those disks would be .99534^20=.91081.  The 
 probability of having no such errors over a period of 10 years would be 
 (.99534^20)^10=.39290 which means that over 10 years I should expect to have 
 such errors, which is why BTRFS RAID-1 and DUP metadata on single disks are 
 necessary features.
 




smime.p7s
Description: S/MIME Cryptographic Signature


Re: btrfs RAID with enterprise SATA or SAS drives

2014-07-09 Thread Martin Steigerwald
Am Mittwoch, 9. Mai 2012, 22:01:49 schrieb Daniel Pocock:
 There is various information about
 - enterprise-class drives (either SAS or just enterprise SATA)
 - the SCSI/SAS protocols themselves vs SATA
 having more advanced features (e.g. for dealing with error conditions)
 than the average block device
 
 For example, Adaptec recommends that such drives will work better with
 their hardware RAID cards:
[…]
 - for someone using SAS or enterprise SATA drives with Linux, I
 understand btrfs gives the extra benefit of checksums, are there any
 other specific benefits over using mdadm or dmraid?

I think I can answer this one.

Most important advantage I think is BTRFS is aware of which blocks of the 
RAID are in use and need to be synced:

- Instant initialization of RAID regardless of size (unless at some 
capacity mkfs.btrfs needs more time)

- Rebuild after disk failure or disk replace will only copy *used* blocks


Scrubbing can repair from good disk if RAID with redundancy, but SoftRAID 
should be able to do this as well. But also for scrubbing: BTRFS only 
check and repairs used blocks.


Another advantage in the future – not yet possible AFAIK:

- Different RAID levels on same filesystem yet different subvolumes, more 
flexibility as subvolumes are dynamically allocated, instead of statically 
sized

Ciao,
Martin

-- 
Martin Steigerwald
Consultant / Trainer

teamix GmbH
Südwestpark 43
90449 Nürnberg

fon:  +49 911 30999 55
fax:  +49 911 30999 99
mail: martin.steigerw...@teamix.de
web:  http://www.teamix.de
blog: http://blog.teamix.de

Amtsgericht Nürnberg, HRB 18320
Geschäftsführer: Oliver Kügow, Richard Müller

** JETZT ANMELDEN – teamix TechDemo - 23.07.2014 - 
http://www.teamix.de/techdemo **

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID with enterprise SATA or SAS drives

2014-07-09 Thread Russell Coker
On Wed, 9 Jul 2014 16:48:05 Martin Steigerwald wrote:
  - for someone using SAS or enterprise SATA drives with Linux, I
  understand btrfs gives the extra benefit of checksums, are there any
  other specific benefits over using mdadm or dmraid?
 
 I think I can answer this one.
 
 Most important advantage I think is BTRFS is aware of which blocks of the
 RAID are in use and need to be synced:
 
 - Instant initialization of RAID regardless of size (unless at some
 capacity mkfs.btrfs needs more time)

From mdadm(8):

   --assume-clean
  Tell mdadm that the array pre-existed and is known to be  clean.
  It  can be useful when trying to recover from a major failure as
  you can be sure that no data will be affected unless  you  actu‐
  ally  write  to  the array.  It can also be used when creating a
  RAID1 or RAID10 if you want to avoid the initial resync, however
  this  practice  — while normally safe — is not recommended.  Use
  this only if you really know what you are doing.

  When the devices that will be part of a new  array  were  filled
  with zeros before creation the operator knows the array is actu‐
  ally clean. If that is the case,  such  as  after  running  bad‐
  blocks,  this  argument  can be used to tell mdadm the facts the
  operator knows.

While it might be regarded as a hack, it is possible to do a fairly instant 
initialisation of a Linux software RAID-1.

 - Rebuild after disk failure or disk replace will only copy *used* blocks

Have you done any benchmarks on this?  The down-side of copying used blocks is 
that you first need to discover which blocks are used.  Given that seek time is 
a major bottleneck at some portion of space used it will be faster to just 
copy the entire disk.

I haven't done any tests on BTRFS in this regard, but I've seen a disk 
replacement on ZFS run significantly slower than a dd of the block device 
would.

 Scrubbing can repair from good disk if RAID with redundancy, but SoftRAID
 should be able to do this as well. But also for scrubbing: BTRFS only
 check and repairs used blocks.

When you scrub Linux Software RAID (and in fact pretty much every RAID) it 
will only correct errors that the disks flag.  If a disk returns bad data and 
says that it's good then the RAID scrub will happily copy the bad data over 
the good data (for a RAID-1) or generate new valid parity blocks for bad data 
(for RAID-5/6).

http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html

Page 12 of the above document says that nearline disks (IE the ones people 
like me can afford for home use) have a 0.466% incidence of returning bad data 
and claiming it's good in a year.  Currently I run about 20 such disks in a 
variety of servers, workstations, and laptops.  Therefore the probability of 
having no such errors on all those disks would be .99534^20=.91081.  The 
probability of having no such errors over a period of 10 years would be 
(.99534^20)^10=.39290 which means that over 10 years I should expect to have 
such errors, which is why BTRFS RAID-1 and DUP metadata on single disks are 
necessary features.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs raid allocator

2014-05-06 Thread Hugo Mills
On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote:
 Hello all!
 
 I would like to use btrfs (or anyting else actually) to maximize raid0
 performance. Basically I have a relatively constant stream of data that
 simply has to be written out to disk. So my question is, how is the block
 allocator deciding on which device to write, can this decision be dynamic
 and could it incorporate timing/troughput decisions? I'm willing to write
 code, I just have no clue as to how this works right now. I read somewhere
 that the decision is based on free space, is this still true?

   For (current) RAID-0 allocation, the block group allocator will use
as many chunks as there are devices with free space (down to a minimum
of 2). Data is then striped across those chunks in 64 KiB stripes.
Thus, the first block group will be N GiB of usable space, striped
across N devices.

   There's a second level of allocation (which I haven't looked at at
all), which is how the FS decides where to put data within the
allocated block groups. I think it will almost certainly be beneficial
in your case to use prealloc extents, which will turn your continuous
write into large contiguous sections of striping.

   I would recommend thoroughly benchmarking your application with the
FS first though, just to see how it's going to behave for you.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Ceci n'est pas une pipe:  | ---   


signature.asc
Description: Digital signature


Re: Btrfs raid allocator

2014-05-06 Thread Hugo Mills
On Tue, May 06, 2014 at 01:14:26PM +0200, Hendrik Siedelmann wrote:
 On 06.05.2014 12:59, Hugo Mills wrote:
 On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote:
 Hello all!
 
 I would like to use btrfs (or anyting else actually) to maximize raid0
 performance. Basically I have a relatively constant stream of data that
 simply has to be written out to disk. So my question is, how is the block
 allocator deciding on which device to write, can this decision be dynamic
 and could it incorporate timing/troughput decisions? I'm willing to write
 code, I just have no clue as to how this works right now. I read somewhere
 that the decision is based on free space, is this still true?
 
 For (current) RAID-0 allocation, the block group allocator will use
 as many chunks as there are devices with free space (down to a minimum
 of 2). Data is then striped across those chunks in 64 KiB stripes.
 Thus, the first block group will be N GiB of usable space, striped
 across N devices.
 
 So do I understand this correctly that (assuming we have enough space) data
 will be spread equally between the disks independend of write speeds? So one
 slow device would slow down the whole raid?

   Yes. Exactly the same as it would be with DM RAID-0 on the same
configuration. There's not a lot we can do about that at this point.

 There's a second level of allocation (which I haven't looked at at
 all), which is how the FS decides where to put data within the
 allocated block groups. I think it will almost certainly be beneficial
 in your case to use prealloc extents, which will turn your continuous
 write into large contiguous sections of striping.
 
 Why does prealloc change anything? For me latency does not matter, only
 continuous troughput!

   It makes the extent allocation algorithm much simpler, because it
can then allocate in larger chunks and do more linear writes

 I would recommend thoroughly benchmarking your application with the
 FS first though, just to see how it's going to behave for you.
 
 Hugo.
 
 
 Of course - it's just that I do not yet have the hardware, but I plan to
 test with a small model - I just try to find out how it actually works
 first, so I know what look out for.

   Good luck. :)

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- I am the author. You are the audience. I outrank you! --- 


signature.asc
Description: Digital signature


Re: Btrfs raid allocator

2014-05-06 Thread Hendrik Siedelmann

On 06.05.2014 13:19, Hugo Mills wrote:

On Tue, May 06, 2014 at 01:14:26PM +0200, Hendrik Siedelmann wrote:

On 06.05.2014 12:59, Hugo Mills wrote:

On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote:

Hello all!

I would like to use btrfs (or anyting else actually) to maximize raid0
performance. Basically I have a relatively constant stream of data that
simply has to be written out to disk. So my question is, how is the block
allocator deciding on which device to write, can this decision be dynamic
and could it incorporate timing/troughput decisions? I'm willing to write
code, I just have no clue as to how this works right now. I read somewhere
that the decision is based on free space, is this still true?


For (current) RAID-0 allocation, the block group allocator will use
as many chunks as there are devices with free space (down to a minimum
of 2). Data is then striped across those chunks in 64 KiB stripes.
Thus, the first block group will be N GiB of usable space, striped
across N devices.


So do I understand this correctly that (assuming we have enough space) data
will be spread equally between the disks independend of write speeds? So one
slow device would slow down the whole raid?


Yes. Exactly the same as it would be with DM RAID-0 on the same
configuration. There's not a lot we can do about that at this point.


So striping is fixed but which disk takes part with a chunk is dynamic? 
But for large workloads slower disks could 'skip a chunk' as chunk 
allocation is dynamic, correct?



There's a second level of allocation (which I haven't looked at at
all), which is how the FS decides where to put data within the
allocated block groups. I think it will almost certainly be beneficial
in your case to use prealloc extents, which will turn your continuous
write into large contiguous sections of striping.


Why does prealloc change anything? For me latency does not matter, only
continuous troughput!


It makes the extent allocation algorithm much simpler, because it
can then allocate in larger chunks and do more linear writes


Is this still true if I do very large writes? Or do those get broken 
down by the kernel somewhere?



I would recommend thoroughly benchmarking your application with the
FS first though, just to see how it's going to behave for you.

Hugo.



Of course - it's just that I do not yet have the hardware, but I plan to
test with a small model - I just try to find out how it actually works
first, so I know what look out for.


Good luck. :)

Hugo.



Thanks!
Hendrik

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs raid allocator

2014-05-06 Thread Hugo Mills
On Tue, May 06, 2014 at 01:26:44PM +0200, Hendrik Siedelmann wrote:
 On 06.05.2014 13:19, Hugo Mills wrote:
 On Tue, May 06, 2014 at 01:14:26PM +0200, Hendrik Siedelmann wrote:
 On 06.05.2014 12:59, Hugo Mills wrote:
 On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote:
 Hello all!
 
 I would like to use btrfs (or anyting else actually) to maximize raid0
 performance. Basically I have a relatively constant stream of data that
 simply has to be written out to disk. So my question is, how is the block
 allocator deciding on which device to write, can this decision be dynamic
 and could it incorporate timing/troughput decisions? I'm willing to write
 code, I just have no clue as to how this works right now. I read somewhere
 that the decision is based on free space, is this still true?
 
 For (current) RAID-0 allocation, the block group allocator will use
 as many chunks as there are devices with free space (down to a minimum
 of 2). Data is then striped across those chunks in 64 KiB stripes.
 Thus, the first block group will be N GiB of usable space, striped
 across N devices.
 
 So do I understand this correctly that (assuming we have enough space) data
 will be spread equally between the disks independend of write speeds? So one
 slow device would slow down the whole raid?
 
 Yes. Exactly the same as it would be with DM RAID-0 on the same
 configuration. There's not a lot we can do about that at this point.
 
 So striping is fixed but which disk takes part with a chunk is dynamic? But
 for large workloads slower disks could 'skip a chunk' as chunk allocation is
 dynamic, correct?

   You'd have to rewrite the chunk allocator to do this, _and_ provide
different RAID levels for different subvolumes. The chunk/block group
allocator right now uses only one rule for allocating data, and one
for allocating metadata. Now, both of these are planned, and _might_
between them possibly cover the use-case you're talking about, but I'm
not certain it's necessarily a sensible thing to do in this case.

   My question is, if you actually care about the performance of this
system, why are you buying some slow devices to drag the performance
of your fast devices down? It seems like a recipe for disaster...

 There's a second level of allocation (which I haven't looked at at
 all), which is how the FS decides where to put data within the
 allocated block groups. I think it will almost certainly be beneficial
 in your case to use prealloc extents, which will turn your continuous
 write into large contiguous sections of striping.
 
 Why does prealloc change anything? For me latency does not matter, only
 continuous troughput!
 
 It makes the extent allocation algorithm much simpler, because it
 can then allocate in larger chunks and do more linear writes
 
 Is this still true if I do very large writes? Or do those get broken down by
 the kernel somewhere?

   I guess it'll depend on the approach you use to do these very
large writes, and on the exact definition of very large. This is
not an area I know a huge amount about.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- I am the author. You are the audience. I outrank you! --- 


signature.asc
Description: Digital signature


Re: Btrfs raid allocator

2014-05-06 Thread Hendrik Siedelmann

On 06.05.2014 13:46, Hugo Mills wrote:

On Tue, May 06, 2014 at 01:26:44PM +0200, Hendrik Siedelmann wrote:

On 06.05.2014 13:19, Hugo Mills wrote:

On Tue, May 06, 2014 at 01:14:26PM +0200, Hendrik Siedelmann wrote:

On 06.05.2014 12:59, Hugo Mills wrote:

On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote:

Hello all!

I would like to use btrfs (or anyting else actually) to maximize raid0
performance. Basically I have a relatively constant stream of data that
simply has to be written out to disk. So my question is, how is the block
allocator deciding on which device to write, can this decision be dynamic
and could it incorporate timing/troughput decisions? I'm willing to write
code, I just have no clue as to how this works right now. I read somewhere
that the decision is based on free space, is this still true?


For (current) RAID-0 allocation, the block group allocator will use
as many chunks as there are devices with free space (down to a minimum
of 2). Data is then striped across those chunks in 64 KiB stripes.
Thus, the first block group will be N GiB of usable space, striped
across N devices.


So do I understand this correctly that (assuming we have enough space) data
will be spread equally between the disks independend of write speeds? So one
slow device would slow down the whole raid?


Yes. Exactly the same as it would be with DM RAID-0 on the same
configuration. There's not a lot we can do about that at this point.


So striping is fixed but which disk takes part with a chunk is dynamic? But
for large workloads slower disks could 'skip a chunk' as chunk allocation is
dynamic, correct?


You'd have to rewrite the chunk allocator to do this, _and_ provide
different RAID levels for different subvolumes. The chunk/block group
allocator right now uses only one rule for allocating data, and one
for allocating metadata. Now, both of these are planned, and _might_
between them possibly cover the use-case you're talking about, but I'm
not certain it's necessarily a sensible thing to do in this case.


But what does the allocator currently do when one disk runs out of 
space? I thought those disks do not get used but we can still write 
data. So the mechanism is already there, it just needs to be invoked 
when a drive is too busy instead of too full.



My question is, if you actually care about the performance of this
system, why are you buying some slow devices to drag the performance
of your fast devices down? It seems like a recipe for disaster...


Even the speed of a single hdd varies depending on where I write the 
data. So actually there is not much choice :-D.
I'm aware that this could be a case of overengineering. Actually my 
first thought was to write a simple fuse module which only handles data 
and puts metadata on a regular filesystem. But then I thought that it 
would be nice to have this in btrfs - and not just for raid0.



There's a second level of allocation (which I haven't looked at at
all), which is how the FS decides where to put data within the
allocated block groups. I think it will almost certainly be beneficial
in your case to use prealloc extents, which will turn your continuous
write into large contiguous sections of striping.


Why does prealloc change anything? For me latency does not matter, only
continuous troughput!


It makes the extent allocation algorithm much simpler, because it
can then allocate in larger chunks and do more linear writes


Is this still true if I do very large writes? Or do those get broken down by
the kernel somewhere?


I guess it'll depend on the approach you use to do these very
large writes, and on the exact definition of very large. This is
not an area I know a huge amount about.

Hugo.


Never mind I'll just try it out!

Hendrik

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs raid allocator

2014-05-06 Thread Duncan
Hendrik Siedelmann posted on Tue, 06 May 2014 12:41:38 +0200 as excerpted:

 I would like to use btrfs (or anyting else actually) to maximize raid0
 performance. Basically I have a relatively constant stream of data that
 simply has to be written out to disk.

If flexible parallelization is all you're worried about, not data 
integrity or the other things btrfs does, I'd suggest looking at a more 
mature solution such as md- or dm-raid.  They're more mature and less 
complex than btrfs, and if you're not using the other features of btrfs 
anyway, they should simply work better for your use-case.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs raid allocator

2014-05-06 Thread Chris Murphy

On May 6, 2014, at 4:41 AM, Hendrik Siedelmann 
hendrik.siedelm...@googlemail.com wrote:

 Hello all!
 
 I would like to use btrfs (or anyting else actually) to maximize raid0 
 performance. Basically I have a relatively constant stream of data that 
 simply has to be written out to disk. 

I think the only way to know what works best for your workload is to test 
configurations with the actual workload. For optimization of multiple device 
file systems, it's hard to beat XFS on raid0 or even linear/concat due to its 
parallelization, if you have more than one stream (or a stream that produces a 
lot of files that XFS can allocate into separate allocation groups). Also mdadm 
supports use specified strip/chunk sizes, whereas currently on Btrfs this is 
fixed to 64KiB. Depending on the file size for your workload, it's possible a 
much larger strip will yield better performance.

Another optimization is hardware RAID with a battery backed write cache (the 
drives' write cache are disabled) and using nobarrier mount option. If your 
workload supports linear/concat then it's fine to use md linear for this. What 
I'm not sure of is if it's an OK practice to disable barriers if the system is 
on a UPS (rather than a battery backed hardware RAID cache). You should post 
the workload and hardware details on the XFS list to get suggestions about such 
things. They'll also likely recommend the deadline scheduler over cfq.

Unless you have a workload really familiar to the responder, they'll tell you 
any benchmarking you do needs to approximate the actual workflow. A mismatched 
benchmark to the workload will lead you to the wrong conclusions. Typically 
when you optimize for a particular workload, other workloads suffer.

Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs raid allocator

2014-05-06 Thread Hendrik Siedelmann

On 06.05.2014 23:49, Chris Murphy wrote:


On May 6, 2014, at 4:41 AM, Hendrik Siedelmann
hendrik.siedelm...@googlemail.com wrote:


Hello all!

I would like to use btrfs (or anyting else actually) to maximize
raid0 performance. Basically I have a relatively constant stream of
data that simply has to be written out to disk.


I think the only way to know what works best for your workload is to
test configurations with the actual workload. For optimization of
multiple device file systems, it's hard to beat XFS on raid0 or even
linear/concat due to its parallelization, if you have more than one
stream (or a stream that produces a lot of files that XFS can
allocate into separate allocation groups). Also mdadm supports use
specified strip/chunk sizes, whereas currently on Btrfs this is fixed
to 64KiB. Depending on the file size for your workload, it's possible
a much larger strip will yield better performance.


Thanks, that's quite a few knobs I can try out - I just have a lot of 
data - with a rate up to 450MB/s that I want to write out in time, 
preferably without having to rely on too expensive hardware.



Another optimization is hardware RAID with a battery backed write
cache (the drives' write cache are disabled) and using nobarrier
mount option. If your workload supports linear/concat then it's fine
to use md linear for this. What I'm not sure of is if it's an OK
practice to disable barriers if the system is on a UPS (rather than a
battery backed hardware RAID cache). You should post the workload and
hardware details on the XFS list to get suggestions about such
things. They'll also likely recommend the deadline scheduler over
cfq.


Actually data integrity does not matter for the workload. If everything 
is succesfull the result will be backed up - before that full filesystem 
corruption is acceptable as a failure mode.



Unless you have a workload really familiar to the responder, they'll
tell you any benchmarking you do needs to approximate the actual
workflow. A mismatched benchmark to the workload will lead you to the
wrong conclusions. Typically when you optimize for a particular
workload, other workloads suffer.

Chris Murphy



Thanks again for all the infos! I'll get back if everything works fine - 
or if it doesn't ;-)


Cheers
Hendrik
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-RAID(3 or 5/6/etc) like btrfs-RAID1?

2014-02-13 Thread Hugo Mills
On Thu, Feb 13, 2014 at 11:32:03AM -0500, Jim Salter wrote:
 That is FANTASTIC news.  Thank you for wielding the LART gently. =)

   No LART necessary. :) Nobody knows everything, and it's not a
particularly heavily-documented or written-about feature at the moment
(mostly because it only exists in Chris's local git repo).

 I do a fair amount of public speaking and writing about next-gen
 filesystems (example: 
 http://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/)
 and I will be VERY sure to talk about the upcoming divorce of stripe
 size from array size in future presentations.  This makes me
 positively giddy.
 
 FWIW, after writing the above article I got contacted by a
 proprietary storage vendor who wanted to tell me all about his
 midmarket/enterprise product, and he was pretty audibly flummoxed
 when I explained how btrfs-RAID1 distributes data and redundancy -
 his product does something similar (to be fair, his product also
 does a lot of other things btrfs doesn't inherently do, like
 clustered storage and synchronous dedup), and he had no idea that
 anything freely available did anything vaguely like it.

   That's quite entertaining for the bogglement factor. Although,
again, see my comment above...

   Hugo.

 I have a feeling the storage world - even the relatively
 well-informed part of it that's aware of ZFS - has little to no
 inclination how gigantic of a splash btrfs is going to make when it
 truly hits the mainstream.
 
 This could be a pretty powerful setup IMO - if you implemented
 something like this, you'd be able to arbitrarily define your
 storage efficiency (percentage of parity blocks / data blocks) and
 your fault-tolerance level (how many drives you can afford to lose
 before failure) WITHOUT tying it directly to your underlying disks,
 or necessarily needing to rebalance as you add more disks to the
 array.  This would be a heck of a lot more flexible than ZFS'
 approach of adding more immutable vdevs.
 
 Please feel free to tell me why I'm dumb for either 1. not realizing
 the obvious flaw in this idea or 2. not realizing it's already being
 worked on in exactly this fashion. =)
 The latter. :)
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Nothing right in my left brain. Nothing left in --- 
 my right brain. 


signature.asc
Description: Digital signature


Re: btrfs-RAID(3 or 5/6/etc) like btrfs-RAID1?

2014-02-13 Thread Goffredo Baroncelli
Hi Jim,
On 02/13/2014 05:13 PM, Jim Salter wrote:
 This might be a stupid question but...

There is no stupid questions, only stupid answers...

 
 Are there any plans to make parity RAID levels in btrfs similar to
 the current implementation of btrfs-raid1?
 
 It took me a while to realize how different and powerful btrfs-raid1
 is from traditional raid1.  The ability to string together virtually
 any combination of mutt hard drives together in arbitrary ways and
 yet maintain redundancy is POWERFUL, and is seriously going to be a
 killer feature advancing btrfs adoption in small environments.
 
 The one real drawback to btrfs-raid1 is that you're committed to n/2
 storage efficiency, since you're using pure redundancy rather than
 parity on the array.  I was thinking about that this morning, and
 suddenly it occurred to me that you ought to be able to create a
 striped parity array in much the same way as a btrfs-raid1 array.
 
 Let's say you have five disks, and you arbitrarily want to define a
 stripe length of four data blocks plus one parity block per stripe.

I what it is different from a raid5 setup (which is supported by btrfs)?

 Right now, what you're looking at effectively amounts to a RAID3
 array, like FreeBSD used to use.  But, what if we add two more disks?
 Or three more disks? Or ten more?  Is there any reason we can't keep
 our stripe length of four blocks + one parity block, and just
 distribute them relatively ad-hoc in the same way btrfs-raid1
 distributes redundant data blocks across an ad-hoc array of disks?
 
 This could be a pretty powerful setup IMO - if you implemented
 something like this, you'd be able to arbitrarily define your storage
 efficiency (percentage of parity blocks / data blocks) and your
 fault-tolerance level (how many drives you can afford to lose before
 failure) WITHOUT tying it directly to your underlying disks

May be that it is a good idea, but which would be the advantage to 
use less drives that the available ones for a RAID ?

Regarding the fault tolerance level, few weeks ago there was a 
posting about a kernel library which would provide a generic
RAID framework capable of several degree of fault tolerance 
(raid 5,6,7...) [give a look to 
[RFC v4 2/3] fs: btrfs: Extends btrfs/raid56 to 
support up to six parities 2014/1/25]. This definitely would be a
big leap forward.

BTW, the raid5/raid6 support in BTRFS is only for testing purpose. 
However Chris Mason, told few week ago that he will work on these
issues.

[...]
 necessarily needing to rebalance as you add more disks to the array.
 This would be a heck of a lot more flexible than ZFS' approach of
 adding more immutable vdevs.

There is no needing to re-balance if you add more drives. The next 
chunk allocation will span all the available drives anyway. It is only 
required when you want to spans all data already written on all the drives.

Regards
Goffredo


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-RAID(3 or 5/6/etc) like btrfs-RAID1?

2014-02-13 Thread Hugo Mills
On Thu, Feb 13, 2014 at 09:22:07PM +0100, Goffredo Baroncelli wrote:
 Hi Jim,
 On 02/13/2014 05:13 PM, Jim Salter wrote:
  Let's say you have five disks, and you arbitrarily want to define a
  stripe length of four data blocks plus one parity block per stripe.
 
 I what it is different from a raid5 setup (which is supported by btrfs)?

   With what's above, yes, that's the current RAID-5 code.

  Right now, what you're looking at effectively amounts to a RAID3
  array, like FreeBSD used to use.  But, what if we add two more disks?
  Or three more disks? Or ten more?  Is there any reason we can't keep
  our stripe length of four blocks + one parity block, and just
  distribute them relatively ad-hoc in the same way btrfs-raid1
  distributes redundant data blocks across an ad-hoc array of disks?
  
  This could be a pretty powerful setup IMO - if you implemented
  something like this, you'd be able to arbitrarily define your storage
  efficiency (percentage of parity blocks / data blocks) and your
  fault-tolerance level (how many drives you can afford to lose before
  failure) WITHOUT tying it directly to your underlying disks
 
 May be that it is a good idea, but which would be the advantage to 
 use less drives that the available ones for a RAID ?

   Performance, plus the ability to handle different sized drives.
Hmm... maybe I should do an optimise option for the space planner...

 Regarding the fault tolerance level, few weeks ago there was a 
 posting about a kernel library which would provide a generic
 RAID framework capable of several degree of fault tolerance 
 (raid 5,6,7...) [give a look to 
 [RFC v4 2/3] fs: btrfs: Extends btrfs/raid56 to 
 support up to six parities 2014/1/25]. This definitely would be a
 big leap forward.
 
 BTW, the raid5/raid6 support in BTRFS is only for testing purpose. 
 However Chris Mason, told few week ago that he will work on these
 issues.
 
 [...]
  necessarily needing to rebalance as you add more disks to the array.
  This would be a heck of a lot more flexible than ZFS' approach of
  adding more immutable vdevs.
 
 There is no needing to re-balance if you add more drives. The next 
 chunk allocation will span all the available drives anyway. It is only 
 required when you want to spans all data already written on all the drives.

   The balance opens up more usable space, unless the new device is
(some nasty function of) the remaining free space on the other drives.
It's not necessarily about spanning the data, although that's an
effect, too.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- It used to take a lot of talent and a certain type of ---  
upbringing to be perfectly polite and have filthy manners
at the same time. Now all it needs is a computer.


signature.asc
Description: Digital signature


Re: btrfs-RAID(3 or 5/6/etc) like btrfs-RAID1?

2014-02-13 Thread Hugo Mills
On Thu, Feb 13, 2014 at 11:13:58AM -0500, Jim Salter wrote:
 This might be a stupid question but...
 
 Are there any plans to make parity RAID levels in btrfs similar to
 the current implementation of btrfs-raid1?

   Yes.

 It took me a while to realize how different and powerful btrfs-raid1
 is from traditional raid1.  The ability to string together virtually
 any combination of mutt hard drives together in arbitrary ways and
 yet maintain redundancy is POWERFUL, and is seriously going to be a
 killer feature advancing btrfs adoption in small environments.
 
 The one real drawback to btrfs-raid1 is that you're committed to n/2
 storage efficiency, since you're using pure redundancy rather than
 parity on the array.  I was thinking about that this morning, and
 suddenly it occurred to me that you ought to be able to create a
 striped parity array in much the same way as a btrfs-raid1 array.
 
 Let's say you have five disks, and you arbitrarily want to define a
 stripe length of four data blocks plus one parity block per
 stripe.  Right now, what you're looking at effectively amounts to
 a RAID3 array, like FreeBSD used to use.  But, what if we add two
 more disks? Or three more disks? Or ten more?  Is there any reason
 we can't keep our stripe length of four blocks + one parity block,
 and just distribute them relatively ad-hoc in the same way
 btrfs-raid1 distributes redundant data blocks across an ad-hoc array
 of disks?

   None whatsoever.

 This could be a pretty powerful setup IMO - if you implemented
 something like this, you'd be able to arbitrarily define your
 storage efficiency (percentage of parity blocks / data blocks) and
 your fault-tolerance level (how many drives you can afford to lose
 before failure) WITHOUT tying it directly to your underlying disks,
 or necessarily needing to rebalance as you add more disks to the
 array.  This would be a heck of a lot more flexible than ZFS'
 approach of adding more immutable vdevs.
 
 Please feel free to tell me why I'm dumb for either 1. not realizing
 the obvious flaw in this idea or 2. not realizing it's already being
 worked on in exactly this fashion. =)

   The latter. :)

   One of the (many) existing problems with the parity RAID
implementation as it is is that with large numbers of devices, it
becomes quite inefficient to write data when you (may) need to modify
dozens of devices. It's been Chris's stated intention for a while now
to allow a bound to be placed on the maximum number of devices per
stripe, which allows a degree of control over the space-yield -
performance knob.

   It's one of the reasons that the usage tool [1] has a maximum
stripes knob on it -- so that you can see the behaviour of the FS
once that feature's in place.

   Hugo.

[1] http://carfax.org.uk/btrfs-usage/

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Nothing right in my left brain. Nothing left in --- 
 my right brain. 


signature.asc
Description: Digital signature


Re: btrfs-RAID(3 or 5/6/etc) like btrfs-RAID1?

2014-02-13 Thread Jim Salter

That is FANTASTIC news.  Thank you for wielding the LART gently. =)

I do a fair amount of public speaking and writing about next-gen 
filesystems (example: 
http://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/) 
and I will be VERY sure to talk about the upcoming divorce of stripe 
size from array size in future presentations.  This makes me positively 
giddy.


FWIW, after writing the above article I got contacted by a proprietary 
storage vendor who wanted to tell me all about his midmarket/enterprise 
product, and he was pretty audibly flummoxed when I explained how 
btrfs-RAID1 distributes data and redundancy - his product does something 
similar (to be fair, his product also does a lot of other things btrfs 
doesn't inherently do, like clustered storage and synchronous dedup), 
and he had no idea that anything freely available did anything vaguely 
like it.


I have a feeling the storage world - even the relatively well-informed 
part of it that's aware of ZFS - has little to no inclination how 
gigantic of a splash btrfs is going to make when it truly hits the 
mainstream.



This could be a pretty powerful setup IMO - if you implemented
something like this, you'd be able to arbitrarily define your
storage efficiency (percentage of parity blocks / data blocks) and
your fault-tolerance level (how many drives you can afford to lose
before failure) WITHOUT tying it directly to your underlying disks,
or necessarily needing to rebalance as you add more disks to the
array.  This would be a heck of a lot more flexible than ZFS'
approach of adding more immutable vdevs.

Please feel free to tell me why I'm dumb for either 1. not realizing
the obvious flaw in this idea or 2. not realizing it's already being
worked on in exactly this fashion. =)

The latter. :)


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid multiple devices IO utilisation

2013-12-13 Thread Duncan
Martin posted on Thu, 12 Dec 2013 13:39:00 + as excerpted:

 Some time back, I noticed that with a two HDD btrfs raid1, some tasks
 suffered ALL the IO getting choked onto just one HDD!
 
 That turned out to be a feature of the btrfs code whereby a device is
 chosen depending on the process ID. For some cases such as in a bash
 loop, the PID increments by two for each iteration and so only one HDD
 ever gets hit...

Unfortunately, yes...

 So... Running with a 3-disk btrfs raid1 and... I still see the same
 problem for such as compile tasks where only one of the three disks is
 maxed out for periods with the other two disks left nearly idle.

It's worth noting that unfortunately, btrfs raid1 mode isn't real raid1 
at this point, at least not in the N-way-mirrored sense.  It's only two-
way-mirrored, regardless of the number of devices you throw at it, tho N-
way mirroring is roadmapped for introduction after raid5/6 functionality 
is fully implemented.  (The current raid5/6 implementation is missing 
some bits and is not considered production usable, as it doesn't handle 
rebuilds and scrubs well yet.)

So I'm guessing your 3-device btrfs raid1 is still stuck on the same even/
odd PID problem as before.

 Perhaps?...
 
 Is there an easy fix in the code to allocate IO according to the
 following search order:
 
 Last used disk with an IO queue  2 items;
 
 Any disk with an IO queue  2 items;
 
 Whichever disk with least queued items.

That does sound reasonable here.

FWIW, however, based on previous discussion here, the devs are aware of 
the alternating PID problem, and I /believe/ someone's already working on 
an alternate implementation using something else.  I think the current 
PID-based selector code was simply a first implementation to get 
something out there; not really intended to be a final working solution. 
IDR whether anything I've read discussed what algorithm they're working 
on, but given the sense your idea seems to make, at least at first glance 
at my sysadmin level of understanding, I wouldn't be surprised if the new 
solution does look something like it.  Of course that's if the queue 
length is reasonably accessible to btrfs, as you already asked in the bit 
I snipped as out of my knowledgeable reply range.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid multiple devices IO utilisation

2013-12-13 Thread Chris Murphy

On Dec 12, 2013, at 6:39 AM, Martin m_bt...@ml1.co.uk wrote:

 Some time back, I noticed that with a two HDD btrfs raid1, some tasks
 suffered ALL the IO getting choked onto just one HDD!
 
 
 That turned out to be a feature of the btrfs code whereby a device is
 chosen depending on the process ID.

I wonder how much btrfs parallelism is affected by the default CFQ scheduler? 
On the XFS FAQ the default i/o scheduler, CFQ, will defeat much of the 
parallelization in XFS. There's also quite a bit of parallelization in Btrfs 
so it too may be affected, in particular multiple device scenarios.

Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs RAID space utilization and bitrot reconstruction

2012-07-02 Thread Martin Steigerwald
Am Sonntag, 1. Juli 2012 schrieb Waxhead:
 As far as I understand btrfs stores all data in huge chunks that are 
 striped, mirrored or raid5/6'ed throughout all the disks added to
 the  filesystem/volume.

Not through all disks. At least not with the current RAID-1 
implementation. It stores two copies of a chunk, no matter how many drives 
you use.

Rest see Hugo´s answer.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs RAID space utilization and bitrot reconstruction

2012-07-01 Thread Hugo Mills
On Sun, Jul 01, 2012 at 01:50:39PM +0200, Waxhead wrote:
 As far as I understand btrfs stores all data in huge chunks that are
 striped, mirrored or raid5/6'ed throughout all the disks added to
 the filesystem/volume.

   Well, RAID-5/6 hasn't landed yet, but yes.

 How does btrfs deal with different sized disks? let's say that you
 for example have 10 different disks that are
 100GB,200GB,300GB...1000GB and you create a btrfs filesystem with
 all the disks. How will the raid5 implementation distribute chunks
 in such a setup.

   We haven't seen the code for that bit yet.

 I assume the stripe+stripe+parity are separate chunks that are
 placed on separate disks but how does btrfs select the best disk to
 store a chunk on? In short will a slow disk slow down the entire
 array, parts of it or will btrfs attempt to use the fastest disks
 first?

   Chunks are allocated by ordering the devices by the amount of free
(=unallocated) space left on each, and picking the chunks from devices
in that order. For RAID-1 chunks are picked in pairs. For RAID-0, as
many as possible are picked, down to a minimum of 2 (I think). For
RAID-10, the largest even number possible is picked, down to a minimum
of 4. I _believe_ that RAID-5 and -6 will pick as many as possible,
down to some minimum -- but as I said, we haven't seen the code yet.

 Also since btrfs checksums both data and metadata I am thinking that
 at least the raid6 implementation perhaps can (try to) reconstruct
 corrupt data (and try to rewrite it) before reading an alternate
 copy. Can someone please fill me in on the details here?

   Yes, it should be possible to do that with RAID-5 as well. (Read
the data stripes, verify checksums, if one fails, read the parity,
verify that, and reconstruct the bad block from the known-good data).

 Finaly how does btrfs deals with advanced format (4k sectors) drives
 when the entire drive (and not a partition) is used to build a btrfs
 filesystem. Is proper alignment achieved?

   I don't know about that. However, the native block size in btrfs is
4k, so I'd imagine that it's all good.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- You stay in the theatre because you're afraid of having no ---
 money? There's irony... 


signature.asc
Description: Digital signature


Re: btrfs RAID with RAID cards (thread renamed)

2012-05-18 Thread Daniel Pocock

 - if a non-RAID SAS card is used, does it matter which card is chosen?
 Does btrfs work equally well with all of them?
 
 If you're using btrfs RAID, you need a HBA, not a RAID card. If the RAID 
 card can work as a HBA (usually labelled as JBOD mode) then you're good to 
 go.
 
 For example, HP CCISS controllers can't work in JBOD mode.

Would you know if they implement their own checksumming, similar to what
btrfs does?  Or if someone uses SmartArray (CCISS) RAID1, then they
simply don't get the full benefit of checksumming under any possible
configuration?

I've had a quick look at what is on the market, here are some observations:

- in many cases, IOPS (critical for SSDs) vary wildly: e.g.
  - SATA-3 SSDs advertise up to 85k IOPS, so RAID1 needs 170k IOPS
  - HP's standard HBAs don't support high IOPS
  - HP Gen8 SmartArray (e.g. P420) claims up to 200k IOPS
  - previous HP arrays (e.g. P212) support only 60k IOPS
  - many vendors don't advertise the IOPS prominently - I had to Google
the HP site to find those figures quoted in some PDFs, they don't quote
them in the quickspecs or product summary tables

- Adaptec now offers an SSD caching function in hardware, supposedly
drop it in the machine and all disks respond faster
  - how would this interact with btrfs checksumming?  E.g. I'm guessing
it would be necessary to ensure that data from both spindles is not
cached on the same SSD?
  - I started thinking about the possibility that data is degraded on
the mechanical disk but btrfs gets a good checksum read from the SSD and
remains blissfully unaware that the real disk is failing, then the other
disk goes completely offline one day, for whatever reason the data is
not in the SSD cache and the sector can't be read reliably from the
remaining physical disk - should such caching just be avoided or can it
be managed from btrfs itself in a manner that is foolproof?

How about the combination of btrfs/root/boot filesystems and grub?  Can
they all play nicely together?  This seems to be one compelling factor
with hardware RAID, the cards have a BIOS that can boot from any drive
even if the other is offline.




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID with enterprise SATA or SAS drives

2012-05-14 Thread Duncan
Martin Steigerwald posted on Fri, 11 May 2012 18:58:05 +0200 as excerpted:

Martin Steigerwald posted on Fri, 11 May 2012 18:58:05 +0200 as excerpted:

 Am Freitag, 11. Mai 2012 schrieb Duncan:
 Daniel Pocock posted on Wed, 09 May 2012 22:01:49 + as excerpted:
  There is various information about - enterprise-class drives

 This isn't a direct answer to that, but expressing a bit of concern
 over  the implications of your question, that you're planning on using
 btrfs in an enterprise class installation.

 [In] mainline Linux kernel terms, btrfs remains very much an
 experimental filesystem

 On an experimental filesystem under as intense continued development as
 btrfs, by contrast, it's best to consider your btrfs copy an extra
 throwaway copy only intended for testing.  You still have your
 primary copy, along with all the usual backups, on something less
 experimental, since you never know when/where/ how your btrfs testing
 will screw up its copy.
 
 Duncan, did you actually test BTRFS? Theory can´t replace real life
 experience.

I /had/ been waiting until the n-way-mirrored-raid1 roadmapped for after
raid5/6 mode (which should hit 3.5, I believe), but hardware issues
intervened and I'm no longer using those older 4-way md/raid drives as
primary.

And now that I have it, present personal experience does not contradict
what I posted.  btrfs does indeed work reasonably well under reasonably
good, non-stressful, conditions.  But my experience so far aligns quite
well with the consider the btrfs copy a throw-away copy, just in case
recommendation.  Just because it's a throw-away copy doesn't mean you'll
have to have to resort to the good copy elsewhere, but it DOES hopefully
mean that you'll have both a good copy elsewhere, and a backup for that
supposedly good copy, just in case btrfs does go bad,
and that supposedly good primary copy, ends up not being good after all.

 From all of my personal BTRFS installations not one has gone corrupt -
 and I have at least four, while more of them are in use at my employer.
 Except maybe a scratch data BRTFS RAID 0 over lots of SATA disks. But
 maybe it would have been fixable by btrfs-zero-log which I didn´t know
 of back then. Another one needed a btrfs-zero-log, but that was quite
 some time ago.
 
 Some of the installations are in use for more than a year AFAIR.
 
 While I would still be reluctant with deploying BTRFS for a customer for
 critical data

This was actually my point in this thread.  If someone's asking questions
about enterprise quality hardware, they're not likely to run into some of
the bugs I've been having recently that have been exposed by hardware
issues.  However, they're also far more likely to be considering btrfs for
a row-of-nines uptime application, which is, after all, where some of
btrfs' features are normally found.  Regardless of whether btrfs is past
the throw away data experimental class stage or not, I think we both
agree it isn't ready for row-of-nines-uptime applications just yet.  If
he's just testing btrfs on such equipment for possible future
row-of-nines-uptime deployment a year or possibly two out, great.  If he's
looking at such a deployment two-months-out, no way, and it looks like you
agree.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID with enterprise SATA or SAS drives

2012-05-10 Thread Hubert Kario
On Wednesday 09 of May 2012 22:01:49 Daniel Pocock wrote:
 There is various information about
 - enterprise-class drives (either SAS or just enterprise SATA)
 - the SCSI/SAS protocols themselves vs SATA
 having more advanced features (e.g. for dealing with error conditions)
 than the average block device
 
 For example, Adaptec recommends that such drives will work better with
 their hardware RAID cards:
 
 http://ask.adaptec.com/cgi-bin/adaptec_tic.cfg/php/enduser/std_adp.php?p_f
 aqid=14596 Desktop class disk drives have an error recovery feature that
 will result in a continuous retry of the drive (read or write) when an
 error is encountered, such as a bad sector. In a RAID array this can
 cause the RAID controller to time-out while waiting for the drive to
 respond.
 
 and this blog:
 http://www.adaptec.com/blog/?p=901
 major advantages to enterprise drives (TLER for one) ... opt for the
 enterprise drives in a RAID environment no matter what the cost of the
 drive over the desktop drive
 
 My question..
 
 - does btrfs RAID1 actively use the more advanced features of these
 drives, e.g. to work around errors without getting stuck on a bad block?

There are no (short) timeouts that I know of
 
 - if a non-RAID SAS card is used, does it matter which card is chosen?
 Does btrfs work equally well with all of them?

If you're using btrfs RAID, you need a HBA, not a RAID card. If the RAID 
card can work as a HBA (usually labelled as JBOD mode) then you're good to 
go.

For example, HP CCISS controllers can't work in JBOD mode.

If you're using the RAID feature of the card, then you need to look at 
general Linux support, btrfs doesn't do anything other FS don't do with the 
block devices.
 
 - ignoring the better MTBF and seek times of these drives, do any of the
 other features passively contribute to a better RAID experience when
 using btrfs?

whatever they really have high MTBF values is debatable...

seek times do matter very much to btrfs, fast CPU is also a good thing to 
have with btrfs, especially if you want to use data compression, high node 
or leaf sizes

 - for someone using SAS or enterprise SATA drives with Linux, I
 understand btrfs gives the extra benefit of checksums, are there any
 other specific benefits over using mdadm or dmraid?

Because btrfs knows when the drive is misbeheaving (because of checksums) 
and is returning bad data, it can detect problems much faster then RAID 
(which doesn't use the reduncancy for checking if the data it's returning is 
actually correct). Both hardware and software RAID implementations depend on 
the drives to return IO errors. In effect, the data is safer on btrfs than 
regular RAID.

Besides that online resize (both shrinking and extending) and (currently not 
implemented) ability to set redundancy level on a per file basis.
In other words, with btrfs you can have a file with RAID6 redundancy and a 
second one with RAID10 level of redundancy in single directory.

Regards,
-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID with enterprise SATA or SAS drives

2012-05-10 Thread Duncan
Daniel Pocock posted on Wed, 09 May 2012 22:01:49 + as excerpted:

 There is various information about
 - enterprise-class drives (either SAS or just enterprise SATA)
 - the SCSI/SAS protocols themselves vs SATA having more advanced
 features (e.g. for dealing with error conditions)
 than the average block device

This isn't a direct answer to that, but expressing a bit of concern over 
the implications of your question, that you're planning on using btrfs in 
an enterprise class installation.

While various Enterprise Linux distributions do now officially support 
btrfs, it's worth checking out exactly what that means in practice.

Meanwhile, in mainline Linux kernel terms, btrfs remains very much an 
experimental filesystem, as expressed by the kernel config option that 
turns btrfs on.  It's still under very intensive development, with an 
error-fixing btrfsck only recently available and still coming with its 
own may make the problems worse instead of fixing them warning.  
Testers willing to risk the chance of data loss implied by that 
experimental filesystem label should be running the latest stable 
kernel at the oldest, and preferably the rcs by rc5 or so, as new kernels 
continue to fix problems in older btrfs code as well as introduce new 
features and if you're running an older kernel, that means you're running 
a kernel with known problems that are fixed in the latest kernel.

Experimental also has implications in terms of backups.  A good sysadmin 
always has backups, but normally, the working copy can be considered the 
primary copy, and there's backups of that.  On an experimental filesystem 
under as intense continued development as btrfs, by contrast, it's best 
to consider your btrfs copy an extra throwaway copy only intended for 
testing.  You still have your primary copy, along with all the usual 
backups, on something less experimental, since you never know when/where/
how your btrfs testing will screw up its copy.

That's not normally the kind of filesystem enterprise class users are 
looking for, unless of course they're doing longer term testing, with an 
intent to actually deploy perhaps a year out, if the testing proves it 
robust enough by then.

And while it's still experimental ATM, btrfs /is/ fast improving.  It 
/does/ now have a working fsck, even if it still comes with warnings, 
and  reasonable feature-set build-out should be within a few more kernels 
(raid5/6 mode is roadmapped for 3.5, and n-way-mirroring raid1/10 are 
roadmapped after that, current raid1 mode is only 2-way mirroring, 
regardless of the number of drives).  After that, the focus should turn 
toward full stabilization.  So while btrfs is currently intended for 
testers only, by around the end of the year or early next, it will likely 
be reasonably stable and ready for at least the more adventurous 
conventional users.  Still, enterprise class users tend to be a 
conservative bunch, and I'd be surprised if they really consider btrfs 
ready before mid-year next year, at the earliest.

So if you're looking to test btrfs on enterprise-class hardware, great!  
But do be aware of what you're getting into.  If you have an enterprise 
distro which supports it too, even greater, but know what that actually 
means.  Does it mean they support the same level of 9s uptime on it as 
they normally do, or just that they're ready to accept payment to try and 
recover things if something goes wrong?

If that hasn't scared you off, and you've not read the wiki yet, that's 
probably the next thing you should look at, as it answers a lot of 
questions you may have, as well as some you wouldn't think to ask.  Being 
a wiki, of course, your own contributions are welcome.  In particular, 
you may well be able to cover some of the enterprise class viewpoint 
questions your asking based on your own testing, once you get to that 
point.

https://btrfs.wiki.kernel.org/

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >