Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2018-05-03 Thread Goffredo Baroncelli
On 08/02/2017 08:47 PM, Chris Mason wrote:
>> I agree, MD pretty much needs a separate device simply because they can't 
>> allocate arbitrary space on the other array members.  BTRFS can do that 
>> though, and I would actually think that that would be _easier_ to implement 
>> than having a separate device.
>>
>> That said, I do think that it would need to be a separate chunk type, 
>> because things could get really complicated if the metadata is itself using 
>> a parity raid profile.
> 
> Thanks for running with this Liu, I'm reading through all the patches. I do 
> agree that it's better to put the logging into a dedicated chunk type, that 
> way we can have it default to either double or triple mirroring.

Sorry for reply a bit late :-), however it should be sufficient to start the 
writes from the stripe boundary. For a filesystem this is complicate to grant, 
however for a journal it would be more simple to do;

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-25 Thread Goffredo Baroncelli
On 08/23/2017 05:28 PM, Chris Murphy wrote:
> - dynamically sized stripes, so that writes can always be full stripe
> writes, no overwrites, and atomic
Think about also that a block could be deallocated (i.e. canceling a file of 
4kb). This leads to have some holes that you cannot fill without a RMW cycle...

BR

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-23 Thread Austin S. Hemmelgarn

On 2017-08-23 11:28, Chris Murphy wrote:

On Wed, Aug 2, 2017 at 2:27 PM, Liu Bo  wrote:

On Wed, Aug 02, 2017 at 10:41:30PM +0200, Goffredo Baroncelli wrote:



What I want to understand, is if it is possible to log only the "partial 
stripe"  RMW cycle.



I think your point is valid if all data is written with datacow.  In
case of nodatacow, btrfs does overwrite in place, so a full stripe
write may pollute on-disk data after unclean shutdown.  Checksum can
detect errors but repair thru raid5 may not recover the correct data.


What's simpler? raid56 journal for everything (cow, nocow, data,
metadata), or to apply some limitations to available layouts?

-  if raid56, then cow only (no such thing as nodatacow)
This should obviously be something that will be contentious to certain 
individuals.

-  permit raid56 for data bg only, system and metadata can be raid1, or raid10

I'm hard pressed thinking of a use case where metadata raid56 is
beneficial over raid10; a metadata heavy workload is not well suited
for any parity raid. And if it isn't metadata heavy, then chances are
you don't even need raid10 but raid1 is sufficient.
Until BTRFS gets n-way replication, raid6 remains the only way to 
configure a BTRFS volume to survive more than one device failure.


Of the more complicated ways to solve it:

- journal
- dynamically sized stripes, so that writes can always be full stripe
writes, no overwrites, and atomic
- mixed block groups where only sequential full stripe writes use
raid56 block group; random and smaller writes go in a raid 1 or 10
block group.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-23 Thread Chris Murphy
On Wed, Aug 2, 2017 at 2:27 PM, Liu Bo  wrote:
> On Wed, Aug 02, 2017 at 10:41:30PM +0200, Goffredo Baroncelli wrote:

>> What I want to understand, is if it is possible to log only the "partial 
>> stripe"  RMW cycle.
>>
>
> I think your point is valid if all data is written with datacow.  In
> case of nodatacow, btrfs does overwrite in place, so a full stripe
> write may pollute on-disk data after unclean shutdown.  Checksum can
> detect errors but repair thru raid5 may not recover the correct data.

What's simpler? raid56 journal for everything (cow, nocow, data,
metadata), or to apply some limitations to available layouts?

-  if raid56, then cow only (no such thing as nodatacow)
-  permit raid56 for data bg only, system and metadata can be raid1, or raid10

I'm hard pressed thinking of a use case where metadata raid56 is
beneficial over raid10; a metadata heavy workload is not well suited
for any parity raid. And if it isn't metadata heavy, then chances are
you don't even need raid10 but raid1 is sufficient.

Of the more complicated ways to solve it:

- journal
- dynamically sized stripes, so that writes can always be full stripe
writes, no overwrites, and atomic
- mixed block groups where only sequential full stripe writes use
raid56 block group; random and smaller writes go in a raid 1 or 10
block group.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-02 Thread Goffredo Baroncelli
On 2017-08-03 06:02, Duncan wrote:
> Liu Bo posted on Wed, 02 Aug 2017 14:27:21 -0600 as excerpted:
> 
> It is correct reading this as: all data is written two times ?
> 
> If as is being discussed the log is mirrored by default that'd be three 
> times...

And for raid6 you need to do it 4 times... (!)

> Parity-raid is slow and of course normally has the infamous write hole 
> this patch set is trying to close.  Yes, closing the write hole is 
> possible, but for sure it's going to make the performance bite of parity-
> raid even worse. =:^(

This is the reason for looking for possible optimization from the beginning: a 
full stripe (only datacow) writing doesn't require logging at all. This could 
be a big optimization ( if you need to write a lot of data, only tail and head 
are NOT full stripe). However this require to know that the data is [no]cow 
when it is logged, and I think that it is not so simple: possible but not 
simple.

> 
> Or are logged only the stripes involved by a RMW cycle (i.e. if a
> stripe is fully written, the log is bypassed )?

 For data, only data in bios from high level will be logged, while for
 parity, the whole parity will be logged.

 Full stripe write still logs all data and parity, as full stripe
 write may not survive from unclean shutdown.
>>>
>>> Does this matter ? Due to the COW nature of BTRFS if a transaction is
>>> interrupted (by an unclean shutdown) the transaction data are all lost.
>>> Am I missing something ?
>>>
>>> What I want to understand, is if it is possible to log only the
>>> "partial stripe"  RMW cycle.
>>>
>>>
>> I think your point is valid if all data is written with datacow.  In
>> case of nodatacow, btrfs does overwrite in place, so a full stripe write
>> may pollute on-disk data after unclean shutdown.  Checksum can detect
>> errors but repair thru raid5 may not recover the correct data.
> 
> But nodatacow doesn't have checksum...

True, but Liu is correct stating that a write "nocow" is not protected by a 
transaction.
The funny part, is that in case of raid5 we need to duplicate the data written 
for the nocow case, when for the cow case it would be possible to avoid it (in 
the full stripe case) !


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-02 Thread Duncan
Liu Bo posted on Wed, 02 Aug 2017 14:27:21 -0600 as excerpted:

>> >> It is correct reading this as: all data is written two times ?

If as is being discussed the log is mirrored by default that'd be three 
times...

Parity-raid is slow and of course normally has the infamous write hole 
this patch set is trying to close.  Yes, closing the write hole is 
possible, but for sure it's going to make the performance bite of parity-
raid even worse. =:^(

>> >> Or are logged only the stripes involved by a RMW cycle (i.e. if a
>> >> stripe is fully written, the log is bypassed )?
>> > 
>> > For data, only data in bios from high level will be logged, while for
>> > parity, the whole parity will be logged.
>> > 
>> > Full stripe write still logs all data and parity, as full stripe
>> > write may not survive from unclean shutdown.
>> 
>> Does this matter ? Due to the COW nature of BTRFS if a transaction is
>> interrupted (by an unclean shutdown) the transaction data are all lost.
>> Am I missing something ?
>> 
>> What I want to understand, is if it is possible to log only the
>> "partial stripe"  RMW cycle.
>>
>>
> I think your point is valid if all data is written with datacow.  In
> case of nodatacow, btrfs does overwrite in place, so a full stripe write
> may pollute on-disk data after unclean shutdown.  Checksum can detect
> errors but repair thru raid5 may not recover the correct data.

But nodatacow doesn't have checksum...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-02 Thread Liu Bo
On Wed, Aug 02, 2017 at 10:41:30PM +0200, Goffredo Baroncelli wrote:
> Hi Liu,
> 
> thanks for your reply, below my comments
> On 2017-08-02 19:57, Liu Bo wrote:
> > On Wed, Aug 02, 2017 at 12:14:27AM +0200, Goffredo Baroncelli wrote:
> >> On 2017-08-01 19:24, Liu Bo wrote:
> >>> On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote:
>  Hi Liu,
> 
>  On 2017-08-01 18:14, Liu Bo wrote:
> > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > separate disk as a journal (aka raid5/6 log), so that after unclean
> > shutdown we can make sure data and parity are consistent on the raid
> > array by replaying the journal.
> >
> 
>  it would be possible to have more information ?
>  - what is logged ? data, parity or data + parity ?
> >>>
> >>> Patch 5 has more details(sorry for not making it clear that in the
> >>> cover letter).
> >>>
> >>> So both data and parity are logged so that while replaying the journal
> >>> everything is written to whichever disk it should be written to.
> >>
> >> It is correct reading this as: all data is written two times ? Or are 
> >> logged only the stripes involved by a RMW cycle (i.e. if a stripe is fully 
> >> written, the log is bypassed )?
> > 
> > For data, only data in bios from high level will be logged, while for
> > parity, the whole parity will be logged.
> > 
> > Full stripe write still logs all data and parity, as full stripe write
> > may not survive from unclean shutdown.
> 
> Does this matter ? Due to the COW nature of BTRFS if a transaction is 
> interrupted (by an unclean shutdown) the transaction data are all lost. Am I 
> missing something ?
> 
> What I want to understand, is if it is possible to log only the "partial 
> stripe"  RMW cycle.
>

I think your point is valid if all data is written with datacow.  In
case of nodatacow, btrfs does overwrite in place, so a full stripe
write may pollute on-disk data after unclean shutdown.  Checksum can
detect errors but repair thru raid5 may not recover the correct data.

> > 
> > Taking a raid5 setup with 3 disks as an example, doing an overwrite
> > of 4k will log 4K(data) + 64K(parity).
> > 
> >>>
>  - in the past I thought that it would be sufficient to log only the 
>  stripe position involved by a RMW cycle, and then start a scrub on these 
>  stripes in case of an unclean shutdown: do you think that it is feasible 
>  ?
> >>>
> >>> An unclean shutdown causes inconsistence between data and parity, so
> >>> scrub won't help as it's not able to tell which one (data or parity)
> >>> is valid
> >> Scrub compares data against its checksum; so it knows if the data is 
> >> correct. If no disk is lost, a scrub process is sufficient/needed to 
> >> rebuild the parity/data.
> >>
> > 
> > If no disk is lost, it depends on whether the number of errors caused
> > by an unclean shutdown can be tolerated by the raid setup.
> 
> see below
> > 
> >> The problem born when after "an unclean shutdown" a disk failure happens. 
> >> But these  are *two* distinct failures. These together break the BTRFS 
> >> raid5 redundancy. But if you run a scrub process between these two 
> >> failures, the btrfs raid5 redundancy is still effective.
> >>
> > 
> > I wouldn't say that the redundancy is still effective after a scrub
> > process, but rather those data which match their checksum can still be
> > read out while the mismatched data are lost forever after unclean
> > shutdown.
> 
> 
> I think that this is the point where we are in disagreement: until now I 
> understood that in BTRFS
> a) a transaction is fully completed or fully not-completed. 
> b) a transaction is completed after both the data *and* the parity are 
> written.
> 
> With these assumption, due to the COW nature of BTRFS an unclean shutdown 
> might invalidate only data of the current transaction. Of course the unclean 
> shutdown prevent the transaction to be completed, and this means that all the 
> data of this transaction is lost in any case.
> 
> For the parity this is different, because it is possible a misalignment 
> between the parity and the data (which might be of different transactions).
> 
> Let me to explain with the help of your example:
> 
> > Taking a raid5 setup with 3 disks as an example, doing an overwrite
> > of 4k will log 4K(data) + 64K(parity).
> 
> If the transaction is aborted, 128k-4k = 124k are untouched, and these still 
> be valid. The last 4k might be wrong, but in any case this data is not 
> referenced because the transaction was never completed. 
> The parity need to be rebuild because we are not able to know if the 
> transaction was aborted before/after the data and/or parity writing
>

True, 4k data is not referenced, but again after rebuilding the
parity, the rest 124K and the 4k which has random data are not
consistent with the rebuilt parity.

The point is to keep parity and data consistent at any point of time
so that raid5 tolerance is 

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-02 Thread Goffredo Baroncelli
Hi Liu,

thanks for your reply, below my comments
On 2017-08-02 19:57, Liu Bo wrote:
> On Wed, Aug 02, 2017 at 12:14:27AM +0200, Goffredo Baroncelli wrote:
>> On 2017-08-01 19:24, Liu Bo wrote:
>>> On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote:
 Hi Liu,

 On 2017-08-01 18:14, Liu Bo wrote:
> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> separate disk as a journal (aka raid5/6 log), so that after unclean
> shutdown we can make sure data and parity are consistent on the raid
> array by replaying the journal.
>

 it would be possible to have more information ?
 - what is logged ? data, parity or data + parity ?
>>>
>>> Patch 5 has more details(sorry for not making it clear that in the
>>> cover letter).
>>>
>>> So both data and parity are logged so that while replaying the journal
>>> everything is written to whichever disk it should be written to.
>>
>> It is correct reading this as: all data is written two times ? Or are logged 
>> only the stripes involved by a RMW cycle (i.e. if a stripe is fully written, 
>> the log is bypassed )?
> 
> For data, only data in bios from high level will be logged, while for
> parity, the whole parity will be logged.
> 
> Full stripe write still logs all data and parity, as full stripe write
> may not survive from unclean shutdown.

Does this matter ? Due to the COW nature of BTRFS if a transaction is 
interrupted (by an unclean shutdown) the transaction data are all lost. Am I 
missing something ?

What I want to understand, is if it is possible to log only the "partial 
stripe"  RMW cycle.

> 
> Taking a raid5 setup with 3 disks as an example, doing an overwrite
> of 4k will log 4K(data) + 64K(parity).
> 
>>>
 - in the past I thought that it would be sufficient to log only the stripe 
 position involved by a RMW cycle, and then start a scrub on these stripes 
 in case of an unclean shutdown: do you think that it is feasible ?
>>>
>>> An unclean shutdown causes inconsistence between data and parity, so
>>> scrub won't help as it's not able to tell which one (data or parity)
>>> is valid
>> Scrub compares data against its checksum; so it knows if the data is 
>> correct. If no disk is lost, a scrub process is sufficient/needed to rebuild 
>> the parity/data.
>>
> 
> If no disk is lost, it depends on whether the number of errors caused
> by an unclean shutdown can be tolerated by the raid setup.

see below
> 
>> The problem born when after "an unclean shutdown" a disk failure happens. 
>> But these  are *two* distinct failures. These together break the BTRFS raid5 
>> redundancy. But if you run a scrub process between these two failures, the 
>> btrfs raid5 redundancy is still effective.
>>
> 
> I wouldn't say that the redundancy is still effective after a scrub
> process, but rather those data which match their checksum can still be
> read out while the mismatched data are lost forever after unclean
> shutdown.


I think that this is the point where we are in disagreement: until now I 
understood that in BTRFS
a) a transaction is fully completed or fully not-completed. 
b) a transaction is completed after both the data *and* the parity are written.

With these assumption, due to the COW nature of BTRFS an unclean shutdown might 
invalidate only data of the current transaction. Of course the unclean shutdown 
prevent the transaction to be completed, and this means that all the data of 
this transaction is lost in any case.

For the parity this is different, because it is possible a misalignment between 
the parity and the data (which might be of different transactions).

Let me to explain with the help of your example:

> Taking a raid5 setup with 3 disks as an example, doing an overwrite
> of 4k will log 4K(data) + 64K(parity).

If the transaction is aborted, 128k-4k = 124k are untouched, and these still be 
valid. The last 4k might be wrong, but in any case this data is not referenced 
because the transaction was never completed. 
The parity need to be rebuild because we are not able to know if the 
transaction was aborted before/after the data and/or parity writing


> 
> Thanks,
> 
> -liubo
>>
>>>
>>> With nodatacow, we do overwrite, so RMW during unclean shutdown is not safe.
>>> With datacow, we don't do overwrite, but the following situation may happen,
>>> say we have a raid5 setup with 3 disks, the stripe length is 64k, so
>>>
>>> 1) write 64K  --> now the raid layout is
>>> [64K data + 64K random + 64K parity]
>>> 2) write another 64K --> now the raid layout after RMW is
>>> [64K 1)'s data + 64K 2)'s data + 64K new parity]
>>>
>>> If unclean shutdown occurs before 2) finishes, then parity may be
>>> corrupted and then 1)'s data may be recovered wrongly if the disk
>>> which holds 1)'s data is offline.
>>>
 - does this journal disk also host other btrfs log ?

>>>
>>> No, purely data/parity and some associated metadata.
>>>
>>> Thanks,
>>>
>>> -liubo
>>>

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-02 Thread Liu Bo
On Wed, Aug 02, 2017 at 12:14:27AM +0200, Goffredo Baroncelli wrote:
> On 2017-08-01 19:24, Liu Bo wrote:
> > On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote:
> >> Hi Liu,
> >>
> >> On 2017-08-01 18:14, Liu Bo wrote:
> >>> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> >>> separate disk as a journal (aka raid5/6 log), so that after unclean
> >>> shutdown we can make sure data and parity are consistent on the raid
> >>> array by replaying the journal.
> >>>
> >>
> >> it would be possible to have more information ?
> >> - what is logged ? data, parity or data + parity ?
> > 
> > Patch 5 has more details(sorry for not making it clear that in the
> > cover letter).
> > 
> > So both data and parity are logged so that while replaying the journal
> > everything is written to whichever disk it should be written to.
> 
> It is correct reading this as: all data is written two times ? Or are logged 
> only the stripes involved by a RMW cycle (i.e. if a stripe is fully written, 
> the log is bypassed )?

For data, only data in bios from high level will be logged, while for
parity, the whole parity will be logged.

Full stripe write still logs all data and parity, as full stripe write
may not survive from unclean shutdown.

Taking a raid5 setup with 3 disks as an example, doing an overwrite
of 4k will log 4K(data) + 64K(parity).

> > 
> >> - in the past I thought that it would be sufficient to log only the stripe 
> >> position involved by a RMW cycle, and then start a scrub on these stripes 
> >> in case of an unclean shutdown: do you think that it is feasible ?
> > 
> > An unclean shutdown causes inconsistence between data and parity, so
> > scrub won't help as it's not able to tell which one (data or parity)
> > is valid
> Scrub compares data against its checksum; so it knows if the data is correct. 
> If no disk is lost, a scrub process is sufficient/needed to rebuild the 
> parity/data.
>

If no disk is lost, it depends on whether the number of errors caused
by an unclean shutdown can be tolerated by the raid setup.

> The problem born when after "an unclean shutdown" a disk failure happens. But 
> these  are *two* distinct failures. These together break the BTRFS raid5 
> redundancy. But if you run a scrub process between these two failures, the 
> btrfs raid5 redundancy is still effective.
>

I wouldn't say that the redundancy is still effective after a scrub
process, but rather those data which match their checksum can still be
read out while the mismatched data are lost forever after unclean
shutdown.

Thanks,

-liubo
> 
> > 
> > With nodatacow, we do overwrite, so RMW during unclean shutdown is not safe.
> > With datacow, we don't do overwrite, but the following situation may happen,
> > say we have a raid5 setup with 3 disks, the stripe length is 64k, so
> > 
> > 1) write 64K  --> now the raid layout is
> > [64K data + 64K random + 64K parity]
> > 2) write another 64K --> now the raid layout after RMW is
> > [64K 1)'s data + 64K 2)'s data + 64K new parity]
> > 
> > If unclean shutdown occurs before 2) finishes, then parity may be
> > corrupted and then 1)'s data may be recovered wrongly if the disk
> > which holds 1)'s data is offline.
> > 
> >> - does this journal disk also host other btrfs log ?
> >>
> > 
> > No, purely data/parity and some associated metadata.
> > 
> > Thanks,
> > 
> > -liubo
> > 
> >>> The idea and the code are similar to the write-through mode of md
> >>> raid5-cache, so ppl(partial parity log) is also feasible to implement.
> >>> (If you've been familiar with md, you may find this patch set is
> >>> boring to read...)
> >>>
> >>> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> >>> the implementation, the rest patches are improvements and bugfixes,
> >>> eg. readahead for recovery, checksum.
> >>>
> >>> Two btrfs-progs patches are required to play with this patch set, one
> >>> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> >>> option '-L', the other is to teach 'btrfs-show-super' to show
> >>> %journal_tail.
> >>>
> >>> This is currently based on 4.12-rc3.
> >>>
> >>> The patch set is tagged with RFC, and comments are always welcome,
> >>> thanks.
> >>>
> >>> Known limitations:
> >>> - Deleting a log device is not implemented yet.
> >>>
> >>>
> >>> Liu Bo (14):
> >>>   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
> >>>   Btrfs: raid56: do not allocate chunk on raid56 log
> >>>   Btrfs: raid56: detect raid56 log on mount
> >>>   Btrfs: raid56: add verbose debug
> >>>   Btrfs: raid56: add stripe log for raid5/6
> >>>   Btrfs: raid56: add reclaim support
> >>>   Btrfs: raid56: load r5log
> >>>   Btrfs: raid56: log recovery
> >>>   Btrfs: raid56: add readahead for recovery
> >>>   Btrfs: raid56: use the readahead helper to get page
> >>>   Btrfs: raid56: add csum support
> >>>   Btrfs: raid56: fix error handling while adding a log device
> >>>   Btrfs: raid56: initialize raid5/6 log after 

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-02 Thread Chris Mason



On 08/01/2017 01:39 PM, Austin S. Hemmelgarn wrote:

On 2017-08-01 13:25, Roman Mamedov wrote:

On Tue,  1 Aug 2017 10:14:23 -0600
Liu Bo  wrote:


This aims to fix write hole issue on btrfs raid5/6 setup by adding a
separate disk as a journal (aka raid5/6 log), so that after unclean
shutdown we can make sure data and parity are consistent on the raid
array by replaying the journal.


Could it be possible to designate areas on the in-array devices to be 
used as

journal?

While md doesn't have much spare room in its metadata for extraneous 
things
like this, Btrfs could use almost as much as it wants to, adding to 
size of the
FS metadata areas. Reliability-wise, the log could be stored as RAID1 
chunks.


It doesn't seem convenient to need having an additional storage device 
around
just for the log, and also needing to maintain its fault tolerance 
yourself (so
the log device would better be on a mirror, such as mdadm RAID1? more 
expense

and maintenance complexity).

I agree, MD pretty much needs a separate device simply because they 
can't allocate arbitrary space on the other array members.  BTRFS can do 
that though, and I would actually think that that would be _easier_ to 
implement than having a separate device.


That said, I do think that it would need to be a separate chunk type, 
because things could get really complicated if the metadata is itself 
using a parity raid profile.


Thanks for running with this Liu, I'm reading through all the patches. 
I do agree that it's better to put the logging into a dedicated chunk 
type, that way we can have it default to either double or triple mirroring.


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Goffredo Baroncelli
On 2017-08-01 23:00, Christoph Anton Mitterer wrote:
> Hi.
> 
> Stupid question:
> Would the write hole be closed already, if parity was checksummed?

No. 
The write hole problem is due to a combination of two things:
a) misalignment between parity and data (i.e. unclean shutdown)
b) loosing of a disk (i.e. disk rupture)

Note1: the write hole problem happens even if these two event are not 
consecutive.

After the disk rupture, when you need to read a data from the broken disk, you 
need the parity to compute the data. But if the parity is misaligned, wrong 
data is returned. 

The data checksum are sufficient to detect if wrong data is returned. The 
checksum parity is not needed. In any case both can't avoid the problem.


> Cheers,
> Chris.
> 

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Goffredo Baroncelli
On 2017-08-01 19:24, Liu Bo wrote:
> On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote:
>> Hi Liu,
>>
>> On 2017-08-01 18:14, Liu Bo wrote:
>>> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
>>> separate disk as a journal (aka raid5/6 log), so that after unclean
>>> shutdown we can make sure data and parity are consistent on the raid
>>> array by replaying the journal.
>>>
>>
>> it would be possible to have more information ?
>> - what is logged ? data, parity or data + parity ?
> 
> Patch 5 has more details(sorry for not making it clear that in the
> cover letter).
> 
> So both data and parity are logged so that while replaying the journal
> everything is written to whichever disk it should be written to.

It is correct reading this as: all data is written two times ? Or are logged 
only the stripes involved by a RMW cycle (i.e. if a stripe is fully written, 
the log is bypassed )?
> 
>> - in the past I thought that it would be sufficient to log only the stripe 
>> position involved by a RMW cycle, and then start a scrub on these stripes in 
>> case of an unclean shutdown: do you think that it is feasible ?
> 
> An unclean shutdown causes inconsistence between data and parity, so
> scrub won't help as it's not able to tell which one (data or parity)
> is valid
Scrub compares data against its checksum; so it knows if the data is correct. 
If no disk is lost, a scrub process is sufficient/needed to rebuild the 
parity/data.

The problem born when after "an unclean shutdown" a disk failure happens. But 
these  are *two* distinct failures. These together break the BTRFS raid5 
redundancy. But if you run a scrub process between these two failures, the 
btrfs raid5 redundancy is still effective.


> 
> With nodatacow, we do overwrite, so RMW during unclean shutdown is not safe.
> With datacow, we don't do overwrite, but the following situation may happen,
> say we have a raid5 setup with 3 disks, the stripe length is 64k, so
> 
> 1) write 64K  --> now the raid layout is
> [64K data + 64K random + 64K parity]
> 2) write another 64K --> now the raid layout after RMW is
> [64K 1)'s data + 64K 2)'s data + 64K new parity]
> 
> If unclean shutdown occurs before 2) finishes, then parity may be
> corrupted and then 1)'s data may be recovered wrongly if the disk
> which holds 1)'s data is offline.
> 
>> - does this journal disk also host other btrfs log ?
>>
> 
> No, purely data/parity and some associated metadata.
> 
> Thanks,
> 
> -liubo
> 
>>> The idea and the code are similar to the write-through mode of md
>>> raid5-cache, so ppl(partial parity log) is also feasible to implement.
>>> (If you've been familiar with md, you may find this patch set is
>>> boring to read...)
>>>
>>> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
>>> the implementation, the rest patches are improvements and bugfixes,
>>> eg. readahead for recovery, checksum.
>>>
>>> Two btrfs-progs patches are required to play with this patch set, one
>>> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
>>> option '-L', the other is to teach 'btrfs-show-super' to show
>>> %journal_tail.
>>>
>>> This is currently based on 4.12-rc3.
>>>
>>> The patch set is tagged with RFC, and comments are always welcome,
>>> thanks.
>>>
>>> Known limitations:
>>> - Deleting a log device is not implemented yet.
>>>
>>>
>>> Liu Bo (14):
>>>   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
>>>   Btrfs: raid56: do not allocate chunk on raid56 log
>>>   Btrfs: raid56: detect raid56 log on mount
>>>   Btrfs: raid56: add verbose debug
>>>   Btrfs: raid56: add stripe log for raid5/6
>>>   Btrfs: raid56: add reclaim support
>>>   Btrfs: raid56: load r5log
>>>   Btrfs: raid56: log recovery
>>>   Btrfs: raid56: add readahead for recovery
>>>   Btrfs: raid56: use the readahead helper to get page
>>>   Btrfs: raid56: add csum support
>>>   Btrfs: raid56: fix error handling while adding a log device
>>>   Btrfs: raid56: initialize raid5/6 log after adding it
>>>   Btrfs: raid56: maintain IO order on raid5/6 log
>>>
>>>  fs/btrfs/ctree.h|   16 +-
>>>  fs/btrfs/disk-io.c  |   16 +
>>>  fs/btrfs/ioctl.c|   48 +-
>>>  fs/btrfs/raid56.c   | 1429 
>>> ++-
>>>  fs/btrfs/raid56.h   |   82 +++
>>>  fs/btrfs/transaction.c  |2 +
>>>  fs/btrfs/volumes.c  |   56 +-
>>>  fs/btrfs/volumes.h  |7 +-
>>>  include/uapi/linux/btrfs.h  |3 +
>>>  include/uapi/linux/btrfs_tree.h |4 +
>>>  10 files changed, 1487 insertions(+), 176 deletions(-)
>>>
>>
>>
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli 
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Christoph Anton Mitterer
Hi.

Stupid question:
Would the write hole be closed already, if parity was checksummed?

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Liu Bo
On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote:
> Hi Liu,
> 
> On 2017-08-01 18:14, Liu Bo wrote:
> > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > separate disk as a journal (aka raid5/6 log), so that after unclean
> > shutdown we can make sure data and parity are consistent on the raid
> > array by replaying the journal.
> > 
> 
> it would be possible to have more information ?
> - what is logged ? data, parity or data + parity ?

Patch 5 has more details(sorry for not making it clear that in the
cover letter).

So both data and parity are logged so that while replaying the journal
everything is written to whichever disk it should be written to.

> - in the past I thought that it would be sufficient to log only the stripe 
> position involved by a RMW cycle, and then start a scrub on these stripes in 
> case of an unclean shutdown: do you think that it is feasible ?

An unclean shutdown causes inconsistence between data and parity, so
scrub won't help as it's not able to tell which one (data or parity)
is valid.

With nodatacow, we do overwrite, so RMW during unclean shutdown is not safe.
With datacow, we don't do overwrite, but the following situation may happen,
say we have a raid5 setup with 3 disks, the stripe length is 64k, so

1) write 64K  --> now the raid layout is
[64K data + 64K random + 64K parity]
2) write another 64K --> now the raid layout after RMW is
[64K 1)'s data + 64K 2)'s data + 64K new parity]

If unclean shutdown occurs before 2) finishes, then parity may be
corrupted and then 1)'s data may be recovered wrongly if the disk
which holds 1)'s data is offline.

> - does this journal disk also host other btrfs log ?
>

No, purely data/parity and some associated metadata.

Thanks,

-liubo

> > The idea and the code are similar to the write-through mode of md
> > raid5-cache, so ppl(partial parity log) is also feasible to implement.
> > (If you've been familiar with md, you may find this patch set is
> > boring to read...)
> > 
> > Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> > the implementation, the rest patches are improvements and bugfixes,
> > eg. readahead for recovery, checksum.
> > 
> > Two btrfs-progs patches are required to play with this patch set, one
> > is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> > option '-L', the other is to teach 'btrfs-show-super' to show
> > %journal_tail.
> > 
> > This is currently based on 4.12-rc3.
> > 
> > The patch set is tagged with RFC, and comments are always welcome,
> > thanks.
> > 
> > Known limitations:
> > - Deleting a log device is not implemented yet.
> > 
> > 
> > Liu Bo (14):
> >   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
> >   Btrfs: raid56: do not allocate chunk on raid56 log
> >   Btrfs: raid56: detect raid56 log on mount
> >   Btrfs: raid56: add verbose debug
> >   Btrfs: raid56: add stripe log for raid5/6
> >   Btrfs: raid56: add reclaim support
> >   Btrfs: raid56: load r5log
> >   Btrfs: raid56: log recovery
> >   Btrfs: raid56: add readahead for recovery
> >   Btrfs: raid56: use the readahead helper to get page
> >   Btrfs: raid56: add csum support
> >   Btrfs: raid56: fix error handling while adding a log device
> >   Btrfs: raid56: initialize raid5/6 log after adding it
> >   Btrfs: raid56: maintain IO order on raid5/6 log
> > 
> >  fs/btrfs/ctree.h|   16 +-
> >  fs/btrfs/disk-io.c  |   16 +
> >  fs/btrfs/ioctl.c|   48 +-
> >  fs/btrfs/raid56.c   | 1429 
> > ++-
> >  fs/btrfs/raid56.h   |   82 +++
> >  fs/btrfs/transaction.c  |2 +
> >  fs/btrfs/volumes.c  |   56 +-
> >  fs/btrfs/volumes.h  |7 +-
> >  include/uapi/linux/btrfs.h  |3 +
> >  include/uapi/linux/btrfs_tree.h |4 +
> >  10 files changed, 1487 insertions(+), 176 deletions(-)
> > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Hugo Mills
On Tue, Aug 01, 2017 at 10:56:39AM -0600, Liu Bo wrote:
> On Tue, Aug 01, 2017 at 05:28:57PM +, Hugo Mills wrote:
> >Hi,
> > 
> >Great to see something addressing the write hole at last.
> > 
> > On Tue, Aug 01, 2017 at 10:14:23AM -0600, Liu Bo wrote:
> > > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > > separate disk as a journal (aka raid5/6 log), so that after unclean
> > > shutdown we can make sure data and parity are consistent on the raid
> > > array by replaying the journal.
> > 
> >What's the behaviour of the FS if the log device dies during use?
> >
> 
> Error handling on IOs is still under construction (belongs to known
> limitations).
> 
> If the log device dies suddenly, I think we could skip the writeback
> to backend raid arrays and follow the rule in btrfs, filp FS to
> readonly as it may expose data loss.  What do you think?

   I think the key thing for me is that the overall behaviour of the
redundancy in the FS is not compromised by the logging solution. That
is, the same guarantees still hold: For RAID-5, you can lose up to one
device of the FS (*including* any log devices), and the FS will
continue to operate normally, but degraded. For RAID-6, you can lose
up to two devices without losing any capabilities of the FS. Dropping
to read-only if the (single) log device fails would break those
guarantees.

   I quite like the idea of embedding the log chunks into the
allocated structure of the FS -- although as pointed out, this is
probably going to need a new chunk type, and (to retain the guarantees
of the RAID-6 behaviour above) the ability to do 3-way RAID-1 on those
chunks. You'd also have to be able to balance the log structures while
in flight. It sounds like a lot more work for you, though.

   Hmm... if 3-way RAID-1 (3c) is available, then you could also have
RAID-1*3 on metadata, RAID-6 on data, and have 2-device redundancy
throughout. That's also a very attractive configuration in many
respects. (Analagous to RAID-1 metadata and RAID-5 data).

   Hugo.

-- 
Hugo Mills | That's not rain, that's a lake with slots in it.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Liu Bo
On Tue, Aug 01, 2017 at 01:39:59PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-08-01 13:25, Roman Mamedov wrote:
> > On Tue,  1 Aug 2017 10:14:23 -0600
> > Liu Bo  wrote:
> > 
> > > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > > separate disk as a journal (aka raid5/6 log), so that after unclean
> > > shutdown we can make sure data and parity are consistent on the raid
> > > array by replaying the journal.
> > 
> > Could it be possible to designate areas on the in-array devices to be used 
> > as
> > journal?
> > 
> > While md doesn't have much spare room in its metadata for extraneous things
> > like this, Btrfs could use almost as much as it wants to, adding to size of 
> > the
> > FS metadata areas. Reliability-wise, the log could be stored as RAID1 
> > chunks.
> > 
> > It doesn't seem convenient to need having an additional storage device 
> > around
> > just for the log, and also needing to maintain its fault tolerance yourself 
> > (so
> > the log device would better be on a mirror, such as mdadm RAID1? more 
> > expense
> > and maintenance complexity).
> > 
> I agree, MD pretty much needs a separate device simply because they can't
> allocate arbitrary space on the other array members.  BTRFS can do that
> though, and I would actually think that that would be _easier_ to implement
> than having a separate device.
>

Yes and no, using chunks may need a new ioctl and diving into chunk
allocation/(auto)deletion maze.

> That said, I do think that it would need to be a separate chunk type,
> because things could get really complicated if the metadata is itself using
> a parity raid profile.

Exactly, esp. when balance comes into the picture.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Liu Bo
On Tue, Aug 01, 2017 at 10:25:47PM +0500, Roman Mamedov wrote:
> On Tue,  1 Aug 2017 10:14:23 -0600
> Liu Bo  wrote:
> 
> > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > separate disk as a journal (aka raid5/6 log), so that after unclean
> > shutdown we can make sure data and parity are consistent on the raid
> > array by replaying the journal.
> 
> Could it be possible to designate areas on the in-array devices to be used as
> journal?
>
> While md doesn't have much spare room in its metadata for extraneous things
> like this, Btrfs could use almost as much as it wants to, adding to size of 
> the
> FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks.
> 

Yes, it makes sense, we could definitely do that, that was actually
the original idea.  I started with adding a new device for log as it
looks easier to me, but I could try that now.

> It doesn't seem convenient to need having an additional storage device around
> just for the log, and also needing to maintain its fault tolerance yourself 
> (so
> the log device would better be on a mirror, such as mdadm RAID1? more expense
> and maintenance complexity).
>

That's true.  Thanks for the suggestions.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Liu Bo
On Tue, Aug 01, 2017 at 05:28:57PM +, Hugo Mills wrote:
>Hi,
> 
>Great to see something addressing the write hole at last.
> 
> On Tue, Aug 01, 2017 at 10:14:23AM -0600, Liu Bo wrote:
> > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > separate disk as a journal (aka raid5/6 log), so that after unclean
> > shutdown we can make sure data and parity are consistent on the raid
> > array by replaying the journal.
> 
>What's the behaviour of the FS if the log device dies during use?
>

Error handling on IOs is still under construction (belongs to known
limitations).

If the log device dies suddenly, I think we could skip the writeback
to backend raid arrays and follow the rule in btrfs, filp FS to
readonly as it may expose data loss.  What do you think?

Thanks,

-liubo

>Hugo.
> 
> > The idea and the code are similar to the write-through mode of md
> > raid5-cache, so ppl(partial parity log) is also feasible to implement.
> > (If you've been familiar with md, you may find this patch set is
> > boring to read...)
> > 
> > Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> > the implementation, the rest patches are improvements and bugfixes,
> > eg. readahead for recovery, checksum.
> > 
> > Two btrfs-progs patches are required to play with this patch set, one
> > is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> > option '-L', the other is to teach 'btrfs-show-super' to show
> > %journal_tail.
> > 
> > This is currently based on 4.12-rc3.
> > 
> > The patch set is tagged with RFC, and comments are always welcome,
> > thanks.
> > 
> > Known limitations:
> > - Deleting a log device is not implemented yet.
> > 
> > 
> > Liu Bo (14):
> >   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
> >   Btrfs: raid56: do not allocate chunk on raid56 log
> >   Btrfs: raid56: detect raid56 log on mount
> >   Btrfs: raid56: add verbose debug
> >   Btrfs: raid56: add stripe log for raid5/6
> >   Btrfs: raid56: add reclaim support
> >   Btrfs: raid56: load r5log
> >   Btrfs: raid56: log recovery
> >   Btrfs: raid56: add readahead for recovery
> >   Btrfs: raid56: use the readahead helper to get page
> >   Btrfs: raid56: add csum support
> >   Btrfs: raid56: fix error handling while adding a log device
> >   Btrfs: raid56: initialize raid5/6 log after adding it
> >   Btrfs: raid56: maintain IO order on raid5/6 log
> > 
> >  fs/btrfs/ctree.h|   16 +-
> >  fs/btrfs/disk-io.c  |   16 +
> >  fs/btrfs/ioctl.c|   48 +-
> >  fs/btrfs/raid56.c   | 1429 
> > ++-
> >  fs/btrfs/raid56.h   |   82 +++
> >  fs/btrfs/transaction.c  |2 +
> >  fs/btrfs/volumes.c  |   56 +-
> >  fs/btrfs/volumes.h  |7 +-
> >  include/uapi/linux/btrfs.h  |3 +
> >  include/uapi/linux/btrfs_tree.h |4 +
> >  10 files changed, 1487 insertions(+), 176 deletions(-)
> > 
> 
> -- 
> Hugo Mills | Some days, it's just not worth gnawing through the
> hugo@... carfax.org.uk | straps
> http://carfax.org.uk/  |
> PGP: E2AB1DE4  |


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Goffredo Baroncelli
Hi Liu,

On 2017-08-01 18:14, Liu Bo wrote:
> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> separate disk as a journal (aka raid5/6 log), so that after unclean
> shutdown we can make sure data and parity are consistent on the raid
> array by replaying the journal.
> 

it would be possible to have more information ?
- what is logged ? data, parity or data + parity ?
- in the past I thought that it would be sufficient to log only the stripe 
position involved by a RMW cycle, and then start a scrub on these stripes in 
case of an unclean shutdown: do you think that it is feasible ?
- does this journal disk also host other btrfs log ?

> The idea and the code are similar to the write-through mode of md
> raid5-cache, so ppl(partial parity log) is also feasible to implement.
> (If you've been familiar with md, you may find this patch set is
> boring to read...)
> 
> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> the implementation, the rest patches are improvements and bugfixes,
> eg. readahead for recovery, checksum.
> 
> Two btrfs-progs patches are required to play with this patch set, one
> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> option '-L', the other is to teach 'btrfs-show-super' to show
> %journal_tail.
> 
> This is currently based on 4.12-rc3.
> 
> The patch set is tagged with RFC, and comments are always welcome,
> thanks.
> 
> Known limitations:
> - Deleting a log device is not implemented yet.
> 
> 
> Liu Bo (14):
>   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
>   Btrfs: raid56: do not allocate chunk on raid56 log
>   Btrfs: raid56: detect raid56 log on mount
>   Btrfs: raid56: add verbose debug
>   Btrfs: raid56: add stripe log for raid5/6
>   Btrfs: raid56: add reclaim support
>   Btrfs: raid56: load r5log
>   Btrfs: raid56: log recovery
>   Btrfs: raid56: add readahead for recovery
>   Btrfs: raid56: use the readahead helper to get page
>   Btrfs: raid56: add csum support
>   Btrfs: raid56: fix error handling while adding a log device
>   Btrfs: raid56: initialize raid5/6 log after adding it
>   Btrfs: raid56: maintain IO order on raid5/6 log
> 
>  fs/btrfs/ctree.h|   16 +-
>  fs/btrfs/disk-io.c  |   16 +
>  fs/btrfs/ioctl.c|   48 +-
>  fs/btrfs/raid56.c   | 1429 
> ++-
>  fs/btrfs/raid56.h   |   82 +++
>  fs/btrfs/transaction.c  |2 +
>  fs/btrfs/volumes.c  |   56 +-
>  fs/btrfs/volumes.h  |7 +-
>  include/uapi/linux/btrfs.h  |3 +
>  include/uapi/linux/btrfs_tree.h |4 +
>  10 files changed, 1487 insertions(+), 176 deletions(-)
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Austin S. Hemmelgarn

On 2017-08-01 13:25, Roman Mamedov wrote:

On Tue,  1 Aug 2017 10:14:23 -0600
Liu Bo  wrote:


This aims to fix write hole issue on btrfs raid5/6 setup by adding a
separate disk as a journal (aka raid5/6 log), so that after unclean
shutdown we can make sure data and parity are consistent on the raid
array by replaying the journal.


Could it be possible to designate areas on the in-array devices to be used as
journal?

While md doesn't have much spare room in its metadata for extraneous things
like this, Btrfs could use almost as much as it wants to, adding to size of the
FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks.

It doesn't seem convenient to need having an additional storage device around
just for the log, and also needing to maintain its fault tolerance yourself (so
the log device would better be on a mirror, such as mdadm RAID1? more expense
and maintenance complexity).

I agree, MD pretty much needs a separate device simply because they 
can't allocate arbitrary space on the other array members.  BTRFS can do 
that though, and I would actually think that that would be _easier_ to 
implement than having a separate device.


That said, I do think that it would need to be a separate chunk type, 
because things could get really complicated if the metadata is itself 
using a parity raid profile.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Hugo Mills
   Hi,

   Great to see something addressing the write hole at last.

On Tue, Aug 01, 2017 at 10:14:23AM -0600, Liu Bo wrote:
> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> separate disk as a journal (aka raid5/6 log), so that after unclean
> shutdown we can make sure data and parity are consistent on the raid
> array by replaying the journal.

   What's the behaviour of the FS if the log device dies during use?

   Hugo.

> The idea and the code are similar to the write-through mode of md
> raid5-cache, so ppl(partial parity log) is also feasible to implement.
> (If you've been familiar with md, you may find this patch set is
> boring to read...)
> 
> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> the implementation, the rest patches are improvements and bugfixes,
> eg. readahead for recovery, checksum.
> 
> Two btrfs-progs patches are required to play with this patch set, one
> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> option '-L', the other is to teach 'btrfs-show-super' to show
> %journal_tail.
> 
> This is currently based on 4.12-rc3.
> 
> The patch set is tagged with RFC, and comments are always welcome,
> thanks.
> 
> Known limitations:
> - Deleting a log device is not implemented yet.
> 
> 
> Liu Bo (14):
>   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
>   Btrfs: raid56: do not allocate chunk on raid56 log
>   Btrfs: raid56: detect raid56 log on mount
>   Btrfs: raid56: add verbose debug
>   Btrfs: raid56: add stripe log for raid5/6
>   Btrfs: raid56: add reclaim support
>   Btrfs: raid56: load r5log
>   Btrfs: raid56: log recovery
>   Btrfs: raid56: add readahead for recovery
>   Btrfs: raid56: use the readahead helper to get page
>   Btrfs: raid56: add csum support
>   Btrfs: raid56: fix error handling while adding a log device
>   Btrfs: raid56: initialize raid5/6 log after adding it
>   Btrfs: raid56: maintain IO order on raid5/6 log
> 
>  fs/btrfs/ctree.h|   16 +-
>  fs/btrfs/disk-io.c  |   16 +
>  fs/btrfs/ioctl.c|   48 +-
>  fs/btrfs/raid56.c   | 1429 
> ++-
>  fs/btrfs/raid56.h   |   82 +++
>  fs/btrfs/transaction.c  |2 +
>  fs/btrfs/volumes.c  |   56 +-
>  fs/btrfs/volumes.h  |7 +-
>  include/uapi/linux/btrfs.h  |3 +
>  include/uapi/linux/btrfs_tree.h |4 +
>  10 files changed, 1487 insertions(+), 176 deletions(-)
> 

-- 
Hugo Mills | Some days, it's just not worth gnawing through the
hugo@... carfax.org.uk | straps
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Roman Mamedov
On Tue,  1 Aug 2017 10:14:23 -0600
Liu Bo  wrote:

> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> separate disk as a journal (aka raid5/6 log), so that after unclean
> shutdown we can make sure data and parity are consistent on the raid
> array by replaying the journal.

Could it be possible to designate areas on the in-array devices to be used as
journal?

While md doesn't have much spare room in its metadata for extraneous things
like this, Btrfs could use almost as much as it wants to, adding to size of the
FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks.

It doesn't seem convenient to need having an additional storage device around
just for the log, and also needing to maintain its fault tolerance yourself (so
the log device would better be on a mirror, such as mdadm RAID1? more expense
and maintenance complexity).

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Liu Bo
This aims to fix write hole issue on btrfs raid5/6 setup by adding a
separate disk as a journal (aka raid5/6 log), so that after unclean
shutdown we can make sure data and parity are consistent on the raid
array by replaying the journal.

The idea and the code are similar to the write-through mode of md
raid5-cache, so ppl(partial parity log) is also feasible to implement.
(If you've been familiar with md, you may find this patch set is
boring to read...)

Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
the implementation, the rest patches are improvements and bugfixes,
eg. readahead for recovery, checksum.

Two btrfs-progs patches are required to play with this patch set, one
is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
option '-L', the other is to teach 'btrfs-show-super' to show
%journal_tail.

This is currently based on 4.12-rc3.

The patch set is tagged with RFC, and comments are always welcome,
thanks.

Known limitations:
- Deleting a log device is not implemented yet.


Liu Bo (14):
  Btrfs: raid56: add raid56 log via add_dev v2 ioctl
  Btrfs: raid56: do not allocate chunk on raid56 log
  Btrfs: raid56: detect raid56 log on mount
  Btrfs: raid56: add verbose debug
  Btrfs: raid56: add stripe log for raid5/6
  Btrfs: raid56: add reclaim support
  Btrfs: raid56: load r5log
  Btrfs: raid56: log recovery
  Btrfs: raid56: add readahead for recovery
  Btrfs: raid56: use the readahead helper to get page
  Btrfs: raid56: add csum support
  Btrfs: raid56: fix error handling while adding a log device
  Btrfs: raid56: initialize raid5/6 log after adding it
  Btrfs: raid56: maintain IO order on raid5/6 log

 fs/btrfs/ctree.h|   16 +-
 fs/btrfs/disk-io.c  |   16 +
 fs/btrfs/ioctl.c|   48 +-
 fs/btrfs/raid56.c   | 1429 ++-
 fs/btrfs/raid56.h   |   82 +++
 fs/btrfs/transaction.c  |2 +
 fs/btrfs/volumes.c  |   56 +-
 fs/btrfs/volumes.h  |7 +-
 include/uapi/linux/btrfs.h  |3 +
 include/uapi/linux/btrfs_tree.h |4 +
 10 files changed, 1487 insertions(+), 176 deletions(-)

-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html