Re: speed up big btrfs volumes with ssds

2017-09-11 Thread Stefan Priebe - Profihost AG
Hello,

Am 04.09.2017 um 20:32 schrieb Stefan Priebe - Profihost AG:
> Am 04.09.2017 um 15:28 schrieb Timofey Titovets:
>> 2017-09-04 15:57 GMT+03:00 Stefan Priebe - Profihost AG 
>> :
>>> Am 04.09.2017 um 12:53 schrieb Henk Slager:
 On Sun, Sep 3, 2017 at 8:32 PM, Stefan Priebe - Profihost AG
  wrote:
> Hello,
>
> i'm trying to speed up big btrfs volumes.
>
> Some facts:
> - Kernel will be 4.13-rc7
> - needed volume size is 60TB
>
> Currently without any ssds i get the best speed with:
> - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices
>
> and using btrfs as raid 0 for data and metadata on top of those 4 raid 5.
>
> I can live with a data loss every now and and than ;-) so a raid 0 on
> top of the 4x radi5 is acceptable for me.
>
> Currently the write speed is not as good as i would like - especially
> for random 8k-16k I/O.
>
> My current idea is to use a pcie flash card with bcache on top of each
> raid 5.

 If it can speed up depends quite a lot on what the use-case is, for
 some not-so-much-parallel-access it might work. So this 60TB is then
 20 4TB disks or so and the 4x 1GB cache is simply not very helpful I
 think. The working set doesn't fit in it I guess. If there is mostly
 single or a few users of the fs, a single pcie based bcacheing 4
 devices can work, but for SATA SSD, I would use 1 SSD per HWraid5.
>>>
>>> Yes that's roughly my idea as well and yes the workload is 4 users max
>>> writing data. 50% sequential, 50% random.
>>>
 Then roughly make sure the complete set of metadata blocks fits in the
 cache. For an fs of this size let's say/estimate 150G. Then maybe same
 of double for data, so an SSD of 500G would be a first try.
>>>
>>> I would use 1TB devices for each Raid or a 4TB PCIe card.
>>>
 You give the impression that reliability for this fs is not the
 highest prio, so if you go full risk, then put bcache in write-back
 mode, then you will have your desired random 8k-16k I/O speedup after
 the cache is warmed up. But any SW or HW failure wil result in total
 fs loss normally if SSD and HDD get out of sync somehow. Bcache
 write-through might also be acceptable, you will need extensive
 monitoring and tuning of all (bcache) parameters etc to be sure of the
 right choice of size and setup etc.
>>>
>>> Yes i wanted to use the write back mode. Has anybody already made some
>>> test or experience with a setup like this?
>>>
>>
>> May be you can make work your raid setup faster by:
>> 1. Use Single Profile
> 
> I'm already using the raid0 profile - see below:
> 
> Data,RAID0: Size:22.57TiB, Used:21.08TiB
> Metadata,RAID0: Size:90.00GiB, Used:82.28GiB
> System,RAID0: Size:64.00MiB, Used:1.53MiB
> 
>> 2. Use different stripe size for HW RAID5:
>> i think 16kb will be optimal with 5 devices per raid group
>> That will give you 64kb data stripe and 16kb parity
>> Btrfs raid0 use 64kb as stripe so that can make data access
>> unaligned (or use single profile for btrfs)
> 
> That sounds like an interesting idea except for the unaligned writes.
> Will need to test this.
> 
>> 3. Use btrfs ssd_spread to decrease RMW cycles.
> Can you explain this?
> 
> Stefan

i was able to fix this issue with ssd_spread. Could it be that the
default allocators nossd and ssd are searching to hard to free space?
Even space_tree did not help.

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up big btrfs volumes with ssds

2017-09-04 Thread Timofey Titovets
2017-09-04 21:32 GMT+03:00 Stefan Priebe - Profihost AG :
>> May be you can make work your raid setup faster by:
>> 1. Use Single Profile
>
> I'm already using the raid0 profile - see below:

If i understand correctly, you have a very big data set with random RW
access, so:
I'm saying about single profile for compact writes to one device, that
can make WB cache more effective
Because writes will not spread on several devices and as result that
increase chance that full stripe will be overwriten
That's will just work as raid0 with very big stripe size.

> Data,RAID0: Size:22.57TiB, Used:21.08TiB
> Metadata,RAID0: Size:90.00GiB, Used:82.28GiB
> System,RAID0: Size:64.00MiB, Used:1.53MiB
>
>> 2. Use different stripe size for HW RAID5:
>> i think 16kb will be optimal with 5 devices per raid group
>> That will give you 64kb data stripe and 16kb parity
>> Btrfs raid0 use 64kb as stripe so that can make data access
>> unaligned (or use single profile for btrfs)
>
> That sounds like an interesting idea except for the unaligned writes.
> Will need to test this.

Afaik btrfs also use 64kb for metadata:
https://github.com/torvalds/linux/blob/e26f1bea3b833fb2c16fb5f0a949da1efa219de3/fs/btrfs/extent-tree.c#L6678

>> 3. Use btrfs ssd_spread to decrease RMW cycles.
> Can you explain this?

Long description:
https://www.spinics.net/lists/linux-btrfs/msg67515.html

Short:
that option will change allocator logic.
Allocator will spread writes more aggressively and always try write to
new/empty area.
So in theory that will write new data to new empty chunk, so if you
have much free space that will make some guaranty to not touch old
data, so not do RWM and in theory always do full stripe write.

But if you expect that you array will be near full and you don't want
do defragment on that, that can easy get you enospace error.

> Stefan

That just my IMHO,
Thanks.
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up big btrfs volumes with ssds

2017-09-04 Thread Stefan Priebe - Profihost AG
Am 04.09.2017 um 15:28 schrieb Timofey Titovets:
> 2017-09-04 15:57 GMT+03:00 Stefan Priebe - Profihost AG 
> :
>> Am 04.09.2017 um 12:53 schrieb Henk Slager:
>>> On Sun, Sep 3, 2017 at 8:32 PM, Stefan Priebe - Profihost AG
>>>  wrote:
 Hello,

 i'm trying to speed up big btrfs volumes.

 Some facts:
 - Kernel will be 4.13-rc7
 - needed volume size is 60TB

 Currently without any ssds i get the best speed with:
 - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices

 and using btrfs as raid 0 for data and metadata on top of those 4 raid 5.

 I can live with a data loss every now and and than ;-) so a raid 0 on
 top of the 4x radi5 is acceptable for me.

 Currently the write speed is not as good as i would like - especially
 for random 8k-16k I/O.

 My current idea is to use a pcie flash card with bcache on top of each
 raid 5.
>>>
>>> If it can speed up depends quite a lot on what the use-case is, for
>>> some not-so-much-parallel-access it might work. So this 60TB is then
>>> 20 4TB disks or so and the 4x 1GB cache is simply not very helpful I
>>> think. The working set doesn't fit in it I guess. If there is mostly
>>> single or a few users of the fs, a single pcie based bcacheing 4
>>> devices can work, but for SATA SSD, I would use 1 SSD per HWraid5.
>>
>> Yes that's roughly my idea as well and yes the workload is 4 users max
>> writing data. 50% sequential, 50% random.
>>
>>> Then roughly make sure the complete set of metadata blocks fits in the
>>> cache. For an fs of this size let's say/estimate 150G. Then maybe same
>>> of double for data, so an SSD of 500G would be a first try.
>>
>> I would use 1TB devices for each Raid or a 4TB PCIe card.
>>
>>> You give the impression that reliability for this fs is not the
>>> highest prio, so if you go full risk, then put bcache in write-back
>>> mode, then you will have your desired random 8k-16k I/O speedup after
>>> the cache is warmed up. But any SW or HW failure wil result in total
>>> fs loss normally if SSD and HDD get out of sync somehow. Bcache
>>> write-through might also be acceptable, you will need extensive
>>> monitoring and tuning of all (bcache) parameters etc to be sure of the
>>> right choice of size and setup etc.
>>
>> Yes i wanted to use the write back mode. Has anybody already made some
>> test or experience with a setup like this?
>>
> 
> May be you can make work your raid setup faster by:
> 1. Use Single Profile

I'm already using the raid0 profile - see below:

Data,RAID0: Size:22.57TiB, Used:21.08TiB
Metadata,RAID0: Size:90.00GiB, Used:82.28GiB
System,RAID0: Size:64.00MiB, Used:1.53MiB

> 2. Use different stripe size for HW RAID5:
> i think 16kb will be optimal with 5 devices per raid group
> That will give you 64kb data stripe and 16kb parity
> Btrfs raid0 use 64kb as stripe so that can make data access
> unaligned (or use single profile for btrfs)

That sounds like an interesting idea except for the unaligned writes.
Will need to test this.

> 3. Use btrfs ssd_spread to decrease RMW cycles.
Can you explain this?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up big btrfs volumes with ssds

2017-09-04 Thread Peter Grandi
>>> [ ... ] Currently without any ssds i get the best speed with:
>>> - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices
>>> and using btrfs as raid 0 for data and metadata on top of
>>> those 4 raid 5. [ ... ] the write speed is not as good as i
>>> would like - especially for random 8k-16k I/O. [ ... ]

> [ ... ] 64kb data stripe and 16kb parity Btrfs raid0 use 64kb
> as stripe so that can make data access unaligned (or use single
> profile for btrfs) 3. Use btrfs ssd_spread to decrease RMW
> cycles.

This is not a "revolutionary" scientific discovery as the idea of
a working set of a small-size random-write workload, but it still
takes a lot of "optimism" to imagine that it is possible to
"decrease RMW cycles" for "random 8k-16k" writes on 64KiB+16KiB
RAID5 stripes, whether with 'ssd_spread' or not.

To "decrease RMW cycles" seems inded to me the better aim than
following the "radical" aim of caching the working set of a
random-small-write workload, but it may be less easy to achieve
than desirable :-). http://www.baarf.dk/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up big btrfs volumes with ssds

2017-09-04 Thread Russell Coker
On Monday, 4 September 2017 2:57:18 PM AEST Stefan Priebe - Profihost AG 
wrote:
> > Then roughly make sure the complete set of metadata blocks fits in the
> > cache. For an fs of this size let's say/estimate 150G. Then maybe same
> > of double for data, so an SSD of 500G would be a first try.
> 
> I would use 1TB devices for each Raid or a 4TB PCIe card.

One thing I've considered is to create a filesystem with a RAID-1 of SSDs and 
then create lots of files with long names to use up a lot of space on the 
SSDs.  Then delete those files and add disks to the filesystem.  Then BTRFS 
should keep using the allocated metadata blocks on the SSD for all metadata 
and use disks for just data.

I haven't yet tried bcache, but would prefer something simpler with one less 
layer.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up big btrfs volumes with ssds

2017-09-04 Thread Timofey Titovets
2017-09-04 15:57 GMT+03:00 Stefan Priebe - Profihost AG :
> Am 04.09.2017 um 12:53 schrieb Henk Slager:
>> On Sun, Sep 3, 2017 at 8:32 PM, Stefan Priebe - Profihost AG
>>  wrote:
>>> Hello,
>>>
>>> i'm trying to speed up big btrfs volumes.
>>>
>>> Some facts:
>>> - Kernel will be 4.13-rc7
>>> - needed volume size is 60TB
>>>
>>> Currently without any ssds i get the best speed with:
>>> - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices
>>>
>>> and using btrfs as raid 0 for data and metadata on top of those 4 raid 5.
>>>
>>> I can live with a data loss every now and and than ;-) so a raid 0 on
>>> top of the 4x radi5 is acceptable for me.
>>>
>>> Currently the write speed is not as good as i would like - especially
>>> for random 8k-16k I/O.
>>>
>>> My current idea is to use a pcie flash card with bcache on top of each
>>> raid 5.
>>
>> If it can speed up depends quite a lot on what the use-case is, for
>> some not-so-much-parallel-access it might work. So this 60TB is then
>> 20 4TB disks or so and the 4x 1GB cache is simply not very helpful I
>> think. The working set doesn't fit in it I guess. If there is mostly
>> single or a few users of the fs, a single pcie based bcacheing 4
>> devices can work, but for SATA SSD, I would use 1 SSD per HWraid5.
>
> Yes that's roughly my idea as well and yes the workload is 4 users max
> writing data. 50% sequential, 50% random.
>
>> Then roughly make sure the complete set of metadata blocks fits in the
>> cache. For an fs of this size let's say/estimate 150G. Then maybe same
>> of double for data, so an SSD of 500G would be a first try.
>
> I would use 1TB devices for each Raid or a 4TB PCIe card.
>
>> You give the impression that reliability for this fs is not the
>> highest prio, so if you go full risk, then put bcache in write-back
>> mode, then you will have your desired random 8k-16k I/O speedup after
>> the cache is warmed up. But any SW or HW failure wil result in total
>> fs loss normally if SSD and HDD get out of sync somehow. Bcache
>> write-through might also be acceptable, you will need extensive
>> monitoring and tuning of all (bcache) parameters etc to be sure of the
>> right choice of size and setup etc.
>
> Yes i wanted to use the write back mode. Has anybody already made some
> test or experience with a setup like this?
>
> Greets,
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

May be you can make work your raid setup faster by:
1. Use Single Profile
2. Use different stripe size for HW RAID5:
i think 16kb will be optimal with 5 devices per raid group
That will give you 64kb data stripe and 16kb parity
Btrfs raid0 use 64kb as stripe so that can make data access
unaligned (or use single profile for btrfs)
3. Use btrfs ssd_spread to decrease RMW cycles.

Thanks.


-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up big btrfs volumes with ssds

2017-09-04 Thread Stefan Priebe - Profihost AG
Am 04.09.2017 um 12:53 schrieb Henk Slager:
> On Sun, Sep 3, 2017 at 8:32 PM, Stefan Priebe - Profihost AG
>  wrote:
>> Hello,
>>
>> i'm trying to speed up big btrfs volumes.
>>
>> Some facts:
>> - Kernel will be 4.13-rc7
>> - needed volume size is 60TB
>>
>> Currently without any ssds i get the best speed with:
>> - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices
>>
>> and using btrfs as raid 0 for data and metadata on top of those 4 raid 5.
>>
>> I can live with a data loss every now and and than ;-) so a raid 0 on
>> top of the 4x radi5 is acceptable for me.
>>
>> Currently the write speed is not as good as i would like - especially
>> for random 8k-16k I/O.
>>
>> My current idea is to use a pcie flash card with bcache on top of each
>> raid 5.
> 
> If it can speed up depends quite a lot on what the use-case is, for
> some not-so-much-parallel-access it might work. So this 60TB is then
> 20 4TB disks or so and the 4x 1GB cache is simply not very helpful I
> think. The working set doesn't fit in it I guess. If there is mostly
> single or a few users of the fs, a single pcie based bcacheing 4
> devices can work, but for SATA SSD, I would use 1 SSD per HWraid5.

Yes that's roughly my idea as well and yes the workload is 4 users max
writing data. 50% sequential, 50% random.

> Then roughly make sure the complete set of metadata blocks fits in the
> cache. For an fs of this size let's say/estimate 150G. Then maybe same
> of double for data, so an SSD of 500G would be a first try.

I would use 1TB devices for each Raid or a 4TB PCIe card.

> You give the impression that reliability for this fs is not the
> highest prio, so if you go full risk, then put bcache in write-back
> mode, then you will have your desired random 8k-16k I/O speedup after
> the cache is warmed up. But any SW or HW failure wil result in total
> fs loss normally if SSD and HDD get out of sync somehow. Bcache
> write-through might also be acceptable, you will need extensive
> monitoring and tuning of all (bcache) parameters etc to be sure of the
> right choice of size and setup etc.

Yes i wanted to use the write back mode. Has anybody already made some
test or experience with a setup like this?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up big btrfs volumes with ssds

2017-09-04 Thread Peter Grandi
>> [ ... ] Currently the write speed is not as good as i would
>> like - especially for random 8k-16k I/O. [ ... ]

> [ ... ] So this 60TB is then 20 4TB disks or so and the 4x 1GB
> cache is simply not very helpful I think. The working set
> doesn't fit in it I guess. If there is mostly single or a few
> users of the fs, a single pcie based bcacheing 4 devices can
> work, but for SATA SSD, I would use 1 SSD per HWraid5. [ ... ]

Probably the idea of the cacheable working set of a random small
write workload is a major new scientific discovery. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up big btrfs volumes with ssds

2017-09-04 Thread Henk Slager
On Sun, Sep 3, 2017 at 8:32 PM, Stefan Priebe - Profihost AG
 wrote:
> Hello,
>
> i'm trying to speed up big btrfs volumes.
>
> Some facts:
> - Kernel will be 4.13-rc7
> - needed volume size is 60TB
>
> Currently without any ssds i get the best speed with:
> - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices
>
> and using btrfs as raid 0 for data and metadata on top of those 4 raid 5.
>
> I can live with a data loss every now and and than ;-) so a raid 0 on
> top of the 4x radi5 is acceptable for me.
>
> Currently the write speed is not as good as i would like - especially
> for random 8k-16k I/O.
>
> My current idea is to use a pcie flash card with bcache on top of each
> raid 5.

If it can speed up depends quite a lot on what the use-case is, for
some not-so-much-parallel-access it might work. So this 60TB is then
20 4TB disks or so and the 4x 1GB cache is simply not very helpful I
think. The working set doesn't fit in it I guess. If there is mostly
single or a few users of the fs, a single pcie based bcacheing 4
devices can work, but for SATA SSD, I would use 1 SSD per HWraid5.

Then roughly make sure the complete set of metadata blocks fits in the
cache. For an fs of this size let's say/estimate 150G. Then maybe same
of double for data, so an SSD of 500G would be a first try.

You give the impression that reliability for this fs is not the
highest prio, so if you go full risk, then put bcache in write-back
mode, then you will have your desired random 8k-16k I/O speedup after
the cache is warmed up. But any SW or HW failure wil result in total
fs loss normally if SSD and HDD get out of sync somehow. Bcache
write-through might also be acceptable, you will need extensive
monitoring and tuning of all (bcache) parameters etc to be sure of the
right choice of size and setup etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up big btrfs volumes with ssds

2017-09-03 Thread Peter Grandi
> [ ... ] - needed volume size is 60TB

I wonder how long that takes to 'scrub', 'balance', 'check',
'subvolume delete', 'find', etc.

> [ ... ] 4x HW Raid 5 with 1GB controller memory of 4TB 3,5"
> devices and using btrfs as raid 0 for data and metadata on top
> of those 4 raid 5. [ ... ]  the write speed is not as good as
> i would like - especially for random 8k-16k I/O. [ ... ]

Also I noticed that the rain is wet and cold - especially if one
walks around for a few hours in a t-shirt, shorts and sandals.
:-)

> My current idea is to use a pcie flash card with bcache on top
> of each raid 5. Is this something which makes sense to speed
> up the write speed.

Well 'bcache' in the role of write buffer allegedly helps
turning unaligned writes into aligned writes, so might help, but
I wonder how effective that will be in this case, plus it won't
turn low random IOPS-per-TB 4TB devices into high ones. Anyhow
if they are battery-backed the 1GB of HW HBA cache/buffer should
do exactly that, excep that again in this case that is rather
optimistic.

But this reminds me of the common story: "Doctor, if I stab
repeatedly my hand with a fork it hurts a lot, how to fix that?"
"Don't do it".
:-)

PS Random writes of 8-16KiB over 60TB might seem like storing
small records/images in small files. That would be "brave".
On a 60TB RAID50 of 20x 4TB disk drives that might mean around
5-10MB/s of random small writes, including both data and
metadata.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


speed up big btrfs volumes with ssds

2017-09-03 Thread Stefan Priebe - Profihost AG
Hello,

i'm trying to speed up big btrfs volumes.

Some facts:
- Kernel will be 4.13-rc7
- needed volume size is 60TB

Currently without any ssds i get the best speed with:
- 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices

and using btrfs as raid 0 for data and metadata on top of those 4 raid 5.

I can live with a data loss every now and and than ;-) so a raid 0 on
top of the 4x radi5 is acceptable for me.

Currently the write speed is not as good as i would like - especially
for random 8k-16k I/O.

My current idea is to use a pcie flash card with bcache on top of each
raid 5.

Is this something which makes sense to speed up the write speed.

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html