Re: Btrfs/SSD

2017-05-16 Thread Kai Krakow
Am Tue, 16 May 2017 14:21:20 +0200
schrieb Tomasz Torcz :

> On Tue, May 16, 2017 at 03:58:41AM +0200, Kai Krakow wrote:
> > Am Mon, 15 May 2017 22:05:05 +0200
> > schrieb Tomasz Torcz :
> >   
>  [...]  
> > > 
> > >   Let me add my 2 cents.  bcache-writearound does not cache writes
> > > on SSD, so there are less writes overall to flash.  It is said
> > > to prolong the life of the flash drive.
> > >   I've recently switched from bcache-writeback to
> > > bcache-writearound, because my SSD caching drive is at the edge
> > > of it's lifetime. I'm using bcache in following configuration:
> > > http://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg My
> > > SSD is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years
> > > ago.
> > > 
> > >   Now, according to
> > > http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html
> > > 120GB and 250GB warranty only covers 75 TBW (terabytes written).  
> > 
> > According to your chart, all your data is written twice to bcache.
> > It may have been better to buy two drives, one per mirror. I don't
> > think that SSD firmwares do deduplication - so data is really
> > written twice.  
> 
>   I'm aware of that, but 50 GB (I've got 100GB caching partition)
> is still plenty to cache my ~, some media files, two small VMs.
> On the other hand I don't want to overspend. This is just a home
> server.
>   Nb. I'm still waiting for btrfs native SSD caching, which was
> planned for 3.6 kernel 5 years ago :)
> ( 
> https://oss.oracle.com/~mason/presentation/btrfs-jls-12/btrfs.html#/planned-3.6
> )
> 
> > 
> >   
> > > My
> > > drive has  # smartctl -a /dev/sda  | grep LBA 241
> > > Total_LBAs_Written  0x0032   099   099   000Old_age
> > > Always   -   136025596053  
> > 
> > Doesn't say this "99%" remaining? The threshold is far from being
> > reached...
> > 
> > I'm curious, what is Wear_Leveling_Count reporting?  
> 
> ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE 9 Power_On_Hours  0x0032
> 096   096   000Old_age   Always   -   18227 12
> Power_Cycle_Count   0x0032   099   099   000Old_age
> Always   -   29 177 Wear_Leveling_Count 0x0013   001
> 001   000Pre-fail  Always   -   4916
> 
>  Is this 001 mean 1%? If so, SMART contradicts datasheets. And I
> don't think I shoud see read errors for 1% wear.

It more means 1% left, that is 99% wear... Most of these are counters
from 100 down to zero, with THRESH being the threshold point below or at
which it is considered failed or failing.

Only a few values work the other way around (like temperature).

Be careful with interpreting raw values: they may be very manufacturer
specific and not normalized.

According to Total_LBAs_Written, the manufacturer thinks the drive
could still take 100x more (only 1% used). But your wear level is almost
100% (value = 001). I think that value isn't really designed around the
flash cell lifetime, but intermediate components like caches.

So you need to read most values "backwards": It's not a used counter,
but a "what's left" counter.

What does it tell you about reserved blocks usage? Note that it's sort
of double negation here: value 100 used means 100% unused or 0%
used... ;-) Or just insert a "minus" in front of those values and think
of them counting up to zero. So on a time axis it's at -100% of the
total lifetime scale and 0 is the fail point (or whatever "thresh"
says).


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-16 Thread Austin S. Hemmelgarn

On 2017-05-16 08:21, Tomasz Torcz wrote:

On Tue, May 16, 2017 at 03:58:41AM +0200, Kai Krakow wrote:

Am Mon, 15 May 2017 22:05:05 +0200
schrieb Tomasz Torcz :

My
drive has  # smartctl -a /dev/sda  | grep LBA 241
Total_LBAs_Written  0x0032   099   099   000Old_age
Always   -   136025596053


Doesn't say this "99%" remaining? The threshold is far from being
reached...

I'm curious, what is Wear_Leveling_Count reporting?


ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  9 Power_On_Hours  0x0032   096   096   000Old_age   Always   
-   18227
 12 Power_Cycle_Count   0x0032   099   099   000Old_age   Always   
-   29
177 Wear_Leveling_Count 0x0013   001   001   000Pre-fail  Always   
-   4916

 Is this 001 mean 1%? If so, SMART contradicts datasheets. And I
don't think I shoud see read errors for 1% wear.
The 'normalized' values shown in the VALUE, WORST, and THRESH columns 
usually count down to zero (with the notable exception of the thermal 
attributes, which usually match the raw value), they exist as a way of 
comparing things without having to know what vendor or model the device 
is, as the raw values are (again with limited exceptions) technically 
vendor specific (the various *_Error_Rate counters on traditional HDD's 
are good examples of this).  VALUE is your current value, WORST is a 
peak-detector type thing that monitors the worst it's been, and THRESH 
is the point at which the device manufacturer considers that aspect 
failed (which will usually result in the 'Overall Health Assessment' 
failing as well), though I'm pretty sure that if THRESH is 000, that 
means that the firmware doesn't base it's asse3ssment for that attribute 
on the normalized value at all.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-16 Thread Tomasz Torcz
On Tue, May 16, 2017 at 03:58:41AM +0200, Kai Krakow wrote:
> Am Mon, 15 May 2017 22:05:05 +0200
> schrieb Tomasz Torcz :
> 
> > > Yes, I considered that, too. And when I tried, there was almost no
> > > perceivable performance difference between bcache-writearound and
> > > bcache-writeback. But the latency of performance improvement was
> > > much longer in writearound mode, so I sticked to writeback mode.
> > > Also, writing random data is faster because bcache will defer it to
> > > background and do writeback in sector order. Sequential access is
> > > passed around bcache anyway, harddisks are already good at that.  
> > 
> >   Let me add my 2 cents.  bcache-writearound does not cache writes
> > on SSD, so there are less writes overall to flash.  It is said
> > to prolong the life of the flash drive.
> >   I've recently switched from bcache-writeback to bcache-writearound,
> > because my SSD caching drive is at the edge of it's lifetime. I'm
> > using bcache in following configuration:
> > http://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg My SSD
> > is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years ago.
> > 
> >   Now, according to
> > http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html
> > 120GB and 250GB warranty only covers 75 TBW (terabytes written).
> 
> According to your chart, all your data is written twice to bcache. It
> may have been better to buy two drives, one per mirror. I don't think
> that SSD firmwares do deduplication - so data is really written twice.

  I'm aware of that, but 50 GB (I've got 100GB caching partition)
is still plenty to cache my ~, some media files, two small VMs.
On the other hand I don't want to overspend. This is just a home
server.
  Nb. I'm still waiting for btrfs native SSD caching, which was
planned for 3.6 kernel 5 years ago :)
( 
https://oss.oracle.com/~mason/presentation/btrfs-jls-12/btrfs.html#/planned-3.6 
)

> 
> 
> > My
> > drive has  # smartctl -a /dev/sda  | grep LBA 241
> > Total_LBAs_Written  0x0032   099   099   000Old_age
> > Always   -   136025596053
> 
> Doesn't say this "99%" remaining? The threshold is far from being
> reached...
> 
> I'm curious, what is Wear_Leveling_Count reporting?

ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  9 Power_On_Hours  0x0032   096   096   000Old_age   Always   
-   18227
 12 Power_Cycle_Count   0x0032   099   099   000Old_age   Always   
-   29
177 Wear_Leveling_Count 0x0013   001   001   000Pre-fail  Always   
-   4916

 Is this 001 mean 1%? If so, SMART contradicts datasheets. And I
don't think I shoud see read errors for 1% wear.
 

> > which multiplied by 512 bytes gives 69.6 TB. Close to 75TB? Well…
> > 

-- 
Tomasz Torcz   ,,(...) today's high-end is tomorrow's embedded processor.''
xmpp: zdzich...@chrome.pl  -- Mitchell Blank on LKML

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-16 Thread Austin S. Hemmelgarn

On 2017-05-15 15:49, Kai Krakow wrote:

Am Mon, 15 May 2017 08:03:48 -0400
schrieb "Austin S. Hemmelgarn" :


That's why I don't trust any of my data to them. But I still want
the benefit of their speed. So I use SSDs mostly as frontend caches
to HDDs. This gives me big storage with fast access. Indeed, I'm
using bcache successfully for this. A warm cache is almost as fast
as native SSD (at least it feels almost that fast, it will be
slower if you threw benchmarks at it).

That's to be expected though, most benchmarks don't replicate actual
usage patterns for client systems, and using SSD's for caching with
bcache or dm-cache for most server workloads except a file server
will usually get you a performance hit.


You mean "performance boost"? Almost every read-most server workload
should benefit... I file server may be the exact opposite...
In my experience, short of some types of file server and non-interactive 
websites, read-mostly server workloads are rare.


Also, I think dm-cache and bcache work very differently and are not
directly comparable. Their benefit depends much on the applied workload.
The low-level framework is different, and much of the internals are 
different, but based on most of the testing I've done, running them in 
the same mode (write-back/write-through/etc) will on average get you 
roughly the same performance.


If I remember right, dm-cache is more about keeping "hot data" in the
flash storage while bcache is more about reducing seeking. So dm-cache
optimizes for bigger throughput of SSDs while bcache optimizes for
almost-zero seek overhead of SSDs. Depending on your underlying
storage, one or the other may even give zero benefit or worsen
performance. Which is what I'd call a "performance hit"... I didn't
ever try dm-cache, tho. For reasons I don't remember exactly, I didn't
like something about how it's implemented, I think it was related to
crash recovery. I don't know if that still holds true with modern
kernels. It may have changed but I never looked back to revise that
decision.
dm-cache is a bit easier to convert to or from in-place and is in my 
experience a bit more flexible in data handling, but has the issue that 
you can still see the FS on the back-end storage (because it has no 
superblock or anything like that on the back-end storage), which means 
it's almost useless with BTRFS, and it requires a separate cache device 
for each back-end device (as well as an independent metadata device, but 
that's usually tiny since it's largely just used as a bitmap to track 
what blocks are clean in-cache).


bcache is more complicated to set up initially, and _requires_ a kernel 
with bcache support to access even if you aren't doing any caching, but 
it masks the back-end (so it's safe to use with BTRFS (recent versions 
of it are at least)), and it doesn't require a 1:1 mapping of cache 
devices to back-end storage.




It's worth noting also that on average, COW filesystems like BTRFS
(or log-structured-filesystems will not benefit as much as
traditional filesystems from SSD caching unless the caching is built
into the filesystem itself, since they don't do in-place rewrites (so
any new write by definition has to drop other data from the cache).


Yes, I considered that, too. And when I tried, there was almost no
perceivable performance difference between bcache-writearound and
bcache-writeback. But the latency of performance improvement was much
longer in writearound mode, so I sticked to writeback mode. Also,
writing random data is faster because bcache will defer it to
background and do writeback in sector order. Sequential access is
passed around bcache anyway, harddisks are already good at that.

But of course, the COW nature of btrfs will lower the hit rate I can
on writes. That's why I see no benefit in using bcache-writethrough
with btrfs.
Yeah, on average based on my own testing, write-through mode is 
worthless for COW filesystems, and write-back is only worthwhile if you 
have a large enough cache proportionate to your bandwidth requirements 
(4G should be more than enough for a desktop or workstation, but servers 
may need huge amounts of space), while write-around is only worthwhile 
for stuff that needs read performance but doesn't really care about latency.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Duncan
Kai Krakow posted on Mon, 15 May 2017 21:12:06 +0200 as excerpted:

> Am Mon, 15 May 2017 14:09:20 +0100
> schrieb Tomasz Kusmierz :
>> 
>> Not true. When HDD uses 10% (10% is just for easy example) of space
>> as spare than aligment on disk is (US - used sector, SS - spare
>> sector, BS - bad sector)
>> 
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> 
>> if failure occurs - drive actually shifts sectors up:
>> 
>> US US US US US US US US US SS
>> US US US BS BS BS US US US US
>> US US US US US US US US US US
>> US US US US US US US US US US
>> US US US US US US US US US SS
>> US US US BS US US US US US US
>> US US US US US US US US US SS
>> US US US US US US US US US SS
> 
> This makes sense... Reserve area somehow implies it is continuous and
> as such located at one far end of the platter. But your image totally
> makes sense.

Thanks Tomasz.  It makes a lot of sense indeed, and had I thought about
it I think I already "knew" it, but I simply hadn't stopped to think about
it that hard, so you disabused me of the vague idea of spares all at one
end of the disk, too. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Kai Krakow
Am Mon, 15 May 2017 22:05:05 +0200
schrieb Tomasz Torcz :

> On Mon, May 15, 2017 at 09:49:38PM +0200, Kai Krakow wrote:
> >   
> > > It's worth noting also that on average, COW filesystems like BTRFS
> > > (or log-structured-filesystems will not benefit as much as
> > > traditional filesystems from SSD caching unless the caching is
> > > built into the filesystem itself, since they don't do in-place
> > > rewrites (so any new write by definition has to drop other data
> > > from the cache).  
> > 
> > Yes, I considered that, too. And when I tried, there was almost no
> > perceivable performance difference between bcache-writearound and
> > bcache-writeback. But the latency of performance improvement was
> > much longer in writearound mode, so I sticked to writeback mode.
> > Also, writing random data is faster because bcache will defer it to
> > background and do writeback in sector order. Sequential access is
> > passed around bcache anyway, harddisks are already good at that.  
> 
>   Let me add my 2 cents.  bcache-writearound does not cache writes
> on SSD, so there are less writes overall to flash.  It is said
> to prolong the life of the flash drive.
>   I've recently switched from bcache-writeback to bcache-writearound,
> because my SSD caching drive is at the edge of it's lifetime. I'm
> using bcache in following configuration:
> http://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg My SSD
> is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years ago.
> 
>   Now, according to
> http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html
> 120GB and 250GB warranty only covers 75 TBW (terabytes written).

According to your chart, all your data is written twice to bcache. It
may have been better to buy two drives, one per mirror. I don't think
that SSD firmwares do deduplication - so data is really written twice.

They may do compression but that won't be streaming compression but
per-block compression, so it won't help here as a deduplicator. Also,
due to internal structure, compression would probably work similar to
how zswap works: By combining compressed blocks into "buddy blocks", so
only compression above 2:1 will merge compressed blocks into single
blocks. For most of your data, this won't be true. So effectively, this
has no overall effect. For this reason, I doubt that any firmware takes
the chance for compression, effects are just too low vs. the management
overhead and complexity that adds to the already complicated FTL layer.


> My
> drive has  # smartctl -a /dev/sda  | grep LBA 241
> Total_LBAs_Written  0x0032   099   099   000Old_age
> Always   -   136025596053

Doesn't say this "99%" remaining? The threshold is far from being
reached...

I'm curious, what is Wear_Leveling_Count reporting?

> which multiplied by 512 bytes gives 69.6 TB. Close to 75TB? Well…
> 
> [35354.697513] sd 0:0:0:0: [sda] tag#19 FAILED Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE [35354.697516] sd 0:0:0:0:
> [sda] tag#19 Sense Key : Medium Error [current] [35354.697518] sd
> 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - auto
> reallocate failed [35354.697522] sd 0:0:0:0: [sda] tag#19 CDB:
> Read(10) 28 00 0c 30 82 9f 00 00 48 00 [35354.697524]
> blk_update_request: I/O error, dev sda, sector 204505785
> 
> Above started appearing recently.  So, I was really suprised that:
> - this drive is only rated for 120 TBW
> - I went through this limit in only 2 years
> 
>   The workload is lightly utilised home server / media center.

I think, bcache is a real SSD killer for drives around 120GB size or
below... I had similar life usage with my previous small SSD just after
one year. But I never had a sense error because I took it out of
service early. And I switched to writearound, too.

I think the write-pattern of bcache cannot be handled well by the FTL.
It behaves like a log-structured file system, with new writes only
appended, and sometimes a garbage collection is done by freeing
complete erase blocks. Maybe it could work better, if btrfs could pass
information about freed blocks down to bcache. Btrfs has a lot of these
due to COW nature.

I wonder if this would already be supported if turning on discard in
btrfs? Does anyone know?


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Tomasz Torcz
On Mon, May 15, 2017 at 09:49:38PM +0200, Kai Krakow wrote:
> 
> > It's worth noting also that on average, COW filesystems like BTRFS
> > (or log-structured-filesystems will not benefit as much as
> > traditional filesystems from SSD caching unless the caching is built
> > into the filesystem itself, since they don't do in-place rewrites (so
> > any new write by definition has to drop other data from the cache).
> 
> Yes, I considered that, too. And when I tried, there was almost no
> perceivable performance difference between bcache-writearound and
> bcache-writeback. But the latency of performance improvement was much
> longer in writearound mode, so I sticked to writeback mode. Also,
> writing random data is faster because bcache will defer it to
> background and do writeback in sector order. Sequential access is
> passed around bcache anyway, harddisks are already good at that.

  Let me add my 2 cents.  bcache-writearound does not cache writes
on SSD, so there are less writes overall to flash.  It is said
to prolong the life of the flash drive.
  I've recently switched from bcache-writeback to bcache-writearound,
because my SSD caching drive is at the edge of it's lifetime. I'm
using bcache in following configuration: 
https://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg
My SSD is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years ago.

  Now, according to 
http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html
120GB and 250GB warranty only covers 75 TBW (terabytes written).
My drive has  # smartctl -a /dev/sda  | grep LBA
241 Total_LBAs_Written  0x0032   099   099   000Old_age   Always   
-   136025596053

which multiplied by 512 bytes gives 69.6 TB. Close to 75TB? Well…

[35354.697513] sd 0:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
[35354.697516] sd 0:0:0:0: [sda] tag#19 Sense Key : Medium Error [current] 
[35354.697518] sd 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - 
auto reallocate failed
[35354.697522] sd 0:0:0:0: [sda] tag#19 CDB: Read(10) 28 00 0c 30 82 9f 00 00 
48 00
[35354.697524] blk_update_request: I/O error, dev sda, sector 204505785

Above started appearing recently.  So, I was really suprised that:
- this drive is only rated for 120 TBW
- I went through this limit in only 2 years

  The workload is lightly utilised home server / media center.

-- 
Tomasz TorczOnly gods can safely risk perfection,
xmpp: zdzich...@chrome.pl it's a dangerous thing for a man.  -- Alia

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Kai Krakow
Am Mon, 15 May 2017 08:03:48 -0400
schrieb "Austin S. Hemmelgarn" :

> > That's why I don't trust any of my data to them. But I still want
> > the benefit of their speed. So I use SSDs mostly as frontend caches
> > to HDDs. This gives me big storage with fast access. Indeed, I'm
> > using bcache successfully for this. A warm cache is almost as fast
> > as native SSD (at least it feels almost that fast, it will be
> > slower if you threw benchmarks at it).  
> That's to be expected though, most benchmarks don't replicate actual 
> usage patterns for client systems, and using SSD's for caching with 
> bcache or dm-cache for most server workloads except a file server
> will usually get you a performance hit.

You mean "performance boost"? Almost every read-most server workload
should benefit... I file server may be the exact opposite...

Also, I think dm-cache and bcache work very differently and are not
directly comparable. Their benefit depends much on the applied workload.

If I remember right, dm-cache is more about keeping "hot data" in the
flash storage while bcache is more about reducing seeking. So dm-cache
optimizes for bigger throughput of SSDs while bcache optimizes for
almost-zero seek overhead of SSDs. Depending on your underlying
storage, one or the other may even give zero benefit or worsen
performance. Which is what I'd call a "performance hit"... I didn't
ever try dm-cache, tho. For reasons I don't remember exactly, I didn't
like something about how it's implemented, I think it was related to
crash recovery. I don't know if that still holds true with modern
kernels. It may have changed but I never looked back to revise that
decision.


> It's worth noting also that on average, COW filesystems like BTRFS
> (or log-structured-filesystems will not benefit as much as
> traditional filesystems from SSD caching unless the caching is built
> into the filesystem itself, since they don't do in-place rewrites (so
> any new write by definition has to drop other data from the cache).

Yes, I considered that, too. And when I tried, there was almost no
perceivable performance difference between bcache-writearound and
bcache-writeback. But the latency of performance improvement was much
longer in writearound mode, so I sticked to writeback mode. Also,
writing random data is faster because bcache will defer it to
background and do writeback in sector order. Sequential access is
passed around bcache anyway, harddisks are already good at that.

But of course, the COW nature of btrfs will lower the hit rate I can
on writes. That's why I see no benefit in using bcache-writethrough
with btrfs.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Kai Krakow
Am Mon, 15 May 2017 07:46:01 -0400
schrieb "Austin S. Hemmelgarn" :

> On 2017-05-12 14:27, Kai Krakow wrote:
> > Am Tue, 18 Apr 2017 15:02:42 +0200
> > schrieb Imran Geriskovan :
> >  
> >> On 4/17/17, Austin S. Hemmelgarn  wrote:  
>  [...]  
> >>
> >> I'm trying to have a proper understanding of what "fragmentation"
> >> really means for an ssd and interrelation with wear-leveling.
> >>
> >> Before continuing lets remember:
> >> Pages cannot be erased individually, only whole blocks can be
> >> erased. The size of a NAND-flash page size can vary, and most
> >> drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have
> >> blocks of 128 or 256 pages, which means that the size of a block
> >> can vary between 256 KB and 4 MB.
> >> codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/
> >>
> >> Lets continue:
> >> Since block sizes are between 256k-4MB, data smaller than this will
> >> "probably" will not be fragmented in a reasonably empty and trimmed
> >> drive. And for a brand new ssd we may speak of contiguous series
> >> of blocks.
> >>
> >> However, as drive is used more and more and as wear leveling
> >> kicking in (ie. blocks are remapped) the meaning of "contiguous
> >> blocks" will erode. So any file bigger than a block size will be
> >> written to blocks physically apart no matter what their block
> >> addresses says. But my guess is that accessing device blocks
> >> -contiguous or not- are constant time operations. So it would not
> >> contribute performance issues. Right? Comments?
> >>
> >> So your the feeling about fragmentation/performance is probably
> >> related with if the file is spread into less or more blocks. If #
> >> of blocks used is higher than necessary (ie. no empty blocks can be
> >> found. Instead lots of partially empty blocks have to be used
> >> increasing the total # of blocks involved) then we will notice
> >> performance loss.
> >>
> >> Additionally if the filesystem will gonna try something to reduce
> >> the fragmentation for the blocks, it should precisely know where
> >> those blocks are located. Then how about ssd block informations?
> >> Are they available and do filesystems use it?
> >>
> >> Anyway if you can provide some more details about your experiences
> >> on this we can probably have better view on the issue.  
> >
> > What you really want for SSD is not defragmented files but
> > defragmented free space. That increases life time.
> >
> > So, defragmentation on SSD makes sense if it cares more about free
> > space but not file data itself.
> >
> > But of course, over time, fragmentation of file data (be it meta
> > data or content data) may introduce overhead - and in btrfs it
> > probably really makes a difference if I scan through some of the
> > past posts.
> >
> > I don't think it is important for the file system to know where the
> > SSD FTL located a data block. It's just important to keep
> > everything nicely aligned with erase block sizes, reduce rewrite
> > patterns, and free up complete erase blocks as good as possible.
> >
> > Maybe such a process should be called "compaction" and not
> > "defragmentation". In the end, the more continuous blocks of free
> > space there are, the better the chance for proper wear leveling.  
> 
> There is one other thing to consider though.  From a practical 
> perspective, performance on an SSD is a function of the number of 
> requests and what else is happening in the background.  The second 
> aspect isn't easy to eliminate on most systems, but the first is
> pretty easy to mitigate by defragmenting data.
> 
> Reiterating the example I made elsewhere in the thread:
> Assume you have an SSD and storage controller that can use DMA to 
> transfer up to 16MB of data off of the disk in a single operation.
> If you need to load a 16MB file off of this disk and it's properly
> aligned (it usually will be with most modern filesystems if the
> partition is properly aligned) and defragmented, it will take exactly
> one operation (assuming that doesn't get interrupted).  By contrast,
> if you have 16 fragments of 1MB each, that will take at minimum 2
> operations, and more likely 15-16 (depends on where everything is
> on-disk, and how smart the driver is about minimizing the number of
> required operations).  Each request has some amount of overhead to
> set up and complete, so the first case (one single extent) will take
> less total time to transfer the data than the second one.
> 
> This particular effect actually impacts almost any data transfer, not 
> just pulling data off of an SSD (this is why jumbo frames are
> important for high-performance networking, and why a higher latency
> timer on the PCI bus will improve performance (but conversely
> increase latency)), even when fetching data from a traditional hard
> drive (but it's not very noticeable there unless your fragments are
> tightly grouped, 

Re: Btrfs/SSD

2017-05-15 Thread Kai Krakow
Am Mon, 15 May 2017 14:09:20 +0100
schrieb Tomasz Kusmierz :

> > Traditional hard drives usually do this too these days (they've
> > been under-provisioned since before SSD's existed), which is part
> > of why older disks tend to be noisier and slower (the reserved
> > space is usually at the far inside or outside of the platter, so
> > using sectors from there to replace stuff leads to long seeks).  
> 
> Not true. When HDD uses 10% (10% is just for easy example) of space
> as spare than aligment on disk is (US - used sector, SS - spare
> sector, BS - bad sector)
> 
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> 
> if failure occurs - drive actually shifts sectors up:
> 
> US US US US US US US US US SS
> US US US BS BS BS US US US US
> US US US US US US US US US US
> US US US US US US US US US US
> US US US US US US US US US SS
> US US US BS US US US US US US
> US US US US US US US US US SS
> US US US US US US US US US SS

This makes sense... Reserve area somehow implies it is continuous and
as such located at one far end of the platter. But your image totally
makes sense.


> that strategy is in place to actually mitigate the problem that
> you’ve described, actually it was in place since drives were using
> PATA :) so if your drive get’s nosier over time it’s either a broken
> bearing or demagnetised arm magnet causing it to not aim propperly -
> so drive have to readjust position multiple times before hitting a
> right track -- To unsubscribe from this list: send the line
> "unsubscribe linux-btrfs" in the body of a message to
> majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

I can confirm that such drives usually do not get noisier except there's
something broken other than just a few sectors. And faulty bearing in
notebook drives is the most often scenario I see. I always recommend to
replace such drives early because they will usually fail completely.
Such notebooks are good candidates for SSD replacements btw. ;-)

The demagnetised arm magnet is an interesting error scenario - didn't
think of it. Thanks for the pointer.

But still, there's one noise you can easily identify as bad sectors:
When the drive starts clicking for 30 or more seconds while trying to
read data, and usually also freezes the OS during that time. Such
drives can be "repaired" by rewriting the offending sectors (because it
will be moved to reserve area then). But I guess it's best to already
replace such a drive by that time.

Early, back in PATA times, I often had harddisks exposing seemingly bad
sectors when power was cut while the drive was writing data. I usually
used dd to rewrite such sectors and the drive was good as new again -
except I lost some file data maybe. Luckily, modern drives don't show
such behavior. And also SSDs learned to handle this...


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Tomasz Kusmierz

> Traditional hard drives usually do this too these days (they've been 
> under-provisioned since before SSD's existed), which is part of why older 
> disks tend to be noisier and slower (the reserved space is usually at the far 
> inside or outside of the platter, so using sectors from there to replace 
> stuff leads to long seeks).

Not true. When HDD uses 10% (10% is just for easy example) of space as spare 
than aligment on disk is (US - used sector, SS - spare sector, BS - bad sector)

US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS

if failure occurs - drive actually shifts sectors up:

US US US US US US US US US SS
US US US BS BS BS US US US US
US US US US US US US US US US
US US US US US US US US US US
US US US US US US US US US SS
US US US BS US US US US US US
US US US US US US US US US SS
US US US US US US US US US SS

that strategy is in place to actually mitigate the problem that you’ve 
described, actually it was in place since drives were using PATA :) so if your 
drive get’s nosier over time it’s either a broken bearing or demagnetised arm 
magnet causing it to not aim propperly - so drive have to readjust position 
multiple times before hitting a right track --
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Austin S. Hemmelgarn

On 2017-05-12 14:36, Kai Krakow wrote:

Am Fri, 12 May 2017 15:02:20 +0200
schrieb Imran Geriskovan :


On 5/12/17, Duncan <1i5t5.dun...@cox.net> wrote:

FWIW, I'm in the market for SSDs ATM, and remembered this from a
couple weeks ago so went back to find it.  Thanks. =:^)

(I'm currently still on quarter-TB generation ssds, plus spinning
rust for the larger media partition and backups, and want to be rid
of the spinning rust, so am looking at half-TB to TB, which seems
to be the pricing sweet spot these days anyway.)


Since you are taking ssds to mainstream based on your experience,
I guess your perception of data retension/reliability is better than
that of spinning rust. Right? Can you eloborate?

Or an other criteria might be physical constraints of spinning rust
on notebooks which dictates that you should handle the device
with care when running.

What was your primary motivation other than performance?


Personally, I don't really trust SSDs so much. They are much more
robust when it comes to physical damage because there are no physical
parts. That's absolutely not my concern. Regarding this, I trust SSDs
better than HDDs.

My concern is with fail scenarios of some SSDs which die unexpected and
horribly. I found some reports of older Samsung SSDs which failed
suddenly and unexpected, and in a way that the drive completely died:
No more data access, everything gone. HDDs start with bad sectors and
there's a good chance I can recover most of the data except a few
sectors.
Older is the key here.  Some early SSD's did indeed behave like that, 
but most modern ones do generally show signs that they will fail in the 
near future.  There's also the fact that traditional hard drives _do_ 
fail like that sometimes, even without rough treatment.


When SSD blocks die, they are probably huge compared to a sector (256kB
to 4MB usually because that's erase block sizes). If this happens, the
firmware may decide to either allow read-only access or completely deny
access. There's another situation where dying storage chips may
completely mess up the firmware and there's no longer any access to
data.
I've yet to see an SSD that blocks user access to an erase block. 
Almost every one I've seen will instead rewrite the block (possibly with 
the corrupted data intact (that is, without mangling it further)) to one 
of the reserve blocks, and then just update it's internal mapping so 
that the old block doesn't get used, and the new one is pointing to the 
right place.  Some of the really good SSD's even use erasure coding in 
the FTL for data verification instead of CRC's, so they can actually 
reconstruct the missing bits when they do this.


Traditional hard drives usually do this too these days (they've been 
under-provisioned since before SSD's existed), which is part of why 
older disks tend to be noisier and slower (the reserved space is usually 
at the far inside or outside of the platter, so using sectors from there 
to replace stuff leads to long seeks).


That's why I don't trust any of my data to them. But I still want the
benefit of their speed. So I use SSDs mostly as frontend caches to
HDDs. This gives me big storage with fast access. Indeed, I'm using
bcache successfully for this. A warm cache is almost as fast as native
SSD (at least it feels almost that fast, it will be slower if you threw
benchmarks at it).
That's to be expected though, most benchmarks don't replicate actual 
usage patterns for client systems, and using SSD's for caching with 
bcache or dm-cache for most server workloads except a file server will 
usually get you a performance hit.


It's worth noting also that on average, COW filesystems like BTRFS (or 
log-structured-filesystems will not benefit as much as traditional 
filesystems from SSD caching unless the caching is built into the 
filesystem itself, since they don't do in-place rewrites (so any new 
write by definition has to drop other data from the cache).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Austin S. Hemmelgarn

On 2017-05-12 14:27, Kai Krakow wrote:

Am Tue, 18 Apr 2017 15:02:42 +0200
schrieb Imran Geriskovan :


On 4/17/17, Austin S. Hemmelgarn  wrote:

Regarding BTRFS specifically:
* Given my recently newfound understanding of what the 'ssd' mount
option actually does, I'm inclined to recommend that people who are
using high-end SSD's _NOT_ use it as it will heavily increase
fragmentation and will likely have near zero impact on actual device
lifetime (but may _hurt_ performance).  It will still probably help
with mid and low-end SSD's.


I'm trying to have a proper understanding of what "fragmentation"
really means for an ssd and interrelation with wear-leveling.

Before continuing lets remember:
Pages cannot be erased individually, only whole blocks can be erased.
The size of a NAND-flash page size can vary, and most drive have pages
of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256
pages, which means that the size of a block can vary between 256 KB
and 4 MB.
codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/

Lets continue:
Since block sizes are between 256k-4MB, data smaller than this will
"probably" will not be fragmented in a reasonably empty and trimmed
drive. And for a brand new ssd we may speak of contiguous series
of blocks.

However, as drive is used more and more and as wear leveling kicking
in (ie. blocks are remapped) the meaning of "contiguous blocks" will
erode. So any file bigger than a block size will be written to blocks
physically apart no matter what their block addresses says. But my
guess is that accessing device blocks -contiguous or not- are
constant time operations. So it would not contribute performance
issues. Right? Comments?

So your the feeling about fragmentation/performance is probably
related with if the file is spread into less or more blocks. If # of
blocks used is higher than necessary (ie. no empty blocks can be
found. Instead lots of partially empty blocks have to be used
increasing the total # of blocks involved) then we will notice
performance loss.

Additionally if the filesystem will gonna try something to reduce
the fragmentation for the blocks, it should precisely know where
those blocks are located. Then how about ssd block informations?
Are they available and do filesystems use it?

Anyway if you can provide some more details about your experiences
on this we can probably have better view on the issue.


What you really want for SSD is not defragmented files but defragmented
free space. That increases life time.

So, defragmentation on SSD makes sense if it cares more about free
space but not file data itself.

But of course, over time, fragmentation of file data (be it meta data
or content data) may introduce overhead - and in btrfs it probably
really makes a difference if I scan through some of the past posts.

I don't think it is important for the file system to know where the SSD
FTL located a data block. It's just important to keep everything nicely
aligned with erase block sizes, reduce rewrite patterns, and free up
complete erase blocks as good as possible.

Maybe such a process should be called "compaction" and not
"defragmentation". In the end, the more continuous blocks of free space
there are, the better the chance for proper wear leveling.


There is one other thing to consider though.  From a practical 
perspective, performance on an SSD is a function of the number of 
requests and what else is happening in the background.  The second 
aspect isn't easy to eliminate on most systems, but the first is pretty 
easy to mitigate by defragmenting data.


Reiterating the example I made elsewhere in the thread:
Assume you have an SSD and storage controller that can use DMA to 
transfer up to 16MB of data off of the disk in a single operation.  If 
you need to load a 16MB file off of this disk and it's properly aligned 
(it usually will be with most modern filesystems if the partition is 
properly aligned) and defragmented, it will take exactly one operation 
(assuming that doesn't get interrupted).  By contrast, if you have 16 
fragments of 1MB each, that will take at minimum 2 operations, and more 
likely 15-16 (depends on where everything is on-disk, and how smart the 
driver is about minimizing the number of required operations).  Each 
request has some amount of overhead to set up and complete, so the first 
case (one single extent) will take less total time to transfer the data 
than the second one.


This particular effect actually impacts almost any data transfer, not 
just pulling data off of an SSD (this is why jumbo frames are important 
for high-performance networking, and why a higher latency timer on the 
PCI bus will improve performance (but conversely increase latency)), 
even when fetching data from a traditional hard drive (but it's not very 
noticeable there unless your fragments are tightly grouped, because seek 
latency dominates 

Re: Btrfs/SSD

2017-05-15 Thread Imran Geriskovan
On 5/15/17, Tomasz Kusmierz  wrote:
> Theoretically all sectors in over provision are erased - practically they
> are either erased or waiting to be erased or broken.
> Over provisioned area does have more uses than that. For example if you have
> a 1TB drive where you store 500GB of data that you never modify -> SSD will
> copy part of that data to over provisioned area -> free sectors that were
> unwritten for a while -> free sectors that were continuously hammered by
> writes and write a static data there. This mechanism is wear levelling - it
> means that SSD internals make sure that sectors on SSD have an equal use
> over time. Despite of some thinking that it’s pointless imagine situation
> where you’ve got a 1TB drive with 1GB free and you keep writing and
> modifying data in this 1GB free … those sectors will quickly die due to
> short flash life expectancy ( some as short as 1k erases ! ).

Thanks for the info. It can be understood that, the drive
has a pool of erase blocks from which some portion (say %90-95)
is provided as usable. Trimmed blocks are candidates
for new allocations. If the drive is not trimmed, that allocatable
pool becomes smaller than it can be and new allocations
under wear levelling logic is done from smaller group.
This will probably increase data traffic on that "small group"
of blocks, eating from their erase cycles.

However, this logic is valid if the drive does NOT move
data on trimmed blocks to trimmed/available ones.

Under some advanced wear leveling operations, drive may
decide to swap two blocks (one occupied/one vacant) if the
cummulative erase cycles of the former is much lower than
the latter to provide some balancing affect.

Theoretically swapping may even occur when the flash tend
to lose charge (and thus data) based on the age of the
data and/or block health.

But in any case I understand that trimming will provide
important degree of freedom and health to the drive.
Without trimming drive will continue to deal with worthless
blocks simply because it doesn't know they are worthless...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-14 Thread Tomasz Kusmierz
Theoretically all sectors in over provision are erased - practically they are 
either erased or waiting to be erased or broken.

What you have to understand is that sectors on SSD are not where you really 
think they are - they can swap place with sectors with over provisioning are, 
they can swap place with each other ect … stuff you see as a disk from 0 to MAX 
does not have to be arranged in sequence on SSD (and mostly never is) 

If you never trim - when your device is 100% full - you need to start overwrite 
data to keep writing - this is where over provisioning shines: ssd fakes that 
you write to a sector while really you write to a sector in over provisioning 
area and those magically swap places without you knowing -> the sector that was 
occupied ends up in over provisioning pool and SSD hardware performs a slow 
errase on it to make it free for the future. This mechanism is simple, and 
transparent for users -> you don’t know that it happens and SSD does all heavy 
lifting. 

Over provisioned area does have more uses than that. For example if you have a 
1TB drive where you store 500GB of data that you never modify -> SSD will copy 
part of that data to over provisioned area -> free sectors that were unwritten 
for a while -> free sectors that were continuously hammered by writes and write 
a static data there. This mechanism is wear levelling - it means that SSD 
internals make sure that sectors on SSD have an equal use over time. Despite of 
some thinking that it’s pointless imagine situation where you’ve got a 1TB 
drive with 1GB free and you keep writing and modifying data in this 1GB free … 
those sectors will quickly die due to short flash life expectancy ( some as 
short as 1k erases ! ).

So again, buy a good quality drives (not a hardcore enterprise drives, just 
good customer ones) and leave stuff to a drive + use OS that gives you trim and 
you should be golden 

> On 15 May 2017, at 00:01, Imran Geriskovan  wrote:
> 
> On 5/14/17, Tomasz Kusmierz  wrote:
>> In terms of over provisioning of SSD it’s a give and take relationship … on
>> good drive there is enough over provisioning to allow a normal operation on
>> systems without TRIM … now if you would use a 1TB drive daily without TRIM
>> and have only 30GB stored on it you will have fantastic performance but if
>> you will want to store 500GB at roughly 200GB you will hit a brick wall and
>> you writes will slow dow to megabytes / s … this is symptom of drive running
>> out of over provisioning space …
> 
> What exactly happens on a non-trimmed drive?
> Does it begin to forge certain erase-blocks? If so
> which are those? What happens when you never
> trim and continue dumping data on it?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-14 Thread Tomasz Kusmierz
Theoretically all sectors in over provision are erased - practically they are 
either erased or waiting to be erased or broken.

What you have to understand is that sectors on SSD are not where you really 
think they are - they can swap place with sectors with over provisioning are, 
they can swap place with each other ect … stuff you see as a disk from 0 to MAX 
does not have to be arranged in sequence on SSD (and mostly never is) 

If you never trim - when your device is 100% full - you need to start overwrite 
data to keep writing - this is where over provisioning shines: ssd fakes that 
you write to a sector while really you write to a sector in over provisioning 
area and those magically swap places without you knowing -> the sector that was 
occupied ends up in over provisioning pool and SSD hardware performs a slow 
errase on it to make it free for the future. This mechanism is simple, and 
transparent for users -> you don’t know that it happens and SSD does all heavy 
lifting. 

Over provisioned area does have more uses than that. For example if you have a 
1TB drive where you store 500GB of data that you never modify -> SSD will copy 
part of that data to over provisioned area -> free sectors that were unwritten 
for a while -> free sectors that were continuously hammered by writes and write 
a static data there. This mechanism is wear levelling - it means that SSD 
internals make sure that sectors on SSD have an equal use over time. Despite of 
some thinking that it’s pointless imagine situation where you’ve got a 1TB 
drive with 1GB free and you keep writing and modifying data in this 1GB free … 
those sectors will quickly die due to short flash life expectancy ( some as 
short as 1k erases ! ).

So again, buy a good quality drives (not a hardcore enterprise drives, just 
good customer ones) and leave stuff to a drive + use OS that gives you trim and 
you should be golden  


> On 15 May 2017, at 00:01, Imran Geriskovan  wrote:
> 
> On 5/14/17, Tomasz Kusmierz  wrote:
>> In terms of over provisioning of SSD it’s a give and take relationship … on
>> good drive there is enough over provisioning to allow a normal operation on
>> systems without TRIM … now if you would use a 1TB drive daily without TRIM
>> and have only 30GB stored on it you will have fantastic performance but if
>> you will want to store 500GB at roughly 200GB you will hit a brick wall and
>> you writes will slow dow to megabytes / s … this is symptom of drive running
>> out of over provisioning space …
> 
> What exactly happens on a non-trimmed drive?
> Does it begin to forge certain erase-blocks? If so
> which are those? What happens when you never
> trim and continue dumping data on it?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-14 Thread Imran Geriskovan
On 5/14/17, Tomasz Kusmierz  wrote:
> In terms of over provisioning of SSD it’s a give and take relationship … on
> good drive there is enough over provisioning to allow a normal operation on
> systems without TRIM … now if you would use a 1TB drive daily without TRIM
> and have only 30GB stored on it you will have fantastic performance but if
> you will want to store 500GB at roughly 200GB you will hit a brick wall and
> you writes will slow dow to megabytes / s … this is symptom of drive running
> out of over provisioning space …

What exactly happens on a non-trimmed drive?
Does it begin to forge certain erase-blocks? If so
which are those? What happens when you never
trim and continue dumping data on it?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD (my -o ssd "summary")

2017-05-14 Thread Hans van Kranenburg
On 05/14/2017 08:01 PM, Tomasz Kusmierz wrote:
> All stuff that Chris wrote holds true, I just wanted to add flash
> specific information (from my experience of writing low level code
> for operating flash)

Thanks!

> [... erase ...]

> In terms of over provisioning of SSD it’s a give and take
> relationship … on good drive there is enough over provisioning to
> allow a normal operation on systems without TRIM … now if you would
> use a 1TB drive daily without TRIM and have only 30GB stored on it
> you will have fantastic performance but if you will want to store
> 500GB at roughly 200GB you will hit a brick wall and you writes will
> slow dow to megabytes / s … this is symptom of drive running out of
> over provisioning space … if you would run OS that issues trim, this
> problem would not exist since drive would know that whole 970GB of
> space is free and it would be pre-emptively erased days before.

== ssd_spread ==

The worst case behaviour is the btrfs ssd_spread mount option in
combination with not having discard enabled. It has a side effect of
minimizing the reuse of free space previously written in.

== ssd ==

[And, since I didn't write a "summary post" about this issue yet, here
is my version of it:]

The default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with writing and deleting many
files that are not too big also causes this pattern, ending up with the
physical address space fully allocated and written to.

My favourite videos about this: *)

ssd (write pattern is small increments in /var/log/mail.log, a mail
spool on /var/spool/postfix (lots of file adds and deletes), and mailman
archives with a lot of little files):

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4

*) The picture uses Hilbert Curve ordering (see link below) and shows
the four last created DATA block groups appended together. (so a new
chunk allocation pushes the others back in the picture).
https://github.com/knorrie/btrfs-heatmap/blob/master/doc/curves.md

 * What the ssd mode does, is simply setting a lower boundary to the
size of free space fragments that are reused.
 * In combination with always trying to walk forward inside a block
group, not looking back at freed up space, it fills up with a shotgun
blast pattern when you do writes and deletes all the time.
 * When a write comes in that is bigger than any free space part left
behind, a new chunk gets allocated, and the bad pattern continues in there.
 * Because it keeps allocating more and more new chunks, and keeps
circling around in the latest one, until a big write is done, it leaves
mostly empty ones behind.
 * Without 'discard', the SSD will never learn that all the free space
left behind is actually free.
 * Eventually all raw disk space is allocated, and users run into
problems with ENOSPC and balance etc.

So, enabling this ssd mode actually means it starts choking itself to
death here.

When users see this effect, they start scheduling balance operations, to
compact free space to bring the amount of allocated but unused space
down a bit.
 * But, doing that is causing just more and more writes to the ssd.
 * Also, since balance takes a "usage" argument and not a "how badly
fragmented" argument, it's causing lots of unnecessary rewriting of data.
 * And, with a decent amount (like a few thousand) subvolumes, all
having a few snapshots of their own, the ratio data:metadata written
during balance is skyrocketing, causing not only the data to be
rewritten, but also causing pushing out lots of metadata to the ssd.
(example: on my backup server rewriting 1GiB of data causes writing of
>40GiB of metadata, where probably 99.99% of those writes are some kind
of intermediary writes which are immediately invalidated during the next
btrfs transaction that is done).

All in all, this reminds me of the series "breaking bad", where every
step taken to try fix things, only made things worse. At every bullet
point above, this is also happening.

== nossd ==

nossd mode (even still without discard) allows a pattern of overwriting
much more previously used space, causing many more implicit discards to
happen because of the overwrite information the ssd gets.

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

> And last part - hard drive is not aware of filesystem and partitions
> … so you could have 400GB on this 1TB drive left unpartitioned and
> still you would be cooked. Technically speaking using as much as
> possible space on a SSD to a FS and OS that supports trim will give
> you best performance because drive will be notified of as much as
> possible disk space that is actually free …..
> 
> So, to summaries:

> - don’t try to outsmart built in mechanics of SSD (people that
> suggest that are just morons that want to have 5 minutes of fame

Re: Btrfs/SSD

2017-05-14 Thread Tomasz Kusmierz
All stuff that Chris wrote holds true, I just wanted to add flash specific 
information (from my experience of writing low level code for operating flash)

So with flash, to erase you have to erase a large allocation block, usually it 
used to be 128kB (plus some crc data and stuff makes more than 128kB, but we 
are talking functional data storage space) on never setups it can be megabytes 
… device dependant really.
To erase a block you need to provide whole 128 x 8 bits with voltage higher 
that is usually used for IO (can be even 15V) so it requires an external supply 
or build in internal charge pump to provide that voltage to a block erasure 
circuitry. This process generates a lot of heat and requires a lot of energy, 
so consensus back in the day was that you could erase one block at a time and 
this could take up to 200ms (0.2 second). After a erase you need to check 
whenever all bits are set to 1 (charged state) and then sector is marked as 
ready for storage.

Of course, flash memories are moving forward and in more demanding environments 
there are solutions where blocks are grouped into groups which have separate 
eraser circuits that will allow errasure to be performed in parallel in 
multiple parts of flash module, still you are bound to one per group.

Another problem is that erasure procedure locally does increase temperature and 
on flat flashes it’s not that much of a problem, but on emerging solutions like 
3d flashed locally we might experience undesired temperature increases that 
would either degrade life span of flash or simply erase neighbouring blocks. 

In terms of over provisioning of SSD it’s a give and take relationship … on 
good drive there is enough over provisioning to allow a normal operation on 
systems without TRIM … now if you would use a 1TB drive daily without TRIM and 
have only 30GB stored on it you will have fantastic performance but if you will 
want to store 500GB at roughly 200GB you will hit a brick wall and you writes 
will slow dow to megabytes / s … this is symptom of drive running out of over 
provisioning space … if you would run OS that issues trim, this problem would 
not exist since drive would know that whole 970GB of space is free and it would 
be pre-emptively erased days before. 

And last part - hard drive is not aware of filesystem and partitions … so you 
could have 400GB on this 1TB drive left unpartitioned and still you would be 
cooked. Technically speaking using as much as possible space on a SSD to a FS 
and OS that supports trim will give you best performance because drive will be 
notified of as much as possible disk space that is actually free …..

So, to summaries: 
- don’t try to outsmart built in mechanics of SSD (people that suggest that are 
just morons that want to have 5 minutes of fame).
- don’t buy crap SSD and expect it to behave like good one if you use below 
certain % of it … it’s stupid, buy more reasonable SSD but smaller and store 
slow data on spinning rust.
- read more books and wikipedia, not jumping down on you but internet is filled 
with people that provide false information, sometimes unknowingly and swear by 
it ( Dunning–Kruger effect :D ) and some of them are very good and making all 
theories sexy and stuff … you simply have to get used to it… 
- if something is to good to be true, than it’s not
- promise of future performance gains is a domain of “sleazy salesman"



> On 14 May 2017, at 17:21, Chris Murphy  wrote:
> 
> On Sat, May 13, 2017 at 3:39 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> 
>> When I was doing my ssd research the first time around, the going
>> recommendation was to keep 20-33% of the total space on the ssd entirely
>> unallocated, allowing it to use that space as an FTL erase-block
>> management pool.
> 
> Any brand name SSD has its own reserve above its specified size to
> ensure that there's decent performance, even when there is no trim
> hinting supplied by the OS; and thereby the SSD can only depend on LBA
> "overwrites" to know what blocks are to be freed up.
> 
> 
>> Anyway, that 20-33% left entirely unallocated/unpartitioned
>> recommendation still holds, right?
> 
> Not that I'm aware of. I've never done this by literally walling off
> space that I won't use. IA fairly large percentage of my partitions
> have free space so it does effectively happen as far as the SSD is
> concerned. And I use fstrim timer. Most of the file systems support
> trim.
> 
> Anyway I've stuffed a Samsung 840 EVO to 98% full with an OS/file
> system that would not issue trim commands on this drive, and it was
> doing full performance writes through that point. Then deleted maybe
> 5% of the files, and then refill the drive to 98% again, and it was
> the same performance.  So it must have had enough in reserve to permit
> full performance "overwrites" which were in effect directed to reserve
> blocks as the freed up blocks were being erased. Thus the erasure
> happening on the fly 

Re: Btrfs/SSD

2017-05-14 Thread Chris Murphy
On Sat, May 13, 2017 at 3:39 AM, Duncan <1i5t5.dun...@cox.net> wrote:

> When I was doing my ssd research the first time around, the going
> recommendation was to keep 20-33% of the total space on the ssd entirely
> unallocated, allowing it to use that space as an FTL erase-block
> management pool.

Any brand name SSD has its own reserve above its specified size to
ensure that there's decent performance, even when there is no trim
hinting supplied by the OS; and thereby the SSD can only depend on LBA
"overwrites" to know what blocks are to be freed up.


> Anyway, that 20-33% left entirely unallocated/unpartitioned
> recommendation still holds, right?

Not that I'm aware of. I've never done this by literally walling off
space that I won't use. IA fairly large percentage of my partitions
have free space so it does effectively happen as far as the SSD is
concerned. And I use fstrim timer. Most of the file systems support
trim.

Anyway I've stuffed a Samsung 840 EVO to 98% full with an OS/file
system that would not issue trim commands on this drive, and it was
doing full performance writes through that point. Then deleted maybe
5% of the files, and then refill the drive to 98% again, and it was
the same performance.  So it must have had enough in reserve to permit
full performance "overwrites" which were in effect directed to reserve
blocks as the freed up blocks were being erased. Thus the erasure
happening on the fly was not inhibiting performance on this SSD. Now
had I gone to 99.9% full, and then delete say 1GiB, and then started
going a bunch of heavy small file writes rather than sequential? I
don't know what would happening, it might have choked because this is
a lot more work for the SSD to deal with heavy IOPS and erasure.

It will invariably be something that's very model and even firmware
version specific.



>  Am I correct in asserting that if one
> is following that, the FTL already has plenty of erase-blocks available
> for management and the discussion about filesystem level trim and free
> space management becomes much less urgent, tho of course it's still worth
> considering if it's convenient to do so?

Most file systems don't direct writes to new areas, they're fairly
prone to overwriting. So the firmware is going to get notified fairly
quickly with either trim or an overwrite, which LBAs are stale. It's
probably more important with Btrfs which has more variable behavior,
it can continue to direct new writes to recently allocated chunks
before it'll do overwrites in older chunks that have free space.


> And am I also correct in believing that while it's not really worth
> spending more to over-provision to the near 50% as I ended up doing, if
> things work out that way as they did with me because the difference in
> price between 30% overprovisioning and 50% overprovisioning ends up being
> trivial, there's really not much need to worry about active filesystem
> trim at all, because the FTL has effectively half the device left to play
> erase-block musical chairs with as it decides it needs to?


I think it's not worth to overprovision by default ever. Use all of
that space until you have a problem. If you have a 256G drive, you
paid to get the spec performance for 100% of those 256G. You did not
pay that company to second guess things and have cut it slack by
overprovisioning from the outset.

I don't know how long it takes for erasure to happen though, so I have
no idea how much overprovisioning is really needed at the write rate
of the drive, so that it can erase at the same rate as writes, in
order to avoid a slow down.

I guess an even worse test would be one that intentionally fragments
across erase block boundaries, forcing the firmware to be unable to do
erasures without first migrating partially full blocks in order to
make them empty, so they can then be erased, and now be used for new
writes. That sort of shuffling is what will separate the good from
average drives, and why the drives have multicore CPUs on them, as
well as most now having on the fly always on encryption.

Even completely empty, some of these drives have a short term higher
speed write which falls back to a lower speed as the fast flash gets
full. After some pause that fast write capability is restored for
future writes. I have no idea if this is separate kind of flash on the
drive, or if it's just a difference in encoding data onto the flash
that's faster. Samsung has a drive that can "simulate" SLC NAND on 3D
VNAND. That sounds like an encoding method; it's fast but inefficient
and probably needs reencoding.

But that's the thing, the firmware is really complicated now.

I kinda wonder if f2fs could be chopped down to become a modular
allocator for the existing file systems; activate that allocation
method with "ssd" mount option rather than whatever overly smart thing
it does today that's based on assumptions that are now likely
outdated.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line 

Re: Btrfs/SSD

2017-05-14 Thread Duncan
Imran Geriskovan posted on Fri, 12 May 2017 15:02:20 +0200 as excerpted:

> On 5/12/17, Duncan <1i5t5.dun...@cox.net> wrote:
>> FWIW, I'm in the market for SSDs ATM, and remembered this from a couple
>> weeks ago so went back to find it.  Thanks. =:^)
>>
>> (I'm currently still on quarter-TB generation ssds, plus spinning rust
>> for the larger media partition and backups, and want to be rid of the
>> spinning rust, so am looking at half-TB to TB, which seems to be the
>> pricing sweet spot these days anyway.)
> 
> Since you are taking ssds to mainstream based on your experience,
> I guess your perception of data retension/reliability is better than
> that of spinning rust. Right? Can you eloborate?
> 
> Or an other criteria might be physical constraints of spinning rust on
> notebooks which dictates that you should handle the device with care
> when running.
> 
> What was your primary motivation other than performance?

Well, the /immediate/ motivation is that the spinning rust is starting to 
hint that it's time to start thinking about rotating it out of service...

It's my main workstation so wall powered, but because it's the media and 
secondary backups partitions, I don't have anything from it mounted most 
of the time and because it /is/ spinning rust, I allow it to spin down.  
It spins right back up if I mount it, and reads seem to be fine, but if I 
let it set a bit after mount, possibly due to it spinning down again, 
sometimes I get write errors, SATA resets, etc.  Sometimes the write will 
then eventually appear to go thru, sometimes not, but once this happens, 
unmounting often times out, and upon a remount (which may or may not work 
until a clean reboot), the last writes may or may not still be there.

And the smart info, while not bad, does indicate it's starting to age, 
tho not extremely so.

Now even a year ago I'd have likely played with it, adjusting timeouts, 
spindowns, etc, attempting to get it working normally again.

But they say that ssd performance spoils you and you don't want to go 
back, and while it's a media drive and performance isn't normally an 
issue, those secondary backups to it as spinning rust sure take a lot 
longer than the primary backups to other partitions on the same pair of 
ssds that the working copies (of everything but media) are on.

Which means I don't like to do them... which means sometimes I put them 
off longer than I should.  Basically, it's another application of my 
"don't make it so big it takes so long to maintain you don't do it as you 
should" rule, only here, it's not the size but rather because I've been 
spoiled by the performance of the ssds.


So couple the aging spinning rust with the fact that I've really wanted 
to put media and the backups on ssd all along, only it couldn't be cost-
justified a few years ago when I bought the original ssds, and I now have 
my excuse to get the now cheaper ssds I really wanted all along. =:^)


As for reliability...  For archival usage I still think spinning rust is 
more reliable, and certainly more cost effective.

However, for me at least, with some real-world ssd experience under my 
belt now, including an early slow failure (more and more blocks going 
bad, I deliberately kept running it in btrfs raid1 mode with scrubs 
handling the bad blocks for quite some time, just to get the experience 
both with ssds and with btrfs) and replacement of one of the ssds with 
one I had originally bought for a different machine (my netbook, which 
went missing shortly thereafter), I now find ssds reliable enough for 
normal usage, certainly so if the data is valuable enough to have backups 
of it anyway, and if it's not valuable enough to be worth doing backups, 
then losing it is obviously not a big deal, because it's self-evidently 
worth less than the time, trouble and resources of doing that backup.

Particularly so if the speed of ssds helpfully encourages you to keep the 
backups more current than you would otherwise. =:^)

But spinning rust remains appropriate for long-term archival usage, like 
that third-level last-resort backup I like to make, then keep on the 
shelf, or store with a friend, or in a safe deposit box, or whatever, and 
basically never use, but like to have just in case.  IOW, that almost 
certainly write once, read-never, seldom update, last resort backup.  If 
three years down the line there's a fire/flood/whatever, and all I can 
find in the ashes/mud or retrieve from that friend is that three year old 
backup, I'll be glad to still have it.

Of course those who have multi-TB scale data needs may still find 
spinning rust useful as well, because while 4-TB ssds are available now, 
they're /horribly/ expensive.  But with 3D-NAND, even that use-case looks 
like it may go ssd in the next five years or so, leaving multi-year to 
decade-plus archiving, and perhaps say 50-TB-plus, but that's going to 
take long enough to actually write or otherwise do anything with it's 
effectively 

[OT] SSD performance patterns (was: Btrfs/SSD)

2017-05-13 Thread Kai Krakow
Am Sat, 13 May 2017 09:39:39 + (UTC)
schrieb Duncan <1i5t5.dun...@cox.net>:

> Kai Krakow posted on Fri, 12 May 2017 20:27:56 +0200 as excerpted:
> 
> > In the end, the more continuous blocks of free space there are, the
> > better the chance for proper wear leveling.  
> 
> Talking about which...
> 
> When I was doing my ssd research the first time around, the going 
> recommendation was to keep 20-33% of the total space on the ssd
> entirely unallocated, allowing it to use that space as an FTL
> erase-block management pool.
> 
> At the time, I added up all my "performance matters" data dirs and 
> allowing for reasonable in-filesystem free-space, decided I could fit
> it in 64 GB if I had to, tho 80 GB would be a more comfortable fit,
> so allowing for the above entirely unpartitioned/unused slackspace 
> recommendations, had a target of 120-128 GB, with a reasonable range 
> depending on actual availability of 100-160 GB.
> 
> It turned out, due to pricing and availability, I ended up spending 
> somewhat more and getting 256 GB (238.5 GiB).  Of course that allowed
> me much more flexibility than I had expected and I ended up with
> basically everything but the media partition on the ssds, PLUS I
> still left them at only just over 50% partitioned, (using the gdisk
> figures, 51%- partitioned, 49%+ free).

I put by ESP (for UEFI) onto the SSD and also played with putting swap
onto it dedicated to hibernation. But I discarded the hibernation idea
and removed the swap because it didn't work well: It wasn't much faster
then waking from HDD, and hibernation is not that reliable anyways.
Also, hybrid hibernation is not yet integrated into KDE so I stick to
sleep mode currently.

The rest of my SSD (also 500GB) is dedicated to bcache. This fits my
complete work set of daily work with hit ratios going up to 90% and
beyond. My filesystem boots and feels like SSD, the HDDs are almost
silent and still my file system is 3TB on 3x 1TB HDD.


> Given that, I've not enabled btrfs trim/discard (which saved me from
> the bugs with it a few kernel cycles ago), and while I do have a
> weekly fstrim systemd timer setup, I've not had to be too concerned
> about btrfs bugs (also now fixed, I believe) when fstrim on btrfs was
> known not to be trimming everything it really should have been.

This is a good recommendation as TRIM is still a slow operation because
Queued TRIM is not used for most drives due to buggy firmware. So you
not only circumvent kernel and firmware bugs, but also get better
performance that way.


> Anyway, that 20-33% left entirely unallocated/unpartitioned 
> recommendation still holds, right?  Am I correct in asserting that if
> one is following that, the FTL already has plenty of erase-blocks
> available for management and the discussion about filesystem level
> trim and free space management becomes much less urgent, tho of
> course it's still worth considering if it's convenient to do so?
> 
> And am I also correct in believing that while it's not really worth 
> spending more to over-provision to the near 50% as I ended up doing,
> if things work out that way as they did with me because the
> difference in price between 30% overprovisioning and 50%
> overprovisioning ends up being trivial, there's really not much need
> to worry about active filesystem trim at all, because the FTL has
> effectively half the device left to play erase-block musical chairs
> with as it decides it needs to?

I think, things may have changed since long ago. See below. But it
certainly depends on which drive manufacturer you chose, I guess.

I can at least confirm that bigger drives wear their write cycles much
slower, even when filled up. My old 128MB Crucial drive was worn after
only 1 year (I swapped it early, I kept an eye on SMART numbers). My
500GB Samsung drive is around 1 year old now, I do write a lot more
data to it, but according to SMART it should work for at least 5 to 7
more years. By that time, I probably already swapped it for a bigger
drive.

So I guess you should maybe look at your SMART numbers and calculate
the expected life time:

Power_on_Hours(RAW) * WLC(VALUE) / (100-WLC(VALUE))
with WLC = Wear_Leveling_Count

should get you the expected remaining power on hours. My drive is
powered on 24/7 most of the time but if you power your drive only 8
hours per day, you can easily ramp up the life time by three times of
days vs. me. ;-)

There is also Total_LBAs_Written but that, at least for me, usually
gives much higher lifetime values so I'd stick with the pessimistic
ones.

Even when WLC goes to zero, the drive should still have reserved blocks
available. My drive sets the threshold to 0 for WLC which makes me
think that it is not fatal when it hits 0 because the drive still has
reserved blocks. And for reserved blocks, the threshold is 10%.

Now combine that with your planning of getting a new drive, and you can
optimize space efficiency vs. lifetime better.


> Of course the higher per-GiB cost 

Re: Btrfs/SSD

2017-05-13 Thread Janos Toth F.
> Anyway, that 20-33% left entirely unallocated/unpartitioned
> recommendation still holds, right?

I never liked that idea. And I really disliked how people considered
it to be (and even passed it down as) some magical, absolute
stupid-proof fail-safe thing (because it's not).

1: Unless you reliably trim the whole LBA space (and/or run
ata_secure_erase on the whole drive) before you (re-)partition the LBA
space, you have zero guarantee that the drive's controller/firmware
will treat the unallocated space as empty or will keep it's content
around as useful data (even if it's full of zeros because zero could
be very useful data unless it's specifically marked as "throwaway" by
trim/erase). On the other hand, a trim-compatible filesystem should
properly mark (trim) all (or at least most of) the free space as free
(= free to erase internally by the controller's discretion). And even
if trim isn't fail-proof either, those bugs should be temporary (and
it's not like a sane SSD will die in a few weeks due to these kind of
issues during sane usage and crazy drivers will often fail under crazy
usage regardless of trim and spare space).

2: It's not some daemon-summoning, world-ending catastrophe if you
occasionally happen to fill your SSD to ~100%. It probably won't like
it (it will probably get slow by the end of the writes and the
internal write amplification might skyrocket at it's peak) but nothing
extraordinary will happen and normal operation (high write speed,
normal internal write amplification, etc) should resume soon after you
make some room (for example, you delete your temporary files or move
some old content to an archive storage and you properly trim that
space). That space is there to be used, just don't leave it close to
100% all the time and try never leaving it close to 100% when you plan
to keep it busy with many small random writes.

3: Some drives have plenty of hidden internal spare space (especially
the expensive kinds offered for datacenters or "enthusiast" consumers
by big companies like Intel and such). Even some cheap drivers might
have plenty of erased space at 100% LBA allocation if they use
compression internally (and you don't fill it up to 100% with
in-compressible content).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-13 Thread Kai Krakow
Am Sat, 13 May 2017 14:52:47 +0500
schrieb Roman Mamedov :

> On Fri, 12 May 2017 20:36:44 +0200
> Kai Krakow  wrote:
> 
> > My concern is with fail scenarios of some SSDs which die unexpected
> > and horribly. I found some reports of older Samsung SSDs which
> > failed suddenly and unexpected, and in a way that the drive
> > completely died: No more data access, everything gone. HDDs start
> > with bad sectors and there's a good chance I can recover most of
> > the data except a few sectors.  
> 
> Just have your backups up-to-date, doesn't matter if it's SSD, HDD or
> any sort of RAID.
> 
> In a way it's even better, that SSDs [are said to] fail abruptly and
> entirely. You can then just restore from backups and go on. Whereas a
> failing HDD can leave you puzzled on e.g. whether it's a cable or
> controller problem instead, and possibly can even cause some data
> corruption which you won't notice until too late.

My current backup strategy can handle this. I never backup files from
the source again if it didn't change by timestamp. That way, silent data
corruption won't creep into the backup. Additionally, I keep a backlog
of 5 years of file history. Even if a corrupted file creeps into the
backup, there is enough time to get a good copy back. If it's older, it
probably doesn't hurt so much anyway.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-13 Thread Roman Mamedov
On Fri, 12 May 2017 20:36:44 +0200
Kai Krakow  wrote:

> My concern is with fail scenarios of some SSDs which die unexpected and
> horribly. I found some reports of older Samsung SSDs which failed
> suddenly and unexpected, and in a way that the drive completely died:
> No more data access, everything gone. HDDs start with bad sectors and
> there's a good chance I can recover most of the data except a few
> sectors.

Just have your backups up-to-date, doesn't matter if it's SSD, HDD or any sort
of RAID.

In a way it's even better, that SSDs [are said to] fail abruptly and entirely.
You can then just restore from backups and go on. Whereas a failing HDD can
leave you puzzled on e.g. whether it's a cable or controller problem instead,
and possibly can even cause some data corruption which you won't notice until
too late.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-13 Thread Duncan
Kai Krakow posted on Fri, 12 May 2017 20:27:56 +0200 as excerpted:

> In the end, the more continuous blocks of free space there are, the
> better the chance for proper wear leveling.

Talking about which...

When I was doing my ssd research the first time around, the going 
recommendation was to keep 20-33% of the total space on the ssd entirely 
unallocated, allowing it to use that space as an FTL erase-block 
management pool.

At the time, I added up all my "performance matters" data dirs and 
allowing for reasonable in-filesystem free-space, decided I could fit it 
in 64 GB if I had to, tho 80 GB would be a more comfortable fit, so 
allowing for the above entirely unpartitioned/unused slackspace 
recommendations, had a target of 120-128 GB, with a reasonable range 
depending on actual availability of 100-160 GB.

It turned out, due to pricing and availability, I ended up spending 
somewhat more and getting 256 GB (238.5 GiB).  Of course that allowed me 
much more flexibility than I had expected and I ended up with basically 
everything but the media partition on the ssds, PLUS I still left them at 
only just over 50% partitioned, (using the gdisk figures, 51%- 
partitioned, 49%+ free).

Given that, I've not enabled btrfs trim/discard (which saved me from the 
bugs with it a few kernel cycles ago), and while I do have a weekly fstrim 
systemd timer setup, I've not had to be too concerned about btrfs bugs 
(also now fixed, I believe) when fstrim on btrfs was known not to be 
trimming everything it really should have been.


Anyway, that 20-33% left entirely unallocated/unpartitioned 
recommendation still holds, right?  Am I correct in asserting that if one 
is following that, the FTL already has plenty of erase-blocks available 
for management and the discussion about filesystem level trim and free 
space management becomes much less urgent, tho of course it's still worth 
considering if it's convenient to do so?

And am I also correct in believing that while it's not really worth 
spending more to over-provision to the near 50% as I ended up doing, if 
things work out that way as they did with me because the difference in 
price between 30% overprovisioning and 50% overprovisioning ends up being 
trivial, there's really not much need to worry about active filesystem 
trim at all, because the FTL has effectively half the device left to play 
erase-block musical chairs with as it decides it needs to?


Of course the higher per-GiB cost of ssd as compared to spinning rust 
does mean that the above overprovisioning recommendation really does 
hurt, most of the time, driving per-usable-GB costs even higher, and as I 
recall that was definitely the case back then between 80 GiB and 160 GiB, 
and it was basically an accident of timing, that I was buying just as the 
manufactures flooded the market with newly cost-effective 256 GB devices, 
that meant they were only trivially more expensive than the 128 or 160 
GB, AND unlike the smaller devices, actually /available/ in the 500-ish 
MB/sec performance range that (for SATA-based SSDs) is actually capped by 
SATA-600 bus speeds more than the chips themselves.  (There were lower 
cost 128 GB devices, but they were lower speed than I wanted, too.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-12 Thread Imran Geriskovan
On 5/12/17, Kai Krakow  wrote:
> I don't think it is important for the file system to know where the SSD
> FTL located a data block. It's just important to keep everything nicely
> aligned with erase block sizes, reduce rewrite patterns, and free up
> complete erase blocks as good as possible.

Yeah. "Tight packing" of data into erase blocks will reduce fragmentation
at flash level, but not necessarily the fragmentation at fs level. And
unless we are writing in continuous journaling style (as f2fs ?),
we still need to have some info about the erase blocks.

Of course while these are going on, there is also something like roundrobin
mapping or some kind of journaling would be going on at the low level flash
as wear leveling/bad block replacements which is totally invisible to us.


> Maybe such a process should be called "compaction" and not
> "defragmentation". In the end, the more continuous blocks of free space
> there are, the better the chance for proper wear leveling.


Tight packing into erase blocks seems dominant factor for ssd welfare.

However, fs fragmentation may still be a thing to consider because
increased fs fragmentation will probably increase the # of erase
blocks involved, affecting both read/write performance and wear.

Keeping an eye on both is a though job. Worse there is "two" uncoordinated
eyes one watching the "fs" and the other watching the "flash" making the
whole process suboptimal.

I think the ultimate utopic combination would be "absolutely dumb flash
controller" providing direct access to physical bytes and the ultimate
"Flash FS" making use of possible performance, wear leveling tricks.

Clearly, we are far from it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-12 Thread Kai Krakow
Am Fri, 12 May 2017 15:02:20 +0200
schrieb Imran Geriskovan :

> On 5/12/17, Duncan <1i5t5.dun...@cox.net> wrote:
> > FWIW, I'm in the market for SSDs ATM, and remembered this from a
> > couple weeks ago so went back to find it.  Thanks. =:^)
> >
> > (I'm currently still on quarter-TB generation ssds, plus spinning
> > rust for the larger media partition and backups, and want to be rid
> > of the spinning rust, so am looking at half-TB to TB, which seems
> > to be the pricing sweet spot these days anyway.)  
> 
> Since you are taking ssds to mainstream based on your experience,
> I guess your perception of data retension/reliability is better than
> that of spinning rust. Right? Can you eloborate?
> 
> Or an other criteria might be physical constraints of spinning rust
> on notebooks which dictates that you should handle the device
> with care when running.
> 
> What was your primary motivation other than performance?

Personally, I don't really trust SSDs so much. They are much more
robust when it comes to physical damage because there are no physical
parts. That's absolutely not my concern. Regarding this, I trust SSDs
better than HDDs.

My concern is with fail scenarios of some SSDs which die unexpected and
horribly. I found some reports of older Samsung SSDs which failed
suddenly and unexpected, and in a way that the drive completely died:
No more data access, everything gone. HDDs start with bad sectors and
there's a good chance I can recover most of the data except a few
sectors.

When SSD blocks die, they are probably huge compared to a sector (256kB
to 4MB usually because that's erase block sizes). If this happens, the
firmware may decide to either allow read-only access or completely deny
access. There's another situation where dying storage chips may
completely mess up the firmware and there's no longer any access to
data.

That's why I don't trust any of my data to them. But I still want the
benefit of their speed. So I use SSDs mostly as frontend caches to
HDDs. This gives me big storage with fast access. Indeed, I'm using
bcache successfully for this. A warm cache is almost as fast as native
SSD (at least it feels almost that fast, it will be slower if you threw
benchmarks at it).


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-12 Thread Kai Krakow
Am Tue, 18 Apr 2017 15:02:42 +0200
schrieb Imran Geriskovan :

> On 4/17/17, Austin S. Hemmelgarn  wrote:
> > Regarding BTRFS specifically:
> > * Given my recently newfound understanding of what the 'ssd' mount
> > option actually does, I'm inclined to recommend that people who are
> > using high-end SSD's _NOT_ use it as it will heavily increase
> > fragmentation and will likely have near zero impact on actual device
> > lifetime (but may _hurt_ performance).  It will still probably help
> > with mid and low-end SSD's.  
> 
> I'm trying to have a proper understanding of what "fragmentation"
> really means for an ssd and interrelation with wear-leveling.
> 
> Before continuing lets remember:
> Pages cannot be erased individually, only whole blocks can be erased.
> The size of a NAND-flash page size can vary, and most drive have pages
> of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256
> pages, which means that the size of a block can vary between 256 KB
> and 4 MB.
> codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/
> 
> Lets continue:
> Since block sizes are between 256k-4MB, data smaller than this will
> "probably" will not be fragmented in a reasonably empty and trimmed
> drive. And for a brand new ssd we may speak of contiguous series
> of blocks.
> 
> However, as drive is used more and more and as wear leveling kicking
> in (ie. blocks are remapped) the meaning of "contiguous blocks" will
> erode. So any file bigger than a block size will be written to blocks
> physically apart no matter what their block addresses says. But my
> guess is that accessing device blocks -contiguous or not- are
> constant time operations. So it would not contribute performance
> issues. Right? Comments?
> 
> So your the feeling about fragmentation/performance is probably
> related with if the file is spread into less or more blocks. If # of
> blocks used is higher than necessary (ie. no empty blocks can be
> found. Instead lots of partially empty blocks have to be used
> increasing the total # of blocks involved) then we will notice
> performance loss.
> 
> Additionally if the filesystem will gonna try something to reduce
> the fragmentation for the blocks, it should precisely know where
> those blocks are located. Then how about ssd block informations?
> Are they available and do filesystems use it?
> 
> Anyway if you can provide some more details about your experiences
> on this we can probably have better view on the issue.

What you really want for SSD is not defragmented files but defragmented
free space. That increases life time.

So, defragmentation on SSD makes sense if it cares more about free
space but not file data itself.

But of course, over time, fragmentation of file data (be it meta data
or content data) may introduce overhead - and in btrfs it probably
really makes a difference if I scan through some of the past posts.

I don't think it is important for the file system to know where the SSD
FTL located a data block. It's just important to keep everything nicely
aligned with erase block sizes, reduce rewrite patterns, and free up
complete erase blocks as good as possible.

Maybe such a process should be called "compaction" and not
"defragmentation". In the end, the more continuous blocks of free space
there are, the better the chance for proper wear leveling.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-12 Thread Imran Geriskovan
On 5/12/17, Duncan <1i5t5.dun...@cox.net> wrote:
> FWIW, I'm in the market for SSDs ATM, and remembered this from a couple
> weeks ago so went back to find it.  Thanks. =:^)
>
> (I'm currently still on quarter-TB generation ssds, plus spinning rust
> for the larger media partition and backups, and want to be rid of the
> spinning rust, so am looking at half-TB to TB, which seems to be the
> pricing sweet spot these days anyway.)

Since you are taking ssds to mainstream based on your experience,
I guess your perception of data retension/reliability is better than that
of spinning rust. Right? Can you eloborate?

Or an other criteria might be physical constraints of spinning rust
on notebooks which dictates that you should handle the device
with care when running.

What was your primary motivation other than performance?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-11 Thread Duncan
Austin S. Hemmelgarn posted on Mon, 17 Apr 2017 07:53:04 -0400 as
excerpted:

> * In my personal experience, Intel, Samsung, and Crucial appear to be
> the best name brands (in relative order of quality).  I have personally
> had bad experiences with SanDisk and Kingston SSD's, but I don't have
> anything beyond circumstantial evidence indicating that it was anything
> but bad luck on both counts.

FWIW, I'm in the market for SSDs ATM, and remembered this from a couple 
weeks ago so went back to find it.  Thanks. =:^)

(I'm currently still on quarter-TB generation ssds, plus spinning rust 
for the larger media partition and backups, and want to be rid of the 
spinning rust, so am looking at half-TB to TB, which seems to be the 
pricing sweet spot these days anyway.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-19 Thread Chris Murphy
On Mon, Apr 17, 2017 at 4:55 PM, Hans van Kranenburg
<hans.van.kranenb...@mendix.com> wrote:
> On 04/17/2017 09:22 PM, Imran Geriskovan wrote:
>> [...]
>>
>> Going over the thread following questions come to my mind:
>>
>> - What exactly does btrfs ssd option does relative to plain mode?
>
> There's quite an amount of information in the the very recent threads:
> - "About free space fragmentation, metadata write amplification and (no)ssd"
> - "BTRFS as a GlusterFS storage back-end, and what I've learned from
> using it as such."
> - "btrfs filesystem keeps allocating new chunks for no apparent reason"
> - ... and a few more
>
> I suspect there will be some "summary" mails at some point, but for now,
> I'd recommend crawling through these threads first.
>
> And now for your instant satisfaction, a short visual guide to the
> difference, which shows actual btrfs behaviour instead of our guesswork
> around it (taken from the second mail thread just mentioned):
>
> -o ssd:
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4
>
> -o nossd:
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

I'm uncertain from these if the option affects both metadata and data
writes, or just data. The latter makes some sense, if you think a
given data write event contains related files and thus increases the
chance when those files are deleted of having a mostly freed up erase
block. That way wear leveling is doing less work. For metadata writes
it makes less sense to me, and is inconsistent with what I've seen
from metadata chunk allocation. Pretty much anything means dozens or
more 16K nodes are being COWd. e.g. a 2KiB write to systemd journal,
even preallocated, means adding an EXTENT DATA item, one of maybe 200
per node, which means that whole node must be COWd, and whatever its
parent is must be written (ROOT ITEM I think) and then tree root, and
then super block. I see generally 30 16K nodes modified in about 4
minutes with average logging. Even if it's 1 change per 4 minutes, and
all 30 nodes get written to one 2MB block, and then that block isn't
ever written to again, the metadata chunk would be growing and I don't
see that. For weeks or months I see a 512MB metadata chunk and it
doesn't ever get bigger than this.

Anyway, I think ssd mount option still sounds plausibly useful. What
I'm skeptical of on SSD is defragmenting without compression, and also
nocow.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-18 Thread Austin S. Hemmelgarn

On 2017-04-18 09:02, Imran Geriskovan wrote:

On 4/17/17, Austin S. Hemmelgarn  wrote:

Regarding BTRFS specifically:
* Given my recently newfound understanding of what the 'ssd' mount
option actually does, I'm inclined to recommend that people who are
using high-end SSD's _NOT_ use it as it will heavily increase
fragmentation and will likely have near zero impact on actual device
lifetime (but may _hurt_ performance).  It will still probably help with
mid and low-end SSD's.


I'm trying to have a proper understanding of what "fragmentation" really
means for an ssd and interrelation with wear-leveling.

Before continuing lets remember:
Pages cannot be erased individually, only whole blocks can be erased.
The size of a NAND-flash page size can vary, and most drive have pages
of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256
pages, which means that the size of a block can vary between 256 KB
and 4 MB.
codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/

Lets continue:
Since block sizes are between 256k-4MB, data smaller than this will
"probably" will not be fragmented in a reasonably empty and trimmed
drive. And for a brand new ssd we may speak of contiguous series
of blocks.
We're slightly talking past each other here.  I'm referring to 
fragmentation on the filesystem level.  This impacts performance on 
SSD's because it necessitates a larger number of IO operations to read 
the data off of the device (which is also the case on traditional HDD's, 
but it has near zero impact there compared to the seek latency).  You 
appear to be referring to fragmentation at the level of the 
flash-translation layer (FTL), which is present in almost any SSD, and 
should have near zero impact on performance if the device has good 
firmware and a decent controller.


However, as drive is used more and more and as wear leveling kicking in
(ie. blocks are remapped) the meaning of "contiguous blocks" will erode.
So any file bigger than a block size will be written to blocks physically apart
no matter what their block addresses says. But my guess is that accessing
device blocks -contiguous or not- are constant time operations. So it would
not contribute performance issues. Right? Comments?

Correct.


So your the feeling about fragmentation/performance is probably related
with if the file is spread into less or more blocks. If # of blocks used
is higher than necessary (ie. no empty blocks can be found. Instead
lots of partially empty blocks have to be used increasing the total
# of blocks involved) then we will notice performance loss.

Kind of.

As an example, consider a 16MB file on a device that can read up to 16MB 
of data in a single read operation (arbitrary numbers chose to make math 
easier).


If you copy that file onto the device while it's idle and has a block of 
free space 16MB in size, it will end up as one extent (in BTRFS at 
least, and probably also in most other  extent-based filesystems).  In 
that case, it will take 1 read operation to read the whole file into memory.


If instead that file gets created with multiple extents that aren't 
right next to each other on disk, you will need a number of read 
operation equal to the number of extents to read the file into memory.


The performance loss I'm referring to when talking about fragmentation 
is the result of the increased number of read operations required to 
read a file with a larger number of extents into memory.  It actually 
has nothing to do with whether or not the device is an SSD, a HDD, a 
DVD, NVRAM, SPI NOR flash, an SD card, or any other storage device, it 
just has more impact on storage devices that have zero seek latency 
because the seek latency usually far exceeds the overhead of the extra 
read operations.


Additionally if the filesystem will gonna try something to reduce
the fragmentation for the blocks, it should precisely know where
those blocks are located. Then how about ssd block informations?
Are they available and do filesystems use it?

Anyway if you can provide some more details about your experiences
on this we can probably have better view on the issue.



* Files with NOCOW and filesystems with 'nodatacow' set will both hurt
performance for BTRFS on SSD's, and appear to reduce the lifetime of the
SSD.


This and other experinces tell us it is still possible to "forge some
blocks of ssd". How could this be possible if there is wear-leveling?

Two alternatives comes to mind:

- If there is no empty (trimmed) blocks left on the ssd, it will have no
chance other than forging the block. How about its reserve blocks?
Are they exhausted too? Or are they only used as bad block replacements?

- No proper wear-levelling is accually done by the drive.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-18 Thread Austin S. Hemmelgarn

On 2017-04-17 15:22, Imran Geriskovan wrote:

On 4/17/17, Roman Mamedov <r...@romanrm.net> wrote:

"Austin S. Hemmelgarn" <ahferro...@gmail.com> wrote:



* Compression should help performance and device lifetime most of the
time, unless your CPU is fully utilized on a regular basis (in which
case it will hurt performance, but still improve device lifetimes).



Days are long gone since the end user had to ever think about device lifetimes
with SSDs. Refer to endurance studies such as
It has been demonstrated that all SSDs on the market tend to overshoot even
their rated TBW by several times, as a result it will take any user literally
dozens of years to wear out the flash no matter which filesystem or what
settings used. And most certainly it's not worth it changing anything
significant in your workflow (such as enabling compression if it's
otherwise inconvenient or not needed) just to save the SSD lifetime.


Going over the thread following questions come to my mind:

- What exactly does btrfs ssd option does relative to plain mode?
Assuming I understand what it does correctly, it prioritizes writing 
into larger, 2MB aligned chunks of free-space, whereas normal mode goes 
for 64k alignment.


- Most(all?) SSDs employ wear leveling. Isn't it? That is they are
constrantly remapping their blocks under the hood. So isn't it
meaningless to speak of some kind of a block forging/fragmentation/etc..
affect of any writing pattern?
Because making one big I/O request to fetch a file is faster than a 
bunch of small ones.  If your file is all in one extent in the 
filesystem, it takes less work to copy to memory than if you're pulling 
form a dozen places on the device.  This doesn't have much impact on 
light workloads, but when you're looking at heavy server workloads, it's 
big.


- If it is so, Doesn't it mean that there is no better ssd usage strategy
other than minimizing the total bytes written? That is whatever we do,
if it contributes to this fact it is good, otherwise bad. Are all other things
are beyond any user control? Is there a recommended setting?
As a general strategy, yes, that appears to be the case.  ON a specific 
SSD, it may not be.  For example, on the Crucial MX300's I have in most 
of my systems, the 'ssd' mount option actually makes things slower by 
anywhere from 2-10%.


- How about "data retension" experiences? It is known that
new ssds can hold data safely for longer period. As they age
that margin gets shorter. As an extreme case if I write into a new
ssd and shelve it, can i get back my data back after 5 years?
How about a file written 5 years ago and never touched again although
rest of the ssd is in active use during that period?

- Yes may be lifetimes getting irrelevant. However TBW has
still direct relation with data retension capability.
Knowing that writing more data to a ssd can reduce the
"life time of your data" is something strange.
Explaining this and your comment above requires a bit of understanding 
of how flash memory actually works.  The general structure of a single 
cell is that of a field-effect transistor (almost always a MOSFET) with 
a floating gate which consists of a bit of material electrically 
isolated from the rest of the transistor.  Data is stored by trapping 
electrons on this floating gate, but getting them there requires a 
strong enough current to break through the insulating layer that keeps 
it isolated from the rest of the transistor.  This process breaks down 
the insulating layer over time, making it easier for the electrons 
trapped in the floating gate to leak back into the rest of the 
transistor, thus losing data.


Aside from the write-based degradation of the insulating layer, there 
are other things that can cause it to break down or for the electrons to 
leak out, including very high temperatures (we're talking industrial 
temperatures here, not the type you're likely to see in most consumer 
electronics), strong electromagnetic fields (again, we're talking 
_really_ strong here, not stuff you're likely to see in most consumer 
electronics), cosmic background radiation, and even noise from other 
nearby cells being rewritten (known as a read disturb error, only an 
issue in NAND flash (but that's what all SSD's are these days)).


- But someone can come and say: Hey don't worry about
"data retension years". Because your ssd will already be dead
before data retension becomes a problem for you... Which is
relieving.. :)) Anyway what are your opinions?
On this in particular, my opinion is that that claim is bogus unless you 
have an SSD designed to brick itself after a fixed period of time.  That 
statement is about the same as saying that you don't need to worry about 
uncorrectable errors in ECC RAM because you'll lose entire chips before 
they ever happen.  In both cases, you should indeed be worrying more 
about catastrophic failure, but that's because it will have a bigger 
impa

Re: Btrfs/SSD

2017-04-18 Thread Hugo Mills
On Tue, Apr 18, 2017 at 07:31:34AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-04-17 15:39, Chris Murphy wrote:
> >On Mon, Apr 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
> > wrote:
> >>On 2017-04-17 14:34, Chris Murphy wrote:
[...]
> >It's almost like we need these things to not fsync at all, and just
> >rely on the filesystem commit time...
> 
> 
> Essentially yes, but that causes all kinds of other problems.
> >>>
> >>>
> >>>Drat.
> >>>
> >>Admittedly most of the problems are use-case specific (you can't afford to
> >>lose transactions in a financial database  for example, so it functionally
> >>has to call fsync after each transaction), but most of it stems from the
> >>fact that BTRFS is doing a lot of the same stuff that much of the 'problem'
> >>software is doing itself internally.
> >>
> >
> >Seems like the old way of doing things, and the staleness of the
> >internet, have colluded to create a lot of nervousness and misuse of
> >fsync. The very fact Btrfs needs a log tree to deal with fsync's in a
> >semi-sane way...
> Except that BTRFS is somewhat unusual.  Prior to this, the only
> 'mainstream' filesystem that provided most of these features was
> ZFS, and that does a good enough job that this doesn't matter.
> 
> For something like a database though, where you need ACID
> guarantees, you pretty much have to have COW semantics internally,
> and you have to force things to stable storage after each
> transaction that actually modifies data.  Looking at it another way,
> most database storage formats are essentially record-oriented
> filesystems (as opposed to block-oriented filesystems that most
> people think of).  This is part of why you see such similar access
> patterns in databases and VM disk images (even if the VM isn't
> running database software), they are essentially doing the same
> things at a low level.

   I remember thinking, when I was learning about the internals of
btrfs, that it looked an awful lot like the high-level description of
the internals of Oracle which I'd just been learning about. Most of
the same pieces, doing mostly the same kinds operations to achieve the
same effective results.

   Hugo.

-- 
Hugo Mills | Don't worry, he's not drunk. He's like that all the
hugo@... carfax.org.uk | time.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   A.H. Deakin


signature.asc
Description: Digital signature


Re: Btrfs/SSD

2017-04-18 Thread Austin S. Hemmelgarn

On 2017-04-17 15:39, Chris Murphy wrote:

On Mon, Apr 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
 wrote:

On 2017-04-17 14:34, Chris Murphy wrote:



Nope. The first paragraph applies to NVMe machine with ssd mount
option. Few fragments.

The second paragraph applies to SD Card machine with ssd_spread mount
option. Many fragments.


Ah, apologies for my misunderstanding.



These are different versions of systemd-journald so I can't completely
rule out a difference in write behavior.


There have only been a couple of changes in the write patterns that I know
of, but I would double check that the values for Seal and Compress in the
journald.conf file are the same, as I know for a fact that changing those
does change the write patterns (not much, but they do change).


Same, unchanged defaults on both systems.

#Storage=auto
#Compress=yes
#Seal=yes
#SplitMode=uid
#SyncIntervalSec=5m
#RateLimitIntervalSec=30s
#RateLimitBurst=1000


The sync interval sec is curious. 5 minutes? Umm, I'm seeing nearly
constant hits every 2-5 seconds on the journal file; using filefrag.
I'm sure there's a better way to trace a single file being
read/written to than this, but...
AIUI, the sync interval is like BTRFS's commit interval, the journal 
file is guaranteed to be 100% consistent at least once every 
 seconds.


As far as tracing, I think it's possible to do some kind of filtering 
with btrace so you just see a specific file, but I'm not certain.




It's almost like we need these things to not fsync at all, and just
rely on the filesystem commit time...



Essentially yes, but that causes all kinds of other problems.



Drat.


Admittedly most of the problems are use-case specific (you can't afford to
lose transactions in a financial database  for example, so it functionally
has to call fsync after each transaction), but most of it stems from the
fact that BTRFS is doing a lot of the same stuff that much of the 'problem'
software is doing itself internally.



Seems like the old way of doing things, and the staleness of the
internet, have colluded to create a lot of nervousness and misuse of
fsync. The very fact Btrfs needs a log tree to deal with fsync's in a
semi-sane way...
Except that BTRFS is somewhat unusual.  Prior to this, the only 
'mainstream' filesystem that provided most of these features was ZFS, 
and that does a good enough job that this doesn't matter.


For something like a database though, where you need ACID guarantees, 
you pretty much have to have COW semantics internally, and you have to 
force things to stable storage after each transaction that actually 
modifies data.  Looking at it another way, most database storage formats 
are essentially record-oriented filesystems (as opposed to 
block-oriented filesystems that most people think of).  This is part of 
why you see such similar access patterns in databases and VM disk images 
(even if the VM isn't running database software), they are essentially 
doing the same things at a low level.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Roman Mamedov
On Tue, 18 Apr 2017 03:23:13 + (UTC)
Duncan <1i5t5.dun...@cox.net> wrote:

> Without reading the links...
> 
> Are you /sure/ it's /all/ ssds currently on the market?  Or are you 
> thinking narrowly, those actually sold as ssds?
> 
> Because all I've read (and I admit I may not actually be current, but...) 
> on for instance sd cards, certainly ssds by definition, says they're 
> still very write-cycle sensitive -- very simple FTL with little FTL wear-
> leveling.
> 
> And AFAIK, USB thumb drives tend to be in the middle, moderately complex 
> FTL with some, somewhat simplistic, wear-leveling.
> 

If I have to clarify, yes, it's all about SATA and NVMe SSDs. SD cards may be
SSDs "by definition", but nobody will think of an SD card when you say "I
bought an SSD for my computer". And yes, SD card and USB flash sticks are
commonly understood to be much simpler and more brittle devices than full
blown desktop (not to mention server) SSDs.

> While the stuff actually marketed as SSDs, generally SATA or direct PCIE/
> NVME connected, may indeed match your argument, no real end-user concern 
> necessary any more as the FTLs are advanced enough that user or 
> filesystem level write-cycle concerns simply aren't necessary these days.
> 
> 
> So does that claim that write-cycle concerns simply don't apply to modern 
> ssds, also apply to common thumb drives and sd cards?  Because these are 
> certainly ssds both technically and by btrfs standards.
> 


-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Duncan
Roman Mamedov posted on Mon, 17 Apr 2017 23:24:19 +0500 as excerpted:

> Days are long gone since the end user had to ever think about device
> lifetimes with SSDs. Refer to endurance studies such as
> http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-
all-dead
> http://ssdendurancetest.com/
> https://3dnews.ru/938764/
> It has been demonstrated that all SSDs on the market tend to overshoot
> even their rated TBW by several times, as a result it will take any user
> literally dozens of years to wear out the flash no matter which
> filesystem or what settings used

Without reading the links...

Are you /sure/ it's /all/ ssds currently on the market?  Or are you 
thinking narrowly, those actually sold as ssds?

Because all I've read (and I admit I may not actually be current, but...) 
on for instance sd cards, certainly ssds by definition, says they're 
still very write-cycle sensitive -- very simple FTL with little FTL wear-
leveling.

And AFAIK, USB thumb drives tend to be in the middle, moderately complex 
FTL with some, somewhat simplistic, wear-leveling.

While the stuff actually marketed as SSDs, generally SATA or direct PCIE/
NVME connected, may indeed match your argument, no real end-user concern 
necessary any more as the FTLs are advanced enough that user or 
filesystem level write-cycle concerns simply aren't necessary these days.


So does that claim that write-cycle concerns simply don't apply to modern 
ssds, also apply to common thumb drives and sd cards?  Because these are 
certainly ssds both technically and by btrfs standards.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Hans van Kranenburg
On 04/17/2017 09:22 PM, Imran Geriskovan wrote:
> [...]
> 
> Going over the thread following questions come to my mind:
> 
> - What exactly does btrfs ssd option does relative to plain mode?

There's quite an amount of information in the the very recent threads:
- "About free space fragmentation, metadata write amplification and (no)ssd"
- "BTRFS as a GlusterFS storage back-end, and what I've learned from
using it as such."
- "btrfs filesystem keeps allocating new chunks for no apparent reason"
- ... and a few more

I suspect there will be some "summary" mails at some point, but for now,
I'd recommend crawling through these threads first.

And now for your instant satisfaction, a short visual guide to the
difference, which shows actual btrfs behaviour instead of our guesswork
around it (taken from the second mail thread just mentioned):

-o ssd:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4

-o nossd:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
 wrote:
> On 2017-04-17 14:34, Chris Murphy wrote:

>> Nope. The first paragraph applies to NVMe machine with ssd mount
>> option. Few fragments.
>>
>> The second paragraph applies to SD Card machine with ssd_spread mount
>> option. Many fragments.
>
> Ah, apologies for my misunderstanding.
>>
>>
>> These are different versions of systemd-journald so I can't completely
>> rule out a difference in write behavior.
>
> There have only been a couple of changes in the write patterns that I know
> of, but I would double check that the values for Seal and Compress in the
> journald.conf file are the same, as I know for a fact that changing those
> does change the write patterns (not much, but they do change).

Same, unchanged defaults on both systems.

#Storage=auto
#Compress=yes
#Seal=yes
#SplitMode=uid
#SyncIntervalSec=5m
#RateLimitIntervalSec=30s
#RateLimitBurst=1000


The sync interval sec is curious. 5 minutes? Umm, I'm seeing nearly
constant hits every 2-5 seconds on the journal file; using filefrag.
I'm sure there's a better way to trace a single file being
read/written to than this, but...


 It's almost like we need these things to not fsync at all, and just
 rely on the filesystem commit time...
>>>
>>>
>>> Essentially yes, but that causes all kinds of other problems.
>>
>>
>> Drat.
>>
> Admittedly most of the problems are use-case specific (you can't afford to
> lose transactions in a financial database  for example, so it functionally
> has to call fsync after each transaction), but most of it stems from the
> fact that BTRFS is doing a lot of the same stuff that much of the 'problem'
> software is doing itself internally.
>

Seems like the old way of doing things, and the staleness of the
internet, have colluded to create a lot of nervousness and misuse of
fsync. The very fact Btrfs needs a log tree to deal with fsync's in a
semi-sane way...


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Austin S. Hemmelgarn

On 2017-04-17 14:34, Chris Murphy wrote:

On Mon, Apr 17, 2017 at 11:13 AM, Austin S. Hemmelgarn
 wrote:


What is a high end SSD these days? Built-in NVMe?


One with a good FTL in the firmware.  At minimum, the good Samsung EVO
drives, the high quality Intel ones, and the Crucial MX series, but probably
some others.  My choice of words here probably wasn't the best though.


It's a confusing market that sorta defies figuring out what we've got.

I have a Samsung EVO SATA SSD in one laptop, but then I have a Samsung
EVO+ SD Card in an Intel NUC. They use that same EVO branding on an
$11 SD Card.

And then there's the Samsung Electronics Co Ltd NVMe SSD Controller
SM951/PM951 in another laptop.
What makes it even more confusing is that other than Samsung (who _only_ 
use their own flash and controllers), manufacturer does not map to 
controller choice consistently, and even two drives with the same 
controller may have different firmware (and thus different degrees of 
reliability, those OCZ drives that were such crap at data retention were 
the result of a firmware option that the controller manufacturer pretty 
much told them not to use on production devices).




So long as this file is not reflinked or snapshot, filefrag shows a
pile of mostly 4096 byte blocks, thousands. But as they're pretty much
all continuous, the file fragmentation (extent count) is usually never
higher than 12. It meanders between 1 and 12 extents for its life.

Except on the system using ssd_spread mount option. That one has a
journal file that is +C, is not being snapshot, but has over 3000
extents per filefrag and btrfs-progs/debugfs. Really weird.


Given how the 'ssd' mount option behaves and the frequency that most systemd
instances write to their journals, that's actually reasonably expected.  We
look for big chunks of free space to write into and then align to 2M
regardless of the actual size of the write, which in turn means that files
like the systemd journal which see lots of small (relatively speaking)
writes will have way more extents than they should until you defragment
them.


Nope. The first paragraph applies to NVMe machine with ssd mount
option. Few fragments.

The second paragraph applies to SD Card machine with ssd_spread mount
option. Many fragments.

Ah, apologies for my misunderstanding.


These are different versions of systemd-journald so I can't completely
rule out a difference in write behavior.
There have only been a couple of changes in the write patterns that I 
know of, but I would double check that the values for Seal and Compress 
in the journald.conf file are the same, as I know for a fact that 
changing those does change the write patterns (not much, but they do 
change).




Now, systemd aside, there are databases that behave this same way
where there's a small section contantly being overwritten, and one or
more sections that grow the data base file from within and at the end.
If this is made cow, the file will absolutely fragment a ton. And
especially if the changes are mostly 4KiB block sizes that then are
fsync'd.

It's almost like we need these things to not fsync at all, and just
rely on the filesystem commit time...


Essentially yes, but that causes all kinds of other problems.


Drat.

Admittedly most of the problems are use-case specific (you can't afford 
to lose transactions in a financial database  for example, so it 
functionally has to call fsync after each transaction), but most of it 
stems from the fact that BTRFS is doing a lot of the same stuff that 
much of the 'problem' software is doing itself internally.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Imran Geriskovan
On 4/17/17, Roman Mamedov <r...@romanrm.net> wrote:
> "Austin S. Hemmelgarn" <ahferro...@gmail.com> wrote:

>> * Compression should help performance and device lifetime most of the
>> time, unless your CPU is fully utilized on a regular basis (in which
>> case it will hurt performance, but still improve device lifetimes).

> Days are long gone since the end user had to ever think about device lifetimes
> with SSDs. Refer to endurance studies such as
> It has been demonstrated that all SSDs on the market tend to overshoot even
> their rated TBW by several times, as a result it will take any user literally
> dozens of years to wear out the flash no matter which filesystem or what
> settings used. And most certainly it's not worth it changing anything
> significant in your workflow (such as enabling compression if it's
> otherwise inconvenient or not needed) just to save the SSD lifetime.

Going over the thread following questions come to my mind:

- What exactly does btrfs ssd option does relative to plain mode?

- Most(all?) SSDs employ wear leveling. Isn't it? That is they are
constrantly remapping their blocks under the hood. So isn't it
meaningless to speak of some kind of a block forging/fragmentation/etc..
affect of any writing pattern?

- If it is so, Doesn't it mean that there is no better ssd usage strategy
other than minimizing the total bytes written? That is whatever we do,
if it contributes to this fact it is good, otherwise bad. Are all other things
are beyond any user control? Is there a recommended setting?

- How about "data retension" experiences? It is known that
new ssds can hold data safely for longer period. As they age
that margin gets shorter. As an extreme case if I write into a new
ssd and shelve it, can i get back my data back after 5 years?
How about a file written 5 years ago and never touched again although
rest of the ssd is in active use during that period?

- Yes may be lifetimes getting irrelevant. However TBW has
still direct relation with data retension capability.
Knowing that writing more data to a ssd can reduce the
"life time of your data" is something strange.

- But someone can come and say: Hey don't worry about
"data retension years". Because your ssd will already be dead
before data retension becomes a problem for you... Which is
relieving.. :)) Anyway what are your opinions?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 11:13 AM, Austin S. Hemmelgarn
 wrote:

>> What is a high end SSD these days? Built-in NVMe?
>
> One with a good FTL in the firmware.  At minimum, the good Samsung EVO
> drives, the high quality Intel ones, and the Crucial MX series, but probably
> some others.  My choice of words here probably wasn't the best though.

It's a confusing market that sorta defies figuring out what we've got.

I have a Samsung EVO SATA SSD in one laptop, but then I have a Samsung
EVO+ SD Card in an Intel NUC. They use that same EVO branding on an
$11 SD Card.

And then there's the Samsung Electronics Co Ltd NVMe SSD Controller
SM951/PM951 in another laptop.


>> So long as this file is not reflinked or snapshot, filefrag shows a
>> pile of mostly 4096 byte blocks, thousands. But as they're pretty much
>> all continuous, the file fragmentation (extent count) is usually never
>> higher than 12. It meanders between 1 and 12 extents for its life.
>>
>> Except on the system using ssd_spread mount option. That one has a
>> journal file that is +C, is not being snapshot, but has over 3000
>> extents per filefrag and btrfs-progs/debugfs. Really weird.
>
> Given how the 'ssd' mount option behaves and the frequency that most systemd
> instances write to their journals, that's actually reasonably expected.  We
> look for big chunks of free space to write into and then align to 2M
> regardless of the actual size of the write, which in turn means that files
> like the systemd journal which see lots of small (relatively speaking)
> writes will have way more extents than they should until you defragment
> them.

Nope. The first paragraph applies to NVMe machine with ssd mount
option. Few fragments.

The second paragraph applies to SD Card machine with ssd_spread mount
option. Many fragments.

These are different versions of systemd-journald so I can't completely
rule out a difference in write behavior.


>> Now, systemd aside, there are databases that behave this same way
>> where there's a small section contantly being overwritten, and one or
>> more sections that grow the data base file from within and at the end.
>> If this is made cow, the file will absolutely fragment a ton. And
>> especially if the changes are mostly 4KiB block sizes that then are
>> fsync'd.
>>
>> It's almost like we need these things to not fsync at all, and just
>> rely on the filesystem commit time...
>
> Essentially yes, but that causes all kinds of other problems.

Drat.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Roman Mamedov
On Mon, 17 Apr 2017 07:53:04 -0400
"Austin S. Hemmelgarn"  wrote:

> General info (not BTRFS specific):
> * Based on SMART attributes and other factors, current life expectancy 
> for light usage (normal desktop usage) appears to be somewhere around 
> 8-12 years depending on specifics of usage (assuming the same workload, 
> F2FS is at the very top of the range, BTRFS and NILFS2 are on the upper 
> end, XFS is roughly in the middle, ext4 and NTFS are on the low end 
> (tested using Windows 7's NTFS driver), and FAT32 is an outlier at the 
> bottom of the barrel).

Life expectancy for an SSD is defined not in years, but in TBW (terabytes
written), and AFAICT that's not "from host", but "to flash" (some SSDs will
show you both values in two separate SMART attributes out of the box, on some
it can be unlocked). Filesystem may come into play only by the amount of write
amplification they cause (how much "to flash" is greater than "from host").
Do you have any test data to show that FSes are ranked in that order by WA
they cause, or is it all about "general feel" and how they are branded (F2FS
says so, so it must be the best)

> * Queued DISCARD support is still missing in most consumer SATA SSD's, 
> which in turn makes the trade-off on those between performance and 
> lifetime much sharper.

My choice was to make a script to run from crontab, using "fstrim" on all
mounted SSDs nightly, and aside from that all FSes are mounted with
"nodiscard". Best of the both worlds, and no interference with actual IO
operation.

> * Modern (2015 and newer) SSD's seem to have better handling in the FTL 
> for the journaling behavior of filesystems like ext4 and XFS.  I'm not 
> sure if this is actually a result of the FTL being better, or some 
> change in the hardware.

Again, what makes you think this, did you observe the write amplification
readings and now those are demonstrably lower than on "2014 and older" SSDs?
So, by how much, and which models did you compare?

> * In my personal experience, Intel, Samsung, and Crucial appear to be 
> the best name brands (in relative order of quality).  I have personally 
> had bad experiences with SanDisk and Kingston SSD's, but I don't have 
> anything beyond circumstantial evidence indicating that it was anything 
> but bad luck on both counts.

Why not think in terms not of "name brands" but platforms, i.e. a controller
model + flash combination. For instance Intel have been using some other
companies' controllers in their SSDs. Kingston uses tons of various
controllers (Sandforce/Phison/Marvell/more?) depending on the model and range.

> * Files with NOCOW and filesystems with 'nodatacow' set will both hurt 
> performance for BTRFS on SSD's, and appear to reduce the lifetime of the 
> SSD.

"Appear to"? Just... what. So how many SSDs did you have fail under nocow?

Or maybe can we get serious in a technical discussion? Did you by any chance
mean cause more writes to the SSD and more "to flash" writes (resulting in a
higher WA). If so, then by how much, and what was your test scenario comparing
the same usage with and without nocow?

> * Compression should help performance and device lifetime most of the 
> time, unless your CPU is fully utilized on a regular basis (in which 
> case it will hurt performance, but still improve device lifetimes).

Days are long gone since the end user had to ever think about device lifetimes
with SSDs. Refer to endurance studies such as 
http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead
http://ssdendurancetest.com/
https://3dnews.ru/938764/
It has been demonstrated that all SSDs on the market tend to overshoot even
their rated TBW by several times, as a result it will take any user literally
dozens of years to wear out the flash no matter which filesystem or what
settings used. And most certainly it's not worth it changing anything
significant in your workflow (such as enabling compression if it's otherwise
inconvenient or not needed) just to save the SSD lifetime.

On Mon, 17 Apr 2017 13:13:39 -0400
"Austin S. Hemmelgarn"  wrote:

> > What is a high end SSD these days? Built-in NVMe?
> One with a good FTL in the firmware.  At minimum, the good Samsung EVO 
> drives, the high quality Intel ones

As opposed to bad Samsung EVO drives and low-quality Intel ones?

> and the Crucial MX series, but 
> probably some others.  My choice of words here probably wasn't the best 
> though.

Again, which controller? Crucial does not manufacture SSD controllers on their
own, they just pack and brand stuff manufactured by someone else. So if you
meant Marvell based SSDs, then that's many brands, not just Crucial.

> For a normal filesystem or BTRFS with nodatacow or NOCOW, the block gets 
> rewritten in-place.  This means that cheap FTL's will rewrite that erase 
> block in-place (which won't hurt performance but will impact device 
> lifetime), and good ones will rewrite into a free 

Re: Btrfs/SSD

2017-04-17 Thread Austin S. Hemmelgarn

On 2017-04-17 12:58, Chris Murphy wrote:

On Mon, Apr 17, 2017 at 5:53 AM, Austin S. Hemmelgarn
 wrote:


Regarding BTRFS specifically:
* Given my recently newfound understanding of what the 'ssd' mount option
actually does, I'm inclined to recommend that people who are using high-end
SSD's _NOT_ use it as it will heavily increase fragmentation and will likely
have near zero impact on actual device lifetime (but may _hurt_
performance).  It will still probably help with mid and low-end SSD's.


What is a high end SSD these days? Built-in NVMe?
One with a good FTL in the firmware.  At minimum, the good Samsung EVO 
drives, the high quality Intel ones, and the Crucial MX series, but 
probably some others.  My choice of words here probably wasn't the best 
though.





* Files with NOCOW and filesystems with 'nodatacow' set will both hurt
performance for BTRFS on SSD's, and appear to reduce the lifetime of the
SSD.


Can you elaborate. It's an interesting problem, on a small scale the
systemd folks have journald set +C on /var/log/journal so that any new
journals are nocow. There is an initial fallocate, but the write
behavior is writing in the same place at the head and tail. But at the
tail, the writes get pushed torward the middle. So the file is growing
into its fallocated space from the tail. The header changes in the
same location, it's an overwrite.
For a normal filesystem or BTRFS with nodatacow or NOCOW, the block gets 
rewritten in-place.  This means that cheap FTL's will rewrite that erase 
block in-place (which won't hurt performance but will impact device 
lifetime), and good ones will rewrite into a free block somewhere else 
but may not free that original block for quite some time (which is bad 
for performance but slightly better for device lifetime).


When BTRFS does a COW operation on a block however, it will guarantee 
that that block moves.  Because of this, the old location will either:

1. Be discarded by the FS itself if the 'discard' mount option is set.
2. Be caught by a scheduled call to 'fstrim'.
3. Lay dormant for at least a while.

The first case is ideal for most FTL's, because it lets them know 
immediately that that data isn't needed and the space can be reused. 
The second is close to ideal, but defers telling the FTL that the block 
is unused, which can be better on some SSD's (some have firmware that 
handles wear-leveling better in batches).  The third is not ideal, but 
is still better than what happens with NOCOW or nodatacow set.


Overall, this boils down to the fact that most FTL's get slower if they 
can't wear-level the device properly, and in-place rewrites make it 
harder for them to do proper wear-leveling.


So long as this file is not reflinked or snapshot, filefrag shows a
pile of mostly 4096 byte blocks, thousands. But as they're pretty much
all continuous, the file fragmentation (extent count) is usually never
higher than 12. It meanders between 1 and 12 extents for its life.

Except on the system using ssd_spread mount option. That one has a
journal file that is +C, is not being snapshot, but has over 3000
extents per filefrag and btrfs-progs/debugfs. Really weird.
Given how the 'ssd' mount option behaves and the frequency that most 
systemd instances write to their journals, that's actually reasonably 
expected.  We look for big chunks of free space to write into and then 
align to 2M regardless of the actual size of the write, which in turn 
means that files like the systemd journal which see lots of small 
(relatively speaking) writes will have way more extents than they should 
until you defragment them.


Now, systemd aside, there are databases that behave this same way
where there's a small section contantly being overwritten, and one or
more sections that grow the data base file from within and at the end.
If this is made cow, the file will absolutely fragment a ton. And
especially if the changes are mostly 4KiB block sizes that then are
fsync'd.

It's almost like we need these things to not fsync at all, and just
rely on the filesystem commit time...

Essentially yes, but that causes all kinds of other problems.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 5:53 AM, Austin S. Hemmelgarn
 wrote:

> Regarding BTRFS specifically:
> * Given my recently newfound understanding of what the 'ssd' mount option
> actually does, I'm inclined to recommend that people who are using high-end
> SSD's _NOT_ use it as it will heavily increase fragmentation and will likely
> have near zero impact on actual device lifetime (but may _hurt_
> performance).  It will still probably help with mid and low-end SSD's.

What is a high end SSD these days? Built-in NVMe?



> * Files with NOCOW and filesystems with 'nodatacow' set will both hurt
> performance for BTRFS on SSD's, and appear to reduce the lifetime of the
> SSD.

Can you elaborate. It's an interesting problem, on a small scale the
systemd folks have journald set +C on /var/log/journal so that any new
journals are nocow. There is an initial fallocate, but the write
behavior is writing in the same place at the head and tail. But at the
tail, the writes get pushed torward the middle. So the file is growing
into its fallocated space from the tail. The header changes in the
same location, it's an overwrite.

So long as this file is not reflinked or snapshot, filefrag shows a
pile of mostly 4096 byte blocks, thousands. But as they're pretty much
all continuous, the file fragmentation (extent count) is usually never
higher than 12. It meanders between 1 and 12 extents for its life.

Except on the system using ssd_spread mount option. That one has a
journal file that is +C, is not being snapshot, but has over 3000
extents per filefrag and btrfs-progs/debugfs. Really weird.

Now, systemd aside, there are databases that behave this same way
where there's a small section contantly being overwritten, and one or
more sections that grow the data base file from within and at the end.
If this is made cow, the file will absolutely fragment a ton. And
especially if the changes are mostly 4KiB block sizes that then are
fsync'd.

It's almost like we need these things to not fsync at all, and just
rely on the filesystem commit time...





-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Austin S. Hemmelgarn

On 2017-04-14 07:02, Imran Geriskovan wrote:

Hi,
Sometime ago we had some discussion about SSDs.
Within the limits of unknown/undocumented device infos,
we loosely had covered data retension capability/disk age/life time
interrelations, (in?)effectiveness of btrfs dup on SSDs, etc..

Now, as time passed and with some accumulated experience on SSDs
I think we again can have a status check/update on them if you
can share your experiences and best practices.

So if you have something to share about SSDs (it may or may not be
directly related with btrfs) I'm sure everybody here will be happy to
hear it.


General info (not BTRFS specific):
* Based on SMART attributes and other factors, current life expectancy 
for light usage (normal desktop usage) appears to be somewhere around 
8-12 years depending on specifics of usage (assuming the same workload, 
F2FS is at the very top of the range, BTRFS and NILFS2 are on the upper 
end, XFS is roughly in the middle, ext4 and NTFS are on the low end 
(tested using Windows 7's NTFS driver), and FAT32 is an outlier at the 
bottom of the barrel).
* Queued DISCARD support is still missing in most consumer SATA SSD's, 
which in turn makes the trade-off on those between performance and 
lifetime much sharper.
* Modern (2015 and newer) SSD's seem to have better handling in the FTL 
for the journaling behavior of filesystems like ext4 and XFS.  I'm not 
sure if this is actually a result of the FTL being better, or some 
change in the hardware.
* In my personal experience, Intel, Samsung, and Crucial appear to be 
the best name brands (in relative order of quality).  I have personally 
had bad experiences with SanDisk and Kingston SSD's, but I don't have 
anything beyond circumstantial evidence indicating that it was anything 
but bad luck on both counts.


Regarding BTRFS specifically:
* Given my recently newfound understanding of what the 'ssd' mount 
option actually does, I'm inclined to recommend that people who are 
using high-end SSD's _NOT_ use it as it will heavily increase 
fragmentation and will likely have near zero impact on actual device 
lifetime (but may _hurt_ performance).  It will still probably help with 
mid and low-end SSD's.
* Files with NOCOW and filesystems with 'nodatacow' set will both hurt 
performance for BTRFS on SSD's, and appear to reduce the lifetime of the 
SSD.
* Compression should help performance and device lifetime most of the 
time, unless your CPU is fully utilized on a regular basis (in which 
case it will hurt performance, but still improve device lifetimes).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs/SSD

2017-04-14 Thread Imran Geriskovan
Hi,
Sometime ago we had some discussion about SSDs.
Within the limits of unknown/undocumented device infos,
we loosely had covered data retension capability/disk age/life time
interrelations, (in?)effectiveness of btrfs dup on SSDs, etc..

Now, as time passed and with some accumulated experience on SSDs
I think we again can have a status check/update on them if you
can share your experiences and best practices.

So if you have something to share about SSDs (it may or may not be
directly related with btrfs) I'm sure everybody here will be happy to
hear it.

Regard,
Imran
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: SSD related mount option dependency rework.

2014-10-26 Thread Qu Wenruo

Any new comments?

Cc: Satoru
Sorry for the late reply.
Unfortunately, it seems that your mail doesn't show in my inbox, but 
only occurs in patchwork.


About writing the dependency in btrfs.txt, since the patch is based on 
the btrfs.txt,
it already shows the dependency, so I'd like not to repeat them in 
btrfs.txt.


Thanks,
Qu

 Original Message 
Subject: [PATCH] btrfs: SSD related mount option dependency rework.
From: Qu Wenruo quwen...@cn.fujitsu.com
To: linux-btrfs@vger.kernel.org
Date: 2014年08月01日 11:27

According to Documentations/filesystem/btrfs.txt, ssd/ssd_spread/nossd
has their own dependency(See below), but only ssd_spread implying ssd is
implemented.

ssd_spread implies ssd, conflicts nossd.
ssd conflicts nossd.
nossd conflicts ssd and ssd_spread.

This patch adds ssd{,_spread} confliction with nossd.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
  fs/btrfs/super.c | 11 +++
  1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8e16bca..2508a16 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -515,19 +515,22 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
   compress_type);
}
break;
-   case Opt_ssd:
-   btrfs_set_and_info(root, SSD,
-  use ssd allocation scheme);
-   break;
case Opt_ssd_spread:
btrfs_set_and_info(root, SSD_SPREAD,
   use spread ssd allocation scheme);
+   /* suppress the ssd mount option log */
btrfs_set_opt(info-mount_opt, SSD);
+   /* fall through for other ssd routine */
+   case Opt_ssd:
+   btrfs_set_and_info(root, SSD,
+  use ssd allocation scheme);
+   btrfs_clear_opt(info-mount_opt, NOSSD);
break;
case Opt_nossd:
btrfs_set_and_info(root, NOSSD,
 not using ssd allocation scheme);
btrfs_clear_opt(info-mount_opt, SSD);
+   btrfs_clear_opt(info-mount_opt, SSD_SPREAD);
break;
case Opt_barrier:
btrfs_clear_and_info(root, NOBARRIER,


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: SSD related mount option dependency rework.

2014-08-07 Thread Satoru Takeuchi
Hi Qu,

(2014/08/01 12:27), Qu Wenruo wrote:
 According to Documentations/filesystem/btrfs.txt, ssd/ssd_spread/nossd
 has their own dependency(See below), but only ssd_spread implying ssd is
 implemented.
 
 ssd_spread implies ssd, conflicts nossd.
 ssd conflicts nossd.
 nossd conflicts ssd and ssd_spread.
 
 This patch adds ssd{,_spread} confliction with nossd.

How about write down above-mentioned dependencies in
Documentations/filesystem/btrfs.txt too?

Thanks,
Satoru

 
 Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
 ---
   fs/btrfs/super.c | 11 +++
   1 file changed, 7 insertions(+), 4 deletions(-)
 
 diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
 index 8e16bca..2508a16 100644
 --- a/fs/btrfs/super.c
 +++ b/fs/btrfs/super.c
 @@ -515,19 +515,22 @@ int btrfs_parse_options(struct btrfs_root *root, char 
 *options)
  compress_type);
   }
   break;
 - case Opt_ssd:
 - btrfs_set_and_info(root, SSD,
 -use ssd allocation scheme);
 - break;
   case Opt_ssd_spread:
   btrfs_set_and_info(root, SSD_SPREAD,
  use spread ssd allocation scheme);
 + /* suppress the ssd mount option log */
   btrfs_set_opt(info-mount_opt, SSD);
 + /* fall through for other ssd routine */
 + case Opt_ssd:
 + btrfs_set_and_info(root, SSD,
 +use ssd allocation scheme);
 + btrfs_clear_opt(info-mount_opt, NOSSD);
   break;
   case Opt_nossd:
   btrfs_set_and_info(root, NOSSD,
not using ssd allocation scheme);
   btrfs_clear_opt(info-mount_opt, SSD);
 + btrfs_clear_opt(info-mount_opt, SSD_SPREAD);
   break;
   case Opt_barrier:
   btrfs_clear_and_info(root, NOBARRIER,
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: SSD related mount option dependency rework.

2014-07-31 Thread Qu Wenruo
According to Documentations/filesystem/btrfs.txt, ssd/ssd_spread/nossd
has their own dependency(See below), but only ssd_spread implying ssd is
implemented.

ssd_spread implies ssd, conflicts nossd.
ssd conflicts nossd.
nossd conflicts ssd and ssd_spread.

This patch adds ssd{,_spread} confliction with nossd.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 fs/btrfs/super.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8e16bca..2508a16 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -515,19 +515,22 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
   compress_type);
}
break;
-   case Opt_ssd:
-   btrfs_set_and_info(root, SSD,
-  use ssd allocation scheme);
-   break;
case Opt_ssd_spread:
btrfs_set_and_info(root, SSD_SPREAD,
   use spread ssd allocation scheme);
+   /* suppress the ssd mount option log */
btrfs_set_opt(info-mount_opt, SSD);
+   /* fall through for other ssd routine */
+   case Opt_ssd:
+   btrfs_set_and_info(root, SSD,
+  use ssd allocation scheme);
+   btrfs_clear_opt(info-mount_opt, NOSSD);
break;
case Opt_nossd:
btrfs_set_and_info(root, NOSSD,
 not using ssd allocation scheme);
btrfs_clear_opt(info-mount_opt, SSD);
+   btrfs_clear_opt(info-mount_opt, SSD_SPREAD);
break;
case Opt_barrier:
btrfs_clear_and_info(root, NOBARRIER,
-- 
2.0.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BTRFS, SSD and single metadata

2014-06-16 Thread Swâmi Petaramesh
Hi,

I created a BTRFS filesytem over LVM over LUKS encryption on an SSD [yes, I 
know...], and I noticed that the FS got created with metadata in DUP mode, 
contrary to what man mkfs.btrfs says for SSDs - it would be supposed to be 
SINGLE...

Well I don't know if my system didn't identify the SSD because of the LVM+LUKS 
stack (however it mounts well by itself with the ssd flag and accepts the 
discard option [yes, I know...]), or if the manpage is obsolete or if this 
feature just doesn't work...?

The SSD being a Micron RealSSD C400

For both SSD preservation and data integrity, would it be advisable to change 
metadata to SINGLE using a rebalance, or if I'd better just leave things the 
way they are...?

TIA for any insight.

-- 
Swâmi Petaramesh sw...@petaramesh.org http://petaramesh.org PGP 9076E32E

Tout le malheur des hommes vient de ce qu'ils ne vivent pas dans _le_ monde,
mais dans _leur_ monde.
-- Héraclite.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS, SSD and single metadata

2014-06-16 Thread Austin S Hemmelgarn
On 2014-06-16 03:54, Swâmi Petaramesh wrote:
 Hi,
 
 I created a BTRFS filesytem over LVM over LUKS encryption on an SSD [yes, I 
 know...], and I noticed that the FS got created with metadata in DUP mode, 
 contrary to what man mkfs.btrfs says for SSDs - it would be supposed to be 
 SINGLE...
 
 Well I don't know if my system didn't identify the SSD because of the 
 LVM+LUKS 
 stack (however it mounts well by itself with the ssd flag and accepts the 
 discard option [yes, I know...]), or if the manpage is obsolete or if this 
 feature just doesn't work...?
 
 The SSD being a Micron RealSSD C400
 
 For both SSD preservation and data integrity, would it be advisable to change 
 metadata to SINGLE using a rebalance, or if I'd better just leave things 
 the 
 way they are...?
 
 TIA for any insight.
 
What mkfs.btrfs looks at is
/sys/block/whatever-device/queue/rotational, if that is 1 it knows
that the device isn't a SSD.  I believe that LVM passes through whatever
the next lower layer's value is, but dmcrypt (and by extension LUKS)
always force it to a 1 (possibly to prevent programs from using
heuristics for enabling discard)



smime.p7s
Description: S/MIME Cryptographic Signature


Re: BTRFS, SSD and single metadata

2014-06-16 Thread Swâmi Petaramesh
Hi Austin, and thanks for your reply.

Le lundi 16 juin 2014, 07:09:55 Austin S Hemmelgarn a écrit :
 
 What mkfs.btrfs looks at is
 /sys/block/whatever-device/queue/rotational, if that is 1 it knows
 that the device isn't a SSD.  I believe that LVM passes through whatever
 the next lower layer's value is, but dmcrypt (and by extension LUKS)
 always force it to a 1 (possibly to prevent programs from using
 heuristics for enabling discard)

In the current running condition, the system clearly sees this is *not* 
rotational, even thru the LVM/dmcrypt stack :

# mount | grep btrfs
/dev/mapper/VG-LINUX on / type btrfs 
(rw,noatime,seclabel,compress=lzo,ssd,discard,space_cache,autodefrag)

# ll /dev/mapper/VGV-LINUX
lrwxrwxrwx. 1 root root 7 16 juin  09:21 /dev/mapper/VG-LINUX - ../dm-1

# cat /sys/block/dm-1/queue/rotational 
0

...However, at mkfs.btrfs time, it migth well not have seen it, as I made it 
from a live USB key in which both the lvm.conf and crypttab had not been 
taylored to allow trim commands...

However, now that the FS is created, I still wonder whether I should use a 
rebalance to change the metadata from DUP to SINGLE, or if Id' better stay 
with DUP...

Kind regards.


-- 
Swâmi Petaramesh sw...@petaramesh.org http://petaramesh.org PGP 9076E32E

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS, SSD and single metadata

2014-06-16 Thread Austin S Hemmelgarn
On 2014-06-16 07:18, Swâmi Petaramesh wrote:
 Hi Austin, and thanks for your reply.
 
 Le lundi 16 juin 2014, 07:09:55 Austin S Hemmelgarn a écrit :

 What mkfs.btrfs looks at is
 /sys/block/whatever-device/queue/rotational, if that is 1 it knows
 that the device isn't a SSD.  I believe that LVM passes through whatever
 the next lower layer's value is, but dmcrypt (and by extension LUKS)
 always force it to a 1 (possibly to prevent programs from using
 heuristics for enabling discard)
 
 In the current running condition, the system clearly sees this is *not* 
 rotational, even thru the LVM/dmcrypt stack :
 
 # mount | grep btrfs
 /dev/mapper/VG-LINUX on / type btrfs 
 (rw,noatime,seclabel,compress=lzo,ssd,discard,space_cache,autodefrag)
 
 # ll /dev/mapper/VGV-LINUX
 lrwxrwxrwx. 1 root root 7 16 juin  09:21 /dev/mapper/VG-LINUX - ../dm-1
 
 # cat /sys/block/dm-1/queue/rotational 
 0
 
 ...However, at mkfs.btrfs time, it migth well not have seen it, as I made it 
 from a live USB key in which both the lvm.conf and crypttab had not been 
 taylored to allow trim commands...
 
 However, now that the FS is created, I still wonder whether I should use a 
 rebalance to change the metadata from DUP to SINGLE, or if Id' better stay 
 with DUP...
 
 Kind regards.
 
 
I'd personally stay with the DUP profile, but then that's just me being
paranoid.  You will almost certainly get better performance using the
SINGLE profile instead of DUP, but this is mostly due to it requiring
fewer blocks to be encrypted by LUKS (Which is almost certainly your
primary bottleneck unless you have some high-end crypto-accelerator card).



smime.p7s
Description: S/MIME Cryptographic Signature


Re: BTRFS, SSD and single metadata

2014-06-16 Thread Russell Coker
On Mon, 16 Jun 2014 07:23:14 Austin S Hemmelgarn wrote:
 I'd personally stay with the DUP profile, but then that's just me being
 paranoid.  You will almost certainly get better performance using the
 SINGLE profile instead of DUP, but this is mostly due to it requiring
 fewer blocks to be encrypted by LUKS (Which is almost certainly your
 primary bottleneck unless you have some high-end crypto-accelerator card).

On my Q8400 workstation running BTRFS over LUKS on an Intel SSD the primary 
bottleneck has always been BTRFS.  The message I wrote earlier today about 
BTRFS fallocate() performance was on this system, I had BTRFS using kernel CPU 
time for periods of 10+ seconds without ANY disk IO - so LUKS wasn't a 
performance issue.

So far I've never seen LUKS be a performance bottleneck.  When running LUKS on 
spinning media the disk seek performance will almost always be the bottleneck.

The worst case for LUKS is transferring large amounts of data such as 
contiguous reads.  In a contiguous read test I'm seeing 120MB/s for LUKS on a 
SSD and 200MB/s for direct access to the same SSD.  That is a reasonable 
difference, but it's not something I've been able to hit with any real-world 
use while BTRFS metadata performance is often an issue.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS, SSD and single metadata

2014-06-16 Thread Duncan
Swâmi Petaramesh posted on Mon, 16 Jun 2014 09:54:01 +0200 as excerpted:

 I created a BTRFS filesytem over LVM over LUKS encryption on an SSD
 [yes, I know...], and I noticed that the FS got created with metadata in
 DUP mode, contrary to what man mkfs.btrfs says for SSDs - it would
 be supposed to be SINGLE...
 
 Well I don't know if my system didn't identify the SSD because of the
 LVM+LUKS stack (however it mounts well by itself with the ssd flag and
 accepts the discard option [yes, I know...]), or if the manpage is
 obsolete or if this feature just doesn't work...?

Does btrfs automatically add the ssd mount option or do you have to add 
it?  If you have to add it, that means btrfs isn't detecting the ssd, 
which would explain why mkfs.btrfs didn't detect it either... as you said 
very likely due to the LVM over LUKS stack.

I believe the detection is actually based on what the kernel reports.  I 
may be mistaken and I'm not running stacked devices ATM in ordered to 
check them, but check /sys/block/device/queue/rotational.  On my ssds, 
the value in that file is 0.  On my spinning rust, it's 1.  If that is 
indeed what it's looking at, you can verify that your hardware device is 
actually detected by the kernel as rotational 0, and then check the 
various layers of your stack and see where the rotational 0 (or the 
rotational file itself) gets dropped.

 The SSD being a Micron RealSSD C400
 
 For both SSD preservation and data integrity, would it be advisable to
 change metadata to SINGLE using a rebalance, or if I'd better just
 leave things the way they are...?
 
 TIA for any insight.

That's a very good question, on which there has been some debate on this 
list.

The first question:  Does your SSD firmware do compression and dedup, or 
not?  IIRC the sandforce (I believe that's the firmware name) firmware 
does compression and dedup, but not all firmware does.  I know a bullet-
point feature of my SSDs (several Corsair Neutron 256 GB, FWIW) is that 
they do NOT do this sort of compression and dedup -- what the system 
tells them to save is what they save, and if you tell it to save a 
hundred copies of the same thing, that's what it does.  (These SSDs are 
targeted at commercial usage and this is billed as a performance 
reliability feature.)

The reason originally given for defaulting to single mode metadata on 
SSDs was that it was due to this possible dedup -- dup-mode metadata 
might actually end up single-copy-only due to the firmware compression 
and dedup in any case.  Between that and their typically smaller size, I 
guess the decision was that single was the best default.

However, it occurs to me that with the LUKS encryption layer, I'm not 
entirely sure if duplication at the btrfs level would end up as the same 
encrypted stream headed to the hardware in any case.  If it would encrypt 
the two copies of the dup-mode metadata as different, then the hardware 
dedup/compression wouldn't work on it anyway.  OTOH, if it encrypts them 
as the same stream headed to hardware, then again, it would matter.

Meanwhile... which is better and should you rebalance to single?  In 
terms of data integrity, dup mode is definitely better, since if there's 
damage to one copy such that it doesn't pass checksum verification, you 
still have the other copy to read from... and to rebuild the damaged copy 
from.

OTOH, single does take less space, and performance should be slightly 
better.  If you're keeping good backups anyway, or if the ssd's firmware 
might be mucking with things leaving you with only a single copy in any 
case, single mode could be a better choice.


FWIW, while most of my partitions are btrfs raid1 here, so the second 
copy is on a different physical device, /boot is an exception.  I have a 
separate /boot on each device, pointed at by the grub loaded on each 
device so I can use the BIOS boot selector to choose which one I boot.  
That lets me keep a working /boot that I boot most of the time on one 
device, and a backup /boot on the other device, in case something goes 
wrong with the first.

So those dedicated /boot partitions are an exception to my normal btrfs 
raid1.  They're both (working and primary backup) 256 MiB mixed-data/
metadata mode as they're so small, which means data and metadata must 
both be the same mode, and they're both set to dup mode.

Which means they effectively only hold (a bit under) 128 MiB worth of 
data, since it's dup mode for both data and metadata, but 128 MiB is fine 
for /boot, as long as I don't let too many kernels build up.

So I'd obviously recommend dup mode, unless you know your ssd's firmware 
is going to dedup it anyway.  But that's just me.  As I said, there has 
been some discussion about it on the list, and some people make other 
choices.  You can of course dig in the list archives if you want to see 
previous threads on the subject, but ultimately, it's upto you. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every 

Re: BTRFS, SSD and single metadata

2014-06-16 Thread Swâmi Petaramesh
Le lundi 16 juin 2014, 12:16:33 Duncan a écrit :
 Does btrfs automatically add the ssd mount option or do you have to add 
 it?  If you have to add it, that means btrfs isn't detecting the ssd, 

First time I mounted the freshly created filesystem, it actually added the 
ssd option by itself. Thus indeed, it could see this was a SSD...

 However, it occurs to me that with the LUKS encryption layer, I'm not 
 entirely sure if duplication at the btrfs level would end up as the same 
 encrypted stream headed to the hardware in any case.  If it would encrypt 
 the two copies of the dup-mode metadata as different, then the hardware 
 dedup/compression wouldn't work on it anyway.  OTOH, if it encrypts them 
 as the same stream headed to hardware, then again, it would matter.

That makes an excellent point. 2 copies of the same binary data in different 
filesystem sectors, process thru LUKS, will create entirely different binary 
ciphertext, so for sure both metadata copies will always be different and 
cannot be deduped by SSD firmware...

Kind regards.

-- 
Swâmi Petaramesh sw...@petaramesh.org http://petaramesh.org PGP 9076E32E

Les êtres humains, cernés par leurs besoins, s'agitent sans but comme des
lapins pris au piège. Ainsi, moine, détourne-toi de ces besoins et trouve
la liberté.
-- Buddha Shakyamuni

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BTRFS SSD RAID 1: Does it trim on both devices? :)

2014-03-09 Thread Martin Steigerwald
Hi!

Since a few days this ThinkPad T520 has 780 GB SSD capacity. The 300 GB of
the Intel SSD 320 were almost full and that 480 GB Crucial m500 mSATA SSD
was cheap enough to just buy it.

I created a new logical volume for big and not that often changed files that
is just on the msata and moved all the music and photo files to it.

Then I thought I´d work in place, just shrink the /home volume on the Intel
SSD 320 and then add a similar sized volume on the Crucial and have it
rebalanced to metadata and data RAID 1. Unfortunately the shrinking hung
at some time. I didn´t check with a scrub afterward, but I rsync´d all
data to a newly created volume on the Crucial and then rebalanced the
other way around. Since the rsync didn´t show any I/O errors I suppose
the BTRFS with the shrink failure was still okay. I report this shrink
failure separately.

So now I have a BTRFS RAID 1 for /home and / and I wondered: Does BTRFS
trim on all devices on issuing fstrim?



Some details:

merkaba:~ btrfs fi show
Label: debian  uuid: […]
Total devices 2 FS bytes used 15.89GiB
devid1 size 30.00GiB used 19.03GiB path /dev/mapper/sata-debian
devid2 size 30.00GiB used 19.01GiB path /dev/dm-2

Label: home  uuid: […]
Total devices 2 FS bytes used 91.59GiB
devid1 size 150.00GiB used 102.00GiB path /dev/dm-0
devid2 size 150.00GiB used 102.00GiB path /dev/mapper/sata-home

Label: daten  uuid: […]
Total devices 1 FS bytes used 148.36GiB
devid1 size 200.00GiB used 150.02GiB path /dev/mapper/msata-daten



merkaba:~ LANG=C df -hT -t btrfs
Filesystem  Type   Size  Used Avail Use% Mounted on
/dev/mapper/sata-debian btrfs   60G   32G   26G  56% /
/dev/dm-0   btrfs  300G  184G  113G  62% /home
/dev/mapper/sata-debian btrfs   60G   32G   26G  56% /mnt/debian-zeit
/dev/dm-0   btrfs  300G  184G  113G  62% /mnt/home-zeit
/dev/mapper/msata-daten btrfs  200G  149G   51G  75% /daten

(the zeit mounts show the root sub volume for snapshots)



merkaba:[…] ./btrfs fi df /
Disk size:60.00GB
Disk allocated:   38.04GB
Disk unallocated: 21.96GB
Used: 15.89GB
Free (Estimated): 14.13GB   (Max: 25.10GB, min: 14.12GB)
Data to disk ratio:  50 %
merkaba:[…] ./btrfs fi df /home
Disk size:   300.00GB
Disk allocated:  204.00GB
Disk unallocated: 96.00GB
Used: 91.59GB
Free (Estimated): 58.41GB   (Max: 106.41GB, min: 58.41GB)
Data to disk ratio:  50 %



merkaba:[…] ./btrfs device disk-usage /
/dev/dm-2  30.00GB
   Data,RAID1:  17.00GB
   Metadata,RAID1:   2.00GB
   System,RAID1: 8.00MB
   Unallocated: 10.99GB

/dev/mapper/sata-debian30.00GB
   Data,Single:  8.00MB
   Data,RAID1:  17.00GB
   Metadata,Single:  8.00MB
   Metadata,RAID1:   2.00GB
   System,Single:4.00MB
   System,RAID1: 8.00MB
   Unallocated: 10.97GB

merkaba:[…] ./btrfs filesystem disk-usage -t /
Data   DataMetadata Metadata System System  
   
Single RAID1   Single   RAID1Single RAID1   
Unallocated

   
/dev/dm-2- 17.00GB-   2.00GB  -  8.00MB 
10.99GB
/dev/mapper/sata-debian 8.00MB 17.00GB   8.00MB   2.00GB 4.00MB  8.00MB 
10.97GB
== ===   == === 
===
Total   8.00MB 17.00GB   8.00MB   2.00GB 4.00MB  8.00MB 
21.96GB
Used  0.00 15.13GB 0.00 778.00MB   0.00 16.00KB 



merkaba:[…] ./btrfs device disk-usage /home   
/dev/dm-0 150.00GB
   Data,RAID1:  98.00GB
   Metadata,RAID1:   4.00GB
   System,Single:4.00MB
   Unallocated: 48.00GB

/dev/mapper/sata-home 150.00GB
   Data,RAID1:  98.00GB
   Metadata,RAID1:   4.00GB
   Unallocated: 48.00GB

merkaba:[…] ./btrfs filesystem disk-usage -t /home
  DataMetadata System 
  RAID1   RAID1Single  Unallocated
  
/dev/dm-0 98.00GB   4.00GB  4.00MB 48.00GB
/dev/mapper/sata-home 98.00GB   4.00GB   - 48.00GB
  ===  === ===
Total 98.00GB   4.00GB  4.00MB 96.00GB
Used  89.60GB   1.99GB 16.00KB



And some healthly over prosioning:

merkaba:~ vgs
  VG#PV #LV #SN Attr   VSize   VFree 
  msata   1   3   0 wz--n- 446,64g 66,64g
  sata1   3   0 wz--n- 278,99g 86,99g

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To 

Re: BTRFS SSD

2010-09-30 Thread Sander
Yuehai Xu wrote (ao):
 So, is it a bottleneck in the case of SSD since the cost for over
 write is very high? For every write, I think the superblocks should be
 overwritten, it might be much more frequent than other common blocks
 in SSD, even though SSD will do wear leveling inside by its FTL.

The FTL will make sure the write cycles are evenly divided among the
physical blocks, regardless of how often you overwrite a single spot on
the fs.

 What I current know is that for Intel x25-V SSD, the write throughput
 of BTRFS is almost 80% less than the one of EXT3 in the case of
 PostMark. This really confuses me.

Can you show the script you use to test this, provide some info
regarding your setup, and show the numbers you see?

Sander

-- 
Humilis IT Services and Solutions
http://www.humilis.net
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS SSD

2010-09-30 Thread David Brown

On 29/09/2010 23:31, Yuehai Xu wrote:

On Wed, Sep 29, 2010 at 3:59 PM, Sean Bartellwingedtachik...@gmail.com  wrote:

On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote:

On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartellwingedtachik...@gmail.com  wrote:

On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote:

I know BTRFS is a kind of Log-structured File System, which doesn't do
overwrite. Here is my question, suppose file A is overwritten by A',
instead of writing A' to the original place of A, a new place is
selected to store it. However, we know that the address of a file
should be recorded in its inode. In such case, the corresponding part
in inode of A should update from the original place A to the new place
A', is this a kind of overwrite actually? I think no matter what
design it is for Log-Structured FS, a mapping table is always needed,
such as inode map, DAT, etc. When a update operation happens for this
mapping table, is it actually a kind of over-write? If it is, is it a
bottleneck for the performance of write for SSD?


In btrfs, this is solved by doing the same thing for the inode--a new
place for the leaf holding the inode is chosen. Then the parent of the
leaf must point to the new position of the leaf, so the parent is moved,
and the parent's parent, etc. This goes all the way up to the
superblocks, which are actually overwritten one at a time.


You mean that there is no over-write for inode too, once the inode
need to be updated, this inode is actually written to a new place
while the only thing to do is to change the point of its parent to
this new place. However, for the last parent, or the superblock, does
it need to be overwritten?


Yes. The idea of copy-on-write, as used by btrfs, is that whenever
*anything* is changed, it is simply written to a new location. This
applies to data, inodes, and all of the B-trees used by the filesystem.
However, it's necessary to have *something* in a fixed place on disk
pointing to everything else. So the superblocks can't move, and they are
overwritten instead.



So, is it a bottleneck in the case of SSD since the cost for over
write is very high? For every write, I think the superblocks should be
overwritten, it might be much more frequent than other common blocks
in SSD, even though SSD will do wear leveling inside by its FTL.



SSDs already do copy-on-write.  They can't change small parts of the 
data in a block, but have to re-write the block.  While that could be 
done by reading the whole erase block to a ram buffer, changing the 
data, erasing the flash block, then re-writing, this is not what happens 
in practice.  To make efficient use of write blocks that are smaller 
than erase blocks, and to provide wear levelling, the flash disk will 
implement a small change to a block by writing a new copy of the 
modified block to a different part of the flash, then updating its block 
indirection tables.


BTRFS just makes this process a bit more explicit (except for superblock 
writes).



What I current know is that for Intel x25-V SSD, the write throughput
of BTRFS is almost 80% less than the one of EXT3 in the case of
PostMark. This really confuses me.



Different file systems have different strengths and weaknesses.  I 
haven't actually tested BTRFS much, but my understanding is that it will 
be significantly slower than EXT in certain cases, such as small 
modifications to large files (since copy-on-write means a lot of extra 
disk activity in such cases).  But for other things it is faster.  Also 
remember that BTRFS is under development - optimising for raw speed 
comes at a lower priority than correctness and safety of data, and 
implementation of BTRFS features.  Once everyone is happy with the 
stability of the file system and its functionality and tools, you can 
expect the speed to improve somewhat over time.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS SSD

2010-09-30 Thread Yuehai Xu
On Thu, Sep 30, 2010 at 3:51 AM, David Brown da...@westcontrol.com wrote:
 On 29/09/2010 23:31, Yuehai Xu wrote:

 On Wed, Sep 29, 2010 at 3:59 PM, Sean Bartellwingedtachik...@gmail.com
  wrote:

 On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote:

 On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartellwingedtachik...@gmail.com
  wrote:

 On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote:

 I know BTRFS is a kind of Log-structured File System, which doesn't do
 overwrite. Here is my question, suppose file A is overwritten by A',
 instead of writing A' to the original place of A, a new place is
 selected to store it. However, we know that the address of a file
 should be recorded in its inode. In such case, the corresponding part
 in inode of A should update from the original place A to the new place
 A', is this a kind of overwrite actually? I think no matter what
 design it is for Log-Structured FS, a mapping table is always needed,
 such as inode map, DAT, etc. When a update operation happens for this
 mapping table, is it actually a kind of over-write? If it is, is it a
 bottleneck for the performance of write for SSD?

 In btrfs, this is solved by doing the same thing for the inode--a new
 place for the leaf holding the inode is chosen. Then the parent of the
 leaf must point to the new position of the leaf, so the parent is
 moved,
 and the parent's parent, etc. This goes all the way up to the
 superblocks, which are actually overwritten one at a time.

 You mean that there is no over-write for inode too, once the inode
 need to be updated, this inode is actually written to a new place
 while the only thing to do is to change the point of its parent to
 this new place. However, for the last parent, or the superblock, does
 it need to be overwritten?

 Yes. The idea of copy-on-write, as used by btrfs, is that whenever
 *anything* is changed, it is simply written to a new location. This
 applies to data, inodes, and all of the B-trees used by the filesystem.
 However, it's necessary to have *something* in a fixed place on disk
 pointing to everything else. So the superblocks can't move, and they are
 overwritten instead.


 So, is it a bottleneck in the case of SSD since the cost for over
 write is very high? For every write, I think the superblocks should be
 overwritten, it might be much more frequent than other common blocks
 in SSD, even though SSD will do wear leveling inside by its FTL.


 SSDs already do copy-on-write.  They can't change small parts of the data in
 a block, but have to re-write the block.  While that could be done by
 reading the whole erase block to a ram buffer, changing the data, erasing
 the flash block, then re-writing, this is not what happens in practice.  To
 make efficient use of write blocks that are smaller than erase blocks, and
 to provide wear levelling, the flash disk will implement a small change to a
 block by writing a new copy of the modified block to a different part of the
 flash, then updating its block indirection tables.

Yes, the FTL inside the SSDs will do such kind of job, and the
overhead should be small once the block mapping is page-level mapping,
however, the size of page-level mapping is too large to be stored
totally in the SRAM of SSDs, So, many complicated algorithms have been
developed to optimize this. In another word, SSDs might not always be
smart enough to do wear leveling with small overhead. This is my
subjective opinion.


 BTRFS just makes this process a bit more explicit (except for superblock
 writes).

As you have said, the superblocks should be over written, is it
frequent? If it is, is it possible to be potential bottleneck for the
throughput of SSDs? Afterall, SSDs are not happy with over-write. Of
course, few people really knows what's the algorithms really are for
the FTL, which determines the efficiency of SSDs actually.



 What I current know is that for Intel x25-V SSD, the write throughput
 of BTRFS is almost 80% less than the one of EXT3 in the case of
 PostMark. This really confuses me.


 Different file systems have different strengths and weaknesses.  I haven't
 actually tested BTRFS much, but my understanding is that it will be
 significantly slower than EXT in certain cases, such as small modifications
 to large files (since copy-on-write means a lot of extra disk activity in
 such cases).  But for other things it is faster.  Also remember that BTRFS
 is under development - optimising for raw speed comes at a lower priority
 than correctness and safety of data, and implementation of BTRFS features.
  Once everyone is happy with the stability of the file system and its
 functionality and tools, you can expect the speed to improve somewhat over
 time.

My test case for PostMark is:
set file size 9216 15360 (file size from 9216 bytes to 15360 bytes)
set number 5(file number is 5)

write throughput(MB/s) for different file systems in Intel SSD X25-V:
EXT3: 28.09
NILFS2: 10
BTRFS: 17.35
EXT4: 31.04
XFS: 

Re: BTRFS SSD

2010-09-30 Thread Yuehai Xu
On Thu, Sep 30, 2010 at 3:15 AM, Sander san...@humilis.net wrote:
 Yuehai Xu wrote (ao):
 So, is it a bottleneck in the case of SSD since the cost for over
 write is very high? For every write, I think the superblocks should be
 overwritten, it might be much more frequent than other common blocks
 in SSD, even though SSD will do wear leveling inside by its FTL.

 The FTL will make sure the write cycles are evenly divided among the
 physical blocks, regardless of how often you overwrite a single spot on
 the fs.

 What I current know is that for Intel x25-V SSD, the write throughput
 of BTRFS is almost 80% less than the one of EXT3 in the case of
 PostMark. This really confuses me.

 Can you show the script you use to test this, provide some info
 regarding your setup, and show the numbers you see?

My test case for PostMark is:
set file size 9216 15360 (file size from 9216 bytes to 15360 bytes)
set number 5(file number is 5)

write throughput(MB/s) for different file systems in Intel SSD X25-V:
EXT3: 28.09
NILFS2: 10
BTRFS: 17.35
EXT4: 31.04
XFS: 11.56
REISERFS: 28.09
EXT2: 15.94

Thanks,
Yuehai

        Sander

 --
 Humilis IT Services and Solutions
 http://www.humilis.net

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS SSD

2010-09-29 Thread Sean Bartell
On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote:
 I know BTRFS is a kind of Log-structured File System, which doesn't do
 overwrite. Here is my question, suppose file A is overwritten by A',
 instead of writing A' to the original place of A, a new place is
 selected to store it. However, we know that the address of a file
 should be recorded in its inode. In such case, the corresponding part
 in inode of A should update from the original place A to the new place
 A', is this a kind of overwrite actually? I think no matter what
 design it is for Log-Structured FS, a mapping table is always needed,
 such as inode map, DAT, etc. When a update operation happens for this
 mapping table, is it actually a kind of over-write? If it is, is it a
 bottleneck for the performance of write for SSD?

In btrfs, this is solved by doing the same thing for the inode--a new
place for the leaf holding the inode is chosen. Then the parent of the
leaf must point to the new position of the leaf, so the parent is moved,
and the parent's parent, etc. This goes all the way up to the
superblocks, which are actually overwritten one at a time.

 What do you think the major work that BTRFS can do to improve the
 performance for SSD? I know FTL has becomes smarter and smarter, the
 idea of log-structured file system is always implemented inside the
 SSD by FTL, in that case, it sounds all the issues have been solved no
 matter what the FS it is in upper stack. But at least, from the
 results of benchmarks on the internet show that the performance from
 different FS are quite different, such as NILFS2 and BTRFS.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS SSD

2010-09-29 Thread Yuehai Xu
Hi,

On Wed, Sep 29, 2010 at 11:37 AM, Dipl.-Ing. Michael Niederle
mniede...@gmx.at wrote:
 Hi Yuehai!

 I tested nilfs2 and btrfs for the use with flash based pen drives.

 nilfs2 performed incredibly well as long as there were enough free blocks. But
 the garbage collector of nilfs used too much IO-bandwidth to be useable (with
 slow-write flash devices).

I also tested the performance of write for INTEL X25-V SSD by
postmark, the results are totally different from the results of INTEL
X25-M(http://www.usenix.org/event/lsf08/tech/shin_SSD.pdf). In his
test, the performance of NILFS2 is the best over all, however, in my
test, ext3 is the best while NILFS2 is the worst, almost 10 times less
than ext3 for the throughput of write.

So, what's the role of file system to handle these tricky storage?
Different throughput might be gotten by different file system.

The question is why nilfs2 and btrfs perform so well compared with
ext3 without considering my results, here I just talk about SSD, since
the FTL internal should always do the same thing as the file system,
that redirects the write to a new place instead of writing to the
original place. The throughput for different file system should be
more or less the same.




 btrfs on the other side performed very well - a lot better than conventional
 file systems like ext2/3 or reiserfs. After switching the mount-options to
 noatime I was able to run a complete Linux system from a (quite slow) pen
 drive without (much) problems. Performance on a fast pen drive is great. I'm
 using btrfs as the root file system on a daily basis since last Christmas
 without running into any problems.


The performance of file system is determined by the internal structure
of SSD? or by the structure of file system? or by the coordination of
both file system and SSD?

Thanks very much for replying.

 Greetings, Michael


Thanks,
Yuehai
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS SSD

2010-09-29 Thread Yuehai Xu
On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell wingedtachik...@gmail.com wrote:
 On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote:
 I know BTRFS is a kind of Log-structured File System, which doesn't do
 overwrite. Here is my question, suppose file A is overwritten by A',
 instead of writing A' to the original place of A, a new place is
 selected to store it. However, we know that the address of a file
 should be recorded in its inode. In such case, the corresponding part
 in inode of A should update from the original place A to the new place
 A', is this a kind of overwrite actually? I think no matter what
 design it is for Log-Structured FS, a mapping table is always needed,
 such as inode map, DAT, etc. When a update operation happens for this
 mapping table, is it actually a kind of over-write? If it is, is it a
 bottleneck for the performance of write for SSD?

 In btrfs, this is solved by doing the same thing for the inode--a new
 place for the leaf holding the inode is chosen. Then the parent of the
 leaf must point to the new position of the leaf, so the parent is moved,
 and the parent's parent, etc. This goes all the way up to the
 superblocks, which are actually overwritten one at a time.

You mean that there is no over-write for inode too, once the inode
need to be updated, this inode is actually written to a new place
while the only thing to do is to change the point of its parent to
this new place. However, for the last parent, or the superblock, does
it need to be overwritten?

I am afraid I don't quite understand the meaning of your last sentence.

Thanks for replying,
Yuehai



 What do you think the major work that BTRFS can do to improve the
 performance for SSD? I know FTL has becomes smarter and smarter, the
 idea of log-structured file system is always implemented inside the
 SSD by FTL, in that case, it sounds all the issues have been solved no
 matter what the FS it is in upper stack. But at least, from the
 results of benchmarks on the internet show that the performance from
 different FS are quite different, such as NILFS2 and BTRFS.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS SSD

2010-09-29 Thread Aryeh Gregor
On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell wingedtachik...@gmail.com wrote:
 In btrfs, this is solved by doing the same thing for the inode--a new
 place for the leaf holding the inode is chosen. Then the parent of the
 leaf must point to the new position of the leaf, so the parent is moved,
 and the parent's parent, etc. This goes all the way up to the
 superblocks, which are actually overwritten one at a time.

Sorry for the useless question, but just out of curiosity: doesn't
this mean that btrfs has to do quite a lot more writes than ext4 for
small file operations?  E.g., if you append one block to a file, like
a log file, then ext3 should have to do about three writes: data,
metadata, and journal (and the latter is always sequential, so it's
cheap).  But btrfs will need to do more, rewriting parent nodes all
the way up the line for both the data and metadata blocks.  Why
doesn't this hurt performance a lot?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS SSD

2010-09-29 Thread Sean Bartell
On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote:
 On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell wingedtachik...@gmail.com 
 wrote:
  On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote:
  I know BTRFS is a kind of Log-structured File System, which doesn't do
  overwrite. Here is my question, suppose file A is overwritten by A',
  instead of writing A' to the original place of A, a new place is
  selected to store it. However, we know that the address of a file
  should be recorded in its inode. In such case, the corresponding part
  in inode of A should update from the original place A to the new place
  A', is this a kind of overwrite actually? I think no matter what
  design it is for Log-Structured FS, a mapping table is always needed,
  such as inode map, DAT, etc. When a update operation happens for this
  mapping table, is it actually a kind of over-write? If it is, is it a
  bottleneck for the performance of write for SSD?
 
  In btrfs, this is solved by doing the same thing for the inode--a new
  place for the leaf holding the inode is chosen. Then the parent of the
  leaf must point to the new position of the leaf, so the parent is moved,
  and the parent's parent, etc. This goes all the way up to the
  superblocks, which are actually overwritten one at a time.
 
 You mean that there is no over-write for inode too, once the inode
 need to be updated, this inode is actually written to a new place
 while the only thing to do is to change the point of its parent to
 this new place. However, for the last parent, or the superblock, does
 it need to be overwritten?

Yes. The idea of copy-on-write, as used by btrfs, is that whenever
*anything* is changed, it is simply written to a new location. This
applies to data, inodes, and all of the B-trees used by the filesystem.
However, it's necessary to have *something* in a fixed place on disk
pointing to everything else. So the superblocks can't move, and they are
overwritten instead.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS SSD

2010-09-29 Thread Sean Bartell
On Wed, Sep 29, 2010 at 03:39:07PM -0400, Aryeh Gregor wrote:
 On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell wingedtachik...@gmail.com 
 wrote:
  In btrfs, this is solved by doing the same thing for the inode--a new
  place for the leaf holding the inode is chosen. Then the parent of the
  leaf must point to the new position of the leaf, so the parent is moved,
  and the parent's parent, etc. This goes all the way up to the
  superblocks, which are actually overwritten one at a time.
 
 Sorry for the useless question, but just out of curiosity: doesn't
 this mean that btrfs has to do quite a lot more writes than ext4 for
 small file operations?  E.g., if you append one block to a file, like
 a log file, then ext3 should have to do about three writes: data,
 metadata, and journal (and the latter is always sequential, so it's
 cheap).  But btrfs will need to do more, rewriting parent nodes all
 the way up the line for both the data and metadata blocks.  Why
 doesn't this hurt performance a lot?

For a single change, it does write more. However, there are usually many
changes to children being performed at once, which only require one
change to the parent. Since it's moving everything to new places, btrfs
also has much more control over where writes occur, so all the leaves
and parents can be written sequentially. ext3 is a slave to the current
locations on disk.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS SSD

2010-09-29 Thread Yuehai Xu
On Wed, Sep 29, 2010 at 3:59 PM, Sean Bartell wingedtachik...@gmail.com wrote:
 On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote:
 On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell wingedtachik...@gmail.com 
 wrote:
  On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote:
  I know BTRFS is a kind of Log-structured File System, which doesn't do
  overwrite. Here is my question, suppose file A is overwritten by A',
  instead of writing A' to the original place of A, a new place is
  selected to store it. However, we know that the address of a file
  should be recorded in its inode. In such case, the corresponding part
  in inode of A should update from the original place A to the new place
  A', is this a kind of overwrite actually? I think no matter what
  design it is for Log-Structured FS, a mapping table is always needed,
  such as inode map, DAT, etc. When a update operation happens for this
  mapping table, is it actually a kind of over-write? If it is, is it a
  bottleneck for the performance of write for SSD?
 
  In btrfs, this is solved by doing the same thing for the inode--a new
  place for the leaf holding the inode is chosen. Then the parent of the
  leaf must point to the new position of the leaf, so the parent is moved,
  and the parent's parent, etc. This goes all the way up to the
  superblocks, which are actually overwritten one at a time.

 You mean that there is no over-write for inode too, once the inode
 need to be updated, this inode is actually written to a new place
 while the only thing to do is to change the point of its parent to
 this new place. However, for the last parent, or the superblock, does
 it need to be overwritten?

 Yes. The idea of copy-on-write, as used by btrfs, is that whenever
 *anything* is changed, it is simply written to a new location. This
 applies to data, inodes, and all of the B-trees used by the filesystem.
 However, it's necessary to have *something* in a fixed place on disk
 pointing to everything else. So the superblocks can't move, and they are
 overwritten instead.


So, is it a bottleneck in the case of SSD since the cost for over
write is very high? For every write, I think the superblocks should be
overwritten, it might be much more frequent than other common blocks
in SSD, even though SSD will do wear leveling inside by its FTL.

What I current know is that for Intel x25-V SSD, the write throughput
of BTRFS is almost 80% less than the one of EXT3 in the case of
PostMark. This really confuses me.

Thanks,
Yuehai
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs SSD autodetection and mount options

2009-06-11 Thread Chris Mason
Hello everyone,

A quick update on the btrfs SSD modes.  The pull request I just to Linus
includes autodetection of ssd devices based on the queue rotating flag.
You can see this for your devices in /sys/block/xxx/queue/rotating

If all the devices in your FS have a 0 in the rotating flag, btrfs
automatically enables ssd mode.  You can turn this off with
mount -o nossd.

The default ssd flag tries to find rough groupings of blocks to allocate
on, and will try to pack blocks into the free space available.  So,
if you have something like this (pretending a bitmap of free blocks)

free | free | used | free | used | free | free | used

Btrfs SSD mode will collect a large region of mostly free blocks and
allocate from that.  This works well on newer and high end ssds that
prefer us to reuse blocks instead of spreading IO across the hole
device.

But, low end devices may have to do a read/modify/write cycle when we
actually do IO in this case.  I've added a new mount option for those
devices:

mount -o ssd_spread.  This is not autodetected, you still need to pass
it on the mount command line or in /etc/fstab.  In ssd_spread mode,
btrfs will try much harder to find a contiguous chunk of free blocks and
hand those out.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html