Re: recommendations and contraindications of using btrfs for Oracle Database Server

2018-01-11 Thread Marat Khalili
I'd advice to migrate from this configuration ASAP. Each of the 
following is enough reason to do so:


1. Running BTRFS on top of kernel 3.10 (there can be vendor backports, 
but most likely they only include absolutely critical fixes for the 
BTRFS part, not the whole stuff). And RedHat (that's your vendor, 
right?) recently abandoned BTRFS.


2. Running a database on top of BTRFS with snapshots, unless write load 
is very light and snapshots are very rare and few.


Besides, taking storage snapshots is hardly a good way to backup 
databases. Oracle certainly provides better methods like online replication.


On 11/01/18 13:51, Ext-Strii-Houttemane Philippe wrote:

ORA-63999: data file suffered media failure
ORA-01114: IO error writing block to file 99 (block # 4)
ORA-01110: data file 99: '/oradata/PS92PRD/data/pcapp.dbf'
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 4

There might be messages in syslog/dmesg about this.


It nevers append with over filesystem types, all the hardware has been checked.
I suspect a Btrfs activated feature via our mount options instead of a bug: 
Oracle see a ghost or a duplicated bloc even if copy on cow feature is disabled.
Never saw this even while using BTRFS on kernel 3.11. But note that 
snapshots kind of disable nodatacow, so you still effectively have cow 
among them (that's the price of convenience).



Mount options: defaults,nofail,nodatacow,nobarrier,noatime
"nobarrier" looks a bit scary, unless you're using some fancy 
battery-backed controller.



btrfs fi show:
Label: 'oradataBtrfs'
 Total devices 1 FS bytes used 1.98TiB
 devid1 size 3.18TiB used 2.02TiB path /dev/sdb1

At least you are not using BTRFS RAID in this configuration, good.

--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommendations for balancing as part of regular maintenance?

2018-01-09 Thread Marat Khalili

On 08/01/18 19:34, Austin S. Hemmelgarn wrote:
A: While not strictly necessary, running regular filtered balances 
(for example `btrfs balance start -dusage=50 -dlimit=2 -musage=50 
-mlimit=4`, see `man btrfs-balance` for more info on what the options 
mean) can help keep a volume healthy by mitigating the things that 
typically cause ENOSPC errors. 


The choice of words is not very fortunate IMO. In my view volume 
stopping being "healthy" during normal operation presumes some bugs (at 
least shortcomings) in the filesystem code. In this case I'd prefer to 
have detailed understanding of the situation before copy-pasting 
commands from wiki pages. Remember, most users don't run cutting-edge 
kernels and tools, preferring LTS distribution releases instead, so one 
size might not fit all.


On 08/01/18 23:29, Martin Raiber wrote:

There have been reports of (rare) corruption caused by balance (won't be
detected by a scrub) here on the mailing list. So I would stay a away
from btrfs balance unless it is absolutely needed (ENOSPC), and while it
is run I would try not to do anything else wrt. to writes simultaneously.


This is my opinion too as a normal user, based upon reading this list 
and own attempts to recover from ENOSPC. I'd rather re-create filesystem 
from scratch, or at least make full verified backup before attempting to 
fix problems with balance.


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: A Big Thank You, and some Notes on Current Recovery Tools.

2018-01-02 Thread Marat Khalili
> I think the 1-3TB Seagate drives are garbage.

There are known problems with ST3000DM001, but first of all you should not put 
PC-oriented disks in RAID, they are not designed for it on multiple levels 
(vibration tolerance, error reporting...) There are similar horror stories 
about people filling whole cases with WD Greens and observing their (non-BTRFS) 
RAID 6 fail. 

(Sorry for OT.)
-- 

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Improve subvolume usability for a normal user

2017-12-06 Thread Marat Khalili

On 07/12/17 08:27, Qu Wenruo wrote:

When doing snapshot, btrfs only needs to increase reference of 2nd
highest level tree blocks of original snapshot, other than "walking the
tree".
(If tree root level is 2, then level 2 node is copied, while all
reference of level 1 tree blocks get increased)


Out of curiosity, how does it interacts with nocow files? Does every 
write to these files involves backref walk?


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: command to get quota limit

2017-12-04 Thread Marat Khalili
AFAIK you don't need subvolume id, `btrfs qgroup show -ref 
/path/to/subvolume` shows necessary qgroup for me. Separating value you 
need is more involved:


btrfs qgroup show -ref --raw /path/to/subvolume | tail -n +3 | tr -s ' 
' | cut -d ' ' -f 4

Not sure how robust is this though.

--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: notification about corrupt files from "btrfs scrub" in cron

2017-11-23 Thread Marat Khalili

On 23/11/17 15:59, Mike Fleetwood wrote:


Cron starts configured jobs at the scheduled time asynchronously.
I.e. It doesn't block waiting for each command to finish.  Cron notices
when the job finishes and any output produced, written to stdout and/or
stderr, by the job is emailed to the user.  So no, a 2 hour job is not a
problem for cron.


Minor additional advice -- prepend you command with:


flock --nonblock /var/run/scrub.lock
to avoid running several scrubs simultaneously in case one takes more 
than 24 hours to finish.



--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tiered storage?

2017-11-15 Thread Marat Khalili


On 15/11/17 10:11, waxhead wrote:

hint: you need more than two for raid1 if you want to stay safe
Huh? Two is not enough? Having three or more makes a difference? (Or, 
you mean hot spare?)


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-13 Thread Marat Khalili

On 14/11/17 06:39, Dave wrote:

My rsync command currently looks like this:

rsync -axAHv --inplace --delete-delay --exclude-from="/some/file"
"$source_snapshop/" "$backup_location"
As I learned from Kai Krakow in this maillist, you should also add 
--no-whole-file if both sides are local. Otherwise target space usage 
can be much worse (but fragmentation much better).


I wonder what is your justification for --delete-delay, I just use --delete.

Here's what I use: --verbose --archive --hard-links --acls --xattrs 
--numeric-ids --inplace --delete --delete-excluded --stats. Since in my 
case source is always remote, there's no --no-whole-file, but there's 
--numeric-ids.



In particular, I want to know if I should or should not be using these options:

 -H, --hard-linkspreserve hard links
 -A, --acls  preserve ACLs (implies -p)
 -X, --xattrspreserve extended attributes
 -x, --one-file-system   don't cross filesystem boundaries
I don't know any semantic use of hard links in modern systems. There're 
ACLs on some files in /var/log/journal on systems with systemd. Synology 
actively uses ACL, but it's implementation is sadly incompatible with 
rsync. There can always be some ACLs or xattrs set by sysadmin manually. 
End result, I always specify first three options where possible just in 
case (even though man page says that --hard-links may affect performance).



I had to use the "x" option to prevent rsync from deleting files in
snapshots in the backup location (as the source location does not
retain any snapshots). Is there a better way?
Don't keep snapshots under rsync target, place them under ../snapshots 
(if snapper supports this):



# find . -maxdepth 2
.
./snapshots
./snapshots/2017-11-08T13:18:20+00:00
./snapshots/2017-11-08T15:10:03+00:00
./snapshots/2017-11-08T23:28:44+00:00
./snapshots/2017-11-09T23:41:30+00:00
./snapshots/2017-11-10T22:44:36+00:00
./snapshots/2017-11-11T21:48:19+00:00
./snapshots/2017-11-12T21:27:41+00:00
./snapshots/2017-11-13T23:29:49+00:00
./rsync
Or, specify them in --exclude and avoid using --delete-excluded. Or keep 
using -x if it works, why not?


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Should cp default to reflink?

2017-11-06 Thread Marat Khalili
Obviously (for me) yes, but who will decide? There should be --no-reflink for 
people trying to defragment something.

>Seems to me any request to duplicate should be optimized by default
>with an auto reflink when possible, and require an explicit option to
>inhibit.

Key word is "any". I much more often use rsync than cp within the same volume. 
-- 

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-04 Thread Marat Khalili
>How is this an issue?  Discard is issued only once we're positive 
>there's no 
>reference to the freed blocks anywhere.  At that point, they're also 
>open 
>for reuse, thus they can be arbitrarily scribbled upon.

Point was, how about keeping this reference for some time period?

>Unless your hardware is seriously broken (such as lying about barriers, 
>which is nearly-guaranteed data loss on btrfs anyway), there's no way 
>the 
>filesystem will ever reference such blocks.

Buggy hardware happen. So do buggy filesystems ;) Besides, most filesystems let 
user recover most data after losing just one sector, would be pity if BTRFS 
with all its COW coolness didn't. 

>Why would you special-case metadata?  Metadata that points to 
>overwritten or 
>discarded blocks is of no use either.

It takes significant time to overwrite noticeable portion of data on disk, but 
loss of metadata makes it gone in a moment. Moreover, user is usually prepared 
to lose some recently changed data in crash, but not the one that it didn't 
even touch.
-- 

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-11-02 Thread Marat Khalili

On 02/11/17 04:39, Dave wrote:

I'm going to make this change now. What would be a good way to
implement this so that the change applies to the $HOME/.cache of each
user?
I'd make each user's .cache a symlink (should work but if it won't then 
bind mount) to a per-user directory in some separately mounted volume 
with necessary options.


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-01 Thread Marat Khalili

On 01/11/17 09:51, Dave wrote:

As already said by Romain Mamedov, rsync is viable alternative to
send-receive with much less hassle. According to some reports it can even be
faster.

Thanks for confirming. I must have missed those reports. I had never
considered this idea until now -- but I like it.

Are there any blogs or wikis where people have done something similar
to what we are discussing here?

I don't know any. Probably someone needs to write it.


We will delete most snapshots on the live volume, but retain many (or
all) snapshots on the backup block device. Is that a good strategy,
given my goals?

Depending on the way you use it, retaining even a dozen snapshots on a live
volume might hurt performance (for high-performance databases) or be
completely transparent (for user folders). You may want to experiment with
this number.

We do experience severe performance problems now, especially with
Firefox. Part of my experiment is to reduce the number of snapshots on
the live volumes, hence this question.
Just for statistics, how many snapshots do you have and how often do you 
take them? It's on SSD, right?



Thanks. I hope you do find time to publish it. (And what do you mean
by portable?) For now, Snapper has a cleanup algorithm that we can
use. At least one of the tools listed here has a thinout algorithm
too: https://btrfs.wiki.kernel.org/index.php/Incremental_Backup
It is currently a small part of yet another home-grown backup tool which 
is itself fairly big and tuned to particular environment. I thought many 
times that it would be very nice to have thinning tool separately and 
with no unnecessary dependencies, but...


BTW beware of deleting too many snapshots at once with any tool. Delete 
few and let filesystem stabilize before proceeding.



Should I consider a dedup tool like one of these?

Certainly NOT for snapshot-based backups: it is already deduplicated almost
as much as possible, dedup tools can only make it *less* deduplicated.

The question is whether to use a dedup tool on the live volume which
has a few snapshots. Even with the new strategy (based on rsync), the
live volume may sometimes have two snapshots (pre- and post- pacman
upgrades).
For deduplication tool to be useful you ought to have some duplicate 
data on your live volume. Do you have any (e.g. many LXC containers with 
the same distribution)?



Also still wondering about these options: no-holes, skinny metadata,
or extended inode refs?

I don't know anything about any of these, sorry.

P.S. I still think you need some off-system backup solution too, either 
rsync+snapshot-based over ssh or e.g. Burp (shameless advertising: 
http://burp.grke.org/ ).


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-01 Thread Marat Khalili
I'm active user of backup using btrfs snapshots. Generally it works with 
some caveats.


You seem to have two tasks: (1) same-volume snapshots (I would not call 
them backups) and (2) updating some backup volume (preferably on a 
different box). By solving them separately you can avoid some complexity 
like accidental remove of snapshot that's still needed for updating 
backup volume.



To reconcile those conflicting goals, the only idea I have come up
with so far is to use btrfs send-receive to perform incremental
backups as described here:
https://btrfs.wiki.kernel.org/index.php/Incremental_Backup .
As already said by Romain Mamedov, rsync is viable alternative to 
send-receive with much less hassle. According to some reports it can 
even be faster.



Given the hourly snapshots, incremental backups are the only practical
option. They take mere moments. Full backups could take an hour or
more, which won't work with hourly backups.
I don't see much sense in re-doing full backups to the same physical 
device. If you care about backup integrity, it is probably more 
important to invest in backups verification. (OTOH, while you didn't 
reveal data size, if full backup takes just an hour on your system then 
why not?)



We will delete most snapshots on the live volume, but retain many (or
all) snapshots on the backup block device. Is that a good strategy,
given my goals?
Depending on the way you use it, retaining even a dozen snapshots on a 
live volume might hurt performance (for high-performance databases) or 
be completely transparent (for user folders). You may want to experiment 
with this number.


In any case I'd not recommend retaining ALL snapshots on backup device, 
even if you have infinite space. Such filesystem would be as dangerous 
as the demon core, only good for adding more snapshots (not even 
deleting them), and any little mistake will blow everything up. Keep a 
few dozen, hundred at most.


Unlike other backup systems, you can fairly easily remove snapshots in 
the middle of sequence, use this opportunity. My thinout rule is: remove 
snapshot if resulting gap will be less than some fraction (e.g. 1/4) of 
its age. One day I'll publish portable solution on github.



Given this minimal retention of snapshots on the live volume, should I
defrag it (assuming there is at least 50% free space available on the
device)? (BTW, is defrag OK on an NVMe drive? or an SSD?)

In the above procedure, would I perform that defrag before or after
taking the snapshot? Or should I use autodefrag?
I ended up using autodefrag, didn't try manual defragmentation. I don't 
use SSDs as backup volumes.



Should I consider a dedup tool like one of these?
Certainly NOT for snapshot-based backups: it is already deduplicated 
almost as much as possible, dedup tools can only make it *less* 
deduplicated.



* Footnote: On the backup device, maybe we will never delete
snapshots. In any event, that's not a concern now. We'll retain many,
many snapshots on the backup device.
Again, DO NOT do this, btrfs in its current state does not support it. 
Good rule of thumb for time of some operations is data size multiplied 
by number of snapshots (raised to some power >= 1) and divided by IO/CPU 
speed. By creating snapshots it is very easy to create petabytes of data 
for kernel to process, which it won't be able to do in many years.


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-10-30 Thread Marat Khalili

On 31/10/17 00:37, Chris Murphy wrote:

But off hand it sounds like hardware was sabotaging the expected write
ordering. How to test a given hardware setup for that, I think, is
really overdue. It affects literally every file system, and Linux
storage technology.

It kinda sounds like to me something other than supers is being
overwritten too soon, and that's why it's possible for none of the
backup roots to find a valid root tree, because all four possible root
trees either haven't actually been written yet (still) or they've been
overwritten, even though the super is updated. But again, it's
speculation, we don't actually know why your system was no longer
mountable.
Just a detached view: I know hardware should respect ordering/barriers 
and such, but how hard is it really to avoid overwriting at least one 
complete metadata tree for half an hour (even better, yet another one 
for a day)? Just metadata, not data extents.


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-subv-backup v0.1b

2017-10-26 Thread Marat Khalili

Hello Austin,

Looks very useful. Two questions:

1. Can you release it under some standard license recognized by github, 
in case someone wants to include it in other projects? AGPL-3.0 would be 
nice.


2. I don't understand mentioned restore performance issues. It shouldn't 
apply if data is restored _after_ subvolume structure is re-created, but 
even if (1) data is already there, and (2) copyless move doesn't work 
between subvolumes (really a limitation of some older systems, not 
Python), there's a known workaround of creating a reflink and then 
removing the original.


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: subvolume: outputs message only when operation succeeds

2017-09-25 Thread Marat Khalili

On 25/09/17 17:33, Qu Wenruo wrote:
(Any in this case, anyone in the maillist can help review messages) 
If this is a question, I can help with assigning levels to messages. 
Although I think many levels are only required for complex daemons or 
network tools, while btrfs utils mostly perform atomic operations which 
either succeed or fail. But it's of course hard to be sure without 
seeing all actual messages, probably there's some use for 4 levels.


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: subvolume: outputs message only when operation succeeds

2017-09-25 Thread Marat Khalili

On 25/09/17 11:08, Qu Wenruo wrote:
What about redirecting stdout to /dev/null and redirecting stderr to 
mail if return value is not 0?

s/if return value is not 0/if return value is 0/.

The main point is, if btrfs returns 0, then nothing to worry about.
(Unless there is a bug, even in that case keep an eye on stderr should 
be enough to catch that)


Redirection to /dev/null will work. However,

1) It will always looks suspicious. grep -v with expected message is at 
least clear about its intent and consequences.


2) Although shorter than grep -v, it will still take space in shell 
scripts and force one to remember btrfs commands one has to add it 
after. This is already inconvenient enough to want a fix.


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: subvolume: outputs message only when operation succeeds

2017-09-25 Thread Marat Khalili

On 25/09/17 10:52, Hugo Mills wrote:


Isn't the correct way to catch the return value instead of grepping
the output?
It is, but if, for example, you're using the command in a cron
script which is expected to work, you don't want it producing output
because then you get a mail every time the script runs. So you have to
grep -v on the "success" output to make the successful script silent.


If it's some command not returning value properly, would you please
report it as a bug so we can fix it.

It's not the return value that's problematic (although those used
to be a real mess). It's the fact that a successful run of the command
produces noise on stdout, which most commands don't.
Yes, exactly: cron, mail -E and just long scripts where btrfs operations 
are small steps here and there.


(On top of this, actually catching the return value from the right 
command before `| grep -v` with errexit and pipefail on is so difficult 
that I usually end up rewriting whole mess in Python. Which would be 
nice result in itself if it didn't take a whole day in place of one 
minute for bash line.)


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: subvolume: outputs message only when operation succeeds

2017-09-25 Thread Marat Khalili

On 25/09/17 10:30, Nikolay Borisov wrote:

On 19.09.2017 10:41, Misono, Tomohiro wrote:

"btrfs subvolume create/delete" outputs the message of "Create/Delete
subvolume ..." even when an operation fails.
Since it is confusing, let's outputs the message only when an operation 
succeeds.

Please change the verb to past tense, more strongly signaling success -
i.e. "Created subvolume"
What about recalling some UNIX standards and returning to NOT outputting 
any message when operation succeeds? My scripts are full of grep -v 
calls after each btrfs command, and this sucks (and I don't think I'm 
alone in this situation). If you change the message a lot of scripts 
will have to be changed, at least make it worth it.


 --

With Best Regards,
Marat Khaliili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Does btrfs use crc32 for error correction?

2017-09-19 Thread Marat Khalili
Would be cool, but probably not wise IMHO, since on modern hardware you almost 
never get one-bit errors (usually it's a whole sector of garbage), and 
therefore you'd more often see an incorrect recovery than actually fixed bit.
-- 

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A user cannot remove his readonly snapshots?!

2017-09-16 Thread Marat Khalili

On 16/09/17 13:19, Ulli Horlacher wrote:

How do I know the btrfs filesystem for a given subvolume?
Do I really have to manually test the diretory path upwards?


It was discussed recently: the answer is, unfortunately, yes, until 
someone patches df to do it for us. You can do it more or less 
efficiently by analyzing /proc/mounts .


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Marat Khalili

May I state my user's point of view:

I know one applications that uses O_DIRECT, and it is subtly broken on 
BTRFS. I know no applications that use O_DIRECT and are not broken. 
(Really more statistics would help here, probably some exist that 
provably work.) According to developers making O_DIRECT work on BTRFS is 
difficult if not impossible. Isn't it time to disable O_DIRECT like ZFS 
does AFAIU? Data safety is certainly more important than performance 
gain it may or may not give some applications.


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-13 Thread Marat Khalili

On 13/09/17 16:23, Chris Murphy wrote:

Right, known problem. To use o_direct implies also using nodatacow (or
at least nodatasum), e.g. xattr +C is set, done by qemu-img -o
nocow=on
https://www.spinics.net/lists/linux-btrfs/msg68244.html
Can you please elaborate? I don't have exactly the same problem as 
described by the link, but I'm still worried that particularly qemu can 
be less resilient to partial raid1 failures even on newer kernels, due 
to missing checksums for instance. (BTW I didn't find any xattrs on my 
VM images, nor do I plan to set any.)


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-12 Thread Marat Khalili

On 12/09/17 14:12, Adam Borowski wrote:

Why would you need support in the hypervisor if cp --reflink=always is
enough?

+1 :)

But I've already found one problem: I use rsync snapshots for backups, 
and although rsync does have --sparse argument, apparently it conflicts 
with --inplace. You cannot have all nice things :(


I think I'll simply try to minimize size of VM root partitions and won't 
think too much about gig or two extra zeroes in backup, at least until 
some autopunchholes mount option arrives.


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-12 Thread Marat Khalili

On 12/09/17 13:01, Duncan wrote:

AFAIK that's wrong -- the only time the app should see the error on btrfs
raid1 is if the second copy is also bad

So thought I, but...


IIRC from what I've read on-list, qcow2 isn't the best alternative for hosting 
VMs on
top of btrfs.
Yeah, I've seen discussions about it here too, but in my case VMs write 
very little (mostly logs and distro updates), so I decided it can live 
as it is for a while. But I'm looking for better solutions as long as 
they are not too complicated.



On 12/09/17 13:32, Adam Borowski wrote:

Just use raw -- btrfs already has every feature that qcow2 has, and does it
better.  This doesn't mean btrfs is the best choice for hosting VM files,
just that raw-over-btrfs is strictly better than qcow2-over-btrfs.
Thanks for advice, I wasn't sure I won't lose features, and was too lazy 
to investigate/ask. Now it looks simple.


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-12 Thread Marat Khalili

On 12/09/17 12:21, Timofey Titovets wrote:

Can't reproduce that on latest kernel: 4.13.1
Great! Thank you very much for the test. Do you know if it's fixed in 
4.10? (or what particular version does?)


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-12 Thread Marat Khalili

On 12/09/17 11:25, Timofey Titovets wrote:

AFAIK, if while read BTRFS get Read Error in RAID1, application will
also see that error and if application can't handle it -> you got a
problems

So Btrfs RAID1 ONLY protect data, not application (qemu in your case).
That's news to me! Why doesn't it try another copy and when does it 
correct the error then? Any idea on how to work it around at least for 
qemu? (Assemble the array from within the VM?)


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-12 Thread Marat Khalili
Thanks to the help from the list I've successfully replaced part of 
btrfs raid1 filesystem. However, while I waited for best opinions on the 
course of actions, the root filesystem of one the qemu-kvm VMs went 
read-only, and this root was of course based in a qcow2 file on the 
problematic btrfs (the root filesystem of the VM itself is ext4, not 
btrfs). It is very well possible that it is a coincidence or something 
inducted by heavier than usual IO load, but it is hard for me to ignore 
the possibility that somehow the hardware error was propagated to VM. Is 
it possible?


No other processes on the machine developed any problems, but:
(1) it is very well possible that problematic sector belonged to this 
qcow2 file;
(2) it is a Kernel VM after all, and it might bypass normal IO paths of 
userspace processes;
(3) it is possible that it uses O_DIRECT or something, and btrfs raid1 
does not fully protect this kind of access.

Does this make any sense?

I could not login to the VM normally to see logs, and made big mistake 
of rebooting it. Now all I see in its logs is big hole, since, well, it 
went read-only :( I'll try to find out if (1) above is true after I 
finish migrating data from HDD and remove the it. I wonder where else 
can I look?


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please help with exact actions for raid1 hot-swap

2017-09-11 Thread Marat Khalili
Patrik, Duncan, thank you for the help. The `btrfs replace start 
/dev/sdb7 /dev/sdd7 /mnt/data` worked without a hitch (though I didn't 
try to reboot yet, still have grub/efi/several mdadm partitions to copy).


It also worked much faster than mdadm would take, apparently only moving 
126GB used, not 2.71TB total. Interestingly, according to HDD lights it 
mostly read from the remaining /dev/sda, not from replaced /dev/sdb 
(which must be completely readable now according to smartctl -- 
problematic sector got finally remapped after ~1day).


It now looks like follows:


$ sudo blkid /dev/sda7 /dev/sdb7 /dev/sdd7
/dev/sda7: LABEL="data" UUID="37d3313a-e2ad-4b7f-98fc-a01d815952e0" 
UUID_SUB="db644855-2334-4d61-a27b-9a591255aa39" TYPE="btrfs" 
PARTUUID="c5ceab7e-e5f8-47c8-b922-c5fa0678831f"

/dev/sdb7: PARTUUID="493923cd-9ecb-4ee8-988b-5d0bfa8991b3"
/dev/sdd7: LABEL="data" UUID="37d3313a-e2ad-4b7f-98fc-a01d815952e0" 
UUID_SUB="9c2f05e9-5996-479f-89ad-f94f7ce130e6" TYPE="btrfs" 
PARTUUID="178cd274-7251-4d25-9116-ce0732d2410b"

$ sudo btrfs fi show /dev/sdb7
ERROR: no btrfs on /dev/sdb7
$ sudo btrfs fi show /dev/sdd7
Label: 'data'  uuid: 37d3313a-e2ad-4b7f-98fc-a01d815952e0
Total devices 2 FS bytes used 108.05GiB
devid1 size 2.71TiB used 131.03GiB path /dev/sda7
devid2 size 2.71TiB used 131.03GiB path /dev/sdd7

Does this mean:
* I should not be afraid to reboot and find /dev/sdb7 mounted again?
* I will not be able to easily mount /dev/sdb7 on a different computer 
to do some tests?


Also, although /dev/sdd7 is much larger than /dev/sdb7 was, `btrfs fi 
show` still displays it as 2.71TiB, why?


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please help with exact actions for raid1 hot-swap

2017-09-10 Thread Marat Khalili
It doesn't need replaced disk to be readable, right? Then what prevents same 
procedure to work without a spare bay?
-- 

With Best Regards,
Marat Khalili

On September 9, 2017 1:29:08 PM GMT+03:00, Patrik Lundquist 
<patrik.lundqu...@gmail.com> wrote:
>On 9 September 2017 at 12:05, Marat Khalili <m...@rqc.ru> wrote:
>> Forgot to add, I've got a spare empty bay if it can be useful here.
>
>That makes it much easier since you don't have to mount it degraded,
>with the risks involved.
>
>Add and partition the disk.
>
># btrfs replace start /dev/sdb7 /dev/sdc(?)7 /mnt/data
>
>Remove the old disk when it is done.
>
>> --
>>
>> With Best Regards,
>> Marat Khalili
>>
>> On September 9, 2017 10:46:10 AM GMT+03:00, Marat Khalili
><m...@rqc.ru> wrote:
>>>Dear list,
>>>
>>>I'm going to replace one hard drive (partition actually) of a btrfs
>>>raid1. Can you please spell exactly what I need to do in order to get
>>>my
>>>filesystem working as RAID1 again after replacement, exactly as it
>was
>>>before? I saw some bad examples of drive replacement in this list so
>I
>>>afraid to just follow random instructions on wiki, and putting this
>>>system out of action even temporarily would be very inconvenient.
>>>
>>>For this filesystem:
>>>
>>>> $ sudo btrfs fi show /dev/sdb7
>>>> Label: 'data'  uuid: 37d3313a-e2ad-4b7f-98fc-a01d815952e0
>>>> Total devices 2 FS bytes used 106.23GiB
>>>> devid1 size 2.71TiB used 126.01GiB path /dev/sda7
>>>> devid2 size 2.71TiB used 126.01GiB path /dev/sdb7
>>>> $ grep /mnt/data /proc/mounts
>>>> /dev/sda7 /mnt/data btrfs
>>>> rw,noatime,space_cache,autodefrag,subvolid=5,subvol=/ 0 0
>>>> $ sudo btrfs fi df /mnt/data
>>>> Data, RAID1: total=123.00GiB, used=104.57GiB
>>>> System, RAID1: total=8.00MiB, used=48.00KiB
>>>> Metadata, RAID1: total=3.00GiB, used=1.67GiB
>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>> $ uname -a
>>>> Linux host 4.4.0-93-generic #116-Ubuntu SMP Fri Aug 11 21:17:51 UTC
>>>> 2017 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>I've got this in dmesg:
>>>
>>>> [Sep 8 20:31] ata6.00: exception Emask 0x0 SAct 0x7ecaa5ef SErr 0x0
>>>> action 0x0
>>>> [  +0.51] ata6.00: irq_stat 0x4008
>>>> [  +0.29] ata6.00: failed command: READ FPDMA QUEUED
>>>> [  +0.38] ata6.00: cmd 60/70:18:50:6c:f3/00:00:79:00:00/40 tag
>3
>>>> ncq 57344 in
>>>>res 41/40:00:68:6c:f3/00:00:79:00:00/40
>Emask
>>>> 0x409 (media error) 
>>>> [  +0.94] ata6.00: status: { DRDY ERR }
>>>> [  +0.26] ata6.00: error: { UNC }
>>>> [  +0.001195] ata6.00: configured for UDMA/133
>>>> [  +0.30] sd 6:0:0:0: [sdb] tag#3 FAILED Result:
>hostbyte=DID_OK
>>>> driverbyte=DRIVER_SENSE
>>>> [  +0.05] sd 6:0:0:0: [sdb] tag#3 Sense Key : Medium Error
>>>> [current] [descriptor]
>>>> [  +0.04] sd 6:0:0:0: [sdb] tag#3 Add. Sense: Unrecovered read
>>>> error - auto reallocate failed
>>>> [  +0.05] sd 6:0:0:0: [sdb] tag#3 CDB: Read(16) 88 00 00 00 00
>00
>>>
>>>> 79 f3 6c 50 00 00 00 70 00 00
>>>> [  +0.03] blk_update_request: I/O error, dev sdb, sector
>>>2045996136
>>>> [  +0.47] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0,
>>>rd
>>>> 1, flush 0, corrupt 0, gen 0
>>>> [  +0.62] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0,
>>>rd
>>>> 2, flush 0, corrupt 0, gen 0
>>>> [  +0.77] ata6: EH complete
>>>
>>>There's still 1 in Current_Pending_Sector line of smartctl output as
>of
>>>
>>>now, so it probably won't heal by itself.
>>>
>>>--
>>>
>>>With Best Regards,
>>>Marat Khalili
>>>--
>>>To unsubscribe from this list: send the line "unsubscribe
>linux-btrfs"
>>>in
>>>the body of a message to majord...@vger.kernel.org
>>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>--
>To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>in
>the body of a message to majord...@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please help with exact actions for raid1 hot-swap

2017-09-09 Thread Marat Khalili
Forgot to add, I've got a spare empty bay if it can be useful here.
--

With Best Regards,
Marat Khalili

On September 9, 2017 10:46:10 AM GMT+03:00, Marat Khalili <m...@rqc.ru> wrote:
>Dear list,
>
>I'm going to replace one hard drive (partition actually) of a btrfs 
>raid1. Can you please spell exactly what I need to do in order to get
>my 
>filesystem working as RAID1 again after replacement, exactly as it was 
>before? I saw some bad examples of drive replacement in this list so I 
>afraid to just follow random instructions on wiki, and putting this 
>system out of action even temporarily would be very inconvenient.
>
>For this filesystem:
>
>> $ sudo btrfs fi show /dev/sdb7
>> Label: 'data'  uuid: 37d3313a-e2ad-4b7f-98fc-a01d815952e0
>> Total devices 2 FS bytes used 106.23GiB
>> devid1 size 2.71TiB used 126.01GiB path /dev/sda7
>> devid2 size 2.71TiB used 126.01GiB path /dev/sdb7
>> $ grep /mnt/data /proc/mounts
>> /dev/sda7 /mnt/data btrfs 
>> rw,noatime,space_cache,autodefrag,subvolid=5,subvol=/ 0 0
>> $ sudo btrfs fi df /mnt/data
>> Data, RAID1: total=123.00GiB, used=104.57GiB
>> System, RAID1: total=8.00MiB, used=48.00KiB
>> Metadata, RAID1: total=3.00GiB, used=1.67GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>> $ uname -a
>> Linux host 4.4.0-93-generic #116-Ubuntu SMP Fri Aug 11 21:17:51 UTC 
>> 2017 x86_64 x86_64 x86_64 GNU/Linux
>
>I've got this in dmesg:
>
>> [Sep 8 20:31] ata6.00: exception Emask 0x0 SAct 0x7ecaa5ef SErr 0x0 
>> action 0x0
>> [  +0.51] ata6.00: irq_stat 0x4008
>> [  +0.29] ata6.00: failed command: READ FPDMA QUEUED
>> [  +0.38] ata6.00: cmd 60/70:18:50:6c:f3/00:00:79:00:00/40 tag 3 
>> ncq 57344 in
>>res 41/40:00:68:6c:f3/00:00:79:00:00/40 Emask 
>> 0x409 (media error) 
>> [  +0.94] ata6.00: status: { DRDY ERR }
>> [  +0.26] ata6.00: error: { UNC }
>> [  +0.001195] ata6.00: configured for UDMA/133
>> [  +0.30] sd 6:0:0:0: [sdb] tag#3 FAILED Result: hostbyte=DID_OK 
>> driverbyte=DRIVER_SENSE
>> [  +0.05] sd 6:0:0:0: [sdb] tag#3 Sense Key : Medium Error 
>> [current] [descriptor]
>> [  +0.04] sd 6:0:0:0: [sdb] tag#3 Add. Sense: Unrecovered read 
>> error - auto reallocate failed
>> [  +0.05] sd 6:0:0:0: [sdb] tag#3 CDB: Read(16) 88 00 00 00 00 00
>
>> 79 f3 6c 50 00 00 00 70 00 00
>> [  +0.03] blk_update_request: I/O error, dev sdb, sector
>2045996136
>> [  +0.47] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0,
>rd 
>> 1, flush 0, corrupt 0, gen 0
>> [  +0.000062] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0,
>rd 
>> 2, flush 0, corrupt 0, gen 0
>> [  +0.77] ata6: EH complete
>
>There's still 1 in Current_Pending_Sector line of smartctl output as of
>
>now, so it probably won't heal by itself.
>
>--
>
>With Best Regards,
>Marat Khalili
>--
>To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>in
>the body of a message to majord...@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Please help with exact actions for raid1 hot-swap

2017-09-09 Thread Marat Khalili

Dear list,

I'm going to replace one hard drive (partition actually) of a btrfs 
raid1. Can you please spell exactly what I need to do in order to get my 
filesystem working as RAID1 again after replacement, exactly as it was 
before? I saw some bad examples of drive replacement in this list so I 
afraid to just follow random instructions on wiki, and putting this 
system out of action even temporarily would be very inconvenient.


For this filesystem:


$ sudo btrfs fi show /dev/sdb7
Label: 'data'  uuid: 37d3313a-e2ad-4b7f-98fc-a01d815952e0
Total devices 2 FS bytes used 106.23GiB
devid1 size 2.71TiB used 126.01GiB path /dev/sda7
devid2 size 2.71TiB used 126.01GiB path /dev/sdb7
$ grep /mnt/data /proc/mounts
/dev/sda7 /mnt/data btrfs 
rw,noatime,space_cache,autodefrag,subvolid=5,subvol=/ 0 0

$ sudo btrfs fi df /mnt/data
Data, RAID1: total=123.00GiB, used=104.57GiB
System, RAID1: total=8.00MiB, used=48.00KiB
Metadata, RAID1: total=3.00GiB, used=1.67GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
$ uname -a
Linux host 4.4.0-93-generic #116-Ubuntu SMP Fri Aug 11 21:17:51 UTC 
2017 x86_64 x86_64 x86_64 GNU/Linux


I've got this in dmesg:

[Sep 8 20:31] ata6.00: exception Emask 0x0 SAct 0x7ecaa5ef SErr 0x0 
action 0x0

[  +0.51] ata6.00: irq_stat 0x4008
[  +0.29] ata6.00: failed command: READ FPDMA QUEUED
[  +0.38] ata6.00: cmd 60/70:18:50:6c:f3/00:00:79:00:00/40 tag 3 
ncq 57344 in
   res 41/40:00:68:6c:f3/00:00:79:00:00/40 Emask 
0x409 (media error) 

[  +0.94] ata6.00: status: { DRDY ERR }
[  +0.26] ata6.00: error: { UNC }
[  +0.001195] ata6.00: configured for UDMA/133
[  +0.30] sd 6:0:0:0: [sdb] tag#3 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
[  +0.05] sd 6:0:0:0: [sdb] tag#3 Sense Key : Medium Error 
[current] [descriptor]
[  +0.04] sd 6:0:0:0: [sdb] tag#3 Add. Sense: Unrecovered read 
error - auto reallocate failed
[  +0.05] sd 6:0:0:0: [sdb] tag#3 CDB: Read(16) 88 00 00 00 00 00 
79 f3 6c 50 00 00 00 70 00 00

[  +0.03] blk_update_request: I/O error, dev sdb, sector 2045996136
[  +0.47] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0, rd 
1, flush 0, corrupt 0, gen 0
[  +0.62] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0, rd 
2, flush 0, corrupt 0, gen 0

[  +0.77] ata6: EH complete


There's still 1 in Current_Pending_Sector line of smartctl output as of 
now, so it probably won't heal by itself.


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is autodefrag recommended? -- re-duplication???

2017-09-05 Thread Marat Khalili

Dear experts,

At first reaction to just switching autodefrag on was positive, but 
mentions of re-duplication are very scary. Main use of BTRFS here is 
backup snapshots, so re-duplication would be disastrous.


In order to stick to concrete example, let there be two files, 4KB and 
4GB in size, referenced in read-only snapshots 100 times each, and some 
4KB of both files are rewritten each night and then another snapshot is 
created (let's ignore snapshots deletion here). AFAIU 8KB of additional 
space (+metadata) will be allocated each night without autodefrag. With 
autodefrag will it be perhaps 4KB+128KB or something much worse?


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Is autodefrag recommended?

2017-09-04 Thread Marat Khalili

Hello list,
good time of the day,

More than once I see mentioned in this list that autodefrag option 
solves problems with no apparent drawbacks, but it's not the default. 
Can you recommend to just switch it on indiscriminately on all 
installations?


I'm currently on kernel 4.4, can switch to 4.10 if necessary (it's 
Ubuntu that gives us this strange choice, no idea why it's not 4.9). 
Only spinning rust here, no SSDs.


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: number of subvolumes

2017-08-24 Thread Marat Khalili
> We find that typically apt is very slow on a machine with 50 or so snapshots 
> and raid10. Slow as in probably 10x slower as doing the same update on a 
> machine with 'single' and no snapshots.
>
> Other operations seem to be the same speed, especially disk benchmarks do not 
> seem to indicate any performance degradation.

For meaningful discussion it is important to take into account the fact that 
dpkg infamously calls fsync after changing every bit of information, so 
basically you're measuring fsync speed. Which is slow on btrfs (compared to 
simpler filesystems), but unrelated to normal work.

I've got two near-identical servers here with several containers each different 
only on in filesystem: btrfs-raid1 on one (for historical reasons) and 
ext4/mdadm-raid1 on another, no snapshots, no reflinks. Each time containers on 
ext4 update several times faster, but in everyday operation there's no 
significant difference.
-- 

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: finding root filesystem of a subvolume?

2017-08-22 Thread Marat Khalili
Hmm, now I'm really confused, I just checked on the Ubuntu 17.04 and 
16.04.3 VM's I have (I only run current and the most recent LTS 
version), and neither of them behave like this. 


Was also shocked, but:


$ lsb_release -a
No LSB modules are available.
Distributor ID:Ubuntu
Description:Ubuntu 16.04.3 LTS
Release:16.04
Codename:xenial

$ df -T | grep /mnt/data/lxc

$ df -T /mnt/data/lxc
Filesystem Type  1K-blocks Used  Available Use% Mounted on
-  -2907008836 90829848 2815107576   4% /mnt/data/lxc


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: finding root filesystem of a subvolume?

2017-08-22 Thread Marat Khalili

I have no subvol=/ option at all:
Probably depends on kernel, but I presume missing subvol means the same 
as subvol=/ .



I am only interested in mounted volumes.
If your initial path (/local/.backup/home) is a subvolume but it's not 
itself present in /proc/mounts then it's probably mounted as a part some 
higher-level subvolume, but this higher-level subvolume does not have to 
be root. Do you need volume root or just some higher-level subvolume 
that's mounted?


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: finding root filesystem of a subvolume?

2017-08-22 Thread Marat Khalili

On 22/08/17 15:50, Ulli Horlacher wrote:

It seems, I have to scan the subvolume path upwards until I found a real
mount point,
I think searching /proc/mounts for the same device and subvol=/ in 
options is most straightforward. But what makes you think it's mounted 
at all?


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow btrfs with a single kworker process using 100% CPU

2017-08-20 Thread Marat Khalili
I'm a btrfs user, not a developer; developers can probably provide more 
detailed explanation by looking at stack traces in dmesg etc., but it's 
possible that there's just no quick fix (yet). I presume these are 1413 
_full-volume_ snapshots. Then some operations have to process 
43.65TiB*1413=62PiB of data -- well, metadata for that data, but it's 
still a lot as you may guess, especially if it's all heavily fragmented. 
You can either gradually reduce number of snapshots and wait (it may 
drag for weeks and months), or copy everything to a different device and 
reformat this one, then don't create that many snapshots again.


As for "blocked for more than 120 seconds" messages in dmesg, I see them 
every night after I delete about a dozen of snapshots ~10TiB in _total_ 
volume, albeit with qgroups. These messages usually subside after about 
couple of hours. They only tell you what you already know: some btrfs 
operations are painfully slow.


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Marat Khalili

I've one system where a single kworker process is using 100% CPU
sometimes a second process comes up with 100% CPU [btrfs-transacti]. Is
there anything i can do to get the old speed again or find the culprit?


1. Do you use quotas (qgroups)?

2. Do you have a lot of snapshots? Have you deleted some recently?

More info about your system would help too.

--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive loss of disk space

2017-08-03 Thread Marat Khalili
On August 3, 2017 7:01:06 PM GMT+03:00, Goffredo Baroncelli 
>The file is physically extended
>
>ghigo@venice:/tmp$ fallocate -l 1000 foo.txt

For clarity let's replace the fallocate above with:
$ head -c 1000 foo.txt

>ghigo@venice:/tmp$ ls -l foo.txt
>-rw-r--r-- 1 ghigo ghigo 1000 Aug  3 18:00 foo.txt
>ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt
>ghigo@venice:/tmp$ ls -l foo.txt
>-rw-r--r-- 1 ghigo ghigo 1500 Aug  3 18:00 foo.txt
>ghigo@venice:/tmp$

According to explanation by Austin the foo.txt at this point somehow occupies 
2000 bytes of space because I can reflink it and then write another 1000 bytes 
of data into it without losing 1000 bytes I already have or getting out of 
drive space. (Or is it only true while there are open file handles?)
-- 

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive loss of disk space

2017-08-03 Thread Marat Khalili

On 02/08/17 20:52, Goffredo Baroncelli wrote:

consider the following scenario:

a) create a 2GB file
b) fallocate -o 1GB -l 2GB
c) write from 1GB to 3GB

after b), the expectation is that c) always succeed [1]: i.e. there is enough 
space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the 
already allocated space because there could be a small time window where both 
the old and the new data exists on the disk.

Just curious. With current implementation, in the following case:
a) create a 2GB file1 && create a 2GB file2
b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2
c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2
will (c) always succeed? I.e. does fallocate really allocate 2GB per 
file, or does it only allocate additional 1GB and check free space for 
another 1GB? If it's only the latter, it is useless.


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Best Practice: Add new device to RAID1 pool

2017-07-24 Thread Marat Khalili
>> This may be a stupid question , but are your pool of butter (or BTRFS pool) 
>> by any chance hooked up via USB? If this is USB2.0 at 480mitb/s then it is 
>> about 57MB/s / 4 drives = roughly 14.25 or about 11MB/s if you shave off 
>> some overhead.
>
>Nope, USB 3. Typically on scrubs I get 110MB/s that winds down to 
>60MB/s as it progresses to the slow parts of the disk.

It could have degraded to USB2 due to bad connection/loose electrical contacts. 
You know USB3 needs extra wires, and if it lost some it'd connect (or 
reconnect) in USB2 mode. I'd check historical kernel messages just in case, 
and/or unmount and reconnect to be sure.
-- 

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel btrfs file system wedged -- is it toast?

2017-07-22 Thread Marat Khalili

> The btrfs developers should have known this, and announced this,
a long time ago, in various prominent ways that it would be difficult
for potential new users to miss. 

I'm also a user like you, and I felt like this too when I came here (BTW there 
are several traps in BTRFS, and other are causing partial or whole filesystem 
loss, so you're lucky). There's truth in your words that some warning is 
needed, but in this open-source business it is not clear who should give it to 
whom. Developers in the list are actually spending their time on adding such 
warnings to kernel and command-line tools, but e.g. people using GUI and not 
reading dmesg over breakfast won't see them anyways. All situation is 
unfortunate because hardware and OS vendors keep hyping BTRFS and making it 
default in their products when it is clearly not ready, but you're now talking 
to and blaming the wrong people.

Personally for me coming to this list was the most helpful thing in 
understanding BTRFS current state and limitations. I'm still using it, although 
in a very careful and controlled manner. But browsing the list every day sadly 
takes time. If you can't afford it or are running something absolutely 
critical, better look to other, more mature filesystems. After all, as adage 
says: "legacy is what we run in production".
-- 

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Marat Khalili

On 21/06/17 06:48, Chris Murphy wrote:

Another possibility is to ensure a new write is written to a new*not*
full stripe, i.e. dynamic stripe size. So if the modification is a 50K
file on a 4 disk raid5; instead of writing 3 64K data strips + 1 64K
parity strip (a full stripe write); write out 1 64K data strip + 1 64K
parity strip. In effect, a 4 disk raid5 would quickly get not just 3
data + 1 parity strip Btrfs block groups; but 1 data + 1 parity, and 2
data + 1 parity chunks, and direct those write to the proper chunk
based on size. Anyway that's beyond my ability to assess how much
allocator work that is. Balance I'd expect to rewrite everything to
max data strips possible; the optimization would only apply to normal
operation COW.
This will make some filesystems mostly RAID1, negating all space savings 
of RAID5, won't it?


Isn't it easier to recalculate parity block based using previous state 
of two rewritten strips, parity and data? I don't understand all 
performance implications, but it might scale better with number of devices.


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: understanding differences in recoverability of raid1 vs raid10 and performance implications of unusual numbers of devices

2017-06-01 Thread Marat Khalili
>raid 1 write data on all disks synchronously all time, no tricks.
>btrfs raid1 read data by PID%2
>0 - first copy
>1 - second copy

Meaning, a single-process database will only ever read one copy? At least, 
meaning of first/second relative to physical devices depends on extent, right, 
right?
-- 

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs mounts RO, kernel oops on RW

2017-05-29 Thread Marat Khalili

I'm not asking for a specific endorsement, but should I be considering
something like the seagate ironwolf or WD red drives?


You need two qualitative features from HDD for RAID usage:
1) being failfast (TLER in WD talk), and
2) be designed to tolerate vibrations from other disks in a box.

Therefore you need _at least_ WD Red or alternative from Seagate. Paying 
more can only bring you quantitative benefits AFAIK. Just don't put 
desktop drives in RAID.


(Sorry for being off-topic, but after some long recent discussions I 
don't feel as guilty. :) )


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-tools/linux 4.11: btrfs-cleaner misbehaving

2017-05-28 Thread Marat Khalili

If
you do have scripted snapshots being taken, be sure you have a script
thinning down your snapshot history as well.



I know Ivan P never mentioned qgroups, but just a warning for future 
readers: *with qgroups don't let this script delete more than couple 
dozen snapshots at once*, then wait for btrfs kernel activity to subside 
before trying again. Especially be very careful when running this script 
in production the very first time, it will most likely find too many 
snapshots to delete. (Temporary removing all affected qgroups beforehand 
may also work.)



--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snapshot destruction making IO extremely slow

2017-05-24 Thread Marat Khalili

Hello,


It occurs when enabling quotas on a volume. When there are a
lot of snapshots that are deleted, the system becomes extremely
unresponsive (IO often waiting for 30s on a SSD). When I don't have
quotas, removing snapshots is fast.
Same problem here. It is now common knowledge in the list that qgroups 
cause performance problems. I try to avoid deleting many snapshots at 
once because of this.


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QGroups Semantics

2017-05-23 Thread Marat Khalili

Hit "Send" a little too early:

More complete workaround would be delayed cleanup. What about 
(re-)mount time? (Should also handle qgroups remaining 

... after subvolumes deleted on previous kernels.)

--

With Best Regards,
Marat Khalili

On 23/05/17 08:38, Marat Khalili wrote:

Just some user's point of view:


I propose the following changes:
1) We always cleanup level-0 qgroups by default, with no opt-out.
I see absolutely no reason to keep these around.


It WILL break scripts that try to do this cleanup themselves. OTOH it 
will simplify writing new ones.


Since qgroups are assigned sequential numbers, it must be possible to 
partially work it around by not returning error on repeated delete. 
But you cannot completely emulate qgroup presence without actually 
keeping it, so some scripts will still break.


More complete workaround would be delayed cleanup. What about 
(re-)mount time? (Should also handle qgroups remaining )



  We do not allow the creation of level-0 qgroups for (sub)volumes
that do not exist.


Probably I'm mistaken, but I see no reasons for doing it even now, 
since I don't think it's possible to reliably assign existing 0-level 
qgroup to a new subvolume. So this change should break nothing.



Why do we allow deleting a level 0 qgroup for a currently existing
subvolume?

4) Add a flag to the qgroup_delete_v2 ioctl, NO_SUBVOL_CHECK.
If the flag is present, it will allow you to delete qgroups which
reference active subvolumes.


Some people doing cleanup in the reverse order? Other than this, I 
don't understand why this feature is needed, so IMO it's unlikely to 
be needed in a new API.


Of course, this is all just one datapoint for you.

--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QGroups Semantics

2017-05-22 Thread Marat Khalili

Just some user's point of view:


I propose the following changes:
1) We always cleanup level-0 qgroups by default, with no opt-out.
I see absolutely no reason to keep these around.


It WILL break scripts that try to do this cleanup themselves. OTOH it 
will simplify writing new ones.


Since qgroups are assigned sequential numbers, it must be possible to 
partially work it around by not returning error on repeated delete. But 
you cannot completely emulate qgroup presence without actually keeping 
it, so some scripts will still break.


More complete workaround would be delayed cleanup. What about (re-)mount 
time? (Should also handle qgroups remaining )



  We do not allow the creation of level-0 qgroups for (sub)volumes
that do not exist.


Probably I'm mistaken, but I see no reasons for doing it even now, since 
I don't think it's possible to reliably assign existing 0-level qgroup 
to a new subvolume. So this change should break nothing.



Why do we allow deleting a level 0 qgroup for a currently existing
subvolume?

4) Add a flag to the qgroup_delete_v2 ioctl, NO_SUBVOL_CHECK.
If the flag is present, it will allow you to delete qgroups which
reference active subvolumes.


Some people doing cleanup in the reverse order? Other than this, I don't 
understand why this feature is needed, so IMO it's unlikely to be needed 
in a new API.


Of course, this is all just one datapoint for you.

--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Backing up BTRFS metadata

2017-05-12 Thread Marat Khalili

Indeed. This has been tried before, and I don't think it came to
anything.

What can/did go wrong?



I suspect it's still only capturing metadata, rather than data.

Yes. But data should still be there, on disk, right?


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Backing up BTRFS metadata

2017-05-11 Thread Marat Khalili


On 11/05/17 18:19, Chris Murphy wrote:

btrfs-image

Looks great, thank you!

--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Backing up BTRFS metadata

2017-05-11 Thread Marat Khalili
Sorry if question sounds unorthodox, Is there some simple way to read 
(and backup) all BTRFS metadata from volume?


Motivation of course is possibility to quickly recover from catastrophic 
filesystem failures on a logical level. Some small amount of actual data 
that this metadata references may be overwritten between backup and 
restore moments, but due to checksumming it can easily be caught (and 
either individually restored from backup or discarded).


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: runtime btrfsck

2017-05-10 Thread Marat Khalili

Hello,

(Warning: I'm a user, not a developer here.)

In my experience (on kernel 4.4) it processed larger and slower devices 
within a day, BUT according to some recent topics the runaway 
fragmentation (meaning in this case large number of small extents 
regardless of their relative location) can significantly slow down BTRFS 
operations to the point of making them infeasible. Possible reasons for 
fragmentations are snapshotting volumes too often and/or running VM 
images from BTRFS without taking some precautions.


On top of this, mount device name makes one suspicious there's another 
layer between BTRFS and hardware. Are you sure it's not the bottleneck 
in this case?


--

With Best Regards,
Marat Khalili

On 10/05/17 10:02, Stefan Priebe - Profihost AG wrote:

I'm now trying btrfs progs 4.10.2. Is anybody out there who can tell me
something about the expected runtime or how to fix bad key ordering?

Greets,
Stefan

Am 06.05.2017 um 07:56 schrieb Stefan Priebe - Profihost AG:

It's still running. Is this the normal behaviour? Is there any other way
to fix the bad key ordering?

Greets,
Stefan

Am 02.05.2017 um 08:29 schrieb Stefan Priebe - Profihost AG:

Hello list,

i wanted to check an fs cause it has bad key ordering.

But btrfscheck is now running since 7 days. Current output:
# btrfsck -p --repair /dev/mapper/crypt_md0
enabling repair mode
Checking filesystem on /dev/mapper/crypt_md0
UUID: 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9
bad key ordering 39 40
checking extents [O]

FS is a 12TB BTRFS Raid 0 over 3 mdadm Raid 5 devices. How long should
btrfsck run and is there any way to speed it up? btrfs tools is 4.8.5

Thanks!

Greets,
Stefan


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BTRFS warning (device sda7): block group 181491728384 has wrong amount of free space

2017-05-02 Thread Marat Khalili

Dear all,

I cannot understand two messages in syslog, could someone please shed 
some light? Here they are:


Apr 29 08:54:03 container-name kernel: [792742.662375] BTRFS warning 
(device sda7): block group 181491728384 has wrong amount of free space
Apr 29 08:54:03 container-name kernel: [792742.662381] BTRFS warning 
(device sda7): failed to load free space cache for block group 
181491728384, rebuilding it now
Especially strange is the fact that messages appear in LXC container's 
syslog, but not in syslog of a host system. I only saw network and 
apparmor-related messages in container syslogs before.


I didn't run any usermode btrfs tools at the time (especially in 
container, since they are not even installed there), but there's a quota 
set for this subvolume, and it was coming close to exhausting by large 
mysql database. There're no snapshots this time. smartmon finds no problems.



marat@host:~$ uname -a
Linux host 4.4.0-72-generic #93-Ubuntu SMP Fri Mar 31 14:07:41 UTC 
2017 x86_64 x86_64 x86_64 GNU/Linux

marat@host:~$ lsb_release -a
No LSB modules are available.
Distributor ID:Ubuntu
Description:Ubuntu 16.04.2 LTS
Release:16.04
Codename:xenial
marat@host:~$ btrfs --version
btrfs-progs v4.4
marat@host:~$ sudo btrfs qgroup show -F -pcre 
/mnt/lxc/container-name/rootfs

qgroupid rfer excl max_rfer max_excl parent  child
     --  -
0/80263.93GiB 63.93GiB 64.00GiB none --- ---
marat@host:~$ sudo btrfs filesystem show /dev/sda7 # run after freeing 
space by clearing database

Label: 'data'  uuid: 37d3313a-e2ad-4b7f-98fc-a01d815952e0
Total devices 2 FS bytes used 47.73GiB
devid1 size 2.71TiB used 114.01GiB path /dev/sda7
devid2 size 2.71TiB used 114.01GiB path /dev/sdb7
marat@host:~$ sudo btrfs filesystem df /mnt/lxc/container-name/rootfs# 
run after freeing space by clearing database

Data, RAID1: total=111.00GiB, used=46.83GiB
System, RAID1: total=8.00MiB, used=32.00KiB
Metadata, RAID1: total=3.00GiB, used=983.11MiB
GlobalReserve, single: total=336.00MiB, used=0.00B
marat@host:~$ sudo lxc-attach -n container-name cat /proc/mounts | 
grep sda7
/dev/sda7 / btrfs 
rw,relatime,space_cache,subvolid=802,subvol=/lxc/container-name/rootfs 0 0


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-04-24 Thread Marat Khalili

On 25/04/17 03:26, Qu Wenruo wrote:
IIRC qgroup for subvolume deletion will cause full subtree rescan 
which can cause tons of memory. 
Could it be this bad, 24GB of RAM for a 5.6TB volume? What does it even 
use this absurd amount of memory for? Is it swappable?


Haven't read about RAM limitations for running qgroups before, only 
about CPU load (which importantly only requires patience, does not crash 
servers).


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Prevent escaping btrfs quota

2017-04-21 Thread Marat Khalili
Just some food for thought: there's already a tag that correctly assigns 
filesystem objects to users. It is called owner(ship). Instead of making 
qgroups repeat ownership logic, why not base qgroup assignments on 
ownership itself? (At least on per-subvolume basis.)


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Encountered kernel bug#72811. Advice on recovery?

2017-04-16 Thread Marat Khalili
Even making such a warning conditional on kernel version is 
problematic, because many distros backport major blocks of code, 
including perhaps btrfs fixes, and the nominally 3.14 or whatever 
kernel may actually be running btrfs and other fixes from 4.14 or 
later, by the time they actually drop support for whatever LTS distro 
version and quit backporting fixes.


This information could be stored in kernel and made available for 
usermode tools via some proc file. This would be very useful 
_especially_ considering backporting. Raid56 could be fixed already (or 
not) by the time it is implemented, but no doubt there will still be 
other highly experimental capabilities judging by how things go. And 
this feature itself could easily be backported.


Some machine-readable readiness level (ok/warning/override flag 
needed/known but disabled in kernel) plus one-line text message 
displayed to users in cases 2-4 is all we need. If proc file is missing 
or doesn't contain information about specific capability, tools could 
default to current behavior (AFAIR there're already warnings in some 
cases). Message should tersely cover any known issues, including 
stability, performance, compatibility and general readiness, and may 
contain links (to btrfs wiki?) for more information. I expect whole file 
to easily fit in 512 bytes.


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Deduplication tools

2017-04-13 Thread Marat Khalili
After reading this maillist for a while I became a bit more cautious 
about using various BTRFS features, so decided to ask just in case: is 
it safe to use out-of-band deduplication tools 
<https://btrfs.wiki.kernel.org/index.php/Deduplication>, and which of 
them are considered more stable/mainstream? Also, won't running these 
tools exacerbate often mentioned stability/performance problems with 
too-many-snapshots? Any first-hand experience is very welcome.



--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Do different btrfs volumes compete for CPU?

2017-04-05 Thread Marat Khalili

On 04/04/17 20:36, Peter Grandi wrote:

SATA works for external use, eSATA works well, but what really
matters is the chipset of the adapter card.
eSATA might be sound electrically, but mechanically it is awful. Try to 
run it for months in a crowded server room, and inevitably you'll get 
disconnections and data corruption. Tried different cables, brackets -- 
same result. If you ever used eSATA connector, you'd feel it.



In my experience JMicron is not so good, Marvell a bit better,
best is to use a recent motherboard chipset with a SATA-eSATA
internal cable and bracket.
That's exactly what I used to use: internal controller of Z77 chipset + 
bracket(s).



But that does not change the fact that it is a library and work
is initiated by user requests which are not per-subvolume, but
in effect per-volume.

That's the answer I was looking for.


It is a way to do so and not a very good way. There is no
obviously good way to define "real usage" in the presence of
hard-links and reflinking, and qgroups use just one way to
define it. A similar problem happens with processes in the
presence of shared pages, multiple mapped shared libraries etc.
No need to over-generalize. There's an obvious good way to define "real 
usage" of a subvolume and its snapshots as long as it don't share any 
data with other subvolumes, as is often the case. If it does share, two 
figures -- exclusive and referenced, like in qgroups -- are sufficient 
for most tasks.



The problem is that
both hard-links and ref-linking create really significant
ambiguities as to used space. Plus the same problem would happen
with directories instead of subvolumes and hard-links instead of
reflinked snapshots.
You're right, although with hard-links there's at least remote chance to 
estimate storage use with usermode scripts.



ASMedia USB3 chipsets are fairly reliable at the least the card
ones on the system side. The ones on the disk side I don't know
much about.
This is getting increasingly off-topic, but our mainstay are CFI 5-disk 
DAS boxes (8253JDGG to be exact) filled with WD Red-s in RAID5 
configuration. They are no longer produced and getting harder and harder 
to source, but showed themselves as very reliable. According to lsusb 
they contain JMicron JMS567 SATA 6Gb/s bridge.


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mix ssd and hdd in single volume

2017-04-03 Thread Marat Khalili

On 02/04/17 03:13, Duncan wrote:

Meanwhile, since you appear to be designing a mass-market product, it's
worth mentioning that btrfs is considered, certainly by its devs and
users on this list, to be "still stabilizing, not fully stable and
mature." [...] That doesn't sound like a particularly good choice for a 
mass-market NAS
product to me.  Of course there's rockstor and others out there already
shipping such products, but they're risking their reputation and the
safety of their customer's data in the process, altho there's certainly a
few customers out there with the time, desire and technical know-how to
ensure the recommended backups and following current kernel and list, and
that's exactly the sort of people you'll find already here.  But that's
not sufficiently mass-market to appeal to most vendors.
You may want to look here: https://www.synology.com/en-global/dsm/Btrfs 
. Somebody forgot to tell Synology, which already supports btrfs in all 
hardware-capable devices. I think Rubicon has been crossed in 
'mass-market NAS[es]', for good or not.


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Do different btrfs volumes compete for CPU?

2017-04-03 Thread Marat Khalili

On 01/04/17 13:17, Peter Grandi wrote:

That "USB-connected is a rather bad idea. On the IRC channel
#Btrfs whenever someone reports odd things happening I ask "is
that USB?" and usually it is and then we say "good luck!" :-).
You're right, but USB/eSATA arrays are dirt cheap in comparison with 
similar-performance SAN/NAS etc. things, that we unfortunately cannot 
really afford here.


Just a bit of a back-story: I tried to use eSATA and ext4 first, but 
observed silent data corruption and irrecoverable kernel hangs -- 
apparently, SATA is not really designed for external use. That's when I 
switched to both USB and, coincidently, btrfs, and stability became 
orders of magnitude better even on re-purposed consumer-grade PC (Z77 
chipset, 3rd gen. i5) with horribly outdated kernel. Now I'm rebuilding 
same configuration on server-grade hardware (C610 chipset, 40 io-channel 
Xeon) and modern kernel, and thus would be very surprised to find 
problems in USB throughput.



As written that question is meaningless: despite the current
mania for "threads"/"threadlets" a filesystem driver is a
library, not a set of processes (all those '[btrfs-*]'
threadlets are somewhat misguided ways to do background
stuff).

But these threadlets, misguided as the are, do exist, don't they?


* Qgroups are famously system CPU intensive, even if less so
   than in earlier releases, especially with subvolumes, so the
   16 hours CPU is both absurd and expected. I think that qgroups
   are still effectively unusable.
I understand that qgroups is very much work in progress, but (correct me 
if I'm wrong) right now it's the only way to estimate real usage of 
subvolume and its snapshots. For instance, if I have dozen 1TB 
subvolumes each having ~50 snapshots and suddenly run out of space on a 
24TB volume, how do I find the culprit without qgroups? Keeping eye on 
storage use is essential for any real life use of snapshots, and they 
are too convenient as backup de-duplication tool to give up.


Just a stray thought:  btrfs seem to lack object type in between of 
volume and subvolume, that would keep track of storage use by several 
subvolumes+their snapshots, allow snapshotting/transferring multiple 
subvolumes at once etc. Some kind of super-subvolume (supervolume?) that 
is hierarchical. With increasing use of subvolumes/snapshots within a 
single system installation, and multiple system installations (belonging 
to different users) in one volume due to liberal use of LXC and similar 
technologies this will become more and more of a pressing problem.



* The scheduler gives excessive priority to kernel threads, so
   they can crowd out user processes. When for whatever reason
   the system CPU percentage rises everything else usually
   suffers.
I thought it was clear, but probably needs spelling out: while 1 core 
was completely occupied with [btrfs-transacti] thread, 5 more were 
mostly idle serving occasional network requests without any problems. 
And only a process that used storage intensively died. Fortunately or 
not, it's the only data point so far -- smaller snapshot cullings do not 
cause problems.



Only Intel/AMD USB chipsets and a few others are fairly
reliable, and for mass storage only with USB3 with UASPI, which
is basically SATA-over-USB (more precisely SCSI-command-set over
USB). Your system-side card seems to be recent enough to do
UASPI, but probably the peripheral-side chipset isn't. Things
are so bad with third-party chipsets that even several types of
add-on SATA and SAS cards are too buggy.
Thank you very much for this hint. The card is indeed unknown factor 
here and I'll keep a close eye on it. The chip is ASM1142, not Intel/AMD 
sadly but quite popular nevertheless.


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Do different btrfs volumes compete for CPU?

2017-03-31 Thread Marat Khalili
Thank you very much for reply and suggestions, more comments below. 
Still, is there a definite answer on root question: are different btrfs 
volumes independent in terms of CPU, or are there some shared workers 
that can be point of contention?



What would have been interesting would have been if you had any reports
from for instance htop during that time, showing wait percentage on the
various cores and status (probably D, disk-wait) of the innocent
process.  iotop output would of course have been even better, but also
rather more special-case so less commonly installed.
Curiously, I have had iotop but not htop running. [btrfs-transacti] had 
some low-level activity in iotop (I still assume it was CPU-limited), 
the innocent process did not have any activity anywhere. Next time I'll 
also take notice of process state in ps (sadly, my omission).



I believe you will find that the problem isn't btrfs, but rather, I/O
contention
This possibility did not come to my mind. Can USB drivers be still that 
bad in 4.4? Is there any way to discriminate these two situations (btrfs 
vs usb load)?


BTW, USB adapter used is this one (though storage array only supports 
USB 3.0): https://www.asus.com/Motherboard-Accessory/USB_31_TYPEA_CARD/



and that if you try the same thing with one of the
filesystems being for instance ext4, you'll see the same problem there as
well
Not sure if it's possible to reproduce the problem with ext4, since it's 
not possible to perform such extensive metadata operations there, and 
simply moving large amount of data never created any problems for me 
regardless of filesystem.


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Do different btrfs volumes compete for CPU?

2017-03-31 Thread Marat Khalili
Approximately 16 hours ago I've run a script that deleted >~100 
snapshots and started quota rescan on a large USB-connected btrfs volume 
(5.4 of 22 TB occupied now). Quota rescan only completed just now, with 
100% load from [btrfs-transacti] throughout this period, which is 
probably ~ok depending on your view on things.


What worries me is innocent process using _another_, SATA-connected 
btrfs volume that hung right after I started my script and took >30 
minutes to be sigkilled. There's nothing interesting in the kernel log, 
and attempts to attach strace to the process output nothing, but I of 
course suspect that it freezed on disk operation.


I wonder:
1) Can there be a contention for CPU or some mutexes between kernel 
btrfs threads belonging to different volumes?
2) If yes, can anything be done about it other than mounting volumes 
from (different) VMs?




$ uname -a; btrfs --version
Linux host 4.4.0-66-generic #87-Ubuntu SMP Fri Mar 3 15:29:05 UTC 2017 
x86_64 x86_64 x86_64 GNU/Linux

btrfs-progs v4.4


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qgroups are not applied when snapshotting a subvol?

2017-03-28 Thread Marat Khalili
If we were going to reserve something, it should be a high number, not 
a low one.  Having 0 reserved makes some sense, but reserving other 
low numbers seems kind of odd when they aren't already reserved. 


I did some experiments.

Currently assigning higher-level qgroup to lower-level qgroup is not 
possible. Consequently, assigning anything to 0-level qgroup is not 
possible.


On the other hand, assigning qgroups while skipping levels (e.g. qgroup 
2/P to 10/Q) is possible. So setting default snapshot level high is 
technically possible, but you'll not be able to assign these high-level 
qgroups anywhere low later.



although I hadn't realized that the snapshot command _does not_ have 
this argument, when it absolutely should. 

It does here in 4.4, it's just not documented :) I too found it by accident.



Perhaps have an option
Options always suit everyone except developers who need to implement and 
support them :)


Here I'd like to wrap up since I seriously doubt any real btrfs 
developers are still reading our discussion :)


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qgroups are not applied when snapshotting a subvol?

2017-03-28 Thread Marat Khalili
There are a couple of reasons I'm advocating the specific behavior I 
outlined:
Some of your points are valid, but some break current behaviour and 
expectations or create technical difficulties.



1. It doesn't require any specific qgroup setup.  By definition, you 
can be 100% certain that the destination qgroup exists, and that you 
won't need to create new qgroups behind the user's back (given your 
suggestion, what happens when qgroup 1/N doesn't exist?).
This is a general problem with current qgroups: you have to reference 
them by some random numbers, not by user-assigned names like files. It 
would need to be fixed sooner or later with syntax like L: in 
place of L/N, or even some special syntax made specifically for path 
snapshots.


BTW, what about reserving level 1 for qgroups describing subvolumes and 
all their snapshots and forbidding manual management of qgroups at this 
level?



2. Just because it's the default, doesn't mean that the subvolume 
can't be reassigned to a different qgroup.  This also would not remove 
the ability to assign a specific qgroup through the snapshot creation 
command.  This is arguably a general point in favor of having any 
default of course, but it's still worth pointing out.
Currently 0/N qgroups are special in that they are created automatically 
and belong to the bottom of the hierarchy. It would be very nice to keep 
it this way.


Changing qgroup assignments after snapshot creation is very inconvenient 
because it requires quota rescan and thus blocks all other quota operations.



3. Because BTRFS has COW semantics, the new snapshot should take up 
near zero space in the qgroup of it's parent.


Indeed it works this way in my experiments if you assign snapshot to 1/N 
qgroup at creation where 0/N also belongs. Moreover, it does not require 
quota rescan, which is very nice.



4. This correlates with the behavior most people expect based on ZFS 
and LVM, which is that snapshots are tied to their parent.


I'm not against tying it to the parent. I'm against removing snapshot's 
own qgroup.



At a minimum, it should belong to _some_ qgroup.  This could also be 
covered by having a designated 'default' qgroup that all new 
subvolumes created without a specified qgroup get put in, but I feel 
that that is somewhat orthogonal to the issue of how snapshots are 
handled. 

It belongs to its own 0/N' qgroup, but this is not probably what you mean.


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qgroups are not applied when snapshotting a subvol?

2017-03-28 Thread Marat Khalili

The default should be to inherit the qgroup of the parent subvolume.
This behaviour is only good for this particular use-case. In general 
case, qgroups of subvolume and snapshots should exist separately, and 
both can be included in some higher level qgroup (after all, that's what 
qgroup hierarchy is for).


In my system I found it convenient to include subvolume and its 
snapshots in qgroup 1/N, where 0/N is qgroup of bare subvolume. I think 
adopting this behaviour as default would be more sensible.


--

With Best Regards,
Marat Khalili

On 28/03/17 14:24, Austin S. Hemmelgarn wrote:

On 2017-03-27 15:32, Chris Murphy wrote:

How about if qgroups are enabled, then non-root user is prevented from
creating new subvolumes?

Or is there a way for a new nested subvolume to be included in its
parent's quota, rather than the new subvolume having a whole new quota
limit?

Tricky problem.
The default should be to inherit the qgroup of the parent subvolume. 
The organization of subvolumes is hierarchical, and sane people expect 
things to behave as they look.  Taking another angle, on ZFS, 'nested' 
(nested in quotes because ZFS' definition of 'nested' zvols is weird) 
inherit their parent's quota and reservations (essentially reverse 
quota), and they're not even inherently nested in the filesystem like 
subvolumes are, so we're differing from the only other widely used 
system that implements things in a similar manner.


As far as the subvolume thing, there should be an option to disable 
user creation of subvolumes, and ideally it should be on by default 
because:
1. Users can't delete subvolumes by default.  This means they can 
create but not destroy a resource by default, which means that a user 
can pretty easily accidentally cause issues for the system as a whole.
2. Correlating with 1, users being able to delete subvolumes by 
default is not safe on multiple levels (easy accidental data loss, 
numerous other issues), and thus user subvolume removal being off by 
default is significantly safer.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: backing up a file server with many subvolumes

2017-03-27 Thread Marat Khalili
Just some consideration, since I've faced similar but no exactly same 
problem: use rsync, but create snapshots on target machine. Blind rsync 
will destroy deduplication of your snapshots and take huge amount of 
storage, so it's not a solution. But you can rsync --inline your 
snapshots in chronological order to some folder and re-take snapshots of 
that folder, thus recreating your snapshots structure on target. 
Obviously, it can/should be automated.



--

With Best Regards,
Marat Khalili

On 26/03/17 06:00, J. Hart wrote:
I have a Btrfs filesystem on a backup server.  This filesystem has a 
directory to hold backups for filesystems from remote machines. In 
this directory is a subdirectory for each machine.  Under each machine 
subdirectory is one directory for each filesystem (ex /boot, /home, 
etc) on that machine.  In each filesystem subdirectory are incremental 
snapshot subvolumes for that filesystem.  The scheme is something like 
this:


/backup///

I'd like to try to back up (duplicate) the file server filesystem 
containing these snapshot subvolumes for each remote machine.  The 
problem is that I don't think I can use send/receive to do this. 
"Btrfs send" requires "read-only" snapshots, and snapshots are not 
recursive as yet.  I think there are too many subvolumes which change 
too often to make doing this without recursion practical.


Any thoughts would be most appreciated.

J. Hart

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: partial quota rescan

2017-02-10 Thread Marat Khalili
Suddenly discovered undocumented (in man or anywhere) -i parameter of 
'btrfs subvolume snapshot' that adds snapshot to qgroup without 
invalidating statistics. Amazing!


--

With Best Regards,
Marat Khalili

On 08/02/17 18:46, Marat Khalili wrote:
I'm using  trying to use qgroups to keep track of storage occupied by 
snapshots. I noticed that:
a) no two rescans can run in parallel, and there's no way to schedule 
another rescan while one is running;
b) seems like it's a whole-disk operation regardless of path specified 
in CLI.


I only just started to fill my new 24Tb btrfs volume using qgroups, 
but rescans already take a long time, and due to (a) above I each time 
have to wait for previous rescan to finish in my scripts. Can anything 
be done about it, like trashing and recomputing only statistics for 
specific qgroup?


Linux host 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 
2017 x86_64 x86_64 x86_64 GNU/Linux

btrfs-progs v4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


partial quota rescan

2017-02-08 Thread Marat Khalili
I'm using  trying to use qgroups to keep track of storage occupied by 
snapshots. I noticed that:
a) no two rescans can run in parallel, and there's no way to schedule 
another rescan while one is running;
b) seems like it's a whole-disk operation regardless of path specified 
in CLI.


I only just started to fill my new 24Tb btrfs volume using qgroups, but 
rescans already take a long time, and due to (a) above I each time have 
to wait for previous rescan to finish in my scripts. Can anything be 
done about it, like trashing and recomputing only statistics for 
specific qgroup?


Linux host 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 
x86_64 x86_64 x86_64 GNU/Linux

btrfs-progs v4.4

--
--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html