Re: defragmenting best practice?

2017-11-01 Thread Sean Greenslade
On Tue, Oct 31, 2017 at 05:47:54PM -0400, Dave wrote:
> I'm following up on all the suggestions regarding Firefox performance
> on BTRFS. 
>
> 
>
> 5. Firefox profile sync has not worked well for us in the past, so we
> don't use it.
> 6. Our machines generally have plenty of RAM so we could put the
> Firefox cache (and maybe profile) into RAM using a technique such as
> https://wiki.archlinux.org/index.php/Firefox/Profile_on_RAM. However,
> profile persistence is important.

> 4. Put the Firefox cache in RAM
> 
> 5. If needed, consider putting the Firefox profile in RAM

Have you looked into profile-sync-daemon?

https://wiki.archlinux.org/index.php/profile-sync-daemon

It basically does the "keep the profile in RAM but also sync it to HDD"
for you. I've used it for years, it works quite well.

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-21 Thread Sean Greenslade
On September 19, 2017 11:38:13 PM PDT, Dave  wrote:
>>On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
> 
>Here's my scenario. Some months ago I built an over-the-top powerful
>desktop computer / workstation and I was looking forward to really
>fantastic performance improvements over my 6 year old Ubuntu machine.
>I installed Arch Linux on BTRFS on the new computer (on an SSD). To my
>shock, it was no faster than my old machine. I focused a lot on
>Firefox performance because I use Firefox a lot and that was one of
>the applications in which I was most looking forward to better
>performance.
>
> 
>
>What would you guys do in this situation?

Check out profile sync daemon:

https://wiki.archlinux.org/index.php/profile-sync-daemon

It keeps the active profile files in a ramfs, periodically syncing them back to 
disk. It works quite well on my 7 year old netbook.

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS converted from EXT4 becomes read-only after reboot

2017-05-08 Thread Sean Greenslade
On Mon, May 08, 2017 at 12:41:11PM -0400, Austin S. Hemmelgarn wrote:
> Send/receive is not likely to transfer the problem unless it has something
> to do with how things are reflinked.  Receive operates by recreating the
> sent subvolume from userspace using regular commands and the clone ioctls,
> so it won't replicate any low-level structural issues in the filesystem
> unless they directly involve the way extents are being shared (or are a side
> effect of that).  On top of that, if there is an issue on the sending side,
> send itself will probably not send that data, so it's actually only
> marginally more dangerous than using something like rsync to copy the data.

True, but my goal was to eliminate as many btrfs variables as I could.
To answer the original question, I used rsync to copy the data and
attributes (something like rsync -aHXp --numeric-ids) from a live CD to
an external hard drive (formatted ext4), then ran mkfs.btrfs on the
original partition, then re-ran the rsync in the opposite direction. It
worked quite well for me, and the problem hasn't resurfaced.

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS converted from EXT4 becomes read-only after reboot

2017-05-08 Thread Sean Greenslade
On May 8, 2017 11:28:42 AM EDT, Sanidhya Solanki  wrote:
>On Mon, 8 May 2017 10:16:44 -0400
>Alexandru Guzu  wrote:
> 
>> Sean, how would you approach the copy of the data back and forth if
>> the OS is on it? Would a Send-receive and then back work?
>
>You could use a Live-USB and then just dd it to remote or attached
>storage, if
>you want to be absolutely sure the data is not affected.

I would not suggest either of those. Send / receive might work, but since we 
don't know the source of the problem, it risks transferring the problem. DD 
would not solve the problem at all, since we're trying to rebuild the 
partition, not clone it.

--Sean


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS converted from EXT4 becomes read-only after reboot

2017-05-07 Thread Sean Greenslade
On May 3, 2017 4:28:11 PM EDT, Alexandru Guzu  wrote:
>Hi all,
>
>In a VirtualBox VM, I converted a EXT4 fs to BTRFS that is now running
>on Ubuntu 16.04 (Kernel 4.4.0-72). I was able to use the system for
>several weeks. I even did kernel updates, compression, deduplication
>without issues.
>
>Since today, a little while after booting (usually when I start
>opening applications), the FS becomes read-only. IMO, the only thing
>that might have changed that could affect this is a kernel upgrade.
>However, since the FS becomes RO, I cannot easily do a kernel update
>(I guess I'd have to chroot and do it).
>
>The stack trace is the following:
>
>[   88.836057] [ cut here ]
>[   88.836082] WARNING: CPU: 0 PID: 25 at
>/build/linux-wXdoVv/linux-4.4.0/fs/btrfs/inode.c:2931
>btrfs_finish_ordered_io+0x63b/0x650 [btrfs]()
>[   88.836083] BTRFS: Transaction aborted (error -95)
>[   88.836084] Modules linked in: nvram msr zram lz4_compress
>vboxsf(OE) joydev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
>aesni_intel snd_intel8x0 aes_x86_64 snd_ac97_codec lrw gf128mul
>glue_helper ablk_helper cryptd ac97_bus snd_pcm snd_seq_midi
>snd_seq_midi_event snd_rawmidi snd_seq vboxvideo(OE) snd_seq_device
>snd_timer ttm input_leds drm_kms_helper serio_raw i2c_piix4 snd drm
>fb_sys_fops syscopyarea sysfillrect sysimgblt soundcore 8250_fintek
>vboxguest(OE) mac_hid parport_pc ppdev lp parport autofs4 btrfs xor
>raid6_pq hid_generic usbhid hid psmouse ahci libahci fjes video
>pata_acpi
>[   88.836116] CPU: 0 PID: 25 Comm: kworker/u2:1 Tainted: G
>OE   4.4.0-72-generic #93-Ubuntu
>[   88.836117] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
>VirtualBox 12/01/2006
>[   88.836130] Workqueue: btrfs-endio-write btrfs_endio_write_helper
>[btrfs]
>[   88.836132]  0286 618d3a00 88007cabbc78
>813f82b3
>[   88.836134]  88007cabbcc0 c01912a8 88007cabbcb0
>81081302
>[   88.836135]  880058bf01b0 8800355f2800 88007b9e64e0
>88007c66f098
>[   88.836137] Call Trace:
>[   88.836142]  [] dump_stack+0x63/0x90
>[   88.836145]  [] warn_slowpath_common+0x82/0xc0
>[   88.836147]  [] warn_slowpath_fmt+0x5c/0x80
>[   88.836159]  [] ? unpin_extent_cache+0x8f/0xe0
>[btrfs]
>[   88.836167]  [] ? btrfs_free_path+0x26/0x30
>[btrfs]
>[   88.836178]  []
>btrfs_finish_ordered_io+0x63b/0x650 [btrfs]
>[   88.836188]  [] finish_ordered_fn+0x15/0x20
>[btrfs]
>[   88.836200]  []
>btrfs_scrubparity_helper+0xca/0x2f0 [btrfs]
>[   88.836202]  [] ?
>__raw_callee_save___pv_queued_spin_unlock+0x11/0x20
>[   88.836214]  [] btrfs_endio_write_helper+0xe/0x10
>[btrfs]
>[   88.836217]  [] process_one_work+0x165/0x480
>[   88.836219]  [] worker_thread+0x4b/0x4c0
>[   88.836220]  [] ? process_one_work+0x480/0x480
>[   88.836222]  [] ? process_one_work+0x480/0x480
>[   88.836224]  [] kthread+0xd8/0xf0
>[   88.836225]  [] ?
>kthread_create_on_node+0x1e0/0x1e0
>[   88.836229]  [] ret_from_fork+0x3f/0x70
>[   88.836230]  [] ?
>kthread_create_on_node+0x1e0/0x1e0
>[   88.836232] ---[ end trace f4b8dbb54f0db139 ]---
>[   88.836234] BTRFS: error (device sda1) in
>btrfs_finish_ordered_io:2931: errno=-95 unknown
>[   88.836236] BTRFS info (device sda1): forced readonly
>[   88.836392] pending csums is 118784
>
>If I reboot with the Live CD version, I can run `scrub` and `check`
>without any issues. Also the FS stays read-write.
>
>Is this a known issue? Could it be due to the conversion, or can this
>happen on any system on any time. This is a test VM that I can
>re-make, but I would like to know if this can happen on a production
>system.
>
>Regards,
>Alex.
>--
>To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>in
>the body of a message to majord...@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

Just FYI, I ran into basically this same issue a while back. 
My solution was to copy all the data off it,  re-run mkfs.btrfs,
then copy all the data back. I found that none of the existing
data was damaged, so I think the actual issue is something
left over from the conversion that confuses the commit
logic.

--Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is btrfs-convert able to deal with sparse files in a ext4 filesystem?

2017-04-01 Thread Sean Greenslade
On Sat, Apr 01, 2017 at 11:48:50AM +0200, Kai Herlemann wrote:
> Hi,
> I have on my ext4 filesystem some sparse files, mostly images from
> ext4 filesystems.
> Is btrfs-convert (4.9.1) able to deal with sparse files or can that
> cause any problems?

>From personal experience, I would recommend not using btrfs-convert on
ext4 partitions. I attempted it on a /home partition on one of my
machines, and while it did succeed in converting, the fs it produced had
weird issues that caused transation failures and thus semi-frequent
remount-ro. Btrfs-check, scrub, and balance were all unable to repair
the damage. I ended up recreating the parition from a backup.

As far as I know, there were no sparse files on this partition, either.

Just my one data point, for whatever it's worth.

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cross-subvolume rename behavior

2017-03-23 Thread Sean Greenslade
On Thu, Mar 23, 2017 at 07:23:40AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-03-23 06:09, Hugo Mills wrote:
> >Direct rename (using rename(2)) isn't possible across subvols,
> > which is what the EXDEV result indicates. The solution is exactly what
> > mv does, which is reflink-and-delete (which is cheaper than
> > copy-and-delete, because no data is moved). In theory, you probably
> > could implement rename across subvolumes in the FS, but it would just
> > be moving the exact same operations from userspace to kernel space.
> >
> Doing so though would have the advantage that it theoretically could be made
> (almost) atomic like a normal rename is, whereas the fallback in mv is
> absolutely not atomic.
> > 
> >I think that the solution here is for the sshfs stack to be fixed
> > so that it passes the EXDEV up to the mv command properly, and passes
> > the subsequent server-side copy (reflink) back down correctly.
> 
> This would be wonderful in theory, but it can't pass down the reflink,
> because the SFTP protocol (which is what sshfs uses) doesn't even have the
> concept of reflinks, so implementing this completely would require a
> revision to the SFTP protocol, which I don't see as likely to happen.

Thanks for the insights, Hugo and Austin. That's more or less what I was
expecting, so I guess my next stop will be with the openssh folks.
Though I was looking over the SFTP protocol, and none of the SSH_FX_*
errors seem suitable for this purpose. This might require a workaround
in sshfs...

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cross-subvolume rename behavior

2017-03-23 Thread Sean Greenslade
Hello, all. I'm currently tracking down the source of some strange
behavior in my setup. I recognize that this isn't strictly a btrfs
issue, but I figured I'd start at the bottom of the stack and work my
way up.

I have a server with a btrfs filesystem on it that I remotely access on
several systems via an sshfs mount. For the most part this works
perfectly, but I just discovered that moving files between subvolumes on
that mount fails with a confusing "Operation not permitted" error.

After doing some digging, it turns out it's not actually a permissions
error. If I do the same operation locally on the server, it succeeds,
but an strace of the mv reveals that the rename() syscall returns EXDEV.
The mv util takes this as a sign to fall back on the copy-and-delete
routine, so the move succeeds. Unfortunately, it seems that somewhere in
sshfs, sftp, or fuse, the EXDEV is getting turned into a generic
failure, which mv apparently interprets as "permission denied".

So my question for the btrfs devs: is rename()-ing across subvolumes
not feasible, or is this simply a case of no one has implemented that
yet? 

Thanks for any insights,

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Incremental send robustness question

2016-10-16 Thread Sean Greenslade
On October 14, 2016 12:43:03 AM EDT, Duncan <1i5t5.dun...@cox.net> wrote:
>I see the specific questions have been answered, and alternatives 
>explored in one direction, but I've another alternative, in a different
>
>direction, to suggest.
>
>First a disclaimer.  I'm a btrfs user/sysadmin and regular on the list,
>
>but I'm not a dev, and my own use-case doesn't involve send/receive, so
>
>what I know regarding send/receive is from the list and manpages, not 
>personal experience.  With that in mind...
>
>It's worth noting that send/receive are subvolume-specific -- a send 
>won't continue down into a subvolume.
>
>Also note that in addition to -p/parent, there's -s/clone-src.  The 
>latter is more flexible than the super-strict parent option, at the 
>expense of a fatter send-stream as additional metadata is sent that 
>specifies which clone the instructions are relative to.
>
>It should be possible to use the combination of these two facts to
>split 
>and recombine your send stream in a firewall-timeout-friendly manner,
>as 
>long as no individual files are so big that sending an individual file 
>exceeds the timeout.
>
>1) Start by taking a read-only snapshot of your intended source 
>subvolume, so you have an unchanging reference.
>
>2) Take multiple writable snapshots of it, and selectively delete
>subdirs 
>(and files if necessary) from each writable snapshot, trimming each one
>
>to a size that should pass the firewall without interruption, so that
>the 
>combination of all these smaller subvolumes contains the content of the
>
>single larger one.
>
>3) Take read-only snapshots of each of these smaller snapshots,
>suitable 
>for sending.
>
>4) Do a non-incremental send of each of these smaller snapshots to the 
>remote.
>
>If it's practical to keep the subvolume divisions, you can simply split
>
>the working tree into subvolumes and send those individually instead of
>
>doing the snapshot splitting above, in which case you can then use -p/
>parent on each as you were trying to do on the original, and you can
>stop 
>here.
>
>If you need/prefer the single subvolume, continue...
>
>5) Do an incremental send of the original full snapshot, using multiple
>-c  options to list each of the smaller snapshots.  Since all the 
>data has already been transferred in the smaller snapshot sends, this 
>send should be all metadata, no actual data.  It'll simply be combining
>
>the individual reference subvolumes into a single larger subvolume once
>
>again.
>
>6) Once you have the single larger subvolume on the receive side, you
>can 
>delete the smaller snapshots as you now have a copy of the larger 
>subvolume on each side to do further incremental sends of the working 
>copy against.
>
>7) I believe the first incremental send of the full working copy
>against 
>the original larger snapshot will still have to use -c, while
>incremental 
>sends based on that first one will be able to use the stricter but 
>slimmer send-stream -p, with each one then using the previous one as
>the 
>parent.  However, I'm not sure on that.  It may be that you have to 
>continue using the fatter send-stream -c each time.
>
>Again, I don't have send/receive experience of my own, so hopefully 
>someone who does can reply either confirming that this should work and 
>whether or not -p can be used after the initial setup, or explaining
>why 
>the idea won't work, but at this point based on my own understanding,
>it 
>seems like it should be perfectly workable to me. =:^)

I was considering doing something like this, but the simple solution of "just 
bring the disk over" won out. If that hadn't been possible, I might have done 
something like that, and I'm still mulling over possible solutions to similar / 
related problems.

I think the biggest solution would be support for partial / resuming receives. 
That'll probably go on my ever-growing list of things to possibly look into 
when I happen upon some free time. It sounds quite complicated, though...

--Sean


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Incremental send robustness question

2016-10-12 Thread Sean Greenslade
On Thu, Oct 13, 2016 at 01:14:51AM +0200, Hans van Kranenburg wrote:
> On 10/13/2016 12:29 AM, Sean Greenslade wrote:
> > Hi, all. I have a question about a backup plan I have involving
> > send/receive. As far as I can tell, there's no way to to resume a send
> > that has been interrupted. In this case, my interruption comes from an
> > overbearing firewall that doesn't like long-lived connections. I'm
> > trying to do the initial (non-incremental) sync of the first snapshot
> > from my main server to my backup endpoint. The snapshot is ~900 GiB, and
> > the internet link is 25 Mbps, so this'll be going for quite a long time.
> 
> You can't resume an interrupted send. You'll have to remove the target
> subvolume on the destination and start again.
> 
> Pipe the send into a local file, and then use any tool that can reliably
> resume interrupted transfers to get it to the other side.
> 
> Or, if faster, put in on a disk and drive there with your car. :)

I may just end up doing that. Hugo's responce gave me some crazy ideas
involving a custom build of split that waits for a command after each
output file fills, which would of course require an equally weird build
of cat that would stall the pipe indefinitely until all the files showed
up. Driving the HDD over would probably be a little simpler. =P

> > And while we're at it, what are the failure modes for incremental sends?
> > Will it throw an error if the parents don't match, or will there just be
> > silent failures?
> 
> Create a list of possibilities, create some test filesystems, try it.

I may just do that, presuming I can find the spare time. Given that I'm
building a backup solution around this tech, it would definitely bolster
my confidence in it if I knew what its failure modes looked like.

Thanks to everyone for your fast responses.

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Incremental send robustness question

2016-10-12 Thread Sean Greenslade
Hi, all. I have a question about a backup plan I have involving
send/receive. As far as I can tell, there's no way to to resume a send
that has been interrupted. In this case, my interruption comes from an
overbearing firewall that doesn't like long-lived connections. I'm
trying to do the initial (non-incremental) sync of the first snapshot
from my main server to my backup endpoint. The snapshot is ~900 GiB, and
the internet link is 25 Mbps, so this'll be going for quite a long time.

What I would like to do is "fake" the first snapshot transfer by
rsync-ing the files over. So my question is this: if I rsync a subvolume
(with the -a option to make all file times, permissions, ownerships,
etc. the same), is that good enough to then be used as a parent for
future incremental sends?

And while we're at it, what are the failure modes for incremental sends?
Will it throw an error if the parents don't match, or will there just be
silent failures? I would imagine receive would barf if it was told to
reference a file that didn't exist, but what if the referenced file is
there but contains different data? Are there checks for this sort of
thing, or is it always assumed that the parent subvols are identical and
if they're not, you're in undefined behavior land?

Thanks,

--Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke? (wiki updated)

2016-09-19 Thread Sean Greenslade
On Mon, Sep 19, 2016 at 12:08:55AM -0400, Zygo Blaxell wrote:
> 
> At the end of the day I'm not sure fsck really matters.  If the filesystem
> is getting corrupted enough that both copies of metadata are broken,
> there's something fundamentally wrong with that setup (hardware bugs,
> software bugs, bad RAM, etc) and it's just going to keep slowly eating
> more data until the underlying problem is fixed, and there's no guarantee
> that a repair is going to restore data correctly.  If we exclude broken
> hardware, the only thing btrfs check is going to repair is btrfs kernel
> bugs...and in that case, why would we expect btrfs check to have fewer
> bugs than the filesystem itself?

I see btrfs check as having a very useful role: fixing known problems
introduced by previous versions of kernel / progs. In my ext conversion
thread, I seem to have discovered a problem introduced by convert,
balance, or defrag. The data and metadata seem to be OK, however the
filesystem cannot be written to without btrfs falling over. If this was
caused by some edge-case data in the btrfs partition, it makes a lot
more sense to have btrfs check repair it than it does to modify the
kernel code to work around this and possibly many other bugs. The upshot
to this is that since (potentially all of) the data is intact, a
functional btrfs check would save me the hassle of restoring from
backup.

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Post ext3 conversion problems

2016-09-19 Thread Sean Greenslade
On Mon, Sep 19, 2016 at 02:30:28PM +0800, Qu Wenruo wrote:
> All chunks are completed convert to DUP, no small chunk, all to its maximum
> chunk size.
> So from chunk level, nothing related to convert yet.
> 
> But for extent tree, I found several extents are heavily referred to.
> Like extent 158173081600 or 183996522496.
> 
> If you're not using off-band dedupe, then it's quite possible that's the
> remaining structure of convert.

I never ran any sort of dedup on this partition.

> Not pretty sure if it's related to the bug, but did you do the
> balance/defrag operation just after removing ext_save subvolume?

That's quite possible. I did it in a live boot, so I don't have the bash
history to check. I checked it just now using "btrfs subvol list -d",
and there's nothing listed. I ran a full balance after that, but the
problem remains. So whatever the problem is, it can survive a full
balance after the ext_save subvol is completely deleted.

--Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Post ext3 conversion problems

2016-09-18 Thread Sean Greenslade
On Mon, Sep 19, 2016 at 10:20:37AM +0800, Qu Wenruo wrote:
> 
> -95 is -EOPNOTSUPP.
> 
> Not a common errno in btrfs.
> 
> Most EOPNOTSUPP are related to discard and crapped fallcate/drop extents.
> 
> Then are you using discard mount option?

I did indeed have the discard mount option enabled. I tried booting with
discard disabled, but the same problem appeared.

> 
> Normally a btrfs-debug-tree would help in most case, but this time it seems
> to be a runtime scrub bug other than on-disk metadata corruption.
> 
> What I can see here is, with all your operation, your fs should be a normal
> btrfs, other than converted one.
> 
> To confirm my idea, would you please upload the following things if your
> filesystem is not too large?
> 
> # btrfs-debug-tree -t extent 
> # btrfs-debug-tree -t chunk 
> # btrfs-debug-tree -t dev 
> 
> There is no file/dir name/data contained in the dump. So it's just
> chunk/extent allocation info.
> You could upload them at ease.
> 
> > Not a mess, I think it's a good bug report. I think Qu and David know
> > more about the latest iteration of the convert code. If you can wait
> > until next week at least to see if they have questions that'd be best.
> > If you need to get access to the computer sooner than later I suggest
> > btrfs-image -c9 -t4 -s to make a filename sanitized copy of the
> > filesystem metadata for them to look at, just in case. They might be
> > able to figure out the problem just from the stack trace, but better
> > to have the image before blowing away the file system, just in case
> > they want it.
> 
> Yes, btrfs-image dump would be the best.
> Although sanitizing may takes a long time and the output may be too large.

I had posted a btrfs-image before. It was run with a single -s flag:

http://phead.us/tmp/sgreenslade_home_sanitized_2016-09-16.btrfs

Here's the debug tree data:

http://phead.us/tmp/wheatley_chunk_2016-09-18.dump.gz
http://phead.us/tmp/wheatley_extent_2016-09-18.dump.gz
http://phead.us/tmp/wheatley_dev_2016-09-18.dump.gz

Thanks,

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Post ext3 conversion problems

2016-09-16 Thread Sean Greenslade
On Fri, Sep 16, 2016 at 07:27:58PM -0700, Liu Bo wrote:
> Interesting, seems that we get errors from 
> 
> btrfs_finish_ordered_io
>   insert_reserved_file_extent
> __btrfs_drop_extents
> 
> And splitting an inline extent throws -95.

Heh, you beat me to the draw. I was just coming to the same conclusion
myself from poking at the source code. What's interesting is that it
seems to be a quite explicit thing:

if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
ret = -EOPNOTSUPP;
break;
}

So now the question is why is this happening? Clearly the presence of
inline extents isn't an issue by itself, since another one of my btrfs
/home partitions has plenty of them.

I added some debug prints to my kernel to catch the inode that tripped
the error. Here's the relevant chunk (with filenames scrubbed) from
btrfs-debug-tree:

Inode 140345 triggered the transaction abort.

leaf 175131459584 items 51 free space 7227 generation 118521 owner 5
fs uuid 1d9ee7c7-d13a-4c3c-b730-256c70841c5b
chunk uuid b67a1a82-ff22-48b5-af1b-9d5f85ebee25
item 0 key (140343 INODE_ITEM 0) itemoff 16123 itemsize 160
inode generation 1 transid 1 size 180 nbytes 0
block group 0 mode 40755 links 1 uid 1000 gid 1000
rdev 0 flags 0x0(none)
item 1 key (140343 INODE_REF 131327) itemoff 16107 itemsize 16
inode ref index 199 namelen 6 name: 
item 2 key (140343 DIR_ITEM 1073386496) itemoff 16072 itemsize 35
location key (142600 INODE_ITEM 0) type SYMLINK
namelen 5 datalen 0 name: 
item 3 key (140343 DIR_ITEM 1148422723) itemoff 16037 itemsize 35
location key (142601 INODE_ITEM 0) type SYMLINK
namelen 5 datalen 0 name: 
item 4 key (140343 DIR_ITEM 2415965623) itemoff 16004 itemsize 33
location key (131550 INODE_ITEM 0) type SYMLINK
namelen 3 datalen 0 name: 
item 5 key (140343 DIR_ITEM 2448077466) itemoff 15965 itemsize 39
location key (140565 INODE_ITEM 0) type FILE
namelen 9 datalen 0 name: 
item 6 key (140343 DIR_ITEM 2566671093) itemoff 15930 itemsize 35
location key (140564 INODE_ITEM 0) type SYMLINK
namelen 5 datalen 0 name: 
item 7 key (140343 DIR_ITEM 3391512089) itemoff 15873 itemsize 57
location key (142599 INODE_ITEM 0) type FILE
namelen 27 datalen 0 name: 
item 8 key (140343 DIR_ITEM 3621719155) itemoff 15838 itemsize 35
location key (131627 INODE_ITEM 0) type SYMLINK
namelen 5 datalen 0 name: 
item 9 key (140343 DIR_ITEM 3701680574) itemoff 15798 itemsize 40
location key (142603 INODE_ITEM 0) type FIFO
namelen 10 datalen 0 name: 
item 10 key (140343 DIR_ITEM 3816117430) itemoff 15763 itemsize 35
location key (140563 INODE_ITEM 0) type SYMLINK
namelen 5 datalen 0 name: 
item 11 key (140343 DIR_ITEM 4214885080) itemoff 15729 itemsize 34
location key (131544 INODE_ITEM 0) type SYMLINK
namelen 4 datalen 0 name: 
item 12 key (140343 DIR_ITEM 4253409616) itemoff 15687 itemsize 42
location key (140352 INODE_ITEM 0) type FILE
namelen 12 datalen 0 name: 
item 13 key (140343 DIR_INDEX 2) itemoff 15653 itemsize 34
location key (131544 INODE_ITEM 0) type SYMLINK
namelen 4 datalen 0 name: 
item 14 key (140343 DIR_INDEX 3) itemoff 15620 itemsize 33
location key (131550 INODE_ITEM 0) type SYMLINK
namelen 3 datalen 0 name: 
item 15 key (140343 DIR_INDEX 4) itemoff 15585 itemsize 35
location key (131627 INODE_ITEM 0) type SYMLINK
namelen 5 datalen 0 name: 
item 16 key (140343 DIR_INDEX 5) itemoff 15543 itemsize 42
location key (140352 INODE_ITEM 0) type FILE
namelen 12 datalen 0 name: 
item 17 key (140343 DIR_INDEX 6) itemoff 15508 itemsize 35
location key (140563 INODE_ITEM 0) type SYMLINK
namelen 5 datalen 0 name: 
item 18 key (140343 DIR_INDEX 7) itemoff 15473 itemsize 35
location key (140564 INODE_ITEM 0) type SYMLINK
namelen 5 datalen 0 name: 
item 19 key (140343 DIR_INDEX 8) itemoff 15434 itemsize 39
location key (140565 INODE_ITEM 0) type FILE
namelen 9 datalen 0 name: 
item 20 key (140343 DIR_INDEX 9) itemoff 15377 itemsize 57
location key (142599 INODE_ITEM 0) type FILE
namelen 27 datalen 0 name: 
item 21 key (140343 DIR_INDEX 10) itemoff 15342 itemsize 35
location key (142600 INODE_ITEM 0) type SYMLINK
namelen 5 datalen 0 name: 
item 22 key (140343 DIR_INDEX 

Re: Post ext3 conversion problems

2016-09-16 Thread Sean Greenslade
On Fri, Sep 16, 2016 at 05:45:59PM -0600, Chris Murphy wrote:
> On Fri, Sep 16, 2016 at 5:25 PM, Sean Greenslade
> <s...@seangreenslade.com> wrote:
> 
> > In the mean time, is there any way to make the kernel more verbose about
> > btrfs errors? It would be nice to see, for example, what was in the
> > transaction that failed, or at least what files / metadata it was
> > touching.
> 
> No idea. Maybe one of the compile time options:
> 
> 
> CONFIG_BTRFS_FS_CHECK_INTEGRITY=y
> This also requires mount options, either check_int or check_int_data
> CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> CONFIG_BTRFS_DEBUG=y
> https://patchwork.kernel.org/patch/846462/
> CONFIG_BTRFS_ASSERT=y
> 
> Actually, even before that maybe if you did a 'btrfs-debug-tree /dev/sdX'
> 
> That might explode in the vicinity of the problem. Thing is, btrfs
> check doesn't see anything wrong with the metadata, so chances are
> debug-tree won't either.

Hmm, I'll probably have a go at compiling the latest mainline kernel
with CONFIG_BTRFS_DEBUG enabled. It certainly can't hurt to try.

And as you suspected, btrfs-debug-tree didn't explode / error out on me.
I didn't thoroughly inspect the output (as I have very little
understanding of the btrfs internals), but it all seemed OK.

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Post ext3 conversion problems

2016-09-16 Thread Sean Greenslade
On Fri, Sep 16, 2016 at 02:23:44PM -0600, Chris Murphy wrote:
> Not a mess, I think it's a good bug report. I think Qu and David know
> more about the latest iteration of the convert code. If you can wait
> until next week at least to see if they have questions that'd be best.
> If you need to get access to the computer sooner than later I suggest
> btrfs-image -c9 -t4 -s to make a filename sanitized copy of the
> filesystem metadata for them to look at, just in case. They might be
> able to figure out the problem just from the stack trace, but better
> to have the image before blowing away the file system, just in case
> they want it.

I can hang on to the system in its current state, I don't particularly
need this machine fully operational.

Just to be proactive, I ran the btrfs-image as follows:

btrfs-image -c9 -t4 -s -w /dev/sda2 dumpfile

http://phead.us/tmp/sgreenslade_home_sanitized_2016-09-16.btrfs

In the mean time, is there any way to make the kernel more verbose about
btrfs errors? It would be nice to see, for example, what was in the
transaction that failed, or at least what files / metadata it was
touching.

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Post ext3 conversion problems

2016-09-16 Thread Sean Greenslade
Hi, all. I've been playing around with an old laptop of mine, and I
figured I'd use it as a learning / bugfinding opportunity. Its /home
partition was originally ext3. I have a full partition image of this
drive as a backup, so I can do (and have done) potentially destructive
things. The system disk is a ~6 year old SSD.

To start, I rebooted to a livedisk (Arch, kernel 4.7.2 w/progs 4.7.1)
and ran a simple btrfs-convert on it. After patching up the fstab and
rebooting, everything seemed fine. I deleted the recovery subvol, ran a
full balance, ran a full defrag, and rebooted again. I then decided to
try (as an experiment) using DUP mode for data and metadata. I ran that
balance without issue, then started using the machine. Sometime later, I
got the following remount ro:

[ 7316.764235] [ cut here ]
[ 7316.764292] WARNING: CPU: 2 PID: 14196 at fs/btrfs/inode.c:2954 
btrfs_finish_ordered_io+0x6bc/0x6d0 [btrfs]
[ 7316.764297] BTRFS: Transaction aborted (error -95)
[ 7316.764301] Modules linked in: fuse sha256_ssse3 sha256_generic hmac drbg 
ansi_cprng ctr ccm joydev mousedev uvcvideo videobuf2_vmalloc videobuf2_memops 
videobuf2_v4l2 videobuf2_core videodev media crc32c_generic iTCO_wdt btrfs 
iTCO_vendor_support arc4 xor ath9k raid6_pq ath9k_common ath9k_hw ath mac80211 
snd_hda_codec_realtek snd_hda_codec_generic psmouse input_leds coretemp 
snd_hda_intel led_class pcspkr snd_hda_codec cfg80211 snd_hwdep snd_hda_core 
snd_pcm lpc_ich snd_timer atl1c rfkill snd soundcore shpchp intel_agp wmi 
thermal fjes battery evdev ac tpm_tis mac_hid tpm sch_fq_codel vboxnetflt(O) 
vboxnetadp(O) pci_stub vboxpci(O) vboxdrv(O) loop sg acpi_cpufreq ip_tables 
x_tables ext4 crc16 jbd2 mbcache sd_mod serio_raw atkbd libps2 ahci libahci 
uhci_hcd libata scsi_mod ehci_pci ehci_hcd usbcore
[ 7316.764434]  usb_common i8042 serio i915 video button intel_gtt i2c_algo_bit 
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm
[ 7316.764462] CPU: 2 PID: 14196 Comm: kworker/u8:11 Tainted: G   O
4.7.3-5-ck #1
[ 7316.764467] Hardware name: ASUSTeK Computer INC. 1015PEM/1015PE, BIOS 0903   
 11/08/2010
[ 7316.764507] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[ 7316.764513]  0286 6101f47d 8800230dbc78 
812f0215
[ 7316.764522]  8800230dbcc8  8800230dbcb8 
8107ae6f
[ 7316.764530]  0b8a0035 88007791afa8 8800751d9000 
880014101d40
[ 7316.764538] Call Trace:
[ 7316.764551]  [] dump_stack+0x63/0x8e
[ 7316.764560]  [] __warn+0xcf/0xf0
[ 7316.764567]  [] warn_slowpath_fmt+0x61/0x80
[ 7316.764605]  [] ? unpin_extent_cache+0xa2/0xf0 [btrfs]
[ 7316.764640]  [] ? btrfs_free_path+0x26/0x30 [btrfs]
[ 7316.764677]  [] btrfs_finish_ordered_io+0x6bc/0x6d0 [btrfs]
[ 7316.764715]  [] finish_ordered_fn+0x15/0x20 [btrfs]
[ 7316.764753]  [] btrfs_scrubparity_helper+0x7e/0x360 [btrfs]
[ 7316.764791]  [] btrfs_endio_write_helper+0xe/0x10 [btrfs]
[ 7316.764799]  [] process_one_work+0x1ed/0x490
[ 7316.764806]  [] worker_thread+0x49/0x500
[ 7316.764813]  [] ? process_one_work+0x490/0x490
[ 7316.764820]  [] kthread+0xda/0xf0
[ 7316.764830]  [] ret_from_fork+0x1f/0x40
[ 7316.764838]  [] ? kthread_worker_fn+0x170/0x170
[ 7316.764843] ---[ end trace 90f54effc5e294b0 ]---
[ 7316.764851] BTRFS: error (device sda2) in btrfs_finish_ordered_io:2954: 
errno=-95 unknown
[ 7316.764859] BTRFS info (device sda2): forced readonly
[ 7316.765396] pending csums is 9437184

After seeing this, I decided to attempt a repair (confident that I could
restore from backup if it failed). At the time, I was unaware of the
issues with progs 4.7.1, so when I ran the check and saw all the
incorrect backrefs messages, I figured that was my problem and ran the
--repair. Of course, this didn't make the messages go away on subsequent
checks, so I looked further and found this bug:

https://bugzilla.kernel.org/show_bug.cgi?id=155791

I updated progs to 4.7.2 and re-ran the --repair (I didn't save any of
the logs from these, unfortunately). The repair seemed to work (I also
used --init-extent-tree), as current checks don't report any errors.

The system boots and mounts the FS just fine. I can read from it all
day, scrubs complete without failure, but just using the system for a
while will eventually trigger the same "Transaction aborted (error -95)"
error.

I realize this is something of a mess, and that I was less than
methodical with my actions so far. Given that I have a full backup that
can be restored if need be (and I certainly could try running the
convert again), what is my best course of action?

Thanks,

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: does btrfs-receive use/compare the checksums from the btrfs-send side?

2016-08-30 Thread Sean Greenslade
On Sun, Aug 28, 2016 at 10:25:32PM +0200, Christoph Anton Mitterer wrote:
> On Sun, 2016-08-28 at 22:19 +0200, Adam Borowski wrote:
> > Transports over which you're likely to send a filesystem stream
> > already
> > protect against corruption.
> Well... in some cases,... but not always... just consider a plain old
> netcat...

Netcat uses TCP by default, so there is error correction and a
guaranteed-correct stream transfer there.

--Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs fi usage bug during shrink

2016-07-31 Thread Sean Greenslade
Hi, all. I was resizing (shrinking) a btrfs partition, and figured I'd
check in on how it was going with "btrfs fi usage." It was quite
startling:

$ sudo btrfs fi usage /mnt/

Overall:
Device size: 370.00GiB
Device allocated:372.03GiB
Device unallocated:   16.00EiB
Device missing:  0.00B
Used:360.56GiB
Free (estimated):0.00B  (min: 8.00EiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:  224.00MiB  (used: 0.00B)

Data,single: Size:370.02GiB, Used:359.31GiB
   /dev/mapper/c1370.02GiB

Metadata,DUP: Size:1.00GiB, Used:639.22MiB
   /dev/mapper/c1  2.00GiB

System,DUP: Size:8.00MiB, Used:64.00KiB
   /dev/mapper/c1 16.00MiB

Unallocated:
   /dev/mapper/c1 16.00EiB


It's reasonably obvious what's going on, here. The overall size has been
set to the final size, and now the worker is going through balancing all
the chunks that are now out of bounds. 

I feel like "fi usage" should probably have some logic to detect this
situation and report something more sensible. Thankfully, it's only
transient, and returns to normal once the resize completes.

Thanks,

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 disk upgrade method

2016-02-13 Thread Sean Greenslade
On Thu, Jan 28, 2016 at 01:47:36PM -0500, Sean Greenslade wrote:
> OK, I just misunderstood how that syntax worked. All seems good now.
> I'll try to play around with some dummy configurations this weekend to
> see if I can reproduce the post-replace mount bug.

So I finally got some time to play with this, and I am entirely unable
to reproduce these errors with virtual loop disks. I'm going to chalk
these errors up to transient SATA nastiness, since that's happened on
this system before. Either way, there was no data loss during this
entire operation, so besides a few extra unplanned reboots, things went
extremely well. Excellent work on btrfs, devs, and thanks to everyone
who chimed in to help me. 

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 vs raid5

2016-01-06 Thread Sean Greenslade
On Tue, Jan 05, 2016 at 05:24:31PM +0100, Psalle wrote:
> Hello all and excuse me if this is a silly question. I looked around in the
> wiki and list archives but couldn't find any in-depth discussion about this:
> 
> I just realized that, since raid1 in btrfs is special (meaning only two
> copies in different devices), the effect in terms of resilience achieved
> with raid1 and raid5 are the same: you can lose one drive and not lose data.
> 
> So!, presuming that raid5 were at the same level of maturity, what would be
> the pros/cons of each mode?

This is true for "classic" RAID: assume you have 3x 1TB disks. RAID1
will give you 1.5TB, whereas RAID5 will give you 2TB.

RAID1 = 1/2 total disk space (assuming equally-sized disks)
RAID5 = (N-1)*single disk space (same assumption)

> As a corollary, I guess that if raid1 is considered a good compromise, then
> functional equivalents to raid6 and beyond could simply be implemented as
> "storing n copies in different devices", dropping any complex parity
> computations and making this mode entirely generic.

This is akin to what has been mentioned on the list earlier as "N-way
mirroring" and I agree that it will be very nice to have once
implemented. However it is not the same as RAID5/6 since the parity
schemes are used to get more usable storage than just simple mirroring
would allow for.

Thus, the main pro of RAID5/6 is more usable storage, and the main con
is more computational complexity (and thus more cpu requirements, slower
access time, more fragile error states, etc.)

> Since this seems pretty obvious, I'd welcome your insights on what are
> the things I'm missing, since it doesn't exist (and it isn't planned
> to be this way, AFAIK). I can foresee consistency difficulties, but
> that seems hardly insurmountable if its being done for raid1?

Fixing an inconsistency in RAID1 is much easier than RAID5/6. No math,
just checking csums. Fixing an inconsistency in RAID5/6 involves busting
out the parity math. This is why repairing RAID5/6 only became possible
in btrfs relatively recently. Generating the parity data was relatively
easy, but rebuilding missing data with it was a more difficult task.

--Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Potential to loose data in case of disk failure

2015-11-11 Thread Sean Greenslade
On Wed, Nov 11, 2015 at 11:30:57AM -0600, Jim Murphy wrote:
> Hi all,
> 
> What am I missing or misunderstanding?  I have a newly
> purchased laptop I want/need to multi boot different OSs
> on.  As a result after partitioning I have ended up with two
> partitions on each of the two internal drives(sda3, sda8,
> sdb3 and sdb8).  FWIW, sda3 and sdb3 are the same size
> and sda8 and sdb8 are the same size.  As an end result
> I want one btrfs raid1 filesystem.  For lack of better terms,
> sda3 and sda8 "concatenated" together, sdb3 and sdb8
> "concatenated" together and then mirroring "sda" to "sdb"
> using only btrfs.  So far have found no use-case to cover
> this.
> 
> If I create a raid1 btrfs volume using all 4 "devices" as I
> understand it I would loose data if I were to loose a drive
> because two mirror possibilities would be:
> 
> sda3 mirrored to sda8
> sdb3 mirrored to sdb8
> 
> Is what I want to do possible without using MD-RAID and/or
> LVM?  If so would someone point me to the documentation
> I missed.  For whatever reason, I don't want to believe that
> this can't be done.  I want to believe that the code in btrfs
> is smart enough to know that sda3 and sda8 are on the same
> drive and would not try to mirror data between them except in
> a test setup.  I hope  I just missed some documentation,
> somewhere.
> 
> Thanks in advance for your help.  And last but not least,
> thanks to all for your work on btrfs.
> 
> Jim

That's a pretty unusual setup, so I'm not surprised there's no quick and
easy answer. The best solution in my opinion would be to shuffle your
partitions around and combine sda3 and sda8 into a single partition.
There's generally no reason to present btrfs with two different
partitions on the same disk.

If there's something that prevents you from doing that, you may be able
to use RAID10 or RAID6 somehow. I'm not really sure, though, so I'll
defer to others on the list for implementation details.

--Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Sean Greenslade
On Thu, Sep 17, 2015 at 07:56:08PM +0200, Gert Menke wrote:
> MD+LVM is very close to what I want, but md has no way to cope with silent
> data corruption. So if I'd want to use a guest filesystem that has no
> checksums either, I'm out of luck.
> I'm honestly a bit confused here - isn't checksumming one of the most
> obvious things to want in a software RAID setup? Is it a feature that might
> appear in the future? Maybe I should talk to the md guys...

MD is emulating hardware RAID. In hardware RAID, you are doing
work at the block level. Block-level RAID has no understanding of the
filesystem(s) running on top of it. Therefore it would have to checksum
groups of blocks, and store those checksums on the physical disks
somewhere, perhaps by keeping some portion of the drive for itself. But
then this is not very efficient, since it is maintaining checksums for
data that may be useless (blocks the FS is not currently using). So then
you might make the RAID filesystem aware...and you now have BTRFS RAID.

Simply put, the block level is probably not an appropriate place for
checksumming to occur. BTRFS can make checksumming work much more
effectively and efficiently by doing it at the filesystem level.

--Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Metadata about to fill up, how to make it bigger next time?

2015-03-25 Thread Sean Greenslade
On Wed, Mar 25, 2015 at 04:45:09PM -0700, Anand Patil wrote:
 Hi everyone,
 
 When I run btrfs fi df /path/to/fs, I see:
 
 Data, single: total=53.01GiB, used=51.79GiB
 System, DUP: total=32.00MiB, used=16.00KiB
 Metadata, DUP: total=16.00GiB, used=14.72GiB
 
 My most pressing question is, does that metadata line really mean that
 the filesystem is going to become unusable soon?

No. If you total up your allocations, you will notice that you have
~69GiB allocated, far shy of your total 1000GiB available. BTRFS
allocates in chunks as needed, so once it needs more space for metadata,
it will allocate some. You only get into trouble if there is no free
space to allocate from, and you are far from having that issue.

--Sean
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Sean Greenslade
On Wed, Sep 10, 2014 at 08:43:25PM +0200, Goffredo Baroncelli wrote:
 May be that I am missing something obvious, however I have to ask which 
 would be the purpose to balance a two disks RAID1 system.
 The balance command should move the data between the disks in order to
 avoid some disk full and other empty; but this assume that there is a
 not symmetrical uses of the disks. Which is not the case for a RAID1/two
 disks system.

Balancing is not necessarily about data distribution between two disks.
You can balance a single disk BTRFS partition. It's more about balancing
how the data / metadata chunks are allocated and used. It also (during a
re-write of a chunk) honors the RAID rules of that chunk type.

 *scrub
 Regarding scrub, pay attention that some (consumer) disks are 
 guarantee for a (not recoverable) error rate less than 1/10^14 [1] 
 bit reads. 10^14 bit are something like 10TB. This means that if you 
 read your system 5 times, you may got an error bit. I suppose 
 that these are very conservative number, so the likelihood of an 
 undetected error is (I hope)lower. But also I am inclined to think 
 these number are evaluated in an ideal case (in term of temperature, 
 voltage, vibration); this means that the true might be worse.
 
 So if you compare these numbers with your average throughput, 
 you can estimate which is the likelihood of an error. Pay attention
 that a scrub job means read all your data: If you have 1T of data,
 and you performs a scrub each week, in three months you reach the 10^14
 bit reads.
 
 This explains the interest in higher redundancy level (raid 6 or more).
  
 G.Baroncelli

I think there is a bit of misunderstanding here. Those disk error rates
are latent media errors. They're a function of production quality of the
platters and the amount of time the data rests on the drive. Reads do
not affect this, and in fact, can actually help reduce the error rate. 

When a hard drive does a read, it also reads the CRC values for the
sector that it just read. If it matches, the drive passes it on as good
data. If not, it attempts error correction on it. If it can correct the
error, it will return the corrected data and (hopefully) re-write the
data on the disk to fix the error permanently. I use quotes because
this could mean that that zone of media is damaged, and it will probably
error again. The disk will eventually re-allocate a sector that
repeatedly returns bad data. This is what you want to happen.

So doing reads, especially across the entire media surface, is a great
way to make the disk perform these sector checks. But sometimes the disk
cannot correct the error. Then the controller (if it is well-behaved)
will return a read error, or sometimes just bunk data. If the BTRFS
scrub sees bad data, it will detect it with its checksums, and if in a
RAID configuration, be able to locate a good copy of the data to
restore. 

Long story short, reads don't cause media errors, and scrubs help detect
errors early.

--Sean
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Sean Greenslade
On Thu, Sep 11, 2014 at 12:28:56AM +0200, Goffredo Baroncelli wrote:
 The WD datasheet says something different. It reports Non-recoverable 
 read errors per bits read less than 1/10^14. They express the number of 
 error in terms of number of bit reading.
 
 You instead are saying that the error depends by the disk age.
 
 These two sentence are very different.
 
 ( and of course all these values depend also by the product quality).

I'm not certain how those specs are determined. I was basing my
statements on knowledge of how read errors occur in rotating media.

 I think that there is two source of error:
 - a platter/disk degradation (due to ageing, wearing...), which may require a 
 sector relocation
 - other sources of error which are not permanent and that may be corrected
 by a 2nd read
 
 I don't have any idea about which one is bigger (even I suspect the second).

They are both the same, generally. If the sector is damaged (e.g.
manufacturing fault), then it can do several things. It can always
return bad data, which will result in a reallocation. It can also
partially fail. For example, accept the data, but slowly lose it over
some period of time. It's still due to bad media, but if you were to
read it quickly enough, you may be able to catch it before it goes bad.
If the drive catches (and re-writes) it, then it may have staved off
losing that data that time around. 

  So doing reads, especially across the entire media surface, is a great
  way to make the disk perform these sector checks. But sometimes the disk
  cannot correct the error. 
 
 I read this as: the error rate is greater than 1/10^14, but the CRC and
 some multiple reading and sector remapping lower the error rate below 1/10^14.
 
 If behind this there are a dumb drive which returns an error as soon as 
 the CRC doesn't match, or a smart drive which retries several time until
 it got a good value doesn't matter: the error rate is still 1/10^14.

Yes, the error rate is almost entirely determined by the manufacturing
of the physical media. Controllers can attempt to work around that, but
they won't go searching for media defects on their own (at least, I've
never seen a drive that does.)

  Long story short, reads don't cause media errors, and scrubs help detect
  errors early.
 
 Nobody told that a reading cause a media error; however assuming (this is 
 how
 I read the WD datasheet) the error rate constant, if you increase the number 
 of reading then you have more errors.
 
 May be that I was not clear, however I didn't want to say that scrubbing 
 reduces 
 the life of disk, I wanted to point out that the size of the disk and the 
 error
 rate are becoming comparable.

I know that wasn't your implication, but I wanted to be sure that things
weren't misinterpreted. I'll clarify:

Disks have latent errors. Nothing you can do will change this, and the
number of reads you do will not affect the error rate of the media. It
_will_ affect how often those errors are detected, however. And with
btrds, this is a Good Thing(TM). If errors are found, they can be
corrected by either the disk controller itself (on the block level) or
the filesystem on its level. 

Scrub your disks, folks. A scrubbed disk is a happy disk.

--Sean
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Sean Greenslade
On Wed, Sep 10, 2014 at 11:51:19PM -0400, Zygo Blaxell wrote:
 This is a complex topic. 

I agree, and I make no claim to be an expert in any of this.

 Some disks have bugs in their firmware, and some of those bugs make the
 data sheets and most of this discussion entirely moot.  The firmware is
 gonna do what the firmware's gonna do.

Agreed. That's why I like that fact that btrfs provides another layer of
error checking / correction.

 It's a bad idea to try to rewrite a fading sector in some cases.
 If the drive is located in a climate-controlled data center then it
 should be OK; however, there are multiple causes of read failure and
 some of them will also cause writes to damage adjacent data on the disk.
 Spinning disks stop being able to position their heads properly around
 -10C or so, a fact that will be familiar to anyone who's tried to use a
 laptop outside in winter.  Maybe someone dropped the computer, and the
 read errors are due to the heads vibrating with the shock--a read retry
 a few milliseconds later would be OK, but a rewrite (without a delay,
 so the heads are still vibrating from the shock) would just wipe out
 some nearby data with no possibility of recovery.

Of course, the drive can't always know what's going on outside. It just
tries its best (we hope). 

 Most of the reallocations I've observed in the field happen when a
 sector is written, not read.

Very true. I believe what happens is that a sector is marked for
re-allocation when the read fails, and a write to that sector will
trigger the actual reallocation. Hence the pending reallocations SMART
attribute.

 Most disks can search for defects on their own, but the host has to issue a
 SMART command to initiate such a search.  They will also track defect
 rates and log recent error details (with varying degrees of bugginess).

And again, it's up to the questionable firmware's discretion as to how
that search is done / how thorough it is. And it has to be triggered by
the user / script. I don't consider that to really be on its own, as
btrfs scrub requires the same level of input/scripting.

 smartmontools is your friend.  It's not a replacement for btrfs scrub, but
 it collects occasionally useful complementary information about the
 health of the drive.

I can't find the link, but there was a study done that shows an
alarmingly high percentage of disk failures showed no SMART errors
before failing. 

 There used to be a firmware feature for drives to test themselves
 whenever they are spinning and idle for four continuous hours, but most
 modern disks will power themselves down if they are idle for much less
 time...and who has a disk that's idle for four hours at a time anyway?  ;)

My backup destination is touched once a day. It averages about 20 hours
a day idle. Though it probably doesn't need to be testing itself 80% of
the time. That would be a mite excessive =P

  Scrub your disks, folks. A scrubbed disk is a happy disk.
 
 Seconded.  Also remember that not all storage errors are due to disk
 failure.  There's a lot of RAM, high-speed signalling, and wire between
 the host CPU and a disk platter.  SMART self-tests won't detect failures
 in those, but scrubs will.

But we'll save the ECC RAM discussion for another day, perhaps.

--Sean
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html