Re: Bad hard drive - checksum verify failure forces readonly mount

2016-07-05 Thread Vasco Almeida
Bug reported
https://bugzilla.kernel.org/show_bug.cgi?id=121491

Thank you for helping.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bad hard drive - checksum verify failure forces readonly mount

2016-06-27 Thread Vasco Almeida
A Dom, 26-06-2016 às 13:54 -0600, Chris Murphy escreveu:
> On Sun, Jun 26, 2016 at 7:05 AM, Vasco Almeida <vascomalme...@sapo.pt
> > wrote:
> > I have tried "btrfs check --repair /device" but that seems do not
> > do
> > any good.
> > http://paste.fedoraproject.org/384960/66945936/
> 
> It did fix things, in particular with the snapshot that was having
> problems being dropped. But it's not enough it seems to prevent it
> from going read only.
> 
> There's more than one bug here, you might see if the repair was good
> enough that it's possible to use brtfs-image now.

File system image available at (choose one link)
https://mega.nz/#!AkAEgKyB!RUa7G5xHIygWm0ALx5ZxQjjXNdFYa7lDRHJ_sW0bWLs
https://www.sendspace.com/file/i70cft

>  If not, use
> btrfs-debug-tree  > file.txt and post that file somewhere. This
> does expose file names. Maybe that'll shed some light on the problem.
> But also worth filing a bug at bugzilla.kernel.org with this debug
> tree referenced (probably too big to attach), maybe a dev will be
> able
> to look at it and improve things so they don't fail.

Should I file a bug report with that image dump linked above or btrfs-
debug-tree output or both?
I think I will use the subject of this thread as summary to file the
bug. Can you think of something more suitable or is that fine?

> > What else can I do or I must rebuild the file system?
> 
> Well, it's a long shot but you could try using --repair --init-csum
> which will create a new csum tree. But that applies to data, if the
> problem with it going read only is due to metadata corruption this
> won't help. And then last you could try --init-extent-tree. Thing I
> can't answer is which order to do it in.
> 
> In any case there will be files that you shouldn't trust after csum
> has been recreated, anything corrupt will now have a new csum, so you
> can get silent data corruption. It's better to just blow away this
> file system and make a new one and reinstall the OS. But if you're
> feeling brave, you can try one or both of those additional options
> and
> see if they can help.

I think I will reinstall the OS since, even if I manage to recover the
file system from this issue, that OS will be something I can not trust
fully.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bad hard drive - checksum verify failure forces readonly mount

2016-06-26 Thread Vasco Almeida
A Sáb, 25-06-2016 às 14:54 -0600, Chris Murphy escreveu:
> On Sat, Jun 25, 2016 at 2:10 PM, Vasco Almeida <vascomalme...@sapo.pt
> > wrote:
> > Citando Chris Murphy <li...@colorremedies.com>:
> > > 3. btrfs-image so that devs can see what's causing the problem
> > > that
> > > the current code isn't handling well enough.
> > 
> > 
> > btrfs-image does not create dump image:
> > 
> > # btrfs-image /dev/mapper/vg_pupu-lv_opensuse_root
> > btrfs-lv_opensuse_root.image
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > Csum didn't match
> > Error reading metadata block
> > Error adding block -5
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > Csum didn't match
> > Error reading metadata block
> > Error flushing pending -5
> > create failed (Success)
> > # echo $?
> > 1
> 
> Well it's pretty strange to have DUP metadata and for the checksum
> verify to fail on both copies. I don't have much optimism that brfsck
> repair can fix it either. But still it's worth a shot since there's
> not much else to go on.

I have tried "btrfs check --repair /device" but that seems do not do
any good.
http://paste.fedoraproject.org/384960/66945936/

I then issued "mount /device /mnt" and, like before, it was mounted
readwrite and then forced readonly. Got some kernel oops and traces. 

I noticed that btrfs-balance was using ~100% CPU whilst btrfs device
was mounted readonly. I let it run for about 20 minutes.
Then had to reboot because the system was no responding well: was
unable to open or close applications, use internet. Did SysRq+reisu
(operations were enabled) and then pressed reset button on computer.

Unfortunately dmesg dumps were lost after resetting computer.

What else can I do or I must rebuild the file system?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bad hard drive - checksum verify failure forces readonly mount

2016-06-25 Thread Vasco Almeida

Citando Chris Murphy <li...@colorremedies.com>:


On Fri, Jun 24, 2016 at 6:06 PM, Vasco Almeida <vascomalme...@sapo.pt> wrote:

Citando Chris Murphy <li...@colorremedies.com>:
dmesg http://paste.fedoraproject.org/384352/80842814/


[ 1837.386732] BTRFS info (device dm-9): continuing balance
[ 1838.006038] BTRFS info (device dm-9): relocating block group
15799943168 flags 34
[ 1838.684892] BTRFS info (device dm-9): relocating block group
10934550528 flags 36
[ 1839.301453] [ cut here ]
[ 1839.301495] WARNING: CPU: 3 PID: 76 at fs/btrfs/extent-tree.c:1625
lookup_inline_extent_backref+0x45c/0x5a0 [btrfs]()

followed by

[ 1839.301797] WARNING: CPU: 3 PID: 76 at fs/btrfs/extent-tree.c:2946
btrfs_run_delayed_refs+0x29d/0x2d0 [btrfs]()
[ 1839.301798] BTRFS: Transaction aborted (error -5)
[...]
[ 1839.301972] BTRFS: error (device dm-9) in
btrfs_run_delayed_refs:2946: errno=-5 IO failure
[ 1839.301975] BTRFS info (device dm-9): forced readonly

So it looks like it was resuming a balance automatically, and while
processing delayed references it's running into something it doesn't
expect and doesn't have a way to fix, so it goes read only to avoid
causing more problems.

I would do a couple things in order:
1. Mount ro and copy off what you want in case the whole thing gets
worse and can't ever be mounted again.
2. Mount with only these options: -o skip_balance,subvolid=5,nospace_cache


I have mounted with that options and was readwrite first and then it  
forces readonly. You can see a delay between first BTRFS messages and  
the "BTRFS info: forced readonly" message in dmesg.


/dev/mapper/vg_pupu-lv_opensuse_root on /mnt type btrfs  
(ro,relatime,seclabel,nospace_cache,skip_balance,subvolid=5,subvol=/)




If it mounts rw, don't do anything with it, just see if it cleans up
after itself. It also looks from the previous trace it was trying to
remove a snapshot and there are complaints of problems in that
snapshot. So hopefully just waiting 5 minutes doing nothing and it'll
clean up after itself (you can check with top to see if there are any
btrfs related transactions that run including the btrfs-cleaner
process) wait until they're done.


I can see that btrfs processes including btrfs-cleaner but they may be  
not doing much since device was forced readonly after mounting it.



Then umount. If you want you could have two other consoles ready
first, one for 'journalctl -f' and another for sysrq+t to issue in
case you get a hang. This doesn't fix anything but it collects more
information for a bug report for the devs.

Once you get it umounted normally or by force, the next thing to do is


I have umount it normally (umount /mnt) after more than 20 minutes  
since mounting it.



3. btrfs-image so that devs can see what's causing the problem that
the current code isn't handling well enough.


btrfs-image does not create dump image:

# btrfs-image /dev/mapper/vg_pupu-lv_opensuse_root  
btrfs-lv_opensuse_root.image

checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
Csum didn't match
Error reading metadata block
Error adding block -5
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
Csum didn't match
Error reading metadata block
Error flushing pending -5
create failed (Success)
# echo $?
1



4. btrfs check --repair


Did not issue this command yet.

dmesg http://paste.fedoraproject.org/384799/14668851/

Thank your for helping.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bad hard drive - checksum verify failure forces readonly mount

2016-06-24 Thread Vasco Almeida

Citando Chris Murphy <li...@colorremedies.com>:


On Fri, Jun 24, 2016 at 9:52 AM, Vasco Almeida <vascomalme...@sapo.pt> wrote:



From the pasted kernel messages:
> Linux version 3.18.34-std473-amd64 (root@rl-sysrcd-p11) (gcc  
version 4.8.5

> (Gentoo 4.8.5 p1.3, pie-0.6.2) ) #2 SMP Tue May 24 20:34:19 UTC 2016
3.18.34 is ancient. Find something newer and try to remount normally.

Present information concerns openSUSE Leap 42.1 (x86_64) mount of root file
system at boot time. That should mount it normally. Hope that fits what you
mean.


OK but it's not mounting it normally, it's still being forced readonly
at btrfs_drop_snapshot and the only thing I'm coming up with search
wise is that it's related to qgroups. Have you enabled quotas on this
file system ever?


Unless openSUSE does that by default, I did not enable quotas. It is  
not something I am aware of doing.




btrfs-progs v4.1.2+20151002


A lot of changes have happened since 4.1.2 I would still use something
newer and try to repair it.


By repair do you mean issue "btrfs check --repair /device" ?


$ /usr/sbin/btrfs fi df /
Data, single: total=10.01GiB, used=9.06GiB
System, DUP: total=64.00MiB, used=16.00KiB
Metadata, DUP: total=1.12GiB, used=596.69MiB
GlobalReserve, single: total=208.00MiB, used=0.00B

I forgot to mention in last e-mail that I ran Marc MERLIN's scrubbing script
[1] after mounting the device with "-o ro,recovery" on System Rescue CD.
Even after that device is forced readonly.


OK but System Rescue CD uses an old kernel by btrfs standards, even
account for all the backports in that particular version:
4.7.3) 2016-06-04:
Standard kernels: Long-Term-Supported linux-3.18.34 (rescue32 + rescue64)

So that's why I'm suggesting you use something newer, like 4.5.x, same
for btrfs-progs. The old versions aren't working. There's no assurance
it'll work with new versions, but that it doesn't get fixed up with
old versions means you either try new versions or you rebuild the file
system. *shrug*


I am using Fedora 24 and have issued "mount  
/dev/mapper/vg_pupu-lv_opensuse_root /mnt". Got some call trace and  
scary stuff that did not get before on other systems. Please check  
dmesg output linked below.


Linux catarina 4.5.7-300.fc24.x86_64 #1 SMP Wed Jun 8 18:12:45 UTC  
2016 x86_64 x86_64 x86_64 GNU/Linux

btrfs-progs v4.5.2

# btrfs fi show
Label: none  uuid: ad167e92-fbb1-4148-b54d-6345b6fb26da
Total devices 1 FS bytes used 9.63GiB
	devid1 size 50.00GiB used 12.32GiB path  
/dev/mapper/vg_pupu-lv_opensuse_root

# btrfs fi df /mnt/
Data, single: total=10.01GiB, used=9.05GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=1.12GiB, used=597.62MiB
GlobalReserve, single: total=208.00MiB, used=224.00KiB

dmesg http://paste.fedoraproject.org/384352/80842814/
dmesg after umount http://paste.fedoraproject.org/384359/14668108/
diff between two http://paste.fedoraproject.org/384364/11704146/

btrfs check --readonly /dev/mappper/vg_pupu-lv_opensuse_root
http://paste.fedoraproject.org/384361/68112421/

After umount and mounting again, the device was normally mounted  
readwrite again:
/dev/mapper/vg_pupu-lv_opensuse_root on /mnt type btrfs  
(rw,relatime,seclabel,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot)
But trying to umount it afterwards makes umount command hang. Device  
no longer shows on mount output, though.

CTRL-C or SIGTERM can't kill umount.

dmesg http://paste.fedoraproject.org/384371/14668130/




I would like to find a solution to be able to mount normally readwrite again
and hopefully understand what caused the issue.


My best guess is qgroup related, there were a lot of problems with
multiple quota implementations and snapshots and openSUSE does take
many many snapshots. So that could be it. But without a reproducer
it's hard to say what caused it.


Thank you again for your time and reply.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Bad hard drive - checksum verify failure forces readonly mount

2016-06-23 Thread Vasco Almeida
I was running OpenSuse Leap 42.1 with btrfs and
LVM (Logical Volume Management).
Last time I've checked smartd log, I noticed there were
30 sector pending reallocation and 1 unrecoverable bad
sector on hard drive.
I think my hard drive got some sector corrupted and now btrfs fails
some checksum and forces mount readonly.
The device is successfully mounted readonly.

OpenSuse dmesg reported:

BTRFS: dm-1 checksum verify failed on 437944320 wanted 39F45669 found
8BF8C752 leval 0
(more 2 times)
BTRFS: error (device dm-1) in btrfs_drop_snapshot:???: error=-5 IO failure
BTRFS: info (device dm-1): forced readonly

Now I'm on System Rescue CD and that is not reported.
I've written down those log line on paper, so there may be some typo.
Seemingly there is no journalctl installed on this system to check
OpenSuse logs again.

All the following logs are on System Rescue CD.
mount -o ro,recovery /dev/mapper/vg_pupu-lv_opensuse_root /mnt/opensuse
https://bpaste.net/show/263e5f7ae9d4

After mounting and umounting several times with and without "-o ro,recovery"
https://bpaste.net/show/43eb64decb63

btrfs check --readonly /dev/mapper/vg_pupu-lv_opensuse_root
https://bpaste.net/show/7ecf422c73a2


Would it be apropriate to run any of "btrfs check --repair /device" or
"btrfs check --init-csum-tree /device" to be able to mount readwrite again?

smartctl --all /dev/disk/by-id/ata-SAMSUNG_HD154UI_S1Y6JDWSC01351
https://bpaste.net/show/a6c132618974

btrfs check manpage: https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-check
btrfsck page: https://btrfs.wiki.kernel.org/index.php/Btrfsck

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html