Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-08 Thread Duncan
Marc Lehmann posted on Wed, 06 Jun 2018 21:06:35 +0200 as excerpted:

> Not sure what exactly you mean with btrfs mirroring (there are many
> btrfs features this could refer to), but the closest thing to that that
> I use is dup for metadata (which is always checksummed), data is always
> single. All btrfs filesystems are on lvm (not mirrored), and most (but
> not all) are encrypted. One affected fs is on a hardware raid
> controller, one is on an ssd. I have a single btrfs fs in that box with
> raid1 for metadata, as an experiment, but I haven't used it for testing
> yet.

On the off chance, tho it doesn't sound like it from your description...

You're not doing LVM snapshots of the volumes with btrfs on them, 
correct?  Because btrfs depends on filesystem GUIDs being just that, 
globally unique, using them to find the possible multiple devices of a 
multi-device btrfs (normal single-device filesystems don't have the issue 
as they don't have to deal with multi-device as btrfs does), and btrfs 
can get very confused, with data-loss potential, if it sees multiple 
copies of a device with the same filesystem GUID, as can happen if lvm 
snapshots (which obviously have the same filesystem GUID as the original) 
are taken and both the snapshot and the source are exposed to btrfs 
device scan (which is auto-triggered by udev when the new device 
appears), with one of them mounted.

Presumably you'd consider lvm snapshotting a form of mirroring and you've 
already said you're not doing that in any form, but just in case, because 
this is a rather obscure trap people using lvm could find themselves in, 
without a clue as to the danger, and the resulting symptoms could be 
rather hard to troubleshoot if this possibility wasn't considered.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread james harvey
On Wed, Jun 6, 2018 at 3:06 PM, Marc Lehmann  wrote:
> On Tue, Jun 05, 2018 at 05:52:38PM -0400, james harvey 
>  wrote:
>> >> This is not always reproducible, but when deleting our journal, creating 
>> >> log
>> >> messages for a few hours and then doing the above manually has a ~50% 
>> >> chance of
>> >> corrupting the journal.
>> ...
>>
>> My strong bet is you have a hardware issue.
>
> Strange, what kind of harwdare bug would affect multiple very different
> computers in exactly the same way?

Oops.  I missed when you clearly said: "All of this is reproducible on
two different boxes, so is unlikely to be a hardware issue."  I ran
into all these problems ultimately because of a badly designed Marvell
SATA controller.  I thought I had ruled out hardware issues by having
2 identical systems, and reproducing the problem on both.  Certainly
makes a hardware issue for you much less likely, especially if "very
different computers" means different motherboards.

FWIW, I have dropped caches a lot lately (not nearly as much as your
crons) and haven't had it corrupt anything, even in proximity to heavy
I/O.

>> going bad, bad cables, bad port, etc.  My strong bet is you're also
>> using BTRFS mirroring.
>
> Not sure what exactly you mean with btrfs mirroring (there are many btrfs
> features this could refer to), but the closest thing to that that I use is
> dup for metadata (which is always checksummed), data is always single. All
> btrfs filesystems are on lvm (not mirrored), and most (but not all) are
> encrypted. One affected fs is on a hardware raid controller, one is on an
> ssd. I have a single btrfs fs in that box with raid1 for metadata, as an
> experiment, but I haven't used it for testing yet.

Was referring to any type of data mirroring.  Data dup, btrfs
RAID1/5/6/10.  But, I see that's not the case here.

>> You're describing intermittent data corruption on files that I'm
>> thinking all have NOCOW turned on.
>
> The systemd journal files are nocow (I re-enabled that after I turned it
> off for a while), but the rtorrent directory (and the files in it) are
> not.
>
> I did experiment (a year ago) with nocow for torrent files and, more
> importantly, vm images, but it didn't really solve the "millions of
> fragments slow down" problem with btrfs, so I figured I can keep them cow
> and regularly copy them to defragment them. Thats why I am quite sure cow
> is switched on long before I booted my first 4.14 kernel (and it still
> is).

Yeah, with data single, you wouldn't be seeing intermittent problems
if it was related to the bugs I was talking about.

>> it's done writing to a journal file, but in a way that guarantees it
>> to fail.  This has been reported to systemd at
>> https://github.com/systemd/systemd/issues/9112 but poettering has
>
> I am aware that systemd tries to turn on nocow, and I think this is actually
> a bug, but this wouldn't have an an effect on rtorrent, which has corruption
> problems on a different fs. And boy would it be wonderufl if Debian switched
> away form systemd, I feel I personally ran into every single bug that
> exists...

systemd turning on NOCOW isn't a bug.  systemd 219 intentionally
turned on NOCOW for journal files, attempting to improve performance
on btrfs.  220 made it user-configurable, defaulting to turning on
NOCOW.  But, yeah, the bugs I was talking about wouldn't affect
rtorrent files on a different fs, since you have NOCOW off on them,
and since they're data single.

> However, no matter how much systemd plays with btrfs flags, it shouldn't
> corrupt data.

Yeah, it doesn't in itself.  Just makes them susceptible to one disk
corruption that btrfs would otherwise protect against with data
checksums.  And, if using compression and btrfs replace on current
kernels, guarantees them to be corrupted.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread Marc Lehmann
On Tue, Jun 05, 2018 at 05:52:38PM -0400, james harvey 
 wrote:
> >> This is not always reproducible, but when deleting our journal, creating 
> >> log
> >> messages for a few hours and then doing the above manually has a ~50% 
> >> chance of
> >> corrupting the journal.
> ...
> 
> My strong bet is you have a hardware issue.

Strange, what kind of harwdare bug would affect multiple very different
computers in exactly the same way?

> going bad, bad cables, bad port, etc.  My strong bet is you're also
> using BTRFS mirroring.

Not sure what exactly you mean with btrfs mirroring (there are many btrfs
features this could refer to), but the closest thing to that that I use is
dup for metadata (which is always checksummed), data is always single. All
btrfs filesystems are on lvm (not mirrored), and most (but not all) are
encrypted. One affected fs is on a hardware raid controller, one is on an
ssd. I have a single btrfs fs in that box with raid1 for metadata, as an
experiment, but I haven't used it for testing yet.

> You're describing intermittent data corruption on files that I'm
> thinking all have NOCOW turned on.

The systemd journal files are nocow (I re-enabled that after I turned it
off for a while), but the rtorrent directory (and the files in it) are
not.

I did experiment (a year ago) with nocow for torrent files and, more
importantly, vm images, but it didn't really solve the "millions of
fragments slow down" problem with btrfs, so I figured I can keep them cow
and regularly copy them to defragment them. Thats why I am quite sure cow
is switched on long before I booted my first 4.14 kernel (and it still
is).

> it's done writing to a journal file, but in a way that guarantees it
> to fail.  This has been reported to systemd at
> https://github.com/systemd/systemd/issues/9112 but poettering has

I am aware that systemd tries to turn on nocow, and I think this is actually
a bug, but this wouldn't have an an effect on rtorrent, which has corruption
problems on a different fs. And boy would it be wonderufl if Debian switched
away form systemd, I feel I personally ran into every single bug that
exists...

However, no matter how much systemd plays with btrfs flags, it shouldn't
corrupt data.

> The context I ran into this problem was with several other bugs
> interacting, that "btrfs replace" has been guaranteed to corrupt
> non-checksummed (NOCOW) compressed data, which the combination of
> those shouldn't happen, but does in some defragmentation situations
> due to another bug.  In my situation, I don't have a hardware issue.

Yeah, btrfs is full of bugs that I constantly run into, but most of them
are containable, unlikely this problem, which might or might not be a
btrfs bug - especially since all your bets seem to be wrong here.

-- 
The choice of a   Deliantra, the free code+content MORPG
  -==- _GNU_  http://www.deliantra.net
  ==-- _   generation
  ---==---(_)__  __   __  Marc Lehmann
  --==---/ / _ \/ // /\ \/ /  schm...@schmorp.de
  -=/_/_//_/\_,_/ /_/\_\
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread Liu Bo
On Wed, Jun 6, 2018 at 9:44 PM, Chris Mason  wrote:
> On 6 Jun 2018, at 9:38, Liu Bo wrote:
>
>> On Wed, Jun 6, 2018 at 8:18 AM, Chris Mason  wrote:
>>>
>>>
>>>
>>> On 5 Jun 2018, at 16:03, Andrew Morton wrote:
>>>
 (switched to email.  Please respond via emailed reply-to-all, not via
 the
 bugzilla web interface).

 On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org
 wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=199931
>
> Bug ID: 199931
>Summary: systemd/rtorrent file data corruption when using
> echo
> 3 >/proc/sys/vm/drop_caches



 A long tale of woe here.  Chris, do you think the pagecache corruption
 is a general thing, or is it possible that btrfs is contributing?

 Also, that 4.4 oom-killer regression sounds very serious.
>>>
>>>
>>>
>>> This week I found a bug in btrfs file write with how we handle stable
>>> pages.
>>> Basically it works like this:
>>>
>>> write(fd, some bytes less than a page)
>>> write(fd, some bytes into the same page)
>>> btrfs prefaults the userland page
>>> lock_and_cleanup_extent_if_need()   <- stable pages
>>> wait for writeback()
>>> clear_page_dirty_for_io()
>>>
>>> At this point we have a page that was dirty and is now clean.  That's
>>> normally fine, unless our prefaulted page isn't in ram anymore.
>>>
>>> iov_iter_copy_from_user_atomic() <--- uh oh
>>>
>>> If the copy_from_user fails, we drop all our locks and retry.  But along
>>> the
>>> way, we completely lost the dirty bit on the page.  If the page is
>>> dropped
>>> by drop_caches, the writes are lost.  We'll just read back the stale
>>> contents of that page during the retry loop.  This won't result in crc
>>> errors because the bytes we lost were never crc'd.
>>>
>>
>> So we're going to carefully redirty the page under the page lock, right?
>
>
> I don't think we actually need to clean it.  We have the page locked,
> writeback won't start until we unlock.
>

My concern is that the buggy thing is similar to compression path,
where we also did the trick of clear_page_dirty_for_io and
redirty_pages to avoid any faults wandering in and changing pages
underneath, but seems here we're fine if pages get changed in between.

>>
>>> It could result in zeros in the file because we're basically reading a
>>> hole,
>>> and those zeros could move around in the page depending on which part of
>>> the
>>> page was dirty when the writes were lost.
>>>
>>
>> I got a question, while re-reading this page, wouldn't it read
>> old/stale on-disk data?
>
>
> If it was never written we should be treating it like a hole, but I'll
> double check.
>

Okay, I think this would also happen in the overwrite case, where
stale data lies on disk.

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread Chris Mason

On 6 Jun 2018, at 9:38, Liu Bo wrote:


On Wed, Jun 6, 2018 at 8:18 AM, Chris Mason  wrote:



On 5 Jun 2018, at 16:03, Andrew Morton wrote:

(switched to email.  Please respond via emailed reply-to-all, not 
via the

bugzilla web interface).

On Tue, 05 Jun 2018 18:01:36 + 
bugzilla-dae...@bugzilla.kernel.org

wrote:


https://bugzilla.kernel.org/show_bug.cgi?id=199931

Bug ID: 199931
   Summary: systemd/rtorrent file data corruption when 
using echo

3 >/proc/sys/vm/drop_caches



A long tale of woe here.  Chris, do you think the pagecache 
corruption

is a general thing, or is it possible that btrfs is contributing?

Also, that 4.4 oom-killer regression sounds very serious.



This week I found a bug in btrfs file write with how we handle stable 
pages.

Basically it works like this:

write(fd, some bytes less than a page)
write(fd, some bytes into the same page)
btrfs prefaults the userland page
lock_and_cleanup_extent_if_need()   <- stable pages
wait for writeback()
clear_page_dirty_for_io()

At this point we have a page that was dirty and is now clean.  That's
normally fine, unless our prefaulted page isn't in ram anymore.

iov_iter_copy_from_user_atomic() <--- uh oh

If the copy_from_user fails, we drop all our locks and retry.  But 
along the
way, we completely lost the dirty bit on the page.  If the page is 
dropped

by drop_caches, the writes are lost.  We'll just read back the stale
contents of that page during the retry loop.  This won't result in 
crc

errors because the bytes we lost were never crc'd.



So we're going to carefully redirty the page under the page lock, 
right?


I don't think we actually need to clean it.  We have the page locked, 
writeback won't start until we unlock.




It could result in zeros in the file because we're basically reading 
a hole,
and those zeros could move around in the page depending on which part 
of the

page was dirty when the writes were lost.



I got a question, while re-reading this page, wouldn't it read
old/stale on-disk data?


If it was never written we should be treating it like a hole, but I'll 
double check.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread Liu Bo
On Wed, Jun 6, 2018 at 8:18 AM, Chris Mason  wrote:
>
>
> On 5 Jun 2018, at 16:03, Andrew Morton wrote:
>
>> (switched to email.  Please respond via emailed reply-to-all, not via the
>> bugzilla web interface).
>>
>> On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org
>> wrote:
>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=199931
>>>
>>> Bug ID: 199931
>>>Summary: systemd/rtorrent file data corruption when using echo
>>> 3 >/proc/sys/vm/drop_caches
>>
>>
>> A long tale of woe here.  Chris, do you think the pagecache corruption
>> is a general thing, or is it possible that btrfs is contributing?
>>
>> Also, that 4.4 oom-killer regression sounds very serious.
>
>
> This week I found a bug in btrfs file write with how we handle stable pages.
> Basically it works like this:
>
> write(fd, some bytes less than a page)
> write(fd, some bytes into the same page)
> btrfs prefaults the userland page
> lock_and_cleanup_extent_if_need()   <- stable pages
> wait for writeback()
> clear_page_dirty_for_io()
>
> At this point we have a page that was dirty and is now clean.  That's
> normally fine, unless our prefaulted page isn't in ram anymore.
>
> iov_iter_copy_from_user_atomic() <--- uh oh
>
> If the copy_from_user fails, we drop all our locks and retry.  But along the
> way, we completely lost the dirty bit on the page.  If the page is dropped
> by drop_caches, the writes are lost.  We'll just read back the stale
> contents of that page during the retry loop.  This won't result in crc
> errors because the bytes we lost were never crc'd.
>

So we're going to carefully redirty the page under the page lock, right?

> It could result in zeros in the file because we're basically reading a hole,
> and those zeros could move around in the page depending on which part of the
> page was dirty when the writes were lost.
>

I got a question, while re-reading this page, wouldn't it read
old/stale on-disk data?

thanks,
liubo

> I spent a morning trying to trigger this with drop_caches and couldn't make
> it happen, even with schedule_timeout()s inserted and other tricks.  But I
> was able to get corruptions if I manually invalidated pages in the critical
> section.
>
> I'm working on a patch, and I'll check and see if any of the other recent
> fixes Dave integrated may have a less exotic explanation.
>
> -chris
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread Michal Hocko
On Tue 05-06-18 13:03:29, Andrew Morton wrote:
[...]
> > As for why we would do something silly as dropping the caches every hour 
> > (in a
> > cronjob), we started doing this recently because after kernel 4.4, we got
> > frequent OOM kills despite having gigabytes of available memory (e.g. 12GB 
> > in
> > use, 20GB page cache and 16GB empty swap and bang, mysql gets killed). We 
> > found
> > that that the debian 4.9 kernel is unusable, and 4.14 works, *iff* we use 
> > the
> > above as an hourly cron job, so we did that, and afterwards run into
> > rtorrent/journald corruption issues. Without the echo in place, mysql 
> > usually
> > gets oom-killed after a few days of uptime.

Do you have any oom reports to share?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-05 Thread Chris Mason




On 5 Jun 2018, at 16:03, Andrew Morton wrote:

(switched to email.  Please respond via emailed reply-to-all, not via 
the

bugzilla web interface).

On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org 
wrote:



https://bugzilla.kernel.org/show_bug.cgi?id=199931

Bug ID: 199931
   Summary: systemd/rtorrent file data corruption when using 
echo

3 >/proc/sys/vm/drop_caches


A long tale of woe here.  Chris, do you think the pagecache corruption
is a general thing, or is it possible that btrfs is contributing?

Also, that 4.4 oom-killer regression sounds very serious.


This week I found a bug in btrfs file write with how we handle stable 
pages.  Basically it works like this:


write(fd, some bytes less than a page)
write(fd, some bytes into the same page)
btrfs prefaults the userland page
lock_and_cleanup_extent_if_need()   <- stable pages
wait for writeback()
clear_page_dirty_for_io()

At this point we have a page that was dirty and is now clean.  That's 
normally fine, unless our prefaulted page isn't in ram anymore.


iov_iter_copy_from_user_atomic() <--- uh oh

If the copy_from_user fails, we drop all our locks and retry.  But along 
the way, we completely lost the dirty bit on the page.  If the page is 
dropped by drop_caches, the writes are lost.  We'll just read back the 
stale contents of that page during the retry loop.  This won't result in 
crc errors because the bytes we lost were never crc'd.


It could result in zeros in the file because we're basically reading a 
hole, and those zeros could move around in the page depending on which 
part of the page was dirty when the writes were lost.


I spent a morning trying to trigger this with drop_caches and couldn't 
make it happen, even with schedule_timeout()s inserted and other tricks. 
 But I was able to get corruptions if I manually invalidated pages in 
the critical section.


I'm working on a patch, and I'll check and see if any of the other 
recent fixes Dave integrated may have a less exotic explanation.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-05 Thread james harvey
On Tue, Jun 5, 2018 at 4:03 PM, Andrew Morton  wrote:
> On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=199931
>>
>> Bug ID: 199931
>>Summary: systemd/rtorrent file data corruption when using echo
>> 3 >/proc/sys/vm/drop_caches
>
> A long tale of woe here.  Chris, do you think the pagecache corruption
> is a general thing, or is it possible that btrfs is contributing?
...
>> We found that
>>
>>echo 3 >/proc/sys/vm/drop_caches
>>
>> causes file data corruption. We found this because we saw systemd journal
>> corruption (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=897266) and
>> tracked this to a cron job dropping caches every hour. The filesystem in use 
>> is
>> btrfs, but I don't know if it only happens with this filesystem. btrfs scrub
>> reports no problems, so this is not filesystem metdata corruption.
...
>> This is not always reproducible, but when deleting our journal, creating log
>> messages for a few hours and then doing the above manually has a ~50% chance 
>> of
>> corrupting the journal.
...

This sounds a lot related to what Qu Wenruo (as the BTRFS expert and
patch writer) and I (from a reporter and research standpoint) have
been working on, but with a different twist.

My strong bet is you have a hardware issue.  Something like a drive
going bad, bad cables, bad port, etc.  My strong bet is you're also
using BTRFS mirroring.

You're describing intermittent data corruption on files that I'm
thinking all have NOCOW turned on.  On BTRFS, journald turns on NOCOW
for its journal files.  It makes an attempt to turn COW back on when
it's done writing to a journal file, but in a way that guarantees it
to fail.  This has been reported to systemd at
https://github.com/systemd/systemd/issues/9112 but poettering has
expressed the desire to leave it the way it is rather than fix it.
(Granted the situation is going to be improved in the context of the
compression/replace bugs described below, by submitted patches, but
leaving the situation of other on-disk data corruption.)  My bet is
your torrent downloads also have NOCOW turned on.

When NOCOW is turned on, BTRFS also stops performing checksumming of
the data.  (Associated metadata is still checksummed.)

If your BTRFS volume uses mirroring, and you have corruption on one
mirror but not the other, you will get correct or corrupted data
pseudo-randomly depending on which disk is read from.

If your BTRFS volume doesn't use mirroring, then if it's a new file
still in the cache, it won't be corrupted, and after dropping the
cache and re-reading it, if you have a hardware issue, you'll be
reading a corrupted copy.  But, I suspect you are using mirroring, or
else you'd probably be getting unfixable checksum errors on COW files
as well.

Where with checksums and mirroring BTRFS would automatically recognize
a bad read, try the other mirror, and correct the bad copy, with NOCOW
on, even with mirroring, BTRFS has no way to know the data read is
corrupted.

The context I ran into this problem was with several other bugs
interacting, that "btrfs replace" has been guaranteed to corrupt
non-checksummed (NOCOW) compressed data, which the combination of
those shouldn't happen, but does in some defragmentation situations
due to another bug.  In my situation, I don't have a hardware issue.



If you're using BTRFS mirroring, there's an easy way for you to see if
I'm right.  Additions to btrfs-tools are in the works to detect this,
but you can manually do it in the meantime.

Run "filefrag -v ".

This isn't the ideal tool for the job (btrfs-debug tree is) but it
will more quickly show you the starting block number and length of
blocks for each extent of your file.

For each extent line listed, run 2 commands: "btrfs-map-logical -l
<4096 * physical_offset first (starting) number> -b <4096 * length> -c
1 -o .1"; and the same but ending "-c 2 -o
.2".

So, if filefrag shows:
0: 0.. 23: 1201616.. 1201639: 24: last,shared,eof

You'd run (again, for each extent line, with appropriate -l and -b
values and output file name):
btrfs-map-logical -l 4921819136 -b 98304 -c 1 -o 4921819136.1
btrfs-map-logical -l 4921819136 -b 98304 -c 2 -o 4921819136.2

(If you are using BTRFS compression, and a flags column includes
"encoded", you want to use "-b 4096" because filefrag doesn't report
the proper ending physical_offset and length in this situation, and
they're always 4096 bytes.)

This will read each of the extents in your file from both mirrored
copies, and write them to separate files.

Then compare each set of .1 and .2 files.

They should never be different.  If they are, for one reason or
another, your mirrored copies differ, and you've found why dropping
cache causes an intermittent problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-05 Thread Andrew Morton
On Wed, 6 Jun 2018 06:22:25 +0900 Tetsuo Handa 
 wrote:

> On 2018/06/06 5:03, Andrew Morton wrote:
> > 
> > (switched to email.  Please respond via emailed reply-to-all, not via the
> > bugzilla web interface).
> > 
> > On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org 
> > wrote:
> > 
> >> https://bugzilla.kernel.org/show_bug.cgi?id=199931
> >>
> >> Bug ID: 199931
> >>Summary: systemd/rtorrent file data corruption when using echo
> >> 3 >/proc/sys/vm/drop_caches
> > 
> > A long tale of woe here.  Chris, do you think the pagecache corruption
> > is a general thing, or is it possible that btrfs is contributing?
> 
> According to timestamp of my testcases, I was observing corrupted-bytes issue 
> upon OOM-kill
> (without using btrfs) as of 2017 Aug 11. Thus, I don't think that this is 
> specific to btrfs.
> But I can't find which patch fixed this issue.

That sounds different.  Here, the corruption is caused by performing
drop_caches, not by an oom-killing.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-05 Thread Tetsuo Handa
On 2018/06/06 5:03, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org wrote:
> 
>> https://bugzilla.kernel.org/show_bug.cgi?id=199931
>>
>> Bug ID: 199931
>>Summary: systemd/rtorrent file data corruption when using echo
>> 3 >/proc/sys/vm/drop_caches
> 
> A long tale of woe here.  Chris, do you think the pagecache corruption
> is a general thing, or is it possible that btrfs is contributing?

According to timestamp of my testcases, I was observing corrupted-bytes issue 
upon OOM-kill
(without using btrfs) as of 2017 Aug 11. Thus, I don't think that this is 
specific to btrfs.
But I can't find which patch fixed this issue.


#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define NUMTHREADS 512
#define STACKSIZE 8192

static int pipe_fd[2] = { EOF, EOF };
static int file_writer(void *i)
{
char buffer[4096] = { };
int fd;
snprintf(buffer, sizeof(buffer), "/tmp/file.%lu", (unsigned long) i);
fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
memset(buffer, 0xFF, sizeof(buffer));
read(pipe_fd[0], buffer, 1);
while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
return 0;
}

int main(int argc, char *argv[])
{
char *buf = NULL;
unsigned long size;
unsigned long i;
char *stack;
if (pipe(pipe_fd))
return 1;
stack = malloc(STACKSIZE * NUMTHREADS);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
for (i = 0; i < NUMTHREADS; i++)
if (clone(file_writer, stack + (i + 1) * STACKSIZE,
  CLONE_THREAD | CLONE_SIGHAND | CLONE_VM | CLONE_FS |
  CLONE_FILES, (void *) i) == -1)
break;
close(pipe_fd[1]);
/* Will cause OOM due to overcommit; if not use SysRq-f */
for (i = 0; i < size; i += 4096)
buf[i] = 0;
kill(-1, SIGKILL);
return 0;
}



#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
char buffer2[64] = { };
int ret = 0;
int i;
for (i = 0; i < 1024; i++) {
 int flag = 0;
 int fd;
 unsigned int byte[256];
 int j;
 snprintf(buffer2, sizeof(buffer2), "/tmp/file.%u", i);
 fd = open(buffer2, O_RDONLY);
 if (fd == EOF)
 continue;
 lseek(fd, -4096, SEEK_END);
 memset(byte, 0, sizeof(byte));
 while (1) {
 static unsigned char buffer[1048576];
 int len = read(fd, (char *) buffer, sizeof(buffer));
 if (len <= 0)
 break;
 for (j = 0; j < len; j++)
 if (buffer[j] != 0xFF)
 byte[buffer[j]]++;
 }
 close(fd);
 for (j = 0; j < 255; j++)
 if (byte[j]) {
 printf("ERROR: %u %u in %s\n", byte[j], j, 
buffer2);
 flag = 1;
 }
 if (flag == 0)
 unlink(buffer2);
 else
 ret = 1;
}
return ret;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-05 Thread Andrew Morton


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=199931
> 
> Bug ID: 199931
>Summary: systemd/rtorrent file data corruption when using echo
> 3 >/proc/sys/vm/drop_caches

A long tale of woe here.  Chris, do you think the pagecache corruption
is a general thing, or is it possible that btrfs is contributing?

Also, that 4.4 oom-killer regression sounds very serious.

>Product: Memory Management
>Version: 2.5
> Kernel Version: 4.14.33
>   Hardware: All
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: Other
>   Assignee: a...@linux-foundation.org
>   Reporter: bugzilla.kernel@plan9.de
> Regression: No
> 
> We found that
> 
>echo 3 >/proc/sys/vm/drop_caches
> 
> causes file data corruption. We found this because we saw systemd journal
> corruption (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=897266) and
> tracked this to a cron job dropping caches every hour. The filesystem in use 
> is
> btrfs, but I don't know if it only happens with this filesystem. btrfs scrub
> reports no problems, so this is not filesystem metdata corruption.
> 
> Basically:
> 
># journalctl --verify
>[everything fine at this point]
># echo 3 >/proc/sys/vm/drop_caches
># journalctl --verify
>[journalctl now reporting corruption problems]
> 
> This is not always reproducible, but when deleting our journal, creating log
> messages for a few hours and then doing the above manually has a ~50% chance 
> of
> corrupting the journal.
> 
> After investigating we found that rtorrent also suffers from corrupted
> downloads when using the above echo - basically, downloading torrents is fine,
> except when executing the above echo a few times during a download, after 
> which
> rtorrent very likely reports a failed hash check.
> 
> All of this is reproducible on two different boxes, so is unlikely to be a
> hardware issue.
> 
> On one affected server we have over 50TB of files, many that have been created
> with the cronjob in place, and none of them are corrupted (we have md5sums of
> everything), so it seems to be related to something that systemd and rtorrent
> do, rather than a generic file corruption issue.
> 
> I also was able to "cmp -l" two corrupted files with their correct version, 
> and
> the corruption manifests itself as streaks of ~100-3000 zero bytes instead of
> the real data. The start offset sems random, but the end offset seems to be
> always aligned to a 4K offset - speculating without the hindrance of knowledge
> this feels like a race somewhere between writing to a mmapped area and freeing
> it, or so.
> 
> Here is the output of cmp -l between a working and a corrupted file, for two
> files:
> 
> http://data.plan9.de/01.cmp.txt
> http://data.plan9.de/02.cmp.txt
> 
> We also have a mysql database with hundreds of gigabytes of writes per day on
> one server which also does not seem to suffer from any corruption.
> 
> As for why we would do something silly as dropping the caches every hour (in a
> cronjob), we started doing this recently because after kernel 4.4, we got
> frequent OOM kills despite having gigabytes of available memory (e.g. 12GB in
> use, 20GB page cache and 16GB empty swap and bang, mysql gets killed). We 
> found
> that that the debian 4.9 kernel is unusable, and 4.14 works, *iff* we use the
> above as an hourly cron job, so we did that, and afterwards run into
> rtorrent/journald corruption issues. Without the echo in place, mysql usually
> gets oom-killed after a few days of uptime.
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html