Re: reproducible builds with btrfs seed feature

2018-10-16 Thread Anand Jain




On 10/17/2018 03:49 AM, Chris Murphy wrote:

On Tue, Oct 16, 2018 at 2:13 AM, Anand Jain  wrote:



On 10/14/2018 06:28 AM, Chris Murphy wrote:


Is it practical and desirable to make Btrfs based OS installation
images reproducible? Or is Btrfs simply too complex and
non-deterministic? [1]

The main three problems with Btrfs right now for reproducibility are:
a. many objects have uuids other than the volume uuid; and mkfs only
lets us set the volume uuid
b. atime, ctime, mtime, otime; and no way to make them all the same
c. non-deterministic allocation of file extents, compression, inode
assignment, logical and physical address allocation

I'm imagining reproducible image creation would be a mkfs feature that
builds on Btrfs seed and --rootdir concepts to constrain Btrfs
features to maybe make reproducible Btrfs volumes possible:

- No raid
- Either all objects needing uuids can have those uuids specified by
switch, or possibly a defined set of uuids expressly for this use
case, or possibly all of them can just be zeros (eek? not sure)
- A flag to set all times the same
- Possibly require that target block device is zero filled before
creation of the Btrfs
- Possibly disallow subvolumes and snapshots
- Require the resulting image is seed/ro and maybe also a new
compat_ro flag to enforce that such Btrfs file systems cannot be
modified after the fact.
- Enforce a consistent means of allocation and compression

The end result is creating two Btrfs volumes would yield image files
with matching hashes.




If I had to guess, the biggest challenge would be allocation. But it's
also possible that such an image may have problems with "sprouts". A
non-removable sprout seems fairly straightforward and safe; but if a
"reproducible build" type of seed is removed, it seems like removal
needs to be smart enough to refresh *all* uuids found in the sprout: a
hard break from the seed.



Right. The seed fsid will be gone in a detached sprout.


I think already we get a new devid, volume uuid, and device uuid.


 Yes on the sprout.


Open
question is whether any other uuid's need to be refreshed, such as
chunk uuid since that appears in every node and leaf.


 There are quite a number of uuid.


Any thoughts? Useful? Difficult to implement?


Recently Nikolay sent a patch to change fsid on a mounted btrfs. However for
a reproducible builds it also needs neutralized uuids, time, bytenr(s)
further more though the ondisk layout won't change without notice but
block-bytenr might.


Seems like the mkfs population method of such a seed,



could be made
very deterministic as to what the start logical address and physical
address are.


 Can be. But it can change in future fixes as those aren't EXPORTED().


The vast majority of non-deterministic behavior comes
from the nature of kernel code having to handle so many complex inputs
and outputs, and negotiate them.





One question why not reproducible builds get the file data extents from the
image and stitch the hashes together to verify the hash. And there could be
a vfs ioctl to import and export filesystem images for a better
support-ability of the use-case similar to the reproducible builds.


Perhaps. I don't know the reproducible build requirements very well,
if all they really care about is the hash of the data extents, and
really how important fs metadata is.




That is important when it comes
to fuzzing file systems that have no metadata checksumming like
squashfs; of course you'd have to checksum the whole file system
image.




Another feature the mkfs variety of seed image would need,
deduplication.  As far as I know, deduplication is kernel code only.
You'd want to be able to deduplicate, 




as well as compress, to have the
smallest distributed seed possible.


btrfs-image(8) already does compress.

I don't think mkfs is the right place to sanitize the uuid/fsid/time... 
it should be when we generate the btrfs-image.


 So a possible solution for the reproducible builds:
   usual mkfs.btrfs dev
   Write the data
   unmount; create btrfs-image with uuid/fsid/time sanitized; mark it 
as a seed (RO).

   check/verify the hash of the image.

  If the hash match. To use this btrfs-image.
   Rest the seed (RO) flag; mount and use it;
   OR
   Mount the seed device; add a RW sprout; detach the seed;
   OR
   Don't set the RO at all (above) and just mount and use it;

Thanks, Anand


And mksquashfs does deduplication
by default.





Crude Oil Export/Lifting Business

2018-10-16 Thread John W Monk
Dear

I earn a living in the oil industry as leader of the Procurement Unit/account 
department in a refining outfit owned by a Sasol SA.
On my desk is a mandate to arrange for crude oil purchase from Libya for up to 
2,000,000 barrels on monthly bases for 12 calendar months.

The essence of my reaching out to you is the fact that am in the process of 
building a middle man structure to mediate between the 2 parties involved
before the contract is signed. You may be wondering why I cannot do it bmyself 
right? The honest fact is that as a staff, it is against my company's
operational policy to profit from any dealings with the firm hence the reason I 
need a trustworthy person outside my work circle in order to
maintain a discreet profile. I wish to extend this partnership to you my friend 
to build a middle man structure with you, while I work from the back
to guide you.

Our commission/brokerage as middle persons is between $2 - $3 per barrel as 
case may be. So if the target of 2M barrels is met monthly we stand to share
$4M - $6M every month for a span of 12 months. Worry less about the speedy 
sales as I have contacts within oil producing country's top officials for
license of crude oil export/lifting to any firm I so present for this business.

Therefore if you can be able to handle this transaction with honesty and 
integrity, you should come back to me immediately for more details.

Your urgent response is highly needed

Regards.

John W Monk


Re: btrfs check: Superblock bytenr is larger than device size

2018-10-16 Thread Qu Wenruo


On 2018/10/16 下午11:25, Anton Shepelev wrote:
> Qu Wenruo to Anton Shepelev:
> 
>>> On all our servers with BTRFS, which are otherwise working
>>> normally, `btrfs check /' complains that
>>>
>>> Superblock bytenr is larger than device size
>>> Couldn't open file system
>>>
>> Please try latest btrfs-progs and see if btrfs check
>> reports any error.
> 
> It is SUSE Linux Enterprise with its native repository, so I
> cannot update btrfs easily.  I will, though.

Then I recommend to use latest openSUSE Tumbleweed rescue ISO to do the
mount and check.

Thanks,
Qu

> 



signature.asc
Description: OpenPGP digital signature


Re: brtfs warning at ctree.h:1564 btrfs_update_device+0x220/0x230

2018-10-16 Thread Qu Wenruo


On 2018/10/17 上午5:27, Dmitry Katsubo wrote:
> Dear btrfs team,
> 
> I often observe kernel traces on linux-4.14.0 (mostly likely due to background
> "btrfs scrub") which contain the following "characterizing" line (for the rest
> see attachments):
> 
> btrfs_remove_chunk+0x26a/0x7e0 [btrfs]
> 
> I wonder if somebody from developers team knows anything about this problem. 
> It
> seems like after such dump btfs volume continues to function OK.

It's a known minor problem.

"btrfs rescue fix-device-size " could fix it offline (unmounted)

Or if you're using the fs as root fs, resize the fs by removing 4K would
also solve the problem:

  # btrfs filesystem resize :-4K 

The cause is old mkfs/kernel isn't aligning device size correctly, while
later kernel is pretty picky about that alignment.
It's mostly a developer oriented warning, no harm except a lot of scary
kernel warning and may slow down log system.

Thanks,
Qu

> 
> Thanks for any information!
> 



signature.asc
Description: OpenPGP digital signature


Failover for unattached USB device

2018-10-16 Thread Dmitry Katsubo
Dear btrfs team / community,

Sometimes it happens that kernel resets USB subsystem (looks like hardware
problem). Nevertheless all USB devices are unattached and attached back. After
few hours of struggle btrfs finally comes to the situation when read-only
filesystem mount is necessary. During this time when I try to access this
mounted filesystem (/mnt/backups) it reports success for some directories, or
error for others:

root@debian:~# ll /mnt/backups/
total 14334
drwxr-xr-x 1 adm users116 Sep 12 00:35 .
drwxrwxr-x 1 adm users164 Sep 19 22:44 ..
-rw-r--r-- 1 adm users  79927 Feb  7  2018 contacts.zip
drwxr-xr-x 1 adm users254 Feb  4  2018 attic
drwxr-xr-x 1 adm users 16 Feb 23  2018 recent
...
root@debian:~# ll /mnt/backups/attic/
ls: reading directory '/mnt/backups/attic/': Input/output error
total 0
drwxr-xr-x 1 adm users 254 Feb  4  2018 .
drwxr-xr-x 1 adm users 116 Sep 12 00:35 ..

It looks like this depends on whether the content is in disk cache...

What is surprising: when I try to create a file, I succeed:

root@debian:~# touch /mnt/backups/.mounted
root@debian:~# ll /mnt/backups/.mounted
-rw-r--r-- 1 root root 0 Sep 20 16:52 /mnt/backups/.mounted
root@debian:~# rm /mnt/backups/.mounted

My btrfs volume consists of two identical drives combined into RAID1 volume:

# btrfs filesystem df /mnt/backups
Data, RAID1: total=880.00GiB, used=878.96GiB
System, RAID1: total=8.00MiB, used=144.00KiB
Metadata, RAID1: total=2.00GiB, used=1.13GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

# btrfs filesystem show /mnt/backups
Label: none  uuid: a657364b-36d2-4c1f-8e5d-dc3d28166190
Total devices 2 FS bytes used 880.09GiB
devid1 size 3.64TiB used 882.01GiB path /dev/sdf
devid2 size 3.64TiB used 882.01GiB path /dev/sde

As a workaround I can monitor dmesg output but:

1. It would be nice if I could tell btrfs that I would like to mount read-only
after a certain error rate per minute is reached.
2. It would be nice if btrfs could detect that both drives are not available and
unmount (as mount read-only won't help much) the filesystem.

Kernel log for Linux v4.14.2 is attached.

-- 
With best regards,
Dmitry
Jun 29 18:54:56 debian kernel: [1197865.440396] usb 4-2: USB disconnect, device 
number 3
Jun 29 18:54:56 debian kernel: [1197865.440403] usb 4-2.2: USB disconnect, 
device number 5
Jun 29 18:54:56 debian kernel: [1197865.476118] usb 4-2.3: USB disconnect, 
device number 8
Jun 29 18:54:56 debian kernel: [1197865.549379] usb 4-2.4: USB disconnect, 
device number 7
...
Jun 29 18:54:58 debian kernel: [1197867.517728] usb-storage 4-2.3:1.0: USB Mass 
Storage device detected
Jun 29 18:54:58 debian kernel: [1197867.524021] usb-storage 4-2.3:1.0: Quirks 
match for vid 152d pid 0567: 500
Jun 29 18:54:58 debian kernel: [1197867.603859] usb 4-2.4: new full-speed USB 
device number 13 using ehci-pci
Jun 29 18:54:58 debian kernel: [1197867.725595] usb-storage 4-2.4:1.2: USB Mass 
Storage device detected
Jun 29 18:54:58 debian kernel: [1197867.728602] scsi host9: usb-storage 
4-2.4:1.2
Jun 29 18:54:59 debian kernel: [1197868.528737] scsi 7:0:0:0: Direct-Access 
ST4000DM 004-2CV104   0125 PQ: 0 ANSI: 6
Jun 29 18:54:59 debian kernel: [1197868.529310] scsi 7:0:0:1: Direct-Access 
ST4000DM 004-2CV104   0125 PQ: 0 ANSI: 6
Jun 29 18:54:59 debian kernel: [1197868.530093] sd 7:0:0:0: Attached scsi 
generic sg5 type 0
Jun 29 18:54:59 debian kernel: [1197868.530588] sd 7:0:0:1: Attached scsi 
generic sg6 type 0
Jun 29 18:54:59 debian kernel: [1197868.533064] sd 7:0:0:1: [sdh] Very big 
device. Trying to use READ CAPACITY(16).
Jun 29 18:54:59 debian kernel: [1197868.533619] sd 7:0:0:1: [sdh] 7814037168 
512-byte logical blocks: (4.00 TB/3.64 TiB)
Jun 29 18:54:59 debian kernel: [1197868.533626] sd 7:0:0:1: [sdh] 4096-byte 
physical blocks
Jun 29 18:54:59 debian kernel: [1197868.534063] sd 7:0:0:1: [sdh] Write Protect 
is off
Jun 29 18:54:59 debian kernel: [1197868.534069] sd 7:0:0:1: [sdh] Mode Sense: 
67 00 10 08
Jun 29 18:54:59 debian kernel: [1197868.534422] sd 7:0:0:1: [sdh] No Caching 
mode page found
Jun 29 18:54:59 debian kernel: [1197868.534542] sd 7:0:0:1: [sdh] Assuming 
drive cache: write through
Jun 29 18:54:59 debian kernel: [1197868.535563] sd 7:0:0:1: [sdh] Very big 
device. Trying to use READ CAPACITY(16).
Jun 29 18:54:59 debian kernel: [1197868.536702] sd 7:0:0:0: [sdg] Very big 
device. Trying to use READ CAPACITY(16).
Jun 29 18:54:59 debian kernel: [1197868.537454] sd 7:0:0:0: [sdg] 7814037168 
512-byte logical blocks: (4.00 TB/3.64 TiB)
Jun 29 18:54:59 debian kernel: [1197868.537459] sd 7:0:0:0: [sdg] 4096-byte 
physical blocks
Jun 29 18:54:59 debian kernel: [1197868.538327] sd 7:0:0:0: [sdg] Write Protect 
is off
Jun 29 18:54:59 debian kernel: [1197868.538331] sd 7:0:0:0: [sdg] Mode Sense: 
67 00 10 08
...
Jun 29 20:22:35 debian kernel: [1203125.061068] BTRFS error (device sdf): bdev 
/dev/sdh errs: wr 0, rd 1, flush 0, corrupt 0, gen 0

brtfs warning at ctree.h:1564 btrfs_update_device+0x220/0x230

2018-10-16 Thread Dmitry Katsubo
Dear btrfs team,

I often observe kernel traces on linux-4.14.0 (mostly likely due to background
"btrfs scrub") which contain the following "characterizing" line (for the rest
see attachments):

btrfs_remove_chunk+0x26a/0x7e0 [btrfs]

I wonder if somebody from developers team knows anything about this problem. It
seems like after such dump btfs volume continues to function OK.

Thanks for any information!

-- 
With best regards,
Dmitry
Jun  7 16:26:31 debian kernel: [1176060.298759] [ cut here 
]
Jun  7 16:26:31 debian kernel: [1176060.298820] WARNING: CPU: 0 PID: 566 at 
/build/linux-SCFPgu/linux-4.14.2/fs/btrfs/ctree.h:1564 
btrfs_update_device+0x220/0x230 [btrfs]
Jun  7 16:26:31 debian kernel: [1176060.298823] Modules linked in: option 
usb_wwan usbserial ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter 
xt_REDIRECT nf_nat_redirect xt_physdev br_netfilter iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c 
xt_tcpudp iptable_mangle arc4 bridge stp llc iTCO_wdt iTCO_vendor_support ppdev 
coretemp ath5k pcspkr serio_raw ath mac80211 sr9700 dm9601 cfg80211 usbnet mii 
i915 rfkill snd_hda_codec_realtek lpc_ich snd_hda_codec_generic mfd_core evdev 
snd_hda_intel snd_hda_codec sg snd_hda_core snd_hwdep snd_pcm_oss rng_core 
snd_mixer_oss video snd_pcm drm_kms_helper snd_timer drm snd parport_pc 
soundcore i2c_algo_bit parport shpchp button acpi_cpufreq binfmt_misc w83627hf 
hwmon_vid ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb 
crypto_simd cryptd
Jun  7 16:26:31 debian kernel: [1176060.298930]  aes_i586 btrfs crc32c_generic 
xor zstd_decompress zstd_compress xxhash raid6_pq hid_generic usbhid hid uas 
usb_storage sr_mod cdrom sd_mod ata_generic i2c_i801 ata_piix libata 
firewire_ohci scsi_mod firewire_core crc_itu_t e1000e ptp pps_core ehci_pci 
uhci_hcd ehci_hcd usbcore usb_common
Jun  7 16:26:31 debian kernel: [1176060.298981] CPU: 0 PID: 566 Comm: 
btrfs-cleaner Tainted: GW   4.14.0-1-686-pae #1 Debian 4.14.2-1
Jun  7 16:26:31 debian kernel: [1176060.299162] Hardware name: AOpen 
i945GMx-IF/i945GMx-IF, BIOS i945GMx-IF R1.01 Mar.02.2007 AOpen Inc. 03/02/2007
Jun  7 16:26:31 debian kernel: [1176060.299327] task: f287e200 task.stack: 
f24e2000
Jun  7 16:26:31 debian kernel: [1176060.299448] EIP: 
btrfs_update_device+0x220/0x230 [btrfs]
Jun  7 16:26:31 debian kernel: [1176060.299450] EFLAGS: 00010206 CPU: 0
Jun  7 16:26:31 debian kernel: [1176060.299454] EAX:  EBX: f68bee00 
ECX: 000c EDX: 0200
Jun  7 16:26:31 debian kernel: [1176060.299457] ESI: ef0d9320 EDI:  
EBP: f24e3e9c ESP: f24e3e5c
Jun  7 16:26:31 debian kernel: [1176060.299460]  DS: 007b ES: 007b FS: 00d8 GS: 
00e0 SS: 0068
Jun  7 16:26:31 debian kernel: [1176060.299463] CR0: 80050033 CR2: 02aa3000 
CR3: 32b6ece0 CR4: 06f0
Jun  7 16:26:31 debian kernel: [1176060.299467] Call Trace:
Jun  7 16:26:31 debian kernel: [1176060.299561]  btrfs_remove_chunk+0x26a/0x7e0 
[btrfs]
Jun  7 16:26:31 debian kernel: [1176060.299686]  
btrfs_delete_unused_bgs+0x321/0x3f0 [btrfs]
Jun  7 16:26:31 debian kernel: [1176060.299819]  cleaner_kthread+0x13c/0x150 
[btrfs]
Jun  7 16:26:31 debian kernel: [1176060.299907]  kthread+0xf3/0x110
Jun  7 16:26:31 debian kernel: [1176060.33]  ? 
__btree_submit_bio_start+0x20/0x20 [btrfs]
Jun  7 16:26:31 debian kernel: [1176060.300099]  ? 
kthread_create_on_node+0x20/0x20
Jun  7 16:26:31 debian kernel: [1176060.300182]  ret_from_fork+0x19/0x24
Jun  7 16:26:31 debian kernel: [1176060.300249] Code: e9 81 fe ff ff 8d b6 00 
00 00 00 bf f4 ff ff ff e9 78 fe ff ff 8d b6 00 00 00 00 f3 90 eb a8 8d 74 26 
00 f3 90 e9 2b ff ff ff 90 <0f> ff e9 7a ff ff ff e8 14 4d 4c dc 8d 74 26 00 3e 
8d 74 26 00
Jun  7 16:26:31 debian kernel: [1176060.300626] ---[ end trace 32773559e9ec5e68 
]---
Jul  1 07:07:31 debian kernel: [1328228.484772] [ cut here 
]
Jul  1 07:07:31 debian kernel: [1328228.484822] WARNING: CPU: 0 PID: 26193 at 
/build/linux-SCFPgu/linux-4.14.2/fs/btrfs/ctree.h:1564 
btrfs_update_device+0x220/0x230 [btrfs]
Jul  1 07:07:31 debian kernel: [1328228.484824] Modules linked in: cpuid nfs 
lockd grace sunrpc fscache ipt_REJECT nf_reject_ipv4 xt_multiport 
iptable_filter xt_REDIRECT nf_nat_redirect xt_physdev br_netfilter iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c 
xt_tcpudp iptable_mangle option usb_wwan usbserial arc4 bridge stp llc iTCO_wdt 
iTCO_vendor_support ppdev evdev ath5k ath mac80211 coretemp cfg80211 sr9700 
rfkill serio_raw dm9601 i915 usbnet pcspkr snd_hda_codec_realtek mii lpc_ich 
snd_hda_codec_generic mfd_core snd_hda_intel snd_hda_codec snd_hda_core 
snd_hwdep rng_core video snd_pcm_oss sg drm_kms_helper snd_mixer_oss drm 
snd_pcm snd_timer i2c_algo_bit snd soundcore parport_pc parport button shpchp 
acpi_cpufreq binfmt_misc w83627hf hwmon_vid ip_tables x_tables autofs4 ext4 
crc16 mbcache
Jul  1 07:07:31 debian kernel: 

Re: CRC mismatch

2018-10-16 Thread Chris Murphy
On Tue, Oct 16, 2018 at 9:42 AM, Austin S. Hemmelgarn
 wrote:
> On 2018-10-16 11:30, Anton Shepelev wrote:
>>
>> Hello, all
>>
>> What may be the reason of a CRC mismatch on a BTRFS file in
>> a virutal machine:
>>
>> csum failed ino 175524 off 1876295680 csum 451760558
>> expected csum 1446289185
>>
>> Shall I seek the culprit in the host machine on in the guest
>> one?  Supposing the host machine healty, what operations on
>> the gueest might have caused a CRC mismatch?
>>
> Possible causes include:
>
> * On the guest side:
>   - Unclean shutdown of the guest system (not likely even if this did
> happen).
>   - A kernel bug on in the guest.
>   - Something directly modifying the block device (also not very likely).
>
> * On the host side:
>   - Unclean shutdown of the host system without properly flushing data from
> the guest.  Not likely unless you're using an actively unsafe caching mode
> for the guest's storage back-end.
>   - At-rest data corruption in the storage back-end.
>   - A bug in the host-side storage stack.
>   - A transient error in the host-side storage stack.
>   - A bug in the hypervisor.
>   - Something directly modifying the back-end storage.
>
> Of these, the statistically most likely location for the issue is probably
> the storage stack on the host.

Is there still that O_DIRECT related "bug" (or more of a limitation)
if the guest is using cache=none on the block device?

Anton what virtual machine tech are you using? qemu/kvm managed with
virt-manager? The configuration affects host behavior; but the
negative effect manifests inside the guest as corruption. If I
remember correctly.

-- 
Chris Murphy


Re: reproducible builds with btrfs seed feature

2018-10-16 Thread Chris Murphy
On Tue, Oct 16, 2018 at 2:13 AM, Anand Jain  wrote:
>
>
> On 10/14/2018 06:28 AM, Chris Murphy wrote:
>>
>> Is it practical and desirable to make Btrfs based OS installation
>> images reproducible? Or is Btrfs simply too complex and
>> non-deterministic? [1]
>>
>> The main three problems with Btrfs right now for reproducibility are:
>> a. many objects have uuids other than the volume uuid; and mkfs only
>> lets us set the volume uuid
>> b. atime, ctime, mtime, otime; and no way to make them all the same
>> c. non-deterministic allocation of file extents, compression, inode
>> assignment, logical and physical address allocation
>>
>> I'm imagining reproducible image creation would be a mkfs feature that
>> builds on Btrfs seed and --rootdir concepts to constrain Btrfs
>> features to maybe make reproducible Btrfs volumes possible:
>>
>> - No raid
>> - Either all objects needing uuids can have those uuids specified by
>> switch, or possibly a defined set of uuids expressly for this use
>> case, or possibly all of them can just be zeros (eek? not sure)
>> - A flag to set all times the same
>> - Possibly require that target block device is zero filled before
>> creation of the Btrfs
>> - Possibly disallow subvolumes and snapshots
>> - Require the resulting image is seed/ro and maybe also a new
>> compat_ro flag to enforce that such Btrfs file systems cannot be
>> modified after the fact.
>> - Enforce a consistent means of allocation and compression
>>
>> The end result is creating two Btrfs volumes would yield image files
>> with matching hashes.
>
>
>> If I had to guess, the biggest challenge would be allocation. But it's
>> also possible that such an image may have problems with "sprouts". A
>> non-removable sprout seems fairly straightforward and safe; but if a
>> "reproducible build" type of seed is removed, it seems like removal
>> needs to be smart enough to refresh *all* uuids found in the sprout: a
>> hard break from the seed.
>
>
> Right. The seed fsid will be gone in a detached sprout.

I think already we get a new devid, volume uuid, and device uuid. Open
question is whether any other uuid's need to be refreshed, such as
chunk uuid since that appears in every node and leaf.


>> Any thoughts? Useful? Difficult to implement?
>
> Recently Nikolay sent a patch to change fsid on a mounted btrfs. However for
> a reproducible builds it also needs neutralized uuids, time, bytenr(s)
> further more though the ondisk layout won't change without notice but
> block-bytenr might.

Seems like the mkfs population method of such a seed, could be made
very deterministic as to what the start logical address and physical
address are. The vast majority of non-deterministic behavior comes
from the nature of kernel code having to handle so many complex inputs
and outputs, and negotiate them.


> One question why not reproducible builds get the file data extents from the
> image and stitch the hashes together to verify the hash. And there could be
> a vfs ioctl to import and export filesystem images for a better
> support-ability of the use-case similar to the reproducible builds.

Perhaps. I don't know the reproducible build requirements very well,
if all they really care about is the hash of the data extents, and
really how important fs metadata is. That is important when it comes
to fuzzing file systems that have no metadata checksumming like
squashfs; of course you'd have to checksum the whole file system
image.

Another feature the mkfs variety of seed image would need,
deduplication.  As far as I know, deduplication is kernel code only.
You'd want to be able to deduplicate, as well as compress, to have the
smallest distributed seed possible. And mksquashfs does deduplication
by default.


-- 
Chris Murphy


Re: CRC mismatch

2018-10-16 Thread Austin S. Hemmelgarn

On 2018-10-16 11:30, Anton Shepelev wrote:

Hello, all

What may be the reason of a CRC mismatch on a BTRFS file in
a virutal machine:

csum failed ino 175524 off 1876295680 csum 451760558
expected csum 1446289185

Shall I seek the culprit in the host machine on in the guest
one?  Supposing the host machine healty, what operations on
the gueest might have caused a CRC mismatch?


Possible causes include:

* On the guest side:
  - Unclean shutdown of the guest system (not likely even if this did 
happen).

  - A kernel bug on in the guest.
  - Something directly modifying the block device (also not very likely).

* On the host side:
  - Unclean shutdown of the host system without properly flushing data 
from the guest.  Not likely unless you're using an actively unsafe 
caching mode for the guest's storage back-end.

  - At-rest data corruption in the storage back-end.
  - A bug in the host-side storage stack.
  - A transient error in the host-side storage stack.
  - A bug in the hypervisor.
  - Something directly modifying the back-end storage.

Of these, the statistically most likely location for the issue is 
probably the storage stack on the host.


CRC mismatch

2018-10-16 Thread Anton Shepelev
Hello, all

What may be the reason of a CRC mismatch on a BTRFS file in
a virutal machine:

   csum failed ino 175524 off 1876295680 csum 451760558
   expected csum 1446289185

Shall I seek the culprit in the host machine on in the guest
one?  Supposing the host machine healty, what operations on
the gueest might have caused a CRC mismatch?

-- 
()  ascii ribbon campaign - against html e-mail
/\  http://preview.tinyurl.com/qcy6mjc [archived]


Re: btrfs check: Superblock bytenr is larger than device size

2018-10-16 Thread Anton Shepelev
Qu Wenruo to Anton Shepelev:

>>On all our servers with BTRFS, which are otherwise working
>>normally, `btrfs check /' complains that
>>
>>Superblock bytenr is larger than device size
>>Couldn't open file system
>>
>Please try latest btrfs-progs and see if btrfs check
>reports any error.

It is SUSE Linux Enterprise with its native repository, so I
cannot update btrfs easily.  I will, though.

-- 
()  ascii ribbon campaign - against html e-mail
/\  http://preview.tinyurl.com/qcy6mjc [archived]


Re: btrfs check: Superblock bytenr is larger than device size

2018-10-16 Thread Qu Wenruo


On 2018/10/16 下午10:05, Anton Shepelev wrote:
> Hello, all
> 
> On all our servers with BTRFS, which are otherwise working
> normally, `btrfs check /' complains that

Btrfs check shouldn't continue on mount point.

Latest one would report error like:

  Opening filesystem to check...
  ERROR: not a regular file or block device: /mnt/btrfs
  ERROR: cannot open file system


> 
>Superblock bytenr is larger than device size

This shouldn't be a big problem, normally related to unaligned numbers,
and older kernel/btrfs-progs.

Please try latest btrfs-progs and see if btrfs check reports any error.

>Couldn't open file system
> 
> Since am not using any "dangerous" options and want merely
> to analyse the system for errors without any modifications,
> I don't think I must unount my FS, which in my case would
> mean booting from a live CD.  What may be causeing this
> error?

Normally old mkfs or old kernel.

Latest btrfs check result would definitely help to solve the problem.
And the follow result will also help (can be dumpded even with fs
mounted, also needs latest btrfs-progs):

  # btrfs ins dump-super -FfA 
  # btrfs ins dump-tree -t chunk 

Thanks,
Qu

> 



signature.asc
Description: OpenPGP digital signature


btrfs check: Superblock bytenr is larger than device size

2018-10-16 Thread Anton Shepelev
Hello, all

On all our servers with BTRFS, which are otherwise working
normally, `btrfs check /' complains that

   Superblock bytenr is larger than device size
   Couldn't open file system

Since am not using any "dangerous" options and want merely
to analyse the system for errors without any modifications,
I don't think I must unount my FS, which in my case would
mean booting from a live CD.  What may be causeing this
error?

-- 
()  ascii ribbon campaign - against html e-mail
/\  http://preview.tinyurl.com/qcy6mjc [archived]


Re: [PATCH v7 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS

2018-10-16 Thread David Sterba
On Fri, Oct 12, 2018 at 01:59:34PM -0700, Andrew Morton wrote:
> On Tue, 11 Sep 2018 15:34:44 -0700 Omar Sandoval  wrote:
> 
> > From: Omar Sandoval 
> > 
> > The SWP_FILE flag serves two purposes: to make swap_{read,write}page()
> > go through the filesystem, and to make swapoff() call
> > ->swap_deactivate(). For Btrfs, we want the latter but not the former,
> > so split this flag into two. This makes us always call
> > ->swap_deactivate() if ->swap_activate() succeeded, not just if it
> > didn't add any swap extents itself.
> > 
> > This also resolves the issue of the very misleading name of SWP_FILE,
> > which is only used for swap files over NFS.
> > 
> 
> Acked-by: Andrew Morton 

Andrew, can you please take the two patches through the mm tree? I'm not
going to send the btrfs swap patches in the upcoming merge window so it
would not make sense to add plain MM changes to btrfs tree.  The whole
series has been in linux-next for some time so it's just moving between
trees. Thanks.


Re: BTRFS bad block management. Does it exist?

2018-10-16 Thread Anand Jain





On 10/14/2018 07:08 PM, waxhead wrote:

In case BTRFS fails to WRITE to a disk. What happens?



Does the bad area get mapped out somehow?


There was a proposed patch, its not convincing because the disks does 
the bad block relocation part transparently to the host and if disk runs 
out of reserved list then probably its time to replace the disk as in my 
experience the disk would have failed for other non-media error before 
it runs out of the reserved list and where in this case the host 
performed relocation won't help. Further more being at the file-system 
level you won't be able to accurately determine whether the block write 
has failed for the bad media error and not because of the reason of 
target circuitry fault.


Does it try again until it 
succeed or



until it "times out" or reach a threshold counter?


Block IO timeout and retry are the properties of the block layer 
depending on the type of error it should.


SD module already does retry of 5 counts (when failfast is not set), it 
should be tune-able. And I think there was a patch for that in the ML.


We had few discussion on the retry part in the past. [1]
[1]
https://www.spinics.net/lists/linux-btrfs/msg70240.html
https://www.spinics.net/lists/linux-btrfs/msg71779.html


Does it eventually try to write to a different disk (in case of using 
the raid1/10 profile?)


When there is mirror copy it does not go into the RO mode, and it leaves 
write hole(s) patchy across any transaction as we don't fail the disk at 
the first failed transaction. That means if a disk is at nth transaction 
per the super-block, its not guaranteed that all previous transactions 
have made it to the disk successfully in case of mirror-ed configs. I 
consider this as a bug. And there is a danger that it may read the junk 
data, which is hard but not impossible to hit due to our un-reasonable 
(there is a patch in the ML to address that as well) hard-coded 
pid-based read-mirror policy.


I sent a patch to fail the disk when first write fails so that we know 
the last good integrity of the FS based on the transaction id. That was 
a long time back I still believe its important patch. There wasn't 
enough comments I guess for it go into the next step.


The current solution is to replace the offending disk _without_ reading 
from it, to have a good recovery from the failed disk. As data centers 
can't relay on admin initiated manual recovery, there is also a patch to 
do this stuff automatically using the auto-replace feature, patches are 
in the ML. Again there wasn't enough comments I guess for it go into the 
next step.


Thanks, Anand


Re: [PATCH v2] Btrfs: fix null pointer dereference on compressed write path error

2018-10-16 Thread David Sterba
On Sat, Oct 13, 2018 at 12:37:25AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> At inode.c:compress_file_range(), under the "free_pages_out" label, we can
> end up dereferencing the "pages" pointer when it has a NULL value. This
> case happens when "start" has a value of 0 and we fail to allocate memory
> for the "pages" pointer. When that happens we jump to the "cont" label and
> then enter the "if (start == 0)" branch where we immediately call the
> cow_file_range_inline() function. If that function returns 0 (success
> creating an inline extent) or an error (like -ENOMEM for example) we jump
> to the "free_pages_out" label and then access "pages[i]" leading to a NULL
> pointer dereference, since "nr_pages" has a value greater than zero at
> that point.
> 
> Fix this by setting "nr_pages" to 0 when we fail to allocate memory for
> the "pages" pointer.
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119
> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads")
> Signed-off-by: Filipe Manana 

Added to misc-next, thanks.


Re: reproducible builds with btrfs seed feature

2018-10-16 Thread Anand Jain




On 10/14/2018 06:28 AM, Chris Murphy wrote:

Is it practical and desirable to make Btrfs based OS installation
images reproducible? Or is Btrfs simply too complex and
non-deterministic? [1]

The main three problems with Btrfs right now for reproducibility are:
a. many objects have uuids other than the volume uuid; and mkfs only
lets us set the volume uuid
b. atime, ctime, mtime, otime; and no way to make them all the same
c. non-deterministic allocation of file extents, compression, inode
assignment, logical and physical address allocation

I'm imagining reproducible image creation would be a mkfs feature that
builds on Btrfs seed and --rootdir concepts to constrain Btrfs
features to maybe make reproducible Btrfs volumes possible:

- No raid
- Either all objects needing uuids can have those uuids specified by
switch, or possibly a defined set of uuids expressly for this use
case, or possibly all of them can just be zeros (eek? not sure)
- A flag to set all times the same
- Possibly require that target block device is zero filled before
creation of the Btrfs
- Possibly disallow subvolumes and snapshots
- Require the resulting image is seed/ro and maybe also a new
compat_ro flag to enforce that such Btrfs file systems cannot be
modified after the fact.
- Enforce a consistent means of allocation and compression

The end result is creating two Btrfs volumes would yield image files
with matching hashes.



If I had to guess, the biggest challenge would be allocation. But it's
also possible that such an image may have problems with "sprouts". A
non-removable sprout seems fairly straightforward and safe; but if a
"reproducible build" type of seed is removed, it seems like removal
needs to be smart enough to refresh *all* uuids found in the sprout: a
hard break from the seed.


Right. The seed fsid will be gone in a detached sprout.


Competing file systems, ext4 with make_ext4 fork, and squashfs. At the
moment I'm thinking it might be easier to teach squashfs integrity
checking than to make Btrfs reproducible.  But then I also think
restricting Btrfs features, and applying some requirements to
constrain Btrfs to make it reproducible, really enhances the Btrfs
seed-sprout feature.


> Any thoughts? Useful? Difficult to implement?

Recently Nikolay sent a patch to change fsid on a mounted btrfs. However 
for a reproducible builds it also needs neutralized uuids, time, 
bytenr(s) further more though the ondisk layout won't change without 
notice but block-bytenr might.


One question why not reproducible builds get the file data extents from 
the image and stitch the hashes together to verify the hash. And there 
could be a vfs ioctl to import and export filesystem images for a better 
support-ability of the use-case similar to the reproducible builds.


For the seed sprout feature one thing I have in mind is to make it image 
and subvolume granular rather than the disk and fsid granular, and 
ability to transpire golden image (seed) updates, but I haven't checked 
the feasibility yet.


Thanks, Anand



Squashfs might be a better fit for this use case *if* it can be taught
about integrity checking.



It does per file checksums for the purpose
of deduplication but those checksums aren't retained for later
integrity checking.

[1] problems of reproducible system images
https://reproducible-builds.org/docs/system-images/

[2] purpose and motivation for reproducible builds
https://reproducible-builds.org/

[3] who is involved?
https://reproducible-builds.org/who/#Qubes%20OS