Re: Btrfs/SSD
> Traditional hard drives usually do this too these days (they've been > under-provisioned since before SSD's existed), which is part of why older > disks tend to be noisier and slower (the reserved space is usually at the far > inside or outside of the platter, so using sectors from there to replace > stuff leads to long seeks). Not true. When HDD uses 10% (10% is just for easy example) of space as spare than aligment on disk is (US - used sector, SS - spare sector, BS - bad sector) US US US US US US US US US SS US US US US US US US US US SS US US US US US US US US US SS US US US US US US US US US SS US US US US US US US US US SS US US US US US US US US US SS US US US US US US US US US SS if failure occurs - drive actually shifts sectors up: US US US US US US US US US SS US US US BS BS BS US US US US US US US US US US US US US US US US US US US US US US US US US US US US US US US US US SS US US US BS US US US US US US US US US US US US US US US SS US US US US US US US US US SS that strategy is in place to actually mitigate the problem that you’ve described, actually it was in place since drives were using PATA :) so if your drive get’s nosier over time it’s either a broken bearing or demagnetised arm magnet causing it to not aim propperly - so drive have to readjust position multiple times before hitting a right track -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Theoretically all sectors in over provision are erased - practically they are either erased or waiting to be erased or broken. What you have to understand is that sectors on SSD are not where you really think they are - they can swap place with sectors with over provisioning are, they can swap place with each other ect … stuff you see as a disk from 0 to MAX does not have to be arranged in sequence on SSD (and mostly never is) If you never trim - when your device is 100% full - you need to start overwrite data to keep writing - this is where over provisioning shines: ssd fakes that you write to a sector while really you write to a sector in over provisioning area and those magically swap places without you knowing -> the sector that was occupied ends up in over provisioning pool and SSD hardware performs a slow errase on it to make it free for the future. This mechanism is simple, and transparent for users -> you don’t know that it happens and SSD does all heavy lifting. Over provisioned area does have more uses than that. For example if you have a 1TB drive where you store 500GB of data that you never modify -> SSD will copy part of that data to over provisioned area -> free sectors that were unwritten for a while -> free sectors that were continuously hammered by writes and write a static data there. This mechanism is wear levelling - it means that SSD internals make sure that sectors on SSD have an equal use over time. Despite of some thinking that it’s pointless imagine situation where you’ve got a 1TB drive with 1GB free and you keep writing and modifying data in this 1GB free … those sectors will quickly die due to short flash life expectancy ( some as short as 1k erases ! ). So again, buy a good quality drives (not a hardcore enterprise drives, just good customer ones) and leave stuff to a drive + use OS that gives you trim and you should be golden > On 15 May 2017, at 00:01, Imran Geriskovan <imran.gerisko...@gmail.com> wrote: > > On 5/14/17, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote: >> In terms of over provisioning of SSD it’s a give and take relationship … on >> good drive there is enough over provisioning to allow a normal operation on >> systems without TRIM … now if you would use a 1TB drive daily without TRIM >> and have only 30GB stored on it you will have fantastic performance but if >> you will want to store 500GB at roughly 200GB you will hit a brick wall and >> you writes will slow dow to megabytes / s … this is symptom of drive running >> out of over provisioning space … > > What exactly happens on a non-trimmed drive? > Does it begin to forge certain erase-blocks? If so > which are those? What happens when you never > trim and continue dumping data on it? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Theoretically all sectors in over provision are erased - practically they are either erased or waiting to be erased or broken. What you have to understand is that sectors on SSD are not where you really think they are - they can swap place with sectors with over provisioning are, they can swap place with each other ect … stuff you see as a disk from 0 to MAX does not have to be arranged in sequence on SSD (and mostly never is) If you never trim - when your device is 100% full - you need to start overwrite data to keep writing - this is where over provisioning shines: ssd fakes that you write to a sector while really you write to a sector in over provisioning area and those magically swap places without you knowing -> the sector that was occupied ends up in over provisioning pool and SSD hardware performs a slow errase on it to make it free for the future. This mechanism is simple, and transparent for users -> you don’t know that it happens and SSD does all heavy lifting. Over provisioned area does have more uses than that. For example if you have a 1TB drive where you store 500GB of data that you never modify -> SSD will copy part of that data to over provisioned area -> free sectors that were unwritten for a while -> free sectors that were continuously hammered by writes and write a static data there. This mechanism is wear levelling - it means that SSD internals make sure that sectors on SSD have an equal use over time. Despite of some thinking that it’s pointless imagine situation where you’ve got a 1TB drive with 1GB free and you keep writing and modifying data in this 1GB free … those sectors will quickly die due to short flash life expectancy ( some as short as 1k erases ! ). So again, buy a good quality drives (not a hardcore enterprise drives, just good customer ones) and leave stuff to a drive + use OS that gives you trim and you should be golden > On 15 May 2017, at 00:01, Imran Geriskovan <imran.gerisko...@gmail.com> wrote: > > On 5/14/17, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote: >> In terms of over provisioning of SSD it’s a give and take relationship … on >> good drive there is enough over provisioning to allow a normal operation on >> systems without TRIM … now if you would use a 1TB drive daily without TRIM >> and have only 30GB stored on it you will have fantastic performance but if >> you will want to store 500GB at roughly 200GB you will hit a brick wall and >> you writes will slow dow to megabytes / s … this is symptom of drive running >> out of over provisioning space … > > What exactly happens on a non-trimmed drive? > Does it begin to forge certain erase-blocks? If so > which are those? What happens when you never > trim and continue dumping data on it? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
All stuff that Chris wrote holds true, I just wanted to add flash specific information (from my experience of writing low level code for operating flash) So with flash, to erase you have to erase a large allocation block, usually it used to be 128kB (plus some crc data and stuff makes more than 128kB, but we are talking functional data storage space) on never setups it can be megabytes … device dependant really. To erase a block you need to provide whole 128 x 8 bits with voltage higher that is usually used for IO (can be even 15V) so it requires an external supply or build in internal charge pump to provide that voltage to a block erasure circuitry. This process generates a lot of heat and requires a lot of energy, so consensus back in the day was that you could erase one block at a time and this could take up to 200ms (0.2 second). After a erase you need to check whenever all bits are set to 1 (charged state) and then sector is marked as ready for storage. Of course, flash memories are moving forward and in more demanding environments there are solutions where blocks are grouped into groups which have separate eraser circuits that will allow errasure to be performed in parallel in multiple parts of flash module, still you are bound to one per group. Another problem is that erasure procedure locally does increase temperature and on flat flashes it’s not that much of a problem, but on emerging solutions like 3d flashed locally we might experience undesired temperature increases that would either degrade life span of flash or simply erase neighbouring blocks. In terms of over provisioning of SSD it’s a give and take relationship … on good drive there is enough over provisioning to allow a normal operation on systems without TRIM … now if you would use a 1TB drive daily without TRIM and have only 30GB stored on it you will have fantastic performance but if you will want to store 500GB at roughly 200GB you will hit a brick wall and you writes will slow dow to megabytes / s … this is symptom of drive running out of over provisioning space … if you would run OS that issues trim, this problem would not exist since drive would know that whole 970GB of space is free and it would be pre-emptively erased days before. And last part - hard drive is not aware of filesystem and partitions … so you could have 400GB on this 1TB drive left unpartitioned and still you would be cooked. Technically speaking using as much as possible space on a SSD to a FS and OS that supports trim will give you best performance because drive will be notified of as much as possible disk space that is actually free ….. So, to summaries: - don’t try to outsmart built in mechanics of SSD (people that suggest that are just morons that want to have 5 minutes of fame). - don’t buy crap SSD and expect it to behave like good one if you use below certain % of it … it’s stupid, buy more reasonable SSD but smaller and store slow data on spinning rust. - read more books and wikipedia, not jumping down on you but internet is filled with people that provide false information, sometimes unknowingly and swear by it ( Dunning–Kruger effect :D ) and some of them are very good and making all theories sexy and stuff … you simply have to get used to it… - if something is to good to be true, than it’s not - promise of future performance gains is a domain of “sleazy salesman" > On 14 May 2017, at 17:21, Chris Murphywrote: > > On Sat, May 13, 2017 at 3:39 AM, Duncan <1i5t5.dun...@cox.net> wrote: > >> When I was doing my ssd research the first time around, the going >> recommendation was to keep 20-33% of the total space on the ssd entirely >> unallocated, allowing it to use that space as an FTL erase-block >> management pool. > > Any brand name SSD has its own reserve above its specified size to > ensure that there's decent performance, even when there is no trim > hinting supplied by the OS; and thereby the SSD can only depend on LBA > "overwrites" to know what blocks are to be freed up. > > >> Anyway, that 20-33% left entirely unallocated/unpartitioned >> recommendation still holds, right? > > Not that I'm aware of. I've never done this by literally walling off > space that I won't use. IA fairly large percentage of my partitions > have free space so it does effectively happen as far as the SSD is > concerned. And I use fstrim timer. Most of the file systems support > trim. > > Anyway I've stuffed a Samsung 840 EVO to 98% full with an OS/file > system that would not issue trim commands on this drive, and it was > doing full performance writes through that point. Then deleted maybe > 5% of the files, and then refill the drive to 98% again, and it was > the same performance. So it must have had enough in reserve to permit > full performance "overwrites" which were in effect directed to reserve > blocks as the freed up blocks were being erased. Thus the erasure > happening on the fly
Re: Shrinking a device - performance?
I’ve glazed over on “Not only that …” … can you make youtube video of that : > On 28 Mar 2017, at 16:06, Peter Grandiwrote: > >> I glazed over at “This is going to be long” … :) >>> [ ... ] > > Not only that, you also top-posted while quoting it pointlessly > in its entirety, to the whole mailing list. Well played :-). It’s because I’m special :* On a real note thank’s for giving a f to provide a detailed comment … to much of open source stuff is based on short comments :/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
I glazed over at “This is going to be long” … :) > On 28 Mar 2017, at 15:43, Peter Grandiwrote: > > This is going to be long because I am writing something detailed > hoping pointlessly that someone in the future will find it by > searching the list archives while doing research before setting > up a new storage system, and they will be the kind of person > that tolerates reading messages longer than Twitter. :-). > >> I’m currently shrinking a device and it seems that the >> performance of shrink is abysmal. > > When I read this kind of statement I am reminded of all the > cases where someone left me to decatastrophize a storage system > built on "optimistic" assumptions. The usual "optimism" is what > I call the "syntactic approach", that is the axiomatic belief > that any syntactically valid combination of features not only > will "work", but very fast too and reliably despite slow cheap > hardware and "unattentive" configuration. Some people call that > the expectation that system developers provide or should provide > an "O_PONIES" option. In particular I get very saddened when > people use "performance" to mean "speed", as the difference > between the two is very great. > > As a general consideration, shrinking a large filetree online > in-place is an amazingly risky, difficult, slow operation and > should be a last desperate resort (as apparently in this case), > regardless of the filesystem type, and expecting otherwise is > "optimistic". > > My guess is that very complex risky slow operations like that > are provided by "clever" filesystem developers for "marketing" > purposes, to win box-ticking competitions. That applies to those > system developers who do know better; I suspect that even some > filesystem developers are "optimistic" as to what they can > actually achieve. > >> I intended to shrink a ~22TiB filesystem down to 20TiB. This is >> still using LVM underneath so that I can’t just remove a device >> from the filesystem but have to use the resize command. > > That is actually a very good idea because Btrfs multi-device is > not quite as reliable as DM/LVM2 multi-device. > >> Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 >> Total devices 1 FS bytes used 18.21TiB >> devid1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy > > Maybe 'balance' should have been used a bit more. > >> This has been running since last Thursday, so roughly 3.5days >> now. The “used” number in devid1 has moved about 1TiB in this >> time. The filesystem is seeing regular usage (read and write) >> and when I’m suspending any application traffic I see about >> 1GiB of movement every now and then. Maybe once every 30 >> seconds or so. Does this sound fishy or normal to you? > > With consistent "optimism" this is a request to assess whether > "performance" of some operations is adequate on a filetree > without telling us either what the filetree contents look like, > what the regular workload is, or what the storage layer looks > like. > > Being one of the few system administrators crippled by lack of > psychic powers :-), I rely on guesses and inferences here, and > having read the whole thread containing some belated details. > > From the ~22TB total capacity my guess is that the storage layer > involves rotating hard disks, and from later details the > filesystem contents seems to be heavily reflinked files of > several GB in size, and workload seems to be backups to those > files from several source hosts. Considering the general level > of "optimism" in the situation my wild guess is that the storage > layer is based on large slow cheap rotating disks in teh 4GB-8GB > range, with very low IOPS-per-TB. > >> Thanks for that info. The 1min per 1GiB is what I saw too - >> the “it can take longer” wasn’t really explainable to me. > > A contemporary rotating disk device can do around 0.5MB/s > transfer rate with small random accesses with barriers up to > around 80-160MB/s in purely sequential access without barriers. > > 1GB/m of simultaneous read-write means around 16MB/s reads plus > 16MB/s writes which is fairly good *performance* (even if slow > *speed*) considering that moving extents around, even across > disks, involves quite a bit of randomish same-disk updates of > metadata; because it all depends usually on how much randomish > metadata updates need to done, on any filesystem type, as those > must be done with barriers. > >> As I’m not using snapshots: would large files (100+gb) > > Using 100GB sized VM virtual disks (never mind with COW) seems > very unwise to me to start with, but of course a lot of other > people know better :-). Just like a lot of other people know > better that large single pool storage systems are awesome in > every respect :-): cost, reliability, speed, flexibility, > maintenance, etc. > >> with long chains of CoW history (specifically reflink copies) >> also hurt? > > Oh yes... They are about one
Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.
): forced readonly Jan 23 05:00:02 server kernel: BTRFS: error (device sdc) in btrfs_run_delayed_refs:2960: errno=-2 No such entry Jan 23 05:00:02 server kernel: BTRFS: error (device sdc) in create_pending_snapshot:1604: errno=-2 No such entry Jan 23 05:00:02 server kernel: BTRFS warning (device sdc): Skipping commit of aborted transaction. Jan 23 05:00:02 server kernel: BTRFS: error (device sdc) in cleanup_transaction:1854: errno=-2 No such entry On 21 February 2017 at 22:18, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote: > Anyone ? > > On 18 Feb 2017, at 16:44, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote: > > So Qu, > > currently my situation is that: > I've tried to go btrfs scan --repair, and it did relair some stuff is > qgroup's ... then tried to mont it and, surprise surpeire system > locked out in 20 seconds. > > Reboot, again scan --repair = a lot of missing back pointers were > repaired and system is supposedly "OK" attempted to mount it and > within 20 seconds system locked out so hard it wold no even reboot > from acpi. > > installed "ellrepo kernel-lm" and installed 4.9.10 > > another scan --repair = same problem with lot's of back pointer > missing, fixed system again seems "OK" ... another attempt to > mount /dev/sdc /mnt2/main_pool and again after 20 seconds system locks > up hard. > > There is nothing in messages, nothing in dmesg ... I think that system > lock up so hard that master btrfs filesystem does not get time those > logs pushed to disk. > > > > > > > On 16 February 2017 at 23:46, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote: > > Thanks Qu, > > Just before I’ll go and accidentally mess up this FS more - I’ve > mentioned originally that this problem started with FS not being able > to create a snapshot ( it would get remounted RO automatically ) for > about a month, and when I’ve realised that there is a problem like > that I’ve attempted a full FS balance that caused this FS to be > unmountable. Is there any other debug you would require before I > proceed (I’ve got a lot i > > On 16 Feb 2017, at 01:26, Qu Wenruo <quwen...@cn.fujitsu.com> wrote: > > > > At 02/15/2017 10:11 PM, Tomasz Kusmierz wrote: > > So guys, any help here ? I’m kinda stuck now with system just idling > and doing nothing while I wait for some feedback ... > > > Sorry for the late reply. > > Busying debugging a kernel bug. > > On 14 Feb 2017, at 19:38, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote: > > [root@server ~]# btrfs-show-super -af /dev/sdc > superblock: bytenr=65536, device=/dev/sdc > - > csum_type 0 (crc32c) > csum_size 4 > csum0x17d56ce0 [match] > > > This superblock is good. > > bytenr 65536 > flags 0x1 > ( WRITTEN ) > magic _BHRfS_M [match] > fsid0576d577-8954-4a60-a02b-9492b3c29318 > label main_pool > generation 150682 > root5223857717248 > sys_array_size 321 > chunk_root_generation 150678 > root_level 1 > chunk_root 8669488005120 > chunk_root_level1 > log_root0 > log_root_transid0 > log_root_level 0 > total_bytes 16003191472128 > bytes_used 6411278503936 > sectorsize 4096 > nodesize16384 > leafsize16384 > stripesize 4096 > root_dir6 > num_devices 8 > compat_flags0x0 > compat_ro_flags 0x0 > incompat_flags 0x161 > ( MIXED_BACKREF | >BIG_METADATA | >EXTENDED_IREF | >SKINNY_METADATA ) > cache_generation150682 > uuid_tree_generation150679 > dev_item.uuid 46abffa8-7afe-451f-93c6-abb8e589c4e8 > dev_item.fsid 0576d577-8954-4a60-a02b-9492b3c29318 [match] > dev_item.type 0 > dev_item.total_bytes2000398934016 > dev_item.bytes_used 1647136735232 > dev_item.io_align 4096 > dev_item.io_width 4096 > dev_item.sector_size4096 > dev_item.devid 1 > dev_item.dev_group 0 > dev_item.seek_speed 0 > dev_item.bandwidth 0 > dev_item.generation 0 > sys_chunk_array[2048]: > item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 8669487824896) > length 67108864 owner 2 stripe_len 65536 type SYSTEM|RAID10 > io_align 65536 io_width 65536 sec
Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.
Anyone ? On 18 Feb 2017, at 16:44, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote: So Qu, currently my situation is that: I've tried to go btrfs scan --repair, and it did relair some stuff is qgroup's ... then tried to mont it and, surprise surpeire system locked out in 20 seconds. Reboot, again scan --repair = a lot of missing back pointers were repaired and system is supposedly "OK" attempted to mount it and within 20 seconds system locked out so hard it wold no even reboot from acpi. installed "ellrepo kernel-lm" and installed 4.9.10 another scan --repair = same problem with lot's of back pointer missing, fixed system again seems "OK" ... another attempt to mount /dev/sdc /mnt2/main_pool and again after 20 seconds system locks up hard. There is nothing in messages, nothing in dmesg ... I think that system lock up so hard that master btrfs filesystem does not get time those logs pushed to disk. On 16 February 2017 at 23:46, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote: Thanks Qu, Just before I’ll go and accidentally mess up this FS more - I’ve mentioned originally that this problem started with FS not being able to create a snapshot ( it would get remounted RO automatically ) for about a month, and when I’ve realised that there is a problem like that I’ve attempted a full FS balance that caused this FS to be unmountable. Is there any other debug you would require before I proceed (I’ve got a lot i On 16 Feb 2017, at 01:26, Qu Wenruo <quwen...@cn.fujitsu.com> wrote: At 02/15/2017 10:11 PM, Tomasz Kusmierz wrote: So guys, any help here ? I’m kinda stuck now with system just idling and doing nothing while I wait for some feedback ... Sorry for the late reply. Busying debugging a kernel bug. On 14 Feb 2017, at 19:38, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote: [root@server ~]# btrfs-show-super -af /dev/sdc superblock: bytenr=65536, device=/dev/sdc - csum_type 0 (crc32c) csum_size 4 csum0x17d56ce0 [match] This superblock is good. bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] fsid0576d577-8954-4a60-a02b-9492b3c29318 label main_pool generation 150682 root5223857717248 sys_array_size 321 chunk_root_generation 150678 root_level 1 chunk_root 8669488005120 chunk_root_level1 log_root0 log_root_transid0 log_root_level 0 total_bytes 16003191472128 bytes_used 6411278503936 sectorsize 4096 nodesize16384 leafsize16384 stripesize 4096 root_dir6 num_devices 8 compat_flags0x0 compat_ro_flags 0x0 incompat_flags 0x161 ( MIXED_BACKREF | BIG_METADATA | EXTENDED_IREF | SKINNY_METADATA ) cache_generation150682 uuid_tree_generation150679 dev_item.uuid 46abffa8-7afe-451f-93c6-abb8e589c4e8 dev_item.fsid 0576d577-8954-4a60-a02b-9492b3c29318 [match] dev_item.type 0 dev_item.total_bytes2000398934016 dev_item.bytes_used 1647136735232 dev_item.io_align 4096 dev_item.io_width 4096 dev_item.sector_size4096 dev_item.devid 1 dev_item.dev_group 0 dev_item.seek_speed 0 dev_item.bandwidth 0 dev_item.generation 0 sys_chunk_array[2048]: item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 8669487824896) length 67108864 owner 2 stripe_len 65536 type SYSTEM|RAID10 io_align 65536 io_width 65536 sector_size 4096 num_stripes 8 sub_stripes 2 stripe 0 devid 7 offset 1083674984448 dev_uuid 566fb8a3-d6de-4230-8b70-a5fda0a120f6 stripe 1 devid 8 offset 1083674984448 dev_uuid 845aefb2-e0a6-479a-957b-a82fb7207d6c stripe 2 devid 1 offset 1365901312 dev_uuid 46abffa8-7afe-451f-93c6-abb8e589c4e8 stripe 3 devid 3 offset 1345978368 dev_uuid 95921633-2fc1-479f-a3ba-e6e5a1989755 stripe 4 devid 4 offset 1345978368 dev_uuid 20828f0e-4661-4987-ac11-72814c1e423a stripe 5 devid 5 offset 1345978368 dev_uuid 2c3cd71f-5178-48e7-8032-6b6eec023197 stripe 6 devid 6 offset 1345978368 dev_uuid 806a47e5-cac4-41c9-abb9-5c49506459e1 stripe 7 devid 2 offset 1345978368 dev_uuid e1358e0e-edaf-4505-9c71-ed0862c45841 And I didn't see anything wrong in sys_chunk_array. W
Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.
So Qu, currently my situation is that: I've tried to go btrfs scan --repair, and it did relair some stuff is qgroup's ... then tried to mont it and, surprise surpeire system locked out in 20 seconds. Reboot, again scan --repair = a lot of missing back pointers were repaired and system is supposedly "OK" attempted to mount it and within 20 seconds system locked out so hard it wold no even reboot from acpi. installed "ellrepo kernel-lm" and installed 4.9.10 another scan --repair = same problem with lot's of back pointer missing, fixed system again seems "OK" ... another attempt to mount /dev/sdc /mnt2/main_pool and again after 20 seconds system locks up hard. There is nothing in messages, nothing in dmesg ... I think that system lock up so hard that master btrfs filesystem does not get time those logs pushed to disk. On 16 February 2017 at 23:46, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote: > Thanks Qu, > > Just before I’ll go and accidentally mess up this FS more - I’ve > mentioned originally that this problem started with FS not being able > to create a snapshot ( it would get remounted RO automatically ) for > about a month, and when I’ve realised that there is a problem like > that I’ve attempted a full FS balance that caused this FS to be > unmountable. Is there any other debug you would require before I > proceed (I’ve got a lot i > > On 16 Feb 2017, at 01:26, Qu Wenruo <quwen...@cn.fujitsu.com> wrote: > > > > At 02/15/2017 10:11 PM, Tomasz Kusmierz wrote: > > So guys, any help here ? I’m kinda stuck now with system just idling > and doing nothing while I wait for some feedback ... > > > Sorry for the late reply. > > Busying debugging a kernel bug. > > On 14 Feb 2017, at 19:38, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote: > > [root@server ~]# btrfs-show-super -af /dev/sdc > superblock: bytenr=65536, device=/dev/sdc > - > csum_type 0 (crc32c) > csum_size 4 > csum0x17d56ce0 [match] > > > This superblock is good. > > bytenr 65536 > flags 0x1 > ( WRITTEN ) > magic _BHRfS_M [match] > fsid0576d577-8954-4a60-a02b-9492b3c29318 > label main_pool > generation 150682 > root5223857717248 > sys_array_size 321 > chunk_root_generation 150678 > root_level 1 > chunk_root 8669488005120 > chunk_root_level1 > log_root0 > log_root_transid0 > log_root_level 0 > total_bytes 16003191472128 > bytes_used 6411278503936 > sectorsize 4096 > nodesize16384 > leafsize16384 > stripesize 4096 > root_dir6 > num_devices 8 > compat_flags0x0 > compat_ro_flags 0x0 > incompat_flags 0x161 > ( MIXED_BACKREF | > BIG_METADATA | > EXTENDED_IREF | > SKINNY_METADATA ) > cache_generation150682 > uuid_tree_generation150679 > dev_item.uuid 46abffa8-7afe-451f-93c6-abb8e589c4e8 > dev_item.fsid 0576d577-8954-4a60-a02b-9492b3c29318 [match] > dev_item.type 0 > dev_item.total_bytes2000398934016 > dev_item.bytes_used 1647136735232 > dev_item.io_align 4096 > dev_item.io_width 4096 > dev_item.sector_size4096 > dev_item.devid 1 > dev_item.dev_group 0 > dev_item.seek_speed 0 > dev_item.bandwidth 0 > dev_item.generation 0 > sys_chunk_array[2048]: > item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 8669487824896) > length 67108864 owner 2 stripe_len 65536 type SYSTEM|RAID10 > io_align 65536 io_width 65536 sector_size 4096 > num_stripes 8 sub_stripes 2 > stripe 0 devid 7 offset 1083674984448 > dev_uuid 566fb8a3-d6de-4230-8b70-a5fda0a120f6 > stripe 1 devid 8 offset 1083674984448 > dev_uuid 845aefb2-e0a6-479a-957b-a82fb7207d6c > stripe 2 devid 1 offset 1365901312 > dev_uuid 46abffa8-7afe-451f-93c6-abb8e589c4e8 > stripe 3 devid 3 offset 1345978368 > dev_uuid 95921633-2fc1-479f-a3ba-e6e5a1989755 > stripe 4 devid 4 offset 1345978368 > dev_uuid 20828f0e-4661-4987-ac11-72814c1e423a > stripe
Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.
Thanks Qu, Just before I’ll go and accidentally mess up this FS more - I’ve mentioned originally that this problem started with FS not being able to create a snapshot ( it would get remounted RO automatically ) for about a month, and when I’ve realised that there is a problem like that I’ve attempted a full FS balance that caused this FS to be unmountable. Is there any other debug you would require before I proceed (I’ve got a lot i On 16 Feb 2017, at 01:26, Qu Wenruo <quwen...@cn.fujitsu.com> wrote: At 02/15/2017 10:11 PM, Tomasz Kusmierz wrote: So guys, any help here ? I’m kinda stuck now with system just idling and doing nothing while I wait for some feedback ... Sorry for the late reply. Busying debugging a kernel bug. On 14 Feb 2017, at 19:38, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote: [root@server ~]# btrfs-show-super -af /dev/sdc superblock: bytenr=65536, device=/dev/sdc - csum_type 0 (crc32c) csum_size 4 csum0x17d56ce0 [match] This superblock is good. bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] fsid0576d577-8954-4a60-a02b-9492b3c29318 label main_pool generation 150682 root5223857717248 sys_array_size 321 chunk_root_generation 150678 root_level 1 chunk_root 8669488005120 chunk_root_level1 log_root0 log_root_transid0 log_root_level 0 total_bytes 16003191472128 bytes_used 6411278503936 sectorsize 4096 nodesize16384 leafsize16384 stripesize 4096 root_dir6 num_devices 8 compat_flags0x0 compat_ro_flags 0x0 incompat_flags 0x161 ( MIXED_BACKREF | BIG_METADATA | EXTENDED_IREF | SKINNY_METADATA ) cache_generation150682 uuid_tree_generation150679 dev_item.uuid 46abffa8-7afe-451f-93c6-abb8e589c4e8 dev_item.fsid 0576d577-8954-4a60-a02b-9492b3c29318 [match] dev_item.type 0 dev_item.total_bytes2000398934016 dev_item.bytes_used 1647136735232 dev_item.io_align 4096 dev_item.io_width 4096 dev_item.sector_size4096 dev_item.devid 1 dev_item.dev_group 0 dev_item.seek_speed 0 dev_item.bandwidth 0 dev_item.generation 0 sys_chunk_array[2048]: item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 8669487824896) length 67108864 owner 2 stripe_len 65536 type SYSTEM|RAID10 io_align 65536 io_width 65536 sector_size 4096 num_stripes 8 sub_stripes 2 stripe 0 devid 7 offset 1083674984448 dev_uuid 566fb8a3-d6de-4230-8b70-a5fda0a120f6 stripe 1 devid 8 offset 1083674984448 dev_uuid 845aefb2-e0a6-479a-957b-a82fb7207d6c stripe 2 devid 1 offset 1365901312 dev_uuid 46abffa8-7afe-451f-93c6-abb8e589c4e8 stripe 3 devid 3 offset 1345978368 dev_uuid 95921633-2fc1-479f-a3ba-e6e5a1989755 stripe 4 devid 4 offset 1345978368 dev_uuid 20828f0e-4661-4987-ac11-72814c1e423a stripe 5 devid 5 offset 1345978368 dev_uuid 2c3cd71f-5178-48e7-8032-6b6eec023197 stripe 6 devid 6 offset 1345978368 dev_uuid 806a47e5-cac4-41c9-abb9-5c49506459e1 stripe 7 devid 2 offset 1345978368 dev_uuid e1358e0e-edaf-4505-9c71-ed0862c45841 And I didn't see anything wrong in sys_chunk_array. Would you please try to mount the fs with latest kernel? Better later than v4.9, as in that version extra kernel messages are introduced to give more details about what's going wrong. Thanks, Qu backup_roots[4]: backup 0: backup_tree_root: 5223857717248 gen: 150680 level: 1 backup_chunk_root: 8669488005120 gen: 150678 level: 1 backup_extent_root: 5223867383808 gen: 150680 level: 2 backup_fs_root: 0 gen: 0 level: 0 backup_dev_root:5224791523328 gen: 150680 level: 1 backup_csum_root: 5224802140160 gen: 150680 level: 3 backup_total_bytes: 16003191472128 backup_bytes_used: 6411278503936 backup_num_devices: 8 backup 1: backup_tree_root: 5224155807744 gen: 150681 level: 1 backup_chunk_root: 8669488005120 gen: 150678 level: 1 backup
Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.
So guys, any help here ? I’m kinda stuck now with system just idling and doing nothing while I wait for some feedback ... > On 14 Feb 2017, at 19:38, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote: > > [root@server ~]# btrfs-show-super -af /dev/sdc > superblock: bytenr=65536, device=/dev/sdc > - > csum_type 0 (crc32c) > csum_size 4 > csum0x17d56ce0 [match] > bytenr 65536 > flags 0x1 >( WRITTEN ) > magic _BHRfS_M [match] > fsid0576d577-8954-4a60-a02b-9492b3c29318 > label main_pool > generation 150682 > root5223857717248 > sys_array_size 321 > chunk_root_generation 150678 > root_level 1 > chunk_root 8669488005120 > chunk_root_level1 > log_root0 > log_root_transid0 > log_root_level 0 > total_bytes 16003191472128 > bytes_used 6411278503936 > sectorsize 4096 > nodesize16384 > leafsize16384 > stripesize 4096 > root_dir6 > num_devices 8 > compat_flags0x0 > compat_ro_flags 0x0 > incompat_flags 0x161 >( MIXED_BACKREF | > BIG_METADATA | > EXTENDED_IREF | > SKINNY_METADATA ) > cache_generation150682 > uuid_tree_generation150679 > dev_item.uuid 46abffa8-7afe-451f-93c6-abb8e589c4e8 > dev_item.fsid 0576d577-8954-4a60-a02b-9492b3c29318 [match] > dev_item.type 0 > dev_item.total_bytes2000398934016 > dev_item.bytes_used 1647136735232 > dev_item.io_align 4096 > dev_item.io_width 4096 > dev_item.sector_size4096 > dev_item.devid 1 > dev_item.dev_group 0 > dev_item.seek_speed 0 > dev_item.bandwidth 0 > dev_item.generation 0 > sys_chunk_array[2048]: >item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 8669487824896) >length 67108864 owner 2 stripe_len 65536 type SYSTEM|RAID10 >io_align 65536 io_width 65536 sector_size 4096 >num_stripes 8 sub_stripes 2 >stripe 0 devid 7 offset 1083674984448 >dev_uuid 566fb8a3-d6de-4230-8b70-a5fda0a120f6 >stripe 1 devid 8 offset 1083674984448 >dev_uuid 845aefb2-e0a6-479a-957b-a82fb7207d6c >stripe 2 devid 1 offset 1365901312 >dev_uuid 46abffa8-7afe-451f-93c6-abb8e589c4e8 >stripe 3 devid 3 offset 1345978368 >dev_uuid 95921633-2fc1-479f-a3ba-e6e5a1989755 >stripe 4 devid 4 offset 1345978368 >dev_uuid 20828f0e-4661-4987-ac11-72814c1e423a >stripe 5 devid 5 offset 1345978368 >dev_uuid 2c3cd71f-5178-48e7-8032-6b6eec023197 >stripe 6 devid 6 offset 1345978368 >dev_uuid 806a47e5-cac4-41c9-abb9-5c49506459e1 >stripe 7 devid 2 offset 1345978368 >dev_uuid e1358e0e-edaf-4505-9c71-ed0862c45841 > backup_roots[4]: >backup 0: >backup_tree_root: 5223857717248 gen: 150680 level: > 1 >backup_chunk_root: 8669488005120 gen: 150678 level: > 1 >backup_extent_root: 5223867383808 gen: 150680 level: > 2 >backup_fs_root: 0 gen: 0 level: 0 >backup_dev_root:5224791523328 gen: 150680 level: > 1 >backup_csum_root: 5224802140160 gen: 150680 level: > 3 >backup_total_bytes: 16003191472128 >backup_bytes_used: 6411278503936 >backup_num_devices: 8 > >backup 1: >backup_tree_root: 5224155807744 gen: 150681 level: > 1 >backup_chunk_root: 8669488005120 gen: 150678 level: > 1 >backup_extent_root: 5224156233728 gen: 150681 level: > 2 >backup_fs_root: 0 gen: 0 level: 0 >backup_dev_root:5224633155584 gen: 150681 level: > 1 >backup_csum_root: 5224634941440 gen: 150681 level: > 3 >backup_total_bytes: 16003191472128 >
Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.
backup_bytes_used: 6411278503936 backup_num_devices: 8 backup 1: backup_tree_root: 5224155807744 gen: 150681 level: 1 backup_chunk_root: 8669488005120 gen: 150678 level: 1 backup_extent_root: 5224156233728 gen: 150681 level: 2 backup_fs_root: 0 gen: 0 level: 0 backup_dev_root:5224633155584 gen: 150681 level: 1 backup_csum_root: 5224634941440 gen: 150681 level: 3 backup_total_bytes: 16003191472128 backup_bytes_used: 6411278503936 backup_num_devices: 8 backup 2: backup_tree_root: 5223857717248 gen: 150682 level: 1 backup_chunk_root: 8669488005120 gen: 150678 level: 1 backup_extent_root: 5223867383808 gen: 150682 level: 2 backup_fs_root: 0 gen: 0 level: 0 backup_dev_root:5224622358528 gen: 150682 level: 1 backup_csum_root: 5224675344384 gen: 150682 level: 3 backup_total_bytes: 16003191472128 backup_bytes_used: 6411278503936 backup_num_devices: 8 backup 3: backup_tree_root: 11179477942272 gen: 150679 level: 1 backup_chunk_root: 8669488005120 gen: 150678 level: 1 backup_extent_root: 11179488018432 gen: 150679 level: 2 backup_fs_root: 6217817456640 gen: 150497 level: 0 backup_dev_root:5224337244160 gen: 150679 level: 1 backup_csum_root: 11179492540416 gen: 150679 level: 3 backup_total_bytes: 16003191472128 backup_bytes_used: 6411278503936 backup_num_devices: 8 On 14 February 2017 at 00:25, Qu Wenruo <quwen...@cn.fujitsu.com> wrote: > > > At 02/14/2017 08:23 AM, Tomasz Kusmierz wrote: >> >> Forgot to mention: >> >> btrfs inspect-internal dump-super -af /dev/sdc > > > Your btrfs-progs is somewhat old, which doesn't integrate dump super into > inspect-internal. > > In that case, you can use btrfs-show-super -af instead. > > Thanks, > Qu > >> >> btrfs inspect-internal: unknown token 'dump-super' >> usage: btrfs inspect-internal >> >> btrfs inspect-internal inode-resolve [-v] >> Get file system paths for the given inode >> btrfs inspect-internal logical-resolve [-Pv] [-s bufsize] >> >> Get file system paths for the given logical address >> btrfs inspect-internal subvolid-resolve >> Get file system paths for the given subvolume ID. >> btrfs inspect-internal rootid >> Get tree ID of the containing subvolume of path. >> btrfs inspect-internal min-dev-size [options] >> Get the minimum size the device can be shrunk to. The >> >> query various internal information >> >> On 13 February 2017 at 14:58, Tomasz Kusmierz <tom.kusmi...@gmail.com> >> wrote: >>> >>> Problem is to send a larger log into this mailing list :/ >>> >>> Anyway: uname -a >>> Linux tevva-server 4.8.7-1.el7.elrepo.x86_64 #1 SMP Thu Nov 10 >>> 20:47:24 EST 2016 x86_64 x86_64 x86_64 GNU/Linux >>> >>> >>> cut from messages (bear in mind that this is a single cut with a bit >>> cut from inside of it to fit it in the email) >>> >>> Feb 10 00:17:14 server journal: ==> >>> /var/log/gitlab/gitlab-shell/gitlab-shell.log <== >>> Feb 10 00:17:30 server journal: 192.168.1.253 - wally_tm >>> [10/Feb/2017:00:17:29 +] "PROPFIND /remote.php/webdav/Pictures >>> HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1" >>> Feb 10 00:18:00 server kernel: BTRFS info (device sdc): found 22 extents >>> Feb 10 00:18:01 server journal: 192.168.1.253 - wally_tm >>> [10/Feb/2017:00:17:59 +] "PROPFIND /remote.php/webdav/Pictures >>> HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1" >>> Feb 10 00:18:05 server kernel: BTRFS info (device sdc): found 22 extents >>> Feb 10 00:18:06 server kernel: BTRFS info (device sdc): relocating >>> block group 12353563131904 flags 65 >>> Feb 10 00:18:06 server journal: >>> Feb 10 00:18:06 server journal: ==> /var/log/gitlab/sidekiq/current <== >>> Feb 10 00:18:06 server journal: 2017-02-10_00:18:06.99341 >>> 2017-02-10T00:18:06.993Z 382 TID-otrr6ws48 P
Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.
Forgot to mention: btrfs inspect-internal dump-super -af /dev/sdc btrfs inspect-internal: unknown token 'dump-super' usage: btrfs inspect-internal btrfs inspect-internal inode-resolve [-v] Get file system paths for the given inode btrfs inspect-internal logical-resolve [-Pv] [-s bufsize] Get file system paths for the given logical address btrfs inspect-internal subvolid-resolve Get file system paths for the given subvolume ID. btrfs inspect-internal rootid Get tree ID of the containing subvolume of path. btrfs inspect-internal min-dev-size [options] Get the minimum size the device can be shrunk to. The query various internal information On 13 February 2017 at 14:58, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote: > Problem is to send a larger log into this mailing list :/ > > Anyway: uname -a > Linux tevva-server 4.8.7-1.el7.elrepo.x86_64 #1 SMP Thu Nov 10 > 20:47:24 EST 2016 x86_64 x86_64 x86_64 GNU/Linux > > > cut from messages (bear in mind that this is a single cut with a bit > cut from inside of it to fit it in the email) > > Feb 10 00:17:14 server journal: ==> > /var/log/gitlab/gitlab-shell/gitlab-shell.log <== > Feb 10 00:17:30 server journal: 192.168.1.253 - wally_tm > [10/Feb/2017:00:17:29 +] "PROPFIND /remote.php/webdav/Pictures > HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1" > Feb 10 00:18:00 server kernel: BTRFS info (device sdc): found 22 extents > Feb 10 00:18:01 server journal: 192.168.1.253 - wally_tm > [10/Feb/2017:00:17:59 +] "PROPFIND /remote.php/webdav/Pictures > HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1" > Feb 10 00:18:05 server kernel: BTRFS info (device sdc): found 22 extents > Feb 10 00:18:06 server kernel: BTRFS info (device sdc): relocating > block group 12353563131904 flags 65 > Feb 10 00:18:06 server journal: > Feb 10 00:18:06 server journal: ==> /var/log/gitlab/sidekiq/current <== > Feb 10 00:18:06 server journal: 2017-02-10_00:18:06.99341 > 2017-02-10T00:18:06.993Z 382 TID-otrr6ws48 PruneOldEventsWorker > JID-99d3a4fb69be748c8674b5e1 INFO: start > Feb 10 00:18:06 server journal: 2017-02-10_00:18:06.99571 > 2017-02-10T00:18:06.995Z 382 TID-otrr6wqok INFO: Cron Jobs - add job > with name: prune_old_events_worker > Feb 10 00:18:07 server journal: 2017-02-10_00:18:07.00454 > 2017-02-10T00:18:07.004Z 382 TID-otrr6ws48 PruneOldEventsWorker > JID-99d3a4fb69be748c8674b5e1 INFO: done: 0.011 sec > Feb 10 00:18:30 server journal: 192.168.1.253 - wally_tm > [10/Feb/2017:00:18:29 +] "PROPFIND /remote.php/webdav/Pictures > HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1" > Feb 10 00:18:43 server kernel: BTRFS info (device sdc): found 32 extents > Feb 10 00:18:48 server kernel: BTRFS info (device sdc): found 32 extents > Feb 10 00:18:49 server kernel: BTRFS info (device sdc): relocating > block group 12349268164608 flags 65 > Feb 10 00:19:01 server journal: 192.168.1.253 - wally_tm > [10/Feb/2017:00:19:00 +] "PROPFIND /remote.php/webdav/Pictures > HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1" > Feb 10 00:19:02 server journal: 2017-02-10_00:19:02.51409 > 2017-02-10T00:19:02.513Z 382 TID-otrr6wqok INFO: Cron Jobs - add job > with name: prune_old_events_worker > Feb 10 00:19:02 server journal: 2017-02-10_00:19:02.51449 > 2017-02-10T00:19:02.514Z 382 TID-otrspth10 PruneOldEventsWorker > JID-4a162ace334771baf4befbb7 INFO: start > Feb 10 00:19:02 server journal: 2017-02-10_00:19:02.52994 > 2017-02-10T00:19:02.529Z 382 TID-otrspth10 PruneOldEventsWorker > JID-4a162ace334771baf4befbb7 INFO: done: 0.015 sec > Feb 10 00:19:26 server kernel: BTRFS info (device sdc): found 33 extents > Feb 10 00:19:31 server kernel: BTRFS info (device sdc): found 33 extents > Feb 10 00:19:31 server journal: 192.168.1.253 - wally_tm > [10/Feb/2017:00:19:29 +] "PROPFIND /remote.php/webdav/Pictures > HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1" > Feb 10 00:19:32 server kernel: BTRFS info (device sdc): relocating > block group 12344973197312 flags 65 > Feb 10 00:19:51 server kernel: perf: interrupt took too long (2513 > > 2500), lowering kernel.perf_event_max_sample_rate to 79000 > Feb 10 00:20:00 server journal: 192.168.1.253 - wally_tm > [10/Feb/2017:00:19:59 +] "PROPFIND /remote.php/webdav/Pictures > HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1" > Feb 10 00:20:10 server kernel: BTRFS info (device sdc): found 32 extents > Feb 10 00:20:10 server journal: 2017-02-10_00:20:10.15695 > 2017-02-10T00:20:10.156Z 382 TID-otrsptg48 > RepositoryCheck::BatchWork
Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.
flags 258 Feb 10 00:29:11 server kernel: #011#011shared block backref parent 5224641380352 Feb 10 00:29:11 server kernel: #011item 12 key (12288258162688 169 0) itemoff 15854 itemsize 33 Feb 10 00:29:11 server kernel: #011#011extent refs 1 gen 142940 flags 258 Feb 10 00:29:11 server kernel: #011#011shared block backref parent 5224641380352 Feb 10 00:29:11 server kernel: #011item 13 key (12288258179072 169 0) itemoff 15821 itemsize 33 Feb 10 00:29:11 server kernel: #011#011extent refs 1 gen 142940 flags 258 Feb 10 00:29:11 server kernel: #011#011shared block backref parent 5224641380352 Feb 10 00:29:11 server kernel: #011item 14 key (12288258375680 169 0) itemoff 15788 itemsize 33 Feb 10 00:29:11 server kernel: #011#011extent refs 1 gen 142940 flags 258 On 13 Feb 2017, at 00:49, Qu Wenruo <quwen...@cn.fujitsu.com> wrote: At 02/12/2017 09:17 AM, Tomasz Kusmierz wrote: Hi all, So my main storage filesystem got some sort of veird corruption (that I can gather). Everything seems to work OK, but when I try to create a snapshot or run balance (no filters) it will get remounted read only. Kernel version please. Fun part is that balance seems to be running even on read only FS, and I continuously get kernel traces in /var/log/messages so it might as well in the back ground silently eat my data away :/ Kernel backtrace please. It would be better if you could paste the *first* kernel backtrace, as that could be the cause, and following kernel backtrace is just warning from btrfs_abort_transaction() without meaningful output. I just see some normal messages, but no kernel backtrace. UPDATE: Yeah, after rebooting the system it does not even mount the FS, mount.btrfs sits in some sort of spinlock and consumes 100% of singe core. UPDATE 2: System is completelly cooked :/ [root@server ~]# btrfs fi show Label: 'rockstor_server' uuid: 5581a647-40ef-4a7a-9d73-847bf35a142b Total devices 1 FS bytes used 5.72GiB devid1 size 53.17GiB used 7.03GiB path /dev/sda2 Label: 'broken_pool' uuid: 26095277-a234-455b-8c97-8dac8ad934c8 Total devices 2 FS bytes used 193.52GiB devid1 size 1.82TiB used 196.03GiB path /dev/sdb devid2 size 1.82TiB used 196.03GiB path /dev/sdi Label: 'main_pool' uuid: 0576d577-8954-4a60-a02b-9492b3c29318 Total devices 8 FS bytes used 5.83TiB devid1 size 1.82TiB used 1.50TiB path /dev/sdc devid2 size 1.82TiB used 1.50TiB path /dev/sdd devid3 size 1.82TiB used 1.50TiB path /dev/sde devid4 size 1.82TiB used 1.50TiB path /dev/sdf devid5 size 1.82TiB used 1.50TiB path /dev/sdg devid6 size 1.82TiB used 1.50TiB path /dev/sdh devid7 size 1.82TiB used 1.50TiB path /dev/sdj devid8 size 1.82TiB used 1.50TiB path /dev/sdk [root@server ~]# mount /dev/sdc /mnt2/main_pool/ mount: wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. [root@server ~]# mount /dev/sdd /mnt2/main_pool/ mount: wrong fs type, bad option, bad superblock on /dev/sdd, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. [root@server ~]# mount /dev/sde /mnt2/main_pool/ mount: wrong fs type, bad option, bad superblock on /dev/sde, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. dmesg tail retuns: [ 9507.835629] systemd-udevd[1873]: Validate module index [ 9507.835656] systemd-udevd[1873]: Check if link configuration needs reloading. [ 9507.835690] systemd-udevd[1873]: seq 3698 queued, 'add' 'bdi' [ 9507.835873] systemd-udevd[1873]: seq 3698 forked new worker [13858] [ 9507.836202] BTRFS info (device sdd): disk space caching is enabled [ 9507.836204] BTRFS info (device sdd): has skinny extents [ 9507.836322] systemd-udevd[13858]: seq 3698 running [ 9507.836443] systemd-udevd[13858]: no db file to read /run/udev/data/+bdi:btrfs-4: No such file or directory [ 9507.836474] systemd-udevd[13858]: RUN '/bin/mknod /dev/btrfs-control c 10 234' /etc/udev/rules.d/64-btrfs.rules:1 [ 9507.837366] systemd-udevd[13861]: starting '/bin/mknod /dev/btrfs-control c 10 234' [ 9507.837833] BTRFS error (device sdd): failed to read the system array: -5 [ 9507.838231] systemd-udevd[13858]: '/bin/mknod /dev/btrfs-control c 10 234'(err) '/bin/mknod: '/dev/btrfs-control': File exists' [ 9507.838262] systemd-udevd[13858]: '/bin/mknod /dev/btrfs-control c 10 234' [13861] exit with return code 1 [ 9507.854757] BTRFS: open_ctree failed [ 9511.370878] BTRFS info (device sdd): disk space caching is enabled [ 9511.370881] BTRFS info (device sdd): has skinny extents [ 9511.375097] BTRFS error (device sdd): failed to read the system array: -5 Btrfs failed to read system chunk array from super block. Normally this means
FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.
Hi all, So my main storage filesystem got some sort of veird corruption (that I can gather). Everything seems to work OK, but when I try to create a snapshot or run balance (no filters) it will get remounted read only. Fun part is that balance seems to be running even on read only FS, and I continuously get kernel traces in /var/log/messages so it might as well in the back ground silently eat my data away :/ UPDATE: Yeah, after rebooting the system it does not even mount the FS, mount.btrfs sits in some sort of spinlock and consumes 100% of singe core. UPDATE 2: System is completelly cooked :/ [root@server ~]# btrfs fi show Label: 'rockstor_server' uuid: 5581a647-40ef-4a7a-9d73-847bf35a142b Total devices 1 FS bytes used 5.72GiB devid1 size 53.17GiB used 7.03GiB path /dev/sda2 Label: 'broken_pool' uuid: 26095277-a234-455b-8c97-8dac8ad934c8 Total devices 2 FS bytes used 193.52GiB devid1 size 1.82TiB used 196.03GiB path /dev/sdb devid2 size 1.82TiB used 196.03GiB path /dev/sdi Label: 'main_pool' uuid: 0576d577-8954-4a60-a02b-9492b3c29318 Total devices 8 FS bytes used 5.83TiB devid1 size 1.82TiB used 1.50TiB path /dev/sdc devid2 size 1.82TiB used 1.50TiB path /dev/sdd devid3 size 1.82TiB used 1.50TiB path /dev/sde devid4 size 1.82TiB used 1.50TiB path /dev/sdf devid5 size 1.82TiB used 1.50TiB path /dev/sdg devid6 size 1.82TiB used 1.50TiB path /dev/sdh devid7 size 1.82TiB used 1.50TiB path /dev/sdj devid8 size 1.82TiB used 1.50TiB path /dev/sdk [root@server ~]# mount /dev/sdc /mnt2/main_pool/ mount: wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. [root@server ~]# mount /dev/sdd /mnt2/main_pool/ mount: wrong fs type, bad option, bad superblock on /dev/sdd, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. [root@server ~]# mount /dev/sde /mnt2/main_pool/ mount: wrong fs type, bad option, bad superblock on /dev/sde, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. dmesg tail retuns: [ 9507.835629] systemd-udevd[1873]: Validate module index [ 9507.835656] systemd-udevd[1873]: Check if link configuration needs reloading. [ 9507.835690] systemd-udevd[1873]: seq 3698 queued, 'add' 'bdi' [ 9507.835873] systemd-udevd[1873]: seq 3698 forked new worker [13858] [ 9507.836202] BTRFS info (device sdd): disk space caching is enabled [ 9507.836204] BTRFS info (device sdd): has skinny extents [ 9507.836322] systemd-udevd[13858]: seq 3698 running [ 9507.836443] systemd-udevd[13858]: no db file to read /run/udev/data/+bdi:btrfs-4: No such file or directory [ 9507.836474] systemd-udevd[13858]: RUN '/bin/mknod /dev/btrfs-control c 10 234' /etc/udev/rules.d/64-btrfs.rules:1 [ 9507.837366] systemd-udevd[13861]: starting '/bin/mknod /dev/btrfs-control c 10 234' [ 9507.837833] BTRFS error (device sdd): failed to read the system array: -5 [ 9507.838231] systemd-udevd[13858]: '/bin/mknod /dev/btrfs-control c 10 234'(err) '/bin/mknod: '/dev/btrfs-control': File exists' [ 9507.838262] systemd-udevd[13858]: '/bin/mknod /dev/btrfs-control c 10 234' [13861] exit with return code 1 [ 9507.854757] BTRFS: open_ctree failed [ 9511.370878] BTRFS info (device sdd): disk space caching is enabled [ 9511.370881] BTRFS info (device sdd): has skinny extents [ 9511.375097] BTRFS error (device sdd): failed to read the system array: -5 [ 9511.392792] BTRFS: open_ctree failed [ 9514.233627] BTRFS: device label main_pool devid 3 transid 150680 /dev/sde [ 9514.234399] systemd-udevd[1873]: Validate module index [ 9514.234431] systemd-udevd[1873]: Check if link configuration needs reloading. [ 9514.234465] systemd-udevd[1873]: seq 3702 queued, 'add' 'bdi' [ 9514.234522] systemd-udevd[1873]: passed 142 bytes to netlink monitor 0x5628f65d40d0 [ 9514.234554] systemd-udevd[13882]: seq 3702 running [ 9514.234780] systemd-udevd[13882]: no db file to read /run/udev/data/+bdi:btrfs-6: No such file or directory [ 9514.234790] BTRFS info (device sde): disk space caching is enabled [ 9514.234792] BTRFS info (device sde): has skinny extents [ 9514.234798] systemd-udevd[13882]: RUN '/bin/mknod /dev/btrfs-control c 10 234' /etc/udev/rules.d/64-btrfs.rules:1 [ 9514.235181] systemd-udevd[13906]: starting '/bin/mknod /dev/btrfs-control c 10 234' [ 9514.236448] systemd-udevd[13882]: '/bin/mknod /dev/btrfs-control c 10 234'(err) '/bin/mknod: '/dev/btrfs-control': File exists' [ 9514.236514] systemd-udevd[13882]: '/bin/mknod /dev/btrfs-control c 10 234' [13906] exit with return code 1 [ 9514.238726] BTRFS error (device sde): failed to read the system array: -5 [ 9514.255472] BTRFS: open_ctree failed -- To unsubscribe from
Re: Best practices for raid 1
That was long winded way of saying “there is no mechanism in btrfs to tell you exactly which device is missing” but thanks anyway. > On 12 Jan 2017, at 12:47, Austin S. Hemmelgarn <ahferro...@gmail.com> wrote: > > On 2017-01-11 15:37, Tomasz Kusmierz wrote: >> I would like to use this thread to ask few questions: >> >> If we have 2 devices dying on us and we run RAID6 - this theoretically will >> still run (despite our current problems). Now let’s say that we booted up >> raid6 of 10 disk and 2 of them dies but operator does NOT know what are dev >> ID of disk that died, How does one removes those devices other than using >> “-missing” ??? I ask because it’s in multiple places stated to use “replace” >> when your device dies but nobody ever states how to find out which /dev/ >> node is actually missing …. so when I want to use a replace, I don’t know >> what to use within command :/ … This whole thing might have an additional >> complication - if FS is fool, than one would need to add disks than remove >> missing. > raid6 is a special case right now (aside from the fact that it's not safe for > general usage) because it's the only profile on BTRFS that can sustain more > than one failed disk. In the case that the devices aren't actually listed as > missing (most disks won't disappear unless the cabling, storage controller, > or disk electronics are bad), you can use btrfs fi show to see what the > mapping is. If the disks are missing (again, not likely unless there's a > pretty severe electrical failure somewhere), it's safer in that case to add > enough devices to satisfy replication and storage constraints, then just run > 'btrfs device delete missing' to get rid of the other disks. >> >> >>> On 10 Jan 2017, at 21:49, Chris Murphy <li...@colorremedies.com> wrote: >>> >>> On Tue, Jan 10, 2017 at 2:07 PM, Vinko Magecic >>> <vinko.mage...@construction.com> wrote: >>>> Hello, >>>> >>>> I set up a raid 1 with two btrfs devices and came across some situations >>>> in my testing that I can't get a straight answer on. >>>> >>>> 1) When replacing a volume, do I still need to `umount /path` and then >>>> `mount -o degraded ...` the good volume before doing the `btrfs replace >>>> start ...` ? >>> >>> No. If the device being replaced is unreliable, use -r to limit the >>> reads from the device being replaced. >>> >>> >>> >>>> I didn't see anything that said I had to and when I tested it without >>>> mounting the volume it was able to replace the device without any issue. >>>> Is that considered bad and could risk damage or has `replace` made it >>>> possible to replace devices without umounting the filesystem? >>> >>> It's always been possible even before 'replace'. >>> btrfs dev add >>> btrfs dev rem >>> >>> But there are some bugs in dev replace that Qu is working on; I think >>> they mainly negatively impact raid56 though. >>> >>> The one limitation of 'replace' is that the new block device must be >>> equal to or larger than the block device being replaced; where dev add >>>> dev rem doesn't require this. >>> >>> >>>> 2) Everything I see about replacing a drive says to use `/old/device >>>> /new/device` but what if the old device can't be read or no longer exists? >>> >>> The command works whether the device is present or not; but if it's >>> present and working then any errors on one device can be corrected by >>> the other, whereas if the device is missing, then any errors on the >>> remaining device can't be corrected. Off hand I'm not sure if the >>> replace continues and an error just logged...I think that's what >>> should happen. >>> >>> >>>> Would that be a `btrfs device add /new/device; btrfs balance start >>>> /new/device` ? >>> >>> dev add then dev rem; the balance isn't necessary. >>> >>>> >>>> 3) When I have the RAID1 with two devices and I want to grow it out, which >>>> is the better practice? Create a larger volume, replace the old device >>>> with the new device and then do it a second time for the other device, or >>>> attaching the new volumes to the label/uuid one at a time and with each >>>> one use `btrfs filesystem resize devid:max /mountpoint`. >>> >>> If you're replacing a 2x raid1 with t
Re: Best practices for raid 1
I would like to use this thread to ask few questions: If we have 2 devices dying on us and we run RAID6 - this theoretically will still run (despite our current problems). Now let’s say that we booted up raid6 of 10 disk and 2 of them dies but operator does NOT know what are dev ID of disk that died, How does one removes those devices other than using “-missing” ??? I ask because it’s in multiple places stated to use “replace” when your device dies but nobody ever states how to find out which /dev/ node is actually missing …. so when I want to use a replace, I don’t know what to use within command :/ … This whole thing might have an additional complication - if FS is fool, than one would need to add disks than remove missing. > On 10 Jan 2017, at 21:49, Chris Murphywrote: > > On Tue, Jan 10, 2017 at 2:07 PM, Vinko Magecic > wrote: >> Hello, >> >> I set up a raid 1 with two btrfs devices and came across some situations in >> my testing that I can't get a straight answer on. >> >> 1) When replacing a volume, do I still need to `umount /path` and then >> `mount -o degraded ...` the good volume before doing the `btrfs replace >> start ...` ? > > No. If the device being replaced is unreliable, use -r to limit the > reads from the device being replaced. > > > >> I didn't see anything that said I had to and when I tested it without >> mounting the volume it was able to replace the device without any issue. Is >> that considered bad and could risk damage or has `replace` made it possible >> to replace devices without umounting the filesystem? > > It's always been possible even before 'replace'. > btrfs dev add > btrfs dev rem > > But there are some bugs in dev replace that Qu is working on; I think > they mainly negatively impact raid56 though. > > The one limitation of 'replace' is that the new block device must be > equal to or larger than the block device being replaced; where dev add >> dev rem doesn't require this. > > >> 2) Everything I see about replacing a drive says to use `/old/device >> /new/device` but what if the old device can't be read or no longer exists? > > The command works whether the device is present or not; but if it's > present and working then any errors on one device can be corrected by > the other, whereas if the device is missing, then any errors on the > remaining device can't be corrected. Off hand I'm not sure if the > replace continues and an error just logged...I think that's what > should happen. > > >> Would that be a `btrfs device add /new/device; btrfs balance start >> /new/device` ? > > dev add then dev rem; the balance isn't necessary. > >> >> 3) When I have the RAID1 with two devices and I want to grow it out, which >> is the better practice? Create a larger volume, replace the old device with >> the new device and then do it a second time for the other device, or >> attaching the new volumes to the label/uuid one at a time and with each one >> use `btrfs filesystem resize devid:max /mountpoint`. > > If you're replacing a 2x raid1 with two bigger replacements, you'd use > 'btrfs replace' twice. Maybe it'd work concurrently, I've never tried > it, but useful for someone to test and see if it explodes because if > it's allowed, it should work or fail gracefully. > > There's no need to do filesystem resizes when doing either 'replace' > or 'dev add' followed by 'dev rem' because the fs resize is implied. > First it's resized/grown with add; and then it's resized/shrink with > remove. For replace there's a consolidation of steps, it's been a > while since I've looked at the code so I can't tell you what steps it > skips, what the state of the devices are in during the replace, which > one active writes go to. > > > -- > Chris Murphy > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Best practices for raid 1
> On 10 Jan 2017, at 21:07, Vinko Magecic> wrote: > > Hello, > > I set up a raid 1 with two btrfs devices and came across some situations in > my testing that I can't get a straight answer on. > 1) When replacing a volume, do I still need to `umount /path` and then `mount > -o degraded ...` the good volume before doing the `btrfs replace start ...` ? > I didn't see anything that said I had to and when I tested it without > mounting the volume it was able to replace the device without any issue. Is > that considered bad and could risk damage or has `replace` made it possible > to replace devices without umounting the filesystem? No need to unmount, just replace old with new. Your scenario seems very convoluted and it’s pointless > 2) Everything I see about replacing a drive says to use `/old/device > /new/device` but what if the old device can't be read or no longer exists? > Would that be a `btrfs device add /new/device; btrfs balance start > /new/device` ? In case where old device is missing you’ve got few options: - if you have enough space to fit the data and enough of disks to comply with redundancy - just remove the drive, So for example is you have 3 x 1TB drives with raid 1 And use less than 1TB of data total - juste remove one drive and you will have 2 x 1TB drives in raid 1 and btrfs fill just rebalance stuff for you ! - if you have not enough space to fi the data / not enough disks left to comply with raid lever - your only option is to add disk first then remove missing (btrfs dev delete missing /mount_point_of_your_fs) > 3) When I have the RAID1 with two devices and I want to grow it out, which is > the better practice? Create a larger volume, replace the old device with the > new device and then do it a second time for the other device, or attaching > the new volumes to the label/uuid one at a time and with each one use `btrfs > filesystem resize devid:max /mountpoint`. You kinda misunderstand the principal of btrfs. Btrfs will span across ALL the available space you’ve got. If you have multiple devices in this setup (remember that partition IS A DEVICE), it will span across multiple devices and you can’t change this. Now btrfs resize is mean for resizing a file system occupying a device (or partition). So work flow is that is you want to shrink a device (partition) you first shrink fs on this device than size down the device (partition) … if you want to increase the size of device (partition) you increase size of device (partition) than you grow filesystem within this device (partition). This is 100% irrespective of total cumulative size of file system. Let’s say you’ve got a btrfs file system that is spanning across 3 x 1TB devices … and those devices are partitions. You have raid 1 setup - your complete amount of available space is 1.5 TB. Let’s say you want to shrink of of partitions to 0.5TB -> first you shrink FS on this partition (balance will runn automatically) -> you shrink partition down to 0.5TB -> from now on your total available space is 1.25TB. Simples right ? :) > Thanks > > > > >-- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to get back a deleted sub-volume.
Chris, the "btrfs-show-super -fa" gives me nothing useful to work with. the "btrfs-find-root -a " is actually something that I was already using (see original post), but the list of roots given had a rather LARGE hole of 200 generations that is located between right after I've had everything removed and 1 month before the whole situation. On 12 December 2016 at 04:14, Chris Murphywrote: > Tomasz - try using 'btrfs-find-root -a ' I totally forgot about > this option. It goes through the extent tree and might have a chance > of finding additional generations that aren't otherwise being found. > You can then plug those tree roots into 'btrfs restore -t ' > and do it with the -D and -v options so it's a verbose dry run, and > see if the file listing it spits out is at all useful - if it has any > of the data you're looking for. > > > Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to get back a deleted sub-volume.
Chris, for all the time you helped so far I have to really appologize I've led you a stray ... so, reson the subvolumes were deleted is nothing to do with btrfs it self, I'm using "Rockstor" to ease managment tasks. This tool / environment / distribution treats a singular btrfs FS as a "pool" ( something in line of zfs :/ ) and when one removes a pool from the system it will actually go and delete subvolumes from FS before unmounting it and removing reference of it from it's DB (yes a bit shiet I know). so I'm not blaming anybody here for disapearing subvolumes, it's just me commig back to belive in man kind to get kicked in the gonads by mankind stupidity. ALSO by importing the fs to their "solution" is just actually mounting and walking the tree of subvolumes to to create all the references in local DB (for rockstor of course, still nothing to do with btrfs functionality). To be able to ïmport" I've had to remove before mentioned snpshots becouse imports script was timing out. So for a single subvolume (called physically "share") I was left with no snapshots (removed by me to make import not time out) and then this subvolume was removed when I was trying to remove a fs (pool) from a running system. I've polled both disks (2 disk fs raid1) and I'm trying to rescue as much data as I can. The question is, why suddenly when I removed the snapshots and (someone else removed) the subvolume, there is such a great gap in generations of FS (over 200 generations missing) and the most recent generation that actually can me touched by btrfs restore is over a month old. How to over come that ? On 11 December 2016 at 19:00, Chris Murphy <li...@colorremedies.com> wrote: > On Sun, Dec 11, 2016 at 10:40 AM, Tomasz Kusmierz > <tom.kusmi...@gmail.com> wrote: >> Hi, >> >> So, I've found my self in a pickle after following this steps: >> 1. trying to migrate an array to different system, it became apparent >> that importing array there was not possible to import it because I've >> had a very large amount of snapshots (every 15 minutes during office >> hours amounting to few K) so I've had to remove snapshots for main >> data storage. > > True, there is no recursive incremental send. > >> 2. while playing with live array, it become apparent that some bright >> spark implemented a "delete all sub-volumes while removing array from >> system" ... needles to say that this behaviour is unexpected to say al >> least ... and I wanted to punch somebody in face. > > The technical part of this is vague. I'm guessing you used 'btrfs > device remove' butt it works no differently than lvremove - when a > device is removed from an array, it wipes the signature from that > device. You probably can restore that signature and use that device > again, depending on what the profile is for metadata and data, it may > be usable stand alone. > > Proposing assault is probably not the best way to ask for advice > though. Just a guess. > > > > >> >> Since then I was trying to rescue as much data as I can, luckily I >> managed to get a lot of data from snapshots for "other than share" >> volumes (because those were not deleted :/) but the most important >> volume "share" prove difficult. This subvolume comes out with a lot of >> errors on readout with "btrfs restore /dev/sda /mnt2/temp2/ -x -m -S >> -s -i -t". >> >> Also for some reason I can't use a lot of root blocks that I find with >> btrfs-find-root .. >> >> To put some detail here: >> btrfs-find-root -a /dev/sda >> Superblock thinks the generation is 184540 >> Superblock thinks the level is 1 >> Well block 919363862528(gen: 184540 level: 1) seems good, and it >> matches superblock >> Well block 919356325888(gen: 184539 level: 1) seems good, but >> generation/level doesn't match, want gen: 184540 level: 1 >> Well block 919343529984(gen: 184538 level: 1) seems good, but >> generation/level doesn't match, want gen: 184540 level: 1 >> Well block 920041308160(gen: 184537 level: 1) seems good, but >> generation/level doesn't match, want gen: 184540 level: 1 >> Well block 919941955584(gen: 184536 level: 1) seems good, but >> generation/level doesn't match, want gen: 184540 level: 1 >> Well block 919670538240(gen: 184535 level: 1) seems good, but >> generation/level doesn't match, want gen: 184540 level: 1 >> Well block 920045371392(gen: 184532 level: 1) seems good, but >> generation/level doesn't match, want gen: 184540 level: 1 >> Well block 920070209536(gen: 184531 level: 1) seems good, but >> generation/level doesn't match, want gen: 184540 level: 1 >> Well block 920117510144(g
How to get back a deleted sub-volume.
Hi, So, I've found my self in a pickle after following this steps: 1. trying to migrate an array to different system, it became apparent that importing array there was not possible to import it because I've had a very large amount of snapshots (every 15 minutes during office hours amounting to few K) so I've had to remove snapshots for main data storage. 2. while playing with live array, it become apparent that some bright spark implemented a "delete all sub-volumes while removing array from system" ... needles to say that this behaviour is unexpected to say al least ... and I wanted to punch somebody in face. 3. the backup off-site server that was making backups every 30 minutes was located in CEO house and his wife decide that it's not necessary to have it connected (laughs can start roughly here) So I've got array with all the data there (theoretically COW, right ?) with additional of plethora of snapshots (important data was snapped every 15 minutes during a office hours to capture all the changes, other sub-volumes were snapshoted daily) This occurred roughly on 4-12-2016. Since then I was trying to rescue as much data as I can, luckily I managed to get a lot of data from snapshots for "other than share" volumes (because those were not deleted :/) but the most important volume "share" prove difficult. This subvolume comes out with a lot of errors on readout with "btrfs restore /dev/sda /mnt2/temp2/ -x -m -S -s -i -t". Also for some reason I can't use a lot of root blocks that I find with btrfs-find-root .. To put some detail here: btrfs-find-root -a /dev/sda Superblock thinks the generation is 184540 Superblock thinks the level is 1 Well block 919363862528(gen: 184540 level: 1) seems good, and it matches superblock Well block 919356325888(gen: 184539 level: 1) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 919343529984(gen: 184538 level: 1) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920041308160(gen: 184537 level: 1) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 919941955584(gen: 184536 level: 1) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 919670538240(gen: 184535 level: 1) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920045371392(gen: 184532 level: 1) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920070209536(gen: 184531 level: 1) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920117510144(gen: 184530 level: 1) seems good, but generation/level doesn't match, want gen: 184540 level: 1 <<< here stuff is gone Well block 920139055104(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920139022336(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920138989568(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920138973184(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920137596928(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920137531392(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920137515008(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920135991296(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920135958528(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920135925760(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920135827456(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920135811072(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920133697536(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920133664768(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 92017088(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920133206016(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920132976640(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920132878336(gen: 184511 level: 0) seems good, but generation/level doesn't match, want gen: 184540 level: 1 Well block 920132845568(gen: 184511 level: 0) seems good, but generation/level doesn't match,
Re: Convert from RAID 5 to 10
FYI. There is an old saying in embedded circles that I revolve that evolved from Arthur C Clarke "Any sufficiently advanced technology is indistinguishable from magic." Engineering version states "Any sufficiently advanced incompetence is indistinguishable from malice" Also I'll quote you on throwing under the bus thing :) (I actually like that justification) On 1 December 2016 at 17:28, Chris Murphy <li...@colorremedies.com> wrote: > On Wed, Nov 30, 2016 at 1:29 PM, Tomasz Kusmierz <tom.kusmi...@gmail.com> > wrote: > >> Please, I beg you add another column to man and wiki stating clearly >> how many devices every profile can withstand to loose. I frequently >> have to explain how btrfs profiles work and show quotes from this >> mailing list because "dawning-kruger effect victims" keep poping up >> with statements like "in btrfs raid10 with 8 drives you can loose 4 >> drives" ... I seriously beg you guys, my beating stick is half broken >> by now. > > You need a new stick. It's called the ad hominem attack. When stupid > people say stupid things, the dispute is not about the facts or > opinions in the argument itself, but rather the person involved. There > is the possibility this is more than stupidity, it really borders on > maliciousness. Any ethical code of conduct for a list will accept ad > hominem attacks over the willful dissemination of provably wrong > information. When stupid assholes throw users under the bus with > provably wrong (and bad) advice, it becomes something of an obligation > to resort to name calling. > > Of course, I'd also like the wiki to clearly state the only profile > that tolerates more than one device loss is raid6; and be very > explicit with the manifestly wrong terminology being used by Btrfs's > raid10 terminology. That is a fairly egregious violation of common > terminology and the trust we're supposed to be developing, both in the > usage of common terms, but also in Btrfs specifically. > > > > -- > Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On 30 November 2016 at 19:09, Chris Murphywrote: > On Wed, Nov 30, 2016 at 7:37 AM, Austin S. Hemmelgarn > wrote: > >> The stability info could be improved, but _absolutely none_ of the things >> mentioned as issues with raid1 are specific to raid1. And in general, in >> the context of a feature stability matrix, 'OK' generally means that there >> are no significant issues with that specific feature, and since none of the >> issues outlined are specific to raid1, it does meet that description of >> 'OK'. > > Maybe the gotchas page needs a one or two liner for each profile's > gotchas compared to what the profile leads the user into believing. > The overriding gotcha with all Btrfs multiple device support is the > lack of monitoring and notification other than kernel messages; and > the raid10 actually being more like raid0+1 I think it certainly a > gotcha, however 'man mkfs.btrfs' contains a grid that very clearly > states raid10 can only safely lose 1 device. > > >> Looking at this another way, I've been using BTRFS on all my systems since >> kernel 3.16 (I forget what exact vintage that is in regular years). I've >> not had any data integrity or data loss issues as a result of BTRFS itself >> since 3.19, and in just the past year I've had multiple raid1 profile >> filesystems survive multiple hardware issues with near zero issues (with the >> caveat that I had to re-balance after replacing devices to convert a few >> single chunks to raid1), and that includes multiple disk failures and 2 bad >> PSU's plus about a dozen (not BTRFS related) kernel panics and 4 unexpected >> power loss events. I also have exhaustive monitoring, so I'm replacing bad >> hardware early instead of waiting for it to actually fail. > > Possibly nothing aids predictably reliable storage stacks than healthy > doses of skepticism and awareness of all limitations. :-D > > -- > Chris Murphy > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Please, I beg you add another column to man and wiki stating clearly how many devices every profile can withstand to loose. I frequently have to explain how btrfs profiles work and show quotes from this mailing list because "dawning-kruger effect victims" keep poping up with statements like "in btrfs raid10 with 8 drives you can loose 4 drives" ... I seriously beg you guys, my beating stick is half broken by now. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
I think you just described all the benefits of btrfs in that type of configuration unfortunately after btrfs RAID 5 & 6 was marked as OK it got marked as "it will eat your data" (and there is a tone of people in random places poping up with raid 5 & 6 that just killed their data) On 11 October 2016 at 16:14, Philip Louis Moetteliwrote: > Hello, > > > I have to build a RAID 6 with the following 3 requirements: > > • Use different kinds of disks with different sizes. > • When a disk fails and there's enough space, the RAID should be able > to reconstruct itself out of the degraded state. Meaning, if I have e. g. a > RAID with 8 disks and 1 fails, I should be able to chose to transform this in > a non-degraded (!) RAID with 7 disks. > • Also the other way round: If I add a disk of what size ever, it > should redistribute the data, so that it becomes a RAID with 9 disks. > > I don’t care, if I have to do it manually. > I don’t care so much about speed either. > > Is BTrFS capable of doing that? > > > Thanks a lot for your help! > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid levels and NAS drives
On 10 October 2016 at 02:01, ronnie sahlbergwrote: > (without html this time.) > > Nas drives are more expensive but also more durable than the normal consumer > drives, but not as durable as enterprise drives. > They are meant for near continous use, compared to consumer/backup drives > that are meant for only occasional use and meant to spend the majority of > time spinned down. > > > They fall in-between consumer and enterprise gear. > Again, you read a marketing flyer ... Historically enterprise drives did equal to a drive with SCSI, after that it started to equal to a drive with more exotic interfaces like SAS or FATA ... nowadays this means more in line "high [seak] performance, for which you pay extra extra extra buck" (10k, 15k arrays of 10 disks with databases on it that are serving plenty of people ?). Currently, customer = low end drive where you will not pay twice the price for 10% performance increase. There is nothing there about reliability !!! Now every [sane] storage engineer will chose a "customer" 5.4k drives for a cold storage / slow IO storage. In high demand, very random seek patterns everybody will go for extreme fast disk that will die in 12 months, because cost * effort or replacing a failed disk is still less than assembling a like array from 7.2k disk (extra controller, extra bays, extra power, extra everything !). So: 1. Stop reading a marketing material that is designed to suck money out of you pocket. Read technical datasheet. Stop reading a paid for articles from so called "specialists", my company pays those people to put in articles that I write to sound more technical so I can tell you how much "horse" those are. 2. hdd: faster rpm = better seek + better sequential read write slower rpm = survives longer + takes less power + better $ per GB 3. what you need to use it for: a remote nas box ? a single 5.1k hdd will saturate your gigabit lan, 7 will saturate your SFP+ - go for best $ per GB local storage ? 4 x 7.1k hdd in raid10 and you're talking a good performance ! putting more disks in and you can drop down to 5.1k a high demand database with thousands of people punching milions of queries a second ? 15k as many as you can! 4. For time being on btrfs give raid 5 & 6 a wide berth for time being ... unless you back up your data [very] regularly than, have fun :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid levels and NAS drives
And what exactly are NAS drives ? Are you talking marketing by any chance ? Please, tell me you got the pun. On 10 October 2016 at 00:12, Charles Zeitlerwrote: > Is there any advantage to using NAS drives > under RAID levels, as oppposed to regular > 'desktop' drives for BTRFS? > > Charles Zeitler > > -- > The Perfect Is The Enemy Of > The Good Enough > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Some help with the code.
This is predominantly for maintainers: I've noticed that there is a lot of code for btrfs ... and after few glimpses I've noticed that there are occurrences which beg for some refactoring to make it less of a pain to maintain. I'm speaking of occurrences where: - within a function there are multiple checks for null pointer and then whenever there is anything hanging on the end of that pointer to finally call the function, pass the pointer to it and watch it perform same checks to finally deallocate stuff on the end of a pointer. - single line functions ... called only in two places and so on. I know that you guys are busy, but maintaining code that is only growing must be a pain. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of SMR with BTRFS
Sorry for late reply, there was a lot of traffic in this thread so: 1. I do apologize but I've got a wrong end of the stick, I was convinced that btrfs does cause corruption on your disk because some of the link that you've hav in original post were pointing to topics with corruptions going on, but you are concerned about performance - right ? 2. I'm still not convinced that Seagate would miss out such a feature as SMR and mistakenly called it TGMR ... a lot of money is involved with storing data and egg in the face moment can cast them a lot ... ALSO it's called "barracuda" historically this meant disks with good IO performance, can't think that somebody at seagate would put barracuda label on stinking SMR (yes SMR stinks! but about that later on). Thou I remember how Micron used to have a complete mess in their data sheets :/ 3. f2fs as well as jffs2, logfs will give a tremendous performance on spinners :D those file system are meant for tiny flash devices with minimal IO, minimal power available, minimal erases, and ti fit in that market those try to mimise fragmentation of flash, to serialise writes, to eliminate jumping thought flash so in will self wear balance. Result of that on spinner is that it will give you a very static and sequential IO on writing data, but your reads will be crap ... and as every single "developed for flash" filesystem it will expect your device to have a 100% functional TRIM that will result in a block (usually 128kB) to be reset to 0xFF ... spinners don't do that ... and this is where you will see corruption. Also a lot of those file systems will require a direct access to device rather than block device emulation. Also on a flash device you can walk in and alter a bit in a byte as long as you change it from physical "1" to "0", on spinner you need to rewrite a sector and associated with it CRC and corrective data etc, on SMR you will have to rewrite a whole BAND ... FUN !! so every time your filesystem will mark something for future TRIM it will try to set single bit in block associated data (hahaha your band needs to get rewritten) and this is how you will effectively kill sectors (bands) on your disk ! 4. SMR stinks ... yes it does ... it's a recipe for a disaster, slight modifications cause a lot of work load on a drive ... if you modify a "sector" most of a band needs to get rewritten ... this is where corruption creeps it, where disk surface wears, I understand how NSA may have a use case for that - google shifts data between server farms in US and broad then they send a copy to NSA (yes they do), NSA stores it, but they don't care about single bit root of minor defects, data does not get modified, just analysed in bulk and discarded and whole array get written with fresh data ... amazon on the other hand gets paid for not having your data corrupted ... so they won't fancy SMR that much (maybe glacier) see a patern here ? as a user to have that type of use case is just weird, if you want to back up your data than you care about it ... then I'm not convinced that SMR is truly for you. Also 5TB device connected to USB3 used as a backup :O :O :O :O :O :O I wouldn't keep my "just in case internet was down backup of pornhub" on that setup :) And I'm not picking on you here ... I personally used far better backup than that and still it failed and still people pointed out bluntly how pathetic it was ... and they were right ! In terms of SMR those are my brutal opinions ... and nothing more. I accept that most likely I'm wrong. Hell, been wrong most of my life, it's just after 10 years of engineering embedded devices for various application I'm just veer precocious due to experiences with a lot of "some bright spark" (clueless guy that wanted to feel more intelligent than engineers) "decided to use this revolutionising thing" (wanted to prove him self and based everything on a luck) "and created a valid learning experience for whole development team" (all engineers wanted to kill him) "and we all came out of that stronger and with more experience" (he/she got fired). On 18 July 2016 at 20:30, Austin S. Hemmelgarnwrote: > On 2016-07-18 15:05, Hendrik Friedel wrote: >> >> Hello Austin, >> >> thanks for your reply. >> Ok, thanks; So, TGMR does not say whether or not the Device is SMR or not, right? >>> >>> I'm not 100% certain about that. Technically, the only non-firmware >>> difference is in the read head and the tracking. If it were me, I'd be >>> listing SMR instead of TGMR on the data sheet, but I'd be more than >>> willing to bet that many drive manufacturers won't think like that. While the Data-Sheet does not mention SMR and the 'Desktop' in the name rather than 'Archive' would indicate no SMR, some reviews indicate SMR (http://www.legitreviews.com/seagate-barracuda-st5000dm000-5tb-desktop-hard-drive-review_161241) >>> Beyond that, I'm not sure, >>> but I believe that their
Re: Status of SMR with BTRFS
Just please don't take it as picking or something: > It's a Seagate Expansion Desktop 5TB (USB3). It is probably a ST5000DM000. this is TGMR not SMR disk: http://www.seagate.com/www-content/product-content/desktop-hdd-fam/en-us/docs/100743772a.pdf So it still confirms to standard record strategy ... >> There are two types: >> 1. SMR managed by device firmware. BTRFS sees that as a normal block >> device … problems you get are not related to BTRFS it self … > > That for sure. But the way BTRFS uses/writes data could cause problems in > conjunction with these devices still, no? I'm sorry but I'm confused now, what "magical way of using/writing data" you actually mean ? AFAIK btrfs sees the disk as a block device ... for example devices has a very varying sector size, which is a 512 bytes + some CRC + maybe ECC ... btrfs does not access this data, drive does ... to be honest drives tend to lie to you continuously ! They use this ECC to magically bail out of bad sector, give you data and silently switch to spare sector ... Now think slowly and thoroughly about it: who would write a code (and maintain it) for a file system that access device specific data for X amount of vendors with each having Y amount of model specific configurations/caveats/firmwares/protocols ... S.M.A.R.T. emerged to give a unifying interface to device statistics ... this is how bad it was ... FYI, in 2009 I was creating a product with linux that was starting from a flash based FS ... some people required that data after 20 years would boot up unchanged ... my answer was: "HOW", yes I could ensure a certain files integrity in readout by checking md5, but I could not warrant a whole FS integrity ... specially at the time when j2ffs was only option on flash memories (yeah it had to be RW as well @#$*@#$) ... so btrfs comes along and takes away most of problems ... if you care about your data, do some research ... if not ... maybe raiserFS is for you :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How can I get blockdev offsets of btrfs chunks for a file?
No answer here, but mate if you are involved in anything that will provide some more automated backup tool for btrfs you got a lot of silent people rooting for you. > On 16 Jul 2016, at 00:21, Eric Wheelerwrote: > > Hello all, > > We do btrfs subvolume snapshots over time for backups. I would like to > traverse the files in the subvolumes and find the total unique chunk count > to calculate total space for a set of subvolumes. > > This sounds kind of like the beginning of what a deduplicator would do, > but I just want to count the blocks, so no submission for deduplication. > I started looking at bedup and other deduplicator code, but the answer to > this question wasn't obvious (to me, anyway). > > Questions: > > Is there an ioctl (or some other way) to get the block device offset for a > file (or file offset) so I can count the unique occurances? > > What API documentation should I review? > > Can you point me at the ioctl(s) that would handle this? > > > Thank you for your help! > > > -- > Eric Wheeler > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of SMR with BTRFS
Thou I’m not a hardcore storage system professional: What disk are you using ? There are two types: 1. SMR managed by device firmware. BTRFS sees that as a normal block device … problems you get are not related to BTRFS it self … 2. SMR managed by host system, BTRFS still does see this as a block device … just emulated by host system to look normal. In case of funky technologies like that I would research how exactly data is stored in terms of “BAND” and experiment with setting leaf & sector size to match a band, then create a btrfs on this device. Run stress.sh on it for couple of days. If you get errors - setup a two standard disk raid1 btrfs file system run stress.sh to see whenever you get errors on this system - to eliminate possibility that your system is actually generating errors. Then come back and we will see what’s going on :) > On 15 Jul 2016, at 19:29, Hendrik Friedelwrote: > > Hello, > > I have a 5TB Seagate drive that uses SMR. > > I was wondering, if BTRFS is usable with this Harddrive technology. So, first > I searched the BTRFS wiki -nothing. Then google. > > * I found this: https://bbs.archlinux.org/viewtopic.php?id=203696 > But this turned out to be an issue not related to BTRFS. > > * Then this: http://www.snia.org/sites/default/files/SDC15_presentations/smr/ > HannesReinecke_Strategies_for_running_unmodified_FS_SMR.pdf > " BTRFS operation matches SMR parameters very closely [...] > > High number of misaligned write accesses ; points to an issue with btrfs > itself > > > * Then this: > http://superuser.com/questions/962257/fastest-linux-filesystem-on-shingled-disks > The BTRFS performance seemed good. > > > * Finally this: http://www.spinics.net/lists/linux-btrfs/msg48072.html > "So you can get mixed results when trying to use the SMR devices but I'd say > it will mostly not work. > But, btrfs has all the fundamental features in place, we'd have to make > adjustments to follow the SMR constraints:" > [...] > I have some notes at > https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt; > > > So, now I am wondering, what the state is today. "We" (I am happy to do that; > but not sure of access rights) should also summarize this in the wiki. > My use-case by the way are back-ups. I am thinking of using some of the > interesting BTRFS features for this (send/receive, deduplication) > > Greetings, > Hendrik > > > --- > Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft. > https://www.avast.com/antivirus > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 has failing disks, but smart is clear
> > Well, I was able to run memtest on the system last night, that passed with > flying colors, so I'm now leaning toward the problem being in the sas card. > But I'll have to run some more tests. > Seriously use the "stres.sh" for couple of days, When I was running memtest it was running continuously for 3 days without the error, day of stres.sh and errors started showing up. Be VERY careful with trusting any sort of that tool, modern CPU's lye to you continuously !!! 1. You may think that you've wrote best on the planet code that bypasses a CPU cache, but in reality since CPU's are multicore you can end up with overzealous MPMD traping you inside of you cache memory and all you resting will do is write a page (trapped in cache) read it from cache (coherency mechanism, not the mis/hit one) will trap you inside of L3 so you have no clue you don't touch the ram, then CPU will just dump your page to RAM and "job done" 2. Since coherency problems and real problems with non blocking on mpmd you can have a DMA controller sucking pages out your own cache, due to ram being marked as dirty and CPU will try to spare the time and accelerate the operation to push DMA straigh out of L3 to somewhere else (mentioning that sine some testers use crazy way of forcing your ram access via DMA to somewhere and back to force droping out of L3) 3. This one is actually funny: some testers didn't claim the pages to the process so for some reason pages that the were using were not showing up as used / dirty etc so all the testing was done 32kB of L1 ... tests were fast thou :) srters.sh will test operation of the whole system !!! it shifts a lot of data so disks are engaged, CPU keeps pumping out CRC32 all the time so it's busy, RAM gets hit nicely as well due to high DMA. When come to think about it, if your device points change during operation of the system it might be an LSI card dying -> reinitialize -> rediscovering drives -> drives show up in different point. On my system I can hot swap sata and it will come up with different dev even thou it was connected to same place on the controller. I think, most important - I presume you run nonECC ? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
> On 7 Jul 2016, at 02:46, Chris Murphywrote: > Chaps, I didn’t wanted this to spring up as a performance of btrfs argument, BUT you are throwing a lot of useful data, maybe diverting some of it into wiki ? you know, us normal people might find it useful for making educated choice in some future ? :) Interestingly on my RAID10 with 6 disks I only get: dd if=/mnt/share/asdf of=/dev/zero bs=100M 113+1 records in 113+1 records out 11874643004 bytes (12 GB, 11 GiB) copied, 45.3123 s, 262 MB/s filefrag -v ext: logical_offset:physical_offset: length: expected: flags: 0:0..2471: 2101940598..2101943069: 2472: 1: 2472.. 12583: 1938312686..1938322797: 10112: 2101943070: 2:12584.. 12837: 1937534654..1937534907:254: 1938322798: 3:12838.. 12839: 1937534908..1937534909: 2: 4:12840.. 34109: 1902954063..1902975332: 21270: 1937534910: 5:34110.. 53671: 1900857931..1900877492: 19562: 1902975333: 6:53672.. 54055: 1900877493..1900877876:384: 7:54056.. 54063: 1900877877..1900877884: 8: 8:54064.. 98041: 1900877885..1900921862: 43978: 9:98042.. 117671: 1900921863..1900941492: 19630: 10: 117672.. 118055: 1900941493..1900941876:384: 11: 118056.. 161833: 1900941877..1900985654: 43778: 12: 161834.. 204013: 1900985655..1901027834: 42180: 13: 204014.. 214269: 1901027835..1901038090: 10256: 14: 214270.. 214401: 1901038091..1901038222:132: 15: 214402.. 214407: 1901038223..1901038228: 6: 16: 214408.. 258089: 1901038229..1901081910: 43682: 17: 258090.. 300139: 1901081911..1901123960: 42050: 18: 300140.. 310559: 1901123961..1901134380: 10420: 19: 310560.. 310695: 1901134381..1901134516:136: 20: 310696.. 354251: 1901134517..1901178072: 43556: 21: 354252.. 396389: 1901178073..1901220210: 42138: 22: 396390.. 406353: 1901220211..1901230174: 9964: 23: 406354.. 406515: 1901230175..1901230336:162: 24: 406516.. 406519: 1901230337..1901230340: 4: 25: 406520.. 450115: 1901230341..1901273936: 43596: 26: 450116.. 492161: 1901273937..1901315982: 42046: 27: 492162.. 524199: 1901315983..1901348020: 32038: 28: 524200.. 535355: 1901348021..1901359176: 11156: 29: 535356.. 535591: 1901359177..1901359412:236: 30: 535592.. 1315369: 1899830240..1900610017: 779778: 1901359413: 31: 1315370.. 1357435: 1901359413..1901401478: 42066: 1900610018: 32: 1357436.. 1368091: 1928101070..1928111725: 10656: 1901401479: 33: 1368092.. 1368231: 1928111726..1928111865:140: 34: 1368232.. 2113959: 1899043808..1899789535: 745728: 1928111866: 35: 2113960.. 2899082: 1898257376..1899042498: 785123: 1899789536: last,elf If it would be possible to read from 6 disks at once maybe this performance would be better for linear read. Anyway this is a huge diversion from original question, so maybe we will end here ? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
> On 7 Jul 2016, at 00:22, Kai Krakow <hurikha...@gmail.com> wrote: > > Am Wed, 6 Jul 2016 13:20:15 +0100 > schrieb Tomasz Kusmierz <tom.kusmi...@gmail.com>: > >> When I think of it, I did move this folder first when filesystem was >> RAID 1 (or not even RAID at all) and then it was upgraded to RAID 1 >> then RAID 10. Was there a faulty balance around August 2014 ? Please >> remember that I’m using Ubuntu so it was probably kernel from Ubuntu >> 14.04 LTS >> >> Also, I would like to hear it from horses mouth: dos & donts for a >> long term storage where you moderately care about the data: RAID10 - >> flaky ? would RAID1 give similar performance ? > > The current implementation of RAID0 in btrfs is probably not very > optimized. RAID0 is a special case anyways: Stripes have a defined > width - I'm not sure what it is for btrfs, probably it's per chunk, so > it's 1GB, maybe it's 64k **. That means your data is usually not read > from multiple disks in parallel anyways as long as requests are below > stripe width (which is probably true for most access patterns except > copying files) - there's no immediate performance benefit. This holds > true for any RAID0 with read and write patterns below the stripe size. > Data is just more evenly distributed across devices and your > application will only benefit performance-wise if accesses spread > semi-random across the span of the whole file. And at least last time I > checked, it was stated that btrfs raid0 does not submit IOs in parallel > yet but first reads one stripe, then the next - so it doesn't submit > IOs to different devices in parallel. > > Getting to RAID1, btrfs is even less optimized: Stripe decision is based > on process pids instead of device load, read accesses won't distribute > evenly to different stripes per single process, it's only just reading > from the same single device - always. Write access isn't faster anyways: > Both stripes need to be written - writing RAID1 is single device > performance only. > > So I guess, at this stage there's no big difference between RAID1 and > RAID10 in btrfs (except maybe for large file copies), not for single > process access patterns and neither for multi process access patterns. > Btrfs can only benefit from RAID1 in multi process access patterns > currently, as can btrfs RAID0 by design for usual small random access > patterns (and maybe large sequential operations). But RAID1 with more > than two disks and multi process access patterns is more or less equal > to RAID10 because stripes are likely to be on different devices anyways. > > In conclusion: RAID1 is simpler than RAID10 and thus its less likely to > contain flaws or bugs. > > **: Please enlighten me, I couldn't find docs on this matter. :O It’s an eye opener - I think that this should end up on btrfs WIKI … seriously ! Anyway my use case for this is “storage” therefore I predominantly copy large files. > -- > Regards, > Kai > > Replies to list-only preferred. > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 has failing disks, but smart is clear
> On 6 Jul 2016, at 23:14, Corey Coughlinwrote: > > Hi all, >Hoping you all can help, have a strange problem, think I know what's going > on, but could use some verification. I set up a raid1 type btrfs filesystem > on an Ubuntu 16.04 system, here's what it looks like: > > btrfs fi show > Label: none uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4 >Total devices 10 FS bytes used 3.42TiB >devid1 size 1.82TiB used 1.18TiB path /dev/sdd >devid2 size 698.64GiB used 47.00GiB path /dev/sdk >devid3 size 931.51GiB used 280.03GiB path /dev/sdm >devid4 size 931.51GiB used 280.00GiB path /dev/sdl >devid5 size 1.82TiB used 1.17TiB path /dev/sdi >devid6 size 1.82TiB used 823.03GiB path /dev/sdj >devid7 size 698.64GiB used 47.00GiB path /dev/sdg >devid8 size 1.82TiB used 1.18TiB path /dev/sda >devid9 size 1.82TiB used 1.18TiB path /dev/sdb >devid 10 size 1.36TiB used 745.03GiB path /dev/sdh > > I added a couple disks, and then ran a balance operation, and that took about > 3 days to finish. When it did finish, tried a scrub and got this message: > > scrub status for 597ee185-36ac-4b68-8961-d4adc13f95d4 >scrub started at Sun Jun 26 18:19:28 2016 and was aborted after 01:16:35 >total bytes scrubbed: 926.45GiB with 18849935 errors >error details: read=18849935 >corrected errors: 5860, uncorrectable errors: 18844075, unverified errors: > 0 > > So that seems bad. Took a look at the devices and a few of them have errors: > ... > [/dev/sdi].generation_errs 0 > [/dev/sdj].write_io_errs 289436740 > [/dev/sdj].read_io_errs289492820 > [/dev/sdj].flush_io_errs 12411 > [/dev/sdj].corruption_errs 0 > [/dev/sdj].generation_errs 0 > [/dev/sdg].write_io_errs 0 > ... > [/dev/sda].generation_errs 0 > [/dev/sdb].write_io_errs 3490143 > [/dev/sdb].read_io_errs111 > [/dev/sdb].flush_io_errs 268 > [/dev/sdb].corruption_errs 0 > [/dev/sdb].generation_errs 0 > [/dev/sdh].write_io_errs 5839 > [/dev/sdh].read_io_errs2188 > [/dev/sdh].flush_io_errs 11 > [/dev/sdh].corruption_errs 1 > [/dev/sdh].generation_errs 16373 > > So I checked the smart data for those disks, they seem perfect, no > reallocated sectors, no problems. But one thing I did notice is that they > are all WD Green drives. So I'm guessing that if they power down and get > reassigned to a new /dev/sd* letter, that could lead to data corruption. I > used idle3ctl to turn off the shut down mode on all the green drives in the > system, but I'm having trouble getting the filesystem working without the > errors. I tried a 'check --repair' command on it, and it seems to find a lot > of verification errors, but it doesn't look like things are getting fixed. > But I have all the data on it backed up on another system, so I can recreate > this if I need to. But here's what I want to know: > > 1. Am I correct about the issues with the WD Green drives, if they change > mounts during disk operations, will that corrupt data? I just wanted to chip in with WD Green drives. I have a RAID10 running on 6x2TB of those, actually had for ~3 years. If disk goes down for spin down, and you try to access something - kernel & FS & whole system will wait for drive to re-spin and everything works OK. I’ve never had a drive being reassigned to different /dev/sdX due to spin down / up. 2 years ago I was having a corruption due to not using ECC ram on my system and one of RAM modules started producing errors that were never caught up by CPU / MoBo. Long story short, guy here managed to point me to the right direction and I started shifting my data to hopefully new and not corrupted FS … but I was sceptical of similar issue that you have described AND I did raid1 and while mounted I did shift disk from one SATA port to another and FS managed to pick up the disk in new location and did not even blinked (as far as I remember there was syslog entry to say that disk vanished and then that disk was added) Last word, you got plenty of errors in your SMART for transfer related stuff, please be advised that this may mean: - faulty cable - faulty mono controller - faulty drive controller - bad RAM - yes, mother board CAN use your ram for storing data and transfer related stuff … specially chapter ones. > 2. If that is the case: >a.) Is there any way I can stop the /dev/sd* mount points from changing? > Or can I set up the filesystem using UUIDs or something more solid? I > googled about it, but found conflicting info Don’t get it the wrong way but I’m personally surprised that anybody still uses mount points rather than UUID. Devices change from boot to boot for a lot of people and most of distros moved to uuid (2 years ago ? even the swap is mounted via UUID now) >b.) Or, is there something else changing my drive devices? I have most of > drives on an LSI SAS 9201-16i card, is there something I need to
Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
> On 6 Jul 2016, at 22:41, Henk Slager <eye...@gmail.com> wrote: > > On Wed, Jul 6, 2016 at 2:20 PM, Tomasz Kusmierz <tom.kusmi...@gmail.com> > wrote: >> >>> On 6 Jul 2016, at 02:25, Henk Slager <eye...@gmail.com> wrote: >>> >>> On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz <tom.kusmi...@gmail.com> >>> wrote: >>>> >>>> On 6 Jul 2016, at 00:30, Henk Slager <eye...@gmail.com> wrote: >>>> >>>> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz <tom.kusmi...@gmail.com> >>>> wrote: >>>> >>>> I did consider that, but: >>>> - some files were NOT accessed by anything with 100% certainty (well if >>>> there is a rootkit on my system or something in that shape than maybe yes) >>>> - the only application that could access those files is totem (well >>>> Nautilius checks extension -> directs it to totem) so in that case we would >>>> hear about out break of totem killing people files. >>>> - if it was a kernel bug then other large files would be affected. >>>> >>>> Maybe I’m wrong and it’s actually related to the fact that all those files >>>> are located in single location on file system (single folder) that might >>>> have a historical bug in some structure somewhere ? >>>> >>>> >>>> I find it hard to imagine that this has something to do with the >>>> folderstructure, unless maybe the folder is a subvolume with >>>> non-default attributes or so. How the files in that folder are created >>>> (at full disktransferspeed or during a day or even a week) might give >>>> some hint. You could run filefrag and see if that rings a bell. >>>> >>>> files that are 4096 show: >>>> 1 extent found >>> >>> I actually meant filefrag for the files that are not (yet) truncated >>> to 4k. For example for virtual machine imagefiles (CoW), one could see >>> an MBR write. >> 117 extents found >> filesize 15468645003 >> >> good / bad ? > > 117 extents for a 1.5G file is fine, with -v option you could see the > fragmentation at the start, but this won't lead to any hint why you > have the truncate issue. > >>>> I did forgot to add that file system was created a long time ago and it was >>>> created with leaf & node size = 16k. >>>> >>>> >>>> If this long time ago is >2 years then you have likely specifically >>>> set node size = 16k, otherwise with older tools it would have been 4K. >>>> >>>> You are right I used -l 16K -n 16K >>>> >>>> Have you created it as raid10 or has it undergone profile conversions? >>>> >>>> Due to lack of spare disks >>>> (it may sound odd for some but spending for more than 6 disks for home use >>>> seems like an overkill) >>>> and due to last I’ve had I had to migrate all data to new file system. >>>> This played that way that I’ve: >>>> 1. from original FS I’ve removed 2 disks >>>> 2. Created RAID1 on those 2 disks, >>>> 3. shifted 2TB >>>> 4. removed 2 disks from source FS and adde those to destination FS >>>> 5 shifted 2 further TB >>>> 6 destroyed original FS and adde 2 disks to destination FS >>>> 7 converted destination FS to RAID10 >>>> >>>> FYI, when I convert to raid 10 I use: >>>> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f >>>> /path/to/FS >>>> >>>> this filesystem has 5 sub volumes. Files affected are located in separate >>>> folder within a “victim folder” that is within a one sub volume. >>>> >>>> >>>> It could also be that the ondisk format is somewhat corrupted (btrfs >>>> check should find that ) and that that causes the issue. >>>> >>>> >>>> root@noname_server:/mnt# btrfs check /dev/sdg1 >>>> Checking filesystem on /dev/sdg1 >>>> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1 >>>> checking extents >>>> checking free space cache >>>> checking fs roots >>>> checking csums >>>> checking root refs >>>> found 4424060642634 bytes used err is 0 >>>> total csum bytes: 4315954936 >>>> total tree bytes: 4522786816 >>>> total fs tree bytes: 61702144 >>>> total extent tree by
Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
> On 6 Jul 2016, at 02:25, Henk Slager <eye...@gmail.com> wrote: > > On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz <tom.kusmi...@gmail.com> > wrote: >> >> On 6 Jul 2016, at 00:30, Henk Slager <eye...@gmail.com> wrote: >> >> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz <tom.kusmi...@gmail.com> >> wrote: >> >> I did consider that, but: >> - some files were NOT accessed by anything with 100% certainty (well if >> there is a rootkit on my system or something in that shape than maybe yes) >> - the only application that could access those files is totem (well >> Nautilius checks extension -> directs it to totem) so in that case we would >> hear about out break of totem killing people files. >> - if it was a kernel bug then other large files would be affected. >> >> Maybe I’m wrong and it’s actually related to the fact that all those files >> are located in single location on file system (single folder) that might >> have a historical bug in some structure somewhere ? >> >> >> I find it hard to imagine that this has something to do with the >> folderstructure, unless maybe the folder is a subvolume with >> non-default attributes or so. How the files in that folder are created >> (at full disktransferspeed or during a day or even a week) might give >> some hint. You could run filefrag and see if that rings a bell. >> >> files that are 4096 show: >> 1 extent found > > I actually meant filefrag for the files that are not (yet) truncated > to 4k. For example for virtual machine imagefiles (CoW), one could see > an MBR write. 117 extents found filesize 15468645003 good / bad ? > >> I did forgot to add that file system was created a long time ago and it was >> created with leaf & node size = 16k. >> >> >> If this long time ago is >2 years then you have likely specifically >> set node size = 16k, otherwise with older tools it would have been 4K. >> >> You are right I used -l 16K -n 16K >> >> Have you created it as raid10 or has it undergone profile conversions? >> >> Due to lack of spare disks >> (it may sound odd for some but spending for more than 6 disks for home use >> seems like an overkill) >> and due to last I’ve had I had to migrate all data to new file system. >> This played that way that I’ve: >> 1. from original FS I’ve removed 2 disks >> 2. Created RAID1 on those 2 disks, >> 3. shifted 2TB >> 4. removed 2 disks from source FS and adde those to destination FS >> 5 shifted 2 further TB >> 6 destroyed original FS and adde 2 disks to destination FS >> 7 converted destination FS to RAID10 >> >> FYI, when I convert to raid 10 I use: >> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f >> /path/to/FS >> >> this filesystem has 5 sub volumes. Files affected are located in separate >> folder within a “victim folder” that is within a one sub volume. >> >> >> It could also be that the ondisk format is somewhat corrupted (btrfs >> check should find that ) and that that causes the issue. >> >> >> root@noname_server:/mnt# btrfs check /dev/sdg1 >> Checking filesystem on /dev/sdg1 >> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1 >> checking extents >> checking free space cache >> checking fs roots >> checking csums >> checking root refs >> found 4424060642634 bytes used err is 0 >> total csum bytes: 4315954936 >> total tree bytes: 4522786816 >> total fs tree bytes: 61702144 >> total extent tree bytes: 41402368 >> btree space waste bytes: 72430813 >> file data blocks allocated: 4475917217792 >> referenced 4420407603200 >> >> No luck there :/ > > Indeed looks all normal. > >> In-lining on raid10 has caused me some trouble (I had 4k nodes) over >> time, it has happened over a year ago with kernels recent at that >> time, but the fs was converted from raid5 >> >> Could you please elaborate on that ? you also ended up with files that got >> truncated to 4096 bytes ? > > I did not have truncated to 4k files, but your case lets me think of > small files inlining. Default max_inline mount option is 8k and that > means that 0 to ~3k files end up in metadata. I had size corruptions > for several of those small sized files that were updated quite > frequent, also within commit time AFAIK. Btrfs check lists this as > errors 400, although fs operation is not disturbed. I don't know what > happens if those small files are being updated/rewritten and are just > belo
Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
On 6 Jul 2016, at 00:30, Henk Slager <eye...@gmail.com <mailto:eye...@gmail.com>> wrote: > > On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz <tom.kusmi...@gmail.com > <mailto:tom.kusmi...@gmail.com>> wrote: >> I did consider that, but: >> - some files were NOT accessed by anything with 100% certainty (well if >> there is a rootkit on my system or something in that shape than maybe yes) >> - the only application that could access those files is totem (well >> Nautilius checks extension -> directs it to totem) so in that case we would >> hear about out break of totem killing people files. >> - if it was a kernel bug then other large files would be affected. >> >> Maybe I’m wrong and it’s actually related to the fact that all those files >> are located in single location on file system (single folder) that might >> have a historical bug in some structure somewhere ? > > I find it hard to imagine that this has something to do with the > folderstructure, unless maybe the folder is a subvolume with > non-default attributes or so. How the files in that folder are created > (at full disktransferspeed or during a day or even a week) might give > some hint. You could run filefrag and see if that rings a bell. files that are 4096 show: 1 extent found > >> I did forgot to add that file system was created a long time ago and it was >> created with leaf & node size = 16k. > > If this long time ago is >2 years then you have likely specifically > set node size = 16k, otherwise with older tools it would have been 4K. You are right I used -l 16K -n 16K > Have you created it as raid10 or has it undergone profile conversions? Due to lack of spare disks (it may sound odd for some but spending for more than 6 disks for home use seems like an overkill) and due to last I’ve had I had to migrate all data to new file system. This played that way that I’ve: 1. from original FS I’ve removed 2 disks 2. Created RAID1 on those 2 disks, 3. shifted 2TB 4. removed 2 disks from source FS and adde those to destination FS 5 shifted 2 further TB 6 destroyed original FS and adde 2 disks to destination FS 7 converted destination FS to RAID10 FYI, when I convert to raid 10 I use: btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f /path/to/FS this filesystem has 5 sub volumes. Files affected are located in separate folder within a “victim folder” that is within a one sub volume. > > It could also be that the ondisk format is somewhat corrupted (btrfs > check should find that ) and that that causes the issue. root@noname_server:/mnt# btrfs check /dev/sdg1 Checking filesystem on /dev/sdg1 UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1 checking extents checking free space cache checking fs roots checking csums checking root refs found 4424060642634 bytes used err is 0 total csum bytes: 4315954936 total tree bytes: 4522786816 total fs tree bytes: 61702144 total extent tree bytes: 41402368 btree space waste bytes: 72430813 file data blocks allocated: 4475917217792 referenced 4420407603200 No luck there :/ > In-lining on raid10 has caused me some trouble (I had 4k nodes) over > time, it has happened over a year ago with kernels recent at that > time, but the fs was converted from raid5 Could you please elaborate on that ? you also ended up with files that got truncated to 4096 bytes ? > You might want to run the python scrips from here: > https://github.com/knorrie/python-btrfs > <https://github.com/knorrie/python-btrfs> Will do. > so that maybe you see how block-groups/chunks are filled etc. > >> (ps. this email client on OS X is driving me up the wall … have to correct >> the corrections all the time :/) >> >>> On 4 Jul 2016, at 22:13, Henk Slager <eye...@gmail.com >>> <mailto:eye...@gmail.com>> wrote: >>> >>> On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz <tom.kusmi...@gmail.com >>> <mailto:tom.kusmi...@gmail.com>> wrote: >>>> Hi, >>>> >>>> My setup is that I use one file system for / and /home (on SSD) and a >>>> larger raid 10 for /mnt/share (6 x 2TB). >>>> >>>> Today I've discovered that 14 of files that are supposed to be over >>>> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB >>>> and it seems that it does contain information that were at the >>>> beginnings of the files. >>>> >>>> I've experienced this problem in the past (3 - 4 years ago ?) but >>>> attributed it to different problem that I've spoke with you guys here >>>> about (corruption due to non ECC ram). At that time I did deleted >>>> files affected (56) and si
Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
I did consider that, but: - some files were NOT accessed by anything with 100% certainty (well if there is a rootkit on my system or something in that shape than maybe yes) - the only application that could access those files is totem (well Nautilius checks extension -> directs it to totem) so in that case we would hear about out break of totem killing people files. - if it was a kernel bug then other large files would be affected. Maybe I’m wrong and it’s actually related to the fact that all those files are located in single location on file system (single folder) that might have a historical bug in some structure somewhere ? I did forgot to add that file system was created a long time ago and it was created with leaf & node size = 16k. (ps. this email client on OS X is driving me up the wall … have to correct the corrections all the time :/) > On 4 Jul 2016, at 22:13, Henk Slager <eye...@gmail.com> wrote: > > On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz <tom.kusmi...@gmail.com> > wrote: >> Hi, >> >> My setup is that I use one file system for / and /home (on SSD) and a >> larger raid 10 for /mnt/share (6 x 2TB). >> >> Today I've discovered that 14 of files that are supposed to be over >> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB >> and it seems that it does contain information that were at the >> beginnings of the files. >> >> I've experienced this problem in the past (3 - 4 years ago ?) but >> attributed it to different problem that I've spoke with you guys here >> about (corruption due to non ECC ram). At that time I did deleted >> files affected (56) and similar problem was discovered a year but not >> more than 2 years ago and I believe I've deleted the files. >> >> I periodically (once a month) run a scrub on my system to eliminate >> any errors sneaking in. I believe I did a balance a half a year ago ? >> to reclaim space after I deleted a large database. >> >> root@noname_server:/mnt/share# btrfs fi show >> Label: none uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2 >>Total devices 1 FS bytes used 177.19GiB >>devid3 size 899.22GiB used 360.06GiB path /dev/sde2 >> >> Label: none uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1 >>Total devices 6 FS bytes used 4.02TiB >>devid1 size 1.82TiB used 1.34TiB path /dev/sdg1 >>devid2 size 1.82TiB used 1.34TiB path /dev/sdh1 >>devid3 size 1.82TiB used 1.34TiB path /dev/sdi1 >>devid4 size 1.82TiB used 1.34TiB path /dev/sdb1 >>devid5 size 1.82TiB used 1.34TiB path /dev/sda1 >>devid6 size 1.82TiB used 1.34TiB path /dev/sdf1 >> >> root@noname_server:/mnt/share# uname -a >> Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24 >> 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux >> root@noname_server:/mnt/share# btrfs --version >> btrfs-progs v4.4 >> root@noname_server:/mnt/share# >> >> >> Problem is that stuff on this filesystem moves so slowly that it's >> hard to remember historical events ... it's like AWS glacier. What I >> can state with 100% certainty is that: >> - files that are affected are 2GB and over (safe to assume 4GB and over) >> - files affected were just read (and some not even read) never written >> after putting into storage >> - In the past I've assumed that files affected are due to size, but I >> have quite few ISO files some backups of virtual machines ... no >> problems there - seems like problem originates in one folder & size > >> 2GB & extension .mkv > > In case some application is the root cause of the issue, I would say > try to keep some ro snapshots done by a tool like snapper for example, > but maybe you do that already. It sounds also like this is some kernel > bug, snaphots won't help that much then I think. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs RAID 10 truncates files over 2G to 4096 bytes.
Hi, My setup is that I use one file system for / and /home (on SSD) and a larger raid 10 for /mnt/share (6 x 2TB). Today I've discovered that 14 of files that are supposed to be over 2GB are in fact just 4096 bytes. I've checked the content of those 4KB and it seems that it does contain information that were at the beginnings of the files. I've experienced this problem in the past (3 - 4 years ago ?) but attributed it to different problem that I've spoke with you guys here about (corruption due to non ECC ram). At that time I did deleted files affected (56) and similar problem was discovered a year but not more than 2 years ago and I believe I've deleted the files. I periodically (once a month) run a scrub on my system to eliminate any errors sneaking in. I believe I did a balance a half a year ago ? to reclaim space after I deleted a large database. root@noname_server:/mnt/share# btrfs fi show Label: none uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2 Total devices 1 FS bytes used 177.19GiB devid3 size 899.22GiB used 360.06GiB path /dev/sde2 Label: none uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1 Total devices 6 FS bytes used 4.02TiB devid1 size 1.82TiB used 1.34TiB path /dev/sdg1 devid2 size 1.82TiB used 1.34TiB path /dev/sdh1 devid3 size 1.82TiB used 1.34TiB path /dev/sdi1 devid4 size 1.82TiB used 1.34TiB path /dev/sdb1 devid5 size 1.82TiB used 1.34TiB path /dev/sda1 devid6 size 1.82TiB used 1.34TiB path /dev/sdf1 root@noname_server:/mnt/share# uname -a Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux root@noname_server:/mnt/share# btrfs --version btrfs-progs v4.4 root@noname_server:/mnt/share# Problem is that stuff on this filesystem moves so slowly that it's hard to remember historical events ... it's like AWS glacier. What I can state with 100% certainty is that: - files that are affected are 2GB and over (safe to assume 4GB and over) - files affected were just read (and some not even read) never written after putting into storage - In the past I've assumed that files affected are due to size, but I have quite few ISO files some backups of virtual machines ... no problems there - seems like problem originates in one folder & size > 2GB & extension .mkv -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs transaction checksum corruption losing root of the tree bizarre UUID change.
Hi all ! So it been some time with btrfs, and so far I was very pleased, but since I've upgraded to ubuntu from 13.10 to 14.04 problems started to occur (YES I know this might be unrelated). So in the past I've had problems with btrfs which turned out to be a problem caused by static from printer generating some corruption in ram causing checksum failures on the file system - so I'm not going to assume that there is something wrong with btrfs from the start. Anyway: On my server I'm running 6 x 2TB disk in raid 10 for general storage and 2 x ~0.5 TB raid 1 for system. Might be unrelated, but after upgrading to 14.04 I've started using Own Cloud which uses Apache MySql for backing store - all data stored on storage array, mysql was on system array. All started with csum errors showing up in mysql data files and in some transactions !!!. Generally system imidiatelly was switching to all btrfs read only mode due to being forced by kernel (don't have dmesg / syslog now). Removed offending files, problem seemed to go away and started from scratch. After 5 days problem reapered and now was located around same mysql files and in files managed by apache as cloud. At this point since these files are rather dear to me I've decided to pull all stops and try to rescue as much as I can. As a excercise in btrfs managment I've run btrfsck --repair - did not help. Repeated with --init-csum-tree - turned out that this left me with blank system array. Nice ! could use some warning here. I've moved all drives and move those to my main rig which got a nice 16GB of ecc ram, so errors of ram, cpu, controller should be kept theoretically eliminated. I've used system array drives and spare drive to extract all dear to me files to newly created array (1tb + 500GB + 640GB). Runned a scrub on it and everything seemed OK. At this point I've deleted dear to me files from storage array and ran a scrub. Scrub now showed even more csum errors in transactions and one large file that was not touched FOR VERY LONG TIME (size ~1GB). Deleted file. Ran scrub - no errors. Copied dear to me files back to storage array. Ran scrub - no issues. Deleted files from my backup array and decided to call a day. Next day I've decided to run a scrub once more just to be sure this time it discovered a myriad of errors in files and transactions. Since I've had no time to continue decided to postpone on next day - next day I've started my rig and noticed that both backup array and storage array does not mount anymore. I was attempting to rescue situation without any luck. Power cycled PC and on next startup both arrays failed to mount, when I tried to mount backup array mount told me that this specific uuid DOES NOT EXIST !?!?! my fstab uuid: fcf23e83-f165-4af0-8d1c-cd6f8d2788f4 new uuid: 771a4ed0-5859-4e10-b916-07aec4b1a60b tried to mount by /dev/sdb1 and it did mount. Tried by new uuid and it did mount as well. Scrub passes with flying colours on backup array while storage array still fails to mount with: root@ubuntu-pc:~# mount /dev/sdd1 /arrays/@storage/ mount: wrong fs type, bad option, bad superblock on /dev/sdd1, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so for any device in the array. Honestly this is a question to more senior guys - what should I do now ? Chris Mason - have you got any updates to your old friend stress.sh ? If not I can try using previous version that you provided to stress test my system - but I this is a second system that exposes this erratic behaviour. Anyone - what can I do to rescue my bellowed files (no sarcasm with zfs / ext4 / tapes / DVDs) ps. needles to say: SMART - no sata CRC errors, no relocated sectors, no errors what so ever (as much as I can see). -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
open_ctree failure on upgrading 3.7 to 3.8 kernel
Hi, Long story short: I've got btrfs raid10 six disk array plus 2 other disks just having a normal setup btrfs filesystems. Everything was running happily under linux 3.5 and 3.7. 3.5 was a stock ubuntu kernel, 3.7 was slightly less stock ubuntu kernel. Now I've upgraded my box to 3.8 and none of btrfs file systems mounts any more. I got open_ctree errors every time I try to mount those. When I reboot system choosing old kernel from grub - everything runs smooth again. Was there any on disk format change or compatibility change?. Some kernel.log output: [ 13.517952] device fsid 9415cddb-e3b8-4977-804c-369553a7eda7 devid 4 transid 30 /dev/sdh1 [ 13.518535] btrfs: disk space caching is enabled [ 13.518773] btrfs: failed to read the system array on sdh1 [ 13.523175] btrfs: open_ctree failed -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Changing node leaf size on live partition.
Hi, Question is pretty simple: How to change node size and leaf size on previously created partition? Now, I know what most people will say: you should've be smarter while typing mkfs.btrfs. Well, I'm intending to convert in place ext4 partition but there seems to be no option for leaf and node size in this tool. If it's not possible I guess I'll have to create / from scratch and copy all my content there. BTW. To Chris and other who were involved - after fixing this static electricity from printer issue I've been running rock solid raid10 since then! Great job guys, really appreciate. Cheers, Tom. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Changing node leaf size on live partition.
Hi, Question is pretty simple: How to change node size and leaf size on previously created partition? Now, I know what most people will say: you should've be smarter while typing mkfs.btrfs. Well, I'm intending to convert in place ext4 partition but there seems to be no option for leaf and node size in this tool. If it's not possible I guess I'll have to create / from scratch and copy all my content there. BTW. To Chris and other who were involved - after fixing this static electricity from printer issue I've been running rock solid raid10 since then! Great job guys, really appreciate. Cheers, Tom. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs for files 10GB = random spontaneous CRC failure.
On 16/01/13 09:21, Bernd Schubert wrote: On 01/16/2013 12:32 AM, Tom Kusmierz wrote: p.s. bizzare that when I fill ext4 partition with test data everything check's up OK (crc over all files), but with Chris tool it gets corrupted - for both Adaptec crappy pcie controller and for mother board built in one. Also since courses of history proven that my testing facilities are crap - any suggestion's on how can I test ram, cpu controller would be appreciated. Similar issues had been the reason we wrote ql-fstest at q-leap. Maybe you could try that? You can easily see the pattern of the corruption with that. But maybe Chris' stress.sh also provides it. Anyway, I yesterday added support to specify min and max file size, as it before only used 1MiB to 1GiB sizes... It's a bit cryptic with bits, though, I will improve that later. https://bitbucket.org/aakef/ql-fstest/downloads Cheers, Bernd PS: But see my other thread, using ql-fstest I yesterday entirely broke a btrfs test file system resulting in kernel panics. Hi, Its been a while, but I think I should provide a definite anwser or simply what was the cause of whole problem: It was a printer! Long story short, I was going nuts trying to diagnose which bit of my server is going bad and effectively I was down to blaming a interface card that connects hotswapable disks to mobo / pcie controllers. When I've got back from my holiday I've sat in front of server and decided to go with ql-fstest which in a very nice way reports errors with a very low lag (~2 minutes) after they occurred. At this point my printer kicked in with self clean and error just showed up after ~ two minutes - so I've restarted printer and while it was going through it's own post with self clean another error showed up. Issue here turned out to be that I was using one of those fantastic pci 4 port ethernet cards and printer was directly to it - after moving it and everything else to switch all problem and issues have went away. AT the moment I'm running server for 2 weeks without any corruptions, any random kernel btrfs crashes etc. Anyway I wanted to thank again to Chris and rest of btrFS dev people for this fantastic filesystem that let me discover how stupid setup I was running and how deep into shiet I've put my self. CHEERS LADS ! Tom. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs for files 10GB = random spontaneous CRC failure.
On 05/02/13 12:49, Chris Mason wrote: On Tue, Feb 05, 2013 at 03:16:34AM -0700, Tomasz Kusmierz wrote: On 16/01/13 09:21, Bernd Schubert wrote: On 01/16/2013 12:32 AM, Tom Kusmierz wrote: p.s. bizzare that when I fill ext4 partition with test data everything check's up OK (crc over all files), but with Chris tool it gets corrupted - for both Adaptec crappy pcie controller and for mother board built in one. Also since courses of history proven that my testing facilities are crap - any suggestion's on how can I test ram, cpu controller would be appreciated. Similar issues had been the reason we wrote ql-fstest at q-leap. Maybe you could try that? You can easily see the pattern of the corruption with that. But maybe Chris' stress.sh also provides it. Anyway, I yesterday added support to specify min and max file size, as it before only used 1MiB to 1GiB sizes... It's a bit cryptic with bits, though, I will improve that later. https://bitbucket.org/aakef/ql-fstest/downloads Cheers, Bernd PS: But see my other thread, using ql-fstest I yesterday entirely broke a btrfs test file system resulting in kernel panics. Hi, Its been a while, but I think I should provide a definite anwser or simply what was the cause of whole problem: It was a printer! Long story short, I was going nuts trying to diagnose which bit of my server is going bad and effectively I was down to blaming a interface card that connects hotswapable disks to mobo / pcie controllers. When I've got back from my holiday I've sat in front of server and decided to go with ql-fstest which in a very nice way reports errors with a very low lag (~2 minutes) after they occurred. At this point my printer kicked in with self clean and error just showed up after ~ two minutes - so I've restarted printer and while it was going through it's own post with self clean another error showed up. Issue here turned out to be that I was using one of those fantastic pci 4 port ethernet cards and printer was directly to it - after moving it and everything else to switch all problem and issues have went away. AT the moment I'm running server for 2 weeks without any corruptions, any random kernel btrfs crashes etc. Wow, I've never heard that one before. You might want to try a different 4 port card and/or report it to the driver maintainer. That shouldn't happen ;) ql-fstest looks neat, I'll check it out (thanks Bernd). -chris I've forgot to mention that server sits on UPS, and printer is directly connected to mains - when thinking of it, it creates an ground shift effect since nothing on cheap PSU got real ground. But anyway this is not a fault of this 4 port card, I've tried moving it to cheap ne2000 and to motherboard integrated one and effect was the same. Also diagnostics was veeery problematic because beside of having a corruption on hdd memtest was returning corruptions in ram, but on a very rare occation, also a cpu test was returning corruption on 1 / day basis. I've replaced nearly everything on this server - including psu (to 1400W from my dev rig) to make NO difference. I should mention as well that this printer is a colour laser printer which got 4 drums to clean, so I would assume that it produces enough static electricity to power a small cattle. ps. it shouldn't be an driver issue since errors in ram were 1 - 4 bit big located in same 32 bit word - hence i think a single transfer had to be corrupt rather than whole eth packet showed into random memory. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs for files 10GB = random spontaneous CRC failure.
On 05/02/13 13:46, Roman Mamedov wrote: On Tue, 05 Feb 2013 10:16:34 + Tomasz Kusmierz tom.kusmi...@gmail.com wrote: that I was using one of those fantastic pci 4 port ethernet cards and printer was directly to it - after moving it and everything else to switch all problem and issues have went away. AT the moment I'm running server for 2 weeks without any corruptions, any random kernel btrfs crashes etc. If moving the printer over to a switch helped, perhaps it is indeed an electrical interference problem, but if your card is an old one from Sun, keep in mind that they also have some problems with DMA on machines with large amounts of RAM: sunhme experiences corrupt packets if machine has more than 2GB of memory https://bugzilla.kernel.org/show_bug.cgi?id=10790 Not hard to envision a horror story scenario where a rogue network card would shred your filesystem buffer cache with network packets DMAed all over it, like bullets from a machine gun :) But in reality afaik IOMMU is supposed to protect against this. As I said in reply to Chris it was definitely and electrical issue. Back in the days when cat5 eth was a novelty I've learnt hard way a simple lesson - don't be skimp, always separate with switch. I've learnt it on networks where parties were not necessary powered from same circuit or even supply phase. Since this setup is limited to my home I've violated my own old rule - and it back fired on me. Anyway thanks for info on sunhme - WOW -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs for files 10GB = random spontaneous CRC failure.
Hi, Since I had some free time over Christmas, I decided to conduct few tests over btrFS to se how it will cope with real life storage for normal gray users and I've found that filesystem will always mess up your files that are larger than 10GB. Long story: I've used my set of data that I've got nicelly backed up on personal raid 5 to populate btrfs volumes: music, slr pics and video (an just a few document). Disks used in test are all green 2TB disks from WD. 1. First I started with creating btrfs (4k blocks) on one disk, filling it up and then adding second disk - convert to raid1 through balance - convert to raid10 trough balance. Unfortunately converting to raid1 failed - because of CRC error in 49 files that vere bigger 10GB. At this point I was a bit spooked up that my controllers are failing or that drives got some bad sectors. Tested everything (took few days) and it turns out that there is no apparent issue with hardware (bad sectors or io down to disks). 2. At this point I thought cool this will be a perfect test case for scrub to show it's magical power!. Created raid1 over two volumes - try scrubbing - FAIL ... It turns out that magically I've got corrupted CRC in two exactly same logical locations on two different disks (~34 files 10GB affected) hence scrub can't do anything with it. It only reports it as uncorrectable errors 3. Performed same test on raid10 setup (still 4k block). Same results (just different file count). Ok, time to dig more into this because it starts get intriguing. I'm running ubuntu server 12.10 (64bit) with stock kernel, so my next step was to get 3.7.1 kernel + new btrfs tool straight from git repo. Unfortunatelly 1 2 3 still provides same results, corrupt CRC only in files 10GB. At this point I thought fine maybe when I'll expand allocation block - it will make less block needed for big file to fit in resulting in propperly storing those - time for 16K leafs :) (-n 16K -l 16K) sectors are still 4K for known reasons :P. Well, it does exactly the same thing - 1 2 3 same results, big files get automagically corrupt. Something about test data: music - not more than 200MB files (tipical mix of mp3 aac) 10 K files give or take. pics - not more than 20MB (typical point shot + dslr) 6K files give or take. video1 - collection of little ones with size more than 300MB, less than 1.5GB ~ 400 files video2 - collection of 5GB - 18GB files ~400 files I guess that stating that files 10GB are only affected is a long shot, but so far I've not seen file less than 10GB affected (I was not really thorough about checking size, but all files that size I've checked were more than 10GB) ps. As a footnote I'll add that I've tried shuffling test 1, 2 3 without video2 and it all work just fine. If you've got any ideas for work around ( other than zfs :D ) I'm happy to try it out. Tom. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs for files 10GB = random spontaneous CRC failure.
Hi, Since I had some free time over Christmas, I decided to conduct few tests over btrFS to se how it will cope with real life storage for normal gray users and I've found that filesystem will always mess up your files that are larger than 10GB. Long story: I've used my set of data that I've got nicelly backed up on personal raid 5 to populate btrfs volumes: music, slr pics and video (an just a few document). Disks used in test are all green 2TB disks from WD. 1. First I started with creating btrfs (4k blocks) on one disk, filling it up and then adding second disk - convert to raid1 through balance - convert to raid10 trough balance. Unfortunately converting to raid1 failed - because of CRC error in 49 files that vere bigger 10GB. At this point I was a bit spooked up that my controllers are failing or that drives got some bad sectors. Tested everything (took few days) and it turns out that there is no apparent issue with hardware (bad sectors or io down to disks). 2. At this point I thought cool this will be a perfect test case for scrub to show it's magical power!. Created raid1 over two volumes - try scrubbing - FAIL ... It turns out that magically I've got corrupted CRC in two exactly same logical locations (~34 files 10GB affected). 3. Performed same test on raid10 setup (still 4k block). Same results (just diffrent file count). Ok, time to dig more into this because it starts get intriguing. I'm running ubuntu server 12.10 with stock kernel, so my next step was to get 3.7.1 kernel + new btrfs tool straight from git repo. Unfortunatelly 1 2 3 still provides same results, corrupt CRC only in files 10GB. At this point I thought fine maybe when I'll expand allocation block - it will make less block needed for big file to fit in resulting in propperly storing those - time for 16K leafs :) (-n 16K -l 16K) sectors are still 4K for known reasons :P. Well, it does exactly the same thing - 1 2 3 same results, big files get automagically corrupt. Something about test data: music - not more than 200MB files (tipical mix of mp3 aac) 10 K files give or take. pics - not more than 20MB (typical point shot + dslr) 6K files give or take. video1 - collection of little ones with size more than 300MB, less than 1.5GB ~ 400 files video2 - collection of 5GB - 18GB files ~400 files I guess that stating that files 10GB are only affected is a long shot, but so far I've not seen file less than 10GB affected (I was not really thorough about checking size, but all files that size I've checked were more than 10GB) ps. As a footnote I'll add that I've tried shuffling test 1, 2 3 without video2 and it all work just fine. If you've got any ideas for work around ( other than zfs :D ) I'm happy to try it out. Tom. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs for files 10GB = random spontaneous CRC failure.
On 14/01/13 11:25, Roman Mamedov wrote: Hello, On Mon, 14 Jan 2013 11:17:17 + Tomasz Kusmierz tom.kusmi...@gmail.com wrote: this point I was a bit spooked up that my controllers are failing or Which controller manufacturer/model? Well, this is a home server (which I preffer to tinker on). Two controllers were used, mother board build in, and crappy Adaptec pcie one. 00:11.0 SATA controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] 02:00.0 RAID bus controller: Adaptec Serial ATA II RAID 1430SA (rev 02) ps. MoBo is: ASUS M4A79T Deluxe -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs for files 10GB = random spontaneous CRC failure.
On 14/01/13 14:59, Chris Mason wrote: On Mon, Jan 14, 2013 at 04:09:47AM -0700, Tomasz Kusmierz wrote: Hi, Since I had some free time over Christmas, I decided to conduct few tests over btrFS to se how it will cope with real life storage for normal gray users and I've found that filesystem will always mess up your files that are larger than 10GB. Hi Tom, I'd like to nail down the test case a little better. 1) Create on one drive, fill with data 2) Add a second drive, convert to raid1 3) find corruptions? What happens if you start with two drives in raid1? In other words, I'm trying to see if this is a problem with the conversion code. -chris Ok, my description might be a bit enigmatic so to cut long story short tests are: 1) create a single drive default btrfs volume on single partition - fill with test data - scrub - admire errors. 2) create a raid1 (-d raid1 -m raid1) volume with two partitions on separate disk, each same size etc. - fill with test data - scrub - admire errors. 3) create a raid10 (-d raid10 -m raid1) volume with four partitions on separate disk, each same size etc. - fill with test data - scrub - admire errors. all disks are same age + size + model ... two different batches to avoid same time failure. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs for files 10GB = random spontaneous CRC failure.
On 14/01/13 15:57, Chris Mason wrote: On Mon, Jan 14, 2013 at 08:22:36AM -0700, Tomasz Kusmierz wrote: On 14/01/13 14:59, Chris Mason wrote: On Mon, Jan 14, 2013 at 04:09:47AM -0700, Tomasz Kusmierz wrote: Hi, Since I had some free time over Christmas, I decided to conduct few tests over btrFS to se how it will cope with real life storage for normal gray users and I've found that filesystem will always mess up your files that are larger than 10GB. Hi Tom, I'd like to nail down the test case a little better. 1) Create on one drive, fill with data 2) Add a second drive, convert to raid1 3) find corruptions? What happens if you start with two drives in raid1? In other words, I'm trying to see if this is a problem with the conversion code. -chris Ok, my description might be a bit enigmatic so to cut long story short tests are: 1) create a single drive default btrfs volume on single partition - fill with test data - scrub - admire errors. 2) create a raid1 (-d raid1 -m raid1) volume with two partitions on separate disk, each same size etc. - fill with test data - scrub - admire errors. 3) create a raid10 (-d raid10 -m raid1) volume with four partitions on separate disk, each same size etc. - fill with test data - scrub - admire errors. all disks are same age + size + model ... two different batches to avoid same time failure. Ok, so we have two possible causes. #1 btrfs is writing garbage to your disks. #2 something in your kernel is corrupting your data. Since you're able to see this 100% of the time, lets assume that if #2 were true, we'd be able to trigger it on other filesystems. So, I've attached an old friend, stress.sh. Use it like this: stress.sh -n 5 -c your source directory -s your btrfs mount point It will run in a loop with 5 parallel processes and make 5 copies of your data set into the destination. It will run forever until there are errors. You can use a higher process count (-n) to force more concurrency and use more ram. It may help to pin down all but 2 or 3 GB of your memory. What I'd like you to do is find a data set and command line that make the script find errors on btrfs. Then, try the same thing on xfs or ext4 and let it run at least twice as long. Then report back ;) -chris Chris, Will do, just please be remember that 2TB of test data on customer grade sata drives will take a while to test :) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs for files 10GB = random spontaneous CRC failure.
On 14/01/13 16:20, Roman Mamedov wrote: On Mon, 14 Jan 2013 15:22:36 + Tomasz Kusmierz tom.kusmi...@gmail.com wrote: 1) create a single drive default btrfs volume on single partition - fill with test data - scrub - admire errors. Did you try ruling out btrfs as the cause of the problem? Maybe something else in your system is corrupting data, and btrfs just lets you know about that. I.e. on the same drive, create an Ext4 filesystem, copy some data to it which has known checksums (use md5sum or cfv to generate them in advance for data that is on another drive and is waiting to be copied); copy to that drive, flush caches, verify checksums of files at the destination. Hi Roman, Chris just provided his good old friend stress.sh that should do that. So I'll dive into more testing :) Tom. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html