Re: Btrfs/SSD

2017-05-15 Thread Tomasz Kusmierz

> Traditional hard drives usually do this too these days (they've been 
> under-provisioned since before SSD's existed), which is part of why older 
> disks tend to be noisier and slower (the reserved space is usually at the far 
> inside or outside of the platter, so using sectors from there to replace 
> stuff leads to long seeks).

Not true. When HDD uses 10% (10% is just for easy example) of space as spare 
than aligment on disk is (US - used sector, SS - spare sector, BS - bad sector)

US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS

if failure occurs - drive actually shifts sectors up:

US US US US US US US US US SS
US US US BS BS BS US US US US
US US US US US US US US US US
US US US US US US US US US US
US US US US US US US US US SS
US US US BS US US US US US US
US US US US US US US US US SS
US US US US US US US US US SS

that strategy is in place to actually mitigate the problem that you’ve 
described, actually it was in place since drives were using PATA :) so if your 
drive get’s nosier over time it’s either a broken bearing or demagnetised arm 
magnet causing it to not aim propperly - so drive have to readjust position 
multiple times before hitting a right track --
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-14 Thread Tomasz Kusmierz
Theoretically all sectors in over provision are erased - practically they are 
either erased or waiting to be erased or broken.

What you have to understand is that sectors on SSD are not where you really 
think they are - they can swap place with sectors with over provisioning are, 
they can swap place with each other ect … stuff you see as a disk from 0 to MAX 
does not have to be arranged in sequence on SSD (and mostly never is) 

If you never trim - when your device is 100% full - you need to start overwrite 
data to keep writing - this is where over provisioning shines: ssd fakes that 
you write to a sector while really you write to a sector in over provisioning 
area and those magically swap places without you knowing -> the sector that was 
occupied ends up in over provisioning pool and SSD hardware performs a slow 
errase on it to make it free for the future. This mechanism is simple, and 
transparent for users -> you don’t know that it happens and SSD does all heavy 
lifting. 

Over provisioned area does have more uses than that. For example if you have a 
1TB drive where you store 500GB of data that you never modify -> SSD will copy 
part of that data to over provisioned area -> free sectors that were unwritten 
for a while -> free sectors that were continuously hammered by writes and write 
a static data there. This mechanism is wear levelling - it means that SSD 
internals make sure that sectors on SSD have an equal use over time. Despite of 
some thinking that it’s pointless imagine situation where you’ve got a 1TB 
drive with 1GB free and you keep writing and modifying data in this 1GB free … 
those sectors will quickly die due to short flash life expectancy ( some as 
short as 1k erases ! ).

So again, buy a good quality drives (not a hardcore enterprise drives, just 
good customer ones) and leave stuff to a drive + use OS that gives you trim and 
you should be golden 

> On 15 May 2017, at 00:01, Imran Geriskovan <imran.gerisko...@gmail.com> wrote:
> 
> On 5/14/17, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:
>> In terms of over provisioning of SSD it’s a give and take relationship … on
>> good drive there is enough over provisioning to allow a normal operation on
>> systems without TRIM … now if you would use a 1TB drive daily without TRIM
>> and have only 30GB stored on it you will have fantastic performance but if
>> you will want to store 500GB at roughly 200GB you will hit a brick wall and
>> you writes will slow dow to megabytes / s … this is symptom of drive running
>> out of over provisioning space …
> 
> What exactly happens on a non-trimmed drive?
> Does it begin to forge certain erase-blocks? If so
> which are those? What happens when you never
> trim and continue dumping data on it?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-14 Thread Tomasz Kusmierz
Theoretically all sectors in over provision are erased - practically they are 
either erased or waiting to be erased or broken.

What you have to understand is that sectors on SSD are not where you really 
think they are - they can swap place with sectors with over provisioning are, 
they can swap place with each other ect … stuff you see as a disk from 0 to MAX 
does not have to be arranged in sequence on SSD (and mostly never is) 

If you never trim - when your device is 100% full - you need to start overwrite 
data to keep writing - this is where over provisioning shines: ssd fakes that 
you write to a sector while really you write to a sector in over provisioning 
area and those magically swap places without you knowing -> the sector that was 
occupied ends up in over provisioning pool and SSD hardware performs a slow 
errase on it to make it free for the future. This mechanism is simple, and 
transparent for users -> you don’t know that it happens and SSD does all heavy 
lifting. 

Over provisioned area does have more uses than that. For example if you have a 
1TB drive where you store 500GB of data that you never modify -> SSD will copy 
part of that data to over provisioned area -> free sectors that were unwritten 
for a while -> free sectors that were continuously hammered by writes and write 
a static data there. This mechanism is wear levelling - it means that SSD 
internals make sure that sectors on SSD have an equal use over time. Despite of 
some thinking that it’s pointless imagine situation where you’ve got a 1TB 
drive with 1GB free and you keep writing and modifying data in this 1GB free … 
those sectors will quickly die due to short flash life expectancy ( some as 
short as 1k erases ! ).

So again, buy a good quality drives (not a hardcore enterprise drives, just 
good customer ones) and leave stuff to a drive + use OS that gives you trim and 
you should be golden  


> On 15 May 2017, at 00:01, Imran Geriskovan <imran.gerisko...@gmail.com> wrote:
> 
> On 5/14/17, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:
>> In terms of over provisioning of SSD it’s a give and take relationship … on
>> good drive there is enough over provisioning to allow a normal operation on
>> systems without TRIM … now if you would use a 1TB drive daily without TRIM
>> and have only 30GB stored on it you will have fantastic performance but if
>> you will want to store 500GB at roughly 200GB you will hit a brick wall and
>> you writes will slow dow to megabytes / s … this is symptom of drive running
>> out of over provisioning space …
> 
> What exactly happens on a non-trimmed drive?
> Does it begin to forge certain erase-blocks? If so
> which are those? What happens when you never
> trim and continue dumping data on it?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-14 Thread Tomasz Kusmierz
All stuff that Chris wrote holds true, I just wanted to add flash specific 
information (from my experience of writing low level code for operating flash)

So with flash, to erase you have to erase a large allocation block, usually it 
used to be 128kB (plus some crc data and stuff makes more than 128kB, but we 
are talking functional data storage space) on never setups it can be megabytes 
… device dependant really.
To erase a block you need to provide whole 128 x 8 bits with voltage higher 
that is usually used for IO (can be even 15V) so it requires an external supply 
or build in internal charge pump to provide that voltage to a block erasure 
circuitry. This process generates a lot of heat and requires a lot of energy, 
so consensus back in the day was that you could erase one block at a time and 
this could take up to 200ms (0.2 second). After a erase you need to check 
whenever all bits are set to 1 (charged state) and then sector is marked as 
ready for storage.

Of course, flash memories are moving forward and in more demanding environments 
there are solutions where blocks are grouped into groups which have separate 
eraser circuits that will allow errasure to be performed in parallel in 
multiple parts of flash module, still you are bound to one per group.

Another problem is that erasure procedure locally does increase temperature and 
on flat flashes it’s not that much of a problem, but on emerging solutions like 
3d flashed locally we might experience undesired temperature increases that 
would either degrade life span of flash or simply erase neighbouring blocks. 

In terms of over provisioning of SSD it’s a give and take relationship … on 
good drive there is enough over provisioning to allow a normal operation on 
systems without TRIM … now if you would use a 1TB drive daily without TRIM and 
have only 30GB stored on it you will have fantastic performance but if you will 
want to store 500GB at roughly 200GB you will hit a brick wall and you writes 
will slow dow to megabytes / s … this is symptom of drive running out of over 
provisioning space … if you would run OS that issues trim, this problem would 
not exist since drive would know that whole 970GB of space is free and it would 
be pre-emptively erased days before. 

And last part - hard drive is not aware of filesystem and partitions … so you 
could have 400GB on this 1TB drive left unpartitioned and still you would be 
cooked. Technically speaking using as much as possible space on a SSD to a FS 
and OS that supports trim will give you best performance because drive will be 
notified of as much as possible disk space that is actually free …..

So, to summaries: 
- don’t try to outsmart built in mechanics of SSD (people that suggest that are 
just morons that want to have 5 minutes of fame).
- don’t buy crap SSD and expect it to behave like good one if you use below 
certain % of it … it’s stupid, buy more reasonable SSD but smaller and store 
slow data on spinning rust.
- read more books and wikipedia, not jumping down on you but internet is filled 
with people that provide false information, sometimes unknowingly and swear by 
it ( Dunning–Kruger effect :D ) and some of them are very good and making all 
theories sexy and stuff … you simply have to get used to it… 
- if something is to good to be true, than it’s not
- promise of future performance gains is a domain of “sleazy salesman"



> On 14 May 2017, at 17:21, Chris Murphy  wrote:
> 
> On Sat, May 13, 2017 at 3:39 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> 
>> When I was doing my ssd research the first time around, the going
>> recommendation was to keep 20-33% of the total space on the ssd entirely
>> unallocated, allowing it to use that space as an FTL erase-block
>> management pool.
> 
> Any brand name SSD has its own reserve above its specified size to
> ensure that there's decent performance, even when there is no trim
> hinting supplied by the OS; and thereby the SSD can only depend on LBA
> "overwrites" to know what blocks are to be freed up.
> 
> 
>> Anyway, that 20-33% left entirely unallocated/unpartitioned
>> recommendation still holds, right?
> 
> Not that I'm aware of. I've never done this by literally walling off
> space that I won't use. IA fairly large percentage of my partitions
> have free space so it does effectively happen as far as the SSD is
> concerned. And I use fstrim timer. Most of the file systems support
> trim.
> 
> Anyway I've stuffed a Samsung 840 EVO to 98% full with an OS/file
> system that would not issue trim commands on this drive, and it was
> doing full performance writes through that point. Then deleted maybe
> 5% of the files, and then refill the drive to 98% again, and it was
> the same performance.  So it must have had enough in reserve to permit
> full performance "overwrites" which were in effect directed to reserve
> blocks as the freed up blocks were being erased. Thus the erasure
> happening on the fly 

Re: Shrinking a device - performance?

2017-03-28 Thread Tomasz Kusmierz
I’ve glazed over on “Not only that …” … can you make youtube video of that :
> On 28 Mar 2017, at 16:06, Peter Grandi  wrote:
> 
>> I glazed over at “This is going to be long” … :)
>>> [ ... ]
> 
> Not only that, you also top-posted while quoting it pointlessly
> in its entirety, to the whole mailing list. Well played :-).
It’s because I’m special :* 

On a real note thank’s for giving a f to provide a detailed comment … to much 
of open source stuff is based on short comments :/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-28 Thread Tomasz Kusmierz
I glazed over at “This is going to be long” … :)

> On 28 Mar 2017, at 15:43, Peter Grandi  wrote:
> 
> This is going to be long because I am writing something detailed
> hoping pointlessly that someone in the future will find it by
> searching the list archives while doing research before setting
> up a new storage system, and they will be the kind of person
> that tolerates reading messages longer than Twitter. :-).
> 
>> I’m currently shrinking a device and it seems that the
>> performance of shrink is abysmal.
> 
> When I read this kind of statement I am reminded of all the
> cases where someone left me to decatastrophize a storage system
> built on "optimistic" assumptions. The usual "optimism" is what
> I call the "syntactic approach", that is the axiomatic belief
> that any syntactically valid combination of features not only
> will "work", but very fast too and reliably despite slow cheap
> hardware and "unattentive" configuration. Some people call that
> the expectation that system developers provide or should provide
> an "O_PONIES" option. In particular I get very saddened when
> people use "performance" to mean "speed", as the difference
> between the two is very great.
> 
> As a general consideration, shrinking a large filetree online
> in-place is an amazingly risky, difficult, slow operation and
> should be a last desperate resort (as apparently in this case),
> regardless of the filesystem type, and expecting otherwise is
> "optimistic".
> 
> My guess is that very complex risky slow operations like that
> are provided by "clever" filesystem developers for "marketing"
> purposes, to win box-ticking competitions. That applies to those
> system developers who do know better; I suspect that even some
> filesystem developers are "optimistic" as to what they can
> actually achieve.
> 
>> I intended to shrink a ~22TiB filesystem down to 20TiB. This is
>> still using LVM underneath so that I can’t just remove a device
>> from the filesystem but have to use the resize command.
> 
> That is actually a very good idea because Btrfs multi-device is
> not quite as reliable as DM/LVM2 multi-device.
> 
>> Label: 'backy'  uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4
>>   Total devices 1 FS bytes used 18.21TiB
>>   devid1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy
> 
> Maybe 'balance' should have been used a bit more.
> 
>> This has been running since last Thursday, so roughly 3.5days
>> now. The “used” number in devid1 has moved about 1TiB in this
>> time. The filesystem is seeing regular usage (read and write)
>> and when I’m suspending any application traffic I see about
>> 1GiB of movement every now and then. Maybe once every 30
>> seconds or so. Does this sound fishy or normal to you?
> 
> With consistent "optimism" this is a request to assess whether
> "performance" of some operations is adequate on a filetree
> without telling us either what the filetree contents look like,
> what the regular workload is, or what the storage layer looks
> like.
> 
> Being one of the few system administrators crippled by lack of
> psychic powers :-), I rely on guesses and inferences here, and
> having read the whole thread containing some belated details.
> 
> From the ~22TB total capacity my guess is that the storage layer
> involves rotating hard disks, and from later details the
> filesystem contents seems to be heavily reflinked files of
> several GB in size, and workload seems to be backups to those
> files from several source hosts. Considering the general level
> of "optimism" in the situation my wild guess is that the storage
> layer is based on large slow cheap rotating disks in teh 4GB-8GB
> range, with very low IOPS-per-TB.
> 
>> Thanks for that info. The 1min per 1GiB is what I saw too -
>> the “it can take longer” wasn’t really explainable to me.
> 
> A contemporary rotating disk device can do around 0.5MB/s
> transfer rate with small random accesses with barriers up to
> around 80-160MB/s in purely sequential access without barriers.
> 
> 1GB/m of simultaneous read-write means around 16MB/s reads plus
> 16MB/s writes which is fairly good *performance* (even if slow
> *speed*) considering that moving extents around, even across
> disks, involves quite a bit of randomish same-disk updates of
> metadata; because it all depends usually on how much randomish
> metadata updates need to done, on any filesystem type, as those
> must be done with barriers.
> 
>> As I’m not using snapshots: would large files (100+gb)
> 
> Using 100GB sized VM virtual disks (never mind with COW) seems
> very unwise to me to start with, but of course a lot of other
> people know better :-). Just like a lot of other people know
> better that large single pool storage systems are awesome in
> every respect :-): cost, reliability, speed, flexibility,
> maintenance, etc.
> 
>> with long chains of CoW history (specifically reflink copies)
>> also hurt?
> 
> Oh yes... They are about one 

Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.

2017-02-22 Thread Tomasz Kusmierz
): forced readonly
Jan 23 05:00:02 server kernel: BTRFS: error (device sdc) in
btrfs_run_delayed_refs:2960: errno=-2 No such entry
Jan 23 05:00:02 server kernel: BTRFS: error (device sdc) in
create_pending_snapshot:1604: errno=-2 No such entry
Jan 23 05:00:02 server kernel: BTRFS warning (device sdc): Skipping
commit of aborted transaction.
Jan 23 05:00:02 server kernel: BTRFS: error (device sdc) in
cleanup_transaction:1854: errno=-2 No such entry

On 21 February 2017 at 22:18, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:
> Anyone ?
>
> On 18 Feb 2017, at 16:44, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:
>
> So Qu,
>
> currently my situation is that:
> I've tried to go btrfs scan --repair, and it did relair some stuff is
> qgroup's ... then tried to mont it and, surprise surpeire system
> locked out in 20 seconds.
>
> Reboot, again scan --repair = a lot of missing back pointers were
> repaired and system is supposedly "OK"  attempted to mount it and
> within 20 seconds system locked out so hard it wold no even reboot
> from acpi.
>
> installed "ellrepo kernel-lm" and installed 4.9.10
>
> another scan --repair = same problem with lot's of back pointer
> missing, fixed  system again seems "OK" ... another attempt to
> mount /dev/sdc /mnt2/main_pool and again after 20 seconds system locks
> up hard.
>
> There is nothing in messages, nothing in dmesg ... I think that system
> lock up so hard that master btrfs filesystem does not get time those
> logs pushed to disk.
>
>
>
>
>
>
> On 16 February 2017 at 23:46, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:
>
> Thanks Qu,
>
> Just before I’ll go and accidentally mess up this FS more - I’ve
> mentioned originally that this problem started with FS not being able
> to create a snapshot ( it would get remounted RO automatically ) for
> about a month, and when I’ve realised that there is a problem like
> that I’ve attempted a full FS balance that caused this FS to be
> unmountable. Is there any other debug you would require before I
> proceed (I’ve got a lot i
>
> On 16 Feb 2017, at 01:26, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:
>
>
>
> At 02/15/2017 10:11 PM, Tomasz Kusmierz wrote:
>
> So guys, any help here ? I’m kinda stuck now with system just idling
> and doing nothing while I wait for some feedback ...
>
>
> Sorry for the late reply.
>
> Busying debugging a kernel bug.
>
> On 14 Feb 2017, at 19:38, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:
>
> [root@server ~]#  btrfs-show-super -af /dev/sdc
> superblock: bytenr=65536, device=/dev/sdc
> -
> csum_type   0 (crc32c)
> csum_size   4
> csum0x17d56ce0 [match]
>
>
> This superblock is good.
>
> bytenr  65536
> flags   0x1
>  ( WRITTEN )
> magic   _BHRfS_M [match]
> fsid0576d577-8954-4a60-a02b-9492b3c29318
> label   main_pool
> generation  150682
> root5223857717248
> sys_array_size  321
> chunk_root_generation   150678
> root_level  1
> chunk_root  8669488005120
> chunk_root_level1
> log_root0
> log_root_transid0
> log_root_level  0
> total_bytes 16003191472128
> bytes_used  6411278503936
> sectorsize  4096
> nodesize16384
> leafsize16384
> stripesize  4096
> root_dir6
> num_devices 8
> compat_flags0x0
> compat_ro_flags 0x0
> incompat_flags  0x161
>  ( MIXED_BACKREF |
>BIG_METADATA |
>EXTENDED_IREF |
>SKINNY_METADATA )
> cache_generation150682
> uuid_tree_generation150679
> dev_item.uuid   46abffa8-7afe-451f-93c6-abb8e589c4e8
> dev_item.fsid   0576d577-8954-4a60-a02b-9492b3c29318 [match]
> dev_item.type   0
> dev_item.total_bytes2000398934016
> dev_item.bytes_used 1647136735232
> dev_item.io_align   4096
> dev_item.io_width   4096
> dev_item.sector_size4096
> dev_item.devid  1
> dev_item.dev_group  0
> dev_item.seek_speed 0
> dev_item.bandwidth  0
> dev_item.generation 0
> sys_chunk_array[2048]:
>  item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 8669487824896)
>  length 67108864 owner 2 stripe_len 65536 type SYSTEM|RAID10
>  io_align 65536 io_width 65536 sec

Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.

2017-02-21 Thread Tomasz Kusmierz
Anyone ?

On 18 Feb 2017, at 16:44, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:

So Qu,

currently my situation is that:
I've tried to go btrfs scan --repair, and it did relair some stuff is
qgroup's ... then tried to mont it and, surprise surpeire system
locked out in 20 seconds.

Reboot, again scan --repair = a lot of missing back pointers were
repaired and system is supposedly "OK"  attempted to mount it and
within 20 seconds system locked out so hard it wold no even reboot
from acpi.

installed "ellrepo kernel-lm" and installed 4.9.10

another scan --repair = same problem with lot's of back pointer
missing, fixed  system again seems "OK" ... another attempt to
mount /dev/sdc /mnt2/main_pool and again after 20 seconds system locks
up hard.

There is nothing in messages, nothing in dmesg ... I think that system
lock up so hard that master btrfs filesystem does not get time those
logs pushed to disk.






On 16 February 2017 at 23:46, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:

Thanks Qu,

Just before I’ll go and accidentally mess up this FS more - I’ve
mentioned originally that this problem started with FS not being able
to create a snapshot ( it would get remounted RO automatically ) for
about a month, and when I’ve realised that there is a problem like
that I’ve attempted a full FS balance that caused this FS to be
unmountable. Is there any other debug you would require before I
proceed (I’ve got a lot i

On 16 Feb 2017, at 01:26, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:



At 02/15/2017 10:11 PM, Tomasz Kusmierz wrote:

So guys, any help here ? I’m kinda stuck now with system just idling
and doing nothing while I wait for some feedback ...


Sorry for the late reply.

Busying debugging a kernel bug.

On 14 Feb 2017, at 19:38, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:

[root@server ~]#  btrfs-show-super -af /dev/sdc
superblock: bytenr=65536, device=/dev/sdc
-
csum_type   0 (crc32c)
csum_size   4
csum0x17d56ce0 [match]


This superblock is good.

bytenr  65536
flags   0x1
 ( WRITTEN )
magic   _BHRfS_M [match]
fsid0576d577-8954-4a60-a02b-9492b3c29318
label   main_pool
generation  150682
root5223857717248
sys_array_size  321
chunk_root_generation   150678
root_level  1
chunk_root  8669488005120
chunk_root_level1
log_root0
log_root_transid0
log_root_level  0
total_bytes 16003191472128
bytes_used  6411278503936
sectorsize  4096
nodesize16384
leafsize16384
stripesize  4096
root_dir6
num_devices 8
compat_flags0x0
compat_ro_flags 0x0
incompat_flags  0x161
 ( MIXED_BACKREF |
   BIG_METADATA |
   EXTENDED_IREF |
   SKINNY_METADATA )
cache_generation150682
uuid_tree_generation150679
dev_item.uuid   46abffa8-7afe-451f-93c6-abb8e589c4e8
dev_item.fsid   0576d577-8954-4a60-a02b-9492b3c29318 [match]
dev_item.type   0
dev_item.total_bytes2000398934016
dev_item.bytes_used 1647136735232
dev_item.io_align   4096
dev_item.io_width   4096
dev_item.sector_size4096
dev_item.devid  1
dev_item.dev_group  0
dev_item.seek_speed 0
dev_item.bandwidth  0
dev_item.generation 0
sys_chunk_array[2048]:
 item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 8669487824896)
 length 67108864 owner 2 stripe_len 65536 type SYSTEM|RAID10
 io_align 65536 io_width 65536 sector_size 4096
 num_stripes 8 sub_stripes 2
 stripe 0 devid 7 offset 1083674984448
 dev_uuid 566fb8a3-d6de-4230-8b70-a5fda0a120f6
 stripe 1 devid 8 offset 1083674984448
 dev_uuid 845aefb2-e0a6-479a-957b-a82fb7207d6c
 stripe 2 devid 1 offset 1365901312
 dev_uuid 46abffa8-7afe-451f-93c6-abb8e589c4e8
 stripe 3 devid 3 offset 1345978368
 dev_uuid 95921633-2fc1-479f-a3ba-e6e5a1989755
 stripe 4 devid 4 offset 1345978368
 dev_uuid 20828f0e-4661-4987-ac11-72814c1e423a
 stripe 5 devid 5 offset 1345978368
 dev_uuid 2c3cd71f-5178-48e7-8032-6b6eec023197
 stripe 6 devid 6 offset 1345978368
 dev_uuid 806a47e5-cac4-41c9-abb9-5c49506459e1
 stripe 7 devid 2 offset 1345978368
 dev_uuid e1358e0e-edaf-4505-9c71-ed0862c45841


And I didn't see anything wrong in sys_chunk_array.


W

Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.

2017-02-18 Thread Tomasz Kusmierz
So Qu,

currently my situation is that:
I've tried to go btrfs scan --repair, and it did relair some stuff is
qgroup's ... then tried to mont it and, surprise surpeire system
locked out in 20 seconds.

Reboot, again scan --repair = a lot of missing back pointers were
repaired and system is supposedly "OK"  attempted to mount it and
within 20 seconds system locked out so hard it wold no even reboot
from acpi.

installed "ellrepo kernel-lm" and installed 4.9.10

another scan --repair = same problem with lot's of back pointer
missing, fixed  system again seems "OK" ... another attempt to
mount /dev/sdc /mnt2/main_pool and again after 20 seconds system locks
up hard.

There is nothing in messages, nothing in dmesg ... I think that system
lock up so hard that master btrfs filesystem does not get time those
logs pushed to disk.






On 16 February 2017 at 23:46, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:
> Thanks Qu,
>
> Just before I’ll go and accidentally mess up this FS more - I’ve
> mentioned originally that this problem started with FS not being able
> to create a snapshot ( it would get remounted RO automatically ) for
> about a month, and when I’ve realised that there is a problem like
> that I’ve attempted a full FS balance that caused this FS to be
> unmountable. Is there any other debug you would require before I
> proceed (I’ve got a lot i
>
> On 16 Feb 2017, at 01:26, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:
>
>
>
> At 02/15/2017 10:11 PM, Tomasz Kusmierz wrote:
>
> So guys, any help here ? I’m kinda stuck now with system just idling
> and doing nothing while I wait for some feedback ...
>
>
> Sorry for the late reply.
>
> Busying debugging a kernel bug.
>
> On 14 Feb 2017, at 19:38, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:
>
> [root@server ~]#  btrfs-show-super -af /dev/sdc
> superblock: bytenr=65536, device=/dev/sdc
> -
> csum_type   0 (crc32c)
> csum_size   4
> csum0x17d56ce0 [match]
>
>
> This superblock is good.
>
> bytenr  65536
> flags   0x1
>   ( WRITTEN )
> magic   _BHRfS_M [match]
> fsid0576d577-8954-4a60-a02b-9492b3c29318
> label   main_pool
> generation  150682
> root5223857717248
> sys_array_size  321
> chunk_root_generation   150678
> root_level  1
> chunk_root  8669488005120
> chunk_root_level1
> log_root0
> log_root_transid0
> log_root_level  0
> total_bytes 16003191472128
> bytes_used  6411278503936
> sectorsize  4096
> nodesize16384
> leafsize16384
> stripesize  4096
> root_dir6
> num_devices 8
> compat_flags0x0
> compat_ro_flags 0x0
> incompat_flags  0x161
>   ( MIXED_BACKREF |
> BIG_METADATA |
> EXTENDED_IREF |
> SKINNY_METADATA )
> cache_generation150682
> uuid_tree_generation150679
> dev_item.uuid   46abffa8-7afe-451f-93c6-abb8e589c4e8
> dev_item.fsid   0576d577-8954-4a60-a02b-9492b3c29318 [match]
> dev_item.type   0
> dev_item.total_bytes2000398934016
> dev_item.bytes_used 1647136735232
> dev_item.io_align   4096
> dev_item.io_width   4096
> dev_item.sector_size4096
> dev_item.devid  1
> dev_item.dev_group  0
> dev_item.seek_speed 0
> dev_item.bandwidth  0
> dev_item.generation 0
> sys_chunk_array[2048]:
>   item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 8669487824896)
>   length 67108864 owner 2 stripe_len 65536 type SYSTEM|RAID10
>   io_align 65536 io_width 65536 sector_size 4096
>   num_stripes 8 sub_stripes 2
>   stripe 0 devid 7 offset 1083674984448
>   dev_uuid 566fb8a3-d6de-4230-8b70-a5fda0a120f6
>   stripe 1 devid 8 offset 1083674984448
>   dev_uuid 845aefb2-e0a6-479a-957b-a82fb7207d6c
>   stripe 2 devid 1 offset 1365901312
>   dev_uuid 46abffa8-7afe-451f-93c6-abb8e589c4e8
>   stripe 3 devid 3 offset 1345978368
>   dev_uuid 95921633-2fc1-479f-a3ba-e6e5a1989755
>   stripe 4 devid 4 offset 1345978368
>   dev_uuid 20828f0e-4661-4987-ac11-72814c1e423a
>   stripe 

Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.

2017-02-16 Thread Tomasz Kusmierz
Thanks Qu,

Just before I’ll go and accidentally mess up this FS more - I’ve
mentioned originally that this problem started with FS not being able
to create a snapshot ( it would get remounted RO automatically ) for
about a month, and when I’ve realised that there is a problem like
that I’ve attempted a full FS balance that caused this FS to be
unmountable. Is there any other debug you would require before I
proceed (I’ve got a lot i

On 16 Feb 2017, at 01:26, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:



At 02/15/2017 10:11 PM, Tomasz Kusmierz wrote:

So guys, any help here ? I’m kinda stuck now with system just idling
and doing nothing while I wait for some feedback ...


Sorry for the late reply.

Busying debugging a kernel bug.

On 14 Feb 2017, at 19:38, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:

[root@server ~]#  btrfs-show-super -af /dev/sdc
superblock: bytenr=65536, device=/dev/sdc
-
csum_type   0 (crc32c)
csum_size   4
csum0x17d56ce0 [match]


This superblock is good.

bytenr  65536
flags   0x1
  ( WRITTEN )
magic   _BHRfS_M [match]
fsid0576d577-8954-4a60-a02b-9492b3c29318
label   main_pool
generation  150682
root5223857717248
sys_array_size  321
chunk_root_generation   150678
root_level  1
chunk_root  8669488005120
chunk_root_level1
log_root0
log_root_transid0
log_root_level  0
total_bytes 16003191472128
bytes_used  6411278503936
sectorsize  4096
nodesize16384
leafsize16384
stripesize  4096
root_dir6
num_devices 8
compat_flags0x0
compat_ro_flags 0x0
incompat_flags  0x161
  ( MIXED_BACKREF |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA )
cache_generation150682
uuid_tree_generation150679
dev_item.uuid   46abffa8-7afe-451f-93c6-abb8e589c4e8
dev_item.fsid   0576d577-8954-4a60-a02b-9492b3c29318 [match]
dev_item.type   0
dev_item.total_bytes2000398934016
dev_item.bytes_used 1647136735232
dev_item.io_align   4096
dev_item.io_width   4096
dev_item.sector_size4096
dev_item.devid  1
dev_item.dev_group  0
dev_item.seek_speed 0
dev_item.bandwidth  0
dev_item.generation 0
sys_chunk_array[2048]:
  item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 8669487824896)
  length 67108864 owner 2 stripe_len 65536 type SYSTEM|RAID10
  io_align 65536 io_width 65536 sector_size 4096
  num_stripes 8 sub_stripes 2
  stripe 0 devid 7 offset 1083674984448
  dev_uuid 566fb8a3-d6de-4230-8b70-a5fda0a120f6
  stripe 1 devid 8 offset 1083674984448
  dev_uuid 845aefb2-e0a6-479a-957b-a82fb7207d6c
  stripe 2 devid 1 offset 1365901312
  dev_uuid 46abffa8-7afe-451f-93c6-abb8e589c4e8
  stripe 3 devid 3 offset 1345978368
  dev_uuid 95921633-2fc1-479f-a3ba-e6e5a1989755
  stripe 4 devid 4 offset 1345978368
  dev_uuid 20828f0e-4661-4987-ac11-72814c1e423a
  stripe 5 devid 5 offset 1345978368
  dev_uuid 2c3cd71f-5178-48e7-8032-6b6eec023197
  stripe 6 devid 6 offset 1345978368
  dev_uuid 806a47e5-cac4-41c9-abb9-5c49506459e1
  stripe 7 devid 2 offset 1345978368
  dev_uuid e1358e0e-edaf-4505-9c71-ed0862c45841


And I didn't see anything wrong in sys_chunk_array.


Would you please try to mount the fs with latest kernel?
Better later than v4.9, as in that version extra kernel messages are
introduced to give more details about what's going wrong.

Thanks,
Qu

backup_roots[4]:
  backup 0:
  backup_tree_root:   5223857717248   gen: 150680 level: 1
  backup_chunk_root:  8669488005120   gen: 150678 level: 1
  backup_extent_root: 5223867383808   gen: 150680 level: 2
  backup_fs_root: 0   gen: 0  level: 0
  backup_dev_root:5224791523328   gen: 150680 level: 1
  backup_csum_root:   5224802140160   gen: 150680 level: 3
  backup_total_bytes: 16003191472128
  backup_bytes_used:  6411278503936
  backup_num_devices: 8

  backup 1:
  backup_tree_root:   5224155807744   gen: 150681 level: 1
  backup_chunk_root:  8669488005120   gen: 150678 level: 1
  backup

Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.

2017-02-15 Thread Tomasz Kusmierz
So guys, any help here ? I’m kinda stuck now with system just idling and doing 
nothing while I wait for some feedback ...
> On 14 Feb 2017, at 19:38, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:
> 
> [root@server ~]#  btrfs-show-super -af /dev/sdc
> superblock: bytenr=65536, device=/dev/sdc
> -
> csum_type   0 (crc32c)
> csum_size   4
> csum0x17d56ce0 [match]
> bytenr  65536
> flags   0x1
>( WRITTEN )
> magic   _BHRfS_M [match]
> fsid0576d577-8954-4a60-a02b-9492b3c29318
> label   main_pool
> generation  150682
> root5223857717248
> sys_array_size  321
> chunk_root_generation   150678
> root_level  1
> chunk_root  8669488005120
> chunk_root_level1
> log_root0
> log_root_transid0
> log_root_level  0
> total_bytes 16003191472128
> bytes_used  6411278503936
> sectorsize  4096
> nodesize16384
> leafsize16384
> stripesize  4096
> root_dir6
> num_devices 8
> compat_flags0x0
> compat_ro_flags 0x0
> incompat_flags  0x161
>( MIXED_BACKREF |
>  BIG_METADATA |
>  EXTENDED_IREF |
>  SKINNY_METADATA )
> cache_generation150682
> uuid_tree_generation150679
> dev_item.uuid   46abffa8-7afe-451f-93c6-abb8e589c4e8
> dev_item.fsid   0576d577-8954-4a60-a02b-9492b3c29318 [match]
> dev_item.type   0
> dev_item.total_bytes2000398934016
> dev_item.bytes_used 1647136735232
> dev_item.io_align   4096
> dev_item.io_width   4096
> dev_item.sector_size4096
> dev_item.devid  1
> dev_item.dev_group  0
> dev_item.seek_speed 0
> dev_item.bandwidth  0
> dev_item.generation 0
> sys_chunk_array[2048]:
>item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 8669487824896)
>length 67108864 owner 2 stripe_len 65536 type SYSTEM|RAID10
>io_align 65536 io_width 65536 sector_size 4096
>num_stripes 8 sub_stripes 2
>stripe 0 devid 7 offset 1083674984448
>dev_uuid 566fb8a3-d6de-4230-8b70-a5fda0a120f6
>stripe 1 devid 8 offset 1083674984448
>dev_uuid 845aefb2-e0a6-479a-957b-a82fb7207d6c
>stripe 2 devid 1 offset 1365901312
>dev_uuid 46abffa8-7afe-451f-93c6-abb8e589c4e8
>stripe 3 devid 3 offset 1345978368
>dev_uuid 95921633-2fc1-479f-a3ba-e6e5a1989755
>stripe 4 devid 4 offset 1345978368
>dev_uuid 20828f0e-4661-4987-ac11-72814c1e423a
>stripe 5 devid 5 offset 1345978368
>dev_uuid 2c3cd71f-5178-48e7-8032-6b6eec023197
>stripe 6 devid 6 offset 1345978368
>dev_uuid 806a47e5-cac4-41c9-abb9-5c49506459e1
>stripe 7 devid 2 offset 1345978368
>dev_uuid e1358e0e-edaf-4505-9c71-ed0862c45841
> backup_roots[4]:
>backup 0:
>backup_tree_root:   5223857717248   gen: 150680 level: 
> 1
>backup_chunk_root:  8669488005120   gen: 150678 level: 
> 1
>backup_extent_root: 5223867383808   gen: 150680 level: 
> 2
>backup_fs_root: 0   gen: 0  level: 0
>backup_dev_root:5224791523328   gen: 150680 level: 
> 1
>backup_csum_root:   5224802140160   gen: 150680 level: 
> 3
>backup_total_bytes: 16003191472128
>backup_bytes_used:  6411278503936
>backup_num_devices: 8
> 
>backup 1:
>backup_tree_root:   5224155807744   gen: 150681 level: 
> 1
>backup_chunk_root:  8669488005120   gen: 150678 level: 
> 1
>backup_extent_root: 5224156233728   gen: 150681 level: 
> 2
>backup_fs_root: 0   gen: 0  level: 0
>backup_dev_root:5224633155584   gen: 150681 level: 
> 1
>backup_csum_root:   5224634941440   gen: 150681 level: 
> 3
>backup_total_bytes: 16003191472128
>

Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.

2017-02-14 Thread Tomasz Kusmierz
backup_bytes_used:  6411278503936
backup_num_devices: 8

backup 1:
backup_tree_root:   5224155807744   gen: 150681 level: 1
backup_chunk_root:  8669488005120   gen: 150678 level: 1
backup_extent_root: 5224156233728   gen: 150681 level: 2
backup_fs_root: 0   gen: 0  level: 0
backup_dev_root:5224633155584   gen: 150681 level: 1
backup_csum_root:   5224634941440   gen: 150681 level: 3
backup_total_bytes: 16003191472128
backup_bytes_used:  6411278503936
backup_num_devices: 8

backup 2:
backup_tree_root:   5223857717248   gen: 150682 level: 1
backup_chunk_root:  8669488005120   gen: 150678 level: 1
backup_extent_root: 5223867383808   gen: 150682 level: 2
backup_fs_root: 0   gen: 0  level: 0
backup_dev_root:5224622358528   gen: 150682 level: 1
backup_csum_root:   5224675344384   gen: 150682 level: 3
backup_total_bytes: 16003191472128
backup_bytes_used:  6411278503936
backup_num_devices: 8

backup 3:
backup_tree_root:   11179477942272  gen: 150679 level: 1
backup_chunk_root:  8669488005120   gen: 150678 level: 1
backup_extent_root: 11179488018432  gen: 150679 level: 2
backup_fs_root: 6217817456640   gen: 150497 level: 0
backup_dev_root:5224337244160   gen: 150679 level: 1
backup_csum_root:   11179492540416  gen: 150679 level: 3
backup_total_bytes: 16003191472128
backup_bytes_used:  6411278503936
backup_num_devices: 8



On 14 February 2017 at 00:25, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:
>
>
> At 02/14/2017 08:23 AM, Tomasz Kusmierz wrote:
>>
>> Forgot to mention:
>>
>> btrfs inspect-internal dump-super -af /dev/sdc
>
>
> Your btrfs-progs is somewhat old, which doesn't integrate dump super into
> inspect-internal.
>
> In that case, you can use btrfs-show-super -af instead.
>
> Thanks,
> Qu
>
>>
>> btrfs inspect-internal: unknown token 'dump-super'
>> usage: btrfs inspect-internal  
>>
>> btrfs inspect-internal inode-resolve [-v]  
>> Get file system paths for the given inode
>> btrfs inspect-internal logical-resolve [-Pv] [-s bufsize] 
>> 
>> Get file system paths for the given logical address
>> btrfs inspect-internal subvolid-resolve  
>> Get file system paths for the given subvolume ID.
>> btrfs inspect-internal rootid 
>> Get tree ID of the containing subvolume of path.
>> btrfs inspect-internal min-dev-size [options] 
>> Get the minimum size the device can be shrunk to. The
>>
>> query various internal information
>>
>> On 13 February 2017 at 14:58, Tomasz Kusmierz <tom.kusmi...@gmail.com>
>> wrote:
>>>
>>> Problem is to send a larger log into this mailing list :/
>>>
>>> Anyway: uname -a
>>> Linux tevva-server 4.8.7-1.el7.elrepo.x86_64 #1 SMP Thu Nov 10
>>> 20:47:24 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>
>>> cut from messages (bear in mind that this is a single cut with a bit
>>> cut from inside of it to fit it in the email)
>>>
>>> Feb 10 00:17:14 server journal: ==>
>>> /var/log/gitlab/gitlab-shell/gitlab-shell.log <==
>>> Feb 10 00:17:30 server journal: 192.168.1.253 - wally_tm
>>> [10/Feb/2017:00:17:29 +] "PROPFIND /remote.php/webdav/Pictures
>>> HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1"
>>> Feb 10 00:18:00 server kernel: BTRFS info (device sdc): found 22 extents
>>> Feb 10 00:18:01 server journal: 192.168.1.253 - wally_tm
>>> [10/Feb/2017:00:17:59 +] "PROPFIND /remote.php/webdav/Pictures
>>> HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1"
>>> Feb 10 00:18:05 server kernel: BTRFS info (device sdc): found 22 extents
>>> Feb 10 00:18:06 server kernel: BTRFS info (device sdc): relocating
>>> block group 12353563131904 flags 65
>>> Feb 10 00:18:06 server journal:
>>> Feb 10 00:18:06 server journal: ==> /var/log/gitlab/sidekiq/current <==
>>> Feb 10 00:18:06 server journal: 2017-02-10_00:18:06.99341
>>> 2017-02-10T00:18:06.993Z 382 TID-otrr6ws48 P

Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.

2017-02-13 Thread Tomasz Kusmierz
Forgot to mention:

btrfs inspect-internal dump-super -af /dev/sdc

btrfs inspect-internal: unknown token 'dump-super'
usage: btrfs inspect-internal  

btrfs inspect-internal inode-resolve [-v]  
Get file system paths for the given inode
btrfs inspect-internal logical-resolve [-Pv] [-s bufsize]  
Get file system paths for the given logical address
btrfs inspect-internal subvolid-resolve  
Get file system paths for the given subvolume ID.
btrfs inspect-internal rootid 
Get tree ID of the containing subvolume of path.
btrfs inspect-internal min-dev-size [options] 
Get the minimum size the device can be shrunk to. The

query various internal information

On 13 February 2017 at 14:58, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:
> Problem is to send a larger log into this mailing list :/
>
> Anyway: uname -a
> Linux tevva-server 4.8.7-1.el7.elrepo.x86_64 #1 SMP Thu Nov 10
> 20:47:24 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
>
>
> cut from messages (bear in mind that this is a single cut with a bit
> cut from inside of it to fit it in the email)
>
> Feb 10 00:17:14 server journal: ==>
> /var/log/gitlab/gitlab-shell/gitlab-shell.log <==
> Feb 10 00:17:30 server journal: 192.168.1.253 - wally_tm
> [10/Feb/2017:00:17:29 +] "PROPFIND /remote.php/webdav/Pictures
> HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1"
> Feb 10 00:18:00 server kernel: BTRFS info (device sdc): found 22 extents
> Feb 10 00:18:01 server journal: 192.168.1.253 - wally_tm
> [10/Feb/2017:00:17:59 +] "PROPFIND /remote.php/webdav/Pictures
> HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1"
> Feb 10 00:18:05 server kernel: BTRFS info (device sdc): found 22 extents
> Feb 10 00:18:06 server kernel: BTRFS info (device sdc): relocating
> block group 12353563131904 flags 65
> Feb 10 00:18:06 server journal:
> Feb 10 00:18:06 server journal: ==> /var/log/gitlab/sidekiq/current <==
> Feb 10 00:18:06 server journal: 2017-02-10_00:18:06.99341
> 2017-02-10T00:18:06.993Z 382 TID-otrr6ws48 PruneOldEventsWorker
> JID-99d3a4fb69be748c8674b5e1 INFO: start
> Feb 10 00:18:06 server journal: 2017-02-10_00:18:06.99571
> 2017-02-10T00:18:06.995Z 382 TID-otrr6wqok INFO: Cron Jobs - add job
> with name: prune_old_events_worker
> Feb 10 00:18:07 server journal: 2017-02-10_00:18:07.00454
> 2017-02-10T00:18:07.004Z 382 TID-otrr6ws48 PruneOldEventsWorker
> JID-99d3a4fb69be748c8674b5e1 INFO: done: 0.011 sec
> Feb 10 00:18:30 server journal: 192.168.1.253 - wally_tm
> [10/Feb/2017:00:18:29 +] "PROPFIND /remote.php/webdav/Pictures
> HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1"
> Feb 10 00:18:43 server kernel: BTRFS info (device sdc): found 32 extents
> Feb 10 00:18:48 server kernel: BTRFS info (device sdc): found 32 extents
> Feb 10 00:18:49 server kernel: BTRFS info (device sdc): relocating
> block group 12349268164608 flags 65
> Feb 10 00:19:01 server journal: 192.168.1.253 - wally_tm
> [10/Feb/2017:00:19:00 +] "PROPFIND /remote.php/webdav/Pictures
> HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1"
> Feb 10 00:19:02 server journal: 2017-02-10_00:19:02.51409
> 2017-02-10T00:19:02.513Z 382 TID-otrr6wqok INFO: Cron Jobs - add job
> with name: prune_old_events_worker
> Feb 10 00:19:02 server journal: 2017-02-10_00:19:02.51449
> 2017-02-10T00:19:02.514Z 382 TID-otrspth10 PruneOldEventsWorker
> JID-4a162ace334771baf4befbb7 INFO: start
> Feb 10 00:19:02 server journal: 2017-02-10_00:19:02.52994
> 2017-02-10T00:19:02.529Z 382 TID-otrspth10 PruneOldEventsWorker
> JID-4a162ace334771baf4befbb7 INFO: done: 0.015 sec
> Feb 10 00:19:26 server kernel: BTRFS info (device sdc): found 33 extents
> Feb 10 00:19:31 server kernel: BTRFS info (device sdc): found 33 extents
> Feb 10 00:19:31 server journal: 192.168.1.253 - wally_tm
> [10/Feb/2017:00:19:29 +] "PROPFIND /remote.php/webdav/Pictures
> HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1"
> Feb 10 00:19:32 server kernel: BTRFS info (device sdc): relocating
> block group 12344973197312 flags 65
> Feb 10 00:19:51 server kernel: perf: interrupt took too long (2513 >
> 2500), lowering kernel.perf_event_max_sample_rate to 79000
> Feb 10 00:20:00 server journal: 192.168.1.253 - wally_tm
> [10/Feb/2017:00:19:59 +] "PROPFIND /remote.php/webdav/Pictures
> HTTP/1.1" 207 1024 "-" "Mozilla/5.0 (Linux) mirall/2.1.1"
> Feb 10 00:20:10 server kernel: BTRFS info (device sdc): found 32 extents
> Feb 10 00:20:10 server journal: 2017-02-10_00:20:10.15695
> 2017-02-10T00:20:10.156Z 382 TID-otrsptg48
> RepositoryCheck::BatchWork

Re: FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.

2017-02-13 Thread Tomasz Kusmierz
 flags 258
Feb 10 00:29:11 server kernel: #011#011shared block backref parent 5224641380352
Feb 10 00:29:11 server kernel: #011item 12 key (12288258162688 169 0)
itemoff 15854 itemsize 33
Feb 10 00:29:11 server kernel: #011#011extent refs 1 gen 142940 flags 258
Feb 10 00:29:11 server kernel: #011#011shared block backref parent 5224641380352
Feb 10 00:29:11 server kernel: #011item 13 key (12288258179072 169 0)
itemoff 15821 itemsize 33
Feb 10 00:29:11 server kernel: #011#011extent refs 1 gen 142940 flags 258
Feb 10 00:29:11 server kernel: #011#011shared block backref parent 5224641380352
Feb 10 00:29:11 server kernel: #011item 14 key (12288258375680 169 0)
itemoff 15788 itemsize 33
Feb 10 00:29:11 server kernel: #011#011extent refs 1 gen 142940 flags 258


























On 13 Feb 2017, at 00:49, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:



At 02/12/2017 09:17 AM, Tomasz Kusmierz wrote:

Hi all,

So my main storage filesystem got some sort of veird corruption (that
I can gather). Everything seems to work OK, but when I try to create a
snapshot or run balance (no filters) it will get remounted read only.


Kernel version please.


Fun part is that balance seems to be running even on read only FS, and
I continuously get kernel traces in /var/log/messages  so it might
as well in the back ground silently eat my data away :/


Kernel backtrace please.

It would be better if you could paste the *first* kernel backtrace, as
that could be the cause, and following kernel backtrace is just
warning from btrfs_abort_transaction() without meaningful output.

I just see some normal messages, but no kernel backtrace.



UPDATE:

Yeah, after rebooting the system it does not even mount the FS,
mount.btrfs sits in some sort of spinlock and consumes 100% of singe
core.



UPDATE 2:

System is completelly cooked :/

[root@server ~]# btrfs fi show
Label: 'rockstor_server'  uuid: 5581a647-40ef-4a7a-9d73-847bf35a142b
   Total devices 1 FS bytes used 5.72GiB
   devid1 size 53.17GiB used 7.03GiB path /dev/sda2

Label: 'broken_pool'  uuid: 26095277-a234-455b-8c97-8dac8ad934c8
   Total devices 2 FS bytes used 193.52GiB
   devid1 size 1.82TiB used 196.03GiB path /dev/sdb
   devid2 size 1.82TiB used 196.03GiB path /dev/sdi

Label: 'main_pool'  uuid: 0576d577-8954-4a60-a02b-9492b3c29318
   Total devices 8 FS bytes used 5.83TiB
   devid1 size 1.82TiB used 1.50TiB path /dev/sdc
   devid2 size 1.82TiB used 1.50TiB path /dev/sdd
   devid3 size 1.82TiB used 1.50TiB path /dev/sde
   devid4 size 1.82TiB used 1.50TiB path /dev/sdf
   devid5 size 1.82TiB used 1.50TiB path /dev/sdg
   devid6 size 1.82TiB used 1.50TiB path /dev/sdh
   devid7 size 1.82TiB used 1.50TiB path /dev/sdj
   devid8 size 1.82TiB used 1.50TiB path /dev/sdk

[root@server ~]# mount /dev/sdc /mnt2/main_pool/
mount: wrong fs type, bad option, bad superblock on /dev/sdc,
  missing codepage or helper program, or other error

  In some cases useful info is found in syslog - try
  dmesg | tail or so.
[root@server ~]# mount /dev/sdd /mnt2/main_pool/
mount: wrong fs type, bad option, bad superblock on /dev/sdd,
  missing codepage or helper program, or other error

  In some cases useful info is found in syslog - try
  dmesg | tail or so.
[root@server ~]# mount /dev/sde /mnt2/main_pool/
mount: wrong fs type, bad option, bad superblock on /dev/sde,
  missing codepage or helper program, or other error

  In some cases useful info is found in syslog - try
  dmesg | tail or so.


dmesg tail retuns:
[ 9507.835629] systemd-udevd[1873]: Validate module index
[ 9507.835656] systemd-udevd[1873]: Check if link configuration needs reloading.
[ 9507.835690] systemd-udevd[1873]: seq 3698 queued, 'add' 'bdi'
[ 9507.835873] systemd-udevd[1873]: seq 3698 forked new worker [13858]
[ 9507.836202] BTRFS info (device sdd): disk space caching is enabled
[ 9507.836204] BTRFS info (device sdd): has skinny extents
[ 9507.836322] systemd-udevd[13858]: seq 3698 running
[ 9507.836443] systemd-udevd[13858]: no db file to read
/run/udev/data/+bdi:btrfs-4: No such file or directory
[ 9507.836474] systemd-udevd[13858]: RUN '/bin/mknod
/dev/btrfs-control c 10 234' /etc/udev/rules.d/64-btrfs.rules:1
[ 9507.837366] systemd-udevd[13861]: starting '/bin/mknod
/dev/btrfs-control c 10 234'
[ 9507.837833] BTRFS error (device sdd): failed to read the system array: -5
[ 9507.838231] systemd-udevd[13858]: '/bin/mknod /dev/btrfs-control c
10 234'(err) '/bin/mknod: '/dev/btrfs-control': File exists'
[ 9507.838262] systemd-udevd[13858]: '/bin/mknod /dev/btrfs-control c
10 234' [13861] exit with return code 1
[ 9507.854757] BTRFS: open_ctree failed
[ 9511.370878] BTRFS info (device sdd): disk space caching is enabled
[ 9511.370881] BTRFS info (device sdd): has skinny extents
[ 9511.375097] BTRFS error (device sdd): failed to read the system array: -5


Btrfs failed to read system chunk array from super block.
Normally this means 

FS gives kernel UPS on attempt to create snapshot and after running balance it's unmountable.

2017-02-11 Thread Tomasz Kusmierz
Hi all,

So my main storage filesystem got some sort of veird corruption (that
I can gather). Everything seems to work OK, but when I try to create a
snapshot or run balance (no filters) it will get remounted read only.

Fun part is that balance seems to be running even on read only FS, and
I continuously get kernel traces in /var/log/messages  so it might
as well in the back ground silently eat my data away :/


UPDATE:

Yeah, after rebooting the system it does not even mount the FS,
mount.btrfs sits in some sort of spinlock and consumes 100% of singe
core.



UPDATE 2:

System is completelly cooked :/

[root@server ~]# btrfs fi show
Label: 'rockstor_server'  uuid: 5581a647-40ef-4a7a-9d73-847bf35a142b
Total devices 1 FS bytes used 5.72GiB
devid1 size 53.17GiB used 7.03GiB path /dev/sda2

Label: 'broken_pool'  uuid: 26095277-a234-455b-8c97-8dac8ad934c8
Total devices 2 FS bytes used 193.52GiB
devid1 size 1.82TiB used 196.03GiB path /dev/sdb
devid2 size 1.82TiB used 196.03GiB path /dev/sdi

Label: 'main_pool'  uuid: 0576d577-8954-4a60-a02b-9492b3c29318
Total devices 8 FS bytes used 5.83TiB
devid1 size 1.82TiB used 1.50TiB path /dev/sdc
devid2 size 1.82TiB used 1.50TiB path /dev/sdd
devid3 size 1.82TiB used 1.50TiB path /dev/sde
devid4 size 1.82TiB used 1.50TiB path /dev/sdf
devid5 size 1.82TiB used 1.50TiB path /dev/sdg
devid6 size 1.82TiB used 1.50TiB path /dev/sdh
devid7 size 1.82TiB used 1.50TiB path /dev/sdj
devid8 size 1.82TiB used 1.50TiB path /dev/sdk

[root@server ~]# mount /dev/sdc /mnt2/main_pool/
mount: wrong fs type, bad option, bad superblock on /dev/sdc,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.
[root@server ~]# mount /dev/sdd /mnt2/main_pool/
mount: wrong fs type, bad option, bad superblock on /dev/sdd,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.
[root@server ~]# mount /dev/sde /mnt2/main_pool/
mount: wrong fs type, bad option, bad superblock on /dev/sde,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.


dmesg tail retuns:
[ 9507.835629] systemd-udevd[1873]: Validate module index
[ 9507.835656] systemd-udevd[1873]: Check if link configuration needs reloading.
[ 9507.835690] systemd-udevd[1873]: seq 3698 queued, 'add' 'bdi'
[ 9507.835873] systemd-udevd[1873]: seq 3698 forked new worker [13858]
[ 9507.836202] BTRFS info (device sdd): disk space caching is enabled
[ 9507.836204] BTRFS info (device sdd): has skinny extents
[ 9507.836322] systemd-udevd[13858]: seq 3698 running
[ 9507.836443] systemd-udevd[13858]: no db file to read
/run/udev/data/+bdi:btrfs-4: No such file or directory
[ 9507.836474] systemd-udevd[13858]: RUN '/bin/mknod
/dev/btrfs-control c 10 234' /etc/udev/rules.d/64-btrfs.rules:1
[ 9507.837366] systemd-udevd[13861]: starting '/bin/mknod
/dev/btrfs-control c 10 234'
[ 9507.837833] BTRFS error (device sdd): failed to read the system array: -5
[ 9507.838231] systemd-udevd[13858]: '/bin/mknod /dev/btrfs-control c
10 234'(err) '/bin/mknod: '/dev/btrfs-control': File exists'
[ 9507.838262] systemd-udevd[13858]: '/bin/mknod /dev/btrfs-control c
10 234' [13861] exit with return code 1
[ 9507.854757] BTRFS: open_ctree failed
[ 9511.370878] BTRFS info (device sdd): disk space caching is enabled
[ 9511.370881] BTRFS info (device sdd): has skinny extents
[ 9511.375097] BTRFS error (device sdd): failed to read the system array: -5
[ 9511.392792] BTRFS: open_ctree failed
[ 9514.233627] BTRFS: device label main_pool devid 3 transid 150680 /dev/sde
[ 9514.234399] systemd-udevd[1873]: Validate module index
[ 9514.234431] systemd-udevd[1873]: Check if link configuration needs reloading.
[ 9514.234465] systemd-udevd[1873]: seq 3702 queued, 'add' 'bdi'
[ 9514.234522] systemd-udevd[1873]: passed 142 bytes to netlink
monitor 0x5628f65d40d0
[ 9514.234554] systemd-udevd[13882]: seq 3702 running
[ 9514.234780] systemd-udevd[13882]: no db file to read
/run/udev/data/+bdi:btrfs-6: No such file or directory
[ 9514.234790] BTRFS info (device sde): disk space caching is enabled
[ 9514.234792] BTRFS info (device sde): has skinny extents
[ 9514.234798] systemd-udevd[13882]: RUN '/bin/mknod
/dev/btrfs-control c 10 234' /etc/udev/rules.d/64-btrfs.rules:1
[ 9514.235181] systemd-udevd[13906]: starting '/bin/mknod
/dev/btrfs-control c 10 234'
[ 9514.236448] systemd-udevd[13882]: '/bin/mknod /dev/btrfs-control c
10 234'(err) '/bin/mknod: '/dev/btrfs-control': File exists'
[ 9514.236514] systemd-udevd[13882]: '/bin/mknod /dev/btrfs-control c
10 234' [13906] exit with return code 1
[ 9514.238726] BTRFS error (device sde): failed to read the system array: -5
[ 9514.255472] BTRFS: open_ctree failed
--
To unsubscribe from 

Re: Best practices for raid 1

2017-01-12 Thread Tomasz Kusmierz
That was long winded way of saying “there is no mechanism in btrfs to tell you 
exactly which device is missing” but thanks anyway.

> On 12 Jan 2017, at 12:47, Austin S. Hemmelgarn <ahferro...@gmail.com> wrote:
> 
> On 2017-01-11 15:37, Tomasz Kusmierz wrote:
>> I would like to use this thread to ask few questions:
>> 
>> If we have 2 devices dying on us and we run RAID6 - this theoretically will 
>> still run (despite our current problems). Now let’s say that we booted up 
>> raid6 of 10 disk and 2 of them dies but operator does NOT know what are dev 
>> ID of disk that died, How does one removes those devices other than using 
>> “-missing” ??? I ask because it’s in multiple places stated to use “replace” 
>> when your device dies but nobody ever states how to find out which /dev/ 
>> node is actually missing  …. so when I want to use a replace, I don’t know 
>> what to use within command :/ … This whole thing might have an additional 
>> complication - if FS is fool, than one would need to add disks than remove 
>> missing.
> raid6 is a special case right now (aside from the fact that it's not safe for 
> general usage) because it's the only profile on BTRFS that can sustain more 
> than one failed disk.  In the case that the devices aren't actually listed as 
> missing (most disks won't disappear unless the cabling, storage controller, 
> or disk electronics are bad), you can use btrfs fi show to see what the 
> mapping is.  If the disks are missing (again, not likely unless there's a 
> pretty severe electrical failure somewhere), it's safer in that case to add 
> enough devices to satisfy replication and storage constraints, then just run 
> 'btrfs device delete missing' to get rid of the other disks.
>> 
>> 
>>> On 10 Jan 2017, at 21:49, Chris Murphy <li...@colorremedies.com> wrote:
>>> 
>>> On Tue, Jan 10, 2017 at 2:07 PM, Vinko Magecic
>>> <vinko.mage...@construction.com> wrote:
>>>> Hello,
>>>> 
>>>> I set up a raid 1 with two btrfs devices and came across some situations 
>>>> in my testing that I can't get a straight answer on.
>>>> 
>>>> 1) When replacing a volume, do I still need to `umount /path` and then 
>>>> `mount -o degraded ...` the good volume before doing the `btrfs replace 
>>>> start ...` ?
>>> 
>>> No. If the device being replaced is unreliable, use -r to limit the
>>> reads from the device being replaced.
>>> 
>>> 
>>> 
>>>> I didn't see anything that said I had to and when I tested it without 
>>>> mounting the volume it was able to replace the device without any issue. 
>>>> Is that considered bad and could risk damage or has `replace` made it 
>>>> possible to replace devices without umounting the filesystem?
>>> 
>>> It's always been possible even before 'replace'.
>>> btrfs dev add 
>>> btrfs dev rem 
>>> 
>>> But there are some bugs in dev replace that Qu is working on; I think
>>> they mainly negatively impact raid56 though.
>>> 
>>> The one limitation of 'replace' is that the new block device must be
>>> equal to or larger than the block device being replaced; where dev add
>>>> dev rem doesn't require this.
>>> 
>>> 
>>>> 2) Everything I see about replacing a drive says to use `/old/device 
>>>> /new/device` but what if the old device can't be read or no longer exists?
>>> 
>>> The command works whether the device is present or not; but if it's
>>> present and working then any errors on one device can be corrected by
>>> the other, whereas if the device is missing, then any errors on the
>>> remaining device can't be corrected. Off hand I'm not sure if the
>>> replace continues and an error just logged...I think that's what
>>> should happen.
>>> 
>>> 
>>>> Would that be a `btrfs device add /new/device; btrfs balance start 
>>>> /new/device` ?
>>> 
>>> dev add then dev rem; the balance isn't necessary.
>>> 
>>>> 
>>>> 3) When I have the RAID1 with two devices and I want to grow it out, which 
>>>> is the better practice? Create a larger volume, replace the old device 
>>>> with the new device and then do it a second time for the other device, or 
>>>> attaching the new volumes to the label/uuid one at a time and with each 
>>>> one use `btrfs filesystem resize devid:max /mountpoint`.
>>> 
>>> If you're replacing a 2x raid1 with t

Re: Best practices for raid 1

2017-01-11 Thread Tomasz Kusmierz
I would like to use this thread to ask few questions: 

If we have 2 devices dying on us and we run RAID6 - this theoretically will 
still run (despite our current problems). Now let’s say that we booted up raid6 
of 10 disk and 2 of them dies but operator does NOT know what are dev ID of 
disk that died, How does one removes those devices other than using “-missing” 
??? I ask because it’s in multiple places stated to use “replace” when your 
device dies but nobody ever states how to find out which /dev/ node is actually 
missing  …. so when I want to use a replace, I don’t know what to use within 
command :/ … This whole thing might have an additional complication - if FS is 
fool, than one would need to add disks than remove missing. 


> On 10 Jan 2017, at 21:49, Chris Murphy  wrote:
> 
> On Tue, Jan 10, 2017 at 2:07 PM, Vinko Magecic
>  wrote:
>> Hello,
>> 
>> I set up a raid 1 with two btrfs devices and came across some situations in 
>> my testing that I can't get a straight answer on.
>> 
>> 1) When replacing a volume, do I still need to `umount /path` and then 
>> `mount -o degraded ...` the good volume before doing the `btrfs replace 
>> start ...` ?
> 
> No. If the device being replaced is unreliable, use -r to limit the
> reads from the device being replaced.
> 
> 
> 
>> I didn't see anything that said I had to and when I tested it without 
>> mounting the volume it was able to replace the device without any issue. Is 
>> that considered bad and could risk damage or has `replace` made it possible 
>> to replace devices without umounting the filesystem?
> 
> It's always been possible even before 'replace'.
> btrfs dev add 
> btrfs dev rem 
> 
> But there are some bugs in dev replace that Qu is working on; I think
> they mainly negatively impact raid56 though.
> 
> The one limitation of 'replace' is that the new block device must be
> equal to or larger than the block device being replaced; where dev add
>> dev rem doesn't require this.
> 
> 
>> 2) Everything I see about replacing a drive says to use `/old/device 
>> /new/device` but what if the old device can't be read or no longer exists?
> 
> The command works whether the device is present or not; but if it's
> present and working then any errors on one device can be corrected by
> the other, whereas if the device is missing, then any errors on the
> remaining device can't be corrected. Off hand I'm not sure if the
> replace continues and an error just logged...I think that's what
> should happen.
> 
> 
>> Would that be a `btrfs device add /new/device; btrfs balance start 
>> /new/device` ?
> 
> dev add then dev rem; the balance isn't necessary.
> 
>> 
>> 3) When I have the RAID1 with two devices and I want to grow it out, which 
>> is the better practice? Create a larger volume, replace the old device with 
>> the new device and then do it a second time for the other device, or 
>> attaching the new volumes to the label/uuid one at a time and with each one 
>> use `btrfs filesystem resize devid:max /mountpoint`.
> 
> If you're replacing a 2x raid1 with two bigger replacements, you'd use
> 'btrfs replace' twice. Maybe it'd work concurrently, I've never tried
> it, but useful for someone to test and see if it explodes because if
> it's allowed, it should work or fail gracefully.
> 
> There's no need to do filesystem resizes when doing either 'replace'
> or 'dev add' followed by 'dev rem' because the fs resize is implied.
> First it's resized/grown with add; and then it's resized/shrink with
> remove. For replace there's a consolidation of steps, it's been a
> while since I've looked at the code so I can't tell you what steps it
> skips, what the state of the devices are in during the replace, which
> one active writes go to.
> 
> 
> -- 
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Best practices for raid 1

2017-01-11 Thread Tomasz Kusmierz

> On 10 Jan 2017, at 21:07, Vinko Magecic  
> wrote:
> 
> Hello,
> 
> I set up a raid 1 with two btrfs devices and came across some situations in 
> my testing that I can't get a straight answer on.
> 1) When replacing a volume, do I still need to `umount /path` and then `mount 
> -o degraded ...` the good volume before doing the `btrfs replace start ...` ? 
> I didn't see anything that said I had to and when I tested it without 
> mounting the volume it was able to replace the device without any issue. Is 
> that considered bad and could risk damage or has `replace` made it possible 
> to replace devices without umounting the filesystem?

No need to unmount, just replace old with new. Your scenario seems very 
convoluted and it’s pointless

> 2) Everything I see about replacing a drive says to use `/old/device 
> /new/device` but what if the old device can't be read or no longer exists? 
> Would that be a `btrfs device add /new/device; btrfs balance start 
> /new/device` ?
In case where old device is missing you’ve got few options:
- if you have enough space to fit the data and enough of disks to comply with 
redundancy - just remove the drive, So for example is you have 3 x 1TB drives 
with raid 1 And use less than 1TB of data total - juste remove one drive and 
you will have 2 x 1TB drives in raid 1 and btrfs fill just rebalance stuff for 
you !
- if you have not enough space to fi the data / not enough disks left to comply 
with raid lever - your only option is to add disk first then remove missing 
(btrfs dev delete missing /mount_point_of_your_fs)

> 3) When I have the RAID1 with two devices and I want to grow it out, which is 
> the better practice? Create a larger volume, replace the old device with the 
> new device and then do it a second time for the other device, or attaching 
> the new volumes to the label/uuid one at a time and with each one use `btrfs 
> filesystem resize devid:max /mountpoint`.

You kinda misunderstand the principal of btrfs. Btrfs will span across ALL the 
available space you’ve got. If you have multiple devices in this setup 
(remember that partition IS A DEVICE), it will span across multiple devices and 
you can’t change this. Now btrfs resize is mean for resizing a file system 
occupying a device (or partition). So work flow is that is you want to shrink a 
device (partition) you first shrink fs on this device than size down the device 
(partition) … if you want to increase the size of device (partition) you 
increase size of device (partition) than you grow filesystem within this device 
(partition). This is 100% irrespective of total cumulative size of file system. 

Let’s say you’ve got a btrfs file system that is spanning across 3 x 1TB 
devices … and those devices are partitions. You have raid 1 setup - your 
complete amount of available space is 1.5 TB. Let’s say you want to shrink of 
of partitions to 0.5TB -> first you shrink FS on this partition (balance will 
runn automatically) -> you shrink partition down to 0.5TB -> from now on your 
total available space is 1.25TB. 

Simples right ? :)

> Thanks
> 
> 
> 
> 
>--
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to get back a deleted sub-volume.

2016-12-12 Thread Tomasz Kusmierz
Chris,

the "btrfs-show-super -fa" gives me nothing useful to work with.

the "btrfs-find-root -a " is actually something that I was
already using (see original post), but the list of roots given had a
rather LARGE hole of 200 generations that is located between right
after I've had everything removed and 1 month before the whole
situation.

On 12 December 2016 at 04:14, Chris Murphy  wrote:
> Tomasz - try using 'btrfs-find-root -a ' I totally forgot about
> this option. It goes through the extent tree and might have a chance
> of finding additional generations that aren't otherwise being found.
> You can then plug those tree roots into 'btrfs restore -t '
> and do it with the -D and -v options so it's a verbose dry run, and
> see if the file listing it spits out is at all useful - if it has any
> of the data you're looking for.
>
>
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to get back a deleted sub-volume.

2016-12-11 Thread Tomasz Kusmierz
Chris, for all the time you helped so far I have to really appologize
I've led you a stray ... so, reson the subvolumes were deleted is
nothing to do with btrfs it self, I'm using "Rockstor" to ease
managment tasks. This tool / environment / distribution treats a
singular btrfs FS as a "pool" ( something in line of zfs :/ ) and when
one removes a pool from the system it will actually go and delete
subvolumes from FS before unmounting it and removing reference of it
from it's DB (yes a bit shiet I know). so I'm not blaming anybody here
for disapearing subvolumes, it's just me commig back to belive in man
kind to get kicked in the gonads by mankind stupidity.

ALSO by importing the fs to their "solution" is just actually mounting
and walking the tree of subvolumes to to create all the references in
local DB (for rockstor of course, still nothing to do with btrfs
functionality). To be able to ïmport" I've had to remove before
mentioned snpshots becouse imports script was timing out.

So for a single subvolume (called physically "share") I was left with
no snapshots (removed by me to make import not time out) and then this
subvolume was removed when I was trying to remove a fs (pool) from a
running system.

I've polled both disks (2 disk fs raid1) and I'm trying to rescue as
much data as I can.

The question is, why suddenly when I removed the snapshots and
(someone else removed) the subvolume, there is such a great gap in
generations of FS (over 200 generations missing) and the most recent
generation that actually can me touched by btrfs restore is over a
month old.

How to over come that ?



On 11 December 2016 at 19:00, Chris Murphy <li...@colorremedies.com> wrote:
> On Sun, Dec 11, 2016 at 10:40 AM, Tomasz Kusmierz
> <tom.kusmi...@gmail.com> wrote:
>> Hi,
>>
>> So, I've found my self in a pickle after following this steps:
>> 1. trying to migrate an array to different system, it became apparent
>> that importing array there was not possible to import it because I've
>> had a very large amount of snapshots (every 15 minutes during office
>> hours amounting to few K) so I've had to remove snapshots for main
>> data storage.
>
> True, there is no recursive incremental send.
>
>> 2. while playing with live array, it become apparent that some bright
>> spark implemented a "delete all sub-volumes while removing array from
>> system" ... needles to say that this behaviour is unexpected to say al
>> least ... and I wanted to punch somebody in face.
>
> The technical part of this is vague. I'm guessing you used 'btrfs
> device remove' butt it works no differently than lvremove - when a
> device is removed from an array, it wipes the signature from that
> device.  You probably can restore that signature and use that device
> again, depending on what the profile is for metadata and data, it may
> be usable stand alone.
>
> Proposing assault is probably not the best way to ask for advice
> though. Just a guess.
>
>
>
>
>>
>> Since then I was trying to rescue as much data as I can, luckily I
>> managed to get a lot of data from snapshots for "other than share"
>> volumes (because those were not deleted :/) but the most important
>> volume "share" prove difficult. This subvolume comes out with a lot of
>> errors on readout with "btrfs restore /dev/sda /mnt2/temp2/ -x -m -S
>> -s -i -t".
>>
>> Also for some reason I can't use a lot of root blocks that I find with
>> btrfs-find-root ..
>>
>> To put some detail here:
>> btrfs-find-root -a /dev/sda
>> Superblock thinks the generation is 184540
>> Superblock thinks the level is 1
>> Well block 919363862528(gen: 184540 level: 1) seems good, and it
>> matches superblock
>> Well block 919356325888(gen: 184539 level: 1) seems good, but
>> generation/level doesn't match, want gen: 184540 level: 1
>> Well block 919343529984(gen: 184538 level: 1) seems good, but
>> generation/level doesn't match, want gen: 184540 level: 1
>> Well block 920041308160(gen: 184537 level: 1) seems good, but
>> generation/level doesn't match, want gen: 184540 level: 1
>> Well block 919941955584(gen: 184536 level: 1) seems good, but
>> generation/level doesn't match, want gen: 184540 level: 1
>> Well block 919670538240(gen: 184535 level: 1) seems good, but
>> generation/level doesn't match, want gen: 184540 level: 1
>> Well block 920045371392(gen: 184532 level: 1) seems good, but
>> generation/level doesn't match, want gen: 184540 level: 1
>> Well block 920070209536(gen: 184531 level: 1) seems good, but
>> generation/level doesn't match, want gen: 184540 level: 1
>> Well block 920117510144(g

How to get back a deleted sub-volume.

2016-12-11 Thread Tomasz Kusmierz
Hi,

So, I've found my self in a pickle after following this steps:
1. trying to migrate an array to different system, it became apparent
that importing array there was not possible to import it because I've
had a very large amount of snapshots (every 15 minutes during office
hours amounting to few K) so I've had to remove snapshots for main
data storage.
2. while playing with live array, it become apparent that some bright
spark implemented a "delete all sub-volumes while removing array from
system" ... needles to say that this behaviour is unexpected to say al
least ... and I wanted to punch somebody in face.
3. the backup off-site server that was making backups every 30 minutes
was located in CEO house and his wife decide that it's not necessary
to have it connected

(laughs can start roughly here)

So I've got array with all the data there (theoretically COW, right ?)
with additional of plethora of snapshots (important data was snapped
every 15 minutes during a office hours to capture all the changes,
other sub-volumes were snapshoted daily)

This occurred roughly on 4-12-2016.

Since then I was trying to rescue as much data as I can, luckily I
managed to get a lot of data from snapshots for "other than share"
volumes (because those were not deleted :/) but the most important
volume "share" prove difficult. This subvolume comes out with a lot of
errors on readout with "btrfs restore /dev/sda /mnt2/temp2/ -x -m -S
-s -i -t".

Also for some reason I can't use a lot of root blocks that I find with
btrfs-find-root ..

To put some detail here:
btrfs-find-root -a /dev/sda
Superblock thinks the generation is 184540
Superblock thinks the level is 1
Well block 919363862528(gen: 184540 level: 1) seems good, and it
matches superblock
Well block 919356325888(gen: 184539 level: 1) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 919343529984(gen: 184538 level: 1) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920041308160(gen: 184537 level: 1) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 919941955584(gen: 184536 level: 1) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 919670538240(gen: 184535 level: 1) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920045371392(gen: 184532 level: 1) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920070209536(gen: 184531 level: 1) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920117510144(gen: 184530 level: 1) seems good, but
generation/level doesn't match, want gen: 184540 level: 1 <<< here
stuff is gone
Well block 920139055104(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920139022336(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920138989568(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920138973184(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920137596928(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920137531392(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920137515008(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920135991296(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920135958528(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920135925760(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920135827456(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920135811072(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920133697536(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920133664768(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 92017088(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920133206016(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920132976640(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920132878336(gen: 184511 level: 0) seems good, but
generation/level doesn't match, want gen: 184540 level: 1
Well block 920132845568(gen: 184511 level: 0) seems good, but
generation/level doesn't match, 

Re: Convert from RAID 5 to 10

2016-12-01 Thread Tomasz Kusmierz
FYI.
There is an old saying in embedded circles that I revolve that evolved
from Arthur C Clarke "Any sufficiently advanced technology is
indistinguishable from magic." Engineering version states "Any
sufficiently advanced incompetence is indistinguishable from malice"
Also I'll quote you on throwing under the bus thing :) (I actually
like that justification)

On 1 December 2016 at 17:28, Chris Murphy <li...@colorremedies.com> wrote:
> On Wed, Nov 30, 2016 at 1:29 PM, Tomasz Kusmierz <tom.kusmi...@gmail.com> 
> wrote:
>
>> Please, I beg you add another column to man and wiki stating clearly
>> how many devices every profile can withstand to loose. I frequently
>> have to explain how btrfs profiles work and show quotes from this
>> mailing list because "dawning-kruger effect victims" keep poping up
>> with statements like "in btrfs raid10 with 8 drives you can loose 4
>> drives" ... I seriously beg you guys, my beating stick is half broken
>> by now.
>
> You need a new stick. It's called the ad hominem attack. When stupid
> people say stupid things, the dispute is not about the facts or
> opinions in the argument itself, but rather the person involved. There
> is the possibility this is more than stupidity, it really borders on
> maliciousness. Any ethical code of conduct for a list will accept ad
> hominem attacks over the willful dissemination of provably wrong
> information. When stupid assholes throw users under the bus with
> provably wrong (and bad) advice, it becomes something of an obligation
> to resort to name calling.
>
> Of course, I'd also like the wiki to clearly state the only profile
> that tolerates more than one device loss is raid6; and be very
> explicit with the manifestly wrong terminology being used by Btrfs's
> raid10 terminology. That is a fairly egregious violation of common
> terminology and the trust we're supposed to be developing, both in the
> usage of common terms, but also in Btrfs specifically.
>
>
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Tomasz Kusmierz
On 30 November 2016 at 19:09, Chris Murphy  wrote:
> On Wed, Nov 30, 2016 at 7:37 AM, Austin S. Hemmelgarn
>  wrote:
>
>> The stability info could be improved, but _absolutely none_ of the things
>> mentioned as issues with raid1 are specific to raid1.  And in general, in
>> the context of a feature stability matrix, 'OK' generally means that there
>> are no significant issues with that specific feature, and since none of the
>> issues outlined are specific to raid1, it does meet that description of
>> 'OK'.
>
> Maybe the gotchas page needs a one or two liner for each profile's
> gotchas compared to what the profile leads the user into believing.
> The overriding gotcha with all Btrfs multiple device support is the
> lack of monitoring and notification other than kernel messages; and
> the raid10 actually being more like raid0+1 I think it certainly a
> gotcha, however 'man mkfs.btrfs' contains a grid that very clearly
> states raid10 can only safely lose 1 device.
>
>
>> Looking at this another way, I've been using BTRFS on all my systems since
>> kernel 3.16 (I forget what exact vintage that is in regular years).  I've
>> not had any data integrity or data loss issues as a result of BTRFS itself
>> since 3.19, and in just the past year I've had multiple raid1 profile
>> filesystems survive multiple hardware issues with near zero issues (with the
>> caveat that I had to re-balance after replacing devices to convert a few
>> single chunks to raid1), and that includes multiple disk failures and 2 bad
>> PSU's plus about a dozen (not BTRFS related) kernel panics and 4 unexpected
>> power loss events.  I also have exhaustive monitoring, so I'm replacing bad
>> hardware early instead of waiting for it to actually fail.
>
> Possibly nothing aids predictably reliable storage stacks than healthy
> doses of skepticism and awareness of all limitations. :-D
>
> --
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Please, I beg you add another column to man and wiki stating clearly
how many devices every profile can withstand to loose. I frequently
have to explain how btrfs profiles work and show quotes from this
mailing list because "dawning-kruger effect victims" keep poping up
with statements like "in btrfs raid10 with 8 drives you can loose 4
drives" ... I seriously beg you guys, my beating stick is half broken
by now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Tomasz Kusmierz
I think you just described all the benefits of btrfs in that type of
configuration  unfortunately after btrfs RAID 5 & 6 was marked as
OK it got marked as "it will eat your data" (and there is a tone of
people in random places poping up with raid 5 & 6 that just killed
their data)

On 11 October 2016 at 16:14, Philip Louis Moetteli
 wrote:
> Hello,
>
>
> I have to build a RAID 6 with the following 3 requirements:
>
> • Use different kinds of disks with different sizes.
> • When a disk fails and there's enough space, the RAID should be able 
> to reconstruct itself out of the degraded state. Meaning, if I have e. g. a 
> RAID with 8 disks and 1 fails, I should be able to chose to transform this in 
> a non-degraded (!) RAID with 7 disks.
> • Also the other way round: If I add a disk of what size ever, it 
> should redistribute the data, so that it becomes a RAID with 9 disks.
>
> I don’t care, if I have to do it manually.
> I don’t care so much about speed either.
>
> Is BTrFS capable of doing that?
>
>
> Thanks a lot for your help!
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid levels and NAS drives

2016-10-09 Thread Tomasz Kusmierz
On 10 October 2016 at 02:01, ronnie sahlberg  wrote:
> (without html this time.)
>
> Nas drives are more expensive but also more durable than the normal consumer
> drives, but not as durable as enterprise drives.
> They are meant for near continous use, compared to consumer/backup drives
> that are meant for only occasional use and meant to spend the majority of
> time spinned down.
>
>
> They fall in-between consumer and enterprise gear.
>

Again, you read a marketing flyer ...

Historically enterprise drives did equal to a drive with SCSI, after
that it started to equal to a drive with more exotic interfaces like
SAS or FATA ... nowadays this means more in line "high [seak]
performance, for which you pay extra extra extra buck" (10k, 15k
arrays of 10 disks with databases on it that are serving plenty of
people ?).
Currently, customer = low end drive where you will not pay twice the
price for 10% performance increase.

There is nothing there about reliability !!!
Now every [sane] storage engineer will chose a "customer" 5.4k drives
for a cold storage / slow IO storage. In high demand, very random seek
patterns everybody will go for extreme fast disk that will die in 12
months, because cost * effort or replacing a failed disk is still less
than assembling a like array from 7.2k disk (extra controller, extra
bays, extra power, extra everything !).

So:
1.
Stop reading a marketing material that is designed to suck money out
of you pocket. Read technical datasheet.
Stop reading a paid for articles from so called "specialists", my
company pays those people to put in articles that I write to sound
more technical so I can tell you how much "horse" those are.

2.
hdd:
faster rpm = better seek + better sequential read write
slower rpm = survives longer + takes less power + better $ per GB

3.
what you need to use it for:
a remote nas box ? a single 5.1k hdd will saturate your gigabit lan, 7
will saturate your SFP+ - go for best $ per GB
local storage ? 4 x 7.1k hdd in raid10 and you're talking a good
performance ! putting more disks in and you can drop down to 5.1k
a high demand database with thousands of people punching milions of
queries a second ? 15k as many as you can!

4.
For time being on btrfs give raid 5 & 6 a wide berth for time being
... unless you back up your data [very] regularly than, have fun :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid levels and NAS drives

2016-10-09 Thread Tomasz Kusmierz
And what exactly are NAS drives ?

Are you talking marketing by any chance ? Please, tell me you got the pun.

On 10 October 2016 at 00:12, Charles Zeitler  wrote:
> Is there any advantage to using NAS drives
> under RAID levels,  as oppposed to regular
> 'desktop' drives for BTRFS?
>
> Charles Zeitler
>
> --
>  The Perfect Is The Enemy Of
>  The Good Enough
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Some help with the code.

2016-09-06 Thread Tomasz Kusmierz
This is predominantly for maintainers:

I've noticed that there is a lot of code for btrfs ... and after few
glimpses I've noticed that there are occurrences which beg for some
refactoring to make it less of a pain to maintain.

I'm speaking of occurrences where:
- within a function there are multiple checks for null pointer and
then whenever there is anything hanging on the end of that pointer to
finally call the function, pass the pointer to it and watch it perform
same checks to finally deallocate stuff on the end of a pointer.
- single line functions ... called only in two places

and so on.

I know that you guys are busy, but maintaining code that is only
growing must be a pain.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of SMR with BTRFS

2016-07-18 Thread Tomasz Kusmierz
Sorry for late reply, there was a lot of traffic in this thread so:

1. I do apologize but I've got a wrong end of the stick, I was
convinced that btrfs does cause corruption on your disk because some
of the link that you've hav in original post were pointing to topics
with corruptions going on, but you are concerned about performance -
right ?

2. I'm still not convinced that Seagate would miss out such a feature
as SMR and mistakenly called it TGMR ... a lot of money is involved
with storing data and egg in the face moment can cast them a lot ...
ALSO it's called "barracuda" historically this meant disks with good
IO performance, can't think that somebody at seagate would put
barracuda label on stinking SMR (yes SMR stinks! but about that later
on). Thou I remember how Micron used to have a complete mess in their
data sheets :/

3. f2fs as well as jffs2, logfs will give a tremendous performance on
spinners :D those file system are meant for tiny flash devices with
minimal IO, minimal power available, minimal erases, and ti fit in
that market those try to mimise fragmentation of flash, to serialise
writes, to eliminate jumping thought flash so in will self wear
balance. Result of that on spinner is that it will give you a very
static and sequential IO on writing data, but your reads will be crap
... and as every single "developed for flash" filesystem it will
expect your device to have a 100% functional TRIM that will result in
a block (usually 128kB) to be reset to 0xFF ... spinners don't do that
... and this is where you will see corruption. Also a lot of those
file systems will require a direct access to device rather than block
device emulation. Also on a flash device you can walk in and alter a
bit in a byte as long as you change it from physical "1" to "0", on
spinner you need to rewrite a sector and associated with it CRC and
corrective data etc, on SMR you will have to rewrite a whole BAND ...
FUN !! so every time your filesystem will mark something for future
TRIM it will try to set single bit in block associated data (hahaha
your band needs to get rewritten) and this is how you will effectively
kill sectors (bands) on your disk !

4. SMR stinks ... yes it does ... it's a recipe for a disaster, slight
modifications cause a lot of work load on a drive ... if you modify a
"sector" most of a band needs to get rewritten ... this is where
corruption creeps it, where disk surface wears, I understand how NSA
may have a use case for that - google shifts data between server farms
in US and broad then they send a copy to NSA (yes they do), NSA stores
it, but they don't care about single bit root of minor defects, data
does not get modified, just analysed in bulk and discarded and whole
array get written with fresh data ... amazon on the other hand gets
paid for not having your data corrupted ... so they won't fancy SMR
that much (maybe glacier) see a patern here ? as a user to have that
type of use case is just weird, if you want to back up your data than
you care about it ... then I'm not convinced that SMR is truly for
you. Also 5TB device connected to USB3 used as a backup :O :O :O :O :O
:O I wouldn't keep my "just in case internet was down backup of
pornhub" on that setup :) And I'm not picking on you here ... I
personally used far better backup than that and still it failed and
still people pointed out bluntly how pathetic it was ... and they were
right !


In terms of SMR those are my brutal opinions ... and nothing more. I
accept that most likely I'm wrong. Hell, been wrong most of my life,
it's just after 10 years of engineering embedded devices for various
application I'm just veer precocious due to
experiences with a lot of "some bright spark" (clueless guy that
wanted to feel more intelligent than engineers) "decided to use this
revolutionising thing" (wanted to prove him self and based everything
on a luck) "and created a valid learning experience for whole
development team" (all engineers wanted to kill him) "and we all came
out of that stronger and with more experience" (he/she got fired).

On 18 July 2016 at 20:30, Austin S. Hemmelgarn  wrote:
> On 2016-07-18 15:05, Hendrik Friedel wrote:
>>
>> Hello Austin,
>>
>> thanks for your reply.
>>
 Ok, thanks; So, TGMR does not say whether or not the Device is SMR or
 not, right?
>>>
>>> I'm not 100% certain about that.  Technically, the only non-firmware
>>> difference is in the read head and the tracking.  If it were me, I'd be
>>> listing SMR instead of TGMR on the data sheet, but I'd be more than
>>> willing to bet that many drive manufacturers won't think like that.

 While the Data-Sheet does not mention SMR and the 'Desktop' in the name
 rather than 'Archive' would indicate no SMR, some reviews indicate SMR

 (http://www.legitreviews.com/seagate-barracuda-st5000dm000-5tb-desktop-hard-drive-review_161241)


>>>  Beyond that, I'm not sure,
>>> but I believe that their 

Re: Status of SMR with BTRFS

2016-07-16 Thread Tomasz Kusmierz
Just please don't take it as picking or something:

> It's a Seagate Expansion Desktop 5TB (USB3). It is probably a ST5000DM000.

this is TGMR not SMR disk:
http://www.seagate.com/www-content/product-content/desktop-hdd-fam/en-us/docs/100743772a.pdf
So it still confirms to standard record strategy ...


>> There are two types:
>> 1. SMR managed by device firmware. BTRFS sees that as a normal block
>> device … problems you get are not related to BTRFS it self …
>
> That for sure. But the way BTRFS uses/writes data could cause problems in
> conjunction with these devices still, no?
I'm sorry but I'm confused now, what "magical way of using/writing
data" you actually mean ? AFAIK btrfs sees the disk as a block device
... for example devices has a very varying sector size, which is a 512
bytes + some CRC + maybe ECC ... btrfs does not access this data,
drive does ... to be honest drives tend to lie to you continuously !
They use this ECC to magically bail out of bad sector, give you data
and silently switch to spare sector ...

Now think slowly and thoroughly about it: who would write a code (and
maintain it) for a file system that access device specific data for X
amount of vendors with each having Y amount of model specific
configurations/caveats/firmwares/protocols ... S.M.A.R.T. emerged to
give a unifying interface to device statistics ... this is how bad it
was ...


FYI,
in 2009 I was creating a product with linux that was starting from a
flash based FS ... some people required that data after 20 years would
boot up unchanged ... my answer was: "HOW", yes I could ensure a
certain files integrity in readout by checking md5, but I could not
warrant a whole FS integrity ... specially at the time when j2ffs was
only option on flash memories (yeah it had to be RW as well @#$*@#$)
... so btrfs comes along and takes away most of problems ... if you
care about your data, do some research ... if not ... maybe raiserFS
is for you :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How can I get blockdev offsets of btrfs chunks for a file?

2016-07-15 Thread Tomasz Kusmierz
No answer here, but mate if you are involved in anything that will provide some 
more automated backup tool for btrfs you got a lot of silent people rooting for 
you.

> On 16 Jul 2016, at 00:21, Eric Wheeler  wrote:
> 
> Hello all,
> 
> We do btrfs subvolume snapshots over time for backups.  I would like to 
> traverse the files in the subvolumes and find the total unique chunk count 
> to calculate total space for a set of subvolumes.
> 
> This sounds kind of like the beginning of what a deduplicator would do, 
> but I just want to count the blocks, so no submission for deduplication.  
> I started looking at bedup and other deduplicator code, but the answer to 
> this question wasn't obvious (to me, anyway).
> 
> Questions:
> 
> Is there an ioctl (or some other way) to get the block device offset for a 
> file (or file offset) so I can count the unique occurances?
> 
> What API documentation should I review?
> 
> Can you point me at the ioctl(s) that would handle this?
> 
> 
> Thank you for your help!
> 
> 
> --
> Eric Wheeler
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of SMR with BTRFS

2016-07-15 Thread Tomasz Kusmierz
Thou I’m not a hardcore storage system professional:

What disk are you using ? There are two types:
1. SMR managed by device firmware. BTRFS sees that as a normal block device … 
problems you get are not related to BTRFS it self …
2. SMR managed by host system, BTRFS still does see this as a block device … 
just emulated by host system to look normal. 

In case of funky technologies like that I would research how exactly data is 
stored in terms of “BAND” and experiment with setting leaf & sector size to 
match a band, then create a btrfs on this device. 
Run stress.sh on it for couple of days.
If you get errors - setup a two standard disk raid1 btrfs file system
run stress.sh to see whenever you get errors on this system - to eliminate 
possibility that your system is actually generating errors. 

Then come back and we will see what’s going on :)


> On 15 Jul 2016, at 19:29, Hendrik Friedel  wrote:
> 
> Hello,
> 
> I have a 5TB Seagate drive that uses SMR.
> 
> I was wondering, if BTRFS is usable with this Harddrive technology. So, first 
> I searched the BTRFS wiki -nothing. Then google.
> 
> * I found this: https://bbs.archlinux.org/viewtopic.php?id=203696
> But this turned out to be an issue not related to BTRFS.
> 
> * Then this: http://www.snia.org/sites/default/files/SDC15_presentations/smr/ 
> HannesReinecke_Strategies_for_running_unmodified_FS_SMR.pdf
>  " BTRFS operation matches SMR parameters very closely [...]
> 
> High number of misaligned write accesses ; points to an issue with btrfs 
> itself
> 
> 
> * Then this: 
> http://superuser.com/questions/962257/fastest-linux-filesystem-on-shingled-disks
> The BTRFS performance seemed good.
> 
> 
> * Finally this: http://www.spinics.net/lists/linux-btrfs/msg48072.html
> "So you can get mixed results when trying to use the SMR devices but I'd say 
> it will mostly not work.
> But, btrfs has all the fundamental features in place, we'd have to make
> adjustments to follow the SMR constraints:"
> [...]
> I have some notes at
> https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt;
> 
> 
> So, now I am wondering, what the state is today. "We" (I am happy to do that; 
> but not sure of access rights) should also summarize this in the wiki.
> My use-case by the way are back-ups. I am thinking of using some of the 
> interesting BTRFS features for this (send/receive, deduplication)
> 
> Greetings,
> Hendrik
> 
> 
> ---
> Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
> https://www.avast.com/antivirus
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 has failing disks, but smart is clear

2016-07-08 Thread Tomasz Kusmierz
>
> Well, I was able to run memtest on the system last night, that passed with
> flying colors, so I'm now leaning toward the problem being in the sas card.
> But I'll have to run some more tests.
>

Seriously use the "stres.sh" for couple of days, When I was running
memtest it was running continuously for 3 days without the error, day
of stres.sh and errors started showing up.
Be VERY careful with trusting any sort of that tool, modern CPU's lye
to you continuously !!!
1. You may think that you've wrote best on the planet code that
bypasses a CPU cache, but in reality since CPU's are multicore you can
end up with overzealous MPMD traping you inside of you cache memory
and all you resting will do is write a page (trapped in cache) read it
from cache (coherency mechanism, not the mis/hit one) will trap you
inside of L3 so you have no clue you don't touch the ram, then CPU
will just dump your page to RAM and "job done"
2. Since coherency problems and real problems with non blocking on
mpmd you can have a DMA controller sucking pages out your own cache,
due to ram being marked as dirty and CPU will try to spare the time
and accelerate the operation to push DMA straigh out of L3 to
somewhere else (mentioning that sine some testers use crazy way of
forcing your ram access via DMA to somewhere and back to force droping
out of L3)
3. This one is actually funny: some testers didn't claim the pages to
the process so for some reason pages that the were using were not
showing up as used / dirty etc so all the testing was done 32kB of L1
... tests were fast thou :)

srters.sh will test operation of the whole system !!! it shifts a lot
of data so disks are engaged, CPU keeps pumping out CRC32 all the time
so it's busy, RAM gets hit nicely as well due to high DMA.

When come to think about it, if your device points change during
operation of the system it might be an LSI card dying -> reinitialize
-> rediscovering drives -> drives show up in different point. On my
system I can hot swap sata and it will come up with different dev even
thou it was connected to same place on the controller.

I think, most important - I presume you run nonECC ?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Tomasz Kusmierz

> On 7 Jul 2016, at 02:46, Chris Murphy  wrote:
> 

Chaps, I didn’t wanted this to spring up as a performance of btrfs argument,

BUT 

you are throwing a lot of useful data, maybe diverting some of it into wiki ? 
you know, us normal people might find it useful for making educated choice in 
some future ? :)

Interestingly on my RAID10 with 6 disks I only get:

dd if=/mnt/share/asdf of=/dev/zero bs=100M
113+1 records in
113+1 records out
11874643004 bytes (12 GB, 11 GiB) copied, 45.3123 s, 262 MB/s


filefrag -v
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..2471: 2101940598..2101943069:   2472:
   1: 2472..   12583: 1938312686..1938322797:  10112: 2101943070:
   2:12584..   12837: 1937534654..1937534907:254: 1938322798:
   3:12838..   12839: 1937534908..1937534909:  2:
   4:12840..   34109: 1902954063..1902975332:  21270: 1937534910:
   5:34110..   53671: 1900857931..1900877492:  19562: 1902975333:
   6:53672..   54055: 1900877493..1900877876:384:
   7:54056..   54063: 1900877877..1900877884:  8:
   8:54064..   98041: 1900877885..1900921862:  43978:
   9:98042..  117671: 1900921863..1900941492:  19630:
  10:   117672..  118055: 1900941493..1900941876:384:
  11:   118056..  161833: 1900941877..1900985654:  43778:
  12:   161834..  204013: 1900985655..1901027834:  42180:
  13:   204014..  214269: 1901027835..1901038090:  10256:
  14:   214270..  214401: 1901038091..1901038222:132:
  15:   214402..  214407: 1901038223..1901038228:  6:
  16:   214408..  258089: 1901038229..1901081910:  43682:
  17:   258090..  300139: 1901081911..1901123960:  42050:
  18:   300140..  310559: 1901123961..1901134380:  10420:
  19:   310560..  310695: 1901134381..1901134516:136:
  20:   310696..  354251: 1901134517..1901178072:  43556:
  21:   354252..  396389: 1901178073..1901220210:  42138:
  22:   396390..  406353: 1901220211..1901230174:   9964:
  23:   406354..  406515: 1901230175..1901230336:162:
  24:   406516..  406519: 1901230337..1901230340:  4:
  25:   406520..  450115: 1901230341..1901273936:  43596:
  26:   450116..  492161: 1901273937..1901315982:  42046:
  27:   492162..  524199: 1901315983..1901348020:  32038:
  28:   524200..  535355: 1901348021..1901359176:  11156:
  29:   535356..  535591: 1901359177..1901359412:236:
  30:   535592.. 1315369: 1899830240..1900610017: 779778: 1901359413:
  31:  1315370.. 1357435: 1901359413..1901401478:  42066: 1900610018:
  32:  1357436.. 1368091: 1928101070..1928111725:  10656: 1901401479:
  33:  1368092.. 1368231: 1928111726..1928111865:140:
  34:  1368232.. 2113959: 1899043808..1899789535: 745728: 1928111866:
  35:  2113960.. 2899082: 1898257376..1899042498: 785123: 1899789536: last,elf


If it would be possible to read from 6 disks at once maybe this performance 
would be better for linear read.

Anyway this is a huge diversion from original question, so maybe we will end 
here ?


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Tomasz Kusmierz

> On 7 Jul 2016, at 00:22, Kai Krakow <hurikha...@gmail.com> wrote:
> 
> Am Wed, 6 Jul 2016 13:20:15 +0100
> schrieb Tomasz Kusmierz <tom.kusmi...@gmail.com>:
> 
>> When I think of it, I did move this folder first when filesystem was
>> RAID 1 (or not even RAID at all) and then it was upgraded to RAID 1
>> then RAID 10. Was there a faulty balance around August 2014 ? Please
>> remember that I’m using Ubuntu so it was probably kernel from Ubuntu
>> 14.04 LTS
>> 
>> Also, I would like to hear it from horses mouth: dos & donts for a
>> long term storage where you moderately care about the data: RAID10 -
>> flaky ? would RAID1 give similar performance ?
> 
> The current implementation of RAID0 in btrfs is probably not very
> optimized. RAID0 is a special case anyways: Stripes have a defined
> width - I'm not sure what it is for btrfs, probably it's per chunk, so
> it's 1GB, maybe it's 64k **. That means your data is usually not read
> from multiple disks in parallel anyways as long as requests are below
> stripe width (which is probably true for most access patterns except
> copying files) - there's no immediate performance benefit. This holds
> true for any RAID0 with read and write patterns below the stripe size.
> Data is just more evenly distributed across devices and your
> application will only benefit performance-wise if accesses spread
> semi-random across the span of the whole file. And at least last time I
> checked, it was stated that btrfs raid0 does not submit IOs in parallel
> yet but first reads one stripe, then the next - so it doesn't submit
> IOs to different devices in parallel.
> 
> Getting to RAID1, btrfs is even less optimized: Stripe decision is based
> on process pids instead of device load, read accesses won't distribute
> evenly to different stripes per single process, it's only just reading
> from the same single device - always. Write access isn't faster anyways:
> Both stripes need to be written - writing RAID1 is single device
> performance only.
> 
> So I guess, at this stage there's no big difference between RAID1 and
> RAID10 in btrfs (except maybe for large file copies), not for single
> process access patterns and neither for multi process access patterns.
> Btrfs can only benefit from RAID1 in multi process access patterns
> currently, as can btrfs RAID0 by design for usual small random access
> patterns (and maybe large sequential operations). But RAID1 with more
> than two disks and multi process access patterns is more or less equal
> to RAID10 because stripes are likely to be on different devices anyways.
> 
> In conclusion: RAID1 is simpler than RAID10 and thus its less likely to
> contain flaws or bugs.
> 
> **: Please enlighten me, I couldn't find docs on this matter.

:O 

It’s an eye opener - I think that this should end up on btrfs WIKI … seriously !

Anyway my use case for this is “storage” therefore I predominantly copy large 
files. 


> -- 
> Regards,
> Kai
> 
> Replies to list-only preferred.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 has failing disks, but smart is clear

2016-07-06 Thread Tomasz Kusmierz

> On 6 Jul 2016, at 23:14, Corey Coughlin  wrote:
> 
> Hi all,
>Hoping you all can help, have a strange problem, think I know what's going 
> on, but could use some verification.  I set up a raid1 type btrfs filesystem 
> on an Ubuntu 16.04 system, here's what it looks like:
> 
> btrfs fi show
> Label: none  uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4
>Total devices 10 FS bytes used 3.42TiB
>devid1 size 1.82TiB used 1.18TiB path /dev/sdd
>devid2 size 698.64GiB used 47.00GiB path /dev/sdk
>devid3 size 931.51GiB used 280.03GiB path /dev/sdm
>devid4 size 931.51GiB used 280.00GiB path /dev/sdl
>devid5 size 1.82TiB used 1.17TiB path /dev/sdi
>devid6 size 1.82TiB used 823.03GiB path /dev/sdj
>devid7 size 698.64GiB used 47.00GiB path /dev/sdg
>devid8 size 1.82TiB used 1.18TiB path /dev/sda
>devid9 size 1.82TiB used 1.18TiB path /dev/sdb
>devid   10 size 1.36TiB used 745.03GiB path /dev/sdh
> 
> I added a couple disks, and then ran a balance operation, and that took about 
> 3 days to finish.  When it did finish, tried a scrub and got this message:
> 
> scrub status for 597ee185-36ac-4b68-8961-d4adc13f95d4
>scrub started at Sun Jun 26 18:19:28 2016 and was aborted after 01:16:35
>total bytes scrubbed: 926.45GiB with 18849935 errors
>error details: read=18849935
>corrected errors: 5860, uncorrectable errors: 18844075, unverified errors: > 0
> 
> So that seems bad.  Took a look at the devices and a few of them have errors:
> ...
> [/dev/sdi].generation_errs 0
> [/dev/sdj].write_io_errs   289436740
> [/dev/sdj].read_io_errs289492820
> [/dev/sdj].flush_io_errs   12411
> [/dev/sdj].corruption_errs 0
> [/dev/sdj].generation_errs 0
> [/dev/sdg].write_io_errs   0
> ...
> [/dev/sda].generation_errs 0
> [/dev/sdb].write_io_errs   3490143
> [/dev/sdb].read_io_errs111
> [/dev/sdb].flush_io_errs   268
> [/dev/sdb].corruption_errs 0
> [/dev/sdb].generation_errs 0
> [/dev/sdh].write_io_errs   5839
> [/dev/sdh].read_io_errs2188
> [/dev/sdh].flush_io_errs   11
> [/dev/sdh].corruption_errs 1
> [/dev/sdh].generation_errs 16373
> 
> So I checked the smart data for those disks, they seem perfect, no 
> reallocated sectors, no problems.  But one thing I did notice is that they 
> are all WD Green drives.  So I'm guessing that if they power down and get 
> reassigned to a new /dev/sd* letter, that could lead to data corruption.  I 
> used idle3ctl to turn off the shut down mode on all the green drives in the 
> system, but I'm having trouble getting the filesystem working without the 
> errors.  I tried a 'check --repair' command on it, and it seems to find a lot 
> of verification errors, but it doesn't look like things are getting fixed.
>  But I have all the data on it backed up on another system, so I can recreate 
> this if I need to.  But here's what I want to know:
> 
> 1.  Am I correct about the issues with the WD Green drives, if they change 
> mounts during disk operations, will that corrupt data?
I just wanted to chip in with WD Green drives. I have a RAID10 running on 6x2TB 
of those, actually had for ~3 years. If disk goes down for spin down, and you 
try to access something - kernel & FS & whole system will wait for drive to 
re-spin and everything works OK. I’ve never had a drive being reassigned to 
different /dev/sdX due to spin down / up. 
2 years ago I was having a corruption due to not using ECC ram on my system and 
one of RAM modules started producing errors that were never caught up by CPU / 
MoBo. Long story short, guy here managed to point me to the right direction and 
I started shifting my data to hopefully new and not corrupted FS … but I was 
sceptical of similar issue that you have described AND I did raid1 and while 
mounted I did shift disk from one SATA port to another and FS managed to pick 
up the disk in new location and did not even blinked (as far as I remember 
there was syslog entry to say that disk vanished and then that disk was added)

Last word, you got plenty of errors in your SMART for transfer related stuff, 
please be advised that this may mean:
- faulty cable
- faulty mono controller
- faulty drive controller
- bad RAM - yes, mother board CAN use your ram for storing data and transfer 
related stuff … specially chapter ones. 

> 2.  If that is the case:
>a.) Is there any way I can stop the /dev/sd* mount points from changing?  
> Or can I set up the filesystem using UUIDs or something more solid?  I 
> googled about it, but found conflicting info
Don’t get it the wrong way but I’m personally surprised that anybody still uses 
mount points rather than UUID. Devices change from boot to boot for a lot of 
people and most of distros moved to uuid (2 years ago ? even the swap is 
mounted via UUID now)

>b.) Or, is there something else changing my drive devices?  I have most of 
> drives on an LSI SAS 9201-16i card, is there something I need to 

Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Tomasz Kusmierz

> On 6 Jul 2016, at 22:41, Henk Slager <eye...@gmail.com> wrote:
> 
> On Wed, Jul 6, 2016 at 2:20 PM, Tomasz Kusmierz <tom.kusmi...@gmail.com> 
> wrote:
>> 
>>> On 6 Jul 2016, at 02:25, Henk Slager <eye...@gmail.com> wrote:
>>> 
>>> On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz <tom.kusmi...@gmail.com> 
>>> wrote:
>>>> 
>>>> On 6 Jul 2016, at 00:30, Henk Slager <eye...@gmail.com> wrote:
>>>> 
>>>> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz <tom.kusmi...@gmail.com>
>>>> wrote:
>>>> 
>>>> I did consider that, but:
>>>> - some files were NOT accessed by anything with 100% certainty (well if
>>>> there is a rootkit on my system or something in that shape than maybe yes)
>>>> - the only application that could access those files is totem (well
>>>> Nautilius checks extension -> directs it to totem) so in that case we would
>>>> hear about out break of totem killing people files.
>>>> - if it was a kernel bug then other large files would be affected.
>>>> 
>>>> Maybe I’m wrong and it’s actually related to the fact that all those files
>>>> are located in single location on file system (single folder) that might
>>>> have a historical bug in some structure somewhere ?
>>>> 
>>>> 
>>>> I find it hard to imagine that this has something to do with the
>>>> folderstructure, unless maybe the folder is a subvolume with
>>>> non-default attributes or so. How the files in that folder are created
>>>> (at full disktransferspeed or during a day or even a week) might give
>>>> some hint. You could run filefrag and see if that rings a bell.
>>>> 
>>>> files that are 4096 show:
>>>> 1 extent found
>>> 
>>> I actually meant filefrag for the files that are not (yet) truncated
>>> to 4k. For example for virtual machine imagefiles (CoW), one could see
>>> an MBR write.
>> 117 extents found
>> filesize 15468645003
>> 
>> good / bad ?
> 
> 117 extents for a 1.5G file is fine, with -v option you could see the
> fragmentation at the start, but this won't lead to any hint why you
> have the truncate issue.
> 
>>>> I did forgot to add that file system was created a long time ago and it was
>>>> created with leaf & node size = 16k.
>>>> 
>>>> 
>>>> If this long time ago is >2 years then you have likely specifically
>>>> set node size = 16k, otherwise with older tools it would have been 4K.
>>>> 
>>>> You are right I used -l 16K -n 16K
>>>> 
>>>> Have you created it as raid10 or has it undergone profile conversions?
>>>> 
>>>> Due to lack of spare disks
>>>> (it may sound odd for some but spending for more than 6 disks for home use
>>>> seems like an overkill)
>>>> and due to last I’ve had I had to migrate all data to new file system.
>>>> This played that way that I’ve:
>>>> 1. from original FS I’ve removed 2 disks
>>>> 2. Created RAID1 on those 2 disks,
>>>> 3. shifted 2TB
>>>> 4. removed 2 disks from source FS and adde those to destination FS
>>>> 5 shifted 2 further TB
>>>> 6 destroyed original FS and adde 2 disks to destination FS
>>>> 7 converted destination FS to RAID10
>>>> 
>>>> FYI, when I convert to raid 10 I use:
>>>> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f
>>>> /path/to/FS
>>>> 
>>>> this filesystem has 5 sub volumes. Files affected are located in separate
>>>> folder within a “victim folder” that is within a one sub volume.
>>>> 
>>>> 
>>>> It could also be that the ondisk format is somewhat corrupted (btrfs
>>>> check should find that ) and that that causes the issue.
>>>> 
>>>> 
>>>> root@noname_server:/mnt# btrfs check /dev/sdg1
>>>> Checking filesystem on /dev/sdg1
>>>> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>>> checking extents
>>>> checking free space cache
>>>> checking fs roots
>>>> checking csums
>>>> checking root refs
>>>> found 4424060642634 bytes used err is 0
>>>> total csum bytes: 4315954936
>>>> total tree bytes: 4522786816
>>>> total fs tree bytes: 61702144
>>>> total extent tree by

Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Tomasz Kusmierz

> On 6 Jul 2016, at 02:25, Henk Slager <eye...@gmail.com> wrote:
> 
> On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz <tom.kusmi...@gmail.com> 
> wrote:
>> 
>> On 6 Jul 2016, at 00:30, Henk Slager <eye...@gmail.com> wrote:
>> 
>> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz <tom.kusmi...@gmail.com>
>> wrote:
>> 
>> I did consider that, but:
>> - some files were NOT accessed by anything with 100% certainty (well if
>> there is a rootkit on my system or something in that shape than maybe yes)
>> - the only application that could access those files is totem (well
>> Nautilius checks extension -> directs it to totem) so in that case we would
>> hear about out break of totem killing people files.
>> - if it was a kernel bug then other large files would be affected.
>> 
>> Maybe I’m wrong and it’s actually related to the fact that all those files
>> are located in single location on file system (single folder) that might
>> have a historical bug in some structure somewhere ?
>> 
>> 
>> I find it hard to imagine that this has something to do with the
>> folderstructure, unless maybe the folder is a subvolume with
>> non-default attributes or so. How the files in that folder are created
>> (at full disktransferspeed or during a day or even a week) might give
>> some hint. You could run filefrag and see if that rings a bell.
>> 
>> files that are 4096 show:
>> 1 extent found
> 
> I actually meant filefrag for the files that are not (yet) truncated
> to 4k. For example for virtual machine imagefiles (CoW), one could see
> an MBR write.
117 extents found
filesize 15468645003

good / bad ?  
> 
>> I did forgot to add that file system was created a long time ago and it was
>> created with leaf & node size = 16k.
>> 
>> 
>> If this long time ago is >2 years then you have likely specifically
>> set node size = 16k, otherwise with older tools it would have been 4K.
>> 
>> You are right I used -l 16K -n 16K
>> 
>> Have you created it as raid10 or has it undergone profile conversions?
>> 
>> Due to lack of spare disks
>> (it may sound odd for some but spending for more than 6 disks for home use
>> seems like an overkill)
>> and due to last I’ve had I had to migrate all data to new file system.
>> This played that way that I’ve:
>> 1. from original FS I’ve removed 2 disks
>> 2. Created RAID1 on those 2 disks,
>> 3. shifted 2TB
>> 4. removed 2 disks from source FS and adde those to destination FS
>> 5 shifted 2 further TB
>> 6 destroyed original FS and adde 2 disks to destination FS
>> 7 converted destination FS to RAID10
>> 
>> FYI, when I convert to raid 10 I use:
>> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f
>> /path/to/FS
>> 
>> this filesystem has 5 sub volumes. Files affected are located in separate
>> folder within a “victim folder” that is within a one sub volume.
>> 
>> 
>> It could also be that the ondisk format is somewhat corrupted (btrfs
>> check should find that ) and that that causes the issue.
>> 
>> 
>> root@noname_server:/mnt# btrfs check /dev/sdg1
>> Checking filesystem on /dev/sdg1
>> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>> checking extents
>> checking free space cache
>> checking fs roots
>> checking csums
>> checking root refs
>> found 4424060642634 bytes used err is 0
>> total csum bytes: 4315954936
>> total tree bytes: 4522786816
>> total fs tree bytes: 61702144
>> total extent tree bytes: 41402368
>> btree space waste bytes: 72430813
>> file data blocks allocated: 4475917217792
>> referenced 4420407603200
>> 
>> No luck there :/
> 
> Indeed looks all normal.
> 
>> In-lining on raid10 has caused me some trouble (I had 4k nodes) over
>> time, it has happened over a year ago with kernels recent at that
>> time, but the fs was converted from raid5
>> 
>> Could you please elaborate on that ? you also ended up with files that got
>> truncated to 4096 bytes ?
> 
> I did not have truncated to 4k files, but your case lets me think of
> small files inlining. Default max_inline mount option is 8k and that
> means that 0 to ~3k files end up in metadata. I had size corruptions
> for several of those small sized files that were updated quite
> frequent, also within commit time AFAIK. Btrfs check lists this as
> errors 400, although fs operation is not disturbed. I don't know what
> happens if those small files are being updated/rewritten and are just
> belo

Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-05 Thread Tomasz Kusmierz
On 6 Jul 2016, at 00:30, Henk Slager <eye...@gmail.com 
<mailto:eye...@gmail.com>> wrote:
> 
> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz <tom.kusmi...@gmail.com 
> <mailto:tom.kusmi...@gmail.com>> wrote:
>> I did consider that, but:
>> - some files were NOT accessed by anything with 100% certainty (well if 
>> there is a rootkit on my system or something in that shape than maybe yes)
>> - the only application that could access those files is totem (well 
>> Nautilius checks extension -> directs it to totem) so in that case we would 
>> hear about out break of totem killing people files.
>> - if it was a kernel bug then other large files would be affected.
>> 
>> Maybe I’m wrong and it’s actually related to the fact that all those files 
>> are located in single location on file system (single folder) that might 
>> have a historical bug in some structure somewhere ?
> 
> I find it hard to imagine that this has something to do with the
> folderstructure, unless maybe the folder is a subvolume with
> non-default attributes or so. How the files in that folder are created
> (at full disktransferspeed or during a day or even a week) might give
> some hint. You could run filefrag and see if that rings a bell.
files that are 4096 show:
1 extent found
> 
>> I did forgot to add that file system was created a long time ago and it was 
>> created with leaf & node size = 16k.
> 
> If this long time ago is >2 years then you have likely specifically
> set node size = 16k, otherwise with older tools it would have been 4K.
You are right I used -l 16K -n 16K
> Have you created it as raid10 or has it undergone profile conversions?
Due to lack of spare disks 
(it may sound odd for some but spending for more than 6 disks for home use 
seems like an overkill)
and due to last I’ve had I had to migrate all data to new file system.
This played that way that I’ve:
1. from original FS I’ve removed 2 disks
2. Created RAID1 on those 2 disks,
3. shifted 2TB
4. removed 2 disks from source FS and adde those to destination FS
5 shifted 2 further TB 
6 destroyed original FS and adde 2 disks to destination FS
7 converted destination FS to RAID10

FYI, when I convert to raid 10 I use:
btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f 
/path/to/FS

this filesystem has 5 sub volumes. Files affected are located in separate 
folder within a “victim folder” that is within a one sub volume.
> 
> It could also be that the ondisk format is somewhat corrupted (btrfs
> check should find that ) and that that causes the issue.

root@noname_server:/mnt# btrfs check /dev/sdg1
Checking filesystem on /dev/sdg1
UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
found 4424060642634 bytes used err is 0
total csum bytes: 4315954936
total tree bytes: 4522786816
total fs tree bytes: 61702144
total extent tree bytes: 41402368
btree space waste bytes: 72430813
file data blocks allocated: 4475917217792
 referenced 4420407603200

No luck there :/

> In-lining on raid10 has caused me some trouble (I had 4k nodes) over
> time, it has happened over a year ago with kernels recent at that
> time, but the fs was converted from raid5
Could you please elaborate on that ? you also ended up with files that got 
truncated to 4096 bytes ?

> You might want to run the python scrips from here:
> https://github.com/knorrie/python-btrfs 
> <https://github.com/knorrie/python-btrfs>
Will do. 

> so that maybe you see how block-groups/chunks are filled etc.
> 
>> (ps. this email client on OS X is driving me up the wall … have to correct 
>> the corrections all the time :/)
>> 
>>> On 4 Jul 2016, at 22:13, Henk Slager <eye...@gmail.com 
>>> <mailto:eye...@gmail.com>> wrote:
>>> 
>>> On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz <tom.kusmi...@gmail.com 
>>> <mailto:tom.kusmi...@gmail.com>> wrote:
>>>> Hi,
>>>> 
>>>> My setup is that I use one file system for / and /home (on SSD) and a
>>>> larger raid 10 for /mnt/share (6 x 2TB).
>>>> 
>>>> Today I've discovered that 14 of files that are supposed to be over
>>>> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
>>>> and it seems that it does contain information that were at the
>>>> beginnings of the files.
>>>> 
>>>> I've experienced this problem in the past (3 - 4 years ago ?) but
>>>> attributed it to different problem that I've spoke with you guys here
>>>> about (corruption due to non ECC ram). At that time I did deleted
>>>> files affected (56) and si

Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-04 Thread Tomasz Kusmierz
I did consider that, but:
- some files were NOT accessed by anything with 100% certainty (well if there 
is a rootkit on my system or something in that shape than maybe yes) 
- the only application that could access those files is totem (well Nautilius 
checks extension -> directs it to totem) so in that case we would hear about 
out break of totem killing people files.  
- if it was a kernel bug then other large files would be affected.

Maybe I’m wrong and it’s actually related to the fact that all those files are 
located in single location on file system (single folder) that might have a 
historical bug in some structure somewhere ? 

I did forgot to add that file system was created a long time ago and it was 
created with leaf & node size = 16k. 

(ps. this email client on OS X is driving me up the wall … have to correct the 
corrections all the time :/)

> On 4 Jul 2016, at 22:13, Henk Slager <eye...@gmail.com> wrote:
> 
> On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz <tom.kusmi...@gmail.com> 
> wrote:
>> Hi,
>> 
>> My setup is that I use one file system for / and /home (on SSD) and a
>> larger raid 10 for /mnt/share (6 x 2TB).
>> 
>> Today I've discovered that 14 of files that are supposed to be over
>> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
>> and it seems that it does contain information that were at the
>> beginnings of the files.
>> 
>> I've experienced this problem in the past (3 - 4 years ago ?) but
>> attributed it to different problem that I've spoke with you guys here
>> about (corruption due to non ECC ram). At that time I did deleted
>> files affected (56) and similar problem was discovered a year but not
>> more than 2 years ago and I believe I've deleted the files.
>> 
>> I periodically (once a month) run a scrub on my system to eliminate
>> any errors sneaking in. I believe I did a balance a half a year ago ?
>> to reclaim space after I deleted a large database.
>> 
>> root@noname_server:/mnt/share# btrfs fi show
>> Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
>>Total devices 1 FS bytes used 177.19GiB
>>devid3 size 899.22GiB used 360.06GiB path /dev/sde2
>> 
>> Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>Total devices 6 FS bytes used 4.02TiB
>>devid1 size 1.82TiB used 1.34TiB path /dev/sdg1
>>devid2 size 1.82TiB used 1.34TiB path /dev/sdh1
>>devid3 size 1.82TiB used 1.34TiB path /dev/sdi1
>>devid4 size 1.82TiB used 1.34TiB path /dev/sdb1
>>devid5 size 1.82TiB used 1.34TiB path /dev/sda1
>>devid6 size 1.82TiB used 1.34TiB path /dev/sdf1
>> 
>> root@noname_server:/mnt/share# uname -a
>> Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
>> 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>> root@noname_server:/mnt/share# btrfs --version
>> btrfs-progs v4.4
>> root@noname_server:/mnt/share#
>> 
>> 
>> Problem is that stuff on this filesystem moves so slowly that it's
>> hard to remember historical events ... it's like AWS glacier. What I
>> can state with 100% certainty is that:
>> - files that are affected are 2GB and over (safe to assume 4GB and over)
>> - files affected were just read (and some not even read) never written
>> after putting into storage
>> - In the past I've assumed that files affected are due to size, but I
>> have quite few ISO files some backups of virtual machines ... no
>> problems there - seems like problem originates in one folder & size >
>> 2GB & extension .mkv
> 
> In case some application is the root cause of the issue, I would say
> try to keep some ro snapshots done by a tool like snapper for example,
> but maybe you do that already. It sounds also like this is some kernel
> bug, snaphots won't help that much then I think.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-02 Thread Tomasz Kusmierz
Hi,

My setup is that I use one file system for / and /home (on SSD) and a
larger raid 10 for /mnt/share (6 x 2TB).

Today I've discovered that 14 of files that are supposed to be over
2GB are in fact just 4096 bytes. I've checked the content of those 4KB
and it seems that it does contain information that were at the
beginnings of the files.

I've experienced this problem in the past (3 - 4 years ago ?) but
attributed it to different problem that I've spoke with you guys here
about (corruption due to non ECC ram). At that time I did deleted
files affected (56) and similar problem was discovered a year but not
more than 2 years ago and I believe I've deleted the files.

I periodically (once a month) run a scrub on my system to eliminate
any errors sneaking in. I believe I did a balance a half a year ago ?
to reclaim space after I deleted a large database.

root@noname_server:/mnt/share# btrfs fi show
Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
Total devices 1 FS bytes used 177.19GiB
devid3 size 899.22GiB used 360.06GiB path /dev/sde2

Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
Total devices 6 FS bytes used 4.02TiB
devid1 size 1.82TiB used 1.34TiB path /dev/sdg1
devid2 size 1.82TiB used 1.34TiB path /dev/sdh1
devid3 size 1.82TiB used 1.34TiB path /dev/sdi1
devid4 size 1.82TiB used 1.34TiB path /dev/sdb1
devid5 size 1.82TiB used 1.34TiB path /dev/sda1
devid6 size 1.82TiB used 1.34TiB path /dev/sdf1

root@noname_server:/mnt/share# uname -a
Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
root@noname_server:/mnt/share# btrfs --version
btrfs-progs v4.4
root@noname_server:/mnt/share#


Problem is that stuff on this filesystem moves so slowly that it's
hard to remember historical events ... it's like AWS glacier. What I
can state with 100% certainty is that:
- files that are affected are 2GB and over (safe to assume 4GB and over)
- files affected were just read (and some not even read) never written
after putting into storage
- In the past I've assumed that files affected are due to size, but I
have quite few ISO files some backups of virtual machines ... no
problems there - seems like problem originates in one folder & size >
2GB & extension .mkv
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs transaction checksum corruption losing root of the tree bizarre UUID change.

2014-07-10 Thread Tomasz Kusmierz
Hi all !

So it been some time with btrfs, and so far I was very pleased, but
since I've upgraded to ubuntu from 13.10 to 14.04 problems started to
occur (YES I know this might be unrelated).

So in the past I've had problems with btrfs which turned out to be a
problem caused by static from printer generating some corruption in
ram causing checksum failures on the file system - so I'm not going to
assume that there is something wrong with btrfs from the start.

Anyway:
On my server I'm running 6 x 2TB disk in raid 10 for general storage
and 2 x ~0.5 TB raid 1 for system. Might be unrelated, but after
upgrading to 14.04 I've started using Own Cloud which uses Apache 
MySql for backing store - all data stored on storage array, mysql was
on system array.

All started with csum errors showing up in mysql data files and in
some transactions !!!. Generally system imidiatelly was switching to
all btrfs read only mode due to being forced by kernel (don't have
dmesg / syslog now). Removed offending files, problem seemed to go
away and started from scratch. After 5 days problem reapered and now
was located around same mysql files and in files managed by apache as
cloud. At this point since these files are rather dear to me I've
decided to pull all stops and try to rescue as much as I can.

As a excercise in btrfs managment I've run btrfsck --repair - did not
help. Repeated with --init-csum-tree - turned out that this left me
with blank system array. Nice ! could use some warning here.

I've moved all drives and move those to my main rig which got a nice
16GB of ecc ram, so errors of ram, cpu, controller should be kept
theoretically eliminated. I've used system array drives and spare
drive to extract all dear to me files to newly created array (1tb +
500GB + 640GB). Runned a scrub on it and everything seemed OK. At this
point I've deleted dear to me files from storage array and ran  a
scrub. Scrub now showed even more csum errors in transactions and one
large file that was not touched FOR VERY LONG TIME (size ~1GB).
Deleted file. Ran scrub - no errors. Copied dear to me files back to
storage array. Ran scrub - no issues. Deleted files from my backup
array and decided to call a day. Next day I've decided to run a scrub
once more just to be sure this time it discovered a myriad of errors
in files and transactions. Since I've had no time to continue decided
to postpone on next day - next day I've started my rig and noticed
that both backup array and storage array does not mount anymore. I was
attempting to rescue situation without any luck. Power cycled PC and
on next startup both arrays failed to mount, when I tried to mount
backup array mount told me that this specific uuid DOES NOT EXIST
!?!?!

my fstab uuid:
fcf23e83-f165-4af0-8d1c-cd6f8d2788f4
new uuid:
771a4ed0-5859-4e10-b916-07aec4b1a60b


tried to mount by /dev/sdb1 and it did mount. Tried by new uuid and it
did mount as well. Scrub passes with flying colours on backup array
while storage array still fails to mount with:

root@ubuntu-pc:~# mount /dev/sdd1 /arrays/@storage/
mount: wrong fs type, bad option, bad superblock on /dev/sdd1,
   missing codepage or helper program, or other error
   In some cases useful info is found in syslog - try
   dmesg | tail  or so

for any device in the array.

Honestly this is a question to more senior guys - what should I do now ?

Chris Mason - have you got any updates to your old friend stress.sh
? If not I can try using previous version that you provided to stress
test my system - but I this is a second system that exposes this
erratic behaviour.

Anyone - what can I do to rescue my bellowed files (no sarcasm with
zfs / ext4 / tapes / DVDs)

ps. needles to say: SMART - no sata CRC errors, no relocated sectors,
no errors what so ever (as much as I can see).
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


open_ctree failure on upgrading 3.7 to 3.8 kernel

2013-05-03 Thread Tomasz Kusmierz

Hi,

Long story short:
I've got btrfs raid10 six disk array plus 2 other disks just having a 
normal setup btrfs filesystems.

Everything was running happily under linux 3.5 and 3.7.
3.5 was a stock ubuntu kernel, 3.7 was slightly less stock ubuntu kernel.
Now I've upgraded my box to 3.8 and none of btrfs file systems mounts 
any more. I got open_ctree errors every time I try to mount those. When 
I reboot system choosing old kernel from grub - everything runs smooth 
again. Was there any on disk format change or compatibility change?.





Some kernel.log output:

[ 13.517952] device fsid 9415cddb-e3b8-4977-804c-369553a7eda7 devid 4 
transid 30 /dev/sdh1

[ 13.518535] btrfs: disk space caching is enabled
[ 13.518773] btrfs: failed to read the system array on sdh1
[ 13.523175] btrfs: open_ctree failed
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Changing node leaf size on live partition.

2013-02-23 Thread Tomasz Kusmierz

Hi,

Question is pretty simple:

How to change node size and leaf size on previously created partition?

Now, I know what most people will say: you should've be smarter while 
typing mkfs.btrfs. Well, I'm intending to convert in place ext4 
partition but there seems to be no option for leaf and node size in this 
tool. If it's not possible I guess I'll have to create / from scratch 
and copy all my content there.




BTW. To Chris and other who were involved - after fixing this static 
electricity from printer issue I've been running rock solid raid10 
since then! Great job guys, really appreciate.



Cheers, Tom.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Changing node leaf size on live partition.

2013-02-23 Thread Tomasz Kusmierz

Hi,

Question is pretty simple:

How to change node size and leaf size on previously created partition?

Now, I know what most people will say: you should've be smarter while 
typing mkfs.btrfs. Well, I'm intending to convert in place ext4 
partition but there seems to be no option for leaf and node size in this 
tool. If it's not possible I guess I'll have to create / from scratch 
and copy all my content there.




BTW. To Chris and other who were involved - after fixing this static 
electricity from printer issue I've been running rock solid raid10 
since then! Great job guys, really appreciate.



Cheers, Tom.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs for files 10GB = random spontaneous CRC failure.

2013-02-05 Thread Tomasz Kusmierz

On 16/01/13 09:21, Bernd Schubert wrote:

On 01/16/2013 12:32 AM, Tom Kusmierz wrote:


p.s. bizzare that when I fill ext4 partition with test data everything
check's up OK (crc over all files), but with Chris tool it gets
corrupted - for both Adaptec crappy pcie controller and for mother board
built in one. Also since courses of history proven that my testing
facilities are crap - any suggestion's on how can I test ram, cpu 
controller would be appreciated.


Similar issues had been the reason we wrote ql-fstest at q-leap. Maybe 
you could try that? You can easily see the pattern of the corruption 
with that. But maybe Chris' stress.sh also provides it.
Anyway, I yesterday added support to specify min and max file size, as 
it before only used 1MiB to 1GiB sizes... It's a bit cryptic with 
bits, though, I will improve that later.

https://bitbucket.org/aakef/ql-fstest/downloads


Cheers,
Bernd


PS: But see my other thread, using ql-fstest I yesterday entirely 
broke a btrfs test file system resulting in kernel panics.


Hi,

Its been a while, but I think I should provide a definite anwser or 
simply what was the cause of whole problem:


It was a printer!

Long story short, I was going nuts trying to diagnose which bit of my 
server is going bad and effectively I was down to blaming a interface 
card that connects hotswapable disks to mobo / pcie controllers. When 
I've got back from my holiday I've sat in front of server and decided to 
go with ql-fstest which in a very nice way reports errors with a very 
low lag (~2 minutes) after they occurred. At this point my printer 
kicked in with self clean and error just showed up after ~ two minutes 
- so I've restarted printer and while it was going through it's own post 
with self clean another error showed up. Issue here turned out to be 
that I was using one of those fantastic pci 4 port ethernet cards and 
printer was directly to it - after moving it and everything else to 
switch all problem and issues have went away. AT the moment I'm running 
server for 2 weeks without any corruptions, any random kernel btrfs 
crashes etc.



Anyway I wanted to thank again to Chris and rest of btrFS dev people for 
this fantastic filesystem that let me discover how stupid setup I was 
running and how deep into shiet I've put my self.


CHEERS LADS !



Tom.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs for files 10GB = random spontaneous CRC failure.

2013-02-05 Thread Tomasz Kusmierz

On 05/02/13 12:49, Chris Mason wrote:

On Tue, Feb 05, 2013 at 03:16:34AM -0700, Tomasz Kusmierz wrote:

On 16/01/13 09:21, Bernd Schubert wrote:

On 01/16/2013 12:32 AM, Tom Kusmierz wrote:


p.s. bizzare that when I fill ext4 partition with test data everything
check's up OK (crc over all files), but with Chris tool it gets
corrupted - for both Adaptec crappy pcie controller and for mother board
built in one. Also since courses of history proven that my testing
facilities are crap - any suggestion's on how can I test ram, cpu 
controller would be appreciated.

Similar issues had been the reason we wrote ql-fstest at q-leap. Maybe
you could try that? You can easily see the pattern of the corruption
with that. But maybe Chris' stress.sh also provides it.
Anyway, I yesterday added support to specify min and max file size, as
it before only used 1MiB to 1GiB sizes... It's a bit cryptic with
bits, though, I will improve that later.
https://bitbucket.org/aakef/ql-fstest/downloads


Cheers,
Bernd


PS: But see my other thread, using ql-fstest I yesterday entirely
broke a btrfs test file system resulting in kernel panics.

Hi,

Its been a while, but I think I should provide a definite anwser or
simply what was the cause of whole problem:

It was a printer!

Long story short, I was going nuts trying to diagnose which bit of my
server is going bad and effectively I was down to blaming a interface
card that connects hotswapable disks to mobo / pcie controllers. When
I've got back from my holiday I've sat in front of server and decided to
go with ql-fstest which in a very nice way reports errors with a very
low lag (~2 minutes) after they occurred. At this point my printer
kicked in with self clean and error just showed up after ~ two minutes
- so I've restarted printer and while it was going through it's own post
with self clean another error showed up. Issue here turned out to be
that I was using one of those fantastic pci 4 port ethernet cards and
printer was directly to it - after moving it and everything else to
switch all problem and issues have went away. AT the moment I'm running
server for 2 weeks without any corruptions, any random kernel btrfs
crashes etc.

Wow, I've never heard that one before.  You might want to try a
different 4 port card and/or report it to the driver maintainer.  That
shouldn't happen ;)

ql-fstest looks neat, I'll check it out (thanks Bernd).
  
-chris


I've forgot to mention that server sits on UPS, and printer is directly 
connected to mains - when thinking of it, it creates an ground shift 
effect since nothing on cheap PSU got real ground. But anyway this is 
not a fault of this 4 port card, I've tried moving it to cheap ne2000 
and to motherboard integrated one and effect was the same. Also 
diagnostics was veeery problematic because beside of having a corruption 
on hdd memtest was returning corruptions in ram, but on a very rare 
occation, also a cpu test was returning corruption on 1 / day basis. 
I've replaced nearly everything on this server - including psu (to 1400W 
from my dev rig) to make NO difference. I should mention as well that 
this printer is a colour laser printer which got 4 drums to clean, so I 
would assume that it produces enough static electricity to power a small 
cattle.


ps. it shouldn't be an driver issue since errors in ram were 1 - 4 bit 
big located in same 32 bit word - hence i think a single transfer had to 
be corrupt rather than whole eth packet showed into random memory.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs for files 10GB = random spontaneous CRC failure.

2013-02-05 Thread Tomasz Kusmierz

On 05/02/13 13:46, Roman Mamedov wrote:

On Tue, 05 Feb 2013 10:16:34 +
Tomasz Kusmierz tom.kusmi...@gmail.com wrote:


that I was using one of those fantastic pci 4 port ethernet cards and
printer was directly to it - after moving it and everything else to
switch all problem and issues have went away. AT the moment I'm running
server for 2 weeks without any corruptions, any random kernel btrfs
crashes etc.

If moving the printer over to a switch helped, perhaps it is indeed an
electrical interference problem, but if your card is an old one from Sun, keep
in mind that they also have some problems with DMA on machines with large
amounts of RAM:

   sunhme experiences corrupt packets if machine has more than 2GB of memory
   https://bugzilla.kernel.org/show_bug.cgi?id=10790

Not hard to envision a horror story scenario where a rogue network card would
shred your filesystem buffer cache with network packets DMAed all over it,
like bullets from a machine gun :) But in reality afaik IOMMU is supposed to
protect against this.

As I said in reply to Chris it was definitely and electrical issue. Back 
in the days when cat5 eth was a novelty I've learnt hard way a simple 
lesson - don't be skimp, always separate with switch. I've learnt it on 
networks where parties were not necessary powered from same circuit or 
even supply phase. Since this setup is limited to my home I've violated 
my own old rule - and it back fired on me.


Anyway thanks for info on sunhme - WOW 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs for files 10GB = random spontaneous CRC failure.

2013-01-14 Thread Tomasz Kusmierz

Hi,

Since I had some free time over Christmas, I decided to conduct few 
tests over btrFS to se how it will cope with real life storage for 
normal gray users and I've found that filesystem will always mess up 
your files that are larger than 10GB.


Long story:
I've used my set of data that I've got nicelly backed up on personal 
raid 5 to populate btrfs volumes: music, slr pics and video (an just a 
few document). Disks used in test are all green 2TB disks from WD.


1. First I started with creating btrfs (4k blocks) on one disk, filling 
it up and then adding second disk - convert to raid1 through balance - 
convert to raid10 trough balance. Unfortunately converting to raid1 
failed - because of CRC error in 49 files that vere bigger  10GB. At 
this point I was a bit spooked up that my controllers are failing or 
that drives got some bad sectors. Tested everything (took few days) and 
it turns out that there is no apparent issue with hardware (bad 
sectors or io down to disks).
2. At this point I thought cool this will be a perfect test case for 
scrub to show it's magical power!. Created raid1 over two volumes - 
try scrubbing - FAIL ... It turns out that magically I've got corrupted 
CRC in two exactly same logical locations on two different disks (~34 
files  10GB affected) hence scrub can't do anything with it. It only 
reports it as uncorrectable errors
3. Performed same test on raid10 setup (still 4k block). Same results 
(just different file count).


Ok, time to dig more into this because it starts get intriguing. I'm 
running ubuntu server 12.10 (64bit) with stock kernel, so my next step 
was to get 3.7.1 kernel + new btrfs tool straight from git repo.
Unfortunatelly 1  2  3 still provides same results, corrupt CRC only 
in files  10GB.
At this point I thought fine maybe when I'll expand allocation block - 
it will make less block needed for big file to fit in resulting in 
propperly storing those - time for 16K leafs :) (-n 16K -l 16K) 
sectors are still 4K for known reasons :P. Well, it does exactly the 
same thing - 1  2  3 same results, big files get automagically corrupt.



Something about test data:
music - not more than 200MB files (tipical mix of mp3  aac) 10 K files 
give or take.
pics - not more than 20MB (typical point  shot + dslr) 6K files give or 
take.
video1 - collection of little ones with size more than 300MB, less than 
1.5GB ~ 400 files

video2 - collection of 5GB - 18GB files ~400 files

I guess that stating that files 10GB are only affected is a long 
shot, but so far I've not seen file less than 10GB affected (I was not 
really thorough about checking size, but all files that size I've 
checked were more than 10GB)


ps. As a footnote I'll add that I've tried shuffling test 1, 2  3 
without video2 and it all work just fine.


If you've got any ideas for work around ( other than zfs :D ) I'm happy 
to try it out.


Tom.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs for files 10GB = random spontaneous CRC failure.

2013-01-14 Thread Tomasz Kusmierz

Hi,

Since I had some free time over Christmas, I decided to conduct few 
tests over btrFS to se how it will cope with real life storage for 
normal gray users and I've found that filesystem will always mess up 
your files that are larger than 10GB.


Long story:
I've used my set of data that I've got nicelly backed up on personal 
raid 5 to populate btrfs volumes: music, slr pics and video (an just a 
few document). Disks used in test are all green 2TB disks from WD.


1. First I started with creating btrfs (4k blocks) on one disk, filling 
it up and then adding second disk - convert to raid1 through balance - 
convert to raid10 trough balance. Unfortunately converting to raid1 
failed - because of CRC error in 49 files that vere bigger  10GB. At 
this point I was a bit spooked up that my controllers are failing or 
that drives got some bad sectors. Tested everything (took few days) and 
it turns out that there is no apparent issue with hardware (bad 
sectors or io down to disks).
2. At this point I thought cool this will be a perfect test case for 
scrub to show it's magical power!. Created raid1 over two volumes - 
try scrubbing - FAIL ... It turns out that magically I've got corrupted 
CRC in two exactly same logical locations (~34 files  10GB affected).
3. Performed same test on raid10 setup (still 4k block). Same results 
(just diffrent file count).


Ok, time to dig more into this because it starts get intriguing. I'm 
running ubuntu server 12.10 with stock kernel, so my next step was to 
get 3.7.1 kernel + new btrfs tool straight from git repo.
Unfortunatelly 1  2  3 still provides same results, corrupt CRC only 
in files  10GB.
At this point I thought fine maybe when I'll expand allocation block - 
it will make less block needed for big file to fit in resulting in 
propperly storing those - time for 16K leafs :) (-n 16K -l 16K) 
sectors are still 4K for known reasons :P. Well, it does exactly the 
same thing - 1  2  3 same results, big files get automagically corrupt.



Something about test data:
music - not more than 200MB files (tipical mix of mp3  aac) 10 K files 
give or take.
pics - not more than 20MB (typical point  shot + dslr) 6K files give or 
take.
video1 - collection of little ones with size more than 300MB, less than 
1.5GB ~ 400 files

video2 - collection of 5GB - 18GB files ~400 files

I guess that stating that files 10GB are only affected is a long 
shot, but so far I've not seen file less than 10GB affected (I was not 
really thorough about checking size, but all files that size I've 
checked were more than 10GB)


ps. As a footnote I'll add that I've tried shuffling test 1, 2  3 
without video2 and it all work just fine.


If you've got any ideas for work around ( other than zfs :D ) I'm happy 
to try it out.


Tom.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs for files 10GB = random spontaneous CRC failure.

2013-01-14 Thread Tomasz Kusmierz

On 14/01/13 11:25, Roman Mamedov wrote:

Hello,

On Mon, 14 Jan 2013 11:17:17 +
Tomasz Kusmierz tom.kusmi...@gmail.com wrote:


this point I was a bit spooked up that my controllers are failing or

Which controller manufacturer/model?

Well, this is a home server (which I preffer to tinker on). Two 
controllers were used, mother board build in, and crappy Adaptec pcie one.


00:11.0 SATA controller: Advanced Micro Devices [AMD] nee ATI 
SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]

02:00.0 RAID bus controller: Adaptec Serial ATA II RAID 1430SA (rev 02)


ps. MoBo is: ASUS M4A79T Deluxe
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs for files 10GB = random spontaneous CRC failure.

2013-01-14 Thread Tomasz Kusmierz

On 14/01/13 14:59, Chris Mason wrote:

On Mon, Jan 14, 2013 at 04:09:47AM -0700, Tomasz Kusmierz wrote:

Hi,

Since I had some free time over Christmas, I decided to conduct few
tests over btrFS to se how it will cope with real life storage for
normal gray users and I've found that filesystem will always mess up
your files that are larger than 10GB.

Hi Tom,

I'd like to nail down the test case a little better.

1) Create on one drive, fill with data
2) Add a second drive, convert to raid1
3) find corruptions?

What happens if you start with two drives in raid1?  In other words, I'm
trying to see if this is a problem with the conversion code.

-chris
Ok, my description might be a bit enigmatic so to cut long story short 
tests are:
1) create a single drive default btrfs volume on single partition - 
fill with test data - scrub - admire errors.
2) create a raid1 (-d raid1 -m raid1) volume with two partitions on 
separate disk, each same size etc. - fill with test data - scrub - 
admire errors.
3) create a raid10 (-d raid10 -m raid1) volume with four partitions on 
separate disk, each same size etc. - fill with test data - scrub - 
admire errors.


all disks are same age + size + model ... two different batches to avoid 
same time failure.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs for files 10GB = random spontaneous CRC failure.

2013-01-14 Thread Tomasz Kusmierz

On 14/01/13 15:57, Chris Mason wrote:

On Mon, Jan 14, 2013 at 08:22:36AM -0700, Tomasz Kusmierz wrote:

On 14/01/13 14:59, Chris Mason wrote:

On Mon, Jan 14, 2013 at 04:09:47AM -0700, Tomasz Kusmierz wrote:

Hi,

Since I had some free time over Christmas, I decided to conduct few
tests over btrFS to se how it will cope with real life storage for
normal gray users and I've found that filesystem will always mess up
your files that are larger than 10GB.

Hi Tom,

I'd like to nail down the test case a little better.

1) Create on one drive, fill with data
2) Add a second drive, convert to raid1
3) find corruptions?

What happens if you start with two drives in raid1?  In other words, I'm
trying to see if this is a problem with the conversion code.

-chris

Ok, my description might be a bit enigmatic so to cut long story short
tests are:
1) create a single drive default btrfs volume on single partition -
fill with test data - scrub - admire errors.
2) create a raid1 (-d raid1 -m raid1) volume with two partitions on
separate disk, each same size etc. - fill with test data - scrub -
admire errors.
3) create a raid10 (-d raid10 -m raid1) volume with four partitions on
separate disk, each same size etc. - fill with test data - scrub -
admire errors.

all disks are same age + size + model ... two different batches to avoid
same time failure.

Ok, so we have two possible causes.  #1 btrfs is writing garbage to your
disks.  #2 something in your kernel is corrupting your data.

Since you're able to see this 100% of the time, lets assume that if #2
were true, we'd be able to trigger it on other filesystems.

So, I've attached an old friend, stress.sh.  Use it like this:

stress.sh -n 5 -c your source directory -s your btrfs mount point

It will run in a loop with 5 parallel processes and make 5 copies of
your data set into the destination.  It will run forever until there are
errors.  You can use a higher process count (-n) to force more
concurrency and use more ram.  It may help to pin down all but 2 or 3 GB
of your memory.

What I'd like you to do is find a data set and command line that make
the script find errors on btrfs.  Then, try the same thing on xfs or
ext4 and let it run at least twice as long.  Then report back ;)

-chris


Chris,

Will do, just please be remember that 2TB of test data on customer 
grade sata drives will take a while to test :)




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs for files 10GB = random spontaneous CRC failure.

2013-01-14 Thread Tomasz Kusmierz

On 14/01/13 16:20, Roman Mamedov wrote:

On Mon, 14 Jan 2013 15:22:36 +
Tomasz Kusmierz tom.kusmi...@gmail.com wrote:


1) create a single drive default btrfs volume on single partition -
fill with test data - scrub - admire errors.

Did you try ruling out btrfs as the cause of the problem? Maybe something else
in your system is corrupting data, and btrfs just lets you know about that.

I.e. on the same drive, create an Ext4 filesystem, copy some data to it which
has known checksums (use md5sum or cfv to generate them in advance for data
that is on another drive and is waiting to be copied); copy to that drive,
flush caches, verify checksums of files at the destination.


Hi Roman,

Chris just provided his good old friend stress.sh that should do that. 
So I'll dive into more testing :)


Tom.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html