Re: [ceph-users] OSD crash after change of osd_memory_target

2020-01-22 Thread Igor Fedotov

Hi Martin,

looks like a bug to me.

You might want to remove all custom settings from config database and 
try to set osd-memory-target only.


Would it help?


Thanks,

Igor

On 1/22/2020 3:43 PM, Martin Mlynář wrote:



Dne 21. 01. 20 v 21:12 Stefan Kooman napsal(a):

Quoting Martin Mlynář (nexus+c...@smoula.net):


Do you think this could help? OSD does not even start, I'm getting a little
lost how flushing caches could help.

I might have mis-understood. I though the OSDs crashed when you set the
config setting.


According to trace I suspect something around processing config values.

I've just set the same config setting on a test cluster and restarted an
OSD without problem. So, not sure what is going on there.

Gr. Stefan


I've compiled ceph-osd with debug symbols and got better backtrace:

   -24> 2020-01-22 13:12:53.614 7f83ed064700  4 set_mon_vals no 
callback set
   -23> 2020-01-22 13:12:53.614 7f83ee867700 10 monclient: discarding 
stray monitor message auth_reply(proto 2 0 (0) Success) v1
   -22> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals 
osd_crush_update_on_start = true
   -21> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals 
osd_max_backfills = 64
   -20> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals 
osd_memory_target = 2147483648
   -19> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals 
osd_recovery_max_active = 40
   -18> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals 
osd_recovery_max_single_start = 1000
   -17> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals 
osd_recovery_sleep_hdd = 0.00
   -16> 2020-01-22 13:12:53.614 7f83ed064700 10 set_mon_vals 
osd_recovery_sleep_hybrid = 0.00
   -15> 2020-01-22 13:12:53.627 7f83f0276c40  0 set uid:gid to 
64045:64045 (ceph:ceph)
   -14> 2020-01-22 13:12:53.627 7f83f0276c40  0 ceph version 14.2.6 
(f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable), process 
ceph-osd, pid 622
   -13> 2020-01-22 13:12:53.649 7f83f0276c40  0 pidfile_write: ignore 
empty --pid-file
   -12> 2020-01-22 13:12:53.657 7f83f0276c40  5 asok(0x5580518fa000) 
init /var/run/ceph/ceph-osd.6.asok
   -11> 2020-01-22 13:12:53.657 7f83f0276c40  5 asok(0x5580518fa000) 
bind_and_listen /var/run/ceph/ceph-osd.6.asok
   -10> 2020-01-22 13:12:53.657 7f83f0276c40  5 asok(0x5580518fa000) 
register_command 0 hook 0x558051872fc0
    -9> 2020-01-22 13:12:53.657 7f83f0276c40  5 asok(0x5580518fa000) 
register_command version hook 0x558051872fc0
    -8> 2020-01-22 13:12:53.657 7f83f0276c40  5 asok(0x5580518fa000) 
register_command git_version hook 0x558051872fc0
    -7> 2020-01-22 13:12:53.657 7f83f0276c40  5 asok(0x5580518fa000) 
register_command help hook 0x558051874220
    -6> 2020-01-22 13:12:53.657 7f83f0276c40  5 asok(0x5580518fa000) 
register_command get_command_descriptions hook 0x558051874260
    -5> 2020-01-22 13:12:53.657 7f83ed865700  5 asok(0x5580518fa000) 
entry start
    -4> 2020-01-22 13:12:53.670 7f83f0276c40  5 object store type is 
bluestore
    -3> 2020-01-22 13:12:53.675 7f83f0276c40  1 bdev create path 
/var/lib/ceph/osd/ceph-6/block type kernel
    -2> 2020-01-22 13:12:53.675 7f83f0276c40  1 bdev(0x5580518f3f80 
/var/lib/ceph/osd/ceph-6/block) open path /var/lib/ceph/osd/ceph-6/block
    -1> 2020-01-22 13:12:53.675 7f83f0276c40  1 bdev(0x5580518f3f80 
/var/lib/ceph/osd/ceph-6/block) open size 3000588304384 
(0x2baa100, 2.7 TiB) block_size 4096 (4 KiB) rotational discard 
not supported
 0> 2020-01-22 13:12:53.714 7f83f0276c40 -1 *** Caught signal 
(Aborted) **

 in thread 7f83f0276c40 thread_name:ceph-osd

 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) 
nautilus (stable)

 1: (()+0x2c19654) [0x558045ec6654]
 2: (()+0x12730) [0x7f83f0d1f730]
 3: (gsignal()+0x10b) [0x7f83f08027bb]
 4: (abort()+0x121) [0x7f83f07ed535]
 5: (()+0x8c983) [0x7f83f0bb5983]
 6: (()+0x928c6) [0x7f83f0bbb8c6]
 7: (()+0x92901) [0x7f83f0bbb901]
 8: (()+0x92b34) [0x7f83f034]
* 9: (void boost::throw_exception(boost::bad_get 
const&)+0x7b) [0x5580454d5430]*
* 10: (Option::size_t&& boost::relaxed_getboost::blank, std::__cxx11::basic_string, 
std::allocator >, unsigned long, long, double, bool, 
entity_addr_t, entity_addrvec_t, std::chrono::durationstd::ratio<1l, 1l> >, Option::size_t, 
uuid_d>(boost::variantstd::char_traits, std::allocator >, unsigned long, long, 
double, bool, entity_addr_t, entity_addrvec_t, 
std::chrono::duration >, Option::size_t, 
uuid_d>&&)+0x5b) [0x5580454d6223]*
 11: (Option::size_t&& boost::strict_getstd::__cxx11::basic_string, 
std::allocator >, unsigned long, long, double, bool, 
entity_addr_t, entity_addrvec_t, std::chrono::durationstd::ratio<1l, 1l> >, Option::size_t, 
uuid_d>(boost::variantstd::char_traits, std::allocator >, unsigned long, long, 
double, bool, entity_addr_t, entity_addrvec_t, 
std::chrono::duration >, Option::size_t, 
uuid_d>&&)+0x20) [0x5580454d4a39]
 12: (Option::size_t&& boost::getstd::__cxx11::basic_string, 
std::allocator >, unsigned long, long, double, bool, 

Re: [ceph-users] Luminous Bluestore OSDs crashing with ASSERT

2020-01-20 Thread Igor Fedotov

Hi Stefan,

these lines are result of transaction dump performed on a failure during 
transaction submission (which is shown as


"submit_transaction error: Corruption: block checksum mismatch code = 2"

Most probably they are out of interest (checksum errors are unlikely to 
be caused by transaction content) and hence we need earlier stuff to 
learn what caused that


checksum mismatch.

It's hard to give any formal overview of what you should look for, from 
my troubleshooting experience generally one may try to find:


- some previous error/warning indications (e.g. allocation, disk access, 
etc)


- prior OSD crashes (sometimes they might have different causes/stack 
traces/assertion messages)


- any timeout or retry indications

- any uncommon log patterns which aren't present during regular running 
but happen each time before the crash/failure.



Anyway I think the inspection depth should be much(?) deeper than 
presumably it is (from what I can see from your log snippets).


Ceph keeps last 1 log events with an increased log level and dumps 
them on crash with negative index starting at - up to -1 as a prefix.


-1> 2020-01-16 01:10:13.404090 7f3350a14700 -1 rocksdb:


It would be great If you share several log snippets for different 
crashes containing these last 1 lines.



Thanks,

Igor


On 1/19/2020 9:42 PM, Stefan Priebe - Profihost AG wrote:

Hello Igor,

there's absolutely nothing in the logs before.

What do those lines mean:
Put( Prefix = O key =
0x7f8001cc45c881217262'd_data.4303206b8b4567.9632!='0xfffe6f0012'x'
Value size = 480)
Put( Prefix = O key =
0x7f8001cc45c881217262'd_data.4303206b8b4567.9632!='0xfffe'o'
Value size = 510)

on the right size i always see 0xfffe on all
failed OSDs.

greets,
Stefan
Am 19.01.20 um 14:07 schrieb Stefan Priebe - Profihost AG:

Yes, except that this happens on 8 different clusters with different hw but 
same ceph version and same kernel version.

Greets,
Stefan


Am 19.01.2020 um 11:53 schrieb Igor Fedotov :

So the intermediate summary is:

Any OSD in the cluster can experience interim RocksDB checksum failure. Which 
isn't present after OSD restart.

No HW issues observed, no persistent artifacts (except OSD log) afterwards.

And looks like the issue is rather specific to the cluster as no similar 
reports from other users seem to be present.


Sorry, I'm out of ideas other then collect all the failure logs and try to find 
something common in them. May be this will shed some light..

BTW from my experience it might make sense to inspect OSD log prior to failure 
(any error messages and/or prior restarts, etc) sometimes this might provide 
some hints.


Thanks,

Igor



On 1/17/2020 2:30 PM, Stefan Priebe - Profihost AG wrote:
HI Igor,


Am 17.01.20 um 12:10 schrieb Igor Fedotov:
hmmm..

Just in case - suggest to check H/W errors with dmesg.

this happens on around 80 nodes - i don't expect all of those have not
identified hw errors. Also all of them are monitored - no dmesg outpout
contains any errors.


Also there are some (not very much though) chances this is another
incarnation of the following bug:
https://tracker.ceph.com/issues/22464
https://github.com/ceph/ceph/pull/24649

The corresponding PR works around it for main device reads (user data
only!) but theoretically it might still happen

either for DB device or DB data at main device.

Can you observe any bluefs spillovers? Are there any correlation between
failing OSDs and spillover presence if any, e.g. failing OSDs always
have a spillover. While OSDs without spillovers never face the issue...

To validate this hypothesis one can try to monitor/check (e.g. once a
day for a week or something) "bluestore_reads_with_retries" counter over
OSDs to learn if the issue is happening

in the system.  Non-zero values mean it's there for user data/main
device and hence is likely to happen for DB ones as well (which doesn't
have any workaround yet).

OK i checked bluestore_reads_with_retries on 360 osds but all of them say 0.



Additionally you might want to monitor memory usage as the above
mentioned PR denotes high memory pressure as potential trigger for these
read errors. So if such pressure happens the hypothesis becomes more valid.

we already do this heavily and have around 10GB of memory per OSD. Also
no of those machines show any io pressure at all.

All hosts show a constant rate of around 38GB to 45GB mem available in
/proc/meminfo.

Stefan


Thanks,

Igor

PS. Everything above is rather a speculation for now. Available
information is definitely not enough for extensive troubleshooting the
cases which happens that rarely.

You might want to start collecting failure-related information
(including but not limited to failure logs, perf counter dumps, system
resource reports etc) for future analysis.



On 1/16/2020 

Re: [ceph-users] OSD up takes 15 minutes after machine restarts

2020-01-20 Thread Igor Fedotov
No, bluestore_fsck_on_mount_deep is applied when bluestore_fsck_on_mount 
is set to true only.


Hence there is no fsck on mount in your case.


Thanks,

Igor


On 1/20/2020 10:25 AM, huxia...@horebdata.cn wrote:

HI, Igor,

does this could cause the problem?



huxia...@horebdata.cn

*From:* Igor Fedotov <mailto:ifedo...@suse.de>
*Date:* 2020-01-19 11:41
*To:* huxia...@horebdata.cn <mailto:huxia...@horebdata.cn>;
ceph-users <mailto:ceph-users@lists.ceph.com>
*Subject:* Re: [ceph-users] OSD up takes 15 minutes after machine
restarts

Hi Samuel,


wondering if you have bluestore_fsck_on_mount option set to true?
Can you see high read load over OSD device(s) during the startup?

If so it might be fsck running which takes that long.


Thanks,

Igor


On 1/19/2020 11:53 AM, huxia...@horebdata.cn wrote:

Dear folks,

I had a strange situation with 3-node Ceph cluster on Luminous
12.2.12 with bluestore. Each machine has 5 OSDs on HDD, and each
OSD uses a 30GB DB/WAL partition on SSD. At the beginning without
much data, OSDs can quickly up if one node restarts.

Then I ran 4-day long stress tests with vdbench, and then I
restarts one node, and to my surprise, OSDs on that node takes ca
15 minutes to turn into up state. How can i speed up the OSD UP
process ?

thanks,

Samuel



huxia...@horebdata.cn

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous Bluestore OSDs crashing with ASSERT

2020-01-19 Thread Igor Fedotov

So the intermediate summary is:

Any OSD in the cluster can experience interim RocksDB checksum failure. 
Which isn't present after OSD restart.


No HW issues observed, no persistent artifacts (except OSD log) afterwards.

And looks like the issue is rather specific to the cluster as no similar 
reports from other users seem to be present.



Sorry, I'm out of ideas other then collect all the failure logs and try 
to find something common in them. May be this will shed some light..


BTW from my experience it might make sense to inspect OSD log prior to 
failure (any error messages and/or prior restarts, etc) sometimes this 
might provide some hints.



Thanks,

Igor


On 1/17/2020 2:30 PM, Stefan Priebe - Profihost AG wrote:

HI Igor,

Am 17.01.20 um 12:10 schrieb Igor Fedotov:

hmmm..

Just in case - suggest to check H/W errors with dmesg.

this happens on around 80 nodes - i don't expect all of those have not
identified hw errors. Also all of them are monitored - no dmesg outpout
contains any errors.


Also there are some (not very much though) chances this is another
incarnation of the following bug:
https://tracker.ceph.com/issues/22464
https://github.com/ceph/ceph/pull/24649

The corresponding PR works around it for main device reads (user data
only!) but theoretically it might still happen

either for DB device or DB data at main device.

Can you observe any bluefs spillovers? Are there any correlation between
failing OSDs and spillover presence if any, e.g. failing OSDs always
have a spillover. While OSDs without spillovers never face the issue...

To validate this hypothesis one can try to monitor/check (e.g. once a
day for a week or something) "bluestore_reads_with_retries" counter over
OSDs to learn if the issue is happening

in the system.  Non-zero values mean it's there for user data/main
device and hence is likely to happen for DB ones as well (which doesn't
have any workaround yet).

OK i checked bluestore_reads_with_retries on 360 osds but all of them say 0.



Additionally you might want to monitor memory usage as the above
mentioned PR denotes high memory pressure as potential trigger for these
read errors. So if such pressure happens the hypothesis becomes more valid.

we already do this heavily and have around 10GB of memory per OSD. Also
no of those machines show any io pressure at all.

All hosts show a constant rate of around 38GB to 45GB mem available in
/proc/meminfo.

Stefan


Thanks,

Igor

PS. Everything above is rather a speculation for now. Available
information is definitely not enough for extensive troubleshooting the
cases which happens that rarely.

You might want to start collecting failure-related information
(including but not limited to failure logs, perf counter dumps, system
resource reports etc) for future analysis.



On 1/16/2020 11:58 PM, Stefan Priebe - Profihost AG wrote:

Hi Igor,

answers inline.

Am 16.01.20 um 21:34 schrieb Igor Fedotov:

you may want to run fsck against failing OSDs. Hopefully it will shed
some light.

fsck just says everything fine:

# ceph-bluestore-tool --command fsck --path /var/lib/ceph/osd/ceph-27/
fsck success



Also wondering if OSD is able to recover (startup and proceed working)
after facing the issue?

no recover needed. It just runs forever after restarting.


If so do you have any one which failed multiple times? Do you have logs
for these occurrences?

may be but there are most probably weeks or month between those failures
- most probably logs are already deleted.


Also please note that patch you mentioned doesn't fix previous issues
(i.e. duplicate allocations), it prevents from new ones only.

But fsck should show them if any...

None showed.

Stefan


Thanks,

Igor



On 1/16/2020 10:04 PM, Stefan Priebe - Profihost AG wrote:

Hi Igor,

ouch sorry. Here we go:

  -1> 2020-01-16 01:10:13.404090 7f3350a14700 -1 rocksdb:
submit_transaction error: Corruption: block checksum mismatch code = 2
Rocksdb transaction:
Put( Prefix = M key =
0x0402'.OBJ_0002.953BFD0A.bb85c.rbd%udata%e3e8eac6b8b4567%e1f2e..'


Value size = 97)
Put( Prefix = M key =
0x0402'.MAP_000BB85C_0002.953BFD0A.bb85c.rbd%udata%e3e8eac6b8b4567%e1f2e..'


Value size = 93)
Put( Prefix = M key =
0x0916'.823257.73922044' Value size = 196)
Put( Prefix = M key =
0x0916'.823257.73922045' Value size = 184)
Put( Prefix = M key = 0x0916'._info' Value size = 899)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f'x'


Value size = 418)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f0003'x'


Value size = 474)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.000

Re: [ceph-users] OSD up takes 15 minutes after machine restarts

2020-01-19 Thread Igor Fedotov

Hi Samuel,


wondering if you have bluestore_fsck_on_mount option set to true? Can 
you see high read load over OSD device(s) during the startup?


If so it might be fsck running which takes that long.


Thanks,

Igor


On 1/19/2020 11:53 AM, huxia...@horebdata.cn wrote:

Dear folks,

I had a strange situation with 3-node Ceph cluster on Luminous 12.2.12 
with bluestore. Each machine has 5 OSDs on HDD, and each OSD uses a 
30GB DB/WAL partition on SSD. At the beginning without much data, OSDs 
can quickly up if one node restarts.


Then I ran 4-day long stress tests with vdbench, and then I restarts 
one node, and to my surprise, OSDs on that node takes ca 15 minutes to 
turn into up state. How can i speed up the OSD UP process ?


thanks,

Samuel



huxia...@horebdata.cn

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous Bluestore OSDs crashing with ASSERT

2020-01-17 Thread Igor Fedotov

hmmm..

Just in case - suggest to check H/W errors with dmesg.

Also there are some (not very much though) chances this is another 
incarnation of the following bug:


https://tracker.ceph.com/issues/22464

https://github.com/ceph/ceph/pull/24649

The corresponding PR works around it for main device reads (user data 
only!) but theoretically it might still happen


either for DB device or DB data at main device.

Can you observe any bluefs spillovers? Are there any correlation between 
failing OSDs and spillover presence if any, e.g. failing OSDs always 
have a spillover. While OSDs without spillovers never face the issue...


To validate this hypothesis one can try to monitor/check (e.g. once a 
day for a week or something) "bluestore_reads_with_retries" counter over 
OSDs to learn if the issue is happening


in the system.  Non-zero values mean it's there for user data/main 
device and hence is likely to happen for DB ones as well (which doesn't 
have any workaround yet).


Additionally you might want to monitor memory usage as the above 
mentioned PR denotes high memory pressure as potential trigger for these 
read errors. So if such pressure happens the hypothesis becomes more valid.



Thanks,

Igor

PS. Everything above is rather a speculation for now. Available 
information is definitely not enough for extensive troubleshooting the 
cases which happens that rarely.


You might want to start collecting failure-related information 
(including but not limited to failure logs, perf counter dumps, system 
resource reports etc) for future analysis.




On 1/16/2020 11:58 PM, Stefan Priebe - Profihost AG wrote:

Hi Igor,

answers inline.

Am 16.01.20 um 21:34 schrieb Igor Fedotov:

you may want to run fsck against failing OSDs. Hopefully it will shed
some light.

fsck just says everything fine:

# ceph-bluestore-tool --command fsck --path /var/lib/ceph/osd/ceph-27/
fsck success



Also wondering if OSD is able to recover (startup and proceed working)
after facing the issue?

no recover needed. It just runs forever after restarting.


If so do you have any one which failed multiple times? Do you have logs
for these occurrences?

may be but there are most probably weeks or month between those failures
- most probably logs are already deleted.


Also please note that patch you mentioned doesn't fix previous issues
(i.e. duplicate allocations), it prevents from new ones only.

But fsck should show them if any...

None showed.

Stefan


Thanks,

Igor



On 1/16/2020 10:04 PM, Stefan Priebe - Profihost AG wrote:

Hi Igor,

ouch sorry. Here we go:

     -1> 2020-01-16 01:10:13.404090 7f3350a14700 -1 rocksdb:
submit_transaction error: Corruption: block checksum mismatch code = 2
Rocksdb transaction:
Put( Prefix = M key =
0x0402'.OBJ_0002.953BFD0A.bb85c.rbd%udata%e3e8eac6b8b4567%e1f2e..'

Value size = 97)
Put( Prefix = M key =
0x0402'.MAP_000BB85C_0002.953BFD0A.bb85c.rbd%udata%e3e8eac6b8b4567%e1f2e..'

Value size = 93)
Put( Prefix = M key =
0x0916'.823257.73922044' Value size = 196)
Put( Prefix = M key =
0x0916'.823257.73922045' Value size = 184)
Put( Prefix = M key = 0x0916'._info' Value size = 899)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f'x'

Value size = 418)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f0003'x'

Value size = 474)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f0007c000'x'

Value size = 392)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f0009'x'

Value size = 317)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f000a'x'

Value size = 521)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f000f4000'x'

Value size = 558)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f0013'x'

Value size = 649)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f00194000'x'

Value size = 449)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f001cc000'x'

Value size = 580)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f0020'x'

Value size = 435)
Put( P

Re: [ceph-users] Luminous Bluestore OSDs crashing with ASSERT

2020-01-16 Thread Igor Fedotov
Put( Prefix = X key = 0x0d7a035b Value size = 14)
Put( Prefix = X key = 0x0d7a035c Value size = 14)
Put( Prefix = X key = 0x0d7a0355 Value size = 14)
Put( Prefix = X key = 0x0d7a0356 Value size = 17)
Put( Prefix = X key = 0x1a54f6e4 Value size = 14)
Put( Prefix = X key = 0x1b1c061e Value size = 14)
Put( Prefix = X key = 0x0d7a038f Value size = 14)
Put( Prefix = X key = 0x0d7a0389 Value size = 14)
Put( Prefix = X key = 0x0d7a0358 Value size = 14)
Put( Prefix = X key = 0x0d7a035f Value size = 14)
Put( Prefix = X key = 0x0d7a0357 Value size = 14)
Put( Prefix = X key = 0x0d7a0387 Value size = 14)
Put( Prefix = X key = 0x0d7a038a Value size = 14)
Put( Prefix = X key = 0x0d7a0388 Value size = 14)
Put( Prefix = X key = 0x134c3fbe Value size = 14)
Put( Prefix = X key = 0x134c3fb5 Value size = 14)
Put( Prefix = X key = 0x0d7a036e Value size = 14)
Put( Prefix = X key = 0x0d7a036d Value size = 14)
Put( Prefix = X key = 0x134c3fb8 Value size = 14)
Put( Prefix = X key = 0x0d7a0371 Value size = 14)
Put( Prefix = X key = 0x0d7a036a Value size = 14)
  0> 2020-01-16 01:10:13.413759 7f3350a14700 -1
/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7f3350a14700 time 2020-01-16
01:10:13.404113
/build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0)

  ceph version 12.2.12-11-gd3eae83543
(d3eae83543bffc0fc6c43823feb637fa851b6213) luminous (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x55c9a712d232]
  2: (BlueStore::_kv_sync_thread()+0x24c5) [0x55c9a6fb54b5]
  3: (BlueStore::KVSyncThread::entry()+0xd) [0x55c9a6ff608d]
  4: (()+0x7494) [0x7f33615f9494]
  5: (clone()+0x3f) [0x7f3360680acf]

I already picked those:
https://github.com/ceph/ceph/pull/28644

Greets,
Stefan
Am 16.01.20 um 17:00 schrieb Igor Fedotov:

Hi Stefan,

would you please share log snippet prior the assertions? Looks like
RocksDB is failing during transaction submission...


Thanks,

Igor

On 1/16/2020 11:56 AM, Stefan Priebe - Profihost AG wrote:

Hello,

does anybody know a fix for this ASSERT / crash?

2020-01-16 02:02:31.316394 7f8c3f5ab700 -1
/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7f8c3f5ab700 time 2020-01-16
02:02:31.304993
/build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0)

   ceph version 12.2.12-11-gd3eae83543
(d3eae83543bffc0fc6c43823feb637fa851b6213) luminous (stable)
   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x55e6df9d9232]
   2: (BlueStore::_kv_sync_thread()+0x24c5) [0x55e6df8614b5]
   3: (BlueStore::KVSyncThread::entry()+0xd) [0x55e6df8a208d]
   4: (()+0x7494) [0x7f8c50190494]
   5: (clone()+0x3f) [0x7f8c4f217acf]

all bluestore OSDs are randomly crashing sometimes (once a week).

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous Bluestore OSDs crashing with ASSERT

2020-01-16 Thread Igor Fedotov

Hi Stefan,

would you please share log snippet prior the assertions? Looks like 
RocksDB is failing during transaction submission...



Thanks,

Igor

On 1/16/2020 11:56 AM, Stefan Priebe - Profihost AG wrote:

Hello,

does anybody know a fix for this ASSERT / crash?

2020-01-16 02:02:31.316394 7f8c3f5ab700 -1
/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7f8c3f5ab700 time 2020-01-16
02:02:31.304993
/build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0)

  ceph version 12.2.12-11-gd3eae83543
(d3eae83543bffc0fc6c43823feb637fa851b6213) luminous (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x55e6df9d9232]
  2: (BlueStore::_kv_sync_thread()+0x24c5) [0x55e6df8614b5]
  3: (BlueStore::KVSyncThread::entry()+0xd) [0x55e6df8a208d]
  4: (()+0x7494) [0x7f8c50190494]
  5: (clone()+0x3f) [0x7f8c4f217acf]

all bluestore OSDs are randomly crashing sometimes (once a week).

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Impact of a small DB size with Bluestore

2019-12-02 Thread Igor Fedotov

Hi Lars,

I've also seen interim space usage burst during my experiments. Up to 2x 
times of max level size when topmost RocksDB level is  L3 (i.e. 25GB 
max). So I think 2x (which results in 60-64 GB for DB) is a good grade 
when your DB is expected to be small and medium sized. Not sure this 
multiplier is perfect enough for large systems where L4 (250GB max) is 
expected. It results in pretty high spare volume. But actually I don't 
have any real experience for this case.


FYI: one can learn per-device maximum bluefs space allocated since OSD 
restart using the following bluefs performance counters:


    l_bluefs_max_bytes_wal,
    l_bluefs_max_bytes_db,
    l_bluefs_max_bytes_slow,

which might give some insight for your system's real needs.


Thanks,

Igor


On 12/2/2019 10:55 AM, Lars Täuber wrote:

Hi,

Tue, 26 Nov 2019 13:57:51 +
Simon Ironside  ==> ceph-users@lists.ceph.com :

Mattia Belluco said back in May:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/035086.html

"when RocksDB needs to compact a layer it rewrites it
*before* deleting the old data; if you'd like to be sure you db does not
spill over to the spindle you should allocate twice the size of the
biggest layer to allow for compaction."

I didn't spot anyone disagreeing so I used 64GiB DB/WAL partitions on
the SSDs in my most recent clusters to allow for this and to be certain
that I definitely had room for the WAL on top and wouldn't get caught
out by people saying GB (x1000^3 bytes) when they mean GiB (x1024^3
bytes). I left the rest of the SSD empty to make the most of wear
leveling, garbage collection etc.

Simon


this is something I liked to get a comment from a developer too.
So what about the doubled size for block_db?

Thanks
Lars

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-objectstore-tool crash when trying to recover pg from OSD

2019-11-07 Thread Igor Fedotov

Hi Eugene,

this looks like https://tracker.ceph.com/issues/42223 indeed.

Would you please find the first crash for these OSDs and share 
corresponding logs in the ticket.



Unfortunately I don't know reliable enough ways to recover OSD after 
such a failure. If they exist at all... :(


I've been told offline by Rafal Wadolowski that ceph-kvstore-tool's 
destructive-repair command helped in 1 of 2 attempts. But I would 
strictly  advise to refrain from this command usage for now unless 
you're  absolutely immune to data loss. It might make things even worse


Thanks,

Igor

On 11/7/2019 11:56 AM, Eugene de Beste wrote:

Hi, does anyone have any feedback for me regarding this?

Here's the log I get when trying to restart the OSD via systemctl: 
https://pastebin.com/tshuqsLP
On Mon, 4 Nov 2019 at 12:42, Eugene de Beste > wrote:


Hi everyone

I have a cluster that was initially set up with bad defaults in
Luminous. After upgrading to Nautilus I've had a few OSDs crash on
me, due to errors seemingly related to
https://tracker.ceph.com/issues/42223 and
https://tracker.ceph.com/issues/22678.

One of my pools have been running in min_size 1 (yes, I know) and
I am not stuck with incomplete pgs due to aforementioned OSD crash.

When trying to use the ceph-objectstore-tool to get the pgs out of
the OSD, I'm running into the same issue as trying to start the
OSD, which is the crashes. ceph-objectstore-tool core dumps and I
can't retrieve the pg.

Does anyone have any input on this? I would like to be able to
retrieve that data if possible.

Here's the log for ceph-objectstore-tool --debug --data-path
/var/lib/ceph/osd/ceph-22 --skip-journal-replay --skip-mount-omap
--op info --pgid 2.9f  -- https://pastebin.com/9aGtAfSv

Regards and thanks,
Eugene


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Negative Objects Number

2019-10-14 Thread Igor Fedotov

Hi Lazuardi,

never seen that. Just wondering what Ceph version are you running?


Thanks,

Igor

On 10/8/2019 3:52 PM, Lazuardi Nasution wrote:

Hi,

I get following weird negative objects number on tiering. Why is this 
happening? How to get back to normal?


Best regards,

[root@management-a ~]# ceph df detail
GLOBAL:
    SIZE     AVAIL     RAW USED     %RAW USED     OBJECTS
    446T      184T         261T         58.62      22092k
POOLS:
    NAME                ID     CATEGORY     QUOTA OBJECTS QUOTA BYTES 
    USED       %USED     MAX AVAIL     OBJECTS  DIRTY      READ       
WRITE      RAW USED
    rbd                 1      -            N/A N/A                  0 
        0        25838G            0      0          0          1     
       0
    volumes             2      -            N/A N/A             82647G 
    76.18        25838G     21177891 20681k      5897M      2447M     
    242T
    images              3      -            N/A N/A              3683G 
    12.48        25838G       705881   689k     37844k     10630k     
  11049G
    backups             4      -            N/A N/A                  0 
        0        25838G            0      0          0          0     
       0
    vms                 5      -            N/A N/A              3003G 
    10.41        25838G       772845   754k       623M       812M     
   9010G
    rbd_tiering         11     -            N/A N/A                333 
        0         3492G            4      0          1          2     
     999
    volumes_tiering     12     -            N/A N/A              9761M 
        0         3492G        -1233    338      2340M      1982M     
       0
    images_tiering      13     -            N/A N/A               293k 
        0         3492G          129      0     34642k      3600k     
    880k
    backups_tiering     14     -            N/A N/A                 83 
        0         3492G            1      0          2          2     
     249
    vms_tiering         15     -            N/A N/A              2758M 
        0         3492G       -32567    116     31942M      2875M     
       0


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph version 14.2.3-OSD fails

2019-10-11 Thread Igor Fedotov

Hi!

originally your issue looked like the ones from 
https://tracker.ceph.com/issues/42223


And it looks like lack of some key information for FreeListManager in 
RocksDB.


Once you have it present we can check the content of the RocksDB to 
prove this hypothesis, please let me know if you want the guideline for 
that.



The last log is different, the key record is probably:

-2> 2019-10-09 23:03:47.011 7fb4295a7700 -1 rocksdb: submit_common 
error: Corruption: block checksum mismatch: expected 2181709173, got 
2130853119  in db/204514.sst offset 0 size 61648 code = 2 Rocksdb 
transaction:


which most probably denotes data corruption in DB. Unfortunately for now 
I can't say if this is related to the original issue or not.


This time it reminds the issue shared in this mailing list a while ago 
by Stefan Priebe. The post caption is "Bluestore OSDs keep crashing in 
BlueStore.cc: 8808: FAILED assert(r == 0)"


So first of all I'd suggest to distinguish these issues for now and try 
to troubleshoot them separately.



As for the first case I'm wondering if you have any OSDs still failing 
this way, i.e. asserting in allocator and showing 0 extents loaded: 
"_open_alloc loaded 0 B in 0 extents"


If so lets check DB content first.


For the second case I'm wondering the most if the issue is permanent for 
a specific OSD or it disappears after OSD/node restart as it occurred in 
Stefan's case?



Thanks,

Igor


On 10/10/2019 1:59 PM, cephuser2345 user wrote:

Hi igor
since the last osd crash we had some 4 more  tried to check RocksDB 
with ceph-kvstore-tool :

ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-71 compact
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-71 repair
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-71 
destructive-repair


nothing helped  we had  to redeploy the osd by removing it from the 
cluster and reinstalling


we have updated  to ceph  14.2.4   2 weeks or more ago still osd's 
falling in the same way
i have manged to to  capture the first fault  by using : ceph crash ls 
added the log+meta  to this email

can something dose this logs can shed some light ?










On Thu, Sep 12, 2019 at 7:20 PM Igor Fedotov mailto:ifedo...@suse.de>> wrote:

Hi,

this line:

    -2> 2019-09-12 16:38:15.101 7fcd02fd1f80  1
bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc loaded 0 B in
0 extents

tells me that OSD is unable to load free list manager
properly, i.e. list of free/allocated blocks in unavailable.

You might want to set 'debug bluestore = 10" and check
additional log output between

these two lines:

    -3> 2019-09-12 16:38:15.093 7fcd02fd1f80  1
bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc opening
allocation metadata
    -2> 2019-09-12 16:38:15.101 7fcd02fd1f80  1
bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc loaded 0 B in
0 extents

And/or check RocksDB records prefixed with "b" prefix using
ceph-kvstore-tool.


Igor


P.S.

Sorry, might be unresponsive for the next two week as I'm
going on vacation.


On 9/12/2019 7:04 PM, cephuser2345 user wrote:

Hi
we have updated  the ceph version from 14.2.2 to version 14.2.3.
the osd getting :

  -21        76.68713     host osd048
 66   hdd  12.78119         osd.66      up  1.0 1.0
 67   hdd  12.78119         osd.67      up  1.0 1.0
 68   hdd  12.78119         osd.68      up  1.0 1.0
 69   hdd  12.78119         osd.69      up  1.0 1.0
 70   hdd  12.78119         osd.70      up  1.0 1.0
 71   hdd  12.78119         osd.71    down  0 1.0

we can not   get the osd  up  getting error its happening in
alot of osds
can you please assist :)  added txt log
bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc opening
allocation metadata
    -2> 2019-09-12 16:38:15.101 7fcd02fd1f80  1
bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc loaded 0 B
in 0 extents
    -1> 2019-09-12 16:38:15.101 7fcd02fd1f80 -1
/build/ceph-14.2.3/src/os/bluestore/fastbmap_allocator_impl.h:
In function 'void
AllocatorLevel02::_mark_allocated(uint64_t, uint64_t)
[with L1 = AllocatorLevel01Loose; uint64_t = long unsigned
int]' thread 7fcd02fd1f80 time 2019-09-12 16:38:15.102539

___
ceph-users mailing list
ceph-users@lists.ceph.com  <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3,30,300 GB constraint of block.db size on SSD

2019-09-30 Thread Igor Fedotov

Hi Massimo,

On 9/29/2019 9:13 AM, Massimo Sgaravatto wrote:

In my ceph cluster I am use spinning disks for bluestore OSDs and SSDs 
just for the  block.db.


If I have got it right, right now:

a) only 3,30,300GB can be used on the SSD rocksdb spillover to slow 
device, so you don't have any benefit with e.g. 250 GB reserved on the 
SSD for block.db wrt a configuration where only 30 GB on the SSD
Generally this is correct except peak points when DB might temporary 
need some extra space. For compaction or other interim purposes. I've 
observed up to 2x increase in the lab. So allocating some extra space 
might be useful.
b) because of a), the recommendation reported in the doc saying that 
te block.db size should not be smaller than 4% of block is basically 
wrong.

I'd say this is very conservative estimate IMO.


Are there plans to change that in next releases ?
I am asking because I am going to buy new hardware and I'd like to 
understand if I should keep considering this 'constraint' when 
choosing the size of the SSD disks


Yes, we're working on a more intelligent DB space utilization scheme. 
Which will allow interim volume sizes to be useful.


Here is a PR which is pending final (hopefully) review: 
https://github.com/ceph/ceph/pull/29687




Thanks, Massimo


Thanks,

Igor


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-09-12 Thread Igor Fedotov

Hi Stefan,

thanks for the update.

Relevant PR from Paul mentions kernels (4.9+): 
https://github.com/ceph/ceph/pull/23273


Not sure how correct this is. That's all I have..

Try asking Sage/Paul...

Also could you please update the ticket with more details, e..g what are 
the original and new kernel versions



Thanks,

Igor

On 9/12/2019 8:20 PM, Stefan Priebe - Profihost AG wrote:

Hello Igor,

i can now confirm that this is indeed a kernel bug. The issue does no
longer happen on upgraded nodes.

Do you know more about it? I really would like to know in which version
it was fixed to prevent rebooting all ceph nodes.

Greets,
Stefan

Am 27.08.19 um 16:20 schrieb Igor Fedotov:

It sounds like OSD is "recovering" after checksum error.

I.e. just failed OSD shows no errors in fsck and is able to restart and
process new write requests for long enough period (longer than just a
couple of minutes). Are these statements true? If so I can suppose this
is accidental/volatile issue rather than data-at-rest corruption.
Something like data incorrectly read from disk.

Are you using standalone disk drive for DB/WAL or it's shared with main
one? Just in case as a low handing fruit - I'd suggest checking with
dmesg and smartctl for drive errors...

FYI: one more reference for the similar issue:
https://tracker.ceph.com/issues/24968

HW issue this time...


Also I recall an issue with some kernels that caused occasional invalid
data reads under high memory pressure/swapping:
https://tracker.ceph.com/issues/22464

IMO memory usage worth checking as well...


Igor


On 8/27/2019 4:52 PM, Stefan Priebe - Profihost AG wrote:

see inline

Am 27.08.19 um 15:43 schrieb Igor Fedotov:

see inline

On 8/27/2019 4:41 PM, Stefan Priebe - Profihost AG wrote:

Hi Igor,

Am 27.08.19 um 14:11 schrieb Igor Fedotov:

Hi Stefan,

this looks like a duplicate for

https://tracker.ceph.com/issues/37282

Actually the root cause selection might be quite wide.

   From HW issues to broken logic in RocksDB/BlueStore/BlueFS etc.

As far as I understand you have different OSDs which are failing,
right?

Yes i've seen this on around 50 different OSDs running different HW but
all run ceph 12.2.12. I've not seen this with 12.2.10 which we were
running before.


Is the set of these broken OSDs limited somehow?

No at least i'm not able to find



Any specific subset which is failing or something? E.g. just N of them
are failing from time to time.

No seems totally random.


Any similarities for broken OSDs (e.g. specific hardware)?

All run intel xeon CPUs and all run linux ;-)


Did you run fsck for any of broken OSDs? Any reports?

Yes but no reports.

Are you saying that fsck is fine for OSDs that showed this sort of
errors?

Yes fsck does not show a single error - everything is fine.


Any other errors/crashes in logs before these sort of issues happens?

No



Just in case - what allocator are you using?

tcmalloc

I meant BlueStore allocator - is it stupid or bitmap?

ah the default one i think this is stupid.

Greets,
Stefan


Greets,
Stefan


Thanks,

Igor



On 8/27/2019 1:03 PM, Stefan Priebe - Profihost AG wrote:

Hello,

since some month all our bluestore OSDs keep crashing from time to
time.
Currently about 5 OSDs per day.

All of them show the following trace:
Trace:
2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb:
submit_transaction
error: Corruption: block checksum mismatch code = 2 Rocksdb
transaction:
Put( Prefix = M key =
0x09a5'.916366.74680351' Value size =
184)
Put( Prefix = M key = 0x09a5'._fastinfo' Value size =
186)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe6f0024'x'



Value size = 530)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe'o'



Value size = 510)
Put( Prefix = L key = 0x10ba60f1 Value size = 4135)
2019-07-24 08:36:49.012110 7fb19a711700 -1
/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7fb19a711700 time 2019-07-24
08:36:48.995415
/build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r
== 0)

ceph version 12.2.12-7-g1321c5e91f
(1321c5e91f3d5d35dd5aa5a0029a54b9a8ab9498) luminous (stable)
     1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x5653a010e222]
     2: (BlueStore::_kv_sync_thread()+0x24c5) [0x56539ff964b5]
     3: (BlueStore::KVSyncThread::entry()+0xd) [0x56539ffd708d]
     4: (()+0x7494) [0x7fb1ab2f6494]
     5: (clone()+0x3f) [0x7fb1aa37dacf]

I already opend up a tracker:
https://tracker.ceph.com/issues/41367

Can anybody help? Is this known?

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list

Re: [ceph-users] ceph version 14.2.3-OSD fails

2019-09-12 Thread Igor Fedotov

Hi,

this line:

    -2> 2019-09-12 16:38:15.101 7fcd02fd1f80  1 
bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc loaded 0 B in 0 extents


tells me that OSD is unable to load free list manager properly, i.e. 
list of free/allocated blocks in unavailable.


You might want to set 'debug bluestore = 10" and check additional log 
output between


these two lines:

    -3> 2019-09-12 16:38:15.093 7fcd02fd1f80  1 
bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc opening allocation metadata
    -2> 2019-09-12 16:38:15.101 7fcd02fd1f80  1 
bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc loaded 0 B in 0 extents


And/or check RocksDB records prefixed with "b" prefix using 
ceph-kvstore-tool.



Igor


P.S.

Sorry, might be unresponsive for the next two week as I'm going on 
vacation.



On 9/12/2019 7:04 PM, cephuser2345 user wrote:

Hi
we have updated  the ceph version from 14.2.2 to version 14.2.3.
the osd getting :

  -21        76.68713     host osd048
 66   hdd  12.78119         osd.66      up  1.0 1.0
 67   hdd  12.78119         osd.67      up  1.0 1.0
 68   hdd  12.78119         osd.68      up  1.0 1.0
 69   hdd  12.78119         osd.69      up  1.0 1.0
 70   hdd  12.78119         osd.70      up  1.0 1.0
 71   hdd  12.78119         osd.71    down        0 1.0

we can not   get the osd  up  getting error its happening in alot of osds
can you please assist :)  added txt log
bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc opening allocation 
metadata
    -2> 2019-09-12 16:38:15.101 7fcd02fd1f80  1 
bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc loaded 0 B in 0 extents
    -1> 2019-09-12 16:38:15.101 7fcd02fd1f80 -1 
/build/ceph-14.2.3/src/os/bluestore/fastbmap_allocator_impl.h: In 
function 'void AllocatorLevel02::_mark_allocated(uint64_t, 
uint64_t) [with L1 = AllocatorLevel01Loose; uint64_t = long unsigned 
int]' thread 7fcd02fd1f80 time 2019-09-12 16:38:15.102539


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Igor Fedotov

It sounds like OSD is "recovering" after checksum error.

I.e. just failed OSD shows no errors in fsck and is able to restart and 
process new write requests for long enough period (longer than just a 
couple of minutes). Are these statements true? If so I can suppose this 
is accidental/volatile issue rather than data-at-rest corruption. 
Something like data incorrectly read from disk.


Are you using standalone disk drive for DB/WAL or it's shared with main 
one? Just in case as a low handing fruit - I'd suggest checking with 
dmesg and smartctl for drive errors...


FYI: one more reference for the similar issue: 
https://tracker.ceph.com/issues/24968


HW issue this time...


Also I recall an issue with some kernels that caused occasional invalid 
data reads under high memory pressure/swapping: 
https://tracker.ceph.com/issues/22464


IMO memory usage worth checking as well...


Igor


On 8/27/2019 4:52 PM, Stefan Priebe - Profihost AG wrote:

see inline

Am 27.08.19 um 15:43 schrieb Igor Fedotov:

see inline

On 8/27/2019 4:41 PM, Stefan Priebe - Profihost AG wrote:

Hi Igor,

Am 27.08.19 um 14:11 schrieb Igor Fedotov:

Hi Stefan,

this looks like a duplicate for

https://tracker.ceph.com/issues/37282

Actually the root cause selection might be quite wide.

  From HW issues to broken logic in RocksDB/BlueStore/BlueFS etc.

As far as I understand you have different OSDs which are failing, right?

Yes i've seen this on around 50 different OSDs running different HW but
all run ceph 12.2.12. I've not seen this with 12.2.10 which we were
running before.


Is the set of these broken OSDs limited somehow?

No at least i'm not able to find



Any specific subset which is failing or something? E.g. just N of them
are failing from time to time.

No seems totally random.


Any similarities for broken OSDs (e.g. specific hardware)?

All run intel xeon CPUs and all run linux ;-)


Did you run fsck for any of broken OSDs? Any reports?

Yes but no reports.

Are you saying that fsck is fine for OSDs that showed this sort of errors?

Yes fsck does not show a single error - everything is fine.


Any other errors/crashes in logs before these sort of issues happens?

No



Just in case - what allocator are you using?

tcmalloc

I meant BlueStore allocator - is it stupid or bitmap?

ah the default one i think this is stupid.

Greets,
Stefan


Greets,
Stefan


Thanks,

Igor



On 8/27/2019 1:03 PM, Stefan Priebe - Profihost AG wrote:

Hello,

since some month all our bluestore OSDs keep crashing from time to
time.
Currently about 5 OSDs per day.

All of them show the following trace:
Trace:
2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb: submit_transaction
error: Corruption: block checksum mismatch code = 2 Rocksdb
transaction:
Put( Prefix = M key =
0x09a5'.916366.74680351' Value size = 184)
Put( Prefix = M key = 0x09a5'._fastinfo' Value size = 186)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe6f0024'x'


Value size = 530)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe'o'


Value size = 510)
Put( Prefix = L key = 0x10ba60f1 Value size = 4135)
2019-07-24 08:36:49.012110 7fb19a711700 -1
/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7fb19a711700 time 2019-07-24
08:36:48.995415
/build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0)

ceph version 12.2.12-7-g1321c5e91f
(1321c5e91f3d5d35dd5aa5a0029a54b9a8ab9498) luminous (stable)
    1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x5653a010e222]
    2: (BlueStore::_kv_sync_thread()+0x24c5) [0x56539ff964b5]
    3: (BlueStore::KVSyncThread::entry()+0xd) [0x56539ffd708d]
    4: (()+0x7494) [0x7fb1ab2f6494]
    5: (clone()+0x3f) [0x7fb1aa37dacf]

I already opend up a tracker:
https://tracker.ceph.com/issues/41367

Can anybody help? Is this known?

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Igor Fedotov

see inline

On 8/27/2019 4:41 PM, Stefan Priebe - Profihost AG wrote:

Hi Igor,

Am 27.08.19 um 14:11 schrieb Igor Fedotov:

Hi Stefan,

this looks like a duplicate for

https://tracker.ceph.com/issues/37282

Actually the root cause selection might be quite wide.

 From HW issues to broken logic in RocksDB/BlueStore/BlueFS etc.

As far as I understand you have different OSDs which are failing, right?

Yes i've seen this on around 50 different OSDs running different HW but
all run ceph 12.2.12. I've not seen this with 12.2.10 which we were
running before.


Is the set of these broken OSDs limited somehow?

No at least i'm not able to find



Any specific subset which is failing or something? E.g. just N of them
are failing from time to time.

No seems totally random.


Any similarities for broken OSDs (e.g. specific hardware)?

All run intel xeon CPUs and all run linux ;-)


Did you run fsck for any of broken OSDs? Any reports?

Yes but no reports.

Are you saying that fsck is fine for OSDs that showed this sort of errors?




Any other errors/crashes in logs before these sort of issues happens?

No



Just in case - what allocator are you using?

tcmalloc

I meant BlueStore allocator - is it stupid or bitmap?


Greets,
Stefan


Thanks,

Igor



On 8/27/2019 1:03 PM, Stefan Priebe - Profihost AG wrote:

Hello,

since some month all our bluestore OSDs keep crashing from time to time.
Currently about 5 OSDs per day.

All of them show the following trace:
Trace:
2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb: submit_transaction
error: Corruption: block checksum mismatch code = 2 Rocksdb transaction:
Put( Prefix = M key =
0x09a5'.916366.74680351' Value size = 184)
Put( Prefix = M key = 0x09a5'._fastinfo' Value size = 186)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe6f0024'x'

Value size = 530)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe'o'

Value size = 510)
Put( Prefix = L key = 0x10ba60f1 Value size = 4135)
2019-07-24 08:36:49.012110 7fb19a711700 -1
/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7fb19a711700 time 2019-07-24
08:36:48.995415
/build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0)

ceph version 12.2.12-7-g1321c5e91f
(1321c5e91f3d5d35dd5aa5a0029a54b9a8ab9498) luminous (stable)
   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x5653a010e222]
   2: (BlueStore::_kv_sync_thread()+0x24c5) [0x56539ff964b5]
   3: (BlueStore::KVSyncThread::entry()+0xd) [0x56539ffd708d]
   4: (()+0x7494) [0x7fb1ab2f6494]
   5: (clone()+0x3f) [0x7fb1aa37dacf]

I already opend up a tracker:
https://tracker.ceph.com/issues/41367

Can anybody help? Is this known?

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Igor Fedotov

Hi Stefan,

this looks like a duplicate for

https://tracker.ceph.com/issues/37282


Actually the root cause selection might be quite wide.

From HW issues to broken logic in RocksDB/BlueStore/BlueFS etc.

As far as I understand you have different OSDs which are failing, right? 
Is the set of these broken OSDs limited somehow?


Any specific subset which is failing or something? E.g. just N of them 
are failing from time to time.


Any similarities for broken OSDs (e.g. specific hardware)?


Did you run fsck for any of broken OSDs? Any reports?

Any other errors/crashes in logs before these sort of issues happens?


Just in case - what allocator are you using?


Thanks,

Igor



On 8/27/2019 1:03 PM, Stefan Priebe - Profihost AG wrote:

Hello,

since some month all our bluestore OSDs keep crashing from time to time.
Currently about 5 OSDs per day.

All of them show the following trace:
Trace:
2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb: submit_transaction
error: Corruption: block checksum mismatch code = 2 Rocksdb transaction:
Put( Prefix = M key =
0x09a5'.916366.74680351' Value size = 184)
Put( Prefix = M key = 0x09a5'._fastinfo' Value size = 186)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe6f0024'x'
Value size = 530)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe'o'
Value size = 510)
Put( Prefix = L key = 0x10ba60f1 Value size = 4135)
2019-07-24 08:36:49.012110 7fb19a711700 -1
/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7fb19a711700 time 2019-07-24
08:36:48.995415
/build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0)

ceph version 12.2.12-7-g1321c5e91f
(1321c5e91f3d5d35dd5aa5a0029a54b9a8ab9498) luminous (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x5653a010e222]
  2: (BlueStore::_kv_sync_thread()+0x24c5) [0x56539ff964b5]
  3: (BlueStore::KVSyncThread::entry()+0xd) [0x56539ffd708d]
  4: (()+0x7494) [0x7fb1ab2f6494]
  5: (clone()+0x3f) [0x7fb1aa37dacf]

I already opend up a tracker:
https://tracker.ceph.com/issues/41367

Can anybody help? Is this known?

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-14 Thread Igor Fedotov

Hi Wido & Hermant.

On 8/14/2019 11:36 AM, Wido den Hollander wrote:


On 8/14/19 9:33 AM, Hemant Sonawane wrote:

Hello guys,

Thank you so much for your responses really appreciate it. But I would
like to mention one more thing which I forgot in my last email is that I
am going to use this storage for openstack VM's. So still the answer
will be the same that I should use 1GB for wal?


WAL 1GB is fine, yes.


I'd like to argue against this for a bit.

Actually standalone WAL is required when you have either very small fast 
device (and don't want db to use it) or three devices (different in 
performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be located  
at the fastest one.


For the given use case you just have HDD and NVMe and DB and WAL can 
safely collocate. Which means you don't need to allocate specific volume 
for WAL. Hence no need to answer the question how many space is needed 
for WAL. Simply allocate DB and WAL will appear there automatically.




As this is an OpenStack/RBD only use-case I would say that 10GB of DB
per 1TB of disk storage is sufficient.


Given RocksDB granularity already mentioned in this thread we tend to 
prefer some fixed allocation sizes with 30-60Gb being close to the optimal.


Anyway suggest to use LVM for DB/WAL volume and may be start with 
smaller size (e.g. 32GB per OSD) which leaves some extra spare space on 
your NVMes and allows to add more space if needed. (Just to note - 
removing some already allocated but still unused space from existing OSD 
and gift it to another/new OSD is a more troublesome task than adding 
some space from the spare volume).



On Wed, 14 Aug 2019 at 05:54, Mark Nelson mailto:mnel...@redhat.com>> wrote:

 On 8/13/19 3:51 PM, Paul Emmerich wrote:

 > On Tue, Aug 13, 2019 at 10:04 PM Wido den Hollander mailto:w...@42on.com>> wrote:
 >> I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of
 DB in
 >> use. No slow db in use.
 > random rgw-only setup here: 12TB drive, 77% full, 48GB metadata and
 > 10GB omap for index and whatever.
 >
 > That's 0.5% + 0.1%. And that's a setup that's using mostly erasure
 > coding and small-ish objects.
 >
 >
 >> I've talked with many people from the community and I don't see an
 >> agreement for the 4% rule.
 > agreed, 4% isn't a reasonable default.
 > I've seen setups with even 10% metadata usage, but these are weird
 > edge cases with very small objects on NVMe-only setups (obviously
 > without a separate DB device).
 >
 > Paul


 I agree, and I did quite a bit of the early space usage analysis.  I
 have a feeling that someone was trying to be well-meaning and make a
 simple ratio for users to target that was big enough to handle the
 majority of use cases.  The problem is that reality isn't that simple
 and one-size-fits all doesn't really work here.


 For RBD you can usually get away with far less than 4%.  A small
 fraction of that is often sufficient.  For tiny (say 4K) RGW objects
 (especially objects with very long names!) you potentially can end up
 using significantly more than 4%. Unfortunately there's no really good
 way for us to normalize this so long as RGW is using OMAP to store
 bucket indexes.  I think the best we can do long run is make it much
 clearer how space is being used on the block/db/wal devices and easier
 for users to shrink/grow the amount of "fast" disk they have on an OSD.
 Alternately we could put bucket indexes into rados objects instead of
 OMAP, but that would be a pretty big project (with it's own challenges
 but potentially also with rewards).


 Mark

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Thanks and Regards,

Hemant Sonawane


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 14.2.2 - OSD Crash

2019-08-07 Thread Igor Fedotov

Manuel,

well, this is a bit different from the tickets I shared... But still 
looks like slow DB access.


80+ seconds for submit/commit latency is TOO HIGH, this definitely might 
cause suicides...


Have you had a chance to inspect disk utilization?


Introducing NVMe drive for WAL/DB might be a good idea, I can see up to 
20GB allocated for META so they perfectly fit into 480GB NVMe drive.


Having single drive isn't that the perfect from performance and failure 
domain points of view though... I'd rather prefer 4-6 OSDs per drive...



As  a workaround you might also try to disable deep scrub.


Thanks,

Igor

On 8/7/2019 2:59 PM, EDH - Manuel Rios Fernandez wrote:


Hi Igor

Yes we got all in same device :

[root@CEPH-MON01 ~]# ceph osd df tree

ID CLASS   WEIGHT    REWEIGHT SIZE    RAW USE DATA    OMAP META 
AVAIL   %USE  VAR  PGS STATUS TYPE NAME


31 130.96783    - 131 TiB 114 TiB 114 TiB  14 MiB  204 GiB 17 TiB 
86.88 1.03   -    host CEPH008


5 archive  10.91399  0.80002  11 TiB 7.9 TiB 7.9 TiB 2.6 MiB   15 GiB 
3.0 TiB 72.65 0.86 181 up osd.5


6 archive  10.91399  1.0  11 TiB 9.4 TiB 9.3 TiB 5.8 MiB   17 GiB 
1.6 TiB 85.76 1.01 222 up osd.6


11 archive  10.91399  1.0  11 TiB  10 TiB  10 TiB  48 KiB   19 GiB 
838 GiB 92.50 1.09 251 up osd.11


45 archive  10.91399  1.0  11 TiB  10 TiB  10 TiB 148 KiB   18 GiB 
678 GiB 93.94 1.11 248 up osd.45


46 archive  10.91399  1.0  11 TiB 9.6 TiB 9.5 TiB 4.7 MiB   17 GiB 
1.4 TiB 87.52 1.04 235 up osd.46


47 archive  10.91399  1.0  11 TiB 8.8 TiB 8.8 TiB  68 KiB   17 GiB 
2.1 TiB 80.43 0.95 211 up osd.47


55 archive  10.91399  1.0  11 TiB 9.9 TiB 9.9 TiB 132 KiB   17 GiB 
1.0 TiB 90.74 1.07 243 up osd.55


70 archive  10.91399  1.0  11 TiB  10 TiB  10 TiB  44 KiB   19 GiB 
864 GiB 92.27 1.09 236 up osd.70


71 archive  10.91399  1.0  11 TiB 9.2 TiB 9.2 TiB  28 KiB   16 GiB 
1.7 TiB 84.19 1.00 228 up osd.71


78 archive  10.91399  1.0  11 TiB 8.9 TiB 8.9 TiB 182 KiB   16 GiB 
2.0 TiB 81.87 0.97 215 up osd.78


79 archive  10.91399  1.0  11 TiB  10 TiB  10 TiB 152 KiB   17 GiB 
958 GiB 91.43 1.08 238 up osd.79


91 archive  10.91399  1.0  11 TiB 9.7 TiB 9.7 TiB  92 KiB   17 GiB 
1.2 TiB 89.22 1.06 232 up osd.91


Disk are HGST of 12TB for archive porpourse.

In the same osd we got sine commit bluestore log latency

2019-08-07 06:57:33.681 7f059b06e700  0 
bluestore(/var/lib/ceph/osd/ceph-46) queue_transactions slow operation 
observed for l_bluestore_submit_lat, latency = 11.163s


2019-08-07 06:57:33.703 7f05a8088700  0 
bluestore(/var/lib/ceph/osd/ceph-46) _txc_committed_kv slow operation 
observed for l_bluestore_commit_lat, latency = 11.1858s, txc = 
0x55e9e3ea2c00


2019-08-07 09:14:00.620 7f059d072700  0 
bluestore(/var/lib/ceph/osd/ceph-46) queue_transactions slow operation 
observed for l_bluestore_submit_lat, latency = 7.23777s


2019-08-07 09:14:00.650 7f05a8088700  0 
bluestore(/var/lib/ceph/osd/ceph-46) _txc_committed_kv slow operation 
observed for l_bluestore_commit_lat, latency = 7.26778s, txc = 
0x55eaafbf6600


2019-08-07 09:19:08.242 7f059e875700  0 
bluestore(/var/lib/ceph/osd/ceph-46) queue_transactions slow operation 
observed for l_bluestore_submit_lat, latency = 81.8293s


2019-08-07 09:19:08.291 7f05a8088700  0 
bluestore(/var/lib/ceph/osd/ceph-46) _txc_committed_kv slow operation 
observed for l_bluestore_commit_lat, latency = 81.8609s, txc = 
0x55ea05ee6000


2019-08-07 09:19:08.467 7f059b06e700  0 
bluestore(/var/lib/ceph/osd/ceph-46) queue_transactions slow operation 
observed for l_bluestore_submit_lat, latency = 87.7795s


2019-08-07 09:19:08.481 7f05a8088700  0 
bluestore(/var/lib/ceph/osd/ceph-46) _txc_committed_kv slow operation 
observed for l_bluestore_commit_lat, latency = 87.7928s, txc = 
0x55eaa7a40600


Maybe move OMAP +META from all OSD to a NVME of 480GB per node helps 
in this situation but not sure.


Manuel

*De:*Igor Fedotov 
*Enviado el:* miércoles, 7 de agosto de 2019 13:10
*Para:* EDH - Manuel Rios Fernandez ; 'Ceph 
Users' 

*Asunto:* Re: [ceph-users] 14.2.2 - OSD Crash

Hi Manuel,

as Brad pointed out timeouts and suicides are rather consequences of 
some other issues with OSDs.


I recall at least two recent relevant tickets:

https://tracker.ceph.com/issues/36482

https://tracker.ceph.com/issues/40741 (see last comments)

Both had massive and slow reads from RocksDB which caused timeouts..

Visible symptom for both cases was  unexpectedly high read I/O from 
underlying disks (main and/or DB).


You can use iotop for inspection..,

These were worsened by having significant part of DB at spinners due 
to spillovers. So wondering what's your layout in this respect:


what drives back troublesome OSDs, is there any spillover to slow 
device, how massive it is?


Also could you please inspect your OSD

Re: [ceph-users] 14.2.2 - OSD Crash

2019-08-07 Thread Igor Fedotov

Hi Manuel,

as Brad pointed out timeouts and suicides are rather consequences of 
some other issues with OSDs.


I recall at least two recent relevant tickets:

https://tracker.ceph.com/issues/36482

https://tracker.ceph.com/issues/40741 (see last comments)

Both had massive and slow reads from RocksDB which caused timeouts..

Visible symptom for both cases was  unexpectedly high read I/O from 
underlying disks (main and/or DB).


You can use iotop for inspection..,

These were worsened by having significant part of DB at spinners due to 
spillovers. So wondering what's your layout in this respect:


what drives back troublesome OSDs, is there any spillover to slow 
device, how massive it is?


Also could you please inspect your OSD logs for the presence of lines 
containing "slow operation observed" substring. And share them if any..



Hope this helps.

Thanks,

Igor



On 8/7/2019 2:16 AM, EDH - Manuel Rios Fernandez wrote:


Hi

We got a pair of OSD located in  node that crash randomly since 14.2.2

OS Version : Centos 7.6

There’re a ton of lines before crash , I will unespected:

--

3045> 2019-08-07 00:39:32.013 7fe9a4996700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fe987e49700' had timed out after 15


-3044> 2019-08-07 00:39:32.013 7fe9a3994700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7fe987e49700' had timed out after 15


-3043> 2019-08-07 00:39:32.033 7fe9a4195700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7fe987e49700' had timed out after 15


-3042> 2019-08-07 00:39:32.033 7fe9a4996700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7fe987e49700' had timed out after 15


--

-

Some hundred lines of:

-164> 2019-08-07 00:47:36.628 7fe9a3994700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fe98964c700' had timed out after 60


  -163> 2019-08-07 00:47:36.632 7fe9a3994700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7fe98964c700' had timed out after 60


  -162> 2019-08-07 00:47:36.632 7fe9a3994700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7fe98964c700' had timed out after 60


-

   -78> 2019-08-07 00:50:51.755 7fe995bfa700 10 monclient: tick

   -77> 2019-08-07 00:50:51.755 7fe995bfa700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2019-08-07 00:50:21.756453)


   -76> 2019-08-07 00:51:01.755 7fe995bfa700 10 monclient: tick

   -75> 2019-08-07 00:51:01.755 7fe995bfa700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2019-08-07 00:50:31.756604)


   -74> 2019-08-07 00:51:11.755 7fe995bfa700 10 monclient: tick

   -73> 2019-08-07 00:51:11.755 7fe995bfa700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2019-08-07 00:50:41.756788)


   -72> 2019-08-07 00:51:21.756 7fe995bfa700 10 monclient: tick

   -71> 2019-08-07 00:51:21.756 7fe995bfa700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2019-08-07 00:50:51.756982)


   -70> 2019-08-07 00:51:31.755 7fe995bfa700 10 monclient: tick

   -69> 2019-08-07 00:51:31.755 7fe995bfa700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2019-08-07 00:51:01.757206)


   -68> 2019-08-07 00:51:41.756 7fe995bfa700 10 monclient: tick

   -67> 2019-08-07 00:51:41.756 7fe995bfa700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2019-08-07 00:51:11.757364)


   -66> 2019-08-07 00:51:51.756 7fe995bfa700 10 monclient: tick

   -65> 2019-08-07 00:51:51.756 7fe995bfa700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2019-08-07 00:51:21.757535)


   -64> 2019-08-07 00:51:52.861 7fe987e49700  1 heartbeat_map 
clear_timeout 'OSD::osd_op_tp thread 0x7fe987e49700' had timed out 
after 15


   -63> 2019-08-07 00:51:52.861 7fe987e49700  1 heartbeat_map 
clear_timeout 'OSD::osd_op_tp thread 0x7fe987e49700' had suicide timed 
out after 150


   -62> 2019-08-07 00:51:52.948 7fe99966c700  5 
bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 
4294967296 heap: 6018998272 unmapped: 1721180160 mapped: 4297818112 
old cache_size: 1994018210 new cache size: 1992784572


   -61> 2019-08-07 00:51:52.948 7fe99966c700  5 
bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 
1992784572 kv_alloc: 763363328 kv_used: 749381098 meta_alloc: 
763363328 meta_used: 654593191 data_alloc: 452984832 data_used: 455929856


   -60> 2019-08-07 00:51:57.923 7fe99966c700  5 
bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 
1994110827 kv_alloc: 763363328 kv_used: 749381098 meta_alloc: 
763363328 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944


   -59> 2019-08-07 00:51:57.973 7fe99966c700  5 
bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 
4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 
old cache_size: 1994110827 new cache size: 1994442069


   -58> 2019-08-07 00:52:01.756 7fe995bfa700 10 monclient: tick

   -57> 

Re: [ceph-users] Wrong ceph df result

2019-07-30 Thread Igor Fedotov

Hi Sylvain,

have you upgraded to Nautilus recently?

Have you added/repaired any OSDs since then?

If so then you're facing a known issue caused by a mixture of legacy and 
new approaches to collect pool statistics.


Sage shared detailed information on the issue in this mailing list under 
"Pool stats issue with upgrades to nautilus" topic.



Thanks,

Igor


On 7/30/2019 1:03 AM, Sylvain PORTIER wrote:

Hi,

When I get me ceph status, I do not understand the result :

ceph df detail
RAW STORAGE:
    CLASS SIZE    AVAIL   USED   RAW USED %RAW USED
    hdd   131 TiB 102 TiB 29 TiB   29 TiB 21.98
    TOTAL 131 TiB 102 TiB 29 TiB   29 TiB 21.98

POOLS:
    POOL   ID STORED  OBJECTS USED %USED MAX 
AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
    rbd  4   213 B  15 64 
KiB 0         29 TiB N/A N/A        15       
    0 B 0 B
    Backup  6 6.3 TiB   4.96M  9.6 TiB 9.95  
   58 TiB             N/A N/A 4.96M    0 B   0 B


I have only one image (rados) of 500 TiB on the pool Backup, and I 
really have 19TiB used on this drive (xfs).


Whats wrong ???

For 19 TiB used, ceph shows for Backup pool 9.6 TiB used ???

Thank you

Sylvain.


---
L'absence de virus dans ce courrier électronique a été vérifiée par le 
logiciel antivirus Avast.

https://www.avast.com/antivirus

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding block.db afterwards

2019-07-26 Thread Igor Fedotov

Hi Frank,

you can specify new db size in the following way:

CEPH_ARGS="--bluestore-block-db-size 107374182400" ceph-bluestore-tool 
bluefs-bdev-new-db 



Thanks,

Igor

On 7/26/2019 2:49 PM, Frank Rothenstein wrote:

Hi,

I'm running a small (3 hosts) ceph cluster. ATM I want to speed up my
cluster by adding seperate block.db SSDs. OSDs at creation were pure
spinning HDDs, no "--block.db /dev/sdxx"-parameter. So there is no
symlink block.db in /var/lib/ceph/osd/ceph-xx/
There is a "ceph-bluestore-tool bluefs-bdev-new-db ..." which should
check existence of block.db and create an new one.
Documentation for this tool is fairly basic, error-output also, so I'm
stuck. (Only way seems to be destroying and recreation of OSD with
block.db-parameter)

CLI:

ceph-bluestore-tool bluefs-bdev-new-db --path /var/lib/ceph/osd/ceph-35
--dev-target /dev/sdaa1
Output:

inferring bluefs devices from bluestore path
DB size isn't specified, please set Ceph bluestore-block-db-size config
parameter

for DB size I tried "bluestore_block_db_size = 32212254720" in
ceph.conf
for path I tried different versions

Any help an this would be appreciated.

Frank




Frank Rothenstein

Systemadministrator
Fon: +49 3821 700 125
Fax: +49 3821 700 190
Internet: www.bodden-kliniken.de
E-Mail: f.rothenst...@bodden-kliniken.de


_
BODDEN-KLINIKEN Ribnitz-Damgarten GmbH
Sandhufe 2
18311 Ribnitz-Damgarten

Telefon: 03821-700-0
Telefax: 03821-700-240

E-Mail: i...@bodden-kliniken.de
Internet: http://www.bodden-kliniken.de

Sitz: Ribnitz-Damgarten, Amtsgericht: Stralsund, HRB 2919, Steuer-Nr.: 
079/133/40188
Aufsichtsratsvorsitzende: Carmen Schröter, Geschäftsführer: Dr. Falko 
Milski, MBA; Dipl.-Kfm.(FH) Gunnar Bölke


Der Inhalt dieser E-Mail ist ausschließlich für den bezeichneten 
Adressaten bestimmt. Wenn Sie nicht der
vorgesehene Adressat dieser E-Mail oder dessen Vertreter sein sollten, 
beachten Sie bitte, dass jede
Form der Veröffentlichung, Vervielfältigung oder Weitergabe des 
Inhalts dieser E-Mail unzulässig ist.
Wir bitten Sie, sofort den Absender zu informieren und die E-Mail zu 
löschen.


      © BODDEN-KLINIKEN Ribnitz-Damgarten GmbH 2019
*** Virenfrei durch Kerio Mail Server AntiSPAM und Bitdefender 
Antivirus ***



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Repair statsfs fail some osd 14.2.1 to 14.2.2

2019-07-23 Thread Igor Fedotov

Hi Manuel,

this looks like either corrupted data in BlueStore data base or memory 
related (some leakage?) issue.



This is reproducible, right?

Could you please make a ticket in upstream tracker, rerun repair with 
debug bluestore set to 5/20 and upload corresponding log.


Please observe memory consumption along the process too.


Thanks,

Igor


On 7/23/2019 11:28 AM, EDH - Manuel Rios Fernandez wrote:


Hi Ceph,

Upgraded last night from 14.2.1 to 14.2.2, 36 osd with old stats. 
We’re still repairing stats one by one . But one failed.


Hope this helps.

CentOS Version: Linux CEPH006 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon 
Mar 18 15:06:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux


[root@CEPH006 ~]# ceph-bluestore-tool repair --path 
/var/lib/ceph/osd/ceph-10


src/central_freelist.cc:333] tcmalloc: allocation failed 8192

terminate called after throwing an instance of 'ceph::buffer::bad_alloc'

  what(): buffer::bad_alloc

*** Caught signal (Aborted) **

in thread 7f823c8e3f00 thread_name:ceph-bluestore-

ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
nautilus (stable)


1: (()+0xf5d0) [0x7f8230dab5d0]

2: (gsignal()+0x37) [0x7f822f5762c7]

3: (abort()+0x148) [0x7f822f5779b8]

4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f822fe857d5]

5: (()+0x5e746) [0x7f822fe83746]

6: (()+0x5e773) [0x7f822fe83773]

7: (()+0x5e993) [0x7f822fe83993]

8: (()+0x250478) [0x7f82328c7478]

9: (ceph::buffer::create_aligned_in_mempool(unsigned int, unsigned 
int, int)+0x2b1) [0x7f8232bf6791]


10: (ceph::buffer::create_aligned(unsigned int, unsigned int)+0x22) 
[0x7f8232bf6812]


11: (ceph::buffer::copy(char const*, unsigned int)+0x2c) [0x7f8232bf71cc]

12: (BlueStore::Blob::decode(BlueStore::Collection*, 
ceph::buffer::v14_2_0::ptr::iterator_impl&, unsigned long, 
unsigned long*, bool)+0x23e) [0x55ba137eafce]


13: 
(BlueStore::ExtentMap::decode_some(ceph::buffer::v14_2_0::list&)+0x8d6) 
[0x55ba137f3536]


14: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, 
unsigned int)+0x2b2) [0x55ba137f3c82]


15: (BlueStore::_fsck(bool, bool)+0x22a5) [0x55ba138577e5]

16: (main()+0x107e) [0x55ba136b3ece]

17: (__libc_start_main()+0xf5) [0x7f822f562495]

18: (()+0x27321f) [0x55ba1379b21f]

2019-07-23 10:14:57.156 7f823c8e3f00 -1 *** Caught signal (Aborted) **

in thread 7f823c8e3f00 thread_name:ceph-bluestore-

ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
nautilus (stable)


1: (()+0xf5d0) [0x7f8230dab5d0]

2: (gsignal()+0x37) [0x7f822f5762c7]

3: (abort()+0x148) [0x7f822f5779b8]

4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f822fe857d5]

5: (()+0x5e746) [0x7f822fe83746]

6: (()+0x5e773) [0x7f822fe83773]

7: (()+0x5e993) [0x7f822fe83993]

8: (()+0x250478) [0x7f82328c7478]

9: (ceph::buffer::create_aligned_in_mempool(unsigned int, unsigned 
int, int)+0x2b1) [0x7f8232bf6791]


10: (ceph::buffer::create_aligned(unsigned int, unsigned int)+0x22) 
[0x7f8232bf6812]


11: (ceph::buffer::copy(char const*, unsigned int)+0x2c) [0x7f8232bf71cc]

12: (BlueStore::Blob::decode(BlueStore::Collection*, 
ceph::buffer::v14_2_0::ptr::iterator_impl&, unsigned long, 
unsigned long*, bool)+0x23e) [0x55ba137eafce]


13: 
(BlueStore::ExtentMap::decode_some(ceph::buffer::v14_2_0::list&)+0x8d6) 
[0x55ba137f3536]


14: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, 
unsigned int)+0x2b2) [0x55ba137f3c82]


15: (BlueStore::_fsck(bool, bool)+0x22a5) [0x55ba138577e5]

16: (main()+0x107e) [0x55ba136b3ece]

17: (__libc_start_main()+0xf5) [0x7f822f562495]

18: (()+0x27321f) [0x55ba1379b21f]

NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


terminate called recursively

Aborted

---

CEPH Startup fail osd 10 fail.

ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
nautilus (stable)


1: (()+0xf5d0) [0x7f00ee9045d0]

2: (gsignal()+0x37) [0x7f00ed6f42c7]

3: (abort()+0x148) [0x7f00ed6f59b8]

4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f00ee0037d5]

5: (()+0x5e746) [0x7f00ee001746]

6: (()+0x5e773) [0x7f00ee001773]

7: (__cxa_rethrow()+0x49) [0x7f00ee0019e9]

8: (std::_Hashtablepg_log_dup_t*>, std::allocatorpg_log_dup_t*> >, std::__detail::_Select1st, 
std::equal_to, std::hash, 
std::__detail::_Mod_range_hashing, 
std::__detail::_Default_ranged_hash, 
std::__detail::_Prime_rehash_policy, 
std::__detail::_Hashtable_traits 
>::_M_insert_unique_node(unsigned long, unsigned long, 
std::__detail::_Hash_node, 
true>*)+0xfd) [0x55e0c6412a8d]


9: (std::__detail::_Map_basepg_log_dup_t*>, std::allocatorpg_log_dup_t*> >, std::__detail::_Select1st, 
std::equal_to, std::hash, 
std::__detail::_Mod_range_hashing, 
std::__detail::_Default_ranged_hash, 
std::__detail::_Prime_rehash_policy, 
std::__detail::_Hashtable_traits, 
true>::operator[](osd_reqid_t const&)+0x99) [0x55e0c64478c9]


10: (PGLog::merge_log_dups(pg_log_t const&)+0x328) [0x55e0c6441d28]

11: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, 

Re: [ceph-users] disk usage reported incorrectly

2019-07-17 Thread Igor Fedotov

Fix is on its way too...

See https://github.com/ceph/ceph/pull/28978

On 7/17/2019 8:55 PM, Paul Mezzanini wrote:

Oh my.  That's going to hurt with 788 OSDs.   Time for some creative shell 
scripts and stepping through the nodes.  I'll report back.

--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfm...@rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.


____
From: Igor Fedotov 
Sent: Wednesday, July 17, 2019 11:33 AM
To: Paul Mezzanini; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] disk usage reported incorrectly

Forgot to provide a workaround...

If that's the case then you need to repair each OSD with corresponding
command in ceph-objectstore-tool...

Thanks,

Igor.


On 7/17/2019 6:29 PM, Paul Mezzanini wrote:

Sometime after our upgrade to Nautilus our disk usage statistics went off the 
rails wrong.  I can't tell you exactly when it broke but I know that after the 
initial upgrade it worked at least for a bit.

Correct numbers should be something similar to: (These are copy/pasted from the 
autoscale-status report)

POOLSIZE
cephfs_metadata 327.1G
cold-ec98.36T
ceph-bulk-3r142.6T
cephfs_data31890G
ceph-hot-2r5276G
kgcoe-cinder103.2T
rbd   3098


Instead, we now show:

POOL SIZE
cephfs_metadata362.9G (correct)
cold-ec607.2G(wrong)
ceph-bulk-3r5186G (wrong)
cephfs_data1654G (wrong)
ceph-hot-2r5884G (correct I think)
kgcoe-cinder5761G   (wrong)
rbd128.0k


`ceph fs status` reports similar numbers.  cold-ec, ceph-hot-2r and cephfs_data 
are all cephfs data pools and cephfs_metadata is unsurprisingly, cephfs 
metadata.  The remaining pools are all used for rbd.


Interestingly, the `ceph df` outpool for raw storage feels correct for each 
drive class while the pool usage is wrong:

RAW STORAGE:
  CLASS SIZEAVAIL   USEDRAW USED %RAW USED
  hdd   6.3 PiB 5.2 PiB 1.1 PiB  1.1 PiB 17.08
  nvme  175 TiB 161 TiB  14 TiB   14 TiB  7.82
  nvme-meta  14 TiB  11 TiB 2.2 TiB  2.5 TiB 18.45
  TOTAL 6.5 PiB 5.4 PiB 1.1 PiB  1.1 PiB 16.84

POOLS:
  POOLID STORED  OBJECTS USED%USED 
MAX AVAIL
  kgcoe-cinder24 1.9 TiB  29.49M 5.6 TiB  0.32  
 582 TiB
  ceph-bulk-3r32 1.7 TiB  88.28M 5.1 TiB  0.29  
 582 TiB
  cephfs_data 35 518 GiB 135.68M 1.6 TiB  0.09  
 582 TiB
  cephfs_metadata 36 363 GiB   5.63M 363 GiB  3.35  
 3.4 TiB
  rbd 37   931 B   5 128 KiB 0  
 582 TiB
  ceph-hot-2r 50 5.7 TiB  18.63M 5.7 TiB  3.72  
  74 TiB
  cold-ec 51 417 GiB 105.23M 607 GiB  0.02  
 2.1 PiB


Everything is on "ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) 
nautilus (stable)" and kernel 5.0.21 or 5.0.9.  I'm actually doing the patching now 
to pull the ceph cluster up to 5.0.21, same as the clients.  I'm not really sure where to 
dig into this one.  Everything is working fine except disk usage reporting.  This also 
completely blows up the autoscaler.

I feel like the question is obvious but I'll state it anyway.  How do I get 
this issue resolved?

Thanks
-paul

--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfm...@rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

__

Re: [ceph-users] disk usage reported incorrectly

2019-07-17 Thread Igor Fedotov

Forgot to provide a workaround...

If that's the case then you need to repair each OSD with corresponding 
command in ceph-objectstore-tool...


Thanks,

Igor.


On 7/17/2019 6:29 PM, Paul Mezzanini wrote:

Sometime after our upgrade to Nautilus our disk usage statistics went off the 
rails wrong.  I can't tell you exactly when it broke but I know that after the 
initial upgrade it worked at least for a bit.

Correct numbers should be something similar to: (These are copy/pasted from the 
autoscale-status report)

POOLSIZE
cephfs_metadata 327.1G
cold-ec98.36T
ceph-bulk-3r142.6T
cephfs_data31890G
ceph-hot-2r5276G
kgcoe-cinder103.2T
rbd   3098


Instead, we now show:

POOL SIZE
cephfs_metadata362.9G (correct)
cold-ec607.2G(wrong)
ceph-bulk-3r5186G (wrong)
cephfs_data1654G (wrong)
ceph-hot-2r5884G (correct I think)
kgcoe-cinder5761G   (wrong)
rbd128.0k


`ceph fs status` reports similar numbers.  cold-ec, ceph-hot-2r and cephfs_data 
are all cephfs data pools and cephfs_metadata is unsurprisingly, cephfs 
metadata.  The remaining pools are all used for rbd.


Interestingly, the `ceph df` outpool for raw storage feels correct for each 
drive class while the pool usage is wrong:

RAW STORAGE:
 CLASS SIZEAVAIL   USEDRAW USED %RAW USED
 hdd   6.3 PiB 5.2 PiB 1.1 PiB  1.1 PiB 17.08
 nvme  175 TiB 161 TiB  14 TiB   14 TiB  7.82
 nvme-meta  14 TiB  11 TiB 2.2 TiB  2.5 TiB 18.45
 TOTAL 6.5 PiB 5.4 PiB 1.1 PiB  1.1 PiB 16.84
  
POOLS:

 POOLID STORED  OBJECTS USED%USED 
MAX AVAIL
 kgcoe-cinder24 1.9 TiB  29.49M 5.6 TiB  0.32   
582 TiB
 ceph-bulk-3r32 1.7 TiB  88.28M 5.1 TiB  0.29   
582 TiB
 cephfs_data 35 518 GiB 135.68M 1.6 TiB  0.09   
582 TiB
 cephfs_metadata 36 363 GiB   5.63M 363 GiB  3.35   
3.4 TiB
 rbd 37   931 B   5 128 KiB 0   
582 TiB
 ceph-hot-2r 50 5.7 TiB  18.63M 5.7 TiB  3.72   
 74 TiB
 cold-ec 51 417 GiB 105.23M 607 GiB  0.02   
2.1 PiB


Everything is on "ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) 
nautilus (stable)" and kernel 5.0.21 or 5.0.9.  I'm actually doing the patching now 
to pull the ceph cluster up to 5.0.21, same as the clients.  I'm not really sure where to 
dig into this one.  Everything is working fine except disk usage reporting.  This also 
completely blows up the autoscaler.

I feel like the question is obvious but I'll state it anyway.  How do I get 
this issue resolved?

Thanks
-paul

--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfm...@rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] disk usage reported incorrectly

2019-07-17 Thread Igor Fedotov

H Paul,

there was a post from Sage named "Pool stats issue with upgrades to 
nautilus" recently.


Perhaps that's the case if you add new OSD or repair existing one...


Thanks,

Igor


On 7/17/2019 6:29 PM, Paul Mezzanini wrote:


Sometime after our upgrade to Nautilus our disk usage statistics went off the 
rails wrong.  I can't tell you exactly when it broke but I know that after the 
initial upgrade it worked at least for a bit.

Correct numbers should be something similar to: (These are copy/pasted from the 
autoscale-status report)

POOLSIZE
cephfs_metadata 327.1G
cold-ec98.36T
ceph-bulk-3r142.6T
cephfs_data31890G
ceph-hot-2r5276G
kgcoe-cinder103.2T
rbd   3098


Instead, we now show:

POOL SIZE
cephfs_metadata362.9G (correct)
cold-ec607.2G(wrong)
ceph-bulk-3r5186G (wrong)
cephfs_data1654G (wrong)
ceph-hot-2r5884G (correct I think)
kgcoe-cinder5761G   (wrong)
rbd128.0k


`ceph fs status` reports similar numbers.  cold-ec, ceph-hot-2r and cephfs_data 
are all cephfs data pools and cephfs_metadata is unsurprisingly, cephfs 
metadata.  The remaining pools are all used for rbd.


Interestingly, the `ceph df` outpool for raw storage feels correct for each 
drive class while the pool usage is wrong:

RAW STORAGE:
 CLASS SIZEAVAIL   USEDRAW USED %RAW USED
 hdd   6.3 PiB 5.2 PiB 1.1 PiB  1.1 PiB 17.08
 nvme  175 TiB 161 TiB  14 TiB   14 TiB  7.82
 nvme-meta  14 TiB  11 TiB 2.2 TiB  2.5 TiB 18.45
 TOTAL 6.5 PiB 5.4 PiB 1.1 PiB  1.1 PiB 16.84
  
POOLS:

 POOLID STORED  OBJECTS USED%USED 
MAX AVAIL
 kgcoe-cinder24 1.9 TiB  29.49M 5.6 TiB  0.32   
582 TiB
 ceph-bulk-3r32 1.7 TiB  88.28M 5.1 TiB  0.29   
582 TiB
 cephfs_data 35 518 GiB 135.68M 1.6 TiB  0.09   
582 TiB
 cephfs_metadata 36 363 GiB   5.63M 363 GiB  3.35   
3.4 TiB
 rbd 37   931 B   5 128 KiB 0   
582 TiB
 ceph-hot-2r 50 5.7 TiB  18.63M 5.7 TiB  3.72   
 74 TiB
 cold-ec 51 417 GiB 105.23M 607 GiB  0.02   
2.1 PiB


Everything is on "ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) 
nautilus (stable)" and kernel 5.0.21 or 5.0.9.  I'm actually doing the patching now 
to pull the ceph cluster up to 5.0.21, same as the clients.  I'm not really sure where to 
dig into this one.  Everything is working fine except disk usage reporting.  This also 
completely blows up the autoscaler.

I feel like the question is obvious but I'll state it anyway.  How do I get 
this issue resolved?

Thanks
-paul

--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfm...@rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-12 Thread Igor Fedotov

Left some notes in the ticket..

On 7/11/2019 10:32 PM, Brett Chancellor wrote:
We moved the .rgw.meta data pool over to SSD to try and improve 
performance, during the backfill SSDs bgan dying in mass. Log attached 
to this case

https://tracker.ceph.com/issues/40741

Right now the SSD's wont come up with either allocator and the cluster 
is pretty much dead.


What are the consequences of deleting the .rgw.meta pool? Can it be 
recreated?


On Wed, Jul 10, 2019 at 3:31 PM ifedo...@suse.de 
<mailto:ifedo...@suse.de> mailto:ifedo...@suse.de>> 
wrote:


You might want to try manual rocksdb compaction using
ceph-kvstore-tool..

Sent from my Huawei tablet


 Original Message 
Subject: Re: [ceph-users] 3 OSDs stopped and unable to restart
From: Brett Chancellor
    To: Igor Fedotov
CC: Ceph Users

Once backfilling finished, the cluster was super slow,
most osd's were filled with heartbeat_map errors.  When an
OSD restarts it causes a cascade of other osd's to follow
suit and restart.. logs like..
  -3> 2019-07-10 18:34:50.046 7f34abf5b700 -1 osd.69
1348581 get_health_metrics reporting 21 slow ops, oldest
is osd_op(client.115295041.0:17575966 15.c37fa482
15.c37fa482 (undecoded)
ack+ondisk+write+known_if_redirected e1348522)
    -2> 2019-07-10 18:34:50.967 7f34acf5d700  1
heartbeat_map is_healthy 'OSD::osd_op_tp thread
0x7f3493f2b700' had timed out after 90
    -1> 2019-07-10 18:34:50.967 7f34acf5d700  1
heartbeat_map is_healthy 'OSD::osd_op_tp thread
0x7f3493f2b700' had suicide timed out after 150
     0> 2019-07-10 18:34:51.025 7f3493f2b700 -1 *** Caught
signal (Aborted) **
 in thread 7f3493f2b700 thread_name:tp_osd_tp

 ceph version 14.2.1
(d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
 1: (()+0xf5d0) [0x7f34b57c25d0]
 2: (pread64()+0x33) [0x7f34b57c1f63]
 3: (KernelDevice::read_random(unsigned long, unsigned
long, char*, bool)+0x238) [0x55bfdae5a448]
 4: (BlueFS::_read_random(BlueFS::FileReader*, unsigned
long, unsigned long, char*)+0xca) [0x55bfdae1271a]
 5: (BlueRocksRandomAccessFile::Read(unsigned long,
unsigned long, rocksdb::Slice*, char*) const+0x20)
[0x55bfdae3b440]
 6: (rocksdb::RandomAccessFileReader::Read(unsigned long,
unsigned long, rocksdb::Slice*, char*) const+0x960)
[0x55bfdb466ba0]
 7: (rocksdb::BlockFetcher::ReadBlockContents()+0x3e7)
[0x55bfdb420c27]
 8: (()+0x11146a4) [0x55bfdb40d6a4]
 9:

(rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache(rocksdb::FilePrefetchBuffer*,
rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions
const&, rocksdb::BlockHandle const&, rocksdb::Slice,
rocksdb::BlockBasedTable::CachableEntry*,
bool, rocksdb::GetContext*)+0x2cc) [0x55bfdb40f63c]
 10: (rocksdb::DataBlockIter*

rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*,
rocksdb::ReadOptions const&, rocksdb::BlockHandle const&,
rocksdb::DataBlockIter*, bool, bool, bool,
rocksdb::GetContext*, rocksdb::Status,
rocksdb::FilePrefetchBuffer*)+0x169) [0x55bfdb41cb29]
 11:
(rocksdb::BlockBasedTableIterator::InitDataBlock()+0xc8) [0x55bfdb41e588]
 12:
(rocksdb::BlockBasedTableIterator::FindKeyForward()+0x8d) [0x55bfdb41e89d]
 13: (()+0x10adde9) [0x55bfdb3a6de9]
 14: (rocksdb::MergingIterator::Next()+0x44) [0x55bfdb4357c4]
 15: (rocksdb::DBIter::FindNextUserEntryInternal(bool,
bool)+0x762) [0x55bfdb32a092]
 16: (rocksdb::DBIter::Next()+0x1d6) [0x55bfdb32b6c6]
 17:
(RocksDBStore::RocksDBWholeSpaceIteratorImpl::next()+0x2d)
[0x55bfdad9fa8d]
 18: (BlueStore::_collection_list(BlueStore::Collection*,
ghobject_t const&, ghobject_t const&, int,
std::vector >*,
ghobject_t*)+0xdf6) [0x55bfdad12466]
 19:

(BlueStore::collection_list(boost::intrusive_ptr&,
ghobject_t const&, ghobject_t const&, int,
std::vector >*,
ghobject_t*)+0x9b) [0x55bfdad1393b]
 20: (PG::_delete_some(ObjectStore::Transaction*)+0x1e0)
[0x55bfda984120]
 21: (PG::RecoveryState::Deleting::react(PG::DeleteSome
const&)+0x38) [0x55bfda985598]
 22:
(boost::statechart::simple_state,

(boost::statechart::history_mode)0>::react_

Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-09 Thread Igor Fedotov
This will cap single bluefs space allocation. Currently it attempts to 
allocate 70Gb which seems to overflow some 32-bit length fields. With 
the adjustment such allocation should be capped at ~700MB.


I doubt there is any relation between this specific failure and the 
pool. At least at the moment.


In short the history is: starting OSD tries to flush bluefs data to 
disk, detects lack of space and asks for more from main device - 
allocations succeeds but returned extent has length field set to 0.


On 7/9/2019 8:33 PM, Brett Chancellor wrote:
What does bluestore_bluefs_gift_ratio do?  I can't find any 
documentation on it.  Also do you think this could be related to the 
.rgw.meta pool having too many objects per PG? The disks that die 
always seem to be backfilling a pg from that pool, and they have ~550k 
objects per PG.


-Brett

On Tue, Jul 9, 2019 at 1:03 PM Igor Fedotov <mailto:ifedo...@suse.de>> wrote:


Please try to set bluestore_bluefs_gift_ratio to 0.0002


On 7/9/2019 7:39 PM, Brett Chancellor wrote:

Too large for pastebin.. The problem is continually crashing new
OSDs. Here is the latest one.

On Tue, Jul 9, 2019 at 11:46 AM Igor Fedotov mailto:ifedo...@suse.de>> wrote:

could you please set debug bluestore to 20 and collect
startup log for this specific OSD once again?


On 7/9/2019 6:29 PM, Brett Chancellor wrote:

I restarted most of the OSDs with the stupid allocator (6 of
them wouldn't start unless bitmap allocator was set), but
I'm still seeing issues with OSDs crashing.  Interestingly
it seems that the dying OSDs are always working on a pg from
the .rgw.meta pool when they crash.

Log : https://pastebin.com/yuJKcPvX

On Tue, Jul 9, 2019 at 5:14 AM Igor Fedotov
mailto:ifedo...@suse.de>> wrote:

Hi Brett,

in Nautilus you can do that via

ceph config set osd.N bluestore_allocator stupid

ceph config set osd.N bluefs_allocator stupid

See

https://ceph.com/community/new-mimic-centralized-configuration-management/
for more details on a new way of configuration options
setting.


A known issue with Stupid allocator is gradual write
request latency increase (occurred within several days
after OSD restart). Seldom observed though. There were
some posts about that behavior in the mail list  this year.

Thanks,

Igor.


On 7/8/2019 8:33 PM, Brett Chancellor wrote:



I'll give that a try.  Is it something like...
ceph tell 'osd.*' bluestore_allocator stupid
ceph tell 'osd.*' bluefs_allocator stupid

And should I expect any issues doing this?


On Mon, Jul 8, 2019 at 1:04 PM Igor Fedotov
mailto:ifedo...@suse.de>> wrote:

I should read call stack more carefully... It's not
about lacking free space - this is rather the bug
from this ticket:

http://tracker.ceph.com/issues/40080


You should upgrade to v14.2.2 (once it's available)
or temporarily switch to stupid allocator as a
workaround.


Thanks,

Igor



    On 7/8/2019 8:00 PM, Igor Fedotov wrote:


Hi Brett,

looks like BlueStore is unable to allocate
additional space for BlueFS at main device. It's
either lacking free space or it's too fragmented...

Would you share osd log, please?

Also please run "ceph-bluestore-tool --path

bluefs-bdev-sizes" and share the output.

Thanks,

Igor

On 7/3/2019 9:59 PM, Brett Chancellor wrote:

Hi All! Today I've had 3 OSDs stop themselves and
are unable to restart, all with the same error.
These OSDs are all on different hosts. All are
running 14.2.1

I did try the following two commands
- ceph-kvstore-tool bluestore-kv
/var/lib/ceph/osd/ceph-80 list > keys
  ## This failed with the same error below
- ceph-bluestore-tool --path
/var/lib/ceph/osd/ceph-80 fsck
 ## After a couple of hours returned...
2019-07-03 18:30:02.095 7fe7c1c1ef00 -1
bluestore(/var/lib/ceph/osd/ceph-80) fsck
warning: legacy statfs record found, suggest to
run store repair to get consistent statistic reports
fsck success


## Error when trying to start one of the OSDs
 -12> 2019-07-03 18:36:57.450 7f5e42366700 -1

Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-09 Thread Igor Fedotov

Please try to set bluestore_bluefs_gift_ratio to 0.0002


On 7/9/2019 7:39 PM, Brett Chancellor wrote:
Too large for pastebin.. The problem is continually crashing new OSDs. 
Here is the latest one.


On Tue, Jul 9, 2019 at 11:46 AM Igor Fedotov <mailto:ifedo...@suse.de>> wrote:


could you please set debug bluestore to 20 and collect startup log
for this specific OSD once again?


On 7/9/2019 6:29 PM, Brett Chancellor wrote:

I restarted most of the OSDs with the stupid allocator (6 of them
wouldn't start unless bitmap allocator was set), but I'm still
seeing issues with OSDs crashing.  Interestingly it seems that
the dying OSDs are always working on a pg from the .rgw.meta pool
when they crash.

Log : https://pastebin.com/yuJKcPvX

On Tue, Jul 9, 2019 at 5:14 AM Igor Fedotov mailto:ifedo...@suse.de>> wrote:

Hi Brett,

in Nautilus you can do that via

ceph config set osd.N bluestore_allocator stupid

ceph config set osd.N bluefs_allocator stupid

See

https://ceph.com/community/new-mimic-centralized-configuration-management/
for more details on a new way of configuration options setting.


A known issue with Stupid allocator is gradual write request
latency increase (occurred within several days after OSD
restart). Seldom observed though. There were some posts about
that behavior in the mail list  this year.

Thanks,

Igor.


On 7/8/2019 8:33 PM, Brett Chancellor wrote:



I'll give that a try.  Is it something like...
ceph tell 'osd.*' bluestore_allocator stupid
ceph tell 'osd.*' bluefs_allocator stupid

And should I expect any issues doing this?


On Mon, Jul 8, 2019 at 1:04 PM Igor Fedotov
mailto:ifedo...@suse.de>> wrote:

I should read call stack more carefully... It's not
about lacking free space - this is rather the bug from
this ticket:

http://tracker.ceph.com/issues/40080


You should upgrade to v14.2.2 (once it's available) or
temporarily switch to stupid allocator as a workaround.


Thanks,

Igor



On 7/8/2019 8:00 PM, Igor Fedotov wrote:


Hi Brett,

looks like BlueStore is unable to allocate additional
space for BlueFS at main device. It's either lacking
free space or it's too fragmented...

Would you share osd log, please?

Also please run "ceph-bluestore-tool --path  bluefs-bdev-sizes" and share the
output.

Thanks,

Igor

On 7/3/2019 9:59 PM, Brett Chancellor wrote:

Hi All! Today I've had 3 OSDs stop themselves and are
unable to restart, all with the same error. These OSDs
are all on different hosts. All are running 14.2.1

I did try the following two commands
- ceph-kvstore-tool bluestore-kv
/var/lib/ceph/osd/ceph-80 list > keys
  ## This failed with the same error below
- ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80
fsck
 ## After a couple of hours returned...
2019-07-03 18:30:02.095 7fe7c1c1ef00 -1
bluestore(/var/lib/ceph/osd/ceph-80) fsck warning:
legacy statfs record found, suggest to run store
repair to get consistent statistic reports
fsck success


## Error when trying to start one of the OSDs
 -12> 2019-07-03 18:36:57.450 7f5e42366700 -1 ***
Caught signal (Aborted) **
 in thread 7f5e42366700 thread_name:rocksdb:low0

 ceph version 14.2.1
(d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
(stable)
 1: (()+0xf5d0) [0x7f5e50bd75d0]
 2: (gsignal()+0x37) [0x7f5e4f9ce207]
 3: (abort()+0x148) [0x7f5e4f9cf8f8]
 4: (ceph::__ceph_assert_fail(char const*, char
const*, int, char const*)+0x199) [0x55a7aaee96ab]
 5: (ceph::__ceph_assertf_fail(char const*, char
const*, int, char const*, char const*, ...)+0)
[0x55a7aaee982a]
 6: (interval_set,
std::allocator > > >::insert(unsigned long, unsigned long,
unsigned long*, unsigned long*)+0x3c6) [0x55a7ab212a66]
 7: (BlueStore::allocate_bluefs_freespace(unsigned
long, unsigned long, std::vector >*)+0x74e) [0x55a7ab48253e]
 8: (BlueFS::_expand_slow_device(unsigned long,
std::vector >&)+0x111) [0x55a7ab59e921]
 9: (BlueFS::_allocate(unsigned char, unsigned long,
bluefs_fnode_t*)+0x68b) [0x55a7ab59f68b]
 10: (BlueFS::_flush_range(BlueFS::FileWriter*,
unsigned long, unsigned long)

Re: [ceph-users] DR practice: "uuid != super.uuid" and csum error at blob offset 0x0

2019-07-09 Thread Igor Fedotov

Hi Mark,

I doubt read-only mode would help here.

Log replay  is required to build a consistent store state and one can't 
bypass it. And looks like your drive/controller still detect some errors 
while reading.


For the second issue this PR might help (you'll be able to disable csum 
verification and hopefully OSD will start) once it's merged 
(https://github.com/ceph/ceph/pull/26247). For now IMO the only way to 
go for you is to have a custom build with this patch manually applied.



Thanks,

Igor

On 7/9/2019 5:02 PM, Mark Lehrer wrote:

My main question is this - is there a way to stop any replay or
journaling during OSD startup and bring up the pool/fs in read-only
mode?

Here is a description of what I'm seeing.  I have a Luminous cluster
with CephFS and 16 8TB SSDs, using size=3.

I had a problem with one of my SAS controllers, and now I have at
least 3 OSDs that refuse to start.  The hardware appears to be fine
now.

I have my essential data backed up, but there are a few files that I
wouldn't mind saving so I want to use this as disaster recovery
practice.

The two problems I am seeing are:

1) On two of OSDs, there is a startup replay error after successfully
replaying quite a few blocks:

2019-07-06 16:08:05.281063 7f6baec66e40 10 bluefs _replay 0x1543000:
stop: uuid c366a2d6-e221-98b3-59fe-0f324c9dac8e != super.uuid
263428d5-8963-4339-8815-92ab6067e7a4
2019-07-06 16:08:05.281064 7f6baec66e40 10 bluefs _replay log file
size was 0x1543000
2019-07-06 16:08:05.281085 7f6baec66e40 -1 bluefs _replay file with
link count 0: file(ino 1485 size 0x15f4c43 mtime 2019-07-04
20:39:39.387601 bdev 1 allocated 160 extents
[1:0x3577150+10,1:0x3577160+10,1:0x3577170+10,1:0x35771c0+10,1:0x35771d0+10,1:0x3577220+10,1:0x3577230+10,1:0x3577280+10,1:0x3577290+10,1:0x35772a0+10,1:0x35772b0+10,1:0x35772c0+10,1:0x35772d0+10,1:0x35772e0+10,1:0x3577330+10,1:0x3577340+10,1:0x3577350+10,1:0x3577360+10,1:0x3577370+10,1:0x3577380+10,1:0x3577390+10,1:0x35773a0+10])
2019-07-06 16:08:05.281093 7f6baec66e40 -1 bluefs mount failed to
replay log: (5) Input/output error


2) The following error happens on at least two OSDs:

2019-07-06 15:58:46.621008 7fdcee030e40 -1
bluestore(/var/lib/ceph/osd/ceph-74) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x0, got 0x147db0c5, expected 0x8f052c9,
device location [0x1~1000], logical extent 0x0~1000, object
#-1:7b3f43c4:::osd_superblock:0#


The system was archiving some unimportant files when it went down, so
I really don't care about any of the recent writes.

What are my recovery options here?  I was thinking that turning off
replaying and running in read-only mode would be feasible, but maybe
there are better options?

Thanks,
Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-09 Thread Igor Fedotov
could you please set debug bluestore to 20 and collect startup log for 
this specific OSD once again?



On 7/9/2019 6:29 PM, Brett Chancellor wrote:
I restarted most of the OSDs with the stupid allocator (6 of them 
wouldn't start unless bitmap allocator was set), but I'm still seeing 
issues with OSDs crashing. Interestingly it seems that the dying OSDs 
are always working on a pg from the .rgw.meta pool when they crash.


Log : https://pastebin.com/yuJKcPvX

On Tue, Jul 9, 2019 at 5:14 AM Igor Fedotov <mailto:ifedo...@suse.de>> wrote:


Hi Brett,

in Nautilus you can do that via

ceph config set osd.N bluestore_allocator stupid

ceph config set osd.N bluefs_allocator stupid

See
https://ceph.com/community/new-mimic-centralized-configuration-management/
for more details on a new way of configuration options setting.


A known issue with Stupid allocator is gradual write request
latency increase (occurred within several days after OSD restart).
Seldom observed though. There were some posts about that behavior
in the mail list  this year.

Thanks,

Igor.


On 7/8/2019 8:33 PM, Brett Chancellor wrote:



I'll give that a try.  Is it something like...
ceph tell 'osd.*' bluestore_allocator stupid
ceph tell 'osd.*' bluefs_allocator stupid

And should I expect any issues doing this?


On Mon, Jul 8, 2019 at 1:04 PM Igor Fedotov mailto:ifedo...@suse.de>> wrote:

I should read call stack more carefully... It's not about
lacking free space - this is rather the bug from this ticket:

http://tracker.ceph.com/issues/40080


You should upgrade to v14.2.2 (once it's available) or
temporarily switch to stupid allocator as a workaround.


Thanks,

Igor



On 7/8/2019 8:00 PM, Igor Fedotov wrote:


Hi Brett,

looks like BlueStore is unable to allocate additional space
for BlueFS at main device. It's either lacking free space or
it's too fragmented...

Would you share osd log, please?

Also please run "ceph-bluestore-tool --path  bluefs-bdev-sizes" and share the output.

Thanks,

Igor

On 7/3/2019 9:59 PM, Brett Chancellor wrote:

Hi All! Today I've had 3 OSDs stop themselves and are
unable to restart, all with the same error. These OSDs are
all on different hosts. All are running 14.2.1

I did try the following two commands
- ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-80
list > keys
  ## This failed with the same error below
- ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80 fsck
 ## After a couple of hours returned...
2019-07-03 18:30:02.095 7fe7c1c1ef00 -1
bluestore(/var/lib/ceph/osd/ceph-80) fsck warning: legacy
statfs record found, suggest to run store repair to get
consistent statistic reports
fsck success


## Error when trying to start one of the OSDs
 -12> 2019-07-03 18:36:57.450 7f5e42366700 -1 *** Caught
signal (Aborted) **
 in thread 7f5e42366700 thread_name:rocksdb:low0

 ceph version 14.2.1
(d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
 1: (()+0xf5d0) [0x7f5e50bd75d0]
 2: (gsignal()+0x37) [0x7f5e4f9ce207]
 3: (abort()+0x148) [0x7f5e4f9cf8f8]
 4: (ceph::__ceph_assert_fail(char const*, char const*,
int, char const*)+0x199) [0x55a7aaee96ab]
 5: (ceph::__ceph_assertf_fail(char const*, char const*,
int, char const*, char const*, ...)+0) [0x55a7aaee982a]
 6: (interval_set,
std::allocator > > >::insert(unsigned long, unsigned long, unsigned
long*, unsigned long*)+0x3c6) [0x55a7ab212a66]
 7: (BlueStore::allocate_bluefs_freespace(unsigned long,
unsigned long, std::vector >*)+0x74e) [0x55a7ab48253e]
 8: (BlueFS::_expand_slow_device(unsigned long,
std::vector >&)+0x111) [0x55a7ab59e921]
 9: (BlueFS::_allocate(unsigned char, unsigned long,
bluefs_fnode_t*)+0x68b) [0x55a7ab59f68b]
 10: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned
long, unsigned long)+0xe5) [0x55a7ab59fce5]
 11: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x10b)
[0x55a7ab5a1b4b]
 12: (BlueRocksWritableFile::Flush()+0x3d) [0x55a7ab5bf84d]
 13: (rocksdb::WritableFileWriter::Flush()+0x19e)
[0x55a7abbedd0e]
 14: (rocksdb::WritableFileWriter::Sync(bool)+0x2e)
[0x55a7abbedfee]
 15:
(rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status
const&, rocksdb::CompactionJob::SubcompactionState*,
rocksdb::RangeDelAggregator*, CompactionIterationStats*,
rocksdb::Slice const*)+0xbaa) [0x55a7abc3b73a]
 16:

(rocksdb::CompactionJob::ProcessKeyValueCompact

Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-09 Thread Igor Fedotov

Hi Brett,

in Nautilus you can do that via

ceph config set osd.N bluestore_allocator stupid

ceph config set osd.N bluefs_allocator stupid

See 
https://ceph.com/community/new-mimic-centralized-configuration-management/ 
for more details on a new way of configuration options setting.



A known issue with Stupid allocator is gradual write request latency 
increase (occurred within several days after OSD restart). Seldom 
observed though. There were some posts about that behavior in the mail 
list  this year.


Thanks,

Igor.


On 7/8/2019 8:33 PM, Brett Chancellor wrote:



I'll give that a try.  Is it something like...
ceph tell 'osd.*' bluestore_allocator stupid
ceph tell 'osd.*' bluefs_allocator stupid

And should I expect any issues doing this?


On Mon, Jul 8, 2019 at 1:04 PM Igor Fedotov <mailto:ifedo...@suse.de>> wrote:


I should read call stack more carefully... It's not about lacking
free space - this is rather the bug from this ticket:

http://tracker.ceph.com/issues/40080


You should upgrade to v14.2.2 (once it's available) or temporarily
switch to stupid allocator as a workaround.


Thanks,

Igor



On 7/8/2019 8:00 PM, Igor Fedotov wrote:


Hi Brett,

looks like BlueStore is unable to allocate additional space for
BlueFS at main device. It's either lacking free space or it's too
fragmented...

Would you share osd log, please?

Also please run "ceph-bluestore-tool --path  bluefs-bdev-sizes" and share the output.

Thanks,

Igor

On 7/3/2019 9:59 PM, Brett Chancellor wrote:

Hi All! Today I've had 3 OSDs stop themselves and are unable to
restart, all with the same error. These OSDs are all on
different hosts. All are running 14.2.1

I did try the following two commands
- ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-80 list
> keys
  ## This failed with the same error below
- ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80 fsck
 ## After a couple of hours returned...
2019-07-03 18:30:02.095 7fe7c1c1ef00 -1
bluestore(/var/lib/ceph/osd/ceph-80) fsck warning: legacy statfs
record found, suggest to run store repair to get consistent
statistic reports
fsck success


## Error when trying to start one of the OSDs
 -12> 2019-07-03 18:36:57.450 7f5e42366700 -1 *** Caught signal
(Aborted) **
 in thread 7f5e42366700 thread_name:rocksdb:low0

 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972)
nautilus (stable)
 1: (()+0xf5d0) [0x7f5e50bd75d0]
 2: (gsignal()+0x37) [0x7f5e4f9ce207]
 3: (abort()+0x148) [0x7f5e4f9cf8f8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x199) [0x55a7aaee96ab]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int,
char const*, char const*, ...)+0) [0x55a7aaee982a]
 6: (interval_set,
std::allocator > >
>::insert(unsigned long, unsigned long, unsigned long*, unsigned
long*)+0x3c6) [0x55a7ab212a66]
 7: (BlueStore::allocate_bluefs_freespace(unsigned long,
unsigned long, std::vector >*)+0x74e) [0x55a7ab48253e]
 8: (BlueFS::_expand_slow_device(unsigned long,
std::vector >&)+0x111) [0x55a7ab59e921]
 9: (BlueFS::_allocate(unsigned char, unsigned long,
bluefs_fnode_t*)+0x68b) [0x55a7ab59f68b]
 10: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long,
unsigned long)+0xe5) [0x55a7ab59fce5]
 11: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x10b)
[0x55a7ab5a1b4b]
 12: (BlueRocksWritableFile::Flush()+0x3d) [0x55a7ab5bf84d]
 13: (rocksdb::WritableFileWriter::Flush()+0x19e) [0x55a7abbedd0e]
 14: (rocksdb::WritableFileWriter::Sync(bool)+0x2e) [0x55a7abbedfee]
 15:
(rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status
const&, rocksdb::CompactionJob::SubcompactionState*,
rocksdb::RangeDelAggregator*, CompactionIterationStats*,
rocksdb::Slice const*)+0xbaa) [0x55a7abc3b73a]
 16:

(rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d0)
[0x55a7abc3f150]
 17: (rocksdb::CompactionJob::Run()+0x298) [0x55a7abc40618]
 18: (rocksdb::DBImpl::BackgroundCompaction(bool*,
rocksdb::JobContext*, rocksdb::LogBuffer*,
rocksdb::DBImpl::PrepickedCompaction*)+0xcb7) [0x55a7aba7fb67]
 19:

(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
rocksdb::Env::Priority)+0xd0) [0x55a7aba813c0]
 20: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x3a)
[0x55a7aba8190a]
 21: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned
long)+0x264) [0x55a7abc8d9c4]
 22:
(rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x4f)
[0x55a7abc8db4f]
 23: (()+0x129dfff) [0x55a7abd1afff]
 24: (()+0x7dd5) [0x7f5e50bcfdd5]
 25: (clone()+0x6d) [0x7f5e4fa95ead]
 NOTE: a copy of the executabl

Re: [ceph-users] slow requests due to scrubbing of very small pg

2019-07-09 Thread Igor Fedotov

Hi Lukasz,

if this is filestore then most probably my comments are irrelevant. The 
issue I expected is BlueStore specific


Unfortunately I'm not an expert in filestore hence unable to help in 
further investigation. Sorry...



Thanks,

Igor


On 7/9/2019 11:39 AM, Luk wrote:

We have (still) on these OSDs filestore.

Regards
Lukasz


Hi Igor,
ThankYoufor   Your   input,  will  try  Your  suggestion  with
ceph-objectstore-tool.
But for now it looks like main problem is this:
2019-07-09 09:29:25.410839 7f5e4b64f700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f5e20e87700' had timed out after 15
2019-07-09 09:29:25.410842 7f5e4b64f700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f5e41651700' had timed out after 60
after  this  (a  lot of) logs osd become unresponsive for monitors and
they are marked down for a few seconds/minutes, sometimes it suicide:
2019-07-09 09:29:32.271361 7f5e3d649700  0 log_channel(cluster) log
[WRN] : Monitor daemon marked osd.118 down, but it is still running
2019-07-09 09:29:32.271381 7f5e3d649700  0 log_channel(cluster) log
[DBG] : map e71903 wrongly marked me down at e71902
2019-07-09 09:29:32.271393 7f5e3d649700  1 osd.118 71903 
start_waiting_for_healthy



maybe You (or any other cepher) know how to dill with this problem ?
Regards
Lukasz

Hi Lukasz,
I've seen something like that - slow requests and relevant OSD reboots
on suicide timeout at least twice with two different clusters. The root
cause was slow omap listing for some objects which had started to happen
after massive removals from RocksDB.
To verify if this is the case you can create a script that uses
ceph-objectstore-tool to list objects for the specific pg and then
list-omap for every returned object.
If omap listing for some object(s) takes too long (minutes in my case) -
you're facing the same issue.
PR that implements automatic lookup for such "slow" objects in
ceph-objectstore-tool is under review:
https://github.com/ceph/ceph/pull/27985



The only known workaround for existing OSDs so far is manual DB
compaction. And https://github.com/ceph/ceph/pull/27627 hopefully fixes
the issue for newly deployed OSDs.




Relevant upstream tickets are:
http://tracker.ceph.com/issues/36482
http://tracker.ceph.com/issues/40557



Hope this helps,
Igor
On 7/3/2019 9:54 AM, Luk wrote:

Hello,

I have strange problem with scrubbing.

When  scrubbing starts on PG which belong to default.rgw.buckets.index
pool,  I  can  see that this OSD is very busy (see attachment), and starts 
showing many
slow  request,  after  the  scrubbing  of this PG stops, slow requests
stops immediately.

[root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# zgrep scrub 
/var/log/ceph/ceph-osd.118.log.1.gz  | grep -w 20.2
2019-07-03 00:14:57.496308 7fd4c7a09700  0 log_channel(cluster) log [DBG] : 
20.2 deep-scrub starts
2019-07-03 05:36:13.274637 7fd4ca20e700  0 log_channel(cluster) log [DBG] : 
20.2 deep-scrub ok
[root@stor-b02 /var/lib/ceph/osd/ceph-118/current]#

[root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# du -sh 20.2_*
636K20.2_head
0   20.2_TEMP
[root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# ls -1 -R 20.2_head | wc -l
4125
[root@stor-b02 /var/lib/ceph/osd/ceph-118/current]#

and on mon:

2019-07-03 00:48:44.793893 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231090 : 
cluster [WRN] Health check failed: 105 slow requests are blocked > 32 sec. 
Implicated osds 118 (REQUEST_SLOW)
2019-07-03 00:48:54.086446 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231097 : 
cluster [WRN] Health check update: 102 slow requests are blocked > 32 sec. 
Implicated osds 118 (REQUEST_SLOW)
2019-07-03 00:48:59.088240 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231099 : 
cluster [WRN] Health check update: 91 slow requests are blocked > 32 sec. 
Implicated osds 118 (REQUEST_SLOW)

[...]

2019-07-03 05:36:19.695586 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6243211 : 
cluster [INF] Health check cleared: REQUEST_SLOW (was: 23 slow requests are 
blocked > 32 sec. Implicated osds 118)
2019-07-03 05:36:19.695700 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6243212 : 
cluster [INF] Cluster is now healthy

ceph version 12.2.9

it  might  be related to this (taken from:
https://ceph.com/releases/v12-2-11-luminous-released/) ? :

"
There have been fixes to RGW dynamic and manual resharding, which no longer
leaves behind stale bucket instances to be removed manually. For finding and
cleaning up older instances from a reshard a radosgw-admin command reshard
stale-instances list and reshard stale-instances rm should do the necessary
cleanup.
"


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-08 Thread Igor Fedotov
I should read call stack more carefully... It's not about lacking free 
space - this is rather the bug from this ticket:


http://tracker.ceph.com/issues/40080


You should upgrade to v14.2.2 (once it's available) or temporarily 
switch to stupid allocator as a workaround.



Thanks,

Igor



On 7/8/2019 8:00 PM, Igor Fedotov wrote:


Hi Brett,

looks like BlueStore is unable to allocate additional space for BlueFS 
at main device. It's either lacking free space or it's too fragmented...


Would you share osd log, please?

Also please run "ceph-bluestore-tool --path path-to-osd!!!> bluefs-bdev-sizes" and share the output.


Thanks,

Igor

On 7/3/2019 9:59 PM, Brett Chancellor wrote:
Hi All! Today I've had 3 OSDs stop themselves and are unable to 
restart, all with the same error. These OSDs are all on different 
hosts. All are running 14.2.1


I did try the following two commands
- ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-80 list > keys
  ## This failed with the same error below
- ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80 fsck
 ## After a couple of hours returned...
2019-07-03 18:30:02.095 7fe7c1c1ef00 -1 
bluestore(/var/lib/ceph/osd/ceph-80) fsck warning: legacy statfs 
record found, suggest to run store repair to get consistent statistic 
reports

fsck success


## Error when trying to start one of the OSDs
 -12> 2019-07-03 18:36:57.450 7f5e42366700 -1 *** Caught signal 
(Aborted) **

 in thread 7f5e42366700 thread_name:rocksdb:low0

 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) 
nautilus (stable)

 1: (()+0xf5d0) [0x7f5e50bd75d0]
 2: (gsignal()+0x37) [0x7f5e4f9ce207]
 3: (abort()+0x148) [0x7f5e4f9cf8f8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x199) [0x55a7aaee96ab]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char 
const*, char const*, ...)+0) [0x55a7aaee982a]
 6: (interval_setlong, std::less, std::allocatorlong const, unsigned long> > > >::insert(unsigned long, unsigned 
long, unsigned long*, unsigned long*)+0x3c6) [0x55a7ab212a66]
 7: (BlueStore::allocate_bluefs_freespace(unsigned long, unsigned 
long, std::vectormempool::pool_allocator<(mempool::pool_index_t)4, 
bluestore_pextent_t> >*)+0x74e) [0x55a7ab48253e]
 8: (BlueFS::_expand_slow_device(unsigned long, 
std::vectormempool::pool_allocator<(mempool::pool_index_t)4, 
bluestore_pextent_t> >&)+0x111) [0x55a7ab59e921]
 9: (BlueFS::_allocate(unsigned char, unsigned long, 
bluefs_fnode_t*)+0x68b) [0x55a7ab59f68b]
 10: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, 
unsigned long)+0xe5) [0x55a7ab59fce5]

 11: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x10b) [0x55a7ab5a1b4b]
 12: (BlueRocksWritableFile::Flush()+0x3d) [0x55a7ab5bf84d]
 13: (rocksdb::WritableFileWriter::Flush()+0x19e) [0x55a7abbedd0e]
 14: (rocksdb::WritableFileWriter::Sync(bool)+0x2e) [0x55a7abbedfee]
 15: 
(rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status 
const&, rocksdb::CompactionJob::SubcompactionState*, 
rocksdb::RangeDelAggregator*, CompactionIterationStats*, 
rocksdb::Slice const*)+0xbaa) [0x55a7abc3b73a]
 16: 
(rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d0) 
[0x55a7abc3f150]

 17: (rocksdb::CompactionJob::Run()+0x298) [0x55a7abc40618]
 18: (rocksdb::DBImpl::BackgroundCompaction(bool*, 
rocksdb::JobContext*, rocksdb::LogBuffer*, 
rocksdb::DBImpl::PrepickedCompaction*)+0xcb7) [0x55a7aba7fb67]
 19: 
(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, 
rocksdb::Env::Priority)+0xd0) [0x55a7aba813c0]

 20: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x3a) [0x55a7aba8190a]
 21: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264) 
[0x55a7abc8d9c4]
 22: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x4f) 
[0x55a7abc8db4f]

 23: (()+0x129dfff) [0x55a7abd1afff]
 24: (()+0x7dd5) [0x7f5e50bcfdd5]
 25: (clone()+0x6d) [0x7f5e4fa95ead]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-08 Thread Igor Fedotov

Hi Brett,

looks like BlueStore is unable to allocate additional space for BlueFS 
at main device. It's either lacking free space or it's too fragmented...


Would you share osd log, please?

Also please run "ceph-bluestore-tool --path path-to-osd!!!> bluefs-bdev-sizes" and share the output.


Thanks,

Igor

On 7/3/2019 9:59 PM, Brett Chancellor wrote:
Hi All! Today I've had 3 OSDs stop themselves and are unable to 
restart, all with the same error. These OSDs are all on different 
hosts. All are running 14.2.1


I did try the following two commands
- ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-80 list > keys
  ## This failed with the same error below
- ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80 fsck
 ## After a couple of hours returned...
2019-07-03 18:30:02.095 7fe7c1c1ef00 -1 
bluestore(/var/lib/ceph/osd/ceph-80) fsck warning: legacy statfs 
record found, suggest to run store repair to get consistent statistic 
reports

fsck success


## Error when trying to start one of the OSDs
 -12> 2019-07-03 18:36:57.450 7f5e42366700 -1 *** Caught signal 
(Aborted) **

 in thread 7f5e42366700 thread_name:rocksdb:low0

 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) 
nautilus (stable)

 1: (()+0xf5d0) [0x7f5e50bd75d0]
 2: (gsignal()+0x37) [0x7f5e4f9ce207]
 3: (abort()+0x148) [0x7f5e4f9cf8f8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x199) [0x55a7aaee96ab]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char 
const*, char const*, ...)+0) [0x55a7aaee982a]
 6: (interval_setlong, std::less, std::allocatorconst, unsigned long> > > >::insert(unsigned long, unsigned long, 
unsigned long*, unsigned long*)+0x3c6) [0x55a7ab212a66]
 7: (BlueStore::allocate_bluefs_freespace(unsigned long, unsigned 
long, std::vectormempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> 
>*)+0x74e) [0x55a7ab48253e]
 8: (BlueFS::_expand_slow_device(unsigned long, 
std::vectormempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> 
>&)+0x111) [0x55a7ab59e921]
 9: (BlueFS::_allocate(unsigned char, unsigned long, 
bluefs_fnode_t*)+0x68b) [0x55a7ab59f68b]
 10: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, 
unsigned long)+0xe5) [0x55a7ab59fce5]

 11: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x10b) [0x55a7ab5a1b4b]
 12: (BlueRocksWritableFile::Flush()+0x3d) [0x55a7ab5bf84d]
 13: (rocksdb::WritableFileWriter::Flush()+0x19e) [0x55a7abbedd0e]
 14: (rocksdb::WritableFileWriter::Sync(bool)+0x2e) [0x55a7abbedfee]
 15: 
(rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status 
const&, rocksdb::CompactionJob::SubcompactionState*, 
rocksdb::RangeDelAggregator*, CompactionIterationStats*, 
rocksdb::Slice const*)+0xbaa) [0x55a7abc3b73a]
 16: 
(rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d0) 
[0x55a7abc3f150]

 17: (rocksdb::CompactionJob::Run()+0x298) [0x55a7abc40618]
 18: (rocksdb::DBImpl::BackgroundCompaction(bool*, 
rocksdb::JobContext*, rocksdb::LogBuffer*, 
rocksdb::DBImpl::PrepickedCompaction*)+0xcb7) [0x55a7aba7fb67]
 19: 
(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, 
rocksdb::Env::Priority)+0xd0) [0x55a7aba813c0]

 20: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x3a) [0x55a7aba8190a]
 21: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264) 
[0x55a7abc8d9c4]
 22: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x4f) 
[0x55a7abc8db4f]

 23: (()+0x129dfff) [0x55a7abd1afff]
 24: (()+0x7dd5) [0x7f5e50bcfdd5]
 25: (clone()+0x6d) [0x7f5e4fa95ead]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests due to scrubbing of very small pg

2019-07-04 Thread Igor Fedotov

Hi Lukasz,

I've seen something like that - slow requests and relevant OSD reboots 
on suicide timeout at least twice with two different clusters. The root 
cause was slow omap listing for some objects which had started to happen 
after massive removals from RocksDB.


To verify if this is the case you can create a script that uses 
ceph-objectstore-tool to list objects for the specific pg and then 
list-omap for every returned object.


If omap listing for some object(s) takes too long (minutes in my case) - 
you're facing the same issue.


PR that implements automatic lookup for such "slow" objects in 
ceph-objectstore-tool is under review: 
https://github.com/ceph/ceph/pull/27985



The only known workaround for existing OSDs so far is manual DB 
compaction. And https://github.com/ceph/ceph/pull/27627 hopefully fixes 
the issue for newly deployed OSDs.




Relevant upstream tickets are:

http://tracker.ceph.com/issues/36482

http://tracker.ceph.com/issues/40557


Hope this helps,

Igor

On 7/3/2019 9:54 AM, Luk wrote:

Hello,

I have strange problem with scrubbing.

When  scrubbing starts on PG which belong to default.rgw.buckets.index
pool,  I  can  see that this OSD is very busy (see attachment), and starts 
showing many
slow  request,  after  the  scrubbing  of this PG stops, slow requests
stops immediately.

[root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# zgrep scrub 
/var/log/ceph/ceph-osd.118.log.1.gz  | grep -w 20.2
2019-07-03 00:14:57.496308 7fd4c7a09700  0 log_channel(cluster) log [DBG] : 
20.2 deep-scrub starts
2019-07-03 05:36:13.274637 7fd4ca20e700  0 log_channel(cluster) log [DBG] : 
20.2 deep-scrub ok
[root@stor-b02 /var/lib/ceph/osd/ceph-118/current]#

[root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# du -sh 20.2_*
636K20.2_head
0   20.2_TEMP
[root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# ls -1 -R 20.2_head | wc -l
4125
[root@stor-b02 /var/lib/ceph/osd/ceph-118/current]#

and on mon:

2019-07-03 00:48:44.793893 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231090 : 
cluster [WRN] Health check failed: 105 slow requests are blocked > 32 sec. 
Implicated osds 118 (REQUEST_SLOW)
2019-07-03 00:48:54.086446 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231097 : 
cluster [WRN] Health check update: 102 slow requests are blocked > 32 sec. 
Implicated osds 118 (REQUEST_SLOW)
2019-07-03 00:48:59.088240 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231099 : 
cluster [WRN] Health check update: 91 slow requests are blocked > 32 sec. 
Implicated osds 118 (REQUEST_SLOW)

[...]

2019-07-03 05:36:19.695586 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6243211 : 
cluster [INF] Health check cleared: REQUEST_SLOW (was: 23 slow requests are 
blocked > 32 sec. Implicated osds 118)
2019-07-03 05:36:19.695700 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6243212 : 
cluster [INF] Cluster is now healthy

ceph version 12.2.9

it  might  be related to this (taken from:
https://ceph.com/releases/v12-2-11-luminous-released/) ? :

"
There have been fixes to RGW dynamic and manual resharding, which no longer
leaves behind stale bucket instances to be removed manually. For finding and
cleaning up older instances from a reshard a radosgw-admin command reshard
stale-instances list and reshard stale-instances rm should do the necessary
cleanup.
"


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] troubleshooting space usage

2019-07-04 Thread Igor Fedotov

Yep, this looks fine..

hmm... sorry, but I'm out of ideas what's happening..

Anyway I think ceph  reports are more trustworthy than rgw ones. Looks 
like some issue with rgw reporting or may be some object leakage.



Regards,

Igor


On 7/3/2019 6:34 PM, Andrei Mikhailovsky wrote:

Hi Igor.

The numbers are identical it seems:

    .rgw.buckets   19      15 TiB     78.22       4.3 TiB *8786934*

# cat /root/ceph-rgw.buckets-rados-ls-all |wc -l
*8786934*

Cheers


*From: *"Igor Fedotov" 
*To: *"andrei" 
*Cc: *"ceph-users" 
*Sent: *Wednesday, 3 July, 2019 13:49:02
*Subject: *Re: [ceph-users] troubleshooting space usage

Looks fine - comparing bluestore_allocated vs. bluestore_stored
shows a little difference. So that's not the allocation overhead.

What's about comparing object counts reported by ceph and radosgw
tools?


Igor.


On 7/3/2019 3:25 PM, Andrei Mikhailovsky wrote:

Thanks Igor, Here is a link to the ceph perf data on several osds.

https://paste.ee/p/IzDMy

In terms of the object sizes. We use rgw to backup the data
from various workstations and servers. So, the sizes would be
from a few kb to a few gig per individual file.

Cheers



----

    *From: *"Igor Fedotov" 
*To: *"andrei" 
*Cc: *"ceph-users" 
*Sent: *Wednesday, 3 July, 2019 12:29:33
*Subject: *Re: [ceph-users] troubleshooting space usage

Hi Andrei,

Additionally I'd like to see performance counters dump for
a couple of HDD OSDs (obtained through 'ceph daemon osd.N
perf dump' command).

W.r.t average object size - I was thinking that you might
know what objects had been uploaded... If not then you
might want to estimate it by using "rados get" command on
the pool: retrieve some random object set and check their
sizes. But let's check performance counters first - most
probably they will show loses caused by allocation.


Also I've just found similar issue (still unresolved) in
our internal tracker - but its root cause is definitely
different from allocation overhead. Looks like some
orphaned objects in the pool. Could you please compare and
share the amounts of objects in the pool reported by "ceph
(or rados) df detail" and radosgw tools?


Thanks,

Igor


On 7/3/2019 12:56 PM, Andrei Mikhailovsky wrote:

Hi Igor,

Many thanks for your reply. Here are the details about
the cluster:

1. Ceph version - 13.2.5-1xenial (installed from Ceph
repository for ubuntu 16.04)

2. main devices for radosgw pool - hdd. we do use a
few ssds for the other pool, but it is not used by radosgw

3. we use BlueStore

4. Average rgw object size - I have no idea how to
check that. Couldn't find a simple answer from google
either. Could you please let me know how to check that?

5. Ceph osd df tree:

6. Other useful info on the cluster:

# ceph osd df tree
ID  CLASS WEIGHT    REWEIGHT SIZE    USE     AVAIL  
%USE  VAR  PGS TYPE NAME

 -1       112.17979        - 113 TiB  90 TiB  23 TiB
79.25 1.00   - root uk
 -5       112.17979        - 113 TiB  90 TiB  23 TiB
79.25 1.00   -     datacenter ldex
-11       112.17979        - 113 TiB  90 TiB  23 TiB
79.25 1.00   -         room ldex-dc3
-13       112.17979        - 113 TiB  90 TiB  23 TiB
79.25 1.00   -             row row-a
 -4       112.17979        - 113 TiB  90 TiB  23 TiB
79.25 1.00   - rack ldex-rack-a5
 -2        28.04495        -  28 TiB  22 TiB 6.2 TiB
77.96 0.98   -   host arh-ibstorage1-ib


  0   hdd   2.73000  0.7 2.8 TiB 2.3 TiB 519 GiB
81.61 1.03 145       osd.0
  1   hdd   2.73000  1.0 2.8 TiB 1.9 TiB 847 GiB
70.00 0.88 130       osd.1
 2   hdd   2.73000  1.0 2.8 TiB 2.2 TiB 561 GiB
80.12 1.01 152         osd.2
  3   hdd   2.73000  1.0 2.8 TiB 2.3 TiB 469 GiB
83.41 1.05 160             osd.3
  4   hdd   2.73000  1.0 2.8 TiB 1.8 TiB 983 GiB
65.18 0.82 141             osd.4
  

Re: [ceph-users] troubleshooting space usage

2019-07-03 Thread Igor Fedotov
Looks fine - comparing bluestore_allocated vs. bluestore_stored shows a 
little difference. So that's not the allocation overhead.


What's about comparing object counts reported by ceph and radosgw tools?


Igor.


On 7/3/2019 3:25 PM, Andrei Mikhailovsky wrote:

Thanks Igor, Here is a link to the ceph perf data on several osds.

https://paste.ee/p/IzDMy

In terms of the object sizes. We use rgw to backup the data from 
various workstations and servers. So, the sizes would be from a few kb 
to a few gig per individual file.


Cheers





*From: *"Igor Fedotov" 
*To: *"andrei" 
*Cc: *"ceph-users" 
*Sent: *Wednesday, 3 July, 2019 12:29:33
*Subject: *Re: [ceph-users] troubleshooting space usage

Hi Andrei,

Additionally I'd like to see performance counters dump for a
couple of HDD OSDs (obtained through 'ceph daemon osd.N perf dump'
command).

W.r.t average object size - I was thinking that you might know
what objects had been uploaded... If not then you might want to
estimate it by using "rados get" command on the pool: retrieve
some random object set and check their sizes. But let's check
performance counters first - most probably they will show loses
caused by allocation.


Also I've just found similar issue (still unresolved) in our
internal tracker - but its root cause is definitely different from
allocation overhead. Looks like some orphaned objects in the pool.
Could you please compare and share the amounts of objects in the
pool reported by "ceph (or rados) df detail" and radosgw tools?


Thanks,

Igor


On 7/3/2019 12:56 PM, Andrei Mikhailovsky wrote:

Hi Igor,

Many thanks for your reply. Here are the details about the
cluster:

1. Ceph version - 13.2.5-1xenial (installed from Ceph
repository for ubuntu 16.04)

2. main devices for radosgw pool - hdd. we do use a few ssds
for the other pool, but it is not used by radosgw

3. we use BlueStore

4. Average rgw object size - I have no idea how to check that.
Couldn't find a simple answer from google either. Could you
please let me know how to check that?

5. Ceph osd df tree:

6. Other useful info on the cluster:

# ceph osd df tree
ID  CLASS WEIGHT    REWEIGHT SIZE    USE AVAIL   %USE  VAR
 PGS TYPE NAME

 -1       112.17979        - 113 TiB  90 TiB  23 TiB 79.25
1.00   - root uk
 -5       112.17979        - 113 TiB  90 TiB  23 TiB 79.25
1.00   -     datacenter ldex
-11       112.17979        - 113 TiB  90 TiB  23 TiB 79.25
1.00   -         room ldex-dc3
-13       112.17979        - 113 TiB  90 TiB  23 TiB 79.25
1.00   -             row row-a
 -4       112.17979        - 113 TiB  90 TiB  23 TiB 79.25
1.00   -                 rack ldex-rack-a5
 -2        28.04495        -  28 TiB  22 TiB 6.2 TiB 77.96
0.98   -                     host arh-ibstorage1-ib


  0   hdd   2.73000  0.7 2.8 TiB 2.3 TiB 519 GiB 81.61
1.03 145                         osd.0
  1   hdd   2.73000  1.0 2.8 TiB 1.9 TiB 847 GiB 70.00
0.88 130                         osd.1
 2   hdd   2.73000  1.0 2.8 TiB 2.2 TiB 561 GiB 80.12 1.01
152                         osd.2
  3   hdd   2.73000  1.0 2.8 TiB 2.3 TiB 469 GiB 83.41
1.05 160 osd.3
  4   hdd   2.73000  1.0 2.8 TiB 1.8 TiB 983 GiB 65.18
0.82 141 osd.4
 32   hdd   5.45999  1.0 5.5 TiB 4.4 TiB 1.1 TiB 80.68
1.02 306 osd.32
 35   hdd   2.73000  1.0 2.8 TiB 1.7 TiB 1.0 TiB 62.89
0.79 126 osd.35
 36   hdd   2.73000  1.0 2.8 TiB 2.3 TiB 464 GiB 83.58
1.05 175 osd.36
 37   hdd   2.73000  0.8 2.8 TiB 2.5 TiB 301 GiB 89.34
1.13 160 osd.37
  5   ssd   0.74500  1.0 745 GiB 642 GiB 103 GiB 86.15
1.09  65 osd.5

 -3        28.04495        -  28 TiB  24 TiB 4.5 TiB 84.03
1.06   -                     host arh-ibstorage2-ib
  9   hdd   2.73000  0.95000 2.8 TiB 2.4 TiB 405 GiB 85.65
1.08 158 osd.9
 10   hdd   2.73000  0.8 2.8 TiB 2.4 TiB 352 GiB 87.52
1.10 169 osd.10
 11   hdd   2.73000  1.0 2.8 TiB 2.0 TiB 783 GiB 72.28
0.91 160 osd.11
 12   hdd   2.73000  0.84999 2.8 TiB 2.4 TiB 359 GiB 87.27
1.10 153 osd.12
 13   hdd   2.73000  1.0 2.8 TiB 2.4 TiB 348 GiB 87.69
1.11 169 osd.13
 14   hdd   2.73000  1.0 2.8 TiB 2.5 TiB 283 GiB 89.97
1.14 170 osd.14
 15   hdd   2.73000  1.0 2.8 TiB 2.2 TiB 560 GiB 80.18
1.01 155 osd.15
 16   hdd   2.73000  0.95000 2.8 TiB 2.4 TiB 332 G

Re: [ceph-users] troubleshooting space usage

2019-07-03 Thread Igor Fedotov
9 GiB 79.51 1.00 146   
                      osd.24
 25   hdd   2.73000  1.0 2.8 TiB 1.9 TiB 886 GiB 68.63 0.87 147   
                      osd.25
 31   hdd   5.45999  1.0 5.5 TiB 4.7 TiB 758 GiB 86.50 1.09 326   
                      osd.31
  6   ssd   0.74500  0.8 744 GiB 640 GiB 104 GiB 86.01 1.09  61   
                      osd.6


-17        28.04494        -  28 TiB  22 TiB 6.3 TiB 77.61 0.98   -   
                  host arh-ibstorage4-ib
  8   hdd   2.73000  1.0 2.8 TiB 1.9 TiB 909 GiB 67.80 0.86 141   
                      osd.8
 17   hdd   2.73000  1.0 2.8 TiB 1.9 TiB 904 GiB 67.99 0.86 144   
                      osd.17
 27   hdd   2.73000  1.0 2.8 TiB 2.1 TiB 654 GiB 76.84 0.97 152   
                      osd.27
 28   hdd   2.73000  1.0 2.8 TiB 2.3 TiB 481 GiB 82.98 1.05 153   
                      osd.28
 29   hdd   2.73000  1.0 2.8 TiB 1.9 TiB 829 GiB 70.65 0.89 137   
                      osd.29
 30   hdd   2.73000  1.0 2.8 TiB 2.0 TiB 762 GiB 73.03 0.92 142   
                      osd.30
 33   hdd   2.73000  1.0 2.8 TiB 2.3 TiB 501 GiB 82.25 1.04 166   
                      osd.33
 34   hdd   5.45998  1.0 5.5 TiB 4.5 TiB 968 GiB 82.77 1.04 325   
                      osd.34
 39   hdd   2.73000  0.95000 2.8 TiB 2.4 TiB 402 GiB 85.77 1.08 162   
                      osd.39
 38   ssd   0.74500  1.0 745 GiB 671 GiB  74 GiB 90.02 1.14  68   
                      osd.38

                       TOTAL 113 TiB  90 TiB  23 TiB 79.25
MIN/MAX VAR: 0.74/1.14  STDDEV: 8.14



# for i in $(radosgw-admin bucket list | jq -r '.[]'); do 
radosgw-admin bucket stats --bucket=$i | jq '.usage | ."rgw.main" | 
.size_kb' ; done | awk '{ SUM += $1} END { print SUM/1024/1024/1024 }'

6.59098


# ceph df


GLOBAL:
    SIZE        AVAIL      RAW USED     %RAW USED
    113 TiB     23 TiB       90 TiB         79.25

POOLS:
    NAME                           ID     USED  %USED     MAX AVAIL   
  OBJECTS
    Primary-ubuntu-1               5       27 TiB 87.56       3.9 TiB 
    7302534
    .users.uid                     15     6.8 KiB 0       3.9 TiB     
     39
    .users                         16       335 B 0       3.9 TiB     
     20
    .users.swift                   17        14 B 0       3.9 TiB     
      1
*.rgw.buckets                   19      15 TiB     79.88       3.9 TiB 
    8787763*
    .users.email                   22         0 B 0       3.9 TiB     
      0
    .log                           24     109 MiB 0       3.9 TiB     
 102301
    .rgw.buckets.extra             37         0 B 0       2.6 TiB     
      0
    .rgw.root                      44     2.9 KiB 0       2.6 TiB     
     16
    .rgw.meta                      45     1.7 MiB 0       2.6 TiB     
   6249
    .rgw.control                   46         0 B 0       2.6 TiB     
      8
    .rgw.gc                        47         0 B 0       2.6 TiB     
     32
    .usage                         52         0 B 0       2.6 TiB     
      0
    .intent-log                    53         0 B 0       2.6 TiB     
      0
    default.rgw.buckets.non-ec     54         0 B 0       2.6 TiB     
      0
    .rgw.buckets.index             55         0 B 0       2.6 TiB     
  11485
    .rgw                           56     491 KiB 0       2.6 TiB     
   1686
    Primary-ubuntu-1-ssd           57     1.2 TiB 92.39       105 GiB 
     379516



I am not too sure if the issue relates to the BlueStore overhead as I 
would probably have seen the discrepancy in my Primary-ubuntu-1 pool 
as well. However, the data usage on Primary-ubuntu-1 pool seems to be 
consistent with my expectations (precise numbers to be verified soon). 
The issues seems to be only with the .rgw-buckets pool where the "ceph 
df " output shows 15TB of usage and the sum of all buckets in that 
pool shows just over 6.5TB.


Cheers

Andrei


--------

*From: *"Igor Fedotov" 
*To: *"andrei" , "ceph-users"

*Sent: *Tuesday, 2 July, 2019 10:58:54
*Subject: *Re: [ceph-users] troubleshooting space usage

Hi Andrei,

The most obvious reason is space usage overhead caused by
BlueStore allocation granularity, e.g. if bluestore_min_alloc_size
is 64K  and average object size is 16K one will waste 48K per
object in average. This is rather a speculation so far as we lack
key the information about your cluster:

- Ceph version

- What are the main devices for OSD: hdd or ssd.

- BlueStore or FileStore.

- average RGW object size.

You might also want to collect and share performance counter dumps
(ceph daemon osd.N perf dump) and "

" reports from a couple of your OSDs.


Thanks,

Igor


On 7/2/2019 11:43 AM, Andrei Mikhailovsky wrote:

Bump!




*From: *"

Re: [ceph-users] troubleshooting space usage

2019-07-02 Thread Igor Fedotov

Hi Andrei,

The most obvious reason is space usage overhead caused by BlueStore 
allocation granularity, e.g. if bluestore_min_alloc_size is 64K  and 
average object size is 16K one will waste 48K per object in average. 
This is rather a speculation so far as we lack key the information about 
your cluster:


- Ceph version

- What are the main devices for OSD: hdd or ssd.

- BlueStore or FileStore.

- average RGW object size.

You might also want to collect and share performance counter dumps (ceph 
daemon osd.N perf dump) and "ceph osd df tree" reports from a couple of 
your OSDs.



Thanks,

Igor


On 7/2/2019 11:43 AM, Andrei Mikhailovsky wrote:

Bump!




*From: *"Andrei Mikhailovsky" 
*To: *"ceph-users" 
*Sent: *Friday, 28 June, 2019 14:54:53
*Subject: *[ceph-users] troubleshooting space usage

Hi

Could someone please explain / show how to troubleshoot the space
usage in Ceph and how to reclaim the unused space?

I have a small cluster with 40 osds, replica of 2, mainly used as
a backend for cloud stack as well as the S3 gateway. The used
space doesn't make any sense to me, especially the rgw pool, so I
am seeking help.

Here is what I found from the client:

Ceph -s shows the

 usage:   89 TiB used, 24 TiB / 113 TiB avail

Ceph df shows:

Primary-ubuntu-1               5       27 TiB 90.11       3.0 TiB
    7201098
Primary-ubuntu-1-ssd           57     1.2 TiB 89.62       143 GiB
     359260
.rgw.buckets         19      15 TiB     83.73       3.0 TiB 874

the usage of the Primary-ubuntu-1 and Primary-ubuntu-1-ssd is in
line with my expectations. However, the .rgw.buckets pool seems to
be using way too much. The usage of all rgw buckets shows 6.5TB
usage (looking at the size_kb values from the "radosgw-admin
bucket stats"). I am trying to figure out why .rgw.buckets is
using 15TB of space instead of the 6.5TB as shown from the bucket
usage.

Thanks

Andrei

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD bluestore initialization failed

2019-06-21 Thread Igor Fedotov
:0xc1f50+10,1:0xc1f60+10,1:0xc1f70+10,1:0xc1f80+10,1:0xc1f90+10,1:0xc1fa0+10,1:0xc1fb0+10,1:0xc1fc0+10,1:0xc1fd0+10,1:0xc1fe0+10,1:0xc1ff0+10,1:0xc2000+10,1:0xc2010+10,1:0xc2020+10,1:0xc2030+10,1:0xc2040+10,1:0xc2050+10,1:0xc2060+10,1:0xc2070+10,1:0xc2080+10,1:0xc2090+10,1:0xc20a0+10,1:0xc20b0+10,1:0xc20c0+10,1:0xc20d0+10,1:0xc2160+10,1:0xc2180+10])
2019-06-21 10:50:56.475433 7f462db84d00 10 bluefs _read h 
0x5632d7074b80 0x104000~1000 from file(ino 1 size 0x104000 mtime 
0.00 bdev 0 allocated 50 extents 
[1:0xc9bf0+10,1:0xc9bb0+40])
2019-06-21 10:50:56.475436 7f462db84d00 20 bluefs _read left 0xfc000 
len 0x1000

2019-06-21 10:50:56.475438 7f462db84d00 20 bluefs _read got 4096
2019-06-21 10:50:56.475440 7f462db84d00 10 bluefs _replay 0x104000: 
txn(seq 332735 len 0xca5 crc 0x4715a5c6)


The entire file as 17M and I can send with necessary ,

Saulo Augusto Silva

Em sex, 21 de jun de 2019 às 06:42, Igor Fedotov <mailto:ifedo...@suse.de>> escreveu:


Hi Saulo,

looks like disk I/O error.

Will you set debug_bluefs to 20 and collect the log, then share  a
few lines prior to the assertion?

Checking smartctl output might be a good idea too.

Thanks,

Igor

On 6/21/2019 11:30 AM, Saulo Silva wrote:

Hi,

After a power failure all OSD´s from a pool are fail with the
following error :

 -5> 2019-06-20 13:32:58.886299 7f146bcb2d00  4 rocksdb:

[/home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/rocksdb/db/version_set.cc:2859]
Recovered from manifest file:db/MANIFEST-003373
succeeded,manifest_file_number is 3373, next_file_number is 3598,
last_sequence is 319489940, log_number is 0,prev_log_number is
0,max_column_family is 0

    -4> 2019-06-20 13:32:58.886330 7f146bcb2d00  4 rocksdb:

[/home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/rocksdb/db/version_set.cc:2867]
Column family [default] (ID 0), log number is 3594

    -3> 2019-06-20 13:32:58.886401 7f146bcb2d00  4 rocksdb:
EVENT_LOG_v1 {"time_micros": 1561048378886391, "job": 1, "event":
"recovery_started", "log_files": [3592, 3594]}
    -2> 2019-06-20 13:32:58.886407 7f146bcb2d00  4 rocksdb:

[/home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/rocksdb/db/db_impl_open.cc:482]
Recovering log #3592 mode 0
    -1> 2019-06-20 13:33:06.629066 7f146bcb2d00  4 rocksdb:

[/home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/rocksdb/db/db_impl_open.cc:482]
Recovering log #3594 mode 0
     0> 2019-06-20 13:33:10.086512 7f146bcb2d00 -1

/home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/os/bluestore/BlueFS.cc:
In function 'int BlueFS::_read(BlueFS::FileReader*,
BlueFS::FileReaderBuffer*, uint64_t, size_t, ceph::bufferlist*,
char*)' thread 7f146bcb2d00 time 2019-06-20 13:33:10.073021

/home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/os/bluestore/BlueFS.cc:
996: FAILED assert(r == 0)

All of OSD only read 2 log and return the error .
Is it possible to delete the rocksdb log and start the OSD again?

Best Regards,

Saulo Augusto Silva

___
ceph-users mailing list
ceph-users@lists.ceph.com  <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-21 Thread Igor Fedotov
Actually there are two issues here - the first one (fixed by #28688) is 
unloaded OSD compression settings when OSD compression mode = none and 
pool one isn't none.


Submitted https://github.com/ceph/ceph/pull/28688 to fix this part.

The second - OSD doesn't see pool settings after restart until some 
setting is (re)set.


Just made another ticket: http://tracker.ceph.com/issues/40483 
<http://tracker.ceph.com/issues/40480>



Thanks,

Igor


On 6/21/2019 12:44 PM, Dan van der Ster wrote:

http://tracker.ceph.com/issues/40480

On Thu, Jun 20, 2019 at 9:12 PM Dan van der Ster  wrote:

I will try to reproduce with logs and create a tracker once I find the
smoking gun...

It's very strange -- I had the osd mode set to 'passive', and pool
option set to 'force', and the osd was compressing objects for around
15 minutes. Then suddenly it just stopped compressing, until I did
'ceph daemon osd.130 config set bluestore_compression_mode force',
where it restarted immediately.

FTR, it *should* compress with osd bluestore_compression_mode=none and
the pool's compression_mode=force, right?

-- dan

-- Dan

On Thu, Jun 20, 2019 at 6:57 PM Igor Fedotov  wrote:

I'd like to see more details (preferably backed with logs) on this...

On 6/20/2019 6:23 PM, Dan van der Ster wrote:

P.S. I know this has been discussed before, but the
compression_(mode|algorithm) pool options [1] seem completely broken
-- With the pool mode set to force, we see that sometimes the
compression is invoked and sometimes it isn't. AFAICT,
the only way to compress every object is to set
bluestore_compression_mode=force on the osd.

-- dan

[1] http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values


On Thu, Jun 20, 2019 at 4:33 PM Dan van der Ster  wrote:

Hi all,

I'm trying to compress an rbd pool via backfilling the existing data,
and the allocated space doesn't match what I expect.

Here is the test: I marked osd.130 out and waited for it to erase all its data.
Then I set (on the pool) compression_mode=force and compression_algorithm=zstd.
Then I marked osd.130 to get its PGs/objects back (this time compressing them).

After a few 10s of minutes we have:
  "bluestore_compressed": 989250439,
  "bluestore_compressed_allocated": 3859677184,
  "bluestore_compressed_original": 7719354368,

So, the allocated is exactly 50% of original, but we are wasting space
because compressed is 12.8% of original.

I don't understand why...

The rbd images all use 4MB objects, and we use the default chunk and
blob sizes (in v13.2.6):
 osd_recovery_max_chunk = 8MB
 bluestore_compression_max_blob_size_hdd = 512kB
 bluestore_compression_min_blob_size_hdd = 128kB
 bluestore_max_blob_size_hdd = 512kB
 bluestore_min_alloc_size_hdd = 64kB

  From my understanding, backfilling should read a whole 4MB object from
the src osd, then write it to osd.130's bluestore, compressing in
512kB blobs. Those compress on average at 12.8% so I would expect to
see allocated being closer to bluestore_min_alloc_size_hdd /
bluestore_compression_max_blob_size_hdd = 12.5%.

Does someone understand where the 0.5 ratio is coming from?

Thanks!

Dan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-21 Thread Igor Fedotov



On 6/20/2019 10:12 PM, Dan van der Ster wrote:

I will try to reproduce with logs and create a tracker once I find the
smoking gun...

It's very strange -- I had the osd mode set to 'passive', and pool
option set to 'force', and the osd was compressing objects for around
15 minutes. Then suddenly it just stopped compressing, until I did
'ceph daemon osd.130 config set bluestore_compression_mode force',
where it restarted immediately.

FTR, it *should* compress with osd bluestore_compression_mode=none and
the pool's compression_mode=force, right?
right but it looks like there is a bug: osd compression algorithm isn't 
applied when osd compression mode set to none. Hence no compression if 
pool lacks explicit algorithm specification.


-- dan

-- Dan

On Thu, Jun 20, 2019 at 6:57 PM Igor Fedotov  wrote:

I'd like to see more details (preferably backed with logs) on this...

On 6/20/2019 6:23 PM, Dan van der Ster wrote:

P.S. I know this has been discussed before, but the
compression_(mode|algorithm) pool options [1] seem completely broken
-- With the pool mode set to force, we see that sometimes the
compression is invoked and sometimes it isn't. AFAICT,
the only way to compress every object is to set
bluestore_compression_mode=force on the osd.

-- dan

[1] http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values


On Thu, Jun 20, 2019 at 4:33 PM Dan van der Ster  wrote:

Hi all,

I'm trying to compress an rbd pool via backfilling the existing data,
and the allocated space doesn't match what I expect.

Here is the test: I marked osd.130 out and waited for it to erase all its data.
Then I set (on the pool) compression_mode=force and compression_algorithm=zstd.
Then I marked osd.130 to get its PGs/objects back (this time compressing them).

After a few 10s of minutes we have:
  "bluestore_compressed": 989250439,
  "bluestore_compressed_allocated": 3859677184,
  "bluestore_compressed_original": 7719354368,

So, the allocated is exactly 50% of original, but we are wasting space
because compressed is 12.8% of original.

I don't understand why...

The rbd images all use 4MB objects, and we use the default chunk and
blob sizes (in v13.2.6):
 osd_recovery_max_chunk = 8MB
 bluestore_compression_max_blob_size_hdd = 512kB
 bluestore_compression_min_blob_size_hdd = 128kB
 bluestore_max_blob_size_hdd = 512kB
 bluestore_min_alloc_size_hdd = 64kB

  From my understanding, backfilling should read a whole 4MB object from
the src osd, then write it to osd.130's bluestore, compressing in
512kB blobs. Those compress on average at 12.8% so I would expect to
see allocated being closer to bluestore_min_alloc_size_hdd /
bluestore_compression_max_blob_size_hdd = 12.5%.

Does someone understand where the 0.5 ratio is coming from?

Thanks!

Dan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD bluestore initialization failed

2019-06-21 Thread Igor Fedotov

Hi Saulo,

looks like disk I/O error.

Will you set debug_bluefs to 20 and collect the log, then share a few 
lines prior to the assertion?


Checking smartctl output might be a good idea too.

Thanks,

Igor

On 6/21/2019 11:30 AM, Saulo Silva wrote:

Hi,

After a power failure all OSD´s from a pool are fail with the 
following error :


 -5> 2019-06-20 13:32:58.886299 7f146bcb2d00  4 rocksdb: 
[/home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/rocksdb/db/version_set.cc:2859] 
Recovered from manifest file:db/MANIFEST-003373 
succeeded,manifest_file_number is 3373, next_file_number is 3598, 
last_sequence is 319489940, log_number is 0,prev_log_number is 
0,max_column_family is 0


    -4> 2019-06-20 13:32:58.886330 7f146bcb2d00  4 rocksdb: 
[/home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/rocksdb/db/version_set.cc:2867] 
Column family [default] (ID 0), log number is 3594


    -3> 2019-06-20 13:32:58.886401 7f146bcb2d00  4 rocksdb: 
EVENT_LOG_v1 {"time_micros": 1561048378886391, "job": 1, "event": 
"recovery_started", "log_files": [3592, 3594]}
    -2> 2019-06-20 13:32:58.886407 7f146bcb2d00  4 rocksdb: 
[/home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/rocksdb/db/db_impl_open.cc:482] 
Recovering log #3592 mode 0
    -1> 2019-06-20 13:33:06.629066 7f146bcb2d00  4 rocksdb: 
[/home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/rocksdb/db/db_impl_open.cc:482] 
Recovering log #3594 mode 0
     0> 2019-06-20 13:33:10.086512 7f146bcb2d00 -1 
/home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/os/bluestore/BlueFS.cc: 
In function 'int BlueFS::_read(BlueFS::FileReader*, 
BlueFS::FileReaderBuffer*, uint64_t, size_t, ceph::bufferlist*, 
char*)' thread 7f146bcb2d00 time 2019-06-20 13:33:10.073021
/home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/os/bluestore/BlueFS.cc: 
996: FAILED assert(r == 0)


All of OSD only read 2 log and return the error .
Is it possible to delete the rocksdb log and start the OSD again?

Best Regards,

Saulo Augusto Silva

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-20 Thread Igor Fedotov



On 6/20/2019 8:55 PM, Dan van der Ster wrote:

On Thu, Jun 20, 2019 at 6:55 PM Igor Fedotov  wrote:

Hi Dan,

bluestore_compression_max_blob_size is applied for objects marked with
some additional hints only:

if ((alloc_hints & CEPH_OSD_ALLOC_HINT_FLAG_SEQUENTIAL_READ) &&
(alloc_hints & CEPH_OSD_ALLOC_HINT_FLAG_RANDOM_READ) == 0 &&
(alloc_hints & (CEPH_OSD_ALLOC_HINT_FLAG_IMMUTABLE |
CEPH_OSD_ALLOC_HINT_FLAG_APPEND_ONLY)) &&
(alloc_hints & CEPH_OSD_ALLOC_HINT_FLAG_RANDOM_WRITE) == 0) {

  dout(20) << __func__ << " will prefer large blob and csum sizes" <<
dendl;


For regular objects "bluestore_compression_max_blob_size" is used. Which
results in minimum ratio  = 0.5

I presume you mean bluestore_compression_min_blob_size ...

yeah, indeed.


Going back to the thread Frank linked later in this thread I
understand now I can double bluestore_compression_min_blob_size to get
0.25, or halve bluestore_min_alloc_size_hdd  (at osd creation time) to
get 0.25. That seems clear now (though I wonder if the option names
are slightly misleading ...)

Now I'll try to observe any performance impact of increased
min_blob_size... Do you recall if there were some benchmarks done to
pick those current defaults?


Unfortunately I don't recall any...




Thanks!

Dan


-- Dan



Thanks,

Igor

On 6/20/2019 5:33 PM, Dan van der Ster wrote:

Hi all,

I'm trying to compress an rbd pool via backfilling the existing data,
and the allocated space doesn't match what I expect.

Here is the test: I marked osd.130 out and waited for it to erase all its data.
Then I set (on the pool) compression_mode=force and compression_algorithm=zstd.
Then I marked osd.130 to get its PGs/objects back (this time compressing them).

After a few 10s of minutes we have:
  "bluestore_compressed": 989250439,
  "bluestore_compressed_allocated": 3859677184,
  "bluestore_compressed_original": 7719354368,

So, the allocated is exactly 50% of original, but we are wasting space
because compressed is 12.8% of original.

I don't understand why...

The rbd images all use 4MB objects, and we use the default chunk and
blob sizes (in v13.2.6):
 osd_recovery_max_chunk = 8MB
 bluestore_compression_max_blob_size_hdd = 512kB
 bluestore_compression_min_blob_size_hdd = 128kB
 bluestore_max_blob_size_hdd = 512kB
 bluestore_min_alloc_size_hdd = 64kB

  From my understanding, backfilling should read a whole 4MB object from
the src osd, then write it to osd.130's bluestore, compressing in
512kB blobs. Those compress on average at 12.8% so I would expect to
see allocated being closer to bluestore_min_alloc_size_hdd /
bluestore_compression_max_blob_size_hdd = 12.5%.

Does someone understand where the 0.5 ratio is coming from?

Thanks!

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-20 Thread Igor Fedotov

I'd like to see more details (preferably backed with logs) on this...

On 6/20/2019 6:23 PM, Dan van der Ster wrote:

P.S. I know this has been discussed before, but the
compression_(mode|algorithm) pool options [1] seem completely broken
-- With the pool mode set to force, we see that sometimes the
compression is invoked and sometimes it isn't. AFAICT,
the only way to compress every object is to set
bluestore_compression_mode=force on the osd.

-- dan

[1] http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values


On Thu, Jun 20, 2019 at 4:33 PM Dan van der Ster  wrote:

Hi all,

I'm trying to compress an rbd pool via backfilling the existing data,
and the allocated space doesn't match what I expect.

Here is the test: I marked osd.130 out and waited for it to erase all its data.
Then I set (on the pool) compression_mode=force and compression_algorithm=zstd.
Then I marked osd.130 to get its PGs/objects back (this time compressing them).

After a few 10s of minutes we have:
 "bluestore_compressed": 989250439,
 "bluestore_compressed_allocated": 3859677184,
 "bluestore_compressed_original": 7719354368,

So, the allocated is exactly 50% of original, but we are wasting space
because compressed is 12.8% of original.

I don't understand why...

The rbd images all use 4MB objects, and we use the default chunk and
blob sizes (in v13.2.6):
osd_recovery_max_chunk = 8MB
bluestore_compression_max_blob_size_hdd = 512kB
bluestore_compression_min_blob_size_hdd = 128kB
bluestore_max_blob_size_hdd = 512kB
bluestore_min_alloc_size_hdd = 64kB

 From my understanding, backfilling should read a whole 4MB object from
the src osd, then write it to osd.130's bluestore, compressing in
512kB blobs. Those compress on average at 12.8% so I would expect to
see allocated being closer to bluestore_min_alloc_size_hdd /
bluestore_compression_max_blob_size_hdd = 12.5%.

Does someone understand where the 0.5 ratio is coming from?

Thanks!

Dan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-20 Thread Igor Fedotov

Hi Dan,

bluestore_compression_max_blob_size is applied for objects marked with 
some additional hints only:


  if ((alloc_hints & CEPH_OSD_ALLOC_HINT_FLAG_SEQUENTIAL_READ) &&
  (alloc_hints & CEPH_OSD_ALLOC_HINT_FLAG_RANDOM_READ) == 0 &&
  (alloc_hints & (CEPH_OSD_ALLOC_HINT_FLAG_IMMUTABLE |
  CEPH_OSD_ALLOC_HINT_FLAG_APPEND_ONLY)) &&
  (alloc_hints & CEPH_OSD_ALLOC_HINT_FLAG_RANDOM_WRITE) == 0) {

    dout(20) << __func__ << " will prefer large blob and csum sizes" << 
dendl;



For regular objects "bluestore_compression_max_blob_size" is used. Which 
results in minimum ratio  = 0.5



Thanks,

Igor

On 6/20/2019 5:33 PM, Dan van der Ster wrote:

Hi all,

I'm trying to compress an rbd pool via backfilling the existing data,
and the allocated space doesn't match what I expect.

Here is the test: I marked osd.130 out and waited for it to erase all its data.
Then I set (on the pool) compression_mode=force and compression_algorithm=zstd.
Then I marked osd.130 to get its PGs/objects back (this time compressing them).

After a few 10s of minutes we have:
 "bluestore_compressed": 989250439,
 "bluestore_compressed_allocated": 3859677184,
 "bluestore_compressed_original": 7719354368,

So, the allocated is exactly 50% of original, but we are wasting space
because compressed is 12.8% of original.

I don't understand why...

The rbd images all use 4MB objects, and we use the default chunk and
blob sizes (in v13.2.6):
osd_recovery_max_chunk = 8MB
bluestore_compression_max_blob_size_hdd = 512kB
bluestore_compression_min_blob_size_hdd = 128kB
bluestore_max_blob_size_hdd = 512kB
bluestore_min_alloc_size_hdd = 64kB

 From my understanding, backfilling should read a whole 4MB object from
the src osd, then write it to osd.130's bluestore, compressing in
512kB blobs. Those compress on average at 12.8% so I would expect to
see allocated being closer to bluestore_min_alloc_size_hdd /
bluestore_compression_max_blob_size_hdd = 12.5%.

Does someone understand where the 0.5 ratio is coming from?

Thanks!

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueFS spillover detected - 14.2.1

2019-06-18 Thread Igor Fedotov
Yes, for now I'd personally prefer 60-64 GB per DB volume unless one is 
unable to allocate 300+ GB.


This is 2x times larger than your DBs keep right now (and which is 
pretty inline with RocksDB Level 3 max size).



Thanks,

Igor


On 6/18/2019 9:30 PM, Brett Chancellor wrote:
Thanks Igor. I'm fine turning the warnings off, but it's curious that 
only this cluster is showing the alerts.  Is there any value in 
rebuilding the with smaller SSD meta data volumes? Say 60GB or 30GB?


-Brett

On Tue, Jun 18, 2019 at 1:55 PM Igor Fedotov <mailto:ifedo...@suse.de>> wrote:


Hi Brett,

this issue has been with you long before upgrade to 14.2.1. This
upgrade just brought corresponding alert visible.

You can turn the alert off by setting
bluestore_warn_on_bluefs_spillover=false.

But generally this warning shows DB data layout inefficiency -
some data is kept at slow device - which might has some negative
performance impact.

Unfortunately that's a know issue with current RocksDB/BlueStore
interaction - spillovers to slow device might take place even when
there is plenty of free space at fast one.


Thanks,

Igor



On 6/18/2019 8:46 PM, Brett Chancellor wrote:

Does anybody have a fix for BlueFS spillover detected? This
started happening 2 days after an upgrade to 14.2.1 and has
increased from 3 OSDs to 118 in the last 4 days.  I read you
could fix it by rebuilding the OSDs, but rebuilding the 264 OSDs
on this cluster will take months of rebalancing.

$ sudo ceph health detail
HEALTH_WARN BlueFS spillover detected on 118 OSD(s)
BLUEFS_SPILLOVER BlueFS spillover detected on 118 OSD(s)
     osd.0 spilled over 22 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.1 spilled over 103 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.5 spilled over 21 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.6 spilled over 64 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.11 spilled over 22 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.13 spilled over 23 GiB metadata from 'db' device (29 GiB
used of 148 GiB) to slow device
     osd.21 spilled over 102 GiB metadata from 'db' device (28
GiB used of 148 GiB) to slow device
     osd.22 spilled over 103 GiB metadata from 'db' device (28
GiB used of 148 GiB) to slow device
     osd.23 spilled over 24 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.24 spilled over 25 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.25 spilled over 24 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.26 spilled over 64 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.27 spilled over 21 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.30 spilled over 65 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.32 spilled over 21 GiB metadata from 'db' device (29 GiB
used of 148 GiB) to slow device
     osd.34 spilled over 24 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.42 spilled over 25 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.45 spilled over 103 GiB metadata from 'db' device (28
GiB used of 148 GiB) to slow device
     osd.46 spilled over 24 GiB metadata from 'db' device (29 GiB
used of 148 GiB) to slow device
     osd.47 spilled over 63 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.48 spilled over 63 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.49 spilled over 62 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.50 spilled over 24 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.52 spilled over 140 GiB metadata from 'db' device (28
GiB used of 148 GiB) to slow device
     osd.53 spilled over 22 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.54 spilled over 59 GiB metadata from 'db' device (29 GiB
used of 148 GiB) to slow device
     osd.55 spilled over 134 GiB metadata from 'db' device (28
GiB used of 148 GiB) to slow device
     osd.56 spilled over 19 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.57 spilled over 61 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.58 spilled over 66 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
     osd.59 spilled over 24 GiB metadata from 'db' device (28 GiB
used of 148 GiB) to slow device
  

Re: [ceph-users] BlueFS spillover detected - 14.2.1

2019-06-18 Thread Igor Fedotov

Hi Brett,

this issue has been with you long before upgrade to 14.2.1. This upgrade 
just brought corresponding alert visible.


You can turn the alert off by setting 
bluestore_warn_on_bluefs_spillover=false.


But generally this warning shows DB data layout inefficiency - some data 
is kept at slow device - which might has some negative performance impact.


Unfortunately that's a know issue with current RocksDB/BlueStore 
interaction - spillovers to slow device might take place even when there 
is plenty of free space at fast one.



Thanks,

Igor



On 6/18/2019 8:46 PM, Brett Chancellor wrote:
Does anybody have a fix for BlueFS spillover detected? This started 
happening 2 days after an upgrade to 14.2.1 and has increased from 3 
OSDs to 118 in the last 4 days. I read you could fix it by rebuilding 
the OSDs, but rebuilding the 264 OSDs on this cluster will take months 
of rebalancing.


$ sudo ceph health detail
HEALTH_WARN BlueFS spillover detected on 118 OSD(s)
BLUEFS_SPILLOVER BlueFS spillover detected on 118 OSD(s)
     osd.0 spilled over 22 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.1 spilled over 103 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.5 spilled over 21 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.6 spilled over 64 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.11 spilled over 22 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.13 spilled over 23 GiB metadata from 'db' device (29 GiB used 
of 148 GiB) to slow device
     osd.21 spilled over 102 GiB metadata from 'db' device (28 GiB 
used of 148 GiB) to slow device
     osd.22 spilled over 103 GiB metadata from 'db' device (28 GiB 
used of 148 GiB) to slow device
     osd.23 spilled over 24 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.24 spilled over 25 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.25 spilled over 24 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.26 spilled over 64 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.27 spilled over 21 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.30 spilled over 65 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.32 spilled over 21 GiB metadata from 'db' device (29 GiB used 
of 148 GiB) to slow device
     osd.34 spilled over 24 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.42 spilled over 25 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.45 spilled over 103 GiB metadata from 'db' device (28 GiB 
used of 148 GiB) to slow device
     osd.46 spilled over 24 GiB metadata from 'db' device (29 GiB used 
of 148 GiB) to slow device
     osd.47 spilled over 63 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.48 spilled over 63 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.49 spilled over 62 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.50 spilled over 24 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.52 spilled over 140 GiB metadata from 'db' device (28 GiB 
used of 148 GiB) to slow device
     osd.53 spilled over 22 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.54 spilled over 59 GiB metadata from 'db' device (29 GiB used 
of 148 GiB) to slow device
     osd.55 spilled over 134 GiB metadata from 'db' device (28 GiB 
used of 148 GiB) to slow device
     osd.56 spilled over 19 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.57 spilled over 61 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.58 spilled over 66 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.59 spilled over 24 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.61 spilled over 24 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.62 spilled over 59 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.65 spilled over 19 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.67 spilled over 62 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.69 spilled over 20 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.71 spilled over 21 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.73 spilled over 24 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.74 spilled over 17 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow device
     osd.75 spilled over 24 GiB metadata from 'db' device (28 GiB used 
of 148 GiB) to slow 

Re: [ceph-users] bluestore_allocated vs bluestore_stored

2019-06-17 Thread Igor Fedotov

Hi Maged,

min_alloc_size determines allocation granularity hence if object size 
isn't aligned with its value allocation overhead still takes place.


E.g. with min_alloc_size = 16K and object size = 24K total allocation 
(i.e. bluestore_allocated) would be 32K.


And yes, this overhead is permanent.

Thanks,

Igor

On 6/17/2019 1:06 AM, Maged Mokhtar wrote:

Hi all,

I want to understand more the difference between bluestore_allocated 
and bluestore_stored in the case of no compression. If i am writing 
fixed objects with sizes greater than min alloc size, would 
bluestore_allocated still be higher than bluestore_stored ? If so, is 
this a permanent overhead/penalty or is something the allocator can 
re-use/optimize later as more objects are stored ?


Appreciate any help.

Cheers /Maged

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD hanging on 12.2.12 by message worker

2019-06-07 Thread Igor Fedotov

Hi Max,

I don't think this is allocator related issue. The symptoms that 
triggered us to start using bitmap allocator over stupid one were:


- write op latency gradually increasing over time (days not hours)

- perf showing significant amount of time spent in allocator related 
function


- OSD reboot was the only remedy.

It had nothing related to network activity and/or client restarts.


Thanks,

Igor


On 6/7/2019 11:05 AM, Max Vernimmen wrote:
Thank you for the suggestion to use the bitmap allocator. I looked at 
the ceph documentation and could find no mention of this setting. This 
makes me wonder how safe and production ready this setting really is. 
I'm hesitant to apply that to our production environment.
If the allocator setting helps to resolve the problem then it looks to 
me like there is a bug in the 'stupid' allocator that is causing this 
behavior. Would this qualify for creating a bug report or is some more 
debugging needed before I can do that?


On Thu, Jun 6, 2019 at 11:18 AM Stefan Kooman > wrote:


Quoting Max Vernimmen (vernim...@textkernel.nl
):
>
> This is happening several times per day after we made several
changes at
> the same time:
>
>    - add physical ram to the ceph nodes
>    - move from fixed 'bluestore cache size hdd|sdd' and
'bluestore cache kv
>    max' to 'bluestore cache autotune = 1' and 'osd memory target =
>    20401094656'.
>    - update ceph from 12.2.8 to 12.2.11
>    - update clients from 12.2.8 to 12.2.11
>
> We have since upgraded the ceph nodes to 12.2.12 but it did not
help to fix
> this problem.

Have you tried the new bitmap allocator for the OSDs already
(available
since 12.2.12):

[osd]

# MEMORY ALLOCATOR
bluestore_allocator = bitmap
bluefs_allocator = bitmap

The issues you are reporting sound like an issue many of us have
seen on
luminous and mimic clusters and has been identified to be caused
by the
"stupid allocator" memory allocator.

Gr. Stefan


-- 
| BIT BV http://www.bit.nl/       Kamer van Koophandel 09090351

| GPG: 0xD14839C6                   +31 318 648 688 / i...@bit.nl




--
Max Vernimmen
Senior DevOps Engineer
Textkernel

--
Textkernel BV, Nieuwendammerkade 26/a5, 1022 AB, Amsterdam, NL
-

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Sizing for DB/WAL: 4% for large drives?

2019-05-28 Thread Igor Fedotov

Hi Jake,

just my 2 cents - I'd suggest to use LVM for DB/WAL to be able 
seamlessly extend their sizes if needed.


Once you've configured this way and if you're able to add more NVMe 
later you're almost free to select any size at the initial stage.



Thanks,

Igor


On 5/28/2019 4:13 PM, Jake Grimmett wrote:

Dear All,

Quick question regarding SSD sizing for a DB/WAL...

I understand 4% is generally recommended for a DB/WAL.

Does this 4% continue for "large" 12TB drives, or can we  economise and
use a smaller DB/WAL?

Ideally I'd fit a smaller drive providing a 266GB DB/WAL per 12TB OSD,
rather than 480GB. i.e. 2.2% rather than 4%.

Will "bad things" happen as the OSD fills with a smaller DB/WAL?

By the way the cluster will mainly be providing CephFS, fairly large
files, and will use erasure encoding.

many thanks for any advice,

Jake


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous OSD: replace block.db partition

2019-05-28 Thread Igor Fedotov

Konstantin,

one should resize device  before using bluefs-bdev-expand command.

So the first question should be what's the backend for block.db - simple 
device partition, LVM volume, raw file?


LVM volume and raw file resizing is quite simple, while partition one 
might need manual data movement to another target via dd or something.



Thanks,

Igor


On 5/28/2019 12:28 PM, Konstantin Shalygin wrote:



Hello - I have created an OSD with 20G block.db, now I wanted to change the
block.db to 100G size.
Please let us know if there is a process for the same.

PS: Ceph version 12.2.4 with bluestore backend.



You should upgrade to 12.2.11+ first! Expand your block.db via 
`ceph-bluestore-tool bluefs-bdev-expand --path 
/var/lib/ceph/osd/ceph-`




k


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost OSD - 1000: FAILED assert(r == 0)

2019-05-24 Thread Igor Fedotov

Hi Guillaume,

Could you please set debug-bluefs to 20, restart OSD and collect the 
whole log.



Thanks,

Igor

On 5/24/2019 4:50 PM, Guillaume Chenuet wrote:

Hi,

We are running a Ceph cluster with 36 OSD splitted on 3 servers (12 
OSD per server) and Ceph version 
12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable).


This cluster is used by an OpenStack private cloud and deployed with 
OpenStack Kolla. Every OSD ran into a Docker container on the server 
and MON, MGR, MDS, and RGW are running on 3 other servers.


This week, one OSD crashed and failed to restart, with this stack trace:

 Running command: '/usr/bin/ceph-osd -f --public-addr 10.106.142.30 
--cluster-addr 10.106.142.30 -i 35'
+ exec /usr/bin/ceph-osd -f --public-addr 10.106.142.30 --cluster-addr 
10.106.142.30 -i 35
starting osd.35 at - osd_data /var/lib/ceph/osd/ceph-35 
/var/lib/ceph/osd/ceph-35/journal
/builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: In 
function 'int BlueFS::_read(BlueFS::FileReader*, 
BlueFS::FileReaderBuffer*, uint64_t, size_t, ceph::bufferlist*, 
char*)' thread 7efd088d6d80 time 2019-05-24 05:40:47.799918
/builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: 1000: 
FAILED assert(r == 0)
 ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) 
luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x110) [0x556f7833f8f0]
 2: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, 
unsigned long, unsigned long, ceph::buffer::list*, char*)+0xca4) 
[0x556f782b5574]

 3: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
 4: (BlueFS::mount()+0x1d4) [0x556f782cc014]
 5: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
 6: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
 7: (OSD::init()+0x3bd) [0x556f77dbbaed]
 8: (main()+0x2d07) [0x556f77cbe667]
 9: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
 10: (()+0x4c1f73) [0x556f77d5ef73]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.

*** Caught signal (Aborted) **
 in thread 7efd088d6d80 thread_name:ceph-osd
 ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) 
luminous (stable)

 1: (()+0xa63931) [0x556f78300931]
 2: (()+0xf5d0) [0x7efd05f995d0]
 3: (gsignal()+0x37) [0x7efd04fba207]
 4: (abort()+0x148) [0x7efd04fbb8f8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x284) [0x556f7833fa64]
 6: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, 
unsigned long, unsigned long, ceph::buffer::list*, char*)+0xca4) 
[0x556f782b5574]

 7: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
 8: (BlueFS::mount()+0x1d4) [0x556f782cc014]
 9: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
 10: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
 11: (OSD::init()+0x3bd) [0x556f77dbbaed]
 12: (main()+0x2d07) [0x556f77cbe667]
 13: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
 14: (()+0x4c1f73) [0x556f77d5ef73]

The cluster health is OK and Ceph sees this OSD as shutdown.

I tried to find more information on the internet about this error 
without luck.

Do you have any idea or input about this error, please?

Thanks,
Guillaume


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrub Crash OSD 14.2.1

2019-05-17 Thread Igor Fedotov

Hi Manuel,

Just in case - haven't you done any manipulation with underlying 
disk/partition/volume - resize, replacement etc?



Thanks,

Igor

On 5/17/2019 3:00 PM, EDH - Manuel Rios Fernandez wrote:


Hi ,

Today we got some osd that crash after scrub. Version 14.2.1

2019-05-17 12:49:40.955 7fd980d8fd80  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1558090180955778, "job": 1, "event": "recovery_finished"}


2019-05-17 12:49:40.967 7fd980d8fd80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/rocksdb/db/db_impl_open.cc:1287] 
DB pointer 0x55cbfcfc9000


2019-05-17 12:49:40.967 7fd980d8fd80  1 
bluestore(/var/lib/ceph/osd/ceph-7) _open_db opened rocksdb path db 
options 
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152


2019-05-17 12:49:40.967 7fd980d8fd80  1 
bluestore(/var/lib/ceph/osd/ceph-7) _upgrade_super from 2, latest 2


2019-05-17 12:49:40.967 7fd980d8fd80  1 
bluestore(/var/lib/ceph/osd/ceph-7) _upgrade_super done


2019-05-17 12:49:41.090 7fd980d8fd80  0  
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/cls/cephfs/cls_cephfs.cc:197: 
loading cephfs


2019-05-17 12:49:41.092 7fd980d8fd80  0  
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/cls/hello/cls_hello.cc:296: 
loading cls_hello


2019-05-17 12:49:41.093 7fd980d8fd80  0 _get_class not permitted to 
load kvs


2019-05-17 12:49:41.096 7fd980d8fd80  0 _get_class not permitted to 
load lua


2019-05-17 12:49:41.121 7fd980d8fd80  0 _get_class not permitted to 
load sdk


2019-05-17 12:49:41.124 7fd980d8fd80  0 osd.7 135670 crush map has 
features 283675107524608, adjusting msgr requires for clients


2019-05-17 12:49:41.124 7fd980d8fd80  0 osd.7 135670 crush map has 
features 283675107524608 was 8705, adjusting msgr requires for mons


2019-05-17 12:49:41.124 7fd980d8fd80  0 osd.7 135670 crush map has 
features 3026702624700514304, adjusting msgr requires for osds


2019-05-17 12:49:50.430 7fd980d8fd80  0 osd.7 135670 load_pgs

2019-05-17 12:50:09.302 7fd980d8fd80  0 osd.7 135670 load_pgs opened 
201 pgs


2019-05-17 12:50:09.303 7fd980d8fd80  0 osd.7 135670 using 
weightedpriority op queue with priority op cut off at 64.


2019-05-17 12:50:09.324 7fd980d8fd80 -1 osd.7 135670 log_to_monitors 
{default=true}


2019-05-17 12:50:09.361 7fd980d8fd80 -1 osd.7 135670 
mon_cmd_maybe_osd_create fail: 'osd.7 has already bound to class 
'archive', can not reset class to 'hdd'; use 'ceph osd crush 
rm-device-class ' to remove old class first': (16) Device or 
resource busy


2019-05-17 12:50:09.365 7fd980d8fd80  0 osd.7 135670 done with init, 
starting boot process


2019-05-17 12:50:09.371 7fd97339d700 -1 osd.7 135670 set_numa_affinity 
unable to identify public interface 'vlan.4094' numa node: (2) No such 
file or directory


2019-05-17 12:50:16.443 7fd95f375700 -1 bdev(0x55cbfcec4e00 
/var/lib/ceph/osd/ceph-7/block) read_random 0x5428527b5be~15b3 error: 
(14) Bad address


2019-05-17 12:50:16.467 7fd95f375700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/os/bluestore/BlueFS.cc: 
In function 'int BlueFS::_read_random(BlueFS::FileReader*, uint64_t, 
size_t, char*)' thread 7fd95f375700 time 2019-05-17 12:50:16.445954


/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/os/bluestore/BlueFS.cc: 
1337: FAILED ceph_assert(r == 0)


ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) 
nautilus (stable)


1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14a) [0x55cbf14e265c]


2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char 
const*, char const*, ...)+0) [0x55cbf14e282a]


3: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned 
long, char*)+0x71a) [0x55cbf1b8fd6a]


4: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
rocksdb::Slice*, char*) const+0x20) [0x55cbf1bb8440]


5: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
long, rocksdb::Slice*, char*) const+0x960) [0x55cbf21e3ba0]


6: (rocksdb::BlockFetcher::ReadBlockContents()+0x3e7) [0x55cbf219dc27]

7: (()+0x11146a4) [0x55cbf218a6a4]

8: 
(rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache(rocksdb::FilePrefetchBuffer*, 
rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, 

Re: [ceph-users] OSDs failing to boot

2019-05-09 Thread Igor Fedotov

Hi Paul,

could you please set both "debug bluestore" and "debug bluefs" to 20, 
run again and share the resulting log.


Thanks,

Igor

On 5/9/2019 2:34 AM, Rawson, Paul L. wrote:

Hi Folks,

I'm having trouble getting some of my OSDs to boot. At some point, these
disks got very full. I fixed the rule that was causing that, and they
are on average ~30% full now.

I'm getting the following in my logs:

      -1> 2019-05-08 16:05:18.956 7fdc7adbbf00 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/include/interval_set.h:
In function 'void interval_set::insert(T, T, T*, T*) [with T =
long unsigned int; Map = std::map, std::allocator > >]' thread 7fdc7adbbf00 time
2019-05-08 16:05:18.953372
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/include/interval_set.h:
490: FAILED ceph_assert(p->first > start+len)

   ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972)
nautilus (stable)
   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x14a) [0x7fdc70daa676]
   2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char
const*, char const*, ...)+0) [0x7fdc70daa844]
   3: (interval_set, std::allocator > > >::insert(unsigned long, unsigned long, unsigned
long*, unsigned long*)+0x45f) [0x55b8960e03df]
   4: (BlueStore::allocate_bluefs_freespace(unsigned long, unsigned long,
std::vector
  >*)+0x74e) [0x55b89611d13e]
   5: (BlueFS::_expand_slow_device(unsigned long,
std::vector
  >&)+0x111) [0x55b8960c8211]
   6: (BlueFS::_allocate(unsigned char, unsigned long,
bluefs_fnode_t*)+0x68b) [0x55b8960c8f7b]
   7: (BlueFS::_allocate(unsigned char, unsigned long,
bluefs_fnode_t*)+0x362) [0x55b8960c8c52]
   8: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned
long)+0xe5) [0x55b8960c95d5]
   9: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x10b) [0x55b8960cb43b]
   10: (BlueRocksWritableFile::Flush()+0x3d) [0x55b8962bdfcd]
   11: (rocksdb::WritableFileWriter::Flush()+0x19e) [0x55b896531a4e]
   12: (rocksdb::WritableFileWriter::Sync(bool)+0x2e) [0x55b896531d2e]
   13: (rocksdb::BuildTable(std::string const&, rocksdb::Env*,
rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&,
rocksdb::EnvOptions const&, rocksdb::TableCache*,
rocksdb::InternalIteratorBase*,
std::unique_ptr,
std::default_delete > >,
rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&,
std::vector >,
std::allocator > > > const*,
unsigned int, std::string const&, std::vector >, unsigned long,
rocksdb::SnapshotChecker*, rocksdb::CompressionType,
rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*,
rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int,
rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long,
unsigned long, rocksdb::Env::WriteLifeTimeHint)+0x2368) [0x55b89655fb68]
   14: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int,
rocksdb::ColumnFamilyData*, rocksdb::MemTable*,
rocksdb::VersionEdit*)+0xc66) [0x55b8963d48c6]
   15: (rocksdb::DBImpl::RecoverLogFiles(std::vector > const&, unsigned long*, bool)+0x1dce)
[0x55b8963d6f1e]
   16:
(rocksdb::DBImpl::Recover(std::vector > const&, bool, bool,
bool)+0x809) [0x55b8963d7db9]
   17: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::string
const&, std::vector > const&,
std::vector >*, rocksdb::DB**, bool,
bool)+0x658) [0x55b8963d8bc8]
   18: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::string const&,
std::vector > const&,
std::vector >*, rocksdb::DB**)+0x24)
[0x55b8963da3a4]
   19: (RocksDBStore::do_open(std::ostream&, bool, bool,
std::vector > const*)+0x1660) [0x55b8961c2a80]
   20: (BlueStore::_open_db(bool, bool, bool)+0xf8e) [0x55b89611b37e]
   21: (BlueStore::_open_db_and_around(bool)+0x165) [0x55b8961388b5]
   22: (BlueStore::_fsck(bool, bool)+0xe5c) [0x55b8961692dc]
   23: (main()+0x107e) [0x55b895fc682e]
   24: (__libc_start_main()+0xf5) [0x7fdc6da4e3d5]
   25: (()+0x2718cf) [0x55b8960ac8cf]

   0> 2019-05-08 16:05:18.960 7fdc7adbbf00 -1 *** Caught signal
(Aborted) **
   in thread 7fdc7adbbf00 thread_name:ceph-bluestore-

   ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972)
nautilus (stable)
   1: (()+0xf5d0) [0x7fdc6f2905d0]
   2: (gsignal()+0x37) [0x7fdc6da62207]
   3: (abort()+0x148) [0x7fdc6da638f8]
   4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x199) [0x7fdc70daa6c5]
   5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char
const*, char const*, ...)+0) [0x7fdc70daa844]
   6: (interval_set, std::allocator > > >::insert(unsigned long, unsigned long, unsigned
long*, unsigned long*)+0x45f) [0x55b8960e03df]
   7: (BlueStore::allocate_bluefs_freespace(unsigned long, unsigned long,
std::vector
  >*)+0x74e) [0x55b89611d13e]
   8: 

Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-06 Thread Igor Fedotov
-05-06 15:11:45.702 7f28f4321d80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/rocksdb/db/version_set.cc:3370] 
Column family [default] (ID 0), log number is 100045


   -11> 2019-05-06 15:11:45.712 7f28f4321d80  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1557148305719525, "job": 1, "event": 
"recovery_started", "log_files": [100047]}
   -10> 2019-05-06 15:11:45.712 7f28f4321d80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/rocksdb/db/db_impl_open.cc:551] 
Recovering log #100047 mode 0
    -9> 2019-05-06 15:11:45.712 7f28f4321d80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/rocksdb/db/version_set.cc:2863] 
Creating manifest 100049


    -8> 2019-05-06 15:11:45.712 7f28f4321d80  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1557148305722323, "job": 1, "event": 
"recovery_finished"}
    -7> 2019-05-06 15:11:45.712 7f28f4321d80  5 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/rocksdb/db/db_impl_files.cc:380] 
[JOB 2] Delete db//MANIFEST-100046 type=3 #100046 -- OK


    -6> 2019-05-06 15:11:45.712 7f28f4321d80  5 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/rocksdb/db/db_impl_files.cc:380] 
[JOB 2] Delete db//100047.log type=0 #100047 -- OK


    -5> 2019-05-06 15:11:45.712 7f28f4321d80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/rocksdb/db/db_impl_open.cc:1218] 
DB pointer 0x55b215bee000
    -4> 2019-05-06 15:11:45.712 7f28f4321d80  1 
bluestore(/var/lib/ceph/osd/ceph-3) _open_db opened rocksdb path db 
options 
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152

    -3> 2019-05-06 15:11:45.732 7f28f4321d80  1 freelist init
    -2> 2019-05-06 15:11:45.742 7f28f4321d80  1 
bluestore(/var/lib/ceph/osd/ceph-3) _open_alloc opening allocation 
metadata
    -1> 2019-05-06 15:11:45.812 7f28f4321d80  1 
bluestore(/var/lib/ceph/osd/ceph-3) _open_alloc loaded 149 GiB in 
5011 extents
 0> 2019-05-06 15:11:45.822 7f28f4321d80 -1 *** Caught signal 
(Segmentation fault) **

 in thread 7f28f4321d80 thread_name:ceph-osd

 ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
(stable)

 1: (()+0x913410) [0x55b213c42410]
 2: (()+0xf5d0) [0x7f28e82845d0]
 3: (BitMapAreaIN::set_blocks_used_int(long, long)+0x41) 
[0x55b213c23d41]

 4: (BitMapAreaIN::set_blocks_used(long, long)+0x35) [0x55b213c24025]
 5: (BitMapAreaIN::set_blocks_used_int(long, long)+0x7d) 
[0x55b213c23d7d]

 6: (BitAllocator::set_blocks_used(long, long)+0x84) [0x55b213c25c84]
 7: (BlueStore::_open_alloc()+0x352) [0x55b213add752]
 8: (BlueStore::_mount(bool, bool)+0x642) [0x55b213b53822]
 9: (OSD::init()+0x28f) [0x55b2136fd08f]
 10: (main()+0x23a3) [0x55b2135db363]
 11: (__libc_start_main()+0xf5) [0x7f28e72913d5]
 12: (()+0x384ab0) [0x55b2136b3ab0]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-osd.

Re: [ceph-users] Bluestore Compression

2019-05-02 Thread Igor Fedotov

Hi Ashley,

general rule is that compression switch do not affect existing data but 
controls future write request processing.


You can enable/disable compression at any time.

Once disabled - no more compression is happening. And data that has been 
compressed remains in this state until removal or overwrite or (under 
some circumstances when keeping it compressed isn't beneficial any 
more)  garbage collection. The latter is mostly triggered by some 
partial overwrite.



Thanks,

Igor

On 5/2/2019 12:20 PM, Ashley Merrick wrote:

Hello,

I am aware that when enabling compression in bluestore it will only 
compress new data.


However, if I had compression enabled for a period of time, is it then 
possible to disable compression and any data that was compressed 
continue to be uncompressed on read as normal but any new data not be 
compressed.


Or once it's enabled for a pool there is no going back apart from 
creating a new fresh pool?


, Ashley

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-01 Thread Igor Fedotov

Hi Igor,

yeah, BlueStore allocators are absolutely interchangeable. You can 
switch between them for free.



Thanks,

Igor


On 5/1/2019 8:59 AM, Igor Podlesny wrote:

On Tue, 30 Apr 2019 at 20:56, Igor Podlesny  wrote:

On Tue, 30 Apr 2019 at 19:10, Denny Fuchs  wrote:
[..]

Any suggestions ?

-- Try different allocator.

Ah, BTW, except memory allocator there's another option: recently
backported bitmap allocator.
Igor Fedotov wrote about it's expected to have lesser memory footprint
with time:

 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034299.html

Also I'm not sure though if it's okay to switch existent OSDs "on-fly"
-- changing config and restarting OSDs.
Igor (Fedotov), can you please elaborate on this matter?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to run a standalone Bluestore instance?

2019-04-17 Thread Igor Fedotov

Or try full rebuild?

On 4/17/2019 5:37 PM, Igor Fedotov wrote:
Could you please check if libfio_ceph_objectstore.so has been rebuilt 
with your last build?


On 4/17/2019 6:37 AM, Can Zhang wrote:

Thanks for your suggestions.

I tried to build libfio_ceph_objectstore.so, but it fails to load:

```
$ LD_LIBRARY_PATH=./lib ./bin/fio --enghelp=libfio_ceph_objectstore.so

fio: engine libfio_ceph_objectstore.so not loadable
IO engine libfio_ceph_objectstore.so not found
```

I managed to print the dlopen error, it said:

```
dlopen error: ./lib/libfio_ceph_objectstore.so: undefined symbol:
_ZTIN13PriorityCache8PriCacheE
```

I found a not-so-relevant
issue(https://tracker.ceph.com/issues/38360), the error seems to be
caused by mixed versions. My build environment is CentOS 7.5.1804 with
SCL devtoolset-7, and ceph is latest master branch. Does someone know
about the symbol?


Best,
Can Zhang

Best,
Can Zhang


On Tue, Apr 16, 2019 at 8:37 PM Igor Fedotov  wrote:

Besides already mentioned store_test.cc one can also use ceph
objectstore fio plugin
(https://github.com/ceph/ceph/tree/master/src/test/fio) to access
standalone BlueStore instance from FIO benchmarking tool.


Thanks,

Igor

On 4/16/2019 7:58 AM, Can ZHANG wrote:

Hi,

I'd like to run a standalone Bluestore instance so as to test and tune
its performance. Are there any tools about it, or any suggestions?



Best,
Can Zhang

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to run a standalone Bluestore instance?

2019-04-17 Thread Igor Fedotov
Could you please check if libfio_ceph_objectstore.so has been rebuilt 
with your last build?


On 4/17/2019 6:37 AM, Can Zhang wrote:

Thanks for your suggestions.

I tried to build libfio_ceph_objectstore.so, but it fails to load:

```
$ LD_LIBRARY_PATH=./lib ./bin/fio --enghelp=libfio_ceph_objectstore.so

fio: engine libfio_ceph_objectstore.so not loadable
IO engine libfio_ceph_objectstore.so not found
```

I managed to print the dlopen error, it said:

```
dlopen error: ./lib/libfio_ceph_objectstore.so: undefined symbol:
_ZTIN13PriorityCache8PriCacheE
```

I found a not-so-relevant
issue(https://tracker.ceph.com/issues/38360), the error seems to be
caused by mixed versions. My build environment is CentOS 7.5.1804 with
SCL devtoolset-7, and ceph is latest master branch. Does someone know
about the symbol?


Best,
Can Zhang

Best,
Can Zhang


On Tue, Apr 16, 2019 at 8:37 PM Igor Fedotov  wrote:

Besides already mentioned store_test.cc one can also use ceph
objectstore fio plugin
(https://github.com/ceph/ceph/tree/master/src/test/fio) to access
standalone BlueStore instance from FIO benchmarking tool.


Thanks,

Igor

On 4/16/2019 7:58 AM, Can ZHANG wrote:

Hi,

I'd like to run a standalone Bluestore instance so as to test and tune
its performance. Are there any tools about it, or any suggestions?



Best,
Can Zhang

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore bitmap allocator under Luminous and Mimic

2019-04-16 Thread Igor Fedotov


On 4/15/2019 4:17 PM, Wido den Hollander wrote:


On 4/15/19 2:55 PM, Igor Fedotov wrote:

Hi Wido,

the main driver for this backport were multiple complains on write ops
latency increasing over time.

E.g. see thread named:  "ceph osd commit latency increase over time,
until restart" here.

Or http://tracker.ceph.com/issues/38738

Most symptoms showed Stupid Allocator as a root cause for that.

Hence we've got a decision to backport bitmap allocator which should
work a fix/workaround.


I see, that makes things clear. Anything users should take into account
when setting:

[osd]
bluestore_allocator = bitmap
bluefs_allocator = bitmap

Writing this here for archival purposes so that users who have the same
question can find it easily.


Nothing specific but a bit different memory usage pattern: stupid 
allocator has more dynamic memory usage approach while bitmap allocator 
is absolutely static in this respect. So depending on the use case OSD 
might require more or less RAM. E.g. on fresh deployment stupid 
allocator memory requirements are most probably less that bitmap 
allocator ones. But RAM usage for bitmap allocator doesn't change with 
OSD evolution while ones for stupid allocator might grow unexpectedly high.


FWIW resulting disk fragmentation might be different too. The same apply 
to their performance but I'm not sure if the latter is visible with the 
full Ceph stack.





Wido


Thanks,

Igor


On 4/15/2019 3:39 PM, Wido den Hollander wrote:

Hi,

With the release of 12.2.12 the bitmap allocator for BlueStore is now
available under Mimic and Luminous.

[osd]
bluestore_allocator = bitmap
bluefs_allocator = bitmap

Before setting this in production: What might the implications be and
what should be thought of?

  From what I've read the bitmap allocator seems to be better in read
performance and uses less memory.

In Nautilus bitmap is the default, but L and M still default to stupid.

Since the bitmap allocator was backported there must be a use-case to
use the bitmap allocator instead of stupid.

Thanks!

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to run a standalone Bluestore instance?

2019-04-16 Thread Igor Fedotov
Besides already mentioned store_test.cc one can also use ceph 
objectstore fio plugin 
(https://github.com/ceph/ceph/tree/master/src/test/fio) to access 
standalone BlueStore instance from FIO benchmarking tool.



Thanks,

Igor

On 4/16/2019 7:58 AM, Can ZHANG wrote:

Hi,

I'd like to run a standalone Bluestore instance so as to test and tune 
its performance. Are there any tools about it, or any suggestions?




Best,
Can Zhang

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore bitmap allocator under Luminous and Mimic

2019-04-15 Thread Igor Fedotov

Hi Wido,

the main driver for this backport were multiple complains on write ops 
latency increasing over time.


E.g. see thread named:  "ceph osd commit latency increase over time, 
until restart" here.


Or http://tracker.ceph.com/issues/38738

Most symptoms showed Stupid Allocator as a root cause for that.

Hence we've got a decision to backport bitmap allocator which should 
work a fix/workaround.



Thanks,

Igor


On 4/15/2019 3:39 PM, Wido den Hollander wrote:

Hi,

With the release of 12.2.12 the bitmap allocator for BlueStore is now
available under Mimic and Luminous.

[osd]
bluestore_allocator = bitmap
bluefs_allocator = bitmap

Before setting this in production: What might the implications be and
what should be thought of?

 From what I've read the bitmap allocator seems to be better in read
performance and uses less memory.

In Nautilus bitmap is the default, but L and M still default to stupid.

Since the bitmap allocator was backported there must be a use-case to
use the bitmap allocator instead of stupid.

Thanks!

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluefs-bdev-expand experience

2019-04-12 Thread Igor Fedotov



On 4/11/2019 11:23 PM, Yury Shevchuk wrote:

Hi Igor!

I have upgraded from Luminous to Nautilus and now slow device
expansion works indeed.  The steps are shown below to round up the
topic.

node2# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZERAW USE DATAOMAPMETAAVAIL   %USE  
VAR  PGS STATUS
  0   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 208 MiB 816 MiB 142 GiB 
38.92 1.04 128 up
  1   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 200 MiB 824 MiB 142 GiB 
38.92 1.04 128 up
  3   hdd 0.227390 0 B 0 B 0 B 0 B 0 B 0 B 
00   0   down
  2   hdd 0.22739  1.0 481 GiB 172 GiB  90 GiB 201 MiB 823 MiB 309 GiB 
35.70 0.96 128 up
 TOTAL 947 GiB 353 GiB 269 GiB 610 MiB 2.4 GiB 594 GiB 37.28
MIN/MAX VAR: 0.96/1.04  STDDEV: 1.62

node2# lvextend -L+50G /dev/vg0/osd2
   Size of logical volume vg0/osd2 changed from 400.00 GiB (102400 extents) to 
450.00 GiB (115200 extents).
   Logical volume vg0/osd2 successfully resized.

node2# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
2019-04-11 22:28:00.240 7f2e24e190c0 -1 bluestore(/var/lib/ceph/osd/ceph-2) 
_lock_fsid failed to lock /var/lib/ceph/osd/ceph-2/fsid (is another ceph-osd 
still running?)(11) Resource temporarily unavailable
...
*** Caught signal (Aborted) **
[two pages of stack dump stripped]

My mistake in the first place: I tried to expand non-stopped osd again.

node2# systemctl stop ceph-osd.target

node2# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
0 : device size 0x4000 : own 0x[1000~3000] = 0x3000 : using 0x8ff000
1 : device size 0x144000 : own 0x[2000~143fffe000] = 0x143fffe000 : using 
0x24dfe000
2 : device size 0x708000 : own 0x[30~4] = 0x4 : 
using 0x0
Expanding...
2 : expanding  from 0x64 to 0x708000
2 : size label updated to 483183820800

node2# ceph-bluestore-tool show-label --dev /dev/vg0/osd2 | grep size
 "size": 483183820800,

node2# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZERAW USE DATAOMAPMETAAVAIL   %USE  
VAR  PGS STATUS
  0   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 208 MiB 816 MiB 142 GiB 
38.92 1.10 128 up
  1   hdd 0.22739  1.0 233 GiB  91 GiB  90 GiB 200 MiB 824 MiB 142 GiB 
38.92 1.10 128 up
  3   hdd 0.227390 0 B 0 B 0 B 0 B 0 B 0 B 
00   0   down
  2   hdd 0.22739  1.0 531 GiB 172 GiB  90 GiB 185 MiB 839 MiB 359 GiB 
32.33 0.91 128 up
 TOTAL 997 GiB 353 GiB 269 GiB 593 MiB 2.4 GiB 644 GiB 35.41
MIN/MAX VAR: 0.91/1.10  STDDEV: 3.37

It worked: AVAIL = 594+50 = 644.  Great!
Thanks a lot for your help.

And one more question regarding your last remark is inline below.

On Wed, Apr 10, 2019 at 09:54:35PM +0300, Igor Fedotov wrote:

On 4/9/2019 1:59 PM, Yury Shevchuk wrote:

Igor, thank you, Round 2 is explained now.

Main aka block aka slow device cannot be expanded in Luminus, this
functionality will be available after upgrade to Nautilus.
Wal and db devices can be expanded in Luminous.

Now I have recreated osd2 once again to get rid of the paradoxical
cepf osd df output and tried to test db expansion, 40G -> 60G:

node2:/# ceph-volume lvm zap --destroy --osd-id 2
node2:/# ceph osd lost 2 --yes-i-really-mean-it
node2:/# ceph osd destroy 2 --yes-i-really-mean-it
node2:/# lvcreate -L1G -n osd2wal vg0
node2:/# lvcreate -L40G -n osd2db vg0
node2:/# lvcreate -L400G -n osd2 vg0
node2:/# ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 
--block.db vg0/osd2db --block.wal vg0/osd2wal

node2:/# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
   0   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
   1   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
   3   hdd 0.227390 0B  0B 0B00   0
   2   hdd 0.22739  1.0 400GiB 9.49GiB 391GiB 2.37 0.72 128
  TOTAL 866GiB 28.5GiB 837GiB 3.29
MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83

node2:/# lvextend -L+20G /dev/vg0/osd2db
Size of logical volume vg0/osd2db changed from 40.00 GiB (10240 extents) to 
60.00 GiB (15360 extents).
Logical volume vg0/osd2db successfully resized.

node2:/# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
   slot 0 /var/lib/ceph/osd/ceph-2//block.wal
   slot 1 /var/lib/ceph/osd/ceph-2//block.db
   slot 2 /var/lib/ceph/osd/ceph-2//block
0 : size 0x4000 : own 0x[1000~3000]
1 : size 0xf : own 0x[2000~9e000]
2 : size 0x64 : own 0x[30~4]
Expanding...
1 : expanding  from 0xa to 0xf
1 : size label updated to 64424509440

node2:/# ceph-bluestore-tool show-label --dev /dev/vg0/osd2db | grep size
  "size": 64424509440,

The label upda

Re: [ceph-users] bluefs-bdev-expand experience

2019-04-10 Thread Igor Fedotov


On 4/9/2019 1:59 PM, Yury Shevchuk wrote:

Igor, thank you, Round 2 is explained now.

Main aka block aka slow device cannot be expanded in Luminus, this
functionality will be available after upgrade to Nautilus.
Wal and db devices can be expanded in Luminous.

Now I have recreated osd2 once again to get rid of the paradoxical
cepf osd df output and tried to test db expansion, 40G -> 60G:

node2:/# ceph-volume lvm zap --destroy --osd-id 2
node2:/# ceph osd lost 2 --yes-i-really-mean-it
node2:/# ceph osd destroy 2 --yes-i-really-mean-it
node2:/# lvcreate -L1G -n osd2wal vg0
node2:/# lvcreate -L40G -n osd2db vg0
node2:/# lvcreate -L400G -n osd2 vg0
node2:/# ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 
--block.db vg0/osd2db --block.wal vg0/osd2wal

node2:/# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
  0   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
  1   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
  3   hdd 0.227390 0B  0B 0B00   0
  2   hdd 0.22739  1.0 400GiB 9.49GiB 391GiB 2.37 0.72 128
 TOTAL 866GiB 28.5GiB 837GiB 3.29
MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83

node2:/# lvextend -L+20G /dev/vg0/osd2db
   Size of logical volume vg0/osd2db changed from 40.00 GiB (10240 extents) to 
60.00 GiB (15360 extents).
   Logical volume vg0/osd2db successfully resized.

node2:/# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
  slot 0 /var/lib/ceph/osd/ceph-2//block.wal
  slot 1 /var/lib/ceph/osd/ceph-2//block.db
  slot 2 /var/lib/ceph/osd/ceph-2//block
0 : size 0x4000 : own 0x[1000~3000]
1 : size 0xf : own 0x[2000~9e000]
2 : size 0x64 : own 0x[30~4]
Expanding...
1 : expanding  from 0xa to 0xf
1 : size label updated to 64424509440

node2:/# ceph-bluestore-tool show-label --dev /dev/vg0/osd2db | grep size
 "size": 64424509440,

The label updated correctly, but ceph osd df have not changed.
I expected to see 391 + 20 = 411GiB in AVAIL column, but it stays at 391:

node2:/# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
  0   hdd 0.22739  1.0 233GiB 9.50GiB 223GiB 4.08 1.24 128
  1   hdd 0.22739  1.0 233GiB 9.50GiB 223GiB 4.08 1.24 128
  3   hdd 0.227390 0B  0B 0B00   0
  2   hdd 0.22739  1.0 400GiB 9.49GiB 391GiB 2.37 0.72 128
 TOTAL 866GiB 28.5GiB 837GiB 3.29
MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83

I have restarted monitors on all three nodes, 391GiB stays intact.

OK, but I used bluefs-bdev-expand on running OSD... probably not good,
it seems to fork by opening bluefs directly... trying once again:

node2:/# systemctl stop ceph-osd@2

node2:/# lvextend -L+20G /dev/vg0/osd2db
   Size of logical volume vg0/osd2db changed from 60.00 GiB (15360 extents) to 
80.00 GiB (20480 extents).
   Logical volume vg0/osd2db successfully resized.

node2:/# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
  slot 0 /var/lib/ceph/osd/ceph-2//block.wal
  slot 1 /var/lib/ceph/osd/ceph-2//block.db
  slot 2 /var/lib/ceph/osd/ceph-2//block
0 : size 0x4000 : own 0x[1000~3000]
1 : size 0x14 : own 0x[2000~9e000]
2 : size 0x64 : own 0x[30~4]
Expanding...
1 : expanding  from 0xa to 0x14
1 : size label updated to 85899345920

node2:/# systemctl start ceph-osd@2
node2:/# systemctl restart ceph-mon@pier42

node2:/# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
  0   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
  1   hdd 0.22739  1.0 233GiB 9.50GiB 223GiB 4.08 1.24 128
  3   hdd 0.227390 0B  0B 0B00   0
  2   hdd 0.22739  1.0 400GiB 9.50GiB 391GiB 2.37 0.72   0
 TOTAL 866GiB 28.5GiB 837GiB 3.29
MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83

Something is wrong.  Maybe I was wrong expecting db change to appear
in AVAIL column?  From Bluestore description I got db and slow should
sum up, no?


It was a while ago when db and slow were summed to provide total store 
size. In the latest Luminous releases that's not true anymore. Ceph uses 
slow device space to report SIZE/AVAIL only . There is some adjustment 
for BlueFS part residing on the slow device but DB device is definitely 
out of calculation here for now.


You can also note that reported SIZE for osd.2 is 400GiB in your case 
which is absolutely inline with slow device capacity.  Hence no DB involved.




Thanks for your help,


-- Yury

On Mon, Apr 08, 2019 at 10:17:24PM +0300, Igor Fedotov wrote:

Hi Yuri,

both issues from Round 2 relate to unsupported expansion for main device.

In fact it doesn't work and silently bypasses the operation in you case.

Please try with a different device...


Also I've just submitted a PR for mimi

Re: [ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?

2019-04-10 Thread Igor Fedotov

It's ceph-bluestore-tool.

On 4/10/2019 10:27 AM, Wido den Hollander wrote:


On 4/10/19 9:25 AM, jes...@krogh.cc wrote:

On 4/10/19 9:07 AM, Charles Alva wrote:

Hi Ceph Users,

Is there a way around to minimize rocksdb compacting event so that it
won't use all the spinning disk IO utilization and avoid it being marked
as down due to fail to send heartbeat to others?

Right now we have frequent high IO disk utilization for every 20-25
minutes where the rocksdb reaches level 4 with 67GB data to compact.


How big is the disk? RocksDB will need to compact at some point and it
seems that the HDD can't keep up.

I've seen this with many customers and in those cases we offloaded the
WAL+DB to an SSD.

Guess the SSD need to be pretty durable to handle that?


Always use DC-grade SSDs, but you don't need to buy the most expensive
ones you can find. ~1.5DWPD is sufficient.


Is there a "migration path" to offload this or is it needed to destroy
and re-create the OSD?


In Nautilus release (and maybe Mimic) there is a tool to migrate the DB
to a different device without the need to re-create the OSD. This is
bluestore-dev-tool I think.

Wido


Thanks.

Jesper



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluefs-bdev-expand experience

2019-04-08 Thread Igor Fedotov

Hi Yuri,

both issues from Round 2 relate to unsupported expansion for main device.

In fact it doesn't work and silently bypasses the operation in you case.

Please try with a different device...


Also I've just submitted a PR for mimic to indicate the bypass, will 
backport to Luminous once mimic patch is approved.


See https://github.com/ceph/ceph/pull/27447


Thanks,

Igor

On 4/5/2019 4:07 PM, Yury Shevchuk wrote:

On Fri, Apr 05, 2019 at 02:42:53PM +0300, Igor Fedotov wrote:

wrt Round 1 - an ability to expand block(main) device has been added to
Nautilus,

see: https://github.com/ceph/ceph/pull/25308

Oh, that's good.  But still separate wal may be good for studying
load on each volume (blktrace) or moving db/wal to another physical
disk by means of LVM transparently to ceph.


wrt Round 2:

- not setting 'size' label looks like a bug although I recall I fixed it...
Will double check.

- wrong stats output is probably related to the lack of monitor restart -
could you please try that and report back if it helps? Or even restart the
whole cluster.. (well I understand that's a bad approach for production but
just to verify my hypothesis)

Mon restart didn't help:

node0:~# systemctl restart ceph-mon@0
node1:~# systemctl restart ceph-mon@1
node2:~# systemctl restart ceph-mon@2
node2:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL  %USE  VAR  PGS
  0   hdd 0.22739  1.0  233GiB 9.44GiB 223GiB  4.06 0.12 128
  1   hdd 0.22739  1.0  233GiB 9.44GiB 223GiB  4.06 0.12 128
  3   hdd 0.227390  0B  0B 0B 00   0
  2   hdd 0.22739  1.0  800GiB  409GiB 391GiB 51.18 1.51 128
 TOTAL 1.24TiB  428GiB 837GiB 33.84
MIN/MAX VAR: 0.12/1.51  STDDEV: 26.30

Restarting mgrs and then all ceph daemons on all three nodes didn't
help either:

node2:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL  %USE  VAR  PGS
  0   hdd 0.22739  1.0  233GiB 9.43GiB 223GiB  4.05 0.12 128
  1   hdd 0.22739  1.0  233GiB 9.43GiB 223GiB  4.05 0.12 128
  3   hdd 0.227390  0B  0B 0B 00   0
  2   hdd 0.22739  1.0  800GiB  409GiB 391GiB 51.18 1.51 128
 TOTAL 1.24TiB  428GiB 837GiB 33.84
MIN/MAX VAR: 0.12/1.51  STDDEV: 26.30

Maybe we should upgrade to v14.2.0 Nautilus instead of studying old
bugs... after all, this is a toy cluster for now.

Thank you for responding,


-- Yury


On 4/5/2019 2:06 PM, Yury Shevchuk wrote:

Hello all!

We have a toy 3-node Ceph cluster running Luminous 12.2.11 with one
bluestore osd per node.  We started with pretty small OSDs and would
like to be able to expand OSDs whenever needed.  We had two issues
with the expansion: one turned out user-serviceable while the other
probably needs developers' look.  I will describe both shortly.

Round 1
~~~
Trying to expand osd.2 by 1TB:

# lvextend -L+1T /dev/vg0/osd2
  Size of logical volume vg0/osd2 changed from 232.88 GiB (59618 extents) 
to 1.23 TiB (321762 extents).
  Logical volume vg0/osd2 successfully resized.

# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-2//block
1 : size 0x13a3880 : own 0x[1bf220~25430]
Expanding...
1 : can't be expanded. Bypassing...
#

It didn't work.  The explaination can be found in
ceph/src/os/bluestore/BlueFS.cc at line 310:

// returns true if specified device is under full bluefs control
// and hence can be expanded
bool BlueFS::is_device_expandable(unsigned id)
{
  if (id >= MAX_BDEV || bdev[id] == nullptr) {
return false;
  }
  switch(id) {
  case BDEV_WAL:
return true;

  case BDEV_DB:
// true if DB volume is non-shared
return bdev[BDEV_SLOW] != nullptr;
  }
  return false;
}

So we have to use separate block.db and block.wal for OSD to be
expandable.  Indeed, our OSDs were created without separate block.db
and block.wal, like this:

ceph-volume lvm create --bluestore --data /dev/vg0/osd2

Recreating osd.2 with separate block.db and block.wal:

# ceph-volume lvm zap --destroy --osd-id 2
# lvcreate -L1G -n osd2wal vg0
  Logical volume "osd2wal" created.
# lvcreate -L40G -n osd2db vg0
  Logical volume "osd2db" created.
# lvcreate -L400G -n osd2 vg0
  Logical volume "osd2" created.
# ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 --block.db 
vg0/osd2db --block.wal vg0/osd2wal

Resync takes some time, and then we have expandable osd.2.


Round 2
~~~
Trying to expand osd.2 from 400G to 700G:

# lvextend -L+300G /dev/vg0/osd2
  Size of logical volume vg0/osd2 changed from 400.00 GiB (102400 extents) 
to 700.00 GiB (179200 extents).
  Logical volume vg0/osd2 successfully resized.

# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferr

Re: [ceph-users] bluefs-bdev-expand experience

2019-04-05 Thread Igor Fedotov

Hi Yuri,

wrt Round 1 - an ability to expand block(main) device has been added to 
Nautilus,


see: https://github.com/ceph/ceph/pull/25308


wrt Round 2:

- not setting 'size' label looks like a bug although I recall I fixed 
it... Will double check.


- wrong stats output is probably related to the lack of monitor restart 
- could you please try that and report back if it helps? Or even restart 
the whole cluster.. (well I understand that's a bad approach for 
production but just to verify my hypothesis)



Thanks,

Igor


On 4/5/2019 2:06 PM, Yury Shevchuk wrote:

Hello all!

We have a toy 3-node Ceph cluster running Luminous 12.2.11 with one
bluestore osd per node.  We started with pretty small OSDs and would
like to be able to expand OSDs whenever needed.  We had two issues
with the expansion: one turned out user-serviceable while the other
probably needs developers' look.  I will describe both shortly.

Round 1
~~~
Trying to expand osd.2 by 1TB:

   # lvextend -L+1T /dev/vg0/osd2
 Size of logical volume vg0/osd2 changed from 232.88 GiB (59618 extents) to 
1.23 TiB (321762 extents).
 Logical volume vg0/osd2 successfully resized.

   # ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2
   inferring bluefs devices from bluestore path
slot 1 /var/lib/ceph/osd/ceph-2//block
   1 : size 0x13a3880 : own 0x[1bf220~25430]
   Expanding...
   1 : can't be expanded. Bypassing...
   #

It didn't work.  The explaination can be found in
ceph/src/os/bluestore/BlueFS.cc at line 310:

   // returns true if specified device is under full bluefs control
   // and hence can be expanded
   bool BlueFS::is_device_expandable(unsigned id)
   {
 if (id >= MAX_BDEV || bdev[id] == nullptr) {
   return false;
 }
 switch(id) {
 case BDEV_WAL:
   return true;

 case BDEV_DB:
   // true if DB volume is non-shared
   return bdev[BDEV_SLOW] != nullptr;
 }
 return false;
   }

So we have to use separate block.db and block.wal for OSD to be
expandable.  Indeed, our OSDs were created without separate block.db
and block.wal, like this:

   ceph-volume lvm create --bluestore --data /dev/vg0/osd2

Recreating osd.2 with separate block.db and block.wal:

   # ceph-volume lvm zap --destroy --osd-id 2
   # lvcreate -L1G -n osd2wal vg0
 Logical volume "osd2wal" created.
   # lvcreate -L40G -n osd2db vg0
 Logical volume "osd2db" created.
   # lvcreate -L400G -n osd2 vg0
 Logical volume "osd2" created.
   # ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 --block.db 
vg0/osd2db --block.wal vg0/osd2wal

Resync takes some time, and then we have expandable osd.2.


Round 2
~~~
Trying to expand osd.2 from 400G to 700G:

   # lvextend -L+300G /dev/vg0/osd2
 Size of logical volume vg0/osd2 changed from 400.00 GiB (102400 extents) 
to 700.00 GiB (179200 extents).
 Logical volume vg0/osd2 successfully resized.

   # ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
   inferring bluefs devices from bluestore path
slot 0 /var/lib/ceph/osd/ceph-2//block.wal
slot 1 /var/lib/ceph/osd/ceph-2//block.db
slot 2 /var/lib/ceph/osd/ceph-2//block
   0 : size 0x4000 : own 0x[1000~3000]
   1 : size 0xa : own 0x[2000~9e000]
   2 : size 0xaf : own 0x[30~4]
   Expanding...
   #


This time expansion appears to work: 0xaf = 700GiB.

However, the size in the block device label have not changed:

   # ceph-bluestore-tool show-label --dev /dev/vg0/osd2
   {
   "/dev/vg0/osd2": {
   "osd_uuid": "a18ff7f7-0de1-4791-ba4b-f3b6d2221f44",
   "size": 429496729600,

429496729600 = 400GiB

Worse, ceph osd df shows the added space as used, not available:

# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL  %USE  VAR  PGS
  0   hdd 0.22739  1.0  233GiB 8.06GiB 225GiB  3.46 0.13 128
  1   hdd 0.22739  1.0  233GiB 8.06GiB 225GiB  3.46 0.13 128
  2   hdd 0.22739  1.0  700GiB  301GiB 399GiB 43.02 1.58  64
 TOTAL 1.14TiB  317GiB 849GiB 27.21
MIN/MAX VAR: 0.13/1.58  STDDEV: 21.43

If I expand osd.2 by another 100G, the space also goes to
"USE" column:

pier42:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL  %USE  VAR  PGS
  0   hdd 0.22739  1.0  233GiB 8.05GiB 225GiB  3.46 0.10 128
  1   hdd 0.22739  1.0  233GiB 8.05GiB 225GiB  3.46 0.10 128
  3   hdd 0.227390  0B  0B 0B 00   0
  2   hdd 0.22739  1.0  800GiB  408GiB 392GiB 51.01 1.52 128
 TOTAL 1.24TiB  424GiB 842GiB 33.51
MIN/MAX VAR: 0.10/1.52  STDDEV: 26.54

So OSD expansion almost works, but not quite.  If you had better luck
with bluefs-bdev-expand, could you please share your story?


-- Yury
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-03-01 Thread Igor Fedotov

resending, not sure the prev email reached the mailing list...


Hi Chen,

thanks for the update. Will prepare patch to periodically reset 
StupidAllocator today.


And just to let you know below is an e-mail from AdamK from RH which 
might explain the issue with the allocator.


Also please note that StupidAllocator might not perform full 
defragmentation in run-time. That's why we observed (mentioned somewhere 
in the thread) fragmentation growth while OSD is running and its drop on 
restart. Such a restart rebuilds internal tree and eliminates 
defragmentation flaws. May be that's the case.



Thanks,

Igor

 Forwarded Message 
Subject:     High CPU in StupidAllocator
Date:     Tue, 12 Feb 2019 10:24:37 +0100
From:     Adam Kupczyk 
To:     IGOR FEDOTOV 


Hi Igor,

I have observed that StupidAllocator can burn a lot of CPU in 
StupidAllocator::allocate_int().

This comes from loops:
while (p != free[bin].end()) {
    if (_aligned_len(p, alloc_unit) >= want_size) {
  goto found;
    }
    ++p;
}

It happens when want_size is close to limit of size of bin.
For example, free[5] contains sizes 8192..16383.
When requesting size like 16000 it is quite likely that multiple chunks 
must be checked.


I have made an attempt to improve it by increasing amount of buckets.
It is done in aclamk/wip-bs-stupid-allocator-2 .

Best regards,

Adam Kupczyk



On 3/1/2019 11:46 AM, Xiaoxi Chen wrote:

igor,
   I can test the patch if we have a package.
   My enviroment and workload can consistently reproduce the latency  
2-3 days after restarting.
    Sage tells me to try bitmap allocator to make sure stupid 
allocator is the bad guy. I have some osds in luminous +bitmap and 
some osds in 14.1.0+bitmap.  Both looks positive till now, but i need 
more time to be sure.
     The perf ,log and admin socket analysis lead to the theory that 
in alloc_int the loop sometimes take long time wkth allocator locks 
held. Which blocks release part called from _txc_finish in 
kv_finalize_thread, this thread is also the one to calculate 
state_kv_committing_lat and overall commit_lat. You can find from 
admin socket that state_done_latency has similar trend as commit_latency.
    But we cannot find a theory to.explain why reboot helps, the 
allocator btree will be rebuild from freelist manager and.it.should be 
exactly. the same as it is prior to reboot.  Anything related with pg 
recovery?


   Anyway, as I have a live env and workload, I am more than willing 
to work with you for further investigatiom


-Xiaoxi

Igor Fedotov mailto:ifedo...@suse.de>> 于 
2019年3月1日周五 上午6:21写道:


Also I think it makes sense to create a ticket at this point. Any
volunteers?

On 3/1/2019 1:00 AM, Igor Fedotov wrote:
> Wondering if somebody would be able to apply simple patch that
> periodically resets StupidAllocator?
>
> Just to verify/disprove the hypothesis it's allocator relateted
>
> On 2/28/2019 11:57 PM, Stefan Kooman wrote:
>> Quoting Wido den Hollander (w...@42on.com <mailto:w...@42on.com>):
>>> Just wanted to chime in, I've seen this with
Luminous+BlueStore+NVMe
>>> OSDs as well. Over time their latency increased until we
started to
>>> notice I/O-wait inside VMs.
>> On a Luminous 12.2.8 cluster with only SSDs we also hit this
issue I
>> guess. After restarting the OSD servers the latency would drop
to normal
>> values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj
>>
>> Reboots were finished at ~ 19:00.
>>
>> Gr. Stefan
>>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-03-01 Thread Igor Fedotov

Hi Chen,

thanks for the update. Will prepare patch to periodically reset 
StupidAllocator today.


And just to let you know below is an e-mail from AdamK from RH which 
might explain the issue with the allocator.


Also please note that StupidAllocator might not perform full 
defragmentation in run-time. That's why we observed (mentioned somewhere 
in the thread) fragmentation growth while OSD is running and its drop on 
restart. Such a restart rebuilds internal tree and eliminates 
defragmentation flaws. May be that's the case.



Thanks,

Igor

 Forwarded Message 

Subject:High CPU in StupidAllocator
Date:   Tue, 12 Feb 2019 10:24:37 +0100
From:   Adam Kupczyk 
To: IGOR FEDOTOV 



Hi Igor,

I have observed that StupidAllocator can burn a lot of CPU in 
StupidAllocator::allocate_int().

This comes from loops:
while (p != free[bin].end()) {
    if (_aligned_len(p, alloc_unit) >= want_size) {
      goto found;
    }
    ++p;
}

It happens when want_size is close to limit of size of bin.
For example, free[5] contains sizes 8192..16383.
When requesting size like 16000 it is quite likely that multiple chunks 
must be checked.


I have made an attempt to improve it by increasing amount of buckets.
It is done in aclamk/wip-bs-stupid-allocator-2 .

Best regards,

Adam Kupczyk



On 3/1/2019 11:46 AM, Xiaoxi Chen wrote:

igor,
   I can test the patch if we have a package.
   My enviroment and workload can consistently reproduce the latency  
2-3 days after restarting.
    Sage tells me to try bitmap allocator to make sure stupid 
allocator is the bad guy. I have some osds in luminous +bitmap and 
some osds in 14.1.0+bitmap.  Both looks positive till now, but i need 
more time to be sure.
     The perf ,log and admin socket analysis lead to the theory that 
in alloc_int the loop sometimes take long time wkth allocator locks 
held. Which blocks release part called from _txc_finish in 
kv_finalize_thread, this thread is also the one to calculate 
state_kv_committing_lat and overall commit_lat. You can find from 
admin socket that state_done_latency has similar trend as commit_latency.
    But we cannot find a theory to.explain why reboot helps, the 
allocator btree will be rebuild from freelist manager and.it.should be 
exactly. the same as it is prior to reboot.  Anything related with pg 
recovery?


   Anyway, as I have a live env and workload, I am more than willing 
to work with you for further investigatiom


-Xiaoxi

Igor Fedotov mailto:ifedo...@suse.de>> 于 
2019年3月1日周五 上午6:21写道:


Also I think it makes sense to create a ticket at this point. Any
volunteers?

On 3/1/2019 1:00 AM, Igor Fedotov wrote:
> Wondering if somebody would be able to apply simple patch that
> periodically resets StupidAllocator?
>
> Just to verify/disprove the hypothesis it's allocator relateted
>
> On 2/28/2019 11:57 PM, Stefan Kooman wrote:
>> Quoting Wido den Hollander (w...@42on.com <mailto:w...@42on.com>):
>>> Just wanted to chime in, I've seen this with
Luminous+BlueStore+NVMe
>>> OSDs as well. Over time their latency increased until we
started to
>>> notice I/O-wait inside VMs.
>> On a Luminous 12.2.8 cluster with only SSDs we also hit this
issue I
>> guess. After restarting the OSD servers the latency would drop
to normal
>> values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj
>>
>> Reboots were finished at ~ 19:00.
>>
>> Gr. Stefan
>>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-28 Thread Igor Fedotov
Also I think it makes sense to create a ticket at this point. Any 
volunteers?


On 3/1/2019 1:00 AM, Igor Fedotov wrote:
Wondering if somebody would be able to apply simple patch that 
periodically resets StupidAllocator?


Just to verify/disprove the hypothesis it's allocator relateted

On 2/28/2019 11:57 PM, Stefan Kooman wrote:

Quoting Wido den Hollander (w...@42on.com):

Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
OSDs as well. Over time their latency increased until we started to
notice I/O-wait inside VMs.

On a Luminous 12.2.8 cluster with only SSDs we also hit this issue I
guess. After restarting the OSD servers the latency would drop to normal
values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj

Reboots were finished at ~ 19:00.

Gr. Stefan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-28 Thread Igor Fedotov
Wondering if somebody would be able to apply simple patch that 
periodically resets StupidAllocator?


Just to verify/disprove the hypothesis it's allocator relateted

On 2/28/2019 11:57 PM, Stefan Kooman wrote:

Quoting Wido den Hollander (w...@42on.com):
  

Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
OSDs as well. Over time their latency increased until we started to
notice I/O-wait inside VMs.

On a Luminous 12.2.8 cluster with only SSDs we also hit this issue I
guess. After restarting the OSD servers the latency would drop to normal
values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj

Reboots were finished at ~ 19:00.

Gr. Stefan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-27 Thread Igor Fedotov

Hi Uwe,

AFAIR Samsung 860 Pro isn't for enterprise market, you shouldn't use 
consumer SSDs for Ceph.


I had some experience with Samsung 960 Pro a while ago and it turned out 
that it handled fsync-ed writes very slowly (comparing to the 
original/advertised performance). Which one can probably explain by the 
lack of power loss protection for these drives. I suppose it's the same 
in your case.


Here are a couple links on the topic:

https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/


Thanks,

Igor



On 2/27/2019 12:01 AM, Uwe Sauter wrote:

Hi,

TL;DR: In my Ceph clusters I replaced all OSDs from HDDs of several 
brands and models with Samsung 860 Pro SSDs and used the opportunity 
to switch from filestore to bluestore. Now I'm seeing blocked ops in 
Ceph and file system freezes inside VMs. Any suggestions?



I have two Proxmox clusters for virtualization which use Ceph on HDDs 
as backend storage for VMs. About half a year ago I had to increase 
the pool size and used the occasion to switch from filestore to 
bluestore. That was when trouble started. Both clusters showed blocked 
ops that caused freezes inside VMs which needed a reboot to function 
properly again. I wasn't able to identify the cause of the blocking 
ops but I blamed the low performance of the HDDs. It was also the time 
when patches for Spectre/Meltdown were released. Kernel 4.13.x didn't 
show the behavior while kernel 4.15.x did. After several weeks of 
debugging the workaround was to go back to filestore.


Today I replace all HDDs with brand new Samsung 860 Pro SSDs and 
switched to bluestore again (on one cluster). And… the blocked ops 
reappeared. I am out of ideas about the cause.


Any idea why bluestore is so much more demanding on the storage 
devices compared to filestore?


Before switching back to filestore do you have any suggestions for 
debugging? Anything special to check for in the network?


The clusters are both connected via 10GbE (MTU 9000) and are only 
lightly loaded (15 VMs on the first, 6 VMs on the second). Each host 
has 3 SSDs and 64GB memory.


"rados bench" gives decent results for 4M block size but 4K block size 
triggers blocked ops (and only finishes after I restart the OSD with 
the blocked ops). Results below.



Thanks,

Uwe




Results from "rados bench" runs with 4K block size when the cluster 
didn't block:


root@px-hotel-cluster:~# rados bench -p scbench 60 write -b 4K -t 16 
--no-cleanup

hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 
for up to 60 seconds or 0 objects

Object prefix: benchmark_data_px-hotel-cluster_3814550
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) avg 
lat(s)

    0   0 0 0 0 0 -   0
    1  16  2338  2322   9.06888   9.07031 0.0068972   
0.0068597
    2  16  4631  4615   9.01238   8.95703   0.0076618 
0.00692027
    3  16  6936  6920   9.00928   9.00391   0.0066511 
0.00692966
    4  16  9173  9157   8.94133   8.73828  0.00416256 
0.00698071
    5  16 11535 11519   8.99821   9.22656  0.00799875 
0.00693842
    6  16 13892 13876   9.03287   9.20703  0.00688782 
0.00691459
    7  15 16173 16158   9.01578   8.91406  0.00791589 
0.00692736
    8  16 18406 18390   8.97854   8.71875  0.00745151 
0.00695723
    9  16 20681 20665   8.96822   8.88672   0.0072881 
0.00696475
   10  16 23037 23021   8.99163   9.20312 0.00728763   
0.0069473
   11  16 24261 24245   8.60882   4.78125  0.00502342 
0.00725673
   12  16 25420 25404   8.26863   4.52734  0.00443917 
0.00750865
   13  16 27347 27331   8.21154   7.52734  0.00670819 
0.00760455
   14  16 28750 28734   8.01642   5.48047  0.00617038 
0.00779322
   15  16 30222 30206    7.8653  5.75  0.00700398 
0.00794209
   16  16 32180 32164    7.8517   7.64844 0.00704785   
0.0079573
   17  16 34527 34511   7.92907   9.16797  0.00582831 
0.00788017
   18  15 36969 36954   8.01868   9.54297  0.00635168 
0.00779228
   19  16 39059 39043   8.02609   8.16016  0.00622597 
0.00778436
2019-02-26 21:55:41.623245 min lat: 0.00337595 max lat: 0.431158 avg 
lat: 0.00779143
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) avg 
lat(s)
   20  16 41079 41063   8.01928   7.89062  0.00649895 
0.00779143
   21  16 43076 43060   8.00878   7.80078  0.00726145 
0.00780128
   22  16 45433 45417   8.06321   9.20703  0.00455727 
0.00774944
   23  16 47763 47747   8.10832   9.10156  0.00582818 
0.00770599
   24  16 50079 50063   8.14738   9.04688   0.0051125 
0.00766894
   25  16 52477 52461   8.19614   9.36719  

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-20 Thread Igor Fedotov

You're right - WAL/DB expansion capability is present in Luminous+ releases.

But David meant volume migration stuff which appeared in Nautilus, see:

https://github.com/ceph/ceph/pull/23103


Thanks,

Igor

On 2/20/2019 9:22 AM, Konstantin Shalygin wrote:

On 2/19/19 11:46 PM, David Turner wrote:
I don't know that there's anything that can be done to resolve this 
yet without rebuilding the OSD. Based on a Nautilus tool being able 
to resize the DB device, I'm assuming that Nautilus is also capable 
of migrating the DB/WAL between devices.  That functionality would 
allow anyone to migrate their DB back off of their spinner which is 
what's happening to you.  I don't believe that sort of tooling exists 
yet, though, without compiling the Nautilus Beta tooling for yourself.


I think there you are wrong, initially bluestore tool can expand only 
wal/db devices [1]. With last releases of mimic and luminous this 
should work fine.


And only master received  feature for main device expanding [2].



[1] 
https://github.com/ceph/ceph/commit/2184e3077caa9de5f21cc901d26f6ecfb76de9e1


[2] 
https://github.com/ceph/ceph/commit/d07c10dfc02e4cdeda288bf39b8060b10da5bbf9


k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-19 Thread Igor Fedotov

Hi Alexander,

I think op_w_process_latency includes replication times, not 100% sure 
though.


So restarting other nodes might affect latencies at this specific OSD.


Thanks,

Igot

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote:

There are 10 OSDs in these systems with 96GB of memory in total. We are
runnigh with memory target on 6G right now to make sure there is no
leakage. If this runs fine for a longer period we will go to 8GB per OSD
so it will max out on 80GB leaving 16GB as spare.

Thanks Wido. I send results monday with my increased memory



@Igor:

I have also notice, that sometime when I have bad latency on an osd on node1 
(restarted 12h ago for example).
(op_w_process_latency).

If I restart osds on other nodes (last restart some days ago, so with bigger 
latency), it's reducing latency on osd of node1 too.

does "op_w_process_latency" counter include replication time ?

- Mail original -
De: "Wido den Hollander" 
À: "aderumier" 
Cc: "Igor Fedotov" , "ceph-users" , 
"ceph-devel" 
Envoyé: Vendredi 15 Février 2019 14:59:30
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

On 2/15/19 2:54 PM, Alexandre DERUMIER wrote:

Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
OSDs as well. Over time their latency increased until we started to
notice I/O-wait inside VMs.

I'm also notice it in the vms. BTW, what it your nvme disk size ?

Samsung PM983 3.84TB SSDs in both clusters.




A restart fixed it. We also increased memory target from 4G to 6G on
these OSDs as the memory would allow it.

I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme.
(my last test was 8gb with 1osd of 6TB, but that didn't help)

There are 10 OSDs in these systems with 96GB of memory in total. We are
runnigh with memory target on 6G right now to make sure there is no
leakage. If this runs fine for a longer period we will go to 8GB per OSD
so it will max out on 80GB leaving 16GB as spare.

As these OSDs were all restarted earlier this week I can't tell how it
will hold up over a longer period. Monitoring (Zabbix) shows the latency
is fine at the moment.

Wido



- Mail original -----
De: "Wido den Hollander" 
À: "Alexandre Derumier" , "Igor Fedotov" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 15 Février 2019 14:50:34
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

On 2/15/19 2:31 PM, Alexandre DERUMIER wrote:

Thanks Igor.

I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is 
different.

I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see 
this latency problem.



Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
OSDs as well. Over time their latency increased until we started to
notice I/O-wait inside VMs.

A restart fixed it. We also increased memory target from 4G to 6G on
these OSDs as the memory would allow it.

But we noticed this on two different 12.2.10/11 clusters.

A restart made the latency drop. Not only the numbers, but the
real-world latency as experienced by a VM as well.

Wido






- Mail original -
De: "Igor Fedotov" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 15 Février 2019 13:47:57
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi Alexander,

I've read through your reports, nothing obvious so far.

I can only see several times average latency increase for OSD write ops
(in seconds)
0.002040060 (first hour) vs.

0.002483516 (last 24 hours) vs.
0.008382087 (last hour)

subop_w_latency:
0.000478934 (first hour) vs.
0.000537956 (last 24 hours) vs.
0.003073475 (last hour)

and OSD read ops, osd_r_latency:

0.000408595 (first hour)
0.000709031 (24 hours)
0.004979540 (last hour)

What's interesting is that such latency differences aren't observed at
neither BlueStore level (any _lat params under "bluestore" section) nor
rocksdb one.

Which probably means that the issue is rather somewhere above BlueStore.

Suggest to proceed with perf dumps collection to see if the picture
stays the same.

W.r.t. memory usage you observed I see nothing suspicious so far - No
decrease in RSS report is a known artifact that seems to be safe.

Thanks,
Igor

On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote:

Hi Igor,

Thanks again for helping !



I have upgrade to last mimic this weekend, and with new autotune memory,
I have setup osd_memory_target to 8G. (my nvme are 6TB)


I have done a lot of perf dump and mempool dump and ps of process to

see rss memory at different hours,

here the reports for osd.0:

http://odisoweb1.odiso.net/perfanalysis/


osd has been started the 12-02-2019 at 08:00

first report after 1h running
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt


http://odisoweb1.odiso.net/perfanal

Re: [ceph-users] single OSDs cause cluster hickups

2019-02-15 Thread Igor Fedotov

Yeah.

I'm monitoring such issue reports for a while and it looks like 
something is definitely wrong with response times under certain 
circumstances. Mpt sure if all these reports have the same root cause 
though.


Scrubbing seems to be one of the trigger.

Perhaps we need more low-level detection/warning for high response times 
from HW and/or DB.


Planning to look how feasible is that warning means shortly.


Thanks,
Igor

On 2/15/2019 3:24 PM, Denny Kreische wrote:

Hi Igor,

Thanks for your reply.
I can verify, discard is disabled in our cluster:

10:03 root@node106b [fra]:~# ceph daemon osd.417 config show | grep discard
 "bdev_async_discard": "false",
 "bdev_enable_discard": "false",
[...]

So there must be something else causing the problems.

Thanks,
Denny



Am 15.02.2019 um 12:41 schrieb Igor Fedotov :

Hi Denny,

Do not remember exactly when discards appeared in BlueStore but they are 
disabled by default:

See bdev_enable_discard option.


Thanks,

Igor

On 2/15/2019 2:12 PM, Denny Kreische wrote:

Hi,

two weeks ago we upgraded one of our ceph clusters from luminous 12.2.8 to 
mimic 13.2.4, cluster is SSD-only, bluestore-only, 68 nodes, 408 OSDs.
somehow we see strange behaviour since then. Single OSDs seem to block for 
around 5 minutes and this causes the whole cluster and connected applications 
to hang. This happened 5 times during the last 10 days at irregular times, it 
didn't happen before the upgrade.

OSD log shows something like this (more log here: 
https://pastebin.com/6BYam5r4):

[...]
2019-02-14 23:53:39.754 7f379a368700 -1 osd.417 340516 get_health_metrics 
reporting 3 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 
0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516)
2019-02-14 23:53:40.706 7f379a368700 -1 osd.417 340516 get_health_metrics 
reporting 7 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 
0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516)
[...]

In this example osd.417 seems to have a problem. I can see same log line in 
other osd logs with placement groups related to osd.417.
I assume that all placement groups related to osd.417 are hanging or blocked 
when osd.417 is blocked.

How can I see in detail what might cause a certain OSD to stop working?

The cluster consists of 3 different SSD vendors (micron, samsung, intel), but 
only micron disks are affected until now. we earlier had problems with micron 
SSDs with filestore (xfs), it was fstrim to cause single OSDs to block for 
several minutes. we migrated to bluestore about a year ago. just in case, is 
there any kind of ssd trim/discard happening in bluestore since mimic?

Thanks,
Denny

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-15 Thread Igor Fedotov

Hi Alexander,

I've read through your reports, nothing obvious so far.

I can only see several times average latency increase for OSD write ops 
(in seconds)

0.002040060 (first hour) vs.

0.002483516 (last 24 hours) vs.
0.008382087 (last hour)

subop_w_latency:
0.000478934 (first hour) vs.
0.000537956 (last 24 hours) vs.
0.003073475 (last hour)

and OSD read ops, osd_r_latency:

0.000408595 (first hour)
0.000709031 (24 hours)
0.004979540 (last hour)

What's interesting is that such latency differences aren't observed at 
neither BlueStore level (any _lat params under "bluestore" section) nor 
rocksdb one.


Which probably means that the issue is rather somewhere above BlueStore.

Suggest to proceed with perf dumps collection to see if the picture 
stays the same.


W.r.t. memory usage you observed I see nothing suspicious so far - No 
decrease in RSS report is a known artifact that seems to be safe.


Thanks,
Igor

On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote:
> Hi Igor,
>
> Thanks again for helping !
>
>
>
> I have upgrade to last mimic this weekend, and with new autotune memory,
> I have setup osd_memory_target to 8G.  (my nvme are 6TB)
>
>
> I have done a lot of perf dump and mempool dump and ps of process to 
see rss memory at different hours,

> here the reports for osd.0:
>
> http://odisoweb1.odiso.net/perfanalysis/
>
>
> osd has been started the 12-02-2019 at 08:00
>
> first report after 1h running
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt

> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt
>
>
>
> report  after 24 before counter resets
>
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt

> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt
>
> report 1h after counter reset
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt

> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt
>
>
>
>
> I'm seeing the bluestore buffer bytes memory increasing up to 4G  
around 12-02-2019 at 14:00

> http://odisoweb1.odiso.net/perfanalysis/graphs2.png
> Then after that, slowly decreasing.
>
>
> Another strange thing,
> I'm seeing total bytes at 5G at 12-02-2018.13:30
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt
> Then is decreasing over time (around 3,7G this morning), but RSS is 
still at 8G

>
>
> I'm graphing mempools counters too since yesterday, so I'll able to 
track them over time.

>
> - Mail original -
> De: "Igor Fedotov" 
> À: "Alexandre Derumier" 
> Cc: "Sage Weil" , "ceph-users" 
, "ceph-devel" 

> Envoyé: Lundi 11 Février 2019 12:03:17
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart

>
> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote:
>> another mempool dump after 1h run. (latency ok)
>>
>> Biggest difference:
>>
>> before restart
>> -
>> "bluestore_cache_other": {
>> "items": 48661920,
>> "bytes": 1539544228
>> },
>> "bluestore_cache_data": {
>> "items": 54,
>> "bytes": 643072
>> },
>> (other caches seem to be quite low too, like bluestore_cache_other 
take all the memory)

>>
>>
>> After restart
>> -
>> "bluestore_cache_other": {
>> "items": 12432298,
>> "bytes": 500834899
>> },
>> "bluestore_cache_data": {
>> "items": 40084,
>> "bytes": 1056235520
>> },
>>
> This is fine as cache is warming after restart and some rebalancing
> between data and metadata might occur.
>
> What relates to allocator and most probably to fragmentation growth is :
>
> "bluestore_alloc": {
> "items": 165053952,
> "bytes": 165053952
> },
>
> which had been higher before the reset (if I got these dumps' order
> properly)
>
> "bluestore_alloc": {
> "items": 210243456,
> "bytes": 210243456
> },
>
> But as I mentioned - I'm not 100% sure this might cause such a huge
> latency increase...
>
> Do you have perf counters dump after the restart?
>
> Could you collect some more dumps - for both mempool and perf counters?
>
> So ideally I'd like to have:
>
> 1) memp

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-15 Thread Igor Fedotov

Hi Alexander,

I've read through your reports, nothing obvious so far.

I can only see several times average latency increase for OSD write ops 
(in seconds)


0.002040060 (first hour) vs.

0.002483516 (last 24 hours) vs.
0.008382087 (last hour)

subop_w_latency:
0.000478934 (first hour) vs.
0.000537956 (last 24 hours) vs.
0.003073475 (last hour)

and OSD read ops, osd_r_latency:

0.000408595 (first hour)
0.000709031 (24 hours)
0.004979540 (last hour)
  
What's interesting is that such latency differences aren't observed at neither BlueStore level (any _lat params under "bluestore" section) nor rocksdb one.


Which probably means that the issue is rather somewhere above BlueStore.

Suggest to proceed with perf dumps collection to see if the picture stays the 
same.

W.r.t. memory usage you observed I see nothing suspicious so far - No decrease 
in RSS report is a known artifact that seems to be safe.

Thanks,
Igor

On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote:

Hi Igor,

Thanks again for helping !



I have upgrade to last mimic this weekend, and with new autotune memory,
I have setup osd_memory_target to 8G.  (my nvme are 6TB)


I have done a lot of perf dump and mempool dump and ps of process to see rss 
memory at different hours,
here the reports for osd.0:

http://odisoweb1.odiso.net/perfanalysis/


osd has been started the 12-02-2019 at 08:00

first report after 1h running
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt



report  after 24 before counter resets

http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt

report 1h after counter reset
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt




I'm seeing the bluestore buffer bytes memory increasing up to 4G  around 
12-02-2019 at 14:00
http://odisoweb1.odiso.net/perfanalysis/graphs2.png
Then after that, slowly decreasing.


Another strange thing,
I'm seeing total bytes at 5G at 12-02-2018.13:30
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt
Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G


I'm graphing mempools counters too since yesterday, so I'll able to track them 
over time.

- Mail original -
De: "Igor Fedotov" 
À: "Alexandre Derumier" 
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel" 
Envoyé: Lundi 11 Février 2019 12:03:17
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote:

another mempool dump after 1h run. (latency ok)

Biggest difference:

before restart
-
"bluestore_cache_other": {
"items": 48661920,
"bytes": 1539544228
},
"bluestore_cache_data": {
"items": 54,
"bytes": 643072
},
(other caches seem to be quite low too, like bluestore_cache_other take all the 
memory)


After restart
-
"bluestore_cache_other": {
"items": 12432298,
"bytes": 500834899
},
"bluestore_cache_data": {
"items": 40084,
"bytes": 1056235520
},


This is fine as cache is warming after restart and some rebalancing
between data and metadata might occur.

What relates to allocator and most probably to fragmentation growth is :

"bluestore_alloc": {
"items": 165053952,
"bytes": 165053952
},

which had been higher before the reset (if I got these dumps' order
properly)

"bluestore_alloc": {
"items": 210243456,
"bytes": 210243456
},

But as I mentioned - I'm not 100% sure this might cause such a huge
latency increase...

Do you have perf counters dump after the restart?

Could you collect some more dumps - for both mempool and perf counters?

So ideally I'd like to have:

1) mempool/perf counters dumps after the restart (1hour is OK)

2) mempool/perf counters dumps in 24+ hours after restart

3) reset perf counters after 2), wait for 1 hour (and without OSD
restart) and dump mempool/perf counters again.

So we'll be able to learn both allocator mem usage growth and operation
latency distribution for the following periods:

a) 1st hour after restart

b) 25th hour.


Thanks,

Igor



full mempool dump after restart
---

{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 165053952,

Re: [ceph-users] single OSDs cause cluster hickups

2019-02-15 Thread Igor Fedotov

Hi Denny,

Do not remember exactly when discards appeared in BlueStore but they are 
disabled by default:


See bdev_enable_discard option.


Thanks,

Igor

On 2/15/2019 2:12 PM, Denny Kreische wrote:

Hi,

two weeks ago we upgraded one of our ceph clusters from luminous 12.2.8 to 
mimic 13.2.4, cluster is SSD-only, bluestore-only, 68 nodes, 408 OSDs.
somehow we see strange behaviour since then. Single OSDs seem to block for 
around 5 minutes and this causes the whole cluster and connected applications 
to hang. This happened 5 times during the last 10 days at irregular times, it 
didn't happen before the upgrade.

OSD log shows something like this (more log here: 
https://pastebin.com/6BYam5r4):

[...]
2019-02-14 23:53:39.754 7f379a368700 -1 osd.417 340516 get_health_metrics 
reporting 3 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 
0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516)
2019-02-14 23:53:40.706 7f379a368700 -1 osd.417 340516 get_health_metrics 
reporting 7 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 
0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516)
[...]

In this example osd.417 seems to have a problem. I can see same log line in 
other osd logs with placement groups related to osd.417.
I assume that all placement groups related to osd.417 are hanging or blocked 
when osd.417 is blocked.

How can I see in detail what might cause a certain OSD to stop working?

The cluster consists of 3 different SSD vendors (micron, samsung, intel), but 
only micron disks are affected until now. we earlier had problems with micron 
SSDs with filestore (xfs), it was fstrim to cause single OSDs to block for 
several minutes. we migrated to bluestore about a year ago. just in case, is 
there any kind of ssd trim/discard happening in bluestore since mimic?

Thanks,
Denny

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-11 Thread Igor Fedotov


On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote:

another mempool dump after 1h run. (latency ok)

Biggest difference:

before restart
-
"bluestore_cache_other": {
"items": 48661920,
"bytes": 1539544228
},
"bluestore_cache_data": {
"items": 54,
"bytes": 643072
},
(other caches seem to be quite low too, like bluestore_cache_other take all the 
memory)


After restart
-
"bluestore_cache_other": {
  "items": 12432298,
   "bytes": 500834899
},
"bluestore_cache_data": {
  "items": 40084,
  "bytes": 1056235520
},

This is fine as cache is warming after restart and some rebalancing 
between data and metadata  might occur.


What relates to allocator and most probably to fragmentation growth is :

"bluestore_alloc": {
"items": 165053952,
"bytes": 165053952
},

which had been higher before the reset (if I got these dumps' order 
properly)


"bluestore_alloc": {
"items": 210243456,
"bytes": 210243456
},

But as I mentioned - I'm not 100% sure this might cause such a huge 
latency increase...


Do you have perf counters dump after the restart?

Could you collect some more dumps - for both mempool and perf counters?

So ideally I'd like to have:

1) mempool/perf counters dumps after the restart (1hour is OK)

2) mempool/perf counters dumps in 24+ hours after restart

3) reset perf counters  after 2), wait for 1 hour (and without OSD 
restart) and dump mempool/perf counters again.


So we'll be able to learn both allocator mem usage growth and operation 
latency distribution for the following periods:


a) 1st hour after restart

b) 25th hour.


Thanks,

Igor



full mempool dump after restart
---

{
 "mempool": {
 "by_pool": {
 "bloom_filter": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_alloc": {
 "items": 165053952,
 "bytes": 165053952
 },
 "bluestore_cache_data": {
 "items": 40084,
 "bytes": 1056235520
 },
 "bluestore_cache_onode": {
 "items": 5,
 "bytes": 14935200
 },
 "bluestore_cache_other": {
 "items": 12432298,
 "bytes": 500834899
 },
 "bluestore_fsck": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_txc": {
 "items": 11,
 "bytes": 8184
 },
 "bluestore_writing_deferred": {
 "items": 5047,
 "bytes": 22673736
 },
 "bluestore_writing": {
 "items": 91,
 "bytes": 1662976
 },
 "bluefs": {
 "items": 1907,
 "bytes": 95600
 },
 "buffer_anon": {
 "items": 19664,
 "bytes": 25486050
 },
 "buffer_meta": {
 "items": 46189,
 "bytes": 2956096
 },
 "osd": {
 "items": 243,
 "bytes": 3089016
 },
 "osd_mapbl": {
 "items": 17,
 "bytes": 214366
 },
 "osd_pglog": {
 "items": 889673,
 "bytes": 367160400
 },
 "osdmap": {
 "items": 3803,
 "bytes": 224552
     },
     "osdmap_mapping": {
 "items": 0,
 "bytes": 0
 },
 "pgmap": {
 "items": 0,
 "bytes": 0
 },
 "mds_co": {
 "items": 0,
 "bytes": 0
 },
 "unittest_1": {
 "items": 0,
 "bytes": 0
 },
 "unittest_2": {
 "items": 0,
 "bytes": 0
 }
 },
 &

Re: [ceph-users] SSD OSD crashing after upgrade to 12.2.10

2019-02-07 Thread Igor Fedotov


On 2/7/2019 6:06 PM, Eugen Block wrote:
At first - you should upgrade to 12.2.11 (or bring the mentioned 
patch in by other means) to fix rename procedure which will avoid new 
inconsistent objects appearance in DB. This should at least reduce 
the OSD crash frequency.


We'll have to wait until 12.2.11 is available for openSUSE, I'm not 
sure how long it will take.


So I'd like to have fsck report to verify that. No matter if you do 
fsck before or after the upgrade.


Once we have fsck report we can proceed with the repair. Which is a 
bit risky procedure so may be I should try to simulate  the 
inconsistency  in question and check if built-in repair handles that 
properly. Will see, lets get fsck report first.


I'll try to run the fsck today, I have to wait until there are fewer 
clients active. Depending on the log file size, would it be okay to 
attach it to an email and send it directly to you or what is the best 
procedure for you?


Reasonable (up to 10MB?) email attachment is OK, for larger ones - 
whatever publicly available site is fine.




Thanks for your support!
Eugen

Zitat von Igor Fedotov :


Eugen,

At first - you should upgrade to 12.2.11 (or bring the mentioned 
patch in by other means) to fix rename procedure which will avoid new 
inconsistent objects appearance in DB. This should at least reduce 
the OSD crash frequency.


At second - theoretically previous crashes could result in persistent 
inconsistent objects in your DB. I haven't seen that in real life 
before but probably they exist. We need to check. If so OSD crashes 
might still occur.


So I'd like to have fsck report to verify that. No matter if you do 
fsck before or after the upgrade.


Once we have fsck report we can proceed with the repair. Which is a 
bit risky procedure so may be I should try to simulate  the 
inconsistency  in question and check if built-in repair handles that 
properly. Will see, lets get fsck report first.



W.r.t to running ceph-bluestore-tool - you might want to specify log 
file and increase log level to 20 using --log-file and --log-level 
options.



On 2/7/2019 4:45 PM, Eugen Block wrote:

Hi Igor,

thanks for the quick response!
Just to make sure I don't misunderstand, and because it's a 
production cluster:
before anything else I should run fsck on that OSD? Depending on the 
result we'll decide how to continue, right?
Is there anything else to be enabled for that command or can I 
simply run 'ceph-bluestore-tool fsck --path 
/var/lib/ceph/osd/ceph-'?


Any other obstacles I should be aware of when running fsck?

Thanks!
Eugen


Zitat von Igor Fedotov :


Hi Eugen,

looks like this isn't [1] but rather

https://tracker.ceph.com/issues/38049

and

https://tracker.ceph.com/issues/36541 (= 
https://tracker.ceph.com/issues/36638 for luminous)


Hence it's not fixed in 12.2.10, target release is 12.2.11


Also please note the patch allows to avoid new occurrences for the 
issue. But there some chances that inconsistencies caused by it 
earlier are still present in DB. And assertion might still happen 
(hopefully with less frequency).


So could you please run fsck for OSDs that were broken once and 
share the results?


Then we can decide if it makes sense to proceed with the repair.


Thanks,

Igor

On 2/7/2019 3:37 PM, Eugen Block wrote:

Hi list,

I found this thread [1] about crashing SSD OSDs, although that was 
about an upgrade to 12.2.7, we just hit (probably) the same issue 
after our update to 12.2.10 two days ago in a production cluster.
Just half an hour ago I saw one OSD (SSD) crashing (for the first 
time):


2019-02-07 13:02:07.682178 mon.host1 mon.0 :6789/0 109754 : 
cluster [INF] osd.10 failed (root=default,host=host1) (connection 
refused reported by osd.20)
2019-02-07 13:02:08.623828 mon.host1 mon.0 :6789/0 109771 : 
cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)


One minute later, the OSD was back online.
This is the stack trace reported in syslog:

---cut here---
2019-02-07T13:01:51.181027+01:00 host1 ceph-osd[1136505]: *** 
Caught signal (Aborted) **
2019-02-07T13:01:51.181232+01:00 host1 ceph-osd[1136505]: in 
thread 7f75ce646700 thread_name:bstore_kv_final
2019-02-07T13:01:51.185873+01:00 host1 ceph-osd[1136505]: ceph 
version 12.2.10-544-gb10c702661 
(b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable)
2019-02-07T13:01:51.186077+01:00 host1 ceph-osd[1136505]: 1: 
(()+0xa587d9) [0x560b921cc7d9]
2019-02-07T13:01:51.186226+01:00 host1 ceph-osd[1136505]: 2: 
(()+0x10b10) [0x7f75d8386b10]
2019-02-07T13:01:51.186368+01:00 host1 ceph-osd[1136505]: 3: 
(gsignal()+0x37) [0x7f75d73508d7]
2019-02-07T13:01:51.186773+01:00 host1 ceph-osd[1136505]: 4: 
(abort()+0x13a) [0x7f75d7351caa]
2019-02-07T13:01:51.186906+01:00 host1 ceph-osd[1136505]: 5: 
(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x280) [0x560b922096d0]
2019-02-07T13:01:51.187027+01:00 host1 ceph-osd[1136505]: 6: 
(interval_setunsigned long, std::less, 
mempool

Re: [ceph-users] SSD OSD crashing after upgrade to 12.2.10

2019-02-07 Thread Igor Fedotov

Eugen,

At first - you should upgrade to 12.2.11 (or bring the mentioned patch 
in by other means) to fix rename procedure which will avoid new 
inconsistent objects appearance in DB. This should at least reduce the 
OSD crash frequency.


At second - theoretically previous crashes could result in persistent 
inconsistent objects in your DB. I haven't seen that in real life before 
but probably they exist. We need to check. If so OSD crashes might still 
occur.


So I'd like to have fsck report to verify that. No matter if you do fsck 
before or after the upgrade.


Once we have fsck report we can proceed with the repair. Which is a bit 
risky procedure so may be I should try to simulate  the inconsistency  
in question and check if built-in repair handles that properly. Will 
see, lets get fsck report first.



W.r.t to running ceph-bluestore-tool - you might want to specify log 
file and increase log level to 20 using --log-file and --log-level options.



On 2/7/2019 4:45 PM, Eugen Block wrote:

Hi Igor,

thanks for the quick response!
Just to make sure I don't misunderstand, and because it's a production 
cluster:
before anything else I should run fsck on that OSD? Depending on the 
result we'll decide how to continue, right?
Is there anything else to be enabled for that command or can I simply 
run 'ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-'?


Any other obstacles I should be aware of when running fsck?

Thanks!
Eugen


Zitat von Igor Fedotov :


Hi Eugen,

looks like this isn't [1] but rather

https://tracker.ceph.com/issues/38049

and

https://tracker.ceph.com/issues/36541 (= 
https://tracker.ceph.com/issues/36638 for luminous)


Hence it's not fixed in 12.2.10, target release is 12.2.11


Also please note the patch allows to avoid new occurrences for the 
issue. But there some chances that inconsistencies caused by it 
earlier are still present in DB. And assertion might still happen 
(hopefully with less frequency).


So could you please run fsck for OSDs that were broken once and share 
the results?


Then we can decide if it makes sense to proceed with the repair.


Thanks,

Igor

On 2/7/2019 3:37 PM, Eugen Block wrote:

Hi list,

I found this thread [1] about crashing SSD OSDs, although that was 
about an upgrade to 12.2.7, we just hit (probably) the same issue 
after our update to 12.2.10 two days ago in a production cluster.
Just half an hour ago I saw one OSD (SSD) crashing (for the first 
time):


2019-02-07 13:02:07.682178 mon.host1 mon.0 :6789/0 109754 : 
cluster [INF] osd.10 failed (root=default,host=host1) (connection 
refused reported by osd.20)
2019-02-07 13:02:08.623828 mon.host1 mon.0 :6789/0 109771 : 
cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)


One minute later, the OSD was back online.
This is the stack trace reported in syslog:

---cut here---
2019-02-07T13:01:51.181027+01:00 host1 ceph-osd[1136505]: *** Caught 
signal (Aborted) **
2019-02-07T13:01:51.181232+01:00 host1 ceph-osd[1136505]:  in thread 
7f75ce646700 thread_name:bstore_kv_final
2019-02-07T13:01:51.185873+01:00 host1 ceph-osd[1136505]: ceph 
version 12.2.10-544-gb10c702661 
(b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable)
2019-02-07T13:01:51.186077+01:00 host1 ceph-osd[1136505]:  1: 
(()+0xa587d9) [0x560b921cc7d9]
2019-02-07T13:01:51.186226+01:00 host1 ceph-osd[1136505]:  2: 
(()+0x10b10) [0x7f75d8386b10]
2019-02-07T13:01:51.186368+01:00 host1 ceph-osd[1136505]:  3: 
(gsignal()+0x37) [0x7f75d73508d7]
2019-02-07T13:01:51.186773+01:00 host1 ceph-osd[1136505]:  4: 
(abort()+0x13a) [0x7f75d7351caa]
2019-02-07T13:01:51.186906+01:00 host1 ceph-osd[1136505]:  5: 
(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x280) [0x560b922096d0]
2019-02-07T13:01:51.187027+01:00 host1 ceph-osd[1136505]:  6: 
(interval_setunsigned long, std::less, 
mempool::pool_allocator<(mempool::pool_index_t)1, std::pairlong const, unsigned long> >, 256> >::insert(unsigned long, unsigned 
long, unsigned long*, unsigned long*)+0xef2) [0x560b921bd432]
2019-02-07T13:01:51.187167+01:00 host1 ceph-osd[1136505]:  7: 
(StupidAllocator::_insert_free(unsigned long, unsigned long)+0x126) 
[0x560b921b4a06]
2019-02-07T13:01:51.187294+01:00 host1 ceph-osd[1136505]:  8: 
(StupidAllocator::release(unsigned long, unsigned long)+0x7d) 
[0x560b921b4f4d]
2019-02-07T13:01:51.187418+01:00 host1 ceph-osd[1136505]:  9: 
(BlueStore::_txc_release_alloc(BlueStore::TransContext*)+0x72) 
[0x560b9207fa22]
2019-02-07T13:01:51.187539+01:00 host1 ceph-osd[1136505]:  10: 
(BlueStore::_txc_finish(BlueStore::TransContext*)+0x5d7) 
[0x560b92092d77]
2019-02-07T13:01:51.187661+01:00 host1 ceph-osd[1136505]:  11: 
(BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x1f6) 
[0x560b920a3fa6]
2019-02-07T13:01:51.187781+01:00 host1 ceph-osd[1136505]:  12: 
(BlueStore::_kv_finalize_thread()+0x620) [0x560b920a58e0]
2019-02-07T13:01:51.187898+01:00 host1 ceph-osd[1136505]:  13: 
(BlueStore::KVFinalizeThre

Re: [ceph-users] SSD OSD crashing after upgrade to 12.2.10

2019-02-07 Thread Igor Fedotov

Hi Eugen,

looks like this isn't [1] but rather

https://tracker.ceph.com/issues/38049

and

https://tracker.ceph.com/issues/36541 (= 
https://tracker.ceph.com/issues/36638 for luminous)


Hence it's not fixed in 12.2.10, target release is 12.2.11


Also please note the patch allows to avoid new occurrences for the 
issue. But there some chances that inconsistencies caused by it earlier 
are still present in DB. And assertion might still happen (hopefully 
with less frequency).


So could you please run fsck for OSDs that were broken once and share 
the results?


Then we can decide if it makes sense to proceed with the repair.


Thanks,

Igor

On 2/7/2019 3:37 PM, Eugen Block wrote:

Hi list,

I found this thread [1] about crashing SSD OSDs, although that was 
about an upgrade to 12.2.7, we just hit (probably) the same issue 
after our update to 12.2.10 two days ago in a production cluster.

Just half an hour ago I saw one OSD (SSD) crashing (for the first time):

2019-02-07 13:02:07.682178 mon.host1 mon.0 :6789/0 109754 : 
cluster [INF] osd.10 failed (root=default,host=host1) (connection 
refused reported by osd.20)
2019-02-07 13:02:08.623828 mon.host1 mon.0 :6789/0 109771 : 
cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)


One minute later, the OSD was back online.
This is the stack trace reported in syslog:

---cut here---
2019-02-07T13:01:51.181027+01:00 host1 ceph-osd[1136505]: *** Caught 
signal (Aborted) **
2019-02-07T13:01:51.181232+01:00 host1 ceph-osd[1136505]:  in thread 
7f75ce646700 thread_name:bstore_kv_final
2019-02-07T13:01:51.185873+01:00 host1 ceph-osd[1136505]:  ceph 
version 12.2.10-544-gb10c702661 
(b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable)
2019-02-07T13:01:51.186077+01:00 host1 ceph-osd[1136505]:  1: 
(()+0xa587d9) [0x560b921cc7d9]
2019-02-07T13:01:51.186226+01:00 host1 ceph-osd[1136505]:  2: 
(()+0x10b10) [0x7f75d8386b10]
2019-02-07T13:01:51.186368+01:00 host1 ceph-osd[1136505]:  3: 
(gsignal()+0x37) [0x7f75d73508d7]
2019-02-07T13:01:51.186773+01:00 host1 ceph-osd[1136505]:  4: 
(abort()+0x13a) [0x7f75d7351caa]
2019-02-07T13:01:51.186906+01:00 host1 ceph-osd[1136505]:  5: 
(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x280) [0x560b922096d0]
2019-02-07T13:01:51.187027+01:00 host1 ceph-osd[1136505]:  6: 
(interval_setlong, std::less, 
mempool::pool_allocator<(mempool::pool_index_t)1, std::pairlong const, unsigned long> >, 256> >::insert(unsigned long, unsigned 
long, unsigned long*, unsigned long*)+0xef2) [0x560b921bd432]
2019-02-07T13:01:51.187167+01:00 host1 ceph-osd[1136505]:  7: 
(StupidAllocator::_insert_free(unsigned long, unsigned long)+0x126) 
[0x560b921b4a06]
2019-02-07T13:01:51.187294+01:00 host1 ceph-osd[1136505]:  8: 
(StupidAllocator::release(unsigned long, unsigned long)+0x7d) 
[0x560b921b4f4d]
2019-02-07T13:01:51.187418+01:00 host1 ceph-osd[1136505]:  9: 
(BlueStore::_txc_release_alloc(BlueStore::TransContext*)+0x72) 
[0x560b9207fa22]
2019-02-07T13:01:51.187539+01:00 host1 ceph-osd[1136505]:  10: 
(BlueStore::_txc_finish(BlueStore::TransContext*)+0x5d7) [0x560b92092d77]
2019-02-07T13:01:51.187661+01:00 host1 ceph-osd[1136505]:  11: 
(BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x1f6) 
[0x560b920a3fa6]
2019-02-07T13:01:51.187781+01:00 host1 ceph-osd[1136505]:  12: 
(BlueStore::_kv_finalize_thread()+0x620) [0x560b920a58e0]
2019-02-07T13:01:51.187898+01:00 host1 ceph-osd[1136505]:  13: 
(BlueStore::KVFinalizeThread::entry()+0xd) [0x560b920fb57d]
2019-02-07T13:01:51.188017+01:00 host1 ceph-osd[1136505]:  14: 
(()+0x8744) [0x7f75d837e744]
2019-02-07T13:01:51.188138+01:00 host1 ceph-osd[1136505]:  15: 
(clone()+0x6d) [0x7f75d7405aad]
2019-02-07T13:01:51.188271+01:00 host1 ceph-osd[1136505]: 2019-02-07 
13:01:51.185833 7f75ce646700 -1 *** Caught signal (Aborted) **

---cut here---

Is there anything we can do about this? The issue in [1] doesn't seem 
to be resolved, yet. Debug logging is not enabled, so I don't have 
more detailed information except the full stack trace from the OSD 
log. Any help is appreciated!


Regards,
Eugen

[1] 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029616.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-05 Thread Igor Fedotov


On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote:

but I don't see l_bluestore_fragmentation counter.
(but I have bluestore_fragmentation_micros)

ok, this is the same

   b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros",
 "How fragmented bluestore free space is (free extents / max possible 
number of free extents) * 1000");


Here a graph on last month, with bluestore_fragmentation_micros and latency,

http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png


hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
it? The same for other OSDs?


This proves some issue with the allocator - generally fragmentation 
might grow but it shouldn't reset on restart. Looks like some intervals 
aren't properly merged in run-time.


On the other side I'm not completely sure that latency degradation is 
caused by that - fragmentation growth is relatively small - I don't see 
how this might impact performance that high.


Wondering if you have OSD mempool monitoring (dump_mempools command 
output on admin socket) reports? Do you have any historic data?


If not may I have current output and say  a couple more samples with 
8-12 hours interval?



Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
before that but I'll discuss this at BlueStore meeting shortly.



Thanks,

Igor


- Mail original -
De: "Alexandre Derumier" 
À: "Igor Fedotov" 
Cc: "Stefan Priebe, Profihost AG" , "Mark Nelson" , "Sage Weil" 
, "ceph-users" , "ceph-devel" 
Envoyé: Lundi 4 Février 2019 16:04:38
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Thanks Igor,


Could you please collect BlueStore performance counters right after OSD
startup and once you get high latency.

Specifically 'l_bluestore_fragmentation' parameter is of interest.

I'm already monitoring with
"ceph daemon osd.x perf dump ", (I have 2months history will all counters)

but I don't see l_bluestore_fragmentation counter.

(but I have bluestore_fragmentation_micros)



Also if you're able to rebuild the code I can probably make a simple
patch to track latency and some other internal allocator's paramter to
make sure it's degraded and learn more details.

Sorry, It's a critical production cluster, I can't test on it :(
But I have a test cluster, maybe I can try to put some load on it, and try to 
reproduce.




More vigorous fix would be to backport bitmap allocator from Nautilus
and try the difference...

Any plan to backport it to mimic ? (But I can wait for Nautilus)
perf results of new bitmap allocator seem very promising from what I've seen in 
PR.



- Mail original -
De: "Igor Fedotov" 
À: "Alexandre Derumier" , "Stefan Priebe, Profihost AG" 
, "Mark Nelson" 
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel" 
Envoyé: Lundi 4 Février 2019 15:51:30
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi Alexandre,

looks like a bug in StupidAllocator.

Could you please collect BlueStore performance counters right after OSD
startup and once you get high latency.

Specifically 'l_bluestore_fragmentation' parameter is of interest.

Also if you're able to rebuild the code I can probably make a simple
patch to track latency and some other internal allocator's paramter to
make sure it's degraded and learn more details.


More vigorous fix would be to backport bitmap allocator from Nautilus
and try the difference...


Thanks,

Igor


On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote:

Hi again,

I speak too fast, the problem has occured again, so it's not tcmalloc cache 
size related.


I have notice something using a simple "perf top",

each time I have this problem (I have seen exactly 4 times the same behaviour),

when latency is bad, perf top give me :

StupidAllocator::_aligned_len
and
btree::btree_iterator, mempoo
l::pool_allocator<(mempool::pool_index_t)1, std::pair >, 
256> >, std::pair&, std::pair*>::increment_slow()

(around 10-20% time for both)


when latency is good, I don't see them at all.


I have used the Mark wallclock profiler, here the results:

http://odisoweb1.odiso.net/gdbpmp-ok.txt

http://odisoweb1.odiso.net/gdbpmp-bad.txt


here an extract of the thread with btree::btree_iterator && 
StupidAllocator::_aligned_len


+ 100.00% clone
+ 100.00% start_thread
+ 100.00% ShardedThreadPool::WorkThreadSharded::entry()
+ 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int)
+ 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
+ 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
ThreadPool::TPHandle&)
| + 70.00% OSD::dequeue_op(boost::intrusive_ptr, 
boost::intrusive_ptr, ThreadPool::TPHandle&)
| + 70.00% PrimaryLogPG::do_request(boost::intrusive_p

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-04 Thread Igor Fedotov

Hi Alexandre,

looks like a bug in StupidAllocator.

Could you please collect BlueStore performance counters right after OSD 
startup and once you get high latency.


Specifically 'l_bluestore_fragmentation' parameter is of interest.

Also if you're able to rebuild the code I can probably make a simple 
patch to track latency and some other internal allocator's paramter to 
make sure it's degraded and learn more details.



More vigorous fix would be to backport bitmap allocator from Nautilus 
and try the difference...



Thanks,

Igor


On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote:

Hi again,

I speak too fast, the problem has occured again, so it's not tcmalloc cache 
size related.


I have notice something using a simple "perf top",

each time I have this problem (I have seen exactly 4 times the same behaviour),

when latency is bad, perf top give me :

StupidAllocator::_aligned_len
and
btree::btree_iterator, mempoo
l::pool_allocator<(mempool::pool_index_t)1, std::pair >, 
256> >, std::pair&, std::pair*>::increment_slow()

(around 10-20% time for both)


when latency is good, I don't see them at all.


I have used the Mark wallclock profiler, here the results:

http://odisoweb1.odiso.net/gdbpmp-ok.txt

http://odisoweb1.odiso.net/gdbpmp-bad.txt


here an extract of the thread with btree::btree_iterator && 
StupidAllocator::_aligned_len


+ 100.00% clone
   + 100.00% start_thread
 + 100.00% ShardedThreadPool::WorkThreadSharded::entry()
   + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int)
 + 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)
   + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
ThreadPool::TPHandle&)
   | + 70.00% OSD::dequeue_op(boost::intrusive_ptr, 
boost::intrusive_ptr, ThreadPool::TPHandle&)
   |   + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr&, 
ThreadPool::TPHandle&)
   | + 68.00% 
PGBackend::handle_message(boost::intrusive_ptr)
   | | + 68.00% 
ReplicatedBackend::_handle_message(boost::intrusive_ptr)
   | |   + 68.00% 
ReplicatedBackend::do_repop(boost::intrusive_ptr)
   | | + 67.00% non-virtual thunk to 
PrimaryLogPG::queue_transactions(std::vector >&, boost::intrusive_ptr)
   | | | + 67.00% 
BlueStore::queue_transactions(boost::intrusive_ptr&, 
std::vector >&, 
boost::intrusive_ptr, ThreadPool::TPHandle*)
   | | |   + 66.00% 
BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ObjectStore::Transaction*)
   | | |   | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr&, unsigned long, unsigned long, 
ceph::buffer::list&, unsigned int)
   | | |   |   + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr, unsigned long, unsigned long, 
ceph::buffer::list&, unsigned int)
   | | |   | + 65.00% 
BlueStore::_do_alloc_write(BlueStore::TransContext*, 
boost::intrusive_ptr, 
boost::intrusive_ptr, BlueStore::WriteContext*)
   | | |   | | + 64.00% StupidAllocator::allocate(unsigned long, 
unsigned long, unsigned long, long, std::vector >*)
   | | |   | | | + 64.00% 
StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned 
long*, unsigned int*)
   | | |   | | |   + 34.00% btree::btree_iterator, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair >, 256> >, std::pair&, std::pair*>::increment_slow()
   | | |   | | |   + 26.00% StupidAllocator::_aligned_len(interval_set, 
mempool::pool_allocator<(mempool::pool_index_t)1, std::pair 
>, 256> >::iterator, unsigned long)



- Mail original -
De: "Alexandre Derumier" 
À: "Stefan Priebe, Profihost AG" 
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel" 
Envoyé: Lundi 4 Février 2019 09:38:11
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi,

some news:

I have tried with different transparent hugepage values (madvise, never) : no 
change

I have tried to increase bluestore_cache_size_ssd to 8G: no change

I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it 
seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to 
be sure)


Note that this behaviour seem to happen really faster (< 2 days) on my big nvme 
drives (6TB),
my others clusters user 1,6TB ssd.

Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by 
osd), but I'll try this week with 2osd by nvme, to see if it's helping.


BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 
2.26 (which have also thread cache) ?


Regards,

Alexandre


- Mail original -
De: "aderumier" 
À: "Stefan Priebe, Profihost AG" 
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel" 
Envoyé: Mercredi 30 Janvier 2019 19:58:15
Objet: Re: 

Re: [ceph-users] bluestore block.db

2019-01-25 Thread Igor Fedotov

Hi Frank,

you might want to use ceph-kvstore-tool, e.g.

ceph-kvstore-tool bluestore-kv  dump


Thanks,

Igor

On 1/25/2019 10:49 PM, F Ritchie wrote:

Hi all,

Is there a way to dump the contents of block.db to a text file?

I am not trying to fix a problem just curious and want to poke around.

thx
Frank


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore 32bit max_object_size limit

2019-01-21 Thread Igor Fedotov


On 1/18/2019 6:33 PM, KEVIN MICHAEL HRPCEK wrote:



On 1/18/19 7:26 AM, Igor Fedotov wrote:


Hi Kevin,

On 1/17/2019 10:50 PM, KEVIN MICHAEL HRPCEK wrote:

Hey,

I recall reading about this somewhere but I can't find it in the 
docs or list archive and confirmation from a dev or someone who 
knows for sure would be nice. What I recall is that bluestore has a 
max 4GB file size limit based on the design of bluestore not the 
osd_max_object_size setting. The bluestore source seems to suggest 
that by setting the OBJECT_MAX_SIZE to a 32bit max, giving an error 
if osd_max_object_size is > OBJECT_MAX_SIZE, and not writing the 
data if offset+length >= OBJECT_MAX_SIZE. So it seems like the in 
osd file size int can't exceed 32 bits which is 4GB, like FAT32. Am 
I correct or maybe I'm reading all this wrong..?


You're correct, BlueStore doesn't support object larger than 
OBJECT_MAX_SIZE(i.e. 4Gb)



Thanks for confirming that!





If bluestore has a hard 4GB object limit using radosstriper to break 
up an object would work, but does using an EC pool that breaks up 
the object to shards smaller than OBJECT_MAX_SIZE have the same 
effect as radosstriper to get around a 4GB limit? We use rados 
directly and would like to move to bluestore but we have some large 
objects <= 13G that may need attention if this 4GB limit does exist 
and an ec pool doesn't get around it.
Theoretically object split using EC might help. But I'm not sure 
whether one needs to adjust osd_max_object_size greater than 4Gb to 
permit 13Gb object usage in EC pool. If it's needed than 
tosd_max_object_size <= OBJECT_MAX_SIZE constraint is violated and 
BlueStore wouldn't start.
In my experience I had to increase osd_max_object_size from the 128M 
default it changed to a couple versions ago to ~20G to be able to 
write our largest objects with some margin. Do you think there is 
another way to handle osd_max_object_size > OBJECT_MAX_SIZE so that 
bluestore will start and EC pools or striping can be used to write 
objects that are greater than OBJECT_MAX_SIZE but each stripe/shard 
ends up smaller than OBJECT_MAX_SIZE after striping or being in an ec 
pool?


I'm not very familiar with the logic osd_max_object_size controls at OSD 
level. But IMO there are might be two logically valid options:


1) This is maximum user (RADOS?)  object size. In this case verification 
at BlueStore is a bit incorrect as EC might be in the path and hence one 
can still have 4+ GB object stored. If that's the case then it's just 
enough to remove the corresponding assertion at BlueStore.


2) This is maximum object size provided to Object store. Then one should 
be able to upload object longer than this threshold using EC.


I'm going to verify this behavior and come up with corresponding fixes 
if any shortly.


Unfortunately in short term I don't see any workarounds for your case  
other than having a custom build that has assertion at BlueStore removed.







https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88
#define OBJECT_MAX_SIZE 0x // 32 bits

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395
  // sanity check(s)
   auto osd_max_object_size =
 cct->_conf.get_val("osd_max_object_size");
   if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) {
 derr << __func__ << " osd_max_object_size >= 0x" << std::hex << 
OBJECT_MAX_SIZE
   << "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." <<  std::dec 
<< dendl;
 return -EINVAL;
   }


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331
   if (offset + length >= OBJECT_MAX_SIZE) {
 r = -E2BIG;
   } else {
 _assign_nid(txc, o);
 r = _do_write(txc, c, o, offset, length, bl, fadvise_flags);
 txc->write_onode(o);
   }

Thanks!
Kevin
--
Kevin Hrpcek
NASA SNPP Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Thanks,

Igor



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore 32bit max_object_size limit

2019-01-18 Thread Igor Fedotov

Hi Kevin,

On 1/17/2019 10:50 PM, KEVIN MICHAEL HRPCEK wrote:

Hey,

I recall reading about this somewhere but I can't find it in the docs 
or list archive and confirmation from a dev or someone who knows for 
sure would be nice. What I recall is that bluestore has a max 4GB file 
size limit based on the design of bluestore not the 
osd_max_object_size setting. The bluestore source seems to suggest 
that by setting the OBJECT_MAX_SIZE to a 32bit max, giving an error if 
osd_max_object_size is > OBJECT_MAX_SIZE, and not writing the data if 
offset+length >= OBJECT_MAX_SIZE. So it seems like the in osd file 
size int can't exceed 32 bits which is 4GB, like FAT32. Am I correct 
or maybe I'm reading all this wrong..?


You're correct, BlueStore doesn't support object larger than 
OBJECT_MAX_SIZE(i.e. 4Gb)





If bluestore has a hard 4GB object limit using radosstriper to break 
up an object would work, but does using an EC pool that breaks up the 
object to shards smaller than OBJECT_MAX_SIZE have the same effect as 
radosstriper to get around a 4GB limit? We use rados directly and 
would like to move to bluestore but we have some large objects <= 13G 
that may need attention if this 4GB limit does exist and an ec pool 
doesn't get around it.
Theoretically object split using EC might help. But I'm not sure whether 
one needs to adjust osd_max_object_size greater than 4Gb to permit 13Gb 
object usage in EC pool. If it's needed than tosd_max_object_size <= 
OBJECT_MAX_SIZE constraint is violated and BlueStore wouldn't start.



https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88
#define OBJECT_MAX_SIZE 0x // 32 bits

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395
  // sanity check(s)
   auto osd_max_object_size =
 cct->_conf.get_val("osd_max_object_size");
   if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) {
 derr << __func__ << " osd_max_object_size >= 0x" << std::hex << 
OBJECT_MAX_SIZE
   << "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." <<  std::dec 
<< dendl;
 return -EINVAL;
   }


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331
   if (offset + length >= OBJECT_MAX_SIZE) {
 r = -E2BIG;
   } else {
 _assign_nid(txc, o);
 r = _do_write(txc, c, o, offset, length, bl, fadvise_flags);
 txc->write_onode(o);
   }

Thanks!
Kevin
--
Kevin Hrpcek
NASA SNPP Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Thanks,

Igor

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] `ceph-bluestore-tool bluefs-bdev-expand` corrupts OSDs

2019-01-11 Thread Igor Fedotov

Hi Hector,

just realized that you're trying to expand main (and exclusive) device 
which isn't supported in mimic.


Here is bluestore_tool complaint (pretty confusing and not preventing 
from the partial expansion though)  while expanding:


expanding dev 1 from 0x1df2eb0 to 0x3a38120
Can't find device path for dev 1


Actually this is covered by the following ticket and its backport 
descendants:


https://tracker.ceph.com/issues/37360

https://tracker.ceph.com/issues/37494

https://tracker.ceph.com/issues/37495


In short - we're planning to support main device expansion for Nautilus+ 
and to introduce better error handling for the case in Mimic and 
Luminous. Nautilus PR has been merged, M & L PRs are pending review at 
the moment:


https://github.com/ceph/ceph/pull/25348

https://github.com/ceph/ceph/pull/25384


Thanks,

Igor


On 1/11/2019 1:18 PM, Hector Martin wrote:

Sorry for the late reply,

Here's what I did this time around. osd.0 and osd.1 should be 
identical, except osd.0 was recreated (that's the first one that 
failed) and I'm trying to expand osd.1 from its original size.


# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-0 | 
grep size

    "size": 4000780910592,
# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1 | 
grep size

    "size": 3957831237632,
# blockdev --getsize64 /var/lib/ceph/osd/ceph-0/block
4000780910592
# blockdev --getsize64 /var/lib/ceph/osd/ceph-1/block
4000780910592

As you can see the osd.1 block device is already resized

# ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-0
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-0/block
1 : size 0x3a38120 : own 0x[1bf1f40~2542a0]
# ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-1/block
1 : size 0x3a38120 : own 0x[1ba5270~24dc40]

# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-1/block
start:
1 : size 0x3a38120 : own 0x[1ba5270~24dc40]
expanding dev 1 from 0x1df2eb0 to 0x3a38120
Can't find device path for dev 1

Unfortunately I forgot to run this with debugging enabled.

This seems like it didn't touch the first 8K (label), so unfortunately 
I cannot undo it. I guess this information is stored elsewhere.


I did notice that the size label was not updated, so I updated it 
manually:


# ceph-bluestore-tool set-label-key --dev 
/var/lib/ceph/osd/ceph-1/block --key size --value 4000780910592


# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1 | 
grep size

    "size": 4000780910592,

This is what bluefs-bdev-sizes says:

# ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-1/block
1 : size 0x3a38120 : own 0x[1ba5270~1e92eb0]

fsck reported "fsck succcess". The log is massive, I can host it 
somewhere if needed.


Starting the OSD fails with:

# ceph-osd --id 1
2019-01-11 18:51:00.745 7f06a72c62c0 -1 Public network was set, but 
cluster network was not set
2019-01-11 18:51:00.745 7f06a72c62c0 -1 Using public network also 
for cluster network
starting osd.1 at - osd_data /var/lib/ceph/osd/ceph-1 
/var/lib/ceph/osd/ceph-1/journal
2019-01-11 18:51:08.902 7f06a72c62c0 -1 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs 
extra 0x[1df2eb0~1c45270]
2019-01-11 18:51:09.301 7f06a72c62c0 -1 osd.1 0 OSD:init: unable to 
mount object store
2019-01-11 18:51:09.301 7f06a72c62c0 -1  ** ERROR: osd init failed: 
(5) Input/output error


That "bluefs extra" line seems to be the issue. From a full log:

2019-01-11 18:56:00.135 7fb74a8272c0 10 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace

2019-01-11 18:56:00.135 7fb74a8272c0 10 bluefs get_block_extents bdev 1
2019-01-11 18:56:00.135 7fb74a8272c0 10 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs 
says 0x[1ba5270~1e92eb0]
2019-01-11 18:56:00.135 7fb74a8272c0 10 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace super 
says  0x[1ba5270~24dc40]
2019-01-11 18:56:00.135 7fb74a8272c0 -1 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs 
extra 0x[1df2eb0~1c45270]
2019-01-11 18:56:00.135 7fb74a8272c0 10 
bluestore(/var/lib/ceph/osd/ceph-1) _flush_cache


And that is where the -EIO is coming from:
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L5305 



So I guess there is an inconsistency between some metadata here?

On 27/12/2018 20:46, Igor Fedotov wrote:

Hector,

One more thing to mention - after expansion please run fsck using
ceph-bluestore-tool prior to running osd daemon and collect another log
using CEPH_ARGS variable.


Th

Re: [ceph-users] `ceph-bluestore-tool bluefs-bdev-expand` corrupts OSDs

2018-12-27 Thread Igor Fedotov

Hector,

One more thing to mention - after expansion please run fsck using 
ceph-bluestore-tool prior to running osd daemon and collect another log 
using CEPH_ARGS variable.



Thanks,

Igor

On 12/27/2018 2:41 PM, Igor Fedotov wrote:

Hi Hector,

I've never tried bluefs-bdev-expand over encrypted volumes but it 
works absolutely fine for me in other cases.


So it would be nice to troubleshoot this a bit.

Suggest to do the following:

1) Backup first 8K for all OSD.1 devices (block, db and wal) using dd. 
This will probably allow to recover from the failed expansion and 
repeat it multiple times.


2) Collect current volume sizes with bluefs-bdev-sizes command and 
actual devices sizes using 'lsblk --bytes'.


3) Do lvm volume expansion and then collect dev sizes with 'lsblk 
--bytes' once again


4) Invoke bluefs-bdev-expand for osd.1 with 
CEPH_ARGS="--debug-bluestore 20 --debug-bluefs 20 --log-file 
bluefs-bdev-expand.log"


Perhaps it makes sense to open a ticket at ceph bug tracker to proceed...


Thanks,

Igor




On 12/27/2018 12:19 PM, Hector Martin wrote:

Hi list,

I'm slightly expanding the underlying LV for two OSDs and figured I 
could use ceph-bluestore-tool to avoid having to re-create them from 
scratch.


I first shut down the OSD, expanded the LV, and then ran:
ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-0

I forgot I was using encryption, so the overlying dm-crypt mapping 
stayed the same when I resized the underlying LV. I was surprised by 
the output of ceph-bluestore-tool, which suggested a size change by a 
significant amount (I was changing the LV size only by a few 
percent). I then checked the underlying `block` device and realized 
its size had not changed, so the command should've been a no-op. I 
then tried to restart the OSD, and it failed with an I/O error. I 
ended up re-creating that OSD and letting it recover.


I have another OSD (osd.1) in the original state where I could run 
this test again if needed. Unfortunately I don't have the output of 
the first test any more.


Is `ceph-bluestore-tool bluefs-bdev-expand` supposed to work? I get 
the feeling it gets the size wrong and corrupts OSDs by expanding it 
too much. If this is indeed supposed to work I would be happy to test 
this again with osd.1 if needed and see if I can get it fixed. 
Otherwise I'll just re-create it and move on.


# ceph --version
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic 
(stable)



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] `ceph-bluestore-tool bluefs-bdev-expand` corrupts OSDs

2018-12-27 Thread Igor Fedotov

Hi Hector,

I've never tried bluefs-bdev-expand over encrypted volumes but it works 
absolutely fine for me in other cases.


So it would be nice to troubleshoot this a bit.

Suggest to do the following:

1) Backup first 8K for all OSD.1 devices (block, db and wal) using dd. 
This will probably allow to recover from the failed expansion and repeat 
it multiple times.


2) Collect current volume sizes with bluefs-bdev-sizes command and 
actual devices sizes using 'lsblk --bytes'.


3) Do lvm volume expansion and then collect dev sizes with 'lsblk 
--bytes' once again


4) Invoke bluefs-bdev-expand for osd.1 with CEPH_ARGS="--debug-bluestore 
20 --debug-bluefs 20 --log-file bluefs-bdev-expand.log"


Perhaps it makes sense to open a ticket at ceph bug tracker to proceed...


Thanks,

Igor




On 12/27/2018 12:19 PM, Hector Martin wrote:

Hi list,

I'm slightly expanding the underlying LV for two OSDs and figured I 
could use ceph-bluestore-tool to avoid having to re-create them from 
scratch.


I first shut down the OSD, expanded the LV, and then ran:
ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-0

I forgot I was using encryption, so the overlying dm-crypt mapping 
stayed the same when I resized the underlying LV. I was surprised by 
the output of ceph-bluestore-tool, which suggested a size change by a 
significant amount (I was changing the LV size only by a few percent). 
I then checked the underlying `block` device and realized its size had 
not changed, so the command should've been a no-op. I then tried to 
restart the OSD, and it failed with an I/O error. I ended up 
re-creating that OSD and letting it recover.


I have another OSD (osd.1) in the original state where I could run 
this test again if needed. Unfortunately I don't have the output of 
the first test any more.


Is `ceph-bluestore-tool bluefs-bdev-expand` supposed to work? I get 
the feeling it gets the size wrong and corrupts OSDs by expanding it 
too much. If this is indeed supposed to work I would be happy to test 
this again with osd.1 if needed and see if I can get it fixed. 
Otherwise I'll just re-create it and move on.


# ceph --version
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic 
(stable)



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SLOW SSD's after moving to Bluestore

2018-12-11 Thread Igor Fedotov

Hi Tyler,

I suspect you have BlueStore DB/WAL at these drives as well, don't you?

Then perhaps you have performance issues with f[data]sync requests which 
DB/WAL invoke pretty frequently.


See the following links for details:

https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

The latter link shows pretty poor numbers for M500DC drives.


Thanks,

Igor


On 12/11/2018 4:58 AM, Tyler Bishop wrote:


Older Crucial/Micron M500/M600
_

*Tyler Bishop*
EST 2007


O:513-299-7108 x1000
M:513-646-5809
http://BeyondHosting.net 


This email is intended only for the recipient(s) above and/or 
otherwise authorized personnel. The information contained herein and 
attached is confidential and the property of Beyond Hosting. Any 
unauthorized copying, forwarding, printing, and/or disclosing 
any information related to this email is prohibited. If you received 
this message in error, please contact the sender and destroy all 
copies of this email and any attachment(s).



On Mon, Dec 10, 2018 at 8:57 PM Christian Balzer > wrote:


Hello,

On Mon, 10 Dec 2018 20:43:40 -0500 Tyler Bishop wrote:

> I don't think thats my issue here because I don't see any IO to
justify the
> latency.  Unless the IO is minimal and its ceph issuing a bunch
of discards
> to the ssd and its causing it to slow down while doing that.
>

What does atop have to say?

Discards/Trims are usually visible in it, this is during a fstrim of a
RAID1 / :
---
DSK |          sdb  | busy     81% |  read       0 | write  8587 
| MBw/s 2323.4 |  avio 0.47 ms |
DSK |          sda  | busy     70% |  read       2 | write  8587 
| MBw/s 2323.4 |  avio 0.41 ms |
---

The numbers tend to be a lot higher than what the actual interface is
capable of, clearly the SSD is reporting its internal activity.

In any case, it should give a good insight of what is going on
activity
wise.
Also for posterity and curiosity, what kind of SSDs?

Christian

> Log isn't showing anything useful and I have most debugging
disabled.
>
>
>
> On Mon, Dec 10, 2018 at 7:43 PM Mark Nelson mailto:mnel...@redhat.com>> wrote:
>
> > Hi Tyler,
> >
> > I think we had a user a while back that reported they had
background
> > deletion work going on after upgrading their OSDs from
filestore to
> > bluestore due to PGs having been moved around.  Is it possible
that your
> > cluster is doing a bunch of work (deletion or otherwise)
beyond the
> > regular client load?  I don't remember how to check for this
off the top
> > of my head, but it might be something to investigate.  If
that's what it
> > is, we just recently added the ability to throttle background
deletes:
> >
> > https://github.com/ceph/ceph/pull/24749
> >
> >
> > If the logs/admin socket don't tell you anything, you could
also try
> > using our wallclock profiler to see what the OSD is spending
it's time
> > doing:
> >
> > https://github.com/markhpc/gdbpmp/
> >
> >
> > ./gdbpmp -t 1000 -p`pidof ceph-osd` -o foo.gdbpmp
> >
> > ./gdbpmp -i foo.gdbpmp -t 1
> >
> >
> > Mark
> >
> > On 12/10/18 6:09 PM, Tyler Bishop wrote:
> > > Hi,
> > >
> > > I have an SSD only cluster that I recently converted from
filestore to
> > > bluestore and performance has totally tanked. It was fairly
decent
> > > before, only having a little additional latency than
expected.  Now
> > > since converting to bluestore the latency is extremely high,
SECONDS.
> > > I am trying to determine if it an issue with the SSD's or
Bluestore
> > > treating them differently than filestore... potential garbage
> > > collection? 24+ hrs ???
> > >
> > > I am now seeing constant 100% IO utilization on ALL of the
devices and
> > > performance is terrible!
> > >
> > > IOSTAT
> > >
> > > avg-cpu:  %user   %nice %system %iowait %steal   %idle
> > >            1.37    0.00    0.34   18.59 0.00   79.70
> > >
> > > Device:         rrqm/s   wrqm/s     r/s     w/s rkB/s    wkB/s
> > > avgrq-sz avgqu-sz   await r_await w_await svctm  %util
> > > sda               0.00     0.00    0.00 9.50  0.00    64.00
> > > 13.47     0.01    1.16    0.00    1.16  1.11  1.05
> > > sdb               0.00    96.50    4.50   46.50 34.00 11776.00
> > >  463.14   132.68 1174.84  782.67 1212.80 19.61 100.00
> > > dm-0              0.00     0.00    5.50  128.00 44.00  8162.00
> > >  122.94   507.84 1704.93  674.09 1749.23  7.49 100.00
> > >
> > > avg-cpu:  %user   %nice %system %iowait %steal  

Re: [ceph-users] How to recover from corrupted RocksDb

2018-11-29 Thread Igor Fedotov

Yeah, that may be the way.

Preferably to disable compaction during this procedure though.

To do that please set

bluestore rocksdb options = "disable_auto_compactions=true"

in [osd] section in ceph.conf


Thanks,

Igor


On 11/29/2018 4:54 PM, Paul Emmerich wrote:

does objectstore-tool still work? If yes:

export all the PGs on the OSD with objectstore-tool and important them
into a new OSD.

Paul




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to recover from corrupted RocksDb

2018-11-29 Thread Igor Fedotov
'ceph-bluestore-tool repair' checks and repairs BlueStore metadata 
consistency not RocksDB one.


It looks like you're observing CRC mismatch during DB compaction which 
is probably not triggered during the repair.


Good point is that it looks like Bluestore's metadata are consistent and 
hence data recovery is still possible  - potentially, can't build up a 
working procedure using existing tools though..


Let me check if one can disable DB compaction using rocksdb settings.


On 11/29/2018 1:42 PM, Mario Giammarco wrote:
The only strange thing is that ceph-bluestore-tool says that repair 
was done, no errors are found and all is ok.

I ask myself what really does that tool.
Mario

Il giorno gio 29 nov 2018 alle ore 11:03 Wido den Hollander 
mailto:w...@42on.com>> ha scritto:




On 11/29/18 10:45 AM, Mario Giammarco wrote:
> I have only that copy, it is a showroom system but someone put a
> production vm on it.
>

I have a feeling this won't be easy to fix or actually fixable:

- Compaction error: Corruption: block checksum mismatch
- submit_transaction error: Corruption: block checksum mismatch

RocksDB got corrupted on that OSD and won't be able to start now.

I wouldn't know where to start with this OSD.

Wido

> Il giorno gio 29 nov 2018 alle ore 10:43 Wido den Hollander
> mailto:w...@42on.com> >> ha scritto:
>
>
>
>     On 11/29/18 10:28 AM, Mario Giammarco wrote:
>     > Hello,
>     > I have a ceph installation in a proxmox cluster.
>     > Due to a temporary hardware glitch now I get this error on
osd startup
>     >
>     >     -6> 2018-11-26 18:02:33.179327 7fa1d784be00  0 osd.0 1033
>     crush map
>     >     has features 1009089991638532096, adjusting msgr
requires for
>     osds
>     >    -5> 2018-11-26 18:02:34.143084 7fa1c33f9700  3 rocksdb:
>     >
>
  [/build/ceph-12.2.9/src/rocksdb/db/db_impl_compaction_flush.cc:1591]
>     >     Compaction error: Corruption: block checksum mismatch
>     >     -4> 2018-11-26 18:02:34.143123 7fa1c33f9700 4 rocksdb:
>     (Original Log
>     >     Time 2018/11/26-18:02:34.143021)
>     >  [/build/ceph-12.2.9/src/rocksdb/db/compaction_job.cc:621]
>     [default]
>     >     compacted to: base level 1 max bytes
base268435456 files[17$
>     >
>     >     -3> 2018-11-26 18:02:34.143126 7fa1c33f9700 4 rocksdb:
>     (Original Log
>     >     Time 2018/11/26-18:02:34.143068) EVENT_LOG_v1
{"time_micros":
>     >     1543251754143044, "job": 3, "event":
"compaction_finished",
>     >     "compaction_time_micros": 1997048, "out$
>     >    -2> 2018-11-26 18:02:34.143152 7fa1c33f9700  2 rocksdb:
>     >
>
  [/build/ceph-12.2.9/src/rocksdb/db/db_impl_compaction_flush.cc:1275]
>     >     Waiting after background compaction error: Corruption:
block
>     >     checksum mismatch, Accumulated background err$
>     >    -1> 2018-11-26 18:02:34.674171 7fa1c4bfc700 -1 rocksdb:
>     >     submit_transaction error: Corruption: block checksum
mismatch
>     code =
>     >     2 Rocksdb transaction:
>     >     Delete( Prefix = O key =
>     >
>
  
0x7f7ffb6400217363'rub_3.26!='0xfffe'o')

>     >     Put( Prefix = S key = 'nid_max' Value size = 8)
>     >     Put( Prefix = S key = 'blobid_max' Value size = 8)
>     >     0> 2018-11-26 18:02:34.675641 7fa1c4bfc700 -1
>     >  /build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: In function
>     'void
>     >     BlueStore::_kv_sync_thread()' thread 7fa1c4bfc700 time
2018-11-26
>     >     18:02:34.674193
>     >  /build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: 8717:
FAILED
>     >     assert(r == 0)
>     >
>     >     ceph version 12.2.9
(9e300932ef8a8916fb3fda78c58691a6ab0f4217)
>     >     luminous (stable)
>     >     1: (ceph::__ceph_assert_fail(char const*, char const*,
int, char
>     >     const*)+0x102) [0x55ec83876092]
>     >     2: (BlueStore::_kv_sync_thread()+0x24b5) [0x55ec836ffb55]
>     >     3: (BlueStore::KVSyncThread::entry()+0xd)
[0x55ec8374040d]
>     >     4: (()+0x7494) [0x7fa1d5027494]
>     >     5: (clone()+0x3f) [0x7fa1d4098acf]
>     >
>     >
>     > I have tried to recover it using ceph-bluestore-tool fsck
and repair
>     > DEEP but it says it is ALL ok.
>     > I see that rocksd ldb tool needs .db files to recover and
not a
>     > partition so I cannot use it.
>     > I do not understand why I cannot start osd if
ceph-bluestore-tools
>     says
>     > me I have lost no data.
>     > Can you help me?
>
>     Why would you try to recover a individual OSD? If all your
Placement
> 

Re: [ceph-users] Raw space usage in Ceph with Bluestore

2018-11-28 Thread Igor Fedotov

Hi Jody,

yes, this is a known issue.

Indeed, currently 'ceph df detail' reports raw space usage in GLOBAL 
section and 'logical' in the POOLS one. While logical one has some flaws.


There is a pending PR targeted to Nautilus to fix that:

https://github.com/ceph/ceph/pull/19454

If you want to do an analysis at exactly per-pool level this PR is the 
only mean AFAIK.



If per-cluster stats are fine then you can also inspect corresponding 
OSD performance counters and sum over all OSDs to get per-cluster info.


This is the most precise but quite inconvenient method for low-level 
per-osd space analysis.


 "bluestore": {
...

   "bluestore_allocated": 655360, # space allocated at BlueStore 
for the specific OSD
    "bluestore_stored": 34768,  # amount of data stored at 
BlueStore for the specific OSD

...

Please note, that aggregate numbers built from these parameters include 
all the replication/EC overhead.  And bluestore_stored vs. 
bluestore_allocated difference is due to allocation overhead and/or 
applied compression.



Thanks,

Igor


On 11/29/2018 12:27 AM, Glider, Jody wrote:


Hello,

I’m trying to find a way to determine real/physical/raw storage 
capacity usage when storing a similar set of objects in different 
pools, for example a 3-way replicated pool vs. a 4+2 erasure coded 
pool, and in particular how this ratio changes from small (where 
Bluestore block size matters more) to large object sizes.


I find that /ceph df detail/ and /rados df/ don’t report on really-raw 
storage, I guess because they’re perceiving ‘raw’ storage from their 
perspective only. If I write a set of objects to each pool, rados df 
shows the space used as the summation of the logical size of the 
objects, while ceph df detail shows the raw used storage as the object 
size * the redundancy factor (e.g. 3 for 3-way replication and 1.5 for 
4+2 erasure code).


Any suggestions?

Jody Glider, Principal Storage Architect

Cloud Architecture and Engineering, SAP Labs LLC

3412 Hillview Ave (PAL 02 23.357), Palo Alto, CA 94304

E j.gli...@sap.com , T   +1 650-320-3306, M   
+1 650-441-0241




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RocksDB and WAL migration to new block device

2018-11-22 Thread Igor Fedotov

Hi Florian,


On 11/21/2018 7:01 PM, Florian Engelmann wrote:

Hi Igor,

sad to say but I failed building the tool. I tried to build the whole 
project like documented here:


http://docs.ceph.com/docs/mimic/install/build-ceph/

But as my workstation is running Ubuntu the binary fails on SLES:

./ceph-bluestore-tool --help
./ceph-bluestore-tool: symbol lookup error: ./ceph-bluestore-tool: 
undefined symbol: _ZNK7leveldb6Status8ToStringB5cxx11Ev


I did copy all libraries to ~/lib and exported LD_LIBRARY_PATH but it 
did not solve the problem.


Is there any simple method to just build the bluestore-tool standalone 
and static?



Unfortunately I don't know such a method.

May be try hex editing instead?


All the best,
Florian


Am 11/21/18 um 9:34 AM schrieb Igor Fedotov:
Actually  (given that your devices are already expanded) you don't 
need to expand them once again - one can just update size labels with 
my new PR.


For new migrations you can use updated bluefs expand command which 
sets size label automatically though.



Thanks,
Igor
On 11/21/2018 11:11 AM, Florian Engelmann wrote:
Great support Igor Both thumbs up! We will try to build the tool 
today and expand those bluefs devices once again.



Am 11/20/18 um 6:54 PM schrieb Igor Fedotov:

FYI: https://github.com/ceph/ceph/pull/25187


On 11/20/2018 8:13 PM, Igor Fedotov wrote:


On 11/20/2018 7:05 PM, Florian Engelmann wrote:

Am 11/20/18 um 4:59 PM schrieb Igor Fedotov:



On 11/20/2018 6:42 PM, Florian Engelmann wrote:

Hi Igor,



what's your Ceph version?


12.2.8 (SES 5.5 - patched to the latest version)



Can you also check the output for

ceph-bluestore-tool show-label -p 


ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-0/
infering bluefs devices from bluestore path
{
    "/var/lib/ceph/osd/ceph-0//block": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 8001457295360,
    "btime": "2018-06-29 23:43:12.088842",
    "description": "main",
    "bluefs": "1",
    "ceph_fsid": "a146-6561-307e-b032-c5cee2ee520c",
    "kv_backend": "rocksdb",
    "magic": "ceph osd volume v026",
    "mkfs_done": "yes",
    "ready": "ready",
    "whoami": "0"
    },
    "/var/lib/ceph/osd/ceph-0//block.wal": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 524288000,
    "btime": "2018-06-29 23:43:12.098690",
    "description": "bluefs wal"
    },
    "/var/lib/ceph/osd/ceph-0//block.db": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 524288000,
    "btime": "2018-06-29 23:43:12.098023",
    "description": "bluefs db"
    }
}





It should report 'size' labels for every volume, please check 
they contain new values.




That's exactly the problem, whether "ceph-bluestore-tool 
show-label" nor "ceph daemon osd.0 perf dump|jq '.bluefs'" did 
recognize the new sizes. But we are 100% sure the new devices 
are used as we already deleted the old once...


We tried to delete the "key" "size" to add one with the new 
value but:


ceph-bluestore-tool rm-label-key --dev 
/var/lib/ceph/osd/ceph-0/block.db -k size

key 'size' not present

even if:

ceph-bluestore-tool show-label --dev 
/var/lib/ceph/osd/ceph-0/block.db

{
    "/var/lib/ceph/osd/ceph-0/block.db": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 524288000,
    "btime": "2018-06-29 23:43:12.098023",
    "description": "bluefs db"
    }
}

So it looks like the key "size" is "read-only"?


There was a bug in updating specific keys, see
https://github.com/ceph/ceph/pull/24352

This PR also eliminates the need to set sizes manually on 
bdev-expand.


I thought it had been backported to Luminous but it looks like 
it doesn't.

Will submit a PR shortly.




Thank you so much Igor! So we have to decide how to proceed. 
Maybe you could help us here as well.


Option A: Wait for this fix to be available. -> could last weeks 
or even months
if you can build a custom version of ceph_bluestore_tool then this 
is a short path. I'll submit a patch today or tomorrow which you 
need to integrate into your private build.

Then you need to upgrade just the tool and apply new sizes.



Option B: Recreate OSDs "one-by-one". -> will take a very long 
time as well

No need for that IMO.


Option C: There is some "lowlevel" commad allowing us to fix 
those sizes?
Well hex editor might help h

Re: [ceph-users] RocksDB and WAL migration to new block device

2018-11-21 Thread Igor Fedotov
Actually  (given that your devices are already expanded) you don't need 
to expand them once again - one can just update size labels with my new PR.


For new migrations you can use updated bluefs expand command which sets 
size label automatically though.



Thanks,
Igor
On 11/21/2018 11:11 AM, Florian Engelmann wrote:
Great support Igor Both thumbs up! We will try to build the tool 
today and expand those bluefs devices once again.



Am 11/20/18 um 6:54 PM schrieb Igor Fedotov:

FYI: https://github.com/ceph/ceph/pull/25187


On 11/20/2018 8:13 PM, Igor Fedotov wrote:


On 11/20/2018 7:05 PM, Florian Engelmann wrote:

Am 11/20/18 um 4:59 PM schrieb Igor Fedotov:



On 11/20/2018 6:42 PM, Florian Engelmann wrote:

Hi Igor,



what's your Ceph version?


12.2.8 (SES 5.5 - patched to the latest version)



Can you also check the output for

ceph-bluestore-tool show-label -p 


ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-0/
infering bluefs devices from bluestore path
{
    "/var/lib/ceph/osd/ceph-0//block": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 8001457295360,
    "btime": "2018-06-29 23:43:12.088842",
    "description": "main",
    "bluefs": "1",
    "ceph_fsid": "a146-6561-307e-b032-c5cee2ee520c",
    "kv_backend": "rocksdb",
    "magic": "ceph osd volume v026",
    "mkfs_done": "yes",
    "ready": "ready",
    "whoami": "0"
    },
    "/var/lib/ceph/osd/ceph-0//block.wal": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 524288000,
    "btime": "2018-06-29 23:43:12.098690",
    "description": "bluefs wal"
    },
    "/var/lib/ceph/osd/ceph-0//block.db": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 524288000,
    "btime": "2018-06-29 23:43:12.098023",
    "description": "bluefs db"
    }
}





It should report 'size' labels for every volume, please check 
they contain new values.




That's exactly the problem, whether "ceph-bluestore-tool 
show-label" nor "ceph daemon osd.0 perf dump|jq '.bluefs'" did 
recognize the new sizes. But we are 100% sure the new devices are 
used as we already deleted the old once...


We tried to delete the "key" "size" to add one with the new value 
but:


ceph-bluestore-tool rm-label-key --dev 
/var/lib/ceph/osd/ceph-0/block.db -k size

key 'size' not present

even if:

ceph-bluestore-tool show-label --dev 
/var/lib/ceph/osd/ceph-0/block.db

{
    "/var/lib/ceph/osd/ceph-0/block.db": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 524288000,
    "btime": "2018-06-29 23:43:12.098023",
    "description": "bluefs db"
    }
}

So it looks like the key "size" is "read-only"?


There was a bug in updating specific keys, see
https://github.com/ceph/ceph/pull/24352

This PR also eliminates the need to set sizes manually on 
bdev-expand.


I thought it had been backported to Luminous but it looks like it 
doesn't.

Will submit a PR shortly.




Thank you so much Igor! So we have to decide how to proceed. Maybe 
you could help us here as well.


Option A: Wait for this fix to be available. -> could last weeks or 
even months
if you can build a custom version of ceph_bluestore_tool then this 
is a short path. I'll submit a patch today or tomorrow which you 
need to integrate into your private build.

Then you need to upgrade just the tool and apply new sizes.



Option B: Recreate OSDs "one-by-one". -> will take a very long time 
as well

No need for that IMO.


Option C: There is some "lowlevel" commad allowing us to fix those 
sizes?
Well hex editor might help here as well. What you need is just to 
update 64bit size value in block.db and block.wal files. In my lab I 
can find it at offset 0x52. Most probably this is the fixed location 
but it's better to check beforehand - existing value should contain 
value corresponding to the one reported with show-label. Or I can do 
that for you - please send the first 4K chunks to me along with 
corresponding label report.
Then update with new values - the field has to contain exactly the 
same size as your new partition.










Thanks,

Igor


On 11/20/2018 5:29 PM, Florian Engelmann wrote:

Hi,

today we migrated all of our rocksdb and wal devices to new 
once. The new once are much bigger (500MB for wal/db -> 60GB db 
and 2G WAL) and LVM based.


We migrated like:

    export OSD=x


Re: [ceph-users] RocksDB and WAL migration to new block device

2018-11-20 Thread Igor Fedotov

FYI: https://github.com/ceph/ceph/pull/25187


On 11/20/2018 8:13 PM, Igor Fedotov wrote:


On 11/20/2018 7:05 PM, Florian Engelmann wrote:

Am 11/20/18 um 4:59 PM schrieb Igor Fedotov:



On 11/20/2018 6:42 PM, Florian Engelmann wrote:

Hi Igor,



what's your Ceph version?


12.2.8 (SES 5.5 - patched to the latest version)



Can you also check the output for

ceph-bluestore-tool show-label -p 


ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-0/
infering bluefs devices from bluestore path
{
    "/var/lib/ceph/osd/ceph-0//block": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 8001457295360,
    "btime": "2018-06-29 23:43:12.088842",
    "description": "main",
    "bluefs": "1",
    "ceph_fsid": "a146-6561-307e-b032-c5cee2ee520c",
    "kv_backend": "rocksdb",
    "magic": "ceph osd volume v026",
    "mkfs_done": "yes",
    "ready": "ready",
    "whoami": "0"
    },
    "/var/lib/ceph/osd/ceph-0//block.wal": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 524288000,
    "btime": "2018-06-29 23:43:12.098690",
    "description": "bluefs wal"
    },
    "/var/lib/ceph/osd/ceph-0//block.db": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 524288000,
    "btime": "2018-06-29 23:43:12.098023",
    "description": "bluefs db"
    }
}





It should report 'size' labels for every volume, please check they 
contain new values.




That's exactly the problem, whether "ceph-bluestore-tool 
show-label" nor "ceph daemon osd.0 perf dump|jq '.bluefs'" did 
recognize the new sizes. But we are 100% sure the new devices are 
used as we already deleted the old once...


We tried to delete the "key" "size" to add one with the new value but:

ceph-bluestore-tool rm-label-key --dev 
/var/lib/ceph/osd/ceph-0/block.db -k size

key 'size' not present

even if:

ceph-bluestore-tool show-label --dev /var/lib/ceph/osd/ceph-0/block.db
{
    "/var/lib/ceph/osd/ceph-0/block.db": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 524288000,
    "btime": "2018-06-29 23:43:12.098023",
    "description": "bluefs db"
    }
}

So it looks like the key "size" is "read-only"?


There was a bug in updating specific keys, see
https://github.com/ceph/ceph/pull/24352

This PR also eliminates the need to set sizes manually on bdev-expand.

I thought it had been backported to Luminous but it looks like it 
doesn't.

Will submit a PR shortly.




Thank you so much Igor! So we have to decide how to proceed. Maybe 
you could help us here as well.


Option A: Wait for this fix to be available. -> could last weeks or 
even months
if you can build a custom version of ceph_bluestore_tool then this is 
a short path. I'll submit a patch today or tomorrow which you need to 
integrate into your private build.

Then you need to upgrade just the tool and apply new sizes.



Option B: Recreate OSDs "one-by-one". -> will take a very long time 
as well

No need for that IMO.


Option C: There is some "lowlevel" commad allowing us to fix those 
sizes?
Well hex editor might help here as well. What you need is just to 
update 64bit size value in block.db and block.wal files. In my lab I 
can find it at offset 0x52. Most probably this is the fixed location 
but it's better to check beforehand - existing value should contain 
value corresponding to the one reported with show-label. Or I can do 
that for you - please send the  first 4K chunks to me along with 
corresponding label report.
Then update with new values - the field has to contain exactly the 
same size as your new partition.










Thanks,

Igor


On 11/20/2018 5:29 PM, Florian Engelmann wrote:

Hi,

today we migrated all of our rocksdb and wal devices to new once. 
The new once are much bigger (500MB for wal/db -> 60GB db and 2G 
WAL) and LVM based.


We migrated like:

    export OSD=x

    systemctl stop ceph-osd@$OSD

    lvcreate -n db-osd$OSD -L60g data || exit 1
    lvcreate -n wal-osd$OSD -L2g data || exit 1

    dd if=/var/lib/ceph/osd/ceph-$OSD/block.wal 
of=/dev/data/wal-osd$OSD bs=1M || exit 1
    dd if=/var/lib/ceph/osd/ceph-$OSD/block.db 
of=/dev/data/db-osd$OSD bs=1M  || exit 1


    rm -v /var/lib/ceph/osd/ceph-$OSD/block.db || exit 1
    rm -v /var/lib/ceph/osd/ceph-$OSD/block.wal || exit 1
    ln -vs /dev/data/db-osd$OSD 
/var/lib/ceph/osd/ceph-$OSD/block.db 

  1   2   >