Re: [ceph-users] Different disk sizes after Luminous upgrade 12.2.2 --> 12.2.5

2018-05-25 Thread Igor Fedotov
Hi Eugen, This difference was introduced by the following PR: https://github.com/ceph/ceph/pull/20487 (commit os/bluestore: do not account DB volume space in total one reported by statfs method). The rationale is to show block device capacity as total only. And don't add DB space to it.

Re: [ceph-users] issues with ceph nautilus version

2018-06-20 Thread Igor Fedotov
Hi Raju, This is a bug in new BlueStore's bitmap allocator. This PR will most probably fix that: https://github.com/ceph/ceph/pull/22610 Also you may try to switch bluestore and bluefs allocators (bluestore_allocator and bluefs_allocator parameters respectively) to stupid and restart

Re: [ceph-users] Ceph Luminous RocksDB vs WalDB?

2018-06-27 Thread Igor Fedotov
Hi Pardhiv, there is no WalDB in Ceph. It's WAL (Write Ahead Log) that is a way to ensure write safety in RocksDB. In other words - that's just a RocksDB subsystem which can use separate volume though. In general For BlueStore/BlueFS one can either allocate separate volumes for WAL and DB

Re: [ceph-users] Recreating a purged OSD fails

2018-06-27 Thread Igor Fedotov
Looks like a known issue tracked by http://tracker.ceph.com/issues/24423 http://tracker.ceph.com/issues/24599 Regards, Igor On 6/27/2018 9:40 AM, Steffen Winther Sørensen wrote: List, Had a failed disk behind an OSD in a Mimic Cluster 13.2.0, so I tried following the doc on removal of

Re: [ceph-users] Monitoring bluestore compression ratio

2018-06-27 Thread Igor Fedotov
Hi David, First of all 'bluestore_extent_compress' is unrelated to data compression - that's amount of merged onode's extent map entries. When BlueStore detects neighboring extents in an onode - it might merge them into a single map entry, e.g. write 0x4000~1000 ... ... write

Re: [ceph-users] Monitoring bluestore compression ratio

2018-06-27 Thread Igor Fedotov
And yes - first 3 parameters from this list are the right and the only way to inspect compression effectiveness so far. Corresponding updates to show that with "ceph df" are on the way and are targeted for Nautilus. Thanks, Igor On 6/26/2018 4:53 PM, David Turner wrote: ceph daemon

Re: [ceph-users] Bluestore on HDD+SSD sync write latency experiences

2018-05-02 Thread Igor Fedotov
Hi Nick, On 5/1/2018 11:50 PM, Nick Fisk wrote: Hi all, Slowly getting round to migrating clusters to Bluestore but I am interested in how people are handling the potential change in write latency coming from Filestore? Or maybe nobody is really seeing much difference? As we all know, in

Re: [ceph-users] Bluestore with so many small files

2018-02-13 Thread Igor Fedotov
Hi Behnam, On 2/12/2018 4:06 PM, Behnam Loghmani wrote: Hi there, I am using ceph Luminous 12.2.2 with: 3 osds (each osd is 100G) - no WAL/DB separation. 3 mons 1 rgw cluster size 3 I stored lots of thumbnails with very small size on ceph with radosgw. Actual size of files is something

Re: [ceph-users] ceph df: Raw used vs. used vs. actual bytes in cephfs

2018-02-20 Thread Igor Fedotov
Another space "leak" might be due BlueStore misbehavior that takes DB partition(s) space into account when calculating total store size. And all this space is immediately marked as used even for an empty store. So if you have 3 OSD with 10 Gb DB device each you unconditionally get 30 Gb used

Re: [ceph-users] bluestore min alloc size vs. wasted space

2018-02-20 Thread Igor Fedotov
On 2/20/2018 11:57 AM, Flemming Frandsen wrote: I have set up a little ceph installation and added about 80k files of various sizes, then I added 1M files of 1 byte each totalling 1 MB, to see what kind of overhead is incurred per object. The overhead for adding 1M objects seems to be

Re: [ceph-users] Bluestore: inaccurate disk usage statistics problem?

2018-01-04 Thread Igor Fedotov
Additional issue with the disk usage statistics I've just realized is that BlueStore's statfs call reports total disk space as   block device total space + DB device total space while available space is measured as   block device's free space + bluefs free space at block device -

Re: [ceph-users] Bluestore: inaccurate disk usage statistics problem?

2018-01-04 Thread Igor Fedotov
On 1/4/2018 5:27 PM, Sage Weil wrote: On Thu, 4 Jan 2018, Igor Fedotov wrote: Additional issue with the disk usage statistics I've just realized is that BlueStore's statfs call reports total disk space as   block device total space + DB device total space while available space is measured

Re: [ceph-users] Bluestore: inaccurate disk usage statistics problem?

2018-01-04 Thread Igor Fedotov
On 1/4/2018 5:52 PM, Sage Weil wrote: On Thu, 4 Jan 2018, Igor Fedotov wrote: On 1/4/2018 5:27 PM, Sage Weil wrote: On Thu, 4 Jan 2018, Igor Fedotov wrote: Additional issue with the disk usage statistics I've just realized is that BlueStore's statfs call reports total disk space as   block

Re: [ceph-users] mimic/bluestore cluster can't allocate space for bluefs

2018-08-14 Thread Igor Fedotov
Hi Jakub, for the crashing OSD could you please set debug_bluestore=10 bluestore_bluefs_balance_failure_dump_interval=1 and collect more logs. This will hopefully provide more insight on why additional space isn't allocated for bluefs. Thanks, Igor On 8/14/2018 12:41 PM, Jakub

Re: [ceph-users] Bluestore : how to check where the WAL is stored ?

2018-08-16 Thread Igor Fedotov
Hi Herve actually absence of block.wal symlink is good enough symptom that DB and WAL are merged . But you can also inspect OSD startup log or check bluefs perf counters after some load - corresponding WAL counters (total/used) should be zero. Thanks, Igor On 8/16/2018 4:55 PM, Hervé

Re: [ceph-users] Bluestore : how to check where the WAL is stored ?

2018-08-16 Thread Igor Fedotov
parameter. But I suppose this is the default settings and I've never seen anybody tuning it. And bluestore_block_wal_size is just ignored in your case. Thanks, Igor Regards, Hervé Le 16/08/2018 à 16:05, Igor Fedotov a écrit : Hi Herve actually absence of block.wal symlink is good enough symptom

Re: [ceph-users] Cephfs slow 6MB/s and rados bench sort of ok.

2018-08-28 Thread Igor Fedotov
Hi Marc, In general dd isn't the best choice for benchmarking. In you case there are at least 3 differences from rados bench : 1)If I haven't missed something then you're comparing reads vs. writes 2) Block Size is difference ( 512 bytes for dd vs . 4M for rados bench) 3) Just a single dd

Re: [ceph-users] BlueStore options in ceph.conf not being used

2018-08-22 Thread Igor Fedotov
Hi Robert, bluestore_allocated just tracks how many space was allocated from slow(aka block) device to keep actual user data. It has nothing about DB and/or WAL. There are counters in bluefs section which track corresponding DB/WAL usage. Thanks, Igor On 8/22/2018 8:34 PM, Robert

Re: [ceph-users] resize wal/db

2018-07-17 Thread Igor Fedotov
rification: we did NOT resize the block.db in the described procedure! We used the exact same block size for the new lvm based block.db as it was before. This is also mentioned in the article. Regards, Eugen Zitat von Igor Fedotov : Hi Zhang, There is no way to resize DB while OSD

Re: [ceph-users] ceph bluestore data cache on osd

2018-07-23 Thread Igor Fedotov
Firstly I'd suggest to inspect bluestore performance counters before and after adjusting cache parameters (and after running the same test suite). Namely: "bluestore_buffer_bytes" "bluestore_buffer_hit_bytes" "bluestore_buffer_miss_bytes" Is hit ratio (bluestore_buffer_hit_bytes) much

Re: [ceph-users] resize wal/db

2018-07-16 Thread Igor Fedotov
Hi Zhang, There is no way to resize DB while OSD is running. There is a bit shorter "unofficial" but risky way than redeploying OSD though. But you'll need to tag specific OSD out for a while in any case. You will also need either additional free partition(s) or initial deployment had to be

Re: [ceph-users] resize wal/db

2018-07-17 Thread Igor Fedotov
On 7/17/2018 5:02 PM, Nicolas Huillard wrote: Le mardi 17 juillet 2018 à 16:20 +0300, Igor Fedotov a écrit : Right, but procedure described in the blog can be pretty easily adjusted to do a resize. Sure, but if I remember correctly, Ceph itself cannot use the increased size: you'll end up

Re: [ceph-users] v12.2.8 Luminous released

2018-09-06 Thread Igor Fedotov
us mon (issue#24481, pr#22655, Sage Weil) * os/bluestore: fix flush_commit locking (issue#21480, pr#22904, Sage Weil) * os/bluestore: fix incomplete faulty range marking when doing compression (issue#21480, pr#22909, Igor Fedotov) * os/bluestore: fix races on SharedBlob::coll in ~SharedBlob (issue#248

Re: [ceph-users] Bluestore DB size and onode count

2018-09-10 Thread Igor Fedotov
Hi Nick. On 9/10/2018 1:30 PM, Nick Fisk wrote: If anybody has 5 minutes could they just clarify a couple of things for me 1. onode count, should this be equal to the number of objects stored on the OSD? Through reading several posts, there seems to be a general indication that this is the

Re: [ceph-users] Bluestore DB size and onode count

2018-09-10 Thread Igor Fedotov
On 9/10/2018 8:26 PM, Mark Nelson wrote: On 09/10/2018 12:22 PM, Igor Fedotov wrote: Just in case - is slow_used_bytes equal to 0? Some DB data might reside at slow device if spill over has happened. Which doesn't require full DB volume to happen - that's by RocksDB's design

Re: [ceph-users] corrupt OSD: BlueFS.cc: 828: FAILED assert

2018-07-05 Thread Igor Fedotov
Hi Jake, IMO it doesn't make sense to recover from this drive/data as the damage coverage looks pretty wide. By modifying BlueFS code you can bypass that specific assertion but most probably BlueFS and  other BlueStore stuff are pretty inconsistent and most probably are unrecoverable at the

Re: [ceph-users] jemalloc / Bluestore

2018-07-05 Thread Igor Fedotov
Hi Uwe, AFAIK jemalloc isn't recommended for use with BlueStore anymore. tcmalloc is the right way so far. Thanks, Igor On 7/5/2018 4:08 PM, Uwe Sauter wrote: Hi all, is using jemalloc still recommended for Ceph? There are multiple sites (e.g.

Re: [ceph-users] Small ceph cluster design question

2018-07-06 Thread Igor Fedotov
Hi Satish, just one caution here. I tried DL380 G8 with NVME Samsung 960 Pro drive. The latter (despite pretty good overall performance numbers) hardly suits for BlueStore WAL/DB deployment since  it handles (f)synced writes very slowly. Which is crucial for DB/WAL. The root cause IMO is

Re: [ceph-users] SPDK for BlueStore rocksDB

2018-01-24 Thread Igor Fedotov
Jorge, I'd suggest to start with regular (non-SPDK) configuration and deploy test cluster. Then do some benchmarking against it and check if nvme drive is the actual bottleneck. I doubt it is though. I did some experiments a while ago and didn't see any benefit from SPDK in my case -

Re: [ceph-users] Bluestore DB size and onode count

2018-09-11 Thread Igor Fedotov
PM, Igor Fedotov wrote: Hi Nick. On 9/10/2018 1:30 PM, Nick Fisk wrote: If anybody has 5 minutes could they just clarify a couple of things for me 1. onode count, should this be equal to the number of objects stored on the OSD? Through reading several posts, there seems to be a general

Re: [ceph-users] Memory leak in Ceph OSD?

2018-03-26 Thread Igor Fedotov
Hi Alex, I can see your bug report: https://tracker.ceph.com/issues/23462 if your settings from there are applicable for your comment here then you have bluestore cache size limit set to 5 Gb that totals in 90 Gb RAM for  18 OSD for BlueStore cache only. There is also additional memory

Re: [ceph-users] Cluster is empty but it still use 1Gb of data

2018-03-02 Thread Igor Fedotov
Yes, by default BlueStore reports 1Gb per OSD as used by BlueFS. On 3/2/2018 8:10 PM, Max Cuttins wrote: Umh Taking a look to your computation I think the ratio OSD/Overhead it's really about 1.1Gb per OSD. Because I have 9 NVMe OSD alive right now. So about 9.5Gb of overhead. So I

Re: [ceph-users] How to "apply" and monitor bluestore compression?

2018-02-26 Thread Igor Fedotov
Hi Martin, On 2/26/2018 6:19 PM, Martin Emrich wrote: Hi! I just migrated my backup cluster from filestore to bluestore (8 OSDs, one OSD at a time, took two weeks but went smoothly). I also enabled compression on a pool beforehand and am impressed by the compression ratio (snappy,

Re: [ceph-users] Memory leak in Ceph OSD?

2018-02-28 Thread Igor Fedotov
Hi Stefan, can you disable compression and check if memory is still leaking. If it stops then the issue is definitely somewhere along the "compress" path. Thanks, Igor On 2/28/2018 6:18 PM, Stefan Kooman wrote: Hi, TL;DR: we see "used" memory grows indefinitely on our OSD servers.

Re: [ceph-users] Cluster is empty but it still use 1Gb of data

2018-03-02 Thread Igor Fedotov
Hi Max, how many OSDs do you have? Are they bluestore? what's the "cepf df detail" output? On 3/2/2018 1:21 PM, Max Cuttins wrote: Hi everybody, i deleted everything from the cluster after some test with RBD. Now I see that there something still in use:   data:     pools:   0

Re: [ceph-users] Ceph Luminous RocksDB vs WalDB?

2018-06-28 Thread Igor Fedotov
think that the first one is the right way to go. The second command only specifies the db partition but no dedicated WAL partition. The first one should do the trick. On 28.06.2018 22:58, Igor Fedotov wrote: I think the second variant is what you need. But I'm not the guru in ceph-deploy so

Re: [ceph-users] Luminous BlueStore OSD - Still a way to pinpoint an object?

2018-06-28 Thread Igor Fedotov
You can access offline OSD using ceph-objectstore-tool which allows to enumerate and access specific objects. Not sure this makes sense for any purposes other than low-level debugging though.. Thanks, Igor On 6/28/2018 5:42 AM, Yu Haiyang wrote: Hi All, Previously I read this article

Re: [ceph-users] Ceph Luminous RocksDB vs WalDB?

2018-06-28 Thread Igor Fedotov
is ssd disk for osd /dev/nvmen0n1p1 is 10G partition /dev/nvme0n1p2 is 25G partition Thanks, Pardhiv K On Wed, Jun 27, 2018 at 9:08 AM Igor Fedotov <mailto:ifedo...@suse.de>> wrote: Hi Pardhiv, there is no WalDB in Ceph. It's WAL (Write Ahead Log) that is a way to ensure wri

Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-15 Thread Igor Fedotov
Hi Wido, once you apply the PR you'll probably see the initial error in the log that triggers the dump. Which is most probably the lack of space reported by _balance_bluefs_freespace() function. If so this means that BlueFS rebalance is unable to allocate contiguous 1M chunk at main device

Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-15 Thread Igor Fedotov
On 10/15/2018 11:47 PM, Wido den Hollander wrote: Hi, On 10/15/2018 10:43 PM, Igor Fedotov wrote: Hi Wido, once you apply the PR you'll probably see the initial error in the log that triggers the dump. Which is most probably the lack of space reported by _balance_bluefs_freespace() function

Re: [ceph-users] Luminous with osd flapping, slow requests when deep scrubbing

2018-10-15 Thread Igor Fedotov
Perhaps this is the same issue as indicated here: https://tracker.ceph.com/issues/36364 Can you check OSD iostat reports for similarities to this ticket, please? Thanks, Igor On 10/15/2018 2:26 PM, Andrei Mikhailovsky wrote: Hello, I am currently running Luminous 12.2.8 on Ubuntu with

Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-16 Thread Igor Fedotov
On 10/16/2018 6:57 AM, Wido den Hollander wrote: On 10/16/2018 12:04 AM, Igor Fedotov wrote: On 10/15/2018 11:47 PM, Wido den Hollander wrote: Hi, On 10/15/2018 10:43 PM, Igor Fedotov wrote: Hi Wido, once you apply the PR you'll probably see the initial error in the log that triggers

Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?

2018-10-18 Thread Igor Fedotov
On 10/18/2018 7:49 PM, Nick Fisk wrote: Hi, Ceph Version = 12.2.8 8TB spinner with 20G SSD partition Perf dump shows the following: "bluefs": { "gift_bytes": 0, "reclaim_bytes": 0, "db_total_bytes": 21472731136, "db_used_bytes": 3467640832,

Re: [ceph-users] bluestore compression enabled but no data compressed

2018-10-19 Thread Igor Fedotov
Hi Frank, On 10/19/2018 2:19 PM, Frank Schilder wrote: Hi David, sorry for the slow response, we had a hell of a week at work. OK, so I had compression mode set to aggressive on some pools, but the global option was not changed, because I interpreted the documentation as "pool settings take

Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?

2018-10-19 Thread Igor Fedotov
Hi Nick On 10/19/2018 10:14 AM, Nick Fisk wrote: -Original Message- From: Igor Fedotov [mailto:ifedo...@suse.de] Sent: 19 October 2018 01:03 To: n...@fisk.me.uk; ceph-users@lists.ceph.com Subject: Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots of space free

[ceph-users] weekly report 41(ifed)

2018-10-16 Thread Igor Fedotov
[read]: [amber]: [green]: * Almost done with: os/bluestore: allow ceph-bluestore-tool to coalesce BlueFS backing volumes [#core] https://github.com/ceph/ceph/pull/23103 Got a preliminary approval from Sage but still working on bringing in the

Re: [ceph-users] bluestore compression enabled but no data compressed

2018-10-23 Thread Igor Fedotov
Hi Frank, On 10/23/2018 2:56 PM, Frank Schilder wrote: Dear David and Igor, thank you very much for your help. I have one more question about chunk sizes and data granularity on bluestore and will summarize the information I got on bluestore compression at the end. 1) Compression ratio

Re: [ceph-users] ceph-bluestore-tool failed

2018-10-31 Thread Igor Fedotov
You might want to try --path option instead of --dev one. On 10/31/2018 7:29 AM, ST Wong (ITSC) wrote: Hi all, We deployed a testing mimic CEPH cluster using bluestore.    We can’t run ceph-bluestore-tool on OSD with following error: --- # ceph-bluestore-tool show-label --dev

Re: [ceph-users] SSD sizing for Bluestore

2018-11-13 Thread Igor Fedotov
Hi Brendan in fact you can alter RocksDB settings by using bluestore_rocksdb_options config parameter. And hence change "max_bytes_for_level_base" and others. Not sure about dynamic level sizing though. Current defaults are:

Re: [ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error

2018-10-03 Thread Igor Fedotov
I've seen somewhat similar behavior in a log from Sergey Malinin in another thread ("mimic: 3/4 OSDs crashed...") He claimed it happened after LVM volume expansion. Isn't this the case for you? Am I right that you use LVM volumes? On 10/3/2018 11:22 AM, Kevin Olbrich wrote: Small

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-03 Thread Igor Fedotov
up I will be in the same trouble under some heavy load scenario? On 10/2/2018 9:15 AM, Igor Fedotov wrote: Even with a single device bluestore has a sort of implicit "BlueFS partition" where DB is stored.  And it dynamically adjusts (rebalances) the space for that partition in

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-03 Thread Igor Fedotov
PM, Sergey Malinin wrote: Repair has gone farther but failed on something different - this time it appears to be related to store inconsistency rather than lack of free space. Emailed log to you, beware: over 2GB uncompressed. On 3.10.2018, at 13:15, Igor Fedotov wrote: You may want to try

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Igor Fedotov
You mentioned repair had worked before, is that correct? What's the difference now except the applied patch? Different OSD? Anything else? On 10/2/2018 3:52 PM, Sergey Malinin wrote: It didn't work, emailed logs to you. On 2.10.2018, at 14:43, Igor Fedotov wrote: The major change

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Igor Fedotov
nmountable with IO error. 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end of bluefs-log-dump), I'm not sure whether corruption occurred before or after volume expansion. On 2.10.2018, at 16:07, Igor Fedotov wrote: You mentioned repair had worked before, is th

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Igor Fedotov
The major change is in get_bluefs_rebalance_txn function, it lacked bluefs_rebalance_txn assignment.. On 10/2/2018 2:40 PM, Sergey Malinin wrote: PR doesn't seem to have changed since yesterday. Am I missing something? On 2.10.2018, at 14:15, Igor Fedotov wrote: Please update the patch

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Igor Fedotov
. Thanks, Igor On 10/1/2018 5:32 PM, Igor Fedotov wrote: On 10/1/2018 5:03 PM, Sergey Malinin wrote: Before I received your response, I had already added 20GB to the OSD (by epanding LV followed by bluefs-bdev-expand) and ran "ceph-kvstore-tool bluestore-kv compact", however it still

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Igor Fedotov
So you should call repair which rebalances (i.e. allocates additional space) BlueFS space. Hence allowing OSD to start. Thanks, Igor On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Igor Fedotov
1 18:02:41.267 7fc9226c6240 1 bdev(0x55d053c32a80 /var/lib/ceph/osd/ceph-1/block) close 2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to mount object store 2018-10-01 18:02:41.443 7fc9226c6240 -1 ** ERROR: osd init failed: (5) Input/output error On 1.10.2018, at 18:09, Igor

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Igor Fedotov
s.insert(p.offset, p.length); } + bufferlist bl; + encode(bluefs_extents, bl); + dout(10) << __func__ << " bluefs_extents now 0x" << std::hex +<< bluefs_extents << std::dec << dendl; + synct->set(PREFIX_S

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Igor Fedotov
and from I reading in this thread doesn't give me a good feeling at all. Document ion on the topic is very sketchy and online posts contradict each other some times. Thank you in advance, On 10/2/2018 8:52 AM, Igor Fedotov wrote: May I have a repair log for that "already expanded&quo

Re: [ceph-users] Bluestore DB showing as ssd

2018-09-28 Thread Igor Fedotov
Hi Brett, most probably your device is reported as hdd by the kernel, please check by running the following:  cat /sys/block/sdf/queue/rotational It should be 0 for SSD. But as far as I know BlueFS (i.e. DB+WAL stuff) doesn't have any specific behavior which depends on this flag so most

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Igor Fedotov
Hi Sergey, could you please provide more details on your OSDs ? What are sizes for DB/block devices? Do you have any modifications in BlueStore config settings? Can you share stats you're referring to? Thanks, Igor On 10/1/2018 12:29 PM, Sergey Malinin wrote: Hello, 3 of 4 NVME OSDs

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Igor Fedotov
ic": "ceph osd volume v026",     "mkfs_done": "yes",     "osd_key": "AQCsaZZbYTxXJBAAe3jJI4p6WbMjvA8CBBUJbA==",     "ready": "ready",     "whoami": "0"     },     "dev/osd0/block.wal

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Igor Fedotov
nst*)+0x3db) [0x56193276098b] 8: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d9) [0x561932763da9] 9: (rocksdb::CompactionJob::Run()+0x314) [0x561932765504] 10: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::Log

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Igor Fedotov
Malinin wrote: I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. On 1.10.2018, at 19:26, Igor Fedotov wrote: So you should call repair which rebalances (i.e. allocates additional

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-03 Thread Igor Fedotov
. However, expanding the volume immediately renders bluefs unmountable with IO error. 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end of bluefs-log-dump), I'm not sure whether corruption occurred before or after volume expansion. On 2.10.2018, at 16:07, Igor Fedotov w

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Igor Fedotov
-f0c9-4186-aa21-4e5c0172cd93 On 2.10.2018, at 11:26, Igor Fedotov wrote: You did repair for any of this OSDs, didn't you? For all of them? Would you please provide the log for both types (failed on mount and failed with enospc) of failing OSDs. Prior to collecting please remove existing ones

Re: [ceph-users] Bluestore 32bit max_object_size limit

2019-01-18 Thread Igor Fedotov
Hi Kevin, On 1/17/2019 10:50 PM, KEVIN MICHAEL HRPCEK wrote: Hey, I recall reading about this somewhere but I can't find it in the docs or list archive and confirmation from a dev or someone who knows for sure would be nice. What I recall is that bluestore has a max 4GB file size limit

Re: [ceph-users] Bluestore 32bit max_object_size limit

2019-01-21 Thread Igor Fedotov
On 1/18/2019 6:33 PM, KEVIN MICHAEL HRPCEK wrote: On 1/18/19 7:26 AM, Igor Fedotov wrote: Hi Kevin, On 1/17/2019 10:50 PM, KEVIN MICHAEL HRPCEK wrote: Hey, I recall reading about this somewhere but I can't find it in the docs or list archive and confirmation from a dev or someone who

Re: [ceph-users] `ceph-bluestore-tool bluefs-bdev-expand` corrupts OSDs

2018-12-27 Thread Igor Fedotov
Hi Hector, I've never tried bluefs-bdev-expand over encrypted volumes but it works absolutely fine for me in other cases. So it would be nice to troubleshoot this a bit. Suggest to do the following: 1) Backup first 8K for all OSD.1 devices (block, db and wal) using dd. This will probably

Re: [ceph-users] `ceph-bluestore-tool bluefs-bdev-expand` corrupts OSDs

2018-12-27 Thread Igor Fedotov
Hector, One more thing to mention - after expansion please run fsck using ceph-bluestore-tool prior to running osd daemon and collect another log using CEPH_ARGS variable. Thanks, Igor On 12/27/2018 2:41 PM, Igor Fedotov wrote: Hi Hector, I've never tried bluefs-bdev-expand over

Re: [ceph-users] SLOW SSD's after moving to Bluestore

2018-12-11 Thread Igor Fedotov
Hi Tyler, I suspect you have BlueStore DB/WAL at these drives as well, don't you? Then perhaps you have performance issues with f[data]sync requests which DB/WAL invoke pretty frequently. See the following links for details:

Re: [ceph-users] How to recover from corrupted RocksDb

2018-11-29 Thread Igor Fedotov
'ceph-bluestore-tool repair' checks and repairs BlueStore metadata consistency not RocksDB one. It looks like you're observing CRC mismatch during DB compaction which is probably not triggered during the repair. Good point is that it looks like Bluestore's metadata are consistent and hence

Re: [ceph-users] How to recover from corrupted RocksDb

2018-11-29 Thread Igor Fedotov
Yeah, that may be the way. Preferably to disable compaction during this procedure though. To do that please set bluestore rocksdb options = "disable_auto_compactions=true" in [osd] section in ceph.conf Thanks, Igor On 11/29/2018 4:54 PM, Paul Emmerich wrote: does objectstore-tool still

Re: [ceph-users] RocksDB and WAL migration to new block device

2018-11-21 Thread Igor Fedotov
, Florian Engelmann wrote: Great support Igor Both thumbs up! We will try to build the tool today and expand those bluefs devices once again. Am 11/20/18 um 6:54 PM schrieb Igor Fedotov: FYI: https://github.com/ceph/ceph/pull/25187 On 11/20/2018 8:13 PM, Igor Fedotov wrote: On 11/20/2018 7

Re: [ceph-users] RocksDB and WAL migration to new block device

2018-11-22 Thread Igor Fedotov
the bluestore-tool standalone and static? Unfortunately I don't know such a method. May be try hex editing instead? All the best, Florian Am 11/21/18 um 9:34 AM schrieb Igor Fedotov: Actually  (given that your devices are already expanded) you don't need to expand them once again - one can just

Re: [ceph-users] RocksDB and WAL migration to new block device

2018-11-20 Thread Igor Fedotov
Hi Florian, what's your Ceph version? Can you also check the output for ceph-bluestore-tool show-label -p It should report 'size' labels for every volume, please check they contain new values. Thanks, Igor On 11/20/2018 5:29 PM, Florian Engelmann wrote: Hi, today we migrated all of

Re: [ceph-users] RocksDB and WAL migration to new block device

2018-11-20 Thread Igor Fedotov
On 11/20/2018 6:42 PM, Florian Engelmann wrote: Hi Igor, what's your Ceph version? 12.2.8 (SES 5.5 - patched to the latest version) Can you also check the output for ceph-bluestore-tool show-label -p ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-0/ infering bluefs

Re: [ceph-users] RocksDB and WAL migration to new block device

2018-11-20 Thread Igor Fedotov
On 11/20/2018 7:05 PM, Florian Engelmann wrote: Am 11/20/18 um 4:59 PM schrieb Igor Fedotov: On 11/20/2018 6:42 PM, Florian Engelmann wrote: Hi Igor, what's your Ceph version? 12.2.8 (SES 5.5 - patched to the latest version) Can you also check the output for ceph-bluestore-tool

Re: [ceph-users] RocksDB and WAL migration to new block device

2018-11-20 Thread Igor Fedotov
FYI: https://github.com/ceph/ceph/pull/25187 On 11/20/2018 8:13 PM, Igor Fedotov wrote: On 11/20/2018 7:05 PM, Florian Engelmann wrote: Am 11/20/18 um 4:59 PM schrieb Igor Fedotov: On 11/20/2018 6:42 PM, Florian Engelmann wrote: Hi Igor, what's your Ceph version? 12.2.8 (SES 5.5

Re: [ceph-users] Raw space usage in Ceph with Bluestore

2018-11-28 Thread Igor Fedotov
Hi Jody, yes, this is a known issue. Indeed, currently 'ceph df detail' reports raw space usage in GLOBAL section and 'logical' in the POOLS one. While logical one has some flaws. There is a pending PR targeted to Nautilus to fix that: https://github.com/ceph/ceph/pull/19454 If you want to

Re: [ceph-users] `ceph-bluestore-tool bluefs-bdev-expand` corrupts OSDs

2019-01-11 Thread Igor Fedotov
019-01-11 18:56:00.135 7fb74a8272c0 10 bluestore(/var/lib/ceph/osd/ceph-1) _flush_cache And that is where the -EIO is coming from: https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L5305 So I guess there is an inconsistency between some metadata here? On 27/12/20

Re: [ceph-users] bluefs-bdev-expand experience

2019-04-05 Thread Igor Fedotov
Hi Yuri, wrt Round 1 - an ability to expand block(main) device has been added to Nautilus, see: https://github.com/ceph/ceph/pull/25308 wrt Round 2: - not setting 'size' label looks like a bug although I recall I fixed it... Will double check. - wrong stats output is probably related to

Re: [ceph-users] Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-27 Thread Igor Fedotov
Hi Uwe, AFAIR Samsung 860 Pro isn't for enterprise market, you shouldn't use consumer SSDs for Ceph. I had some experience with Samsung 960 Pro a while ago and it turned out that it handled fsync-ed writes very slowly (comparing to the original/advertised performance). Which one can

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-28 Thread Igor Fedotov
Also I think it makes sense to create a ticket at this point. Any volunteers? On 3/1/2019 1:00 AM, Igor Fedotov wrote: Wondering if somebody would be able to apply simple patch that periodically resets StupidAllocator? Just to verify/disprove the hypothesis it's allocator relateted On 2/28

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-28 Thread Igor Fedotov
Wondering if somebody would be able to apply simple patch that periodically resets StupidAllocator? Just to verify/disprove the hypothesis it's allocator relateted On 2/28/2019 11:57 PM, Stefan Kooman wrote: Quoting Wido den Hollander (w...@42on.com): Just wanted to chime in, I've seen

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-03-01 Thread Igor Fedotov
Subject:High CPU in StupidAllocator Date: Tue, 12 Feb 2019 10:24:37 +0100 From: Adam Kupczyk To: IGOR FEDOTOV Hi Igor, I have observed that StupidAllocator can burn a lot of CPU in StupidAllocator::allocate_int(). This comes from loops: while (p != free[bin].end

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-03-01 Thread Igor Fedotov
the case. Thanks, Igor Forwarded Message Subject:     High CPU in StupidAllocator Date:     Tue, 12 Feb 2019 10:24:37 +0100 From:     Adam Kupczyk To:     IGOR FEDOTOV Hi Igor, I have observed that StupidAllocator can burn a lot of CPU in StupidAllocator::allocate_int

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-19 Thread Igor Fedotov
eplication time ? - Mail original - De: "Wido den Hollander" À: "aderumier" Cc: "Igor Fedotov" , "ceph-users" , "ceph-devel" Envoyé: Vendredi 15 Février 2019 14:59:30 Objet: Re: [ceph-users] ceph osd commit latency increase over time,

Re: [ceph-users] SSD OSD crashing after upgrade to 12.2.10

2019-02-07 Thread Igor Fedotov
it directly to you or what is the best procedure for you? Reasonable (up to 10MB?) email attachment is OK, for larger ones - whatever publicly available site is fine. Thanks for your support! Eugen Zitat von Igor Fedotov : Eugen, At first - you should upgrade to 12.2.11 (or bring

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-15 Thread Igor Fedotov
00 }, "buffer_anon": { "items": 19664, "bytes": 25486050 }, "buffer_meta": { "items": 46189, "bytes": 2956096 }, "osd": { "items": 243, "bytes": 3089016 }, "osd_mapbl": { "items": 17, "

Re: [ceph-users] single OSDs cause cluster hickups

2019-02-15 Thread Igor Fedotov
daemon osd.417 config show | grep discard "bdev_async_discard": "false", "bdev_enable_discard": "false", [...] So there must be something else causing the problems. Thanks, Denny Am 15.02.2019 um 12:41 schrieb Igor Fedotov : Hi Denny, Do not

Re: [ceph-users] single OSDs cause cluster hickups

2019-02-15 Thread Igor Fedotov
Hi Denny, Do not remember exactly when discards appeared in BlueStore but they are disabled by default: See bdev_enable_discard option. Thanks, Igor On 2/15/2019 2:12 PM, Denny Kreische wrote: Hi, two weeks ago we upgraded one of our ceph clusters from luminous 12.2.8 to mimic 13.2.4,

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-15 Thread Igor Fedotov
18.13:30.dump_mempools.txt > Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G > > > I'm graphing mempools counters too since yesterday, so I'll able to track them over time. > > - Mail original - > De: "Igor Fedotov" >

Re: [ceph-users] SSD OSD crashing after upgrade to 12.2.10

2019-02-07 Thread Igor Fedotov
Hi Eugen, looks like this isn't [1] but rather https://tracker.ceph.com/issues/38049 and https://tracker.ceph.com/issues/36541 (= https://tracker.ceph.com/issues/36638 for luminous) Hence it's not fixed in 12.2.10, target release is 12.2.11 Also please note the patch allows to avoid new

Re: [ceph-users] SSD OSD crashing after upgrade to 12.2.10

2019-02-07 Thread Igor Fedotov
on the result we'll decide how to continue, right? Is there anything else to be enabled for that command or can I simply run 'ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-'? Any other obstacles I should be aware of when running fsck? Thanks! Eugen Zitat von Igor Fedotov : Hi

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-20 Thread Igor Fedotov
You're right - WAL/DB expansion capability is present in Luminous+ releases. But David meant volume migration stuff which appeared in Nautilus, see: https://github.com/ceph/ceph/pull/23103 Thanks, Igor On 2/20/2019 9:22 AM, Konstantin Shalygin wrote: On 2/19/19 11:46 PM, David Turner

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-11 Thread Igor Fedotov
"osdmap": { "items": 3803, "bytes": 224552 }, "osdmap_mapping": { "items": 0, "bytes": 0 }, "pgmap": {

Re: [ceph-users] bluestore block.db

2019-01-25 Thread Igor Fedotov
Hi Frank, you might want to use ceph-kvstore-tool, e.g. ceph-kvstore-tool bluestore-kv dump Thanks, Igor On 1/25/2019 10:49 PM, F Ritchie wrote: Hi all, Is there a way to dump the contents of block.db to a text file? I am not trying to fix a problem just curious and want to poke around.

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-04 Thread Igor Fedotov
Hi Alexandre, looks like a bug in StupidAllocator. Could you please collect BlueStore performance counters right after OSD startup and once you get high latency. Specifically 'l_bluestore_fragmentation' parameter is of interest. Also if you're able to rebuild the code I can probably make a

  1   2   >