from:"Nick Fisk"

Re: [ceph-users] ?==?utf-8?q? OSD's hang after network blip

2020-01-16 Thread Nick Fisk

On Thursday, January 16, 2020 09:15 GMT, Dan van der Ster  
wrote: 
 
> Hi Nick,
> 
> We saw the exact same problem yesterday after a network outage -- a few of
> our down OSDs were stuck down until we restarted their processes.
> 
> -- Dan
> 
> 
> On Wed, Jan 15, 2020 at 3:37 PM Nick Fisk  wrote:
> 
> > Hi All,
> >
> > Running 14.2.5, currently experiencing some network blips isolated to a
> > single rack which is under investigation. However, it appears following a
> > network blip, random OSD's in unaffected racks are sometimes not recovering
> > from the incident and are left running running in a zombie state. The OSD's
> > appear to be running from a process perspective, but the cluster thinks
> > they are down and will not rejoin the cluster until the OSD process is
> > restarted, which incidentally takes a lot longer than usual (systemctl
> > command takes a couple of minutes to complete).
> >
> > If the OSD is left in this state, CPU and memory usage of the process
> > appears to climb, but never rejoins, at least for several hours that I have
> > left them. Not exactly sure what the OSD is trying to do during this
> > period. There's nothing in the logs during this hung state to indicate that
> > anything is happening, but I will try and inject more verbose logging next
> > time it occurs.
> >
> > Not sure if anybody has come across this before or any ideas? In the past
> > as long as OSD's have been running they have always re-joint following any
> > network issues.
> >
> > Nick
> >
> > Sample from OSD and cluster logs below. Blip happened at 12:06, I
> > restarted OSD at 12:26
> >
> > OSD Logs from OSD that hung (Note this OSD was not directly affected by
> > network outage)
> > 2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [WRN] :
> > Monitor daemon marked osd.43 down, but it is still running
> > 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [DBG] :
> > map e2342992 wrongly marked me down at e2342992
> > 2020-01-15 12:06:34.034 7f419480a700  1 osd.43 2342992
> > start_waiting_for_healthy
> > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> > reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> > reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> > reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.

Re: [ceph-users] ?==?utf-8?q? OSD's hang after network blip

2020-01-15 Thread Nick Fisk

On Wednesday, January 15, 2020 14:37 GMT, "Nick Fisk"  wrote: 
 
> Hi All,
> 
> Running 14.2.5, currently experiencing some network blips isolated to a 
> single rack which is under investigation. However, it appears following a 
> network blip, random OSD's in unaffected racks are sometimes not recovering 
> from the incident and are left running running in a zombie state. The OSD's 
> appear to be running from a process perspective, but the cluster thinks they 
> are down and will not rejoin the cluster until the OSD process is restarted, 
> which incidentally takes a lot longer than usual (systemctl command takes a 
> couple of minutes to complete).
> 
> If the OSD is left in this state, CPU and memory usage of the process appears 
> to climb, but never rejoins, at least for several hours that I have left 
> them. Not exactly sure what the OSD is trying to do during this period. 
> There's nothing in the logs during this hung state to indicate that anything 
> is happening, but I will try and inject more verbose logging next time it 
> occurs.
> 
> Not sure if anybody has come across this before or any ideas? In the past as 
> long as OSD's have been running they have always re-joint following any 
> network issues.
> 
> Nick
> 
> Sample from OSD and cluster logs below. Blip happened at 12:06, I restarted 
> OSD at 12:26
> 
> OSD Logs from OSD that hung (Note this OSD was not directly affected by 
> network outage)
> 2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no 
> reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first ping 
> sent 2020-01-15 12:06:1


 
 Its just happened again and managed to pull this out of debug_osd 20 :

2020-01-15 16:29:01.464 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.182 
v2:[2a03:25e0:253:5::76]:6839/8394683 says i am down in 2343138
2020-01-15 16:29:01.464 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.184 
v2:[2a03:25e0:253:5::76]:6814/7394522 says i am down in 2343138
2020-01-15 16:29:01.464 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.190 
v2:[2a03:25e0:253:5::76]:6860/5986687 says i am down in 2343138
2020-01-15 16:29:01.668 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.19 
v2:[2a03:25e0:253:5::12]:6815/5153900 says i am down in 2343138

And this from the daemon status output:
sudo ceph daemon osd.87 status
{
"cluster_fsid": "c1703b54-b4cd-41ab-a3ba-4fab241b62f3",
"osd_fsid": "0cd8fe7d-17be-4982-b76f-ef1cbed0c19b",
"whoami": 87,
"state": "waiting_for_healthy",
"oldest_map": 2342407,
"newest_map": 2343121,
"num_pgs": 218
}

So OSD doesn't seem to be getting latest map from the mon's. Map 2343138 
obviously has OSD.87 marked down hence the error messages from the osd_pings. 
But I'm guessing the latest map the OSD has 2343121 has it as marked up, so it 
never tries to "re-connect"?

Seems similar to this post from a few years back, which didn't seem to end with 
a form of resolution
https://www.spinics.net/lists/ceph-devel/msg31788.html

Also found this PR for Nautilus which suggested it might be a fix for the 
issue, but should already be part of the release I'm running:
https://github.com/ceph/ceph/pull/23958

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OSD's hang after network blip

2020-01-15 Thread Nick Fisk

Hi All,

Running 14.2.5, currently experiencing some network blips isolated to a single 
rack which is under investigation. However, it appears following a network 
blip, random OSD's in unaffected racks are sometimes not recovering from the 
incident and are left running running in a zombie state. The OSD's appear to be 
running from a process perspective, but the cluster thinks they are down and 
will not rejoin the cluster until the OSD process is restarted, which 
incidentally takes a lot longer than usual (systemctl command takes a couple of 
minutes to complete).

If the OSD is left in this state, CPU and memory usage of the process appears 
to climb, but never rejoins, at least for several hours that I have left them. 
Not exactly sure what the OSD is trying to do during this period. There's 
nothing in the logs during this hung state to indicate that anything is 
happening, but I will try and inject more verbose logging next time it occurs.

Not sure if anybody has come across this before or any ideas? In the past as 
long as OSD's have been running they have always re-joint following any network 
issues.

Nick

Sample from OSD and cluster logs below. Blip happened at 12:06, I restarted OSD 
at 12:26

OSD Logs from OSD that hung (Note this OSD was not directly affected by network 
outage)
2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no 
reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first ping 
sent 2020-01-15 12:06:1
1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no 
reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first ping 
sent 2020-01-15 12:06:1
1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no 
reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first ping 
sent 2020-01-15 12:06:1
1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no 
reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first ping 
sent 2020-01-15 12:06:1
1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no 
reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, first ping 
sent 2020-01-15 12:06:1
1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no 
reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, first ping 
sent 2020-01-15 12:06:1
1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no 
reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first ping 
sent 2020-01-15 12:06:1
1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [WRN] : 
Monitor daemon marked osd.43 down, but it is still running
2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [DBG] : map 
e2342992 wrongly marked me down at e2342992
2020-01-15 12:06:34.034 7f419480a700  1 osd.43 2342992 start_waiting_for_healthy
2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no 
reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first ping 
sent 2020-01-15 12:06:1
1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no 
reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first ping 
sent 2020-01-15 12:06:1
1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no 
reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first ping 
sent 2020-01-15 12:06:1
1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no 
reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, first ping 
sent 2020-01-15 12:06:1
1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no 
reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, first ping 
sent 2020-01-15 12:06:1
1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no 
reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first ping 
sent 2020-01-15 12:06:1
1.411216 (oldest deadline 2020-01-15 12:06:31.411216)

Cluster logs
2020-01-15 12:06:09.740607 mon.mc-ceph-mon1 (mon.0) 531400 : cluster [DBG] 
osd.43 reported failed by osd.57
2020-01-15 12:06:09.945163 mon.mc-ceph-mon1 (mon.0) 531683 : cluster [DBG] 
osd.43 reported failed by osd.63
2020-01-15 12:06:09.945287 mon.mc-ceph-mon1 (mon.0) 531684 : cluster [INF] 
osd.43 f

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-25 Thread Nick Fisk

> -Original Message-
> From: Vitaliy Filippov 
> Sent: 23 February 2019 20:31
> To: n...@fisk.me.uk; Serkan Çoban 
> Cc: ceph-users 
> Subject: Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow 
> storage for db - why?
> 
> X-Assp-URIBL failed: 'yourcmc.ru'(black.uribl.com )
> X-Assp-Spam-Level: *
> X-Assp-Envelope-From: vita...@yourcmc.ru
> X-Assp-Intended-For: n...@fisk.me.uk
> X-Assp-ID: ASSP.fisk.me.uk (55095-04241)
> X-Assp-Version: 1.9.1.4(1.0.00)
> 
> Numbers are easy to calculate from RocksDB parameters, however I also don't 
> understand why it's 3 -> 30 -> 300...
> 
> Default memtables are 256 MB, there are 4 of them, so L0 should be 1 GB,
> L1 should be 10 GB, and L2 should be 100 GB?

From how I understand it, RocksDB levels increment by a factor of x10:
256MB+2.56GB+25.6GB=~28-29GB

Although that is greatly simplified way of looking at it, this link explains it 
in more detail:
https://github.com/facebook/rocksdb/wiki/Leveled-Compaction


> 
> >>> These sizes are roughly 3GB,30GB,300GB. Anything in-between those
> >>> sizes are pointless. Only ~3GB of SSD will ever be used out of a
> > 28GB partition. Likewise a 240GB partition is also pointless as only
> > ~30GB will be used.
> >
> > Where did you get those numbers? I would like to read more if you can
> > point to a link.
> 
> --
> With best regards,
>Vitaliy Filippov

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-25 Thread Nick Fisk




> -Original Message-
> From: Konstantin Shalygin 
> Sent: 22 February 2019 14:23
> To: Nick Fisk 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow 
> storage for db - why?
> 
> Bluestore/RocksDB will only put the next level up size of DB on flash if the 
> whole size will fit.
> These sizes are roughly 3GB,30GB,300GB. Anything in-between those sizes are 
> pointless. Only ~3GB of SSD will ever be used out of a
> 28GB partition. Likewise a 240GB partition is also pointless as only ~30GB 
> will be used.
> 
> I'm currently running 30GB partitions on my cluster with a mix of 6,8,10TB 
> disks. The 10TB's are about 75% full and use around 14GB,
> this is on mainly 3x Replica RBD(4MB objects)
> 
> Nick
> 
> Can you explain more? You mean that I should increase my 28Gb to 30Gb and 
> this do a trick?
> How is your db_slow size? We should control it? You control it? How?

Yes, I was in a similar situation initially where I had deployed my OSD's with 
25GB DB partitions and after 3GB DB used, everything else was going into slowDB 
on disk. From memory 29GB was just enough to make the DB fit on flash, but 30GB 
is a safe round figure to aim for. With a 30GB DB partition with most RBD type 
workloads all data should reside on flash even for fairly large disks running 
erasure coding.

Nick

> 
> 
> k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore HDD Cluster Advice

2019-02-22 Thread Nick Fisk

>Yes and no... bluestore seems to not work really optimal. For example,
>it has no filestore-like journal waterlining and flushes the deferred
>write queue just every 32 writes (deferred_batch_ops). And when it does
>that it's basically waiting for the HDD to commit and slowing down all
>further writes. And even worse, I found it to be basically untunable
>because when I tried to increase that limit to 1024 - OSDs waited for
>1024 writes to accumulate and then started to flush them in one batch
>which led to a HUGE write stall (tens of seconds). Commiting every 32
>writes is probably good for the thing they gently call "tail latency"
>(sudden latency spikes!) But it has the downside of that the latency is
>just consistently high :-P (ok, consistently average). 

What IO size are you testing, Bluestore will only defer writes under 32kb is 
size by default. Unless you are writing sequentially,
only limited amount of buffering via SSD is going to help, you will eventually 
hit the limits of the disk. Could you share some more
details as I'm interested in in this topic as well.

>
>In my small cluster with HGST drives and Intel SSDs for WAL+DB I've
>found the single-thread write latency (fio -iodepth=1 -ioengine=rbd) to
>be similar to a cluster without SSDs at all, it gave me only ~40-60
>iops. As I understand this is exactly because bluestore is flushing data
>each 32 writes and waiting for HDDs to commit all the time. One thing
>that helped me a lot was to disable the drives' volatile write cache
>(`hdparm -W 0 /dev/sdXX`). After doing that I have ~500-600 iops for the
>single-thread load! Which looks like it's finally committing data using
>the WAL correctly. My guess is that this is because HGST drives, in
>addition to a normal volatile write cache, have the thing called "Media
>Cache" which allows the HDD to acknowledge random writes by writing them
>to a temporary place on the platters without doing much seeks, and this
>thing gets enabled only when you disable the volatile cache. 

Interesting, will have to investigate this further!!! I wish there were more 
details around this technology from HGST

>
>At the same time, deferred writes slightly help performance when you
>don't have SSD. But the difference we talking is like tens of iops (30
>vs 40), so it's not noticeable in the SSD era :).

What size IO's are these you are testing with? I see a difference going from 
around 50IOPs up to over a thousand for a single
threaded 4kb sequential test.

>
>So - in theory yes, deferred writes should be acknowledged by the WAL.
>In practice, bluestore is a big mess of threads, locks and extra writes,
>so this is not always so. In fact, I would recommend you trying bcache
>as an option, it may work better, although I've not tested it myself yet
>:-) 
>
>What about the size of WAL/DB: 
>
>1) you don't need to put them on separate partitions, bluestore
>automatically allocates the available space 
>
>2) 8TB disks only take 16-17 GB for WAL+DB in my case. SSD partitions I
>have allocated for OSDs are just 20GB and it's also OK because bluestore
>can move parts of its DB to the main data device when it runs out of
>space on SSD partition.

Careful here, Bluestore will only migrate the next level of its DB if it can 
fit the entire DB on the flash device. These cutoff's
are around 3GB,30GB,300GB by default, so anything in-between will not be used. 
In your example a 20GB flash partition will mean that
a large amount of RocksDB will end up on the spinning disk (slowusedBytes)

>
>14 февраля 2019 г. 6:40:35 GMT+03:00, John Petrini
> пишет: 
>
>> Okay that makes more sense, I didn't realize the WAL functioned in a similar 
>> manner to filestore journals (though now that I've
had another read of Sage's blog post, New in Luminous: BlueStore, I notice he 
does >cover this). Is this to say that writes are
acknowledged as soon as they hit the WAL?
>> 
>> Also this raises another question regarding sizing. The Ceph documentation 
>> suggests allocating as much available space as
possible to blocks.db but what about WAL? We'll have 120GB per OSD available on 
each >SSD. Any suggestion on how we might divvy that
between the WAL and DB?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-22 Thread Nick Fisk

>On 2/16/19 12:33 AM, David Turner wrote:
>> The answer is probably going to be in how big your DB partition is vs 
>> how big your HDD disk is.  From your output it looks like you have a 
>> 6TB HDD with a 28GB Blocks.DB partition.  Even though the DB used 
>> size isn't currently full, I would guess that at some point since 
>> this OSD was created that it did fill up and what you're seeing is 
>> the part of the DB that spilled over to the data disk. This is why 
>> the official recommendation (that is quite cautious, but cautious 
>> because some use cases will use this up) for a blocks.db partition is 
>> 4% of the data drive.  For your 6TB disks that's a recommendation of 
>> 240GB per DB partition.  Of course the actual size of the DB needed 
>> is dependent on your use case.  But pretty much every use case for a 
>> 6TB disk needs a bigger partition than 28GB.
>
>
>My current db size of osd.33 is 7910457344 bytes, and osd.73 is
>2013265920+4685037568 bytes. 7544Mbyte (24.56% of db_total_bytes) vs
>6388Mbyte (6.69% of db_total_bytes).
>
>Why osd.33 is not used slow storage at this case?

Bluestore/RocksDB will only put the next level up size of DB on flash if the 
whole size will fit.
These sizes are roughly 3GB,30GB,300GB. Anything in-between those sizes are 
pointless. Only ~3GB of SSD will ever be used out of a
28GB partition. Likewise a 240GB partition is also pointless as only ~30GB will 
be used.

I'm currently running 30GB partitions on my cluster with a mix of 6,8,10TB 
disks. The 10TB's are about 75% full and use around 14GB,
this is on mainly 3x Replica RBD(4MB objects)

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?

2018-10-30 Thread Nick Fisk

> > >> On 10/18/2018 7:49 PM, Nick Fisk wrote:
> > >>> Hi,
> > >>>
> > >>> Ceph Version = 12.2.8
> > >>> 8TB spinner with 20G SSD partition
> > >>>
> > >>> Perf dump shows the following:
> > >>>
> > >>> "bluefs": {
> > >>>   "gift_bytes": 0,
> > >>>   "reclaim_bytes": 0,
> > >>>   "db_total_bytes": 21472731136,
> > >>>   "db_used_bytes": 3467640832,
> > >>>   "wal_total_bytes": 0,
> > >>>   "wal_used_bytes": 0,
> > >>>   "slow_total_bytes": 320063143936,
> > >>>   "slow_used_bytes": 4546625536,
> > >>>   "num_files": 124,
> > >>>   "log_bytes": 11833344,
> > >>>   "log_compactions": 4,
> > >>>   "logged_bytes": 316227584,
> > >>>   "files_written_wal": 2,
> > >>>   "files_written_sst": 4375,
> > >>>   "bytes_written_wal": 204427489105,
> > >>>   "bytes_written_sst": 248223463173
> > >>>
> > >>> Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, 
> > >>> yet 4.5GB of DB is stored on the spinning disk?
> > >> Correct. Most probably the rationale for this is the layered scheme
> > >> RocksDB uses to keep its sst. For each level It has a maximum
> > >> threshold (determined by level no, some base value and
> > >> corresponding multiplier - see max_bytes_for_level_base &
> > >> max_bytes_for_level_multiplier at
> > >> https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide)
> > >> If the next level  (at its max size) doesn't fit into the space 
> > >> available at DB volume - it's totally spilled over to slow device.
> > >> IIRC level_base is about 250MB and multiplier is 10 so the third level 
> > >> needs 25Gb and hence doesn't fit into your DB volume.
> > >>
> > >> In fact  DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the 
> > >> slow one. AFAIR current recommendation is about 4%.
> > >>
> > > Thanks Igor, these nodes were designed back in the filestore days
> > > where Small 10DWPD SSD's were all the rage, I might be able to
> > shrink the OS/swap partition and get each DB partition up to 25/26GB,
> > they are not going to get any bigger than that as that’s the NVME
> > completely filled. But I'm then going have to effectively wipe all the 
> > disks I've done so far and re-backfill. ☹ Are there any tunables to
> change this behaviour post OSD deployment to move data back onto SSD?
> > None I'm aware of.
> >
> > However I've just completed development for offline BlueFS volume
> > migration feature within ceph-bluestore-tool. It allows DB/WAL volumes
> > allocation and resizing as well as moving BlueFS data between volumes (with 
> > some limitations unrelated to your case). Hence one
> doesn't need slow backfilling to adjust BlueFS volume configuration.
> > Here is the PR (Nautilus only for now):
> > https://github.com/ceph/ceph/pull/23103
> 
> That sounds awesome, I might look at leaving the current OSD's how they are 
> and look to "fix" them when Nautilus comes out.
> 
> >
> > >
> > > On a related note, does frequently accessed data move into the SSD,
> > > or is the overspill a one way ticket? I would assume writes
> > would cause data in rocksdb to be written back into L0 and work its way 
> > down, but I'm not sure about reads?
> > AFAIK reads don't trigger any data layout changes.
> >
> 
> 
> 
> > >
> > > So I think the lesson from this is that despite whatever DB usage
> > > you may think you may end up with, always make sure your SSD
> > partition is bigger than 26GB (L0+L1)?
> > In fact that's
> > L0+L1 (2x250Mb), L2(2500MB), L3(25000MB) which is about 28GB.
> 
> Well I upgraded a new node and after shrinking the OS, I managed to assign 
> 29GB as the DB's. It's just finished backfilling and
> disappointingly it looks like the DB has over spilled onto the disks ☹ So the 
> magic minimum number is going to be somewhere between
> 30GB and 40GB. I might be able to squeeze 3

Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?

2018-10-20 Thread Nick Fisk

> >> On 10/18/2018 7:49 PM, Nick Fisk wrote:
> >>> Hi,
> >>>
> >>> Ceph Version = 12.2.8
> >>> 8TB spinner with 20G SSD partition
> >>>
> >>> Perf dump shows the following:
> >>>
> >>> "bluefs": {
> >>>   "gift_bytes": 0,
> >>>   "reclaim_bytes": 0,
> >>>   "db_total_bytes": 21472731136,
> >>>   "db_used_bytes": 3467640832,
> >>>   "wal_total_bytes": 0,
> >>>   "wal_used_bytes": 0,
> >>>   "slow_total_bytes": 320063143936,
> >>>   "slow_used_bytes": 4546625536,
> >>>   "num_files": 124,
> >>>   "log_bytes": 11833344,
> >>>   "log_compactions": 4,
> >>>   "logged_bytes": 316227584,
> >>>   "files_written_wal": 2,
> >>>   "files_written_sst": 4375,
> >>>   "bytes_written_wal": 204427489105,
> >>>   "bytes_written_sst": 248223463173
> >>>
> >>> Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 
> >>> 4.5GB of DB is stored on the spinning disk?
> >> Correct. Most probably the rationale for this is the layered scheme
> >> RocksDB uses to keep its sst. For each level It has a maximum
> >> threshold (determined by level no, some base value and corresponding
> >> multiplier - see max_bytes_for_level_base &
> >> max_bytes_for_level_multiplier at
> >> https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide)
> >> If the next level  (at its max size) doesn't fit into the space available 
> >> at DB volume - it's totally spilled over to slow device.
> >> IIRC level_base is about 250MB and multiplier is 10 so the third level 
> >> needs 25Gb and hence doesn't fit into your DB volume.
> >>
> >> In fact  DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the 
> >> slow one. AFAIR current recommendation is about 4%.
> >>
> > Thanks Igor, these nodes were designed back in the filestore days where 
> > Small 10DWPD SSD's were all the rage, I might be able to
> shrink the OS/swap partition and get each DB partition up to 25/26GB, they 
> are not going to get any bigger than that as that’s the
> NVME completely filled. But I'm then going have to effectively wipe all the 
> disks I've done so far and re-backfill. ☹ Are there any
> tunables to change this behaviour post OSD deployment to move data back onto 
> SSD?
> None I'm aware of.
> 
> However I've just completed development for offline BlueFS volume migration 
> feature within ceph-bluestore-tool. It allows DB/WAL
> volumes allocation and resizing as well as moving BlueFS data between volumes 
> (with some limitations unrelated to your case). Hence
> one doesn't need slow backfilling to adjust BlueFS volume configuration.
> Here is the PR (Nautilus only for now):
> https://github.com/ceph/ceph/pull/23103

That sounds awesome, I might look at leaving the current OSD's how they are and 
look to "fix" them when Nautilus comes out.

> 
> >
> > On a related note, does frequently accessed data move into the SSD, or is 
> > the overspill a one way ticket? I would assume writes
> would cause data in rocksdb to be written back into L0 and work its way down, 
> but I'm not sure about reads?
> AFAIK reads don't trigger any data layout changes.
> 



> >
> > So I think the lesson from this is that despite whatever DB usage you may 
> > think you may end up with, always make sure your SSD
> partition is bigger than 26GB (L0+L1)?
> In fact that's
> L0+L1 (2x250Mb), L2(2500MB), L3(25000MB) which is about 28GB.

Well I upgraded a new node and after shrinking the OS, I managed to assign 29GB 
as the DB's. It's just finished backfilling and disappointingly it looks like 
the DB has over spilled onto the disks ☹ So the magic minimum number is going 
to be somewhere between 30GB and 40GB. I might be able to squeeze 30G 
partitions out if I go for a tiny OS disk and no swap. Will try that on the 
next one. Hoping that 30G does it.



> 
> One more observation from my side - RocksDB might additionally use up to 100% 
> of the level maximum size during compaction -
> hence it might make sense to have up to 25GB of additional spare space. 
> Surely this spare space wouldn't be fully used mos

Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?

2018-10-19 Thread Nick Fisk

> -Original Message-
> From: Nick Fisk [mailto:n...@fisk.me.uk]
> Sent: 19 October 2018 08:15
> To: 'Igor Fedotov' ; ceph-users@lists.ceph.com
> Subject: RE: [ceph-users] slow_used_bytes - SlowDB being used despite lots of 
> space free in BlockDB on SSD?
> 
> > -Original Message-
> > From: Igor Fedotov [mailto:ifedo...@suse.de]
> > Sent: 19 October 2018 01:03
> > To: n...@fisk.me.uk; ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots 
> > of space free in BlockDB on SSD?
> >
> >
> >
> > On 10/18/2018 7:49 PM, Nick Fisk wrote:
> > > Hi,
> > >
> > > Ceph Version = 12.2.8
> > > 8TB spinner with 20G SSD partition
> > >
> > > Perf dump shows the following:
> > >
> > > "bluefs": {
> > >  "gift_bytes": 0,
> > >  "reclaim_bytes": 0,
> > >  "db_total_bytes": 21472731136,
> > >  "db_used_bytes": 3467640832,
> > >  "wal_total_bytes": 0,
> > >  "wal_used_bytes": 0,
> > >  "slow_total_bytes": 320063143936,
> > >  "slow_used_bytes": 4546625536,
> > >  "num_files": 124,
> > >  "log_bytes": 11833344,
> > >  "log_compactions": 4,
> > >  "logged_bytes": 316227584,
> > >  "files_written_wal": 2,
> > >  "files_written_sst": 4375,
> > >  "bytes_written_wal": 204427489105,
> > >  "bytes_written_sst": 248223463173
> > >
> > > Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 
> > > 4.5GB of DB is stored on the spinning disk?
> > Correct. Most probably the rationale for this is the layered scheme
> > RocksDB uses to keep its sst. For each level It has a maximum
> > threshold (determined by level no, some base value and corresponding
> > multiplier - see max_bytes_for_level_base &
> > max_bytes_for_level_multiplier at
> > https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide)
> > If the next level  (at its max size) doesn't fit into the space available 
> > at DB volume - it's totally spilled over to slow device.
> > IIRC level_base is about 250MB and multiplier is 10 so the third level 
> > needs 25Gb and hence doesn't fit into your DB volume.
> >
> > In fact  DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the 
> > slow one. AFAIR current recommendation is about 4%.
> >
> > >
> 
> Thanks Igor, these nodes were designed back in the filestore days where Small 
> 10DWPD SSD's were all the rage, I might be able to
> shrink the OS/swap partition and get each DB partition up to 25/26GB, they 
> are not going to get any bigger than that as that’s the
> NVME completely filled. But I'm then going have to effectively wipe all the 
> disks I've done so far and re-backfill. ☹ Are there any
> tunables to change this behaviour post OSD deployment to move data back onto 
> SSD?
> 
> On a related note, does frequently accessed data move into the SSD, or is the 
> overspill a one way ticket? I would assume writes would
> cause data in rocksdb to be written back into L0 and work its way down, but 
> I'm not sure about reads?
> 
> This is from a similar slightly newer node with 10TB spinners and 40G 
> partition
> "bluefs": {
> "gift_bytes": 0,
> "reclaim_bytes": 0,
> "db_total_bytes": 53684985856,
> "db_used_bytes": 10380902400,
> "wal_total_bytes": 0,
> "wal_used_bytes": 0,
> "slow_total_bytes": 400033841152,
> "slow_used_bytes": 0,
> "num_files": 165,
> "log_bytes": 15683584,
> "log_compactions": 8,
> "logged_bytes": 384712704,
> "files_written_wal": 2,
> "files_written_sst": 11317,
> "bytes_written_wal": 564218701044,
> "bytes_written_sst": 618268958848
> 
> So I see your point about the 25G file size making it over spill the 
> partition, as it obvious in this case that the 10G of DB used is
> completely stored on the SSD. Theses OSD's are about 70% full, so I'm not 
> expecting a massive increase in usage. Albeit if I move to EC
> pools, I should expect

Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?

2018-10-19 Thread Nick Fisk

> -Original Message-
> From: Igor Fedotov [mailto:ifedo...@suse.de]
> Sent: 19 October 2018 01:03
> To: n...@fisk.me.uk; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots of 
> space free in BlockDB on SSD?
> 
> 
> 
> On 10/18/2018 7:49 PM, Nick Fisk wrote:
> > Hi,
> >
> > Ceph Version = 12.2.8
> > 8TB spinner with 20G SSD partition
> >
> > Perf dump shows the following:
> >
> > "bluefs": {
> >  "gift_bytes": 0,
> >  "reclaim_bytes": 0,
> >  "db_total_bytes": 21472731136,
> >  "db_used_bytes": 3467640832,
> >  "wal_total_bytes": 0,
> >  "wal_used_bytes": 0,
> >  "slow_total_bytes": 320063143936,
> >  "slow_used_bytes": 4546625536,
> >  "num_files": 124,
> >  "log_bytes": 11833344,
> >  "log_compactions": 4,
> >  "logged_bytes": 316227584,
> >  "files_written_wal": 2,
> >  "files_written_sst": 4375,
> >  "bytes_written_wal": 204427489105,
> >  "bytes_written_sst": 248223463173
> >
> > Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 
> > 4.5GB of DB is stored on the spinning disk?
> Correct. Most probably the rationale for this is the layered scheme RocksDB 
> uses to keep its sst. For each level It has a maximum
> threshold (determined by level no, some base value and corresponding 
> multiplier - see max_bytes_for_level_base &
> max_bytes_for_level_multiplier at
> https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide)
> If the next level  (at its max size) doesn't fit into the space available at 
> DB volume - it's totally spilled over to slow device.
> IIRC level_base is about 250MB and multiplier is 10 so the third level needs 
> 25Gb and hence doesn't fit into your DB volume.
> 
> In fact  DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the slow 
> one. AFAIR current recommendation is about 4%.
> 
> >

Thanks Igor, these nodes were designed back in the filestore days where Small 
10DWPD SSD's were all the rage, I might be able to shrink the OS/swap partition 
and get each DB partition up to 25/26GB, they are not going to get any bigger 
than that as that’s the NVME completely filled. But I'm then going have to 
effectively wipe all the disks I've done so far and re-backfill. ☹ Are there 
any tunables to change this behaviour post OSD deployment to move data back 
onto SSD?

On a related note, does frequently accessed data move into the SSD, or is the 
overspill a one way ticket? I would assume writes would cause data in rocksdb 
to be written back into L0 and work its way down, but I'm not sure about reads?

This is from a similar slightly newer node with 10TB spinners and 40G partition
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 53684985856,
"db_used_bytes": 10380902400,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 400033841152,
"slow_used_bytes": 0,
"num_files": 165,
"log_bytes": 15683584,
"log_compactions": 8,
"logged_bytes": 384712704,
"files_written_wal": 2,
"files_written_sst": 11317,
"bytes_written_wal": 564218701044,
"bytes_written_sst": 618268958848

So I see your point about the 25G file size making it over spill the partition, 
as it obvious in this case that the 10G of DB used is completely stored on the 
SSD. Theses OSD's are about 70% full, so I'm not expecting a massive increase 
in usage. Albeit if I move to EC pools, I should expect maybe a doubling in 
objects, so maybe that db_used might double, but it should still be within the 
40G hopefully.

The 4% rule would not be workable in my case, there are 12X10TB disks in these 
nodes, I would nearly 5TB worth of SSD, which would likely cost a similar 
amount to the whole node+disks. I get the fact that any recommendations need to 
take the worse case into account, but I would imagine for a lot of simple RBD 
only use cases, this number is quite inflated.

So I think the lesson from this is that despite whatever DB usage you may think 
you may end up with, always make sure your SSD partition is bigger than 26GB 
(L0+L1)?

> > Am I also understanding correctly that BlueFS has reserved 300G of space on 
> > the spinning disk?
> Right.
> > Found a previous bug tracker for something which looks exactly the same 
> > case, but should be fixed now:
> > https://tracker.ceph.com/issues/22264
> >
> > Thanks,
> > Nick
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?

2018-10-18 Thread Nick Fisk

Hi,

Ceph Version = 12.2.8   
8TB spinner with 20G SSD partition 

Perf dump shows the following:

"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 21472731136,
"db_used_bytes": 3467640832,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 320063143936,
"slow_used_bytes": 4546625536,
"num_files": 124,
"log_bytes": 11833344,
"log_compactions": 4,
"logged_bytes": 316227584,
"files_written_wal": 2,
"files_written_sst": 4375,
"bytes_written_wal": 204427489105,
"bytes_written_sst": 248223463173

Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 4.5GB 
of DB is stored on the spinning disk?

Am I also understanding correctly that BlueFS has reserved 300G of space on the 
spinning disk?

Found a previous bug tracker for something which looks exactly the same case, 
but should be fixed now:
https://tracker.ceph.com/issues/22264

Thanks,
Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore DB size and onode count

2018-09-10 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
> Nelson
> Sent: 10 September 2018 18:27
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Bluestore DB size and onode count
> 
> On 09/10/2018 12:22 PM, Igor Fedotov wrote:
> 
> > Hi Nick.
> >
> >
> > On 9/10/2018 1:30 PM, Nick Fisk wrote:
> >> If anybody has 5 minutes could they just clarify a couple of things
> >> for me
> >>
> >> 1. onode count, should this be equal to the number of objects stored
> >> on the OSD?
> >> Through reading several posts, there seems to be a general indication
> >> that this is the case, but looking at my OSD's the maths don't
> >> work.
> > onode_count is the number of onodes in the cache, not the total number
> > of onodes at an OSD.
> > Hence the difference...

Ok, thanks, that makes sense. I assume there isn't actually a counter which 
gives you the total number of objects on an OSD then?

> >>
> >> Eg.
> >> ceph osd df
> >> ID CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL  %USE  VAR  PGS
> >>   0   hdd 2.73679  1.0 2802G  1347G  1454G 48.09 0.69 115
> >>
> >> So 3TB OSD, roughly half full. This is pure RBD workload (no
> >> snapshots or anything clever) so let's assume worse case scenario of
> >> 4MB objects (Compression is on however, which would only mean more
> >> objects for given size)
> >> 1347000/4=~336750 expected objects
> >>
> >> sudo ceph daemon osd.0 perf dump | grep blue
> >>  "bluefs": {
> >>  "bluestore": {
> >>  "bluestore_allocated": 1437813964800,
> >>  "bluestore_stored": 2326118994003,
> >>  "bluestore_compressed": 445228558486,
> >>  "bluestore_compressed_allocated": 547649159168,
> >>  "bluestore_compressed_original": 1437773843456,
> >>  "bluestore_onodes": 99022,
> >>  "bluestore_onode_hits": 18151499,
> >>  "bluestore_onode_misses": 4539604,
> >>  "bluestore_onode_shard_hits": 10596780,
> >>  "bluestore_onode_shard_misses": 4632238,
> >>  "bluestore_extents": 896365,
> >>  "bluestore_blobs": 861495,
> >>
> >> 99022 onodes, anyone care to enlighten me?
> >>
> >> 2. block.db Size
> >> sudo ceph daemon osd.0 perf dump | grep db
> >>  "db_total_bytes": 8587829248,
> >>  "db_used_bytes": 2375024640,
> >>
> >> 2.3GB=0.17% of data size. This seems a lot lower than the 1%
> >> recommendation (10GB for every 1TB) or 4% given in the official docs. I
> >> know that different workloads will have differing overheads and
> >> potentially smaller objects. But am I understanding these figures
> >> correctly as they seem dramatically lower?
> > Just in case - is slow_used_bytes equal to 0? Some DB data might
> > reside at slow device if spill over has happened. Which doesn't
> > require full DB volume to happen - that's by RocksDB's design.
> >
> > And recommended numbers are a bit... speculative. So it's quite
> > possible that you numbers are absolutely adequate.
> 
> FWIW, these are the numbers I came up with after examining the SST files
> generated under different workloads:
> 
> https://protect-eu.mimecast.com/s/7e0iCJq9Bh6pZCzILpy?domain=drive.google.com
> 

Thanks for your input Mark and Igor. Mark I can see your RBD figures aren't too 
far off mine, so all looks to be as expected then.

> >>
> >> Regards,
> >> Nick
> >>
> >> ___
> >> ceph-users mailing list
> >> mailto:ceph-users@lists.ceph.com
> >> https://protect-eu.mimecast.com/s/YtrdCKZVDUX8OTAS9XW?domain=lists.ceph.com
> >
> > ___
> > ceph-users mailing list
> > mailto:ceph-users@lists.ceph.com
> > https://protect-eu.mimecast.com/s/YtrdCKZVDUX8OTAS9XW?domain=lists.ceph.com
> 
> ___
> ceph-users mailing list
> mailto:ceph-users@lists.ceph.com
> https://protect-eu.mimecast.com/s/YtrdCKZVDUX8OTAS9XW?domain=lists.ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Bluestore DB size and onode count

2018-09-10 Thread Nick Fisk

If anybody has 5 minutes could they just clarify a couple of things for me

1. onode count, should this be equal to the number of objects stored on the OSD?
Through reading several posts, there seems to be a general indication that this 
is the case, but looking at my OSD's the maths don't
work.

Eg.
ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL  %USE  VAR  PGS
 0   hdd 2.73679  1.0 2802G  1347G  1454G 48.09 0.69 115

So 3TB OSD, roughly half full. This is pure RBD workload (no snapshots or 
anything clever) so let's assume worse case scenario of
4MB objects (Compression is on however, which would only mean more objects for 
given size)
1347000/4=~336750 expected objects

sudo ceph daemon osd.0 perf dump | grep blue
"bluefs": {
"bluestore": {
"bluestore_allocated": 1437813964800,
"bluestore_stored": 2326118994003,
"bluestore_compressed": 445228558486,
"bluestore_compressed_allocated": 547649159168,
"bluestore_compressed_original": 1437773843456,
"bluestore_onodes": 99022,
"bluestore_onode_hits": 18151499,
"bluestore_onode_misses": 4539604,
"bluestore_onode_shard_hits": 10596780,
"bluestore_onode_shard_misses": 4632238,
"bluestore_extents": 896365,
"bluestore_blobs": 861495,

99022 onodes, anyone care to enlighten me?

2. block.db Size
sudo ceph daemon osd.0 perf dump | grep db
"db_total_bytes": 8587829248,
"db_used_bytes": 2375024640,

2.3GB=0.17% of data size. This seems a lot lower than the 1% recommendation 
(10GB for every 1TB) or 4% given in the official docs. I
know that different workloads will have differing overheads and potentially 
smaller objects. But am I understanding these figures
correctly as they seem dramatically lower?

Regards,
Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Tiering stats are blank on Bluestore OSD's

2018-09-10 Thread Nick Fisk

After upgrading a number of OSD's to Bluestore I have noticed that the cache 
tier OSD's which have so far been upgraded are no
longer logging tier_* stats

"tier_promote": 0,
"tier_flush": 0,
"tier_flush_fail": 0,
"tier_try_flush": 0,
"tier_try_flush_fail": 0,
"tier_evict": 0,
"tier_whiteout": 0,
"tier_dirty": 0,
"tier_clean": 0,
"tier_delay": 0,
"tier_proxy_read": 0,
"tier_proxy_write": 0,
"osd_tier_flush_lat": {
"osd_tier_promote_lat": {
"osd_tier_r_lat": {

Example from Filestore OSD (both are running 12.2.8)
"tier_promote": 265140,
"tier_flush": 0,
"tier_flush_fail": 0,
"tier_try_flush": 88942,
"tier_try_flush_fail": 0,
"tier_evict": 264773,
"tier_whiteout": 35,
"tier_dirty": 89314,
"tier_clean": 89207,
"tier_delay": 0,
"tier_proxy_read": 1446068,
"tier_proxy_write": 10957517,
"osd_tier_flush_lat": {
"osd_tier_promote_lat": {
"osd_tier_r_lat": {

"New Issue" button on tracker seems to cause a 500 error btw

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] help needed

2018-09-06 Thread Nick Fisk

If it helps, I’m seeing about a 3GB DB usage for a 3TB OSD about 60% full. This 
is with a pure RBD workload, I believe this can vary depending on what your 
Ceph use case is.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of David 
Turner
Sent: 06 September 2018 14:09
To: Muhammad Junaid 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] help needed

The official ceph documentation recommendations for a db partition for a 4TB 
bluestore osd would be 160GB each.

Samsung Evo Pro is not an Enterprise class SSD. A quick search of the ML will 
allow which SSDs people are using.

As was already suggested, the better option is an HBA as opposed to a raid 
controller. If you are set on your controllers, write-back is fine as long as 
you have BBU. Otherwise you should be using write-through.

On Thu, Sep 6, 2018, 8:54 AM Muhammad Junaid mailto:junaid.fsd...@gmail.com> > wrote:

Thanks. Can you please clarify, if we use any other enterprise class SSD for 
journal, should we enable write-back caching available on raid controller for 
journal device or connect it as write through. Regards.

On Thu, Sep 6, 2018 at 4:50 PM Marc Roos mailto:m.r...@f1-outsourcing.eu> > wrote:

Do not use Samsung 850 PRO for journal
Just use LSI logic HBA (eg. SAS2308)

-Original Message-
From: Muhammad Junaid [mailto:junaid.fsd...@gmail.com 
 ] 
Sent: donderdag 6 september 2018 13:18
To: ceph-users@lists.ceph.com  
Subject: [ceph-users] help needed

Hi there

Hope, every one will be fine. I need an urgent help in ceph cluster 
design. We are planning 3 OSD node cluster in the beginning. Details are 
as under:

Servers: 3 * DELL R720xd
OS Drives: 2 2.5" SSD
OSD Drives: 10  3.5" SAS 7200rpm 3/4 TB
Journal Drives: 2 SSD's Samsung 850 PRO 256GB each Raid controller: PERC 
H710 (512MB Cache) OSD Drives: On raid0 mode Journal Drives: JBOD Mode 
Rocks db: On same Journal drives

My question is: is this setup good for a start? And critical question 
is: should we enable write back caching on controller for Journal 
drives? Pls suggest. Thanks in advance. Regards.

Muhammad Junaid 

___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS+NFS For VMWare

2018-07-02 Thread Nick Fisk



Quoting Ilya Dryomov :


On Fri, Jun 29, 2018 at 8:08 PM Nick Fisk  wrote:


This is for us peeps using Ceph with VMWare.



My current favoured solution for consuming Ceph in VMWare is via  
RBD’s formatted with XFS and exported via NFS to ESXi. This seems  
to perform better than iSCSI+VMFS which seems to not play nicely  
with Ceph’s PG contention issues particularly if working with thin  
provisioned VMDK’s.




I’ve still been noticing some performance issues however, mainly  
noticeable when doing any form of storage migrations. This is  
largely due to the way vSphere transfers VM’s in 64KB IO’s at a QD  
of 32. vSphere does this so Arrays with QOS can balance the IO  
easier than if larger IO’s were submitted. However Ceph’s PG  
locking means that only one or two of these IO’s can happen at a  
time, seriously lowering throughput. Typically you won’t be able to  
push more than 20-25MB/s during a storage migration




There is also another issue in that the IO needed for the XFS  
journal on the RBD, can cause contention and effectively also means  
every NFS write IO sends 2 down to Ceph. This can have an impact on  
latency as well. Due to possible PG contention caused by the XFS  
journal updates when multiple IO’s are in flight, you normally end  
up making more and more RBD’s to try and spread the load. This  
normally means you end up having to do storage migrations…..you can  
see where I’m getting at here.




I’ve been thinking for a while that CephFS works around a lot of  
these limitations.




1.   It supports fancy striping, so should mean there is less  
per object contention


Hi Nick,

Fancy striping is supported since 4.17.  I think its primary use case
is small sequential I/Os, so not sure if it is going to help much, but
it might be worth doing some benchmarking.


Thanks Ilya, I will try to find sometime to also investigate this.
Nick



Thanks,

Ilya




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS+NFS For VMWare

2018-06-30 Thread Nick Fisk

Hi Paul,

 

Thanks for your response, is there anything you can go into more detail on and 
share with the list? I’m sure it would be much appreciated by more than just 
myself.

 

I was planning on Kernel CephFS and NFS server, both seem to achieve better 
performance, although stability is of greater concern. 

 

Thanks,

Nick

From: Paul Emmerich [mailto:paul.emmer...@croit.io] 
Sent: 29 June 2018 17:57
To: Nick Fisk 
Cc: ceph-users 
Subject: Re: [ceph-users] CephFS+NFS For VMWare

 

VMWare can be quite picky about NFS servers.

Some things that you should test before deploying anything with that in 
production:

 

* failover

* reconnects after NFS reboots or outages

* NFS3 vs NFS4

* Kernel NFS (which kernel version? cephfs-fuse or cephfs-kernel?) vs NFS 
Ganesha (VFS FSAL vs. Ceph FSAL)

* Stress tests with lots of VMWare clients - we had a setup than ran fine with 
5 big VMWare hypervisors but started to get random deadlocks once we added 5 
more

 

We are running CephFS + NFS + VMWare in production but we've encountered *a 
lot* of problems until we got that stable for a few configurations.

Be prepared to debug NFS problems at a low level with tcpdump and a careful 
read of the RFC and NFS server source ;)

 

Paul

 

2018-06-29 18:48 GMT+02:00 Nick Fisk mailto:n...@fisk.me.uk> 
>:

This is for us peeps using Ceph with VMWare.

 

My current favoured solution for consuming Ceph in VMWare is via RBD’s 
formatted with XFS and exported via NFS to ESXi. This seems to perform better 
than iSCSI+VMFS which seems to not play nicely with Ceph’s PG contention issues 
particularly if working with thin provisioned VMDK’s.

 

I’ve still been noticing some performance issues however, mainly noticeable 
when doing any form of storage migrations. This is largely due to the way 
vSphere transfers VM’s in 64KB IO’s at a QD of 32. vSphere does this so Arrays 
with QOS can balance the IO easier than if larger IO’s were submitted. However 
Ceph’s PG locking means that only one or two of these IO’s can happen at a 
time, seriously lowering throughput. Typically you won’t be able to push more 
than 20-25MB/s during a storage migration

 

There is also another issue in that the IO needed for the XFS journal on the 
RBD, can cause contention and effectively also means every NFS write IO sends 2 
down to Ceph. This can have an impact on latency as well. Due to possible PG 
contention caused by the XFS journal updates when multiple IO’s are in flight, 
you normally end up making more and more RBD’s to try and spread the load. This 
normally means you end up having to do storage migrations…..you can see where 
I’m getting at here.

 

I’ve been thinking for a while that CephFS works around a lot of these 
limitations. 

 

1.   It supports fancy striping, so should mean there is less per object 
contention

2.   There is no FS in the middle to maintain a journal and other 
associated IO

3.   A single large NFS mount should have none of the disadvantages seen 
with a single RBD

4.   No need to migrate VM’s about because of #3

5.   No need to fstrim after deleting VM’s

6.   Potential to do away with pacemaker and use LVS to do active/active 
NFS as ESXi does its own locking with files

 

With this in mind I exported a CephFS mount via NFS and then mounted it to an 
ESXi host as a test.

 

Initial results are looking very good. I’m seeing storage migrations to the NFS 
mount going at over 200MB/s, which equates to several thousand IO’s and seems 
to be writing at the intended QD32.

 

I need to do more testing to make sure everything works as intended, but like I 
say, promising initial results. 

 

Further testing needs to be done to see what sort of MDS performance is 
required, I would imagine that since we are mainly dealing with large files, it 
might not be that critical. I also need to consider the stability of CephFS, 
RBD is relatively simple and is in use by a large proportion of the Ceph 
community. CephFS is a lot easier to “upset”.

 

Nick


___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 

Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io <http://www.croit.io> 
Tel: +49 89 1896585 90

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] CephFS+NFS For VMWare

2018-06-29 Thread Nick Fisk

This is for us peeps using Ceph with VMWare.

 

My current favoured solution for consuming Ceph in VMWare is via RBD's 
formatted with XFS and exported via NFS to ESXi. This seems
to perform better than iSCSI+VMFS which seems to not play nicely with Ceph's PG 
contention issues particularly if working with thin
provisioned VMDK's.

 

I've still been noticing some performance issues however, mainly noticeable 
when doing any form of storage migrations. This is
largely due to the way vSphere transfers VM's in 64KB IO's at a QD of 32. 
vSphere does this so Arrays with QOS can balance the IO
easier than if larger IO's were submitted. However Ceph's PG locking means that 
only one or two of these IO's can happen at a time,
seriously lowering throughput. Typically you won't be able to push more than 
20-25MB/s during a storage migration

 

There is also another issue in that the IO needed for the XFS journal on the 
RBD, can cause contention and effectively also means
every NFS write IO sends 2 down to Ceph. This can have an impact on latency as 
well. Due to possible PG contention caused by the XFS
journal updates when multiple IO's are in flight, you normally end up making 
more and more RBD's to try and spread the load. This
normally means you end up having to do storage migrations...you can see where 
I'm getting at here.

 

I've been thinking for a while that CephFS works around a lot of these 
limitations. 

 

1.   It supports fancy striping, so should mean there is less per object 
contention

2.   There is no FS in the middle to maintain a journal and other 
associated IO

3.   A single large NFS mount should have none of the disadvantages seen 
with a single RBD

4.   No need to migrate VM's about because of #3

5.   No need to fstrim after deleting VM's

6.   Potential to do away with pacemaker and use LVS to do active/active 
NFS as ESXi does its own locking with files

 

With this in mind I exported a CephFS mount via NFS and then mounted it to an 
ESXi host as a test.

 

Initial results are looking very good. I'm seeing storage migrations to the NFS 
mount going at over 200MB/s, which equates to
several thousand IO's and seems to be writing at the intended QD32.

 

I need to do more testing to make sure everything works as intended, but like I 
say, promising initial results. 

 

Further testing needs to be done to see what sort of MDS performance is 
required, I would imagine that since we are mainly dealing
with large files, it might not be that critical. I also need to consider the 
stability of CephFS, RBD is relatively simple and is in
use by a large proportion of the Ceph community. CephFS is a lot easier to 
"upset".

 

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-14 Thread Nick Fisk

For completeness in case anyone has this issue in the future and stumbles 
across this thread

If your OSD is crashing and you are still running on a Luminous build that does 
not have the fix in the pull request below, you will
need to compile the ceph-osd binary and replace it on the affected OSD node. 
This will get your OSD's/cluster back up and running.

In regards to the stray object/clone, I was unable to remove it using the 
objectstore tool, I'm guessing this is because as far as
the OSD is concerned it believes that clone should have already been deleted. I 
am still running Filestore on this cluster and
simply removing the clone object from the OSD PG folder (Note: the object won't 
have _head in its name) and then running a deep
scrub on the PG again fixed the issue for me.

Nick

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nick 
Fisk
Sent: 07 June 2018 14:01
To: 'ceph-users' 
Subject: Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

So I've recompiled a 12.2.5 ceph-osd binary with the fix included in 
https://github.com/ceph/ceph/pull/22396

The OSD has restarted as expected and the PG is now active+clean ..success 
so far.

What's the best method to clean up the stray snapshot on OSD.46? I'm guessing 
using the object-store-tool, but not sure if I want to
clean the clone metadata or try and remove the actual snapshot object.

-Original Message-----
From: ceph-users  On Behalf Of Nick Fisk
Sent: 05 June 2018 17:22
To: 'ceph-users' 
Subject: Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

So, from what I can see I believe this issue is being caused by one of the 
remaining OSD's acting for this PG containing a snapshot
file of the object

/var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__head_F930D2CA_
_1
/var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__1c_F930D2CA__1

Both the OSD which crashed and the other acting OSD don't have this "1c" 
snapshot file. Is the solution to use objectstore tool to
remove this "1c" snapshot object and then allow thigs to backfill?


-Original Message-
From: ceph-users  On Behalf Of Nick Fisk
Sent: 05 June 2018 16:43
To: 'ceph-users' 
Subject: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

Hi,

After a RBD snapshot was removed, I seem to be having OSD's assert when they 
try and recover pg 1.2ca. The issue seems to follow the
PG around as OSD's fail. I've seen this bug tracker and associated mailing list 
post, but would appreciate if anyone can give any
pointers. https://tracker.ceph.com/issues/23030,

Cluster is 12.2.5 with Filestore and 3x replica pools

I noticed after the snapshot was removed that there were 2 inconsistent PG's. I 
can a repair on both and one of them (1.2ca) seems
to have triggered this issue.


Snippet of log from two different OSD's. 1st one is from the original OSD 
holding the PG, 2nd is from where the OSD was marked out
and it was trying to be recovered to:


-4> 2018-06-05 15:15:45.997225 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6819/3544315 --> [2a03:25e0:253:5::14]:0/3307345 --
osd_ping(ping_reply e2196479 stamp 2018-06-05 15:15:45.994907) v4 -- 
0x557d3d72f800 con 0
-3> 2018-06-05 15:15:46.018088 7febb2954700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 --> [2a03:25e0:254:5::12]:6809/5784710 --
MOSDPGPull(1.2ca e2196479/2196477 cost 8389608) v3 -- 0x557d4d180b40 con 0
-2> 2018-06-05 15:15:46.018412 7febce7a5700  5 -- 
[2a03:25e0:254:5::112]:6817/3544315 >> [2a03:25e0:254:5::12]:6809/5784710
conn(0x557d4b1a9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=13470 
cs=1 l=0).
rx osd.46 seq 13 0x557d4d180b40 MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c,
version: 2195927'1249660, data_included: [], data_size: 0, omap_header_size: 0, 
omap_ent
ries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c@2195927'1249660,
 size: 0, copy_subset: [], clone_subset:
{}, snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:true, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap _recovered_to:, omap_complete:false,
error:false))]) v3
-1> 2018-06-05 15:15:46.018425 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 <== osd.46
[2a03:25e0:254:5::12]:6809/5784710 13  MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f
29.000bf479:1c, version: 2195927'1249660, data_included: [], data_size:

Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

2018-06-14 Thread Nick Fisk

I’ve seen similar things like this happen if you tend to end up with extreme 
weighting towards a small set of OSD’s. Crush tries a slightly different 
combination of OSD’s at each attempt, but in an extremely lop sided weighting, 
it can run out of attempts before it finds a set of OSD’s which match the crush 
map.

Setting the number of crush tries up really high, like in the several hundreds, 
can help if you need the extreme lop sided weights. I think from memory the 
crush setting is choose_total_tries

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Gregory Farnum
Sent: 14 June 2018 21:47
To: Oliver Schulz 
Cc: ceph-users 
Subject: Re: [ceph-users] How to fix a Ceph PG in unkown state with no OSDs?

I don't think there's a way to help them. They "should" get priority in 
recovery, but there were a number of bugs with it in various versions and 
forcing that kind of priority without global decision making is prone to issues.

But yep, looks like things will eventually become all good now. :)

On Thu, Jun 14, 2018 at 4:39 PM Oliver Schulz mailto:oliver.sch...@tu-dortmund.de> > wrote:

Thanks, Greg!!

I reset all the OSD weights to 1.00, and I think I'm in a much
better state now. The only trouble left in "ceph health detail" is

PG_DEGRADED Degraded data redundancy: 4/404985012 objects degraded 
(0.000%), 3 pgs degraded
 pg 2.47 is active+recovery_wait+degraded+remapped, acting [177,68,187]
 pg 2.1fd is active+recovery_wait+degraded+remapped, acting [36,83,185]
 pg 2.748 is active+recovery_wait+degraded, acting [31,8,149]

(There's a lot of misplaced PGs now, obviously). The interesting
thing is that my "lost" PG is back, too, with three acting OSDs.

Maybe I dodged the bullet - what do you think?

One question: Is there a way to give recovery of the three
degraded PGs priority over backfilling the misplaced ones?
I tried "ceph pg force-recovery" but it didn't seem to have
any effect, they were still on "recovery_wait", after.

Cheers,

Oliver

On 14.06.2018 22:09, Gregory Farnum wrote:
> On Thu, Jun 14, 2018 at 4:07 PM Oliver Schulz 
> mailto:oliver.sch...@tu-dortmund.de>  
>  >> 
> wrote:
> 
> Hi Greg,
> 
> I increased the hard limit and rebooted everything. The
> PG without acting OSDs still has none, but I also have
> quite a few PGs with that look like this, now:
> 
>   pg 1.79c is stuck undersized for 470.640254, current state
> active+undersized+degraded, last acting [179,154]
> 
> I had that problem before (only two acting OSDs on a few PGs),
> I always solved it by setting the primary OSD to out and then
> back in a few seconds later (resulting in a very quick recovery,
> then all was fine again). But maybe that's not the ideal solution?
> 
> Here's "ceph pg map" for one of them:
> 
>   osdmap e526060 pg 1.79c (1.79c) -> up [179,154] acting [179,154]
> 
> I also have two PG's that have only one acting OSD, now:
> 
>   osdmap e526060 pg 0.58a (0.58a) -> up [174] acting [174]
>   osdmap e526060 pg 2.139 (2.139) -> up [61] acting [61]
> 
> How can I make Ceph assign three OSD's to all of these weird PGs?
> Before the reboot, they all did have three OSDs assigned (except for
> the one that has none), and they were not shown as degraded.
> 
> 
>   > If it's the second, then fixing the remapping problem will
> resolve it.
>   > That's probably/hopefully just by undoing the remap-by-utilization
>   > changes.
> 
> How do I do that, best? Just set all the weights back to 1.00?
> 
> 
> Yeah. This is probably the best way to fix up the other undersized PGs — 
> at least, assuming it doesn't result in an over-full PG!
> 
> I don't work with overflowing OSDs/clusters often, but my suspicion is 
> you're better off with something like CERN's reweight scripts than using 
> reweight-by-utilization. Unless it's improved without my noticing, that 
> algorithm just isn't very good. :/
> -Greg
> 
> 
> 
> Cheers,
> 
> Oliver
> 
> 
> P.S.: Thanks so much for helping!
> 
> 
> 
> On 14.06.2018 21:37, Gregory Farnum wrote:
>  > On Thu, Jun 14, 2018 at 3:26 PM Oliver Schulz
>  > mailto:oliver.sch...@tu-dortmund.de> 
>   >
>   
>   >>> wrote:
>  >
>  > But the contents of the remapped PGs should still be
>  > Ok, right? What confuses me is that they don't
>  > backfill - why don't the "move" where they belong?
>  >
>  > As for the PG hard limit, yes, I ran into this. Our
>  > cluster had been very (very) full, but I wanted the
>  > new OSD nodes to use bluestore, so I upda

Re: [ceph-users] Why the change from ceph-disk to ceph-volume and lvm? (and just not stick with direct disk access)

2018-06-08 Thread Nick Fisk

http://docs.ceph.com/docs/master/ceph-volume/simple/

?

From: ceph-users  On Behalf Of Konstantin 
Shalygin
Sent: 08 June 2018 11:11
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Why the change from ceph-disk to ceph-volume and lvm? 
(and just not stick with direct disk access)

What is the reasoning behind switching to lvm? Does it make sense to go 
through (yet) another layer to access the disk? Why creating this 
dependency and added complexity? It is fine as it is, or not?

In fact, the question is why one tool is replaced by another without saving 
functionality.
Why lvm, why not bcache?

It seems to me that in the heads dev team someone has pushed the idea that lvm 
solves all problems.
But this is also added the overhead, and since this is a kernel module with a 
update we can get a performance drop, changes in module settings, etc.
I understand that for Red Hat Storage this is a solution, but for a community 
with different distributions and hardware this may be superfluous.
I would like to get back possibility of preparing osd's with direct access was 
restored, and let it not be the default.
Also this will save configurations for ceph-ansible. Actually I was don't know 
what is create my osd's ceph-disk/ceph-volume or whatever before this 
deprecation.

k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-07 Thread Nick Fisk

So I've recompiled a 12.2.5 ceph-osd binary with the fix included in 
https://github.com/ceph/ceph/pull/22396

The OSD has restarted as expected and the PG is now active+clean ..success 
so far.

What's the best method to clean up the stray snapshot on OSD.46? I'm guessing 
using the object-store-tool, but not sure if I want to
clean the clone metadata or try and remove the actual snapshot object.

-Original Message-
From: ceph-users  On Behalf Of Nick Fisk
Sent: 05 June 2018 17:22
To: 'ceph-users' 
Subject: Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

So, from what I can see I believe this issue is being caused by one of the 
remaining OSD's acting for this PG containing a snapshot
file of the object

/var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__head_F930D2CA_
_1
/var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__1c_F930D2CA__1

Both the OSD which crashed and the other acting OSD don't have this "1c" 
snapshot file. Is the solution to use objectstore tool to
remove this "1c" snapshot object and then allow thigs to backfill?


-----Original Message-
From: ceph-users  On Behalf Of Nick Fisk
Sent: 05 June 2018 16:43
To: 'ceph-users' 
Subject: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

Hi,

After a RBD snapshot was removed, I seem to be having OSD's assert when they 
try and recover pg 1.2ca. The issue seems to follow the
PG around as OSD's fail. I've seen this bug tracker and associated mailing list 
post, but would appreciate if anyone can give any
pointers. https://tracker.ceph.com/issues/23030,

Cluster is 12.2.5 with Filestore and 3x replica pools

I noticed after the snapshot was removed that there were 2 inconsistent PG's. I 
can a repair on both and one of them (1.2ca) seems
to have triggered this issue.


Snippet of log from two different OSD's. 1st one is from the original OSD 
holding the PG, 2nd is from where the OSD was marked out
and it was trying to be recovered to:


-4> 2018-06-05 15:15:45.997225 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6819/3544315 --> [2a03:25e0:253:5::14]:0/3307345 --
osd_ping(ping_reply e2196479 stamp 2018-06-05 15:15:45.994907) v4 -- 
0x557d3d72f800 con 0
-3> 2018-06-05 15:15:46.018088 7febb2954700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 --> [2a03:25e0:254:5::12]:6809/5784710 --
MOSDPGPull(1.2ca e2196479/2196477 cost 8389608) v3 -- 0x557d4d180b40 con 0
-2> 2018-06-05 15:15:46.018412 7febce7a5700  5 -- 
[2a03:25e0:254:5::112]:6817/3544315 >> [2a03:25e0:254:5::12]:6809/5784710
conn(0x557d4b1a9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=13470 
cs=1 l=0).
rx osd.46 seq 13 0x557d4d180b40 MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c,
version: 2195927'1249660, data_included: [], data_size: 0, omap_header_size: 0, 
omap_ent
ries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c@2195927'1249660,
 size: 0, copy_subset: [], clone_subset:
{}, snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:true, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap _recovered_to:, omap_complete:false,
error:false))]) v3
-1> 2018-06-05 15:15:46.018425 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 <== osd.46
[2a03:25e0:254:5::12]:6809/5784710 13  MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f
29.000bf479:1c, version: 2195927'1249660, data_included: [], data_size: 
0, omap_header_size: 0, omap_entries_size: 0,
attrset_size: 1, recovery_info: 
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.0
00bf479:1c@2195927'1249660, size: 0, copy_subset: [], clone_subset: {}, 
snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:t rue, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap_recovered_to:, omap_complete:false,
error:false))]) v3  885+0+0 (695790480 0 0) 0x557d4d180b40 con 0x5
57d4b1a9000
 0> 2018-06-05 15:15:46.022099 7febb2954700 -1 
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, 
ObjectContextRef, bool , ObjectStore::Transaction*)'
thread 7febb2954700 time 2018-06-05 15:15:46.019130
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != 
recovery_info.ss.clone_snaps.end())

Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-05 Thread Nick Fisk

So, from what I can see I believe this issue is being caused by one of the 
remaining OSD's acting for this PG containing a snapshot
file of the object

/var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__head_F930D2CA_
_1
/var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__1c_F930D2CA__1

Both the OSD which crashed and the other acting OSD don't have this "1c" 
snapshot file. Is the solution to use objectstore tool to
remove this "1c" snapshot object and then allow thigs to backfill?


-Original Message-
From: ceph-users  On Behalf Of Nick Fisk
Sent: 05 June 2018 16:43
To: 'ceph-users' 
Subject: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

Hi,

After a RBD snapshot was removed, I seem to be having OSD's assert when they 
try and recover pg 1.2ca. The issue seems to follow the
PG around as OSD's fail. I've seen this bug tracker and associated mailing list 
post, but would appreciate if anyone can give any
pointers. https://tracker.ceph.com/issues/23030,

Cluster is 12.2.5 with Filestore and 3x replica pools

I noticed after the snapshot was removed that there were 2 inconsistent PG's. I 
can a repair on both and one of them (1.2ca) seems
to have triggered this issue.


Snippet of log from two different OSD's. 1st one is from the original OSD 
holding the PG, 2nd is from where the OSD was marked out
and it was trying to be recovered to:


-4> 2018-06-05 15:15:45.997225 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6819/3544315 --> [2a03:25e0:253:5::14]:0/3307345 --
osd_ping(ping_reply e2196479 stamp 2018-06-05 15:15:45.994907) v4 -- 
0x557d3d72f800 con 0
-3> 2018-06-05 15:15:46.018088 7febb2954700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 --> [2a03:25e0:254:5::12]:6809/5784710 --
MOSDPGPull(1.2ca e2196479/2196477 cost 8389608) v3 -- 0x557d4d180b40 con 0
-2> 2018-06-05 15:15:46.018412 7febce7a5700  5 -- 
[2a03:25e0:254:5::112]:6817/3544315 >> [2a03:25e0:254:5::12]:6809/5784710
conn(0x557d4b1a9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=13470 
cs=1 l=0).
rx osd.46 seq 13 0x557d4d180b40 MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c,
version: 2195927'1249660, data_included: [], data_size: 0, omap_header_size: 0, 
omap_ent
ries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c@2195927'1249660,
 size: 0, copy_subset: [], clone_subset:
{}, snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:true, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap _recovered_to:, omap_complete:false,
error:false))]) v3
-1> 2018-06-05 15:15:46.018425 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 <== osd.46
[2a03:25e0:254:5::12]:6809/5784710 13  MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f
29.000bf479:1c, version: 2195927'1249660, data_included: [], data_size: 
0, omap_header_size: 0, omap_entries_size: 0,
attrset_size: 1, recovery_info: 
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.0
00bf479:1c@2195927'1249660, size: 0, copy_subset: [], clone_subset: {}, 
snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:t rue, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap_recovered_to:, omap_complete:false,
error:false))]) v3  885+0+0 (695790480 0 0) 0x557d4d180b40 con 0x5
57d4b1a9000
 0> 2018-06-05 15:15:46.022099 7febb2954700 -1 
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, 
ObjectContextRef, bool , ObjectStore::Transaction*)'
thread 7febb2954700 time 2018-06-05 15:15:46.019130
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != 
recovery_info.ss.clone_snaps.end())







-4> 2018-06-05 16:28:59.560140 7fcd7b655700  5 -- 
[2a03:25e0:254:5::113]:6829/525383 >> [2a03:25e0:254:5::12]:6809/5784710
conn(0x557447510800 :6829 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH 
pgs=13524 cs
=1 l=0). rx osd.46 seq 6 0x5574480d0d80 MOSDPGPush(1.2ca 2196813/2196812
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c, version: 
2195927'1249660, data_included: [], data_size: 0,
omap_header_s
ize: 0, omap_entries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c@2195927'1249660,
 size: 4194304, copy_subset: [],
clone_subset:

Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-05 Thread Nick Fisk

From: ceph-users  On Behalf Of Paul Emmerich
Sent: 05 June 2018 17:02
To: n...@fisk.me.uk
Cc: ceph-users 
Subject: Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-05 17:42 GMT+02:00 Nick Fisk mailto:n...@fisk.me.uk> 
>:

Hi,

After a RBD snapshot was removed, I seem to be having OSD's assert when they 
try and recover pg 1.2ca. The issue seems to follow the
PG around as OSD's fail. I've seen this bug tracker and associated mailing list 
post, but would appreciate if anyone can give any
pointers. https://tracker.ceph.com/issues/23030,

I've originally reported that issue. Our cluster that was seeing this somehow 
got magically better without us intervening.

We might have deleted the offending snapshot or RBD, not sure.

Thanks Paul, It looks like the snapshot has gone, but there is obviously some 
remains still around.

Paul

-- 

Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io <http://www.croit.io> 
Tel: +49 89 1896585 90

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-05 Thread Nick Fisk

Hi,

After a RBD snapshot was removed, I seem to be having OSD's assert when they 
try and recover pg 1.2ca. The issue seems to follow the
PG around as OSD's fail. I've seen this bug tracker and associated mailing list 
post, but would appreciate if anyone can give any
pointers. https://tracker.ceph.com/issues/23030,

Cluster is 12.2.5 with Filestore and 3x replica pools

I noticed after the snapshot was removed that there were 2 inconsistent PG's. I 
can a repair on both and one of them (1.2ca) seems
to have triggered this issue.


Snippet of log from two different OSD's. 1st one is from the original OSD 
holding the PG, 2nd is from where the OSD was marked out
and it was trying to be recovered to:


-4> 2018-06-05 15:15:45.997225 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6819/3544315 --> [2a03:25e0:253:5::14]:0/3307345 --
osd_ping(ping_reply e2196479 stamp 2018-06-05 15:15:45.994907) v4 -- 
0x557d3d72f800 con 0
-3> 2018-06-05 15:15:46.018088 7febb2954700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 --> [2a03:25e0:254:5::12]:6809/5784710 --
MOSDPGPull(1.2ca e2196479/2196477 cost 8389608) v3 -- 0x557d4d180b40 con 0
-2> 2018-06-05 15:15:46.018412 7febce7a5700  5 -- 
[2a03:25e0:254:5::112]:6817/3544315 >> [2a03:25e0:254:5::12]:6809/5784710
conn(0x557d4b1a9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=13470 
cs=1 l=0).
rx osd.46 seq 13 0x557d4d180b40 MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c,
version: 2195927'1249660, data_included: [], data_size: 0, omap_header_size: 0, 
omap_ent
ries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c@2195927'1249660,
 size: 0, copy_subset: [], clone_subset:
{}, snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:true, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap
_recovered_to:, omap_complete:false, error:false))]) v3
-1> 2018-06-05 15:15:46.018425 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 <== osd.46
[2a03:25e0:254:5::12]:6809/5784710 13  MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f
29.000bf479:1c, version: 2195927'1249660, data_included: [], data_size: 
0, omap_header_size: 0, omap_entries_size: 0,
attrset_size: 1, recovery_info: 
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.0
00bf479:1c@2195927'1249660, size: 0, copy_subset: [], clone_subset: {}, 
snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:t
rue, error:false), before_progress: ObjectRecoveryProgress(first, 
data_recovered_to:0, data_complete:false, omap_recovered_to:,
omap_complete:false, error:false))]) v3  885+0+0 (695790480 0 0) 
0x557d4d180b40 con 0x5
57d4b1a9000
 0> 2018-06-05 15:15:46.022099 7febb2954700 -1 
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, 
ObjectContextRef, bool
, ObjectStore::Transaction*)' thread 7febb2954700 time 2018-06-05 
15:15:46.019130
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != 
recovery_info.ss.clone_snaps.end())







-4> 2018-06-05 16:28:59.560140 7fcd7b655700  5 -- 
[2a03:25e0:254:5::113]:6829/525383 >> [2a03:25e0:254:5::12]:6809/5784710
conn(0x557447510800 :6829 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH 
pgs=13524 cs
=1 l=0). rx osd.46 seq 6 0x5574480d0d80 MOSDPGPush(1.2ca 2196813/2196812
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c, version: 
2195927'1249660, data_included: [], data_size: 0,
omap_header_s
ize: 0, omap_entries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c@2195927'1249660,
 size: 4194304, copy_subset: [],
clone_subset: {}, snapse
t: 1c=[]:{}), after_progress: ObjectRecoveryProgress(!first, 
data_recovered_to:0, data_complete:true, omap_recovered_to:,
omap_complete:true, error:false), before_progress: 
ObjectRecoveryProgress(first, data_rec
overed_to:0, data_complete:false, omap_recovered_to:, omap_complete:false, 
error:false))]) v3
-3> 2018-06-05 16:28:59.560155 7fcd7b655700  1 -- 
[2a03:25e0:254:5::113]:6829/525383 <== osd.46
[2a03:25e0:254:5::12]:6809/5784710 6  MOSDPGPush(1.2ca 2196813/2196812 
[PushOp(1:534b0c9f:::rbd_data.0c4c14
238e1f29.000bf479:1c, version: 2195927'1249660, data_included: [], 
data_size: 0, omap_header_size: 0, omap_entries_size: 0,
attrset_size: 1, recovery_info: ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14
238e1f29.000bf479:1c@2195927'1249660, size: 4194304, copy_subset: [], 
clone_subset: {}, snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, omap_re
covered_to:,

Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-14 Thread Nick Fisk

Hi Wido,

Are you trying this setting?

/sys/devices/system/cpu/intel_pstate/min_perf_pct



-Original Message-
From: ceph-users  On Behalf Of Wido den
Hollander
Sent: 14 May 2018 14:14
To: n...@fisk.me.uk; 'Blair Bethwaite' 
Cc: 'ceph-users' 
Subject: Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on
NVMe/SSD Ceph OSDs



On 05/01/2018 10:19 PM, Nick Fisk wrote:
> 4.16 required?
> https://www.phoronix.com/scan.php?page=news_item&px=Skylake-X-P-State-
> Linux-
> 4.16
> 

I've been trying with the 4.16 kernel for the last few days, but still, it's
not working.

The CPU's keep clocking down to 800Mhz

I've set scaling_min_freq=scaling_max_freq in /sys, but that doesn't change
a thing. The CPUs keep scaling down.

Still not close to the 1ms latency with these CPUs :(

Wido

> 
> -Original Message-
> From: ceph-users  On Behalf Of 
> Blair Bethwaite
> Sent: 01 May 2018 16:46
> To: Wido den Hollander 
> Cc: ceph-users ; Nick Fisk 
> 
> Subject: Re: [ceph-users] Intel Xeon Scalable and CPU frequency 
> scaling on NVMe/SSD Ceph OSDs
> 
> Also curious about this over here. We've got a rack's worth of R740XDs 
> with Xeon 4114's running RHEL 7.4 and intel-pstate isn't even active 
> on them, though I don't believe they are any different at the OS level 
> to our Broadwell nodes (where it is loaded).
> 
> Have you tried poking the kernel's pmqos interface for your use-case?
> 
> On 2 May 2018 at 01:07, Wido den Hollander  wrote:
>> Hi,
>>
>> I've been trying to get the lowest latency possible out of the new 
>> Xeon Scalable CPUs and so far I got down to 1.3ms with the help of Nick.
>>
>> However, I can't seem to pin the CPUs to always run at their maximum 
>> frequency.
>>
>> If I disable power saving in the BIOS they stay at 2.1Ghz (Silver 
>> 4110), but that disables the boost.
>>
>> With the Power Saving enabled in the BIOS and when giving the OS all 
>> control for some reason the CPUs keep scaling down.
>>
>> $ echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
>>
>> cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009 Report 
>> errors and bugs to cpuf...@vger.kernel.org, please.
>> analyzing CPU 0:
>>   driver: intel_pstate
>>   CPUs which run at the same hardware frequency: 0
>>   CPUs which need to have their frequency coordinated by software: 0
>>   maximum transition latency: 0.97 ms.
>>   hardware limits: 800 MHz - 3.00 GHz
>>   available cpufreq governors: performance, powersave
>>   current policy: frequency should be within 800 MHz and 3.00 GHz.
>>   The governor "performance" may decide which speed to
use
>>   within this range.
>>   current CPU frequency is 800 MHz.
>>
>> I do see the CPUs scale up to 2.1Ghz, but they quickly scale down 
>> again to 800Mhz and that hurts latency. (50% difference!)
>>
>> With the CPUs scaling down to 800Mhz my latency jumps from 1.3ms to 
>> 2.4ms on avg. With turbo enabled I hope to get down to 1.1~1.2ms on avg.
>>
>> $ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
>> performance
>>
>> Everything seems to be OK and I would expect the CPUs to stay at 
>> 2.10Ghz, but they aren't.
>>
>> C-States are also pinned to 0 as a boot parameter for the kernel:
>>
>> processor.max_cstate=1 intel_idle.max_cstate=0
>>
>> Running Ubuntu 16.04.4 with the 4.13 kernel from the HWE from Ubuntu.
>>
>> Has anybody tried this yet with the recent Intel Xeon Scalable CPUs?
>>
>> Thanks,
>>
>> Wido
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> --
> Cheers,
> ~Blairo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Scrubbing impacting write latency since Luminous

2018-05-10 Thread Nick Fisk

Hi All,

I've just upgraded our main cluster to Luminous and have noticed that where
before the cluster 64k write latency was always hovering around 2ms
regardless of what scrubbing was going on, since the upgrade to Luminous,
scrubbing takes the average latency up to around 5-10ms and deep scrubbing
pushes it into the 30ms region.

No other changes apart from the upgrade have taken place. Is anyone aware of
any major changes in the way scrubbing is carried out Jewel->Luminous, which
may be causing this?

Thanks,
Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore on HDD+SSD sync write latency experiences

2018-05-03 Thread Nick Fisk



Hi Dan,

Quoting Dan van der Ster :


Hi Nick,

Our latency probe results (4kB rados bench) didn't change noticeably
after converting a test cluster from FileStore (sata SSD journal) to
BlueStore (sata SSD db). Those 4kB writes take 3-4ms on average from a
random VM in our data centre. (So bluestore DB seems equivalent to
FileStore journal for small writes).

Otherwise, our other monitoring (osd log analysis) shows that the vast
majority of writes are under 32kB, and the average write size is 42kB
(with a long tail out to 4MB).

So... do you think is this a *real* issue that would impact user
observed latency, given that they are mostly doing small writes?
(maybe your environment is very different?)
I'm not saying that tuning the deferred write threshold up wouldn't
help, but it's not obvious that deferring writes is better on the
whole.


Probably for a lot of users running VM's, like you say most writes  
will be under 32kB and won't notice much difference. My workload is  
where the client is submitting largely sequential 1MB writes in sync  
(NFS), but at a fairly low queue depth. The cluster needs to ack them  
as fast as possible. So in this case writing the IO's through the NVME  
first seems to help by quite a large margin.




I'm curious what was the original rationale for 32kB?

Cheers, Dan


On Tue, May 1, 2018 at 10:50 PM, Nick Fisk  wrote:

Hi all,



Slowly getting round to migrating clusters to Bluestore but I am interested
in how people are handling the potential change in write latency coming from
Filestore? Or maybe nobody is really seeing much difference?



As we all know, in Bluestore, writes are not double written and in most
cases go straight to disk. Whilst this is awesome for people with pure SSD
or pure HDD clusters as the amount of overhead is drastically reduced, for
people with HDD+SSD journals in Filestore land, the double write had the
side effect of acting like a battery backed cache, accelerating writes when
not under saturation.



In some brief testing I am seeing Filestore OSD’s with NVME journal show an
average apply latency of around 1-2ms whereas some new Bluestore OSD’s in
the same cluster are showing 20-40ms. I am fairly certain this is due to
writes exhibiting the latency of the underlying 7.2k disk. Note, cluster is
very lightly loaded, this is not anything being driven into saturation.



I know there is a deferred write tuning knob which adjusts the cutover for
when an object is double written, but at the default of 32kb, I suspect a
lot of IO’s even in the 1MB area are still drastically slower going straight
to disk than if double written to NVME 1st. Has anybody else done any
investigation in this area? Is there any long turn harm at running a cluster
deferring writes up to 1MB+ in size to mimic the Filestore double write
approach?



I also suspect after looking through github that deferred writes only happen
when overwriting an existing object or blob (not sure which case applies),
so new allocations are still written straight to disk. Can anyone confirm?



PS. If your spinning disks are connected via a RAID controller with BBWC
then you are not affected by this.



Thanks,

Nick


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore on HDD+SSD sync write latency experiences

2018-05-03 Thread Nick Fisk

Hi Nick,

On 5/1/2018 11:50 PM, Nick Fisk wrote:

Hi all,

 

Slowly getting round to migrating clusters to Bluestore but I am interested
in how people are handling the potential change in write latency coming from
Filestore? Or maybe nobody is really seeing much difference?

 

As we all know, in Bluestore, writes are not double written and in most
cases go straight to disk. Whilst this is awesome for people with pure SSD
or pure HDD clusters as the amount of overhead is drastically reduced, for
people with HDD+SSD journals in Filestore land, the double write had the
side effect of acting like a battery backed cache, accelerating writes when
not under saturation.

 

In some brief testing I am seeing Filestore OSD's with NVME journal show an
average apply latency of around 1-2ms whereas some new Bluestore OSD's in
the same cluster are showing 20-40ms. I am fairly certain this is due to
writes exhibiting the latency of the underlying 7.2k disk. Note, cluster is
very lightly loaded, this is not anything being driven into saturation.

 

I know there is a deferred write tuning knob which adjusts the cutover for
when an object is double written, but at the default of 32kb, I suspect a
lot of IO's even in the 1MB area are still drastically slower going straight
to disk than if double written to NVME 1st. Has anybody else done any
investigation in this area? Is there any long turn harm at running a cluster
deferring writes up to 1MB+ in size to mimic the Filestore double write
approach?

This should work fine with low load but be careful when load is raising.
RocksDB and corresponding stuff around it might become a bottleneck in this
scenario.

 

Yep, this cluster has extremely low load, but client is submitting largely
sequential 1MB writes in sync (NFS). Cluster needs to ack them as fast as
possible.

 

I also suspect after looking through github that deferred writes only happen
when overwriting an existing object or blob (not sure which case applies),
so new allocations are still written straight to disk. Can anyone confirm?

"small" writes (length < min_alloc_size) are direct if they go to unused
chunk (4K or more depending on checksum settings) of an existing mutable
block and write length > bluestore_prefer_deferred_size only. 
E.g. appending with 4K data  blocks to an object at HDD will trigger
deferred mode for the first of every 16 writes (given that default
min_alloc_size for HDD is 64K). Rest 15 go direct.

"big" writes are unconditionally deferred if length <=
bluestore_prefer_deferred_size.

 

So according to defaults and assuming an RBD comprised of 4MB objects:

1. Write between 32K and 64K will go direct if written to an unused chunks

2. Write below 32K to existing and new chunk will be deferred

3. Everything above 32K will be direct

 

If I was to increase the deferred write to 1MB:

1. Everything between 64K and 1MB is deferred

2. Above 1MB is direct

3. Below 64K, still deferred for new and existing because
bluestore_prefer_deferred_size > min_alloc_size

 





 

PS. If your spinning disks are connected via a RAID controller with BBWC
then you are not affected by this.

 

Thanks,

Nick






___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore on HDD+SSD sync write latency experiences

2018-05-03 Thread Nick Fisk

-Original Message-
From: Alex Gorbachev  
Sent: 02 May 2018 22:05
To: Nick Fisk 
Cc: ceph-users 
Subject: Re: [ceph-users] Bluestore on HDD+SSD sync write latency experiences

Hi Nick,

On Tue, May 1, 2018 at 4:50 PM, Nick Fisk  wrote:
> Hi all,
>
>
>
> Slowly getting round to migrating clusters to Bluestore but I am 
> interested in how people are handling the potential change in write 
> latency coming from Filestore? Or maybe nobody is really seeing much 
> difference?
>
>
>
> As we all know, in Bluestore, writes are not double written and in 
> most cases go straight to disk. Whilst this is awesome for people with 
> pure SSD or pure HDD clusters as the amount of overhead is drastically 
> reduced, for people with HDD+SSD journals in Filestore land, the 
> double write had the side effect of acting like a battery backed 
> cache, accelerating writes when not under saturation.
>
>
>
> In some brief testing I am seeing Filestore OSD’s with NVME journal 
> show an average apply latency of around 1-2ms whereas some new 
> Bluestore OSD’s in the same cluster are showing 20-40ms. I am fairly 
> certain this is due to writes exhibiting the latency of the underlying 
> 7.2k disk. Note, cluster is very lightly loaded, this is not anything being 
> driven into saturation.
>
>
>
> I know there is a deferred write tuning knob which adjusts the cutover 
> for when an object is double written, but at the default of 32kb, I 
> suspect a lot of IO’s even in the 1MB area are still drastically 
> slower going straight to disk than if double written to NVME 1st. Has 
> anybody else done any investigation in this area? Is there any long 
> turn harm at running a cluster deferring writes up to 1MB+ in size to 
> mimic the Filestore double write approach?
>
>
>
> I also suspect after looking through github that deferred writes only 
> happen when overwriting an existing object or blob (not sure which 
> case applies), so new allocations are still written straight to disk. Can 
> anyone confirm?
>
>
>
> PS. If your spinning disks are connected via a RAID controller with 
> BBWC then you are not affected by this.

We saw this behavior even on Areca 1883, which does buffer HDD writes.
The way out was to put WAL and DB on NVMe drives and that solved performance 
problems.

Just to confirm, our problem is not poor performance of the RocksDB when 
running on HDD, but the direct write to disk of data. Or have I misunderstood 
your comment?

--
Alex Gorbachev
Storcium

>
>
>
> Thanks,
>
> Nick
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Bluestore on HDD+SSD sync write latency experiences

2018-05-01 Thread Nick Fisk

Hi all,

 

Slowly getting round to migrating clusters to Bluestore but I am interested
in how people are handling the potential change in write latency coming from
Filestore? Or maybe nobody is really seeing much difference?

 

As we all know, in Bluestore, writes are not double written and in most
cases go straight to disk. Whilst this is awesome for people with pure SSD
or pure HDD clusters as the amount of overhead is drastically reduced, for
people with HDD+SSD journals in Filestore land, the double write had the
side effect of acting like a battery backed cache, accelerating writes when
not under saturation.

 

In some brief testing I am seeing Filestore OSD's with NVME journal show an
average apply latency of around 1-2ms whereas some new Bluestore OSD's in
the same cluster are showing 20-40ms. I am fairly certain this is due to
writes exhibiting the latency of the underlying 7.2k disk. Note, cluster is
very lightly loaded, this is not anything being driven into saturation.

 

I know there is a deferred write tuning knob which adjusts the cutover for
when an object is double written, but at the default of 32kb, I suspect a
lot of IO's even in the 1MB area are still drastically slower going straight
to disk than if double written to NVME 1st. Has anybody else done any
investigation in this area? Is there any long turn harm at running a cluster
deferring writes up to 1MB+ in size to mimic the Filestore double write
approach?

 

I also suspect after looking through github that deferred writes only happen
when overwriting an existing object or blob (not sure which case applies),
so new allocations are still written straight to disk. Can anyone confirm?

 

PS. If your spinning disks are connected via a RAID controller with BBWC
then you are not affected by this.

 

Thanks,

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-01 Thread Nick Fisk

4.16 required?
https://www.phoronix.com/scan.php?page=news_item&px=Skylake-X-P-State-Linux-
4.16


-Original Message-
From: ceph-users  On Behalf Of Blair
Bethwaite
Sent: 01 May 2018 16:46
To: Wido den Hollander 
Cc: ceph-users ; Nick Fisk 
Subject: Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on
NVMe/SSD Ceph OSDs

Also curious about this over here. We've got a rack's worth of R740XDs with
Xeon 4114's running RHEL 7.4 and intel-pstate isn't even active on them,
though I don't believe they are any different at the OS level to our
Broadwell nodes (where it is loaded).

Have you tried poking the kernel's pmqos interface for your use-case?

On 2 May 2018 at 01:07, Wido den Hollander  wrote:
> Hi,
>
> I've been trying to get the lowest latency possible out of the new 
> Xeon Scalable CPUs and so far I got down to 1.3ms with the help of Nick.
>
> However, I can't seem to pin the CPUs to always run at their maximum 
> frequency.
>
> If I disable power saving in the BIOS they stay at 2.1Ghz (Silver 
> 4110), but that disables the boost.
>
> With the Power Saving enabled in the BIOS and when giving the OS all 
> control for some reason the CPUs keep scaling down.
>
> $ echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
>
> cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009 Report 
> errors and bugs to cpuf...@vger.kernel.org, please.
> analyzing CPU 0:
>   driver: intel_pstate
>   CPUs which run at the same hardware frequency: 0
>   CPUs which need to have their frequency coordinated by software: 0
>   maximum transition latency: 0.97 ms.
>   hardware limits: 800 MHz - 3.00 GHz
>   available cpufreq governors: performance, powersave
>   current policy: frequency should be within 800 MHz and 3.00 GHz.
>   The governor "performance" may decide which speed to use
>   within this range.
>   current CPU frequency is 800 MHz.
>
> I do see the CPUs scale up to 2.1Ghz, but they quickly scale down 
> again to 800Mhz and that hurts latency. (50% difference!)
>
> With the CPUs scaling down to 800Mhz my latency jumps from 1.3ms to 
> 2.4ms on avg. With turbo enabled I hope to get down to 1.1~1.2ms on avg.
>
> $ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
> performance
>
> Everything seems to be OK and I would expect the CPUs to stay at 
> 2.10Ghz, but they aren't.
>
> C-States are also pinned to 0 as a boot parameter for the kernel:
>
> processor.max_cstate=1 intel_idle.max_cstate=0
>
> Running Ubuntu 16.04.4 with the 4.13 kernel from the HWE from Ubuntu.
>
> Has anybody tried this yet with the recent Intel Xeon Scalable CPUs?
>
> Thanks,
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pgs down after adding 260 OSDs & increasing PGs

2018-01-29 Thread Nick Fisk

Hi Jake,

I suspect you have hit an issue that me and a few others have hit in
Luminous. By increasing the number of PG's before all the data has
re-balanced, you have probably exceeded hard PG per OSD limit.

See this thread
https://www.spinics.net/lists/ceph-users/msg41231.html

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jake Grimmett
> Sent: 29 January 2018 12:46
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] pgs down after adding 260 OSDs & increasing PGs
> 
> Dear All,
> 
> Our ceph luminous (12.2.2) cluster has just broken, due to either adding
> 260 OSDs drives in one go, or to increasing the PG number from 1024 to
> 4096 in one go, or a combination of both...
> 
> Prior to the upgrade, the cluster consisted of 10 dual v4 Xeon nodes
running
> SL7.4, each node had 19 bluestore OSDs (8TB Seagate Ironwolf) & 64GB ram.
> 
> The cluster has just two pools;
> 1) 1024 pg/pgs 8+2 EC pool on 190 x hdd.
> 2) 4 nodes have 1 NVMe SSD's used for a 3x replicated MDS pool.
> 
> Cluster provides 500TB CephFS used for scratch space, four snapshots taken
> daily and kept for one week only.
> 
> Everything was working perfectly, until 26 OSD's were added to each node,
> bringing the total hdd OSD count to 450. (all 8TB Ironwolf)
> 
> After adding all 260 OSD's with ceph-deploy, ceph health shows
> 
> HEALTH_WARN noout flag(s) set;
> 732950716/1219068139 objects misplaced (60.124%); Degraded data
> redundancy: 1024 pgs unclean; too few PGs per OSD (23 < min 30)
> 
> So far so good, I'd expected to see the cluster rebalancing, the complaint
about
> too few pgs per OSD seemed reasonable.
> 
> Without waiting for the cluster to rebalance, I increased the pg/pgs to
4096. At
> this point, ceph health showed this:
> 
> HEALTH_ERR 135858073/1219068139 objects misplaced (11.144%); Reduced
> data availability: 3119 pgs inactive; Degraded data redundancy:
> 210609/1219068139 objects degraded (0.017%),
> 4088 pgs unclean, 1002 pgs degraded,
> 1002 pgs undersized; 5 stuck requests are blocked > 4096 sec
> 
> We then left the cluster to rebalance.
> 
> Next morning, two ceph nodes were down, and I could see lots of oom-killer
> messages in the logs.
> Each node only has 64GB for 45 OSD's which is probably the cause of this.
> 
> as a short term fix, we limited RAM usage by adding this to ceph.conf
> bluestore_cache_size = 104857600 bluestore_cache_kv_max = 67108864
> 
> This appears to stop the oom problems, so we waited while the cluster
> rebalanced, until it said it stopped saying "objects misplaced"
> This took a couple of days...
> 
> The problem now, is that although all of the OSD's are up, lots of pgs are
down,
> degraded, unclean, and it is not clear how to fix this.
> 
> I have tried issuing osd scrub, and pg repair commands, but these do not
appear
> to do anything.
> 
> cephfs will mount, but when locks up when it hits a pg that is down.
> 
> I have tried sequentially restarting all OSD's on each node, slowly
walking
> through the cluster several times, but this does not fix things.
> 
> Current Status:
> # ceph health
> HEALTH_ERR Reduced data availability:
> 3021 pgs inactive, 23 pgs down, 23 pgs stale; Degraded data redundancy:
3021
> pgs unclean, 1879 pgs degraded,
> 1879 pgs undersized; 1 stuck requests are blocked > 4096 sec
> 
> ceph health detail (see http://p.ip.fi/Pwdb ) contains many lines such as:
> 
> pg 4.ffe is stuck unclean for 470551.768849, current state
> activating+remapped, last acting [156,175,33,169,135,85,165,55,148,178]
> 
> pg 4.fff is stuck undersized for 49509.580577, current state
> activating+undersized+degraded+remapped, last acting
> [44,12,185,125,69,29,119,102,81,2147483647]
> 
> (Presumably the OSD number "2147483647" is due to Erasure Encoding,
> as per
>  May/001660.html>
> ?)
> 
> Tailing the stuck osd log with debug osd = 20 shows this:
> 
> 2018-01-29 11:56:35.204391 7f0dab4fd700 20 osd.46 15482 share_map_peer
> 0x5647cb336800 already has epoch 15482
> 2018-01-29 11:56:35.213226 7f0da7537700 10 osd.46 15482
> tick_without_osd_lock
> 2018-01-29 11:56:35.213252 7f0da7537700 20 osd.46 15482
> scrub_random_backoff lost coin flip, randomly backing off
> 2018-01-29 11:56:35.213257 7f0da7537700 10 osd.46 15482
> promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0
> bytes; target 25 obj/sec or 5120 k bytes/sec
> 2018-01-29 11:56:35.213263 7f0da7537700 20 osd.46 15482
> promote_throttle_recalibrate  new_prob 1000
> 2018-01-29 11:56:35.213266 7f0da7537700 10 osd.46 15482
> promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted
> new_prob 1000, prob 1000 -> 1000
> 2018-01-29 11:56:35.232884 7f0dab4fd700 20 osd.46 15482 share_map_peer
> 0x5647cabf3800 already has epoch 15482
> 
> Currently this cluster is just storing scratch data, so could be wiped,
> however we would be more confident about using ceph widely if we can fix
> er

Re: [ceph-users] BlueStore.cc: 9363: FAILED assert(0 == "unexpected error")

2018-01-26 Thread Nick Fisk

I can see this in the logs:

 

2018-01-25 06:05:56.292124 7f37fa6ea700 -1 log_channel(cluster) log [ERR] : 
full status failsafe engaged, dropping updates, now 101% full

2018-01-25 06:05:56.325404 7f3803f9c700 -1 bluestore(/var/lib/ceph/osd/ceph-9) 
_do_alloc_write failed to reserve 0x4000

2018-01-25 06:05:56.325434 7f3803f9c700 -1 bluestore(/var/lib/ceph/osd/ceph-9) 
_do_write _do_alloc_write failed with (28) No space left on device

2018-01-25 06:05:56.325462 7f3803f9c700 -1 bluestore(/var/lib/ceph/osd/ceph-9) 
_txc_add_transaction error (28) No space left on device not handled on 
operation 10 (op 0, counting from 0)

 

Are they out of space, or is something mis-reporting?

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of David 
Turner
Sent: 26 January 2018 13:03
To: ceph-users 
Subject: [ceph-users] BlueStore.cc: 9363: FAILED assert(0 == "unexpected error")

 

http://tracker.ceph.com/issues/22796

 

I was curious if anyone here had any ideas or experience with this problem.  I 
created the tracker for this yesterday when I woke up to find all 3 of my SSD 
OSDs not running and unable to start due to this segfault.  These OSDs are in 
my small home cluster and hold the cephfs_cache and cephfs_metadata pools.

 

To recap, I upgraded from 10.2.10 to 12.2.2, successfully swapped out my 9 OSDs 
to Bluestore, reconfigured my crush rules to utilize OSD classes, failed to 
remove the CephFS cache tier due to http://tracker.ceph.com/issues/22754, 
created these 3 SSD OSDs and updated the cephfs_cache and cephfs_metadata pools 
to use the replicated_ssd crush rule... fast forward 2 days of this working 
great to me waking up with all 3 of them crashed and unable to start.  There is 
an OSD log with debug bluestore = 5 attached to the tracker at the top of the 
email.

 

My CephFS is completely down while these 2 pools are inaccessible.  The OSDs 
themselves are in-tact if I need to move the data out manually to the HDDs or 
something.  Any help is appreciated.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD servers swapping despite having free memory capacity

2018-01-24 Thread Nick Fisk

I know this may be a bit vague, but also suggests the "try a newer kernel" 
approach. We had constant problems with hosts mounting a number of RBD volumes 
formatted with XFS. The servers would start aggressively swapping even though 
the actual memory in use was nowhere near even 50% and eventually processes 
started dying/hanging (Not OOM though). I couldn't quite put my finger on what 
was actually using the memory, but it looked almost like the page cache was not 
releasing memory when requested.

This was happening on the  4.10 kernel, updated to 4.14 and the problem 
completely disappeared.

I've attached a graph (if it gets through) showing the memory change between 
4.10 and 4.14 on the 22nd Nov

Nick


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Warren Wang
> Sent: 24 January 2018 17:54
> To: Blair Bethwaite 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OSD servers swapping despite having free memory
> capacity
> 
> Forgot to mention another hint. If kswapd is constantly using CPU, and your 
> sar -
> r ALL and sar -B stats look like it's trashing, kswapd is probably busy 
> evicting
> things from memory in order to make a larger order allocation.
> 
> The other thing I can think of is if you have OSDs locking up and getting
> corrupted, there is a severe XFS bug where the kernel will throw a NULL 
> pointer
> dereference under heavy memory pressure. Again, it's due to memory issues,
> but you will see the message in your kernel logs. It's fixed in upstream 
> kernels as
> of this month. I forget what version exactly. 4.4.0-102?
> https://launchpad.net/bugs/1729256
> 
> Warren Wang
> 
> On 1/23/18, 11:01 PM, "Blair Bethwaite"  wrote:
> 
> +1 to Warren's advice on checking for memory fragmentation. Are you
> seeing kmem allocation failures in dmesg on these hosts?
> 
> On 24 January 2018 at 10:44, Warren Wang 
> wrote:
> > Check /proc/buddyinfo for memory fragmentation. We have some pretty
> severe memory frag issues with Ceph to the point where we keep excessive
> min_free_kbytes configured (8GB), and are starting to order more memory than
> we actually need. If you have a lot of objects, you may find that you need to
> increase vfs_cache_pressure as well, to something like the default of 100.
> >
> > In your buddyinfo, the columns represent the quantity of each page size
> available. So if you only see numbers in the first 2 columns, you only have 
> 4K and
> 8K pages available, and will fail any allocations larger than that. The 
> problem is
> so severe for us that we have stopped using jumbo frames due to dropped
> packets as a result of not being able to DMA map pages that will fit 9K 
> frames.
> >
> > In short, you might have enough memory, but not contiguous. It's even
> worse on RGW nodes.
> >
> > Warren Wang
> >
> > On 1/23/18, 2:56 PM, "ceph-users on behalf of Samuel Taylor Liston" 
>  users-boun...@lists.ceph.com on behalf of sam.lis...@utah.edu> wrote:
> >
> > We have a 9 - node (16 - 8TB OSDs per node) running jewel on centos 
> 7.4.
> The OSDs are configured with encryption.  The cluster is accessed via two -
> RGWs  and there are 3 - mon servers.  The data pool is using 6+3 erasure 
> coding.
> >
> > About 2 weeks ago I found two of the nine servers wedged and had to
> hard power cycle them to get them back.  In this hard reboot 22 - OSDs came
> back with either a corrupted encryption or data partitions.  These OSDs were
> removed and recreated, and the resultant rebalance moved along just fine for
> about a week.  At the end of that week two different nodes were unresponsive
> complaining of page allocation failures.  This is when I realized the nodes 
> were
> heavy into swap.  These nodes were configured with 64GB of RAM as a cost
> saving going against the 1GB per 1TB recommendation.  We have since then
> doubled the RAM in each of the nodes giving each of them more than the 1GB
> per 1TB ratio.
> >
> > The issue I am running into is that these nodes are still swapping; 
> a lot,
> and over time becoming unresponsive, or throwing page allocation failures.  As
> an example, “free” will show 15GB of RAM usage (out of 128GB) and 32GB of
> swap.  I have configured swappiness to 0 and and also turned up the
> vm.min_free_kbytes to 4GB to try to keep the kernel happy, and yet I am still
> filling up swap.  It only occurs when the OSDs have mounted partitions and 
> ceph-
> osd daemons active.
> >
> > Anyone have an idea where this swap usage might be coming from?
> > Thanks for any insight,
> >
> > Sam Liston (sam.lis...@utah.edu)
> > 
> > Center for High Performance Computing
> > 155 S. 1452 E. Rm 405
> > Salt Lake City, Utah 84112 (801)232-6932
> > 
> >
> >
> >
> >

Re: [ceph-users] What is the should be the expected latency of 10Gbit network connections

2018-01-22 Thread Nick Fisk

Anyone with 25G ethernet willing to do the test? Would love to see what the 
latency figures are for that.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Maged 
Mokhtar
Sent: 22 January 2018 11:28
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What is the should be the expected latency of 10Gbit 
network connections

 

On 2018-01-22 08:39, Wido den Hollander wrote:



On 01/20/2018 02:02 PM, Marc Roos wrote: 

  If I test my connections with sockperf via a 1Gbit switch I get around
25usec, when I test the 10Gbit connection via the switch I have around
12usec is that normal? Or should there be a differnce of 10x.


No, that's normal.

Tests with 8k ping packets over different links I did:

1GbE:  0.800ms
10GbE: 0.200ms
40GbE: 0.150ms

Wido




sockperf ping-pong

sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=10.100 sec; SentMessages=432875;
ReceivedMessages=432874
sockperf: = Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=10.000 sec; SentMessages=428640;
ReceivedMessages=428640
sockperf: > avg-lat= 11.609 (std-dev=1.684)
sockperf: # dropped messages = 0; # duplicated messages = 0; #
out-of-order messages = 0
sockperf: Summary: Latency is 11.609 usec
sockperf: Total 428640 observations; each percentile contains 4286.40
observations
sockperf: --->  observation =  856.944
sockperf: ---> percentile  99.99 =   39.789
sockperf: ---> percentile  99.90 =   20.550
sockperf: ---> percentile  99.50 =   17.094
sockperf: ---> percentile  99.00 =   15.578
sockperf: ---> percentile  95.00 =   12.838
sockperf: ---> percentile  90.00 =   12.299
sockperf: ---> percentile  75.00 =   11.844
sockperf: ---> percentile  50.00 =   11.409
sockperf: ---> percentile  25.00 =   11.124
sockperf: --->  observation =8.888

sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.100 sec; SentMessages=22065;
ReceivedMessages=22064
sockperf: = Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=1.000 sec; SentMessages=20056;
ReceivedMessages=20056
sockperf: > avg-lat= 24.861 (std-dev=1.774)
sockperf: # dropped messages = 0; # duplicated messages = 0; #
out-of-order messages = 0
sockperf: Summary: Latency is 24.861 usec
sockperf: Total 20056 observations; each percentile contains 200.56
observations
sockperf: --->  observation =   77.158
sockperf: ---> percentile  99.99 =   54.285
sockperf: ---> percentile  99.90 =   37.864
sockperf: ---> percentile  99.50 =   34.406
sockperf: ---> percentile  99.00 =   33.337
sockperf: ---> percentile  95.00 =   27.497
sockperf: ---> percentile  90.00 =   26.072
sockperf: ---> percentile  75.00 =   24.618
sockperf: ---> percentile  50.00 =   24.443
sockperf: ---> percentile  25.00 =   24.361
sockperf: --->  observation =   16.746
[root@c01 sbin]# sockperf ping-pong -i 192.168.0.12 -p 5001 -t 10
sockperf: == version #2.6 ==
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on
socket(s)








___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

I find the ping command with flood option handy to measure latency, gives stats 
min/max/average/std deviation

example:

ping  -c 10 -f 10.0.1.12

Maged

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ubuntu 17.10 or Debian 9.3 + Luminous = random OS hang ?

2018-01-21 Thread Nick Fisk

How up to date is your VM environment? We saw something very similar last year 
with Linux VM’s running newish kernels. It turns out newer kernels supported a 
new feature of the vmxnet3 adapters which had a bug in ESXi. The fix was 
release last year some time in ESXi6.5 U1, or a workaround was to set an option 
in the VM config.

 

https://kb.vmware.com/s/article/2151480

 

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Youzhong Yang
Sent: 21 January 2018 19:50
To: Brad Hubbard 
Cc: ceph-users 
Subject: Re: [ceph-users] Ubuntu 17.10 or Debian 9.3 + Luminous = random OS 
hang ?

 

As someone suggested, I installed linux-generic-hwe-16.04 package on Ubuntu 
16.04 to get kernel of 17.10, and then rebooted all VMs, here is what I 
observed:

- ceph monitor node froze upon reboot, in another case froze after a few 
minutes 

- ceph OSD hosts easily froze

- ceph admin node (which runs no ceph service but ceph-deploy) never freezes

- ceph rgw nodes and ceph mgr so far so good

 

Here are two images I captured:

 

https://drive.google.com/file/d/11hMJwhCF6Tj8LD3nlpokG0CB_oZqI506/view?usp=sharing

https://drive.google.com/file/d/1tzDQ3DYTnfDHh_hTQb0ISZZ4WZdRxHLv/view?usp=sharing

 

Thanks.

 

On Sat, Jan 20, 2018 at 7:03 PM, Brad Hubbard mailto:bhubb...@redhat.com> > wrote:

On Fri, Jan 19, 2018 at 11:54 PM, Youzhong Yang mailto:youzh...@gmail.com> > wrote:
> I don't think it's hardware issue. All the hosts are VMs. By the way, using
> the same set of VMWare hypervisors, I switched back to Ubuntu 16.04 last
> night, so far so good, no freeze.

Too little information to make any sort of assessment I'm afraid but,
at this stage, this doesn't sound like a ceph issue.


>
> On Fri, Jan 19, 2018 at 8:50 AM, Daniel Baumann   >
> wrote:
>>
>> Hi,
>>
>> On 01/19/18 14:46, Youzhong Yang wrote:
>> > Just wondering if anyone has seen the same issue, or it's just me.
>>
>> we're using debian with our own backported kernels and ceph, works rock
>> solid.
>>
>> what you're describing sounds more like hardware issues to me. if you
>> don't fully "trust"/have confidence in your hardware (and your logs
>> don't reveal anything), I'd recommend running some burn-in tests
>> (memtest, cpuburn, etc.) on them for 24 hours/machine to rule out
>> cpu/ram/etc. issues.
>>
>> Regards,
>> Daniel
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com  
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com  
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>




--
Cheers,
Brad

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)

2018-01-11 Thread Nick Fisk

I take my hat off to you, well done for solving that!!!

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Zdenek Janda
> Sent: 11 January 2018 13:01
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)
> 
> Hi,
> we have restored damaged ODS not starting after bug caused by this issue,
> detailed steps are for reference at
> http://tracker.ceph.com/issues/21142#note-9 , should anybody hit into this 
> this
> should fix it for you.
> Thanks
> Zdenek Janda
> 
> 
> 
> 
> On 11.1.2018 11:40, Zdenek Janda wrote:
> > Hi,
> > I have succeeded in identifying faulty PG:
> >
> >  -3450> 2018-01-11 11:32:20.015658 7f066e2a3e00 10 osd.15 15340 12.62d
> > needs 13939-15333  -3449> 2018-01-11 11:32:20.019405 7f066e2a3e00  1
> > osd.15 15340 build_past_intervals_parallel over 13939-15333  -3448>
> > 2018-01-11 11:32:20.019436 7f066e2a3e00 10 osd.15 15340
> > build_past_intervals_parallel epoch 13939  -3447> 2018-01-11
> > 11:32:20.019447 7f066e2a3e00 20 osd.15 0 get_map
> > 13939 - loading and decoding 0x55d39deefb80  -3446> 2018-01-11
> > 11:32:20.249771 7f066e2a3e00 10 osd.15 0 add_map_bl
> > 13939 27475 bytes
> >  -3445> 2018-01-11 11:32:20.250392 7f066e2a3e00 10 osd.15 15340
> > build_past_intervals_parallel epoch 13939 pg 12.62d first map, acting
> > [21,9] up [21,9], same_interval_since = 13939  -3444> 2018-01-11
> > 11:32:20.250505 7f066e2a3e00 10 osd.15 15340
> > build_past_intervals_parallel epoch 13940  -3443> 2018-01-11
> > 11:32:20.250529 7f066e2a3e00 20 osd.15 0 get_map
> > 13940 - loading and decoding 0x55d39deef800  -3442> 2018-01-11
> > 11:32:20.251883 7f066e2a3e00 10 osd.15 0 add_map_bl
> > 13940 27475 bytes
> > 
> > -3> 2018-01-11 11:32:26.973843 7f066e2a3e00 10 osd.15 15340
> > build_past_intervals_parallel epoch 15087
> > -2> 2018-01-11 11:32:26.973999 7f066e2a3e00 20 osd.15 0 get_map
> > 15087 - loading and decoding 0x55d3f9e7e700
> > -1> 2018-01-11 11:32:26.984286 7f066e2a3e00 10 osd.15 0 add_map_bl
> > 15087 11409 bytes
> >  0> 2018-01-11 11:32:26.990595 7f066e2a3e00 -1
> > /build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void
> > pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)'
> > thread 7f066e2a3e00 time 2018-01-11 11:32:26.984716
> > /build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED
> > assert(interval.last > last)
> >
> > Lets see what can be done about this PG.
> >
> > Thanks
> > Zdenek Janda
> >
> >
> > On 11.1.2018 11:20, Zdenek Janda wrote:
> >> Hi,
> >>
> >> updated the issue at http://tracker.ceph.com/issues/21142#note-5 with
> >> last 1 lines of strace before ABRT. Crash ends with:
> >>
> >>  0.002429 pread64(22,
> >>
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215
> >> {\354:\0\0"...,
> >> 12288, 908492996608) = 12288
> >>  0.007869 pread64(22,
> >>
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215
> >> {\355:\0\0"...,
> >> 12288, 908493324288) = 12288
> >>  0.004220 pread64(22,
> >>
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215
> >> {\356:\0\0"...,
> >> 12288, 908499615744) = 12288
> >>  0.009143 pread64(22,
> >>
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215
> >> {\357:\0\0"...,
> >> 12288, 908500926464) = 12288
> >>  0.010802 write(2, "/build/ceph-12.2.1/src/osd/osd_t"...,
> >> 275/build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void
> >> pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)'
> >> thread 7fb85e234e00 time 2018-01-11 11:02:54.783628
> >> /build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED
> >> assert(interval.last > last)
> >>
> >> Any suggestions are welcome, need to understand mechanism why this
> >> happened
> >>
> >> Thanks
> >> Zdenek Janda
> >>
> >>
> >> On 11.1.2018 10:48, Josef Zelenka wrote:
> >>> I have posted logs/strace from our osds with details to a ticket in
> >>> the ceph bug tracker - see here
> >>> http://tracker.ceph.com/issues/21142. You can see where exactly the
> >>> OSDs crash etc, this can be of help if someone decides to debug it.
> >>>
> >>> JZ
> >>>
> >>>
> >>> On 10/01/18 22:05, Josef Zelenka wrote:
> 
>  Hi, today we had a disasterous crash - we are running a 3 node, 24
>  osd in total cluster (8 each) with SSDs for blockdb, HDD for
>  bluestore data. This cluster is used as a radosgw backend, for
>  storing a big number of thumbnails for a file hosting site - around
>  110m files in total. We were adding an interface to the nodes which
>  required a restart, but after restarting one of the nodes, a lot of
>  the OSDs were kicked out of the cluster and rgw stopped working. We
>  have a lot of pgs down and unfound atm. OSDs can't be started(aside
>  from some, that's a mystery) with this error -  FAILED assert (
>  interval.last >
>  last) - they just periodically res

[ceph-users] Linux Meltdown (KPTI) fix and how it affects performance?

2018-01-04 Thread Nick Fisk

Hi All,

As the KPTI fix largely only affects the performance where there are a large
number of syscalls made, which Ceph does a lot of, I was wondering if
anybody has had a chance to perform any initial tests. I suspect small write
latencies will the worse affected?

Although I'm thinking the backend Ceph OSD's shouldn't really be at risk
from these vulnerabilities, due to them not being direct user facing and
could have this work around disabled?

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cache tiering on Erasure coded pools

2017-12-27 Thread Nick Fisk

Also carefully read the word of caution section on David's link (which is 
absent in the jewel version of the docs), a cache tier in front of an ersure 
coded data pool for RBD is almost always a bad idea.

 

 

I would say that statement is incorrect if using Bluestore. If using Bluestore, 
small writes are supported on erasure coded pools and so that “always a bad 
idea” should be read as “can be a bad idea”

 

Nick

 

 

Caspar




Met vriendelijke groet,

Caspar Smit
Systemengineer
SuperNAS
Dorsvlegelstraat 13
1445 PA Purmerend

t: (+31) 299 410 414
e: caspars...@supernas.eu  
w: www.supernas.eu  

 

2017-12-26 23:12 GMT+01:00 David Turner mailto:drakonst...@gmail.com> >:

Please use the version of the docs for your installed version of ceph.  Now the 
Jewel in your URL and the Luminous in mine.  In Luminous you no longer need a 
cache tier to use EC with RBDs.

http://docs.ceph.com/docs/luminous/rados/operations/cache-tiering/

 

On Tue, Dec 26, 2017, 4:21 PM Karun Josy mailto:karunjo...@gmail.com> > wrote:

Hi,

 

We are using Erasure coded pools in a ceph cluster for RBD images.

Ceph version is 12.2.2 Luminous.

 

-

http://docs.ceph.com/docs/jewel/rados/operations/cache-tiering/

-

 

Here it says we can use a Cache tiering infront of ec pools.

To use erasure code with RBD we  have a replicated pool to store metadata and  
ecpool as data pool .

 

Is it possible to setup cache tiering since there is already a replicated pool 
that is being used ?

 

 

 

 

 

 

 

 




Karun Josy

___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore Compression not inheriting pool option

2017-12-13 Thread Nick Fisk

Thanks for confirming, logged
http://tracker.ceph.com/issues/22419


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Stefan Kooman
> Sent: 12 December 2017 20:35
> To: Nick Fisk 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Bluestore Compression not inheriting pool option
> 
> Quoting Nick Fisk (n...@fisk.me.uk):
> > Hi All,
> >
> > Has anyone been testing the bluestore pool compression option?
> >
> > I have set compression=snappy on a RBD pool. When I add a new
> > bluestore OSD, data is not being compressed when backfilling,
> > confirmed by looking at the perf dump results. If I then set again the
> > compression type on the pool to snappy, then immediately data starts
> > getting compressed. It seems like when a new OSD joins the cluster, it
> > doesn't pick up the existing compression setting on the pool.
> >
> > Anyone seeing anything similar? I will raise a bug if anyone can
confirm.
> 
> Yes. I tried to reproduce your issue and I'm seeing the same thing. The
things
> I did to reproduce:
> 
> - check for compressed objects beforehand on osd (no compressed objects
>   where there)
> 
> - remove one of the osds in the cluster
> 
> - ceph osd pool set CEPH-TEST-ONE compression_algorithm snappy
> - ceph osd pool set CEPH-TEST-ONE compression_mode force
> -  rbd clone a rbd image
> - let cluster heal again
> - check for compressed bluestore objects on "new" osd (ceph daemon osd.0
> perf dump | grep blue):
> 
> "bluestore_compressed": 0,
> "bluestore_compressed_allocated": 0,
> "bluestore_compressed_original": 0
> 
> - check for compressed bluestore objects on already existing osd (ceph
> daemon
> osd.1 perf dump | grep blue):
> 
> "bluestore_compressed": 2991873,
> "bluestore_compressed_allocated": 3637248,
> "bluestore_compressed_original": 10895360,
> 
> Gr. Stefan
> 
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Odd object blocking IO on PG

2017-12-13 Thread Nick Fisk

Boom!! Fixed it. Not sure if the behavior I stumbled from is correct, but this 
has a potential to break a few things for people moving from Jewel to Luminous 
if they potentially had a few too many PG’s.

Firstly, how I stumbled across it. I whacked the logging up to max on OSD 68 
and saw this mentioned in the logs

osd.68 106454 maybe_wait_for_max_pg withhold creation of pg 0.1cf: 403 >= 400

This made me search through the code for this warning string

https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L4221

Which jogged my memory about the changes in Luminous regarding max PG’s 
warning, and in particular these two config options

mon_max_pg_per_osd

osd_max_pg_per_osd_hard_ratio

In my cluster I have just over 200 PG’s per OSD, but the node with OSD.68 in, 
has 8TB disks instead of 3TB for the rest of the cluster. This means these 
OSD’s were taking a lot more PG’s than the average would suggest. So in 
Luminous 200x2 gives a hard limit of 400, which is what that error message in 
the log suggests is the limit. I set the osd_max_pg_per_osd_hard_ratio  option 
to 3 and restarted the OSD and hey presto everything fell into line.

Now a question. I get the idea around these settings to stop making too many or 
pools with too many PG’s. But is it correct they can break an existing pool 
which is maybe making the new PG on an OSD due to CRUSH layout being modified?

Nick

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nick 
Fisk
Sent: 13 December 2017 11:14
To: 'Gregory Farnum' 
Cc: 'ceph-users' 
Subject: Re: [ceph-users] Odd object blocking IO on PG

On Tue, Dec 12, 2017 at 12:33 PM Nick Fisk mailto:n...@fisk.me.uk> > wrote:

> That doesn't look like an RBD object -- any idea who is
> "client.34720596.1:212637720"?

So I think these might be proxy ops from the cache tier, as there are also
block ops on one of the cache tier OSD's, but this time it actually lists
the object name. Block op on cache tier.

   "description": "osd_op(client.34720596.1:212637720 17.ae78c1cf
17:f3831e75:::rbd_data.15a5e20238e1f29.000388ad:head [set-alloc-hint
object_size 4194304 write_size 4194304,write 2584576~16384] snapc 0=[]
RETRY=2 ondisk+retry+write+known_if_redirected e104841)",
"initiated_at": "2017-12-12 16:25:32.435718",
"age": 13996.681147,
"duration": 13996.681203,
"type_data": {
"flag_point": "reached pg",
"client_info": {
"client": "client.34720596",
"client_addr": "10.3.31.41:0/2600619462 
<http://10.3.31.41:0/2600619462> ",
"tid": 212637720

I'm a bit baffled at the moment what's going. The pg query (attached) is not
showing in the main status that it has been blocked from peering or that
there are any missing objects. I've tried restarting all OSD's I can see
relating to the PG in case they needed a bit of a nudge.

Did that fix anything? I don't see anything immediately obvious but I'm not 
practiced in quickly reading that pg state output.

What's the output of "ceph -s"?

Hi Greg,

No restarting OSD’s didn’t seem to help. But I did make some progress late last 
night. By stopping OSD.68 the cluster unlocks itself and IO can progress. 
However as soon as it starts back up, 0.1cf and a couple of other PG’s again 
get stuck in an activating state. If I out the OSD, either with it up or down, 
then some other PG’s seem to get hit by the same problem as CRUSH moves PG 
mappings around to other OSD’s.

So there definitely seems to be some sort of weird peering issue somewhere. I 
have seen a very similar issue before on this cluster where after running the 
crush reweight script to balance OSD utilization, the weight got set too low 
and PG’s were unable to peer. I’m not convinced this is what’s happening here 
as all the weights haven’t changed, but I’m intending to explore this further 
just in case.

With 68 down

pgs: 1071783/48650631 objects degraded (2.203%)

 5923 active+clean

 399  active+undersized+degraded

 7active+clean+scrubbing+deep

 7active+clean+remapped

With it up

pgs: 0.047% pgs not active

 67271/48651279 objects degraded (0.138%)

 15602/48651279 objects misplaced (0.032%)

 6051 active+clean

 273  active+recovery_wait+degraded

 4active+clean+scrubbing+deep

 4active+remapped+backfill_wait

3activating+remapped

1.  active+recovering+degraded

PG Dump

ceph pg dump | grep activatin

dumped all

2.389 0

Re: [ceph-users] Health Error : Request Stuck

2017-12-13 Thread Nick Fisk

Ok, great glad you got your issue sorted. I’m still battling along with mine.

From: Karun Josy [mailto:karunjo...@gmail.com] 
Sent: 13 December 2017 12:22
To: n...@fisk.me.uk
Cc: ceph-users 
Subject: Re: [ceph-users] Health Error : Request Stuck

Hi Nick,

Finally, was able to correct the issue!

We found that there were many slow requests in ceph health detail. 
And found that some osds were slowing the cluster down.

Initially the cluster was unusable when there were 10 PGs with 
"activating+remapped" status and slow requests.

Slow requests were mainly on 2 osds. And we restarted osd daemons one by one, 
which cleared the block requests.

And that made the cluster reusable. However, there were 4 PGs still in inactive 
state.

So I took down one of the osd with slow requests for some time, and allowed the 
cluster to rebalance.

And it worked!

To be honest, not exactly sure its the correct way. 

P.S : I had upgraded to Luminous 12.2.2 yesterday. 

Karun Josy

On Wed, Dec 13, 2017 at 4:31 PM, Nick Fisk mailto:n...@fisk.me.uk> > wrote:

Hi Karun,

I too am experiencing something very similar with a PG stuck in 
activating+remapped state after re-introducing a OSD back into the cluster as 
Bluestore. Although this new OSD is not the one listed against the PG’s stuck 
activating. I also see the same thing as you where the up set is different to 
the acting set.

Can I just ask what ceph version you are running and the output of ceph osd 
tree?

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
<mailto:ceph-users-boun...@lists.ceph.com> ] On Behalf Of Karun Josy
Sent: 13 December 2017 07:06
To: ceph-users mailto:ceph-users@lists.ceph.com> >
Subject: Re: [ceph-users] Health Error : Request Stuck

Cluster is unusable because of inactive PGs. How can we correct it?

=

ceph pg dump_stuck inactive

ok

PG_STAT STATE   UP   UP_PRIMARY ACTING   ACTING_PRIMARY

1.4bactivating+remapped [5,2,0,13,1]  5 [5,2,13,1,4]  5

1.35activating+remapped [2,7,0,1,12]  2 [2,7,1,12,9]  2

1.12activating+remapped  [1,3,5,0,7]  1  [1,3,5,7,2]  1

1.4eactivating+remapped  [1,3,0,9,2]  1  [1,3,0,9,5]  1

2.3bactivating+remapped [13,1,0] 13 [13,1,2] 13

1.19activating+remapped [2,13,8,9,0]  2 [2,13,8,9,1]  2

1.1eactivating+remapped [2,3,1,10,0]  2 [2,3,1,10,5]  2

2.29activating+remapped [1,0,13]  1 [1,8,11]  1

1.6factivating+remapped [8,2,0,4,13]  8 [8,2,4,13,1]  8

1.74activating+remapped [7,13,2,0,4]  7 [7,13,2,4,1]  7

Karun Josy

On Wed, Dec 13, 2017 at 8:27 AM, Karun Josy mailto:karunjo...@gmail.com> > wrote:

Hello,

We added a new disk to the cluster and while rebalancing we are getting error 
warnings.

=

Overall status: HEALTH_ERR

REQUEST_SLOW: 1824 slow requests are blocked > 32 sec

REQUEST_STUCK: 1022 stuck requests are blocked > 4096 sec

==

The load in the servers seems to be very low.

How can I correct it?

Karun 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Odd object blocking IO on PG

2017-12-13 Thread Nick Fisk

On Tue, Dec 12, 2017 at 12:33 PM Nick Fisk mailto:n...@fisk.me.uk> > wrote:

> That doesn't look like an RBD object -- any idea who is
> "client.34720596.1:212637720"?

So I think these might be proxy ops from the cache tier, as there are also
block ops on one of the cache tier OSD's, but this time it actually lists
the object name. Block op on cache tier.

   "description": "osd_op(client.34720596.1:212637720 17.ae78c1cf
17:f3831e75:::rbd_data.15a5e20238e1f29.000388ad:head [set-alloc-hint
object_size 4194304 write_size 4194304,write 2584576~16384] snapc 0=[]
RETRY=2 ondisk+retry+write+known_if_redirected e104841)",
"initiated_at": "2017-12-12 16:25:32.435718",
"age": 13996.681147,
"duration": 13996.681203,
"type_data": {
"flag_point": "reached pg",
"client_info": {
"client": "client.34720596",
"client_addr": "10.3.31.41:0/2600619462 
<http://10.3.31.41:0/2600619462> ",
"tid": 212637720

I'm a bit baffled at the moment what's going. The pg query (attached) is not
showing in the main status that it has been blocked from peering or that
there are any missing objects. I've tried restarting all OSD's I can see
relating to the PG in case they needed a bit of a nudge.

Did that fix anything? I don't see anything immediately obvious but I'm not 
practiced in quickly reading that pg state output.

What's the output of "ceph -s"?

Hi Greg,

No restarting OSD’s didn’t seem to help. But I did make some progress late last 
night. By stopping OSD.68 the cluster unlocks itself and IO can progress. 
However as soon as it starts back up, 0.1cf and a couple of other PG’s again 
get stuck in an activating state. If I out the OSD, either with it up or down, 
then some other PG’s seem to get hit by the same problem as CRUSH moves PG 
mappings around to other OSD’s.

So there definitely seems to be some sort of weird peering issue somewhere. I 
have seen a very similar issue before on this cluster where after running the 
crush reweight script to balance OSD utilization, the weight got set too low 
and PG’s were unable to peer. I’m not convinced this is what’s happening here 
as all the weights haven’t changed, but I’m intending to explore this further 
just in case.

With 68 down

pgs: 1071783/48650631 objects degraded (2.203%)

 5923 active+clean

 399  active+undersized+degraded

 7active+clean+scrubbing+deep

 7active+clean+remapped

With it up

pgs: 0.047% pgs not active

 67271/48651279 objects degraded (0.138%)

 15602/48651279 objects misplaced (0.032%)

 6051 active+clean

 273  active+recovery_wait+degraded

 4active+clean+scrubbing+deep

 4active+remapped+backfill_wait

3activating+remapped

1.  active+recovering+degraded

PG Dump

ceph pg dump | grep activatin

dumped all

2.389 0  00 0   0   0 1500  
   1500   activating+remapped 2017-12-13 11:08:50.990526  
76271'34230106239:160310 [68,60,58,59,29,23] 68 [62,60,58,59,29,23] 
62  76271'34230 2017-12-13 09:00:08.359690  76271'34230 
2017-12-10 10:05:10.931366

0.1cf  3947  00 0   0 16472186880 1577  
   1577   activating+remapped 2017-12-13 11:08:50.641034   
106236'7512915   106239:6176548   [34,68,8] 34   
[34,8,53] 34   106138'7512682 2017-12-13 10:27:37.400613   
106138'7512682 2017-12-13 10:27:37.400613

2.210 0  00 0   0   0 1500  
   1500   activating+remapped 2017-12-13 11:08:50.686193  
76271'33304 106239:96797 [68,67,34,36,16,15] 68 [62,67,34,36,16,15] 
62  76271'33304 2017-12-12 00:49:21.038437  76271'33304 
2017-12-10 16:05:12.751425

>
> On Tue, Dec 12, 2017 at 12:36 PM, Nick Fisk  <mailto:n...@fisk.me.uk> > wrote:
> > Does anyone know what this object (0.ae78c1cf) might be, it's not your
> > normal run of the mill RBD object and I can't seem to find it in the
> > pool using rados --all ls . It seems to be leaving the 0.1cf PG stuck
> > in an
> > activating+remapped state and blocking IO. Pool 0 is just a pure RBD
> > activating+pool
> > with a cache tier above it. There is no current mention of unfound
> > objects or any other obvious issues.
> >
> > There is some

Re: [ceph-users] Health Error : Request Stuck

2017-12-13 Thread Nick Fisk

Hi Karun,

 

I too am experiencing something very similar with a PG stuck in 
activating+remapped state after re-introducing a OSD back into the cluster as 
Bluestore. Although this new OSD is not the one listed against the PG’s stuck 
activating. I also see the same thing as you where the up set is different to 
the acting set.

 

Can I just ask what ceph version you are running and the output of ceph osd 
tree?

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Karun 
Josy
Sent: 13 December 2017 07:06
To: ceph-users 
Subject: Re: [ceph-users] Health Error : Request Stuck

 

Cluster is unusable because of inactive PGs. How can we correct it?

 

=

ceph pg dump_stuck inactive

ok

PG_STAT STATE   UP   UP_PRIMARY ACTING   ACTING_PRIMARY

1.4bactivating+remapped [5,2,0,13,1]  5 [5,2,13,1,4]  5

1.35activating+remapped [2,7,0,1,12]  2 [2,7,1,12,9]  2

1.12activating+remapped  [1,3,5,0,7]  1  [1,3,5,7,2]  1

1.4eactivating+remapped  [1,3,0,9,2]  1  [1,3,0,9,5]  1

2.3bactivating+remapped [13,1,0] 13 [13,1,2] 13

1.19activating+remapped [2,13,8,9,0]  2 [2,13,8,9,1]  2

1.1eactivating+remapped [2,3,1,10,0]  2 [2,3,1,10,5]  2

2.29activating+remapped [1,0,13]  1 [1,8,11]  1

1.6factivating+remapped [8,2,0,4,13]  8 [8,2,4,13,1]  8

1.74activating+remapped [7,13,2,0,4]  7 [7,13,2,4,1]  7






Karun Josy

 

On Wed, Dec 13, 2017 at 8:27 AM, Karun Josy mailto:karunjo...@gmail.com> > wrote:

Hello,

 

We added a new disk to the cluster and while rebalancing we are getting error 
warnings.

 

=

Overall status: HEALTH_ERR

REQUEST_SLOW: 1824 slow requests are blocked > 32 sec

REQUEST_STUCK: 1022 stuck requests are blocked > 4096 sec

==

 

The load in the servers seems to be very low.

 

How can I correct it?

 

 

Karun 

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Odd object blocking IO on PG

2017-12-12 Thread Nick Fisk

 
> That doesn't look like an RBD object -- any idea who is
> "client.34720596.1:212637720"?

So I think these might be proxy ops from the cache tier, as there are also
block ops on one of the cache tier OSD's, but this time it actually lists
the object name. Block op on cache tier.

   "description": "osd_op(client.34720596.1:212637720 17.ae78c1cf
17:f3831e75:::rbd_data.15a5e20238e1f29.000388ad:head [set-alloc-hint
object_size 4194304 write_size 4194304,write 2584576~16384] snapc 0=[]
RETRY=2 ondisk+retry+write+known_if_redirected e104841)",
"initiated_at": "2017-12-12 16:25:32.435718",
"age": 13996.681147,
"duration": 13996.681203,
"type_data": {
"flag_point": "reached pg",
"client_info": {
"client": "client.34720596",
"client_addr": "10.3.31.41:0/2600619462",
"tid": 212637720

I'm a bit baffled at the moment what's going. The pg query (attached) is not
showing in the main status that it has been blocked from peering or that
there are any missing objects. I've tried restarting all OSD's I can see
relating to the PG in case they needed a bit of a nudge.

> 
> On Tue, Dec 12, 2017 at 12:36 PM, Nick Fisk  wrote:
> > Does anyone know what this object (0.ae78c1cf) might be, it's not your
> > normal run of the mill RBD object and I can't seem to find it in the
> > pool using rados --all ls . It seems to be leaving the 0.1cf PG stuck
> > in an
> > activating+remapped state and blocking IO. Pool 0 is just a pure RBD
> > activating+pool
> > with a cache tier above it. There is no current mention of unfound
> > objects or any other obvious issues.
> >
> > There is some backfilling going on, on another OSD which was upgraded
> > to bluestore, which was when the issue started. But I can't see any
> > link in the PG dump with upgraded OSD. My only thought so far is to
> > wait for this backfilling to finish and then deep-scrub this PG and
> > see if that reveals anything?
> >
> > Thanks,
> > Nick
> >
> >  "description": "osd_op(client.34720596.1:212637720 0.1cf 0.ae78c1cf
> > (undecoded)
> > ondisk+retry+write+ignore_cache+ignore_overlay+known_if_redirected
> > e105014)",
> > "initiated_at": "2017-12-12 17:10:50.030660",
> > "age": 335.948290,
> > "duration": 335.948383,
> > "type_data": {
> > "flag_point": "delayed",
> > "events": [
> > {
> > "time": "2017-12-12 17:10:50.030660",
> > "event": "initiated"
> > },
> > {
> > "time": "2017-12-12 17:10:50.030692",
> > "event": "queued_for_pg"
> > },
> > {
> > "time": "2017-12-12 17:10:50.030719",
> > "event": "reached_pg"
> > },
> > {
> > "time": "2017-12-12 17:10:50.030727",
> > "event": "waiting for peered"
> > },
> > {
> > "time": "2017-12-12 17:10:50.197353",
> > "event": "reached_pg"
> > },
> > {
> > "time": "2017-12-12 17:10:50.197355",
> > "event": "waiting for peered"
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
{
"state": "activating+remapped",
"snap_trimq": "[]",
"epoch": 105385,
"up": [
34,
68,
8
],
"acting

[ceph-users] Bluestore Compression not inheriting pool option

2017-12-12 Thread Nick Fisk

Hi All,

Has anyone been testing the bluestore pool compression option?

I have set compression=snappy on a RBD pool. When I add a new bluestore OSD,
data is not being compressed when backfilling, confirmed by looking at the
perf dump results. If I then set again the compression type on the pool to
snappy, then immediately data starts getting compressed. It seems like when
a new OSD joins the cluster, it doesn't pick up the existing compression
setting on the pool.

Anyone seeing anything similar? I will raise a bug if anyone can confirm.

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Odd object blocking IO on PG

2017-12-12 Thread Nick Fisk

Does anyone know what this object (0.ae78c1cf) might be, it's not your
normal run of the mill RBD object and I can't seem to find it in the pool
using rados --all ls . It seems to be leaving the 0.1cf PG stuck in an
activating+remapped state and blocking IO. Pool 0 is just a pure RBD pool
with a cache tier above it. There is no current mention of unfound objects
or any other obvious issues.  

There is some backfilling going on, on another OSD which was upgraded to
bluestore, which was when the issue started. But I can't see any link in the
PG dump with upgraded OSD. My only thought so far is to wait for this
backfilling to finish and then deep-scrub this PG and see if that reveals
anything?

Thanks,
Nick

 "description": "osd_op(client.34720596.1:212637720 0.1cf 0.ae78c1cf
(undecoded)
ondisk+retry+write+ignore_cache+ignore_overlay+known_if_redirected
e105014)",
"initiated_at": "2017-12-12 17:10:50.030660",
"age": 335.948290,
"duration": 335.948383,
"type_data": {
"flag_point": "delayed",
"events": [
{
"time": "2017-12-12 17:10:50.030660",
"event": "initiated"
},
{
"time": "2017-12-12 17:10:50.030692",
"event": "queued_for_pg"
},
{
"time": "2017-12-12 17:10:50.030719",
"event": "reached_pg"
},
{
"time": "2017-12-12 17:10:50.030727",
"event": "waiting for peered"
},
{
"time": "2017-12-12 17:10:50.197353",
"event": "reached_pg"
},
{
"time": "2017-12-12 17:10:50.197355",
"event": "waiting for peered"

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] what's the maximum number of OSDs per OSD server?

2017-12-10 Thread Nick Fisk

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Igor 
Mendelev
Sent: 10 December 2017 17:37
To: n...@fisk.me.uk; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] what's the maximum number of OSDs per OSD server?

Expected number of nodes for initial setup is 10-15 and of OSDs - 1,500-2,000. 

Networking is planned to be 2 100GbE or 2 dual 50GbE in x16 slots (per OSD 
node).

JBODs are to be connected with 3-4 x8 SAS3 HBAs (4 4x SAS3 ports each)

Choice of hardware is done considering (non-trivial) per-server sw licensing 
costs -

so small (12-24 HDD) nodes are certainly not optimal regardless of CPUs cost 
(which

is estimated to be below 10% of the total cost in the setup I'm currently 
considering).

EC (4+2 or 8+3 etc - TBD) - not 3x replication - is planned to be used for most 
of the storage space.

Main applications are expected to be archiving and sequential access to large 
(multiGB) files/objects.

Nick, which physical limitations you're referring to ?

Thanks.

Hi Igor,

I guess I meant physical annoyances rather than limitations. Being able to pull 
out a 1 or 2U node is always much less of a chore vs dealing with several U of 
SAS interconnected JBOD’s. 

If you have some license reason for larger nodes, then there is a very valid 
argument for larger nodes. Is this license cost  related in some way to Ceph (I 
thought Redhat was capacity based) or is this some sort of collocated software? 
Just make sure you size the nodes to a point that if one has to be taken 
offline for any reason, that you are happy with the resulting state of the 
cluster, including the peering when suddenly taking ~200 OSD’s offline/online.

Nick

On Sun, Dec 10, 2017 at 11:17 AM, Nick Fisk mailto:n...@fisk.me.uk> > wrote:

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
<mailto:ceph-users-boun...@lists.ceph.com> ] On Behalf Of Igor Mendelev
Sent: 10 December 2017 15:39
To: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: [ceph-users] what's the maximum number of OSDs per OSD server?

Given that servers with 64 CPU cores (128 threads @ 2.7GHz) and up to 2TB RAM - 
as well as 12TB HDDs - are easily available and somewhat reasonably priced I 
wonder what's the maximum number of OSDs per OSD server (if using 10TB or 12TB 
HDDs) and how much RAM does it really require if total storage capacity for 
such OSD server is on the order of 1,000+ TB - is it still 1GB RAM per TB of 
HDD or it could be less (during normal operations - and extended with NVMe SSDs 
swap space for extra space during recovery)?

Are there any known scalability limits in Ceph Luminous (12.2.2 with BlueStore) 
and/or Linux that'll make such high capacity OSD server not scale well (using 
sequential IO speed per HDD as a metric)?

Thanks.

How many total OSD’s will you have? If you are planning on having thousands 
then dense nodes might make sense. Otherwise you are leaving yourself open to 
having a few number of very large nodes, which will likely shoot you in the 
foot further down the line. Also don’t forget, unless this is purely for 
archiving, you will likely need to scale the networking up per node, 2x10G 
won’t cut it when you have 10-20+ disks per node.

With Bluestore, you are probably looking at around 2-3GB of RAM per OSD, so say 
4GB to be on the safe side.

7.2k HDD’s will likely only use a small proportion of a CPU core due to their 
limited IO potential. A would imagine that even with 90 bay JBOD’s, you will 
run into physical limitations before you hit CPU ones. 

Without knowing your exact requirements, I would suggest that larger number of 
smaller nodes, might be a better idea. If you choose your hardware right, you 
can often get the cost down to comparable levels by not going with top of the 
range kit. Ie Xeon E3’s or D’s vs dual socket E5’s.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] what's the maximum number of OSDs per OSD server?

2017-12-10 Thread Nick Fisk

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Igor 
Mendelev
Sent: 10 December 2017 15:39
To: ceph-users@lists.ceph.com
Subject: [ceph-users] what's the maximum number of OSDs per OSD server?

Given that servers with 64 CPU cores (128 threads @ 2.7GHz) and up to 2TB RAM - 
as well as 12TB HDDs - are easily available and somewhat reasonably priced I 
wonder what's the maximum number of OSDs per OSD server (if using 10TB or 12TB 
HDDs) and how much RAM does it really require if total storage capacity for 
such OSD server is on the order of 1,000+ TB - is it still 1GB RAM per TB of 
HDD or it could be less (during normal operations - and extended with NVMe SSDs 
swap space for extra space during recovery)?

Are there any known scalability limits in Ceph Luminous (12.2.2 with BlueStore) 
and/or Linux that'll make such high capacity OSD server not scale well (using 
sequential IO speed per HDD as a metric)?

Thanks.

How many total OSD’s will you have? If you are planning on having thousands 
then dense nodes might make sense. Otherwise you are leaving yourself open to 
having a few number of very large nodes, which will likely shoot you in the 
foot further down the line. Also don’t forget, unless this is purely for 
archiving, you will likely need to scale the networking up per node, 2x10G 
won’t cut it when you have 10-20+ disks per node.

With Bluestore, you are probably looking at around 2-3GB of RAM per OSD, so say 
4GB to be on the safe side.

7.2k HDD’s will likely only use a small proportion of a CPU core due to their 
limited IO potential. A would imagine that even with 90 bay JBOD’s, you will 
run into physical limitations before you hit CPU ones. 

Without knowing your exact requirements, I would suggest that larger number of 
smaller nodes, might be a better idea. If you choose your hardware right, you 
can often get the cost down to comparable levels by not going with top of the 
range kit. Ie Xeon E3’s or D’s vs dual socket E5’s.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread Nick Fisk

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of German 
Anders
Sent: 27 November 2017 14:44
To: Maged Mokhtar 
Cc: ceph-users 
Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning

Hi Maged,

Thanks a lot for the response. We try with different number of threads and 
we're getting almost the same kind of difference between the storage types. 
Going to try with different rbd stripe size, object size values and see if we 
get more competitive numbers. Will get back with more tests and param changes 
to see if we get better :)

Just to echo a couple of comments. Ceph will always struggle to match the 
performance of a traditional array for mainly 2 reasons.

1.  You are replacing some sort of dual ported SAS or internally RDMA 
connected device with a network for Ceph replication traffic. This will 
instantly have a large impact on write latency
2.  Ceph locks at the PG level and a PG will most likely cover at least one 
4MB object, so lots of small accesses to the same blocks (on a block device) 
will wait on each other and go effectively at a single threaded rate.

The best thing you can do to mitigate these, is to run the fastest journal/WAL 
devices you can, fastest network connections (ie 25Gb/s) and run your CPU’s at 
max C and P states.

You stated that you are running the performance profile on the CPU’s. Could you 
also just double check that the C-states are being held at C1(e)? There are a 
few utilities that can show this in realtime.

Other than that, although there could be some minor tweaks, you are probably 
nearing the limit of what you can hope to achieve.

Nick

Thanks,

Best,

German

2017-11-27 11:36 GMT-03:00 Maged Mokhtar mailto:mmokh...@petasan.org> >:

On 2017-11-27 15:02, German Anders wrote:

Hi All,

I've a performance question, we recently install a brand new Ceph cluster with 
all-nvme disks, using ceph version 12.2.0 with bluestore configured. The 
back-end of the cluster is using a bond IPoIB (active/passive) , and for the 
front-end we are using a bonding config with active/active (20GbE) to 
communicate with the clients.

The cluster configuration is the following:

MON Nodes:

OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 

3x 1U servers:

  2x Intel Xeon E5-2630v4 @2.2Ghz

  128G RAM

  2x Intel SSD DC S3520 150G (in RAID-1 for OS)

  2x 82599ES 10-Gigabit SFI/SFP+ Network Connection

OSD Nodes:

OS: Ubuntu 16.04.3 LTS | kernel 4.12.14

4x 2U servers:

  2x Intel Xeon E5-2640v4 @2.4Ghz

  128G RAM

  2x Intel SSD DC S3520 150G (in RAID-1 for OS)

  1x Ethernet Controller 10G X550T

  1x 82599ES 10-Gigabit SFI/SFP+ Network Connection

  12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons

  1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)

Here's the tree:

ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF

-7   48.0 root root

-5   24.0 rack rack1

-1   12.0 node cpn01

 0  nvme  1.0 osd.0  up  1.0 1.0

 1  nvme  1.0 osd.1  up  1.0 1.0

 2  nvme  1.0 osd.2  up  1.0 1.0

 3  nvme  1.0 osd.3  up  1.0 1.0

 4  nvme  1.0 osd.4  up  1.0 1.0

 5  nvme  1.0 osd.5  up  1.0 1.0

 6  nvme  1.0 osd.6  up  1.0 1.0

 7  nvme  1.0 osd.7  up  1.0 1.0

 8  nvme  1.0 osd.8  up  1.0 1.0

 9  nvme  1.0 osd.9  up  1.0 1.0

10  nvme  1.0 osd.10 up  1.0 1.0

11  nvme  1.0 osd.11 up  1.0 1.0

-3   12.0 node cpn03

24  nvme  1.0 osd.24 up  1.0 1.0

25  nvme  1.0 osd.25 up  1.0 1.0

26  nvme  1.0 osd.26 up  1.0 1.0

27  nvme  1.0 osd.27 up  1.0 1.0

28  nvme  1.0 osd.28 up  1.0 1.0

29  nvme  1.0 osd.29 up  1.0 1.0

30  nvme  1.0 osd.30 up  1.0 1.0

31  nvme  1.0 osd.31 up  1.0 1.0

32  nvme  1.0 osd.32 up  1.0 1.0

33  nvme  1.0 osd.33 up  1.0 1.0

34  nvme  1.0 osd.34 up  1.0 1.0

35  nvme  1.0 osd.35 up  1.0 1.0

-6   24.0 rack rack2

-2   12.0 node cpn02

12  nvme  1.0 osd.12 up  1.0 1.0

13  nvme  1.0 osd.13 up  1.0 1.0

14  nvme  1.0 osd.14 up  1.0 1.0

15  nvme  1.0 osd.15 up  1.0 1.0

16  nvme  1.0 osd.16 up  1.0 1.0

17  nvme  1.0 osd.17 up  1.0 1.0

18  nvme  1

Re: [ceph-users] Bluestore performance 50% of filestore

2017-11-18 Thread Nick Fisk

al) likes high io depths so writes can hit all of 
> the
> drives at the same time.  There are tricks (like journals, writahead logs,
> centralized caches, etc) that can help mitigate this, but I suspect you'll see
> much better performance with more concurrent writes.
> 
> Regarding file size, the smaller the file, the more likely those tricks 
> mentioned
> above are to help you.  Based on your results, it appears filestore may be
> doing a better job of it than bluestore.  The question you have to ask is
> whether or not this kind of test represents what you are likely to see for 
> real
> on your cluster.
> 
> Doing writes over a much larger file, say 3-4x over the total amount of RAM
> in all of the nodes, helps you get a better idea of what the behavior is like
> when those tricks are less effective.  I think that's probably a more likely
> scenario in most production environments, but it's up to you which workload
> you think better represents what you are going to see in practice.  A while
> back Nick Fisk showed some results wehre bluestore was slower than
> filestore at small sync writes and it could be that we simply have more work
> to do in this area.  On the other hand, we pretty consistently see bluestore
> doing better than filestore with 4k random writes and higher IO depths,
> which is why I'd be curious to see how it goes if you try that.
> 
> Mark
> 
> On 11/16/2017 10:11 AM, Milanov, Radoslav Nikiforov wrote:
> > No,
> > What test parameters (iodepth/file size/numjobs) would make sense  for 3
> node/27OSD@4TB ?
> > - Rado
> >
> > -Original Message-
> > From: Mark Nelson [mailto:mnel...@redhat.com]
> > Sent: Thursday, November 16, 2017 10:56 AM
> > To: Milanov, Radoslav Nikiforov ; David Turner
> > 
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Bluestore performance 50% of filestore
> >
> > Did you happen to have a chance to try with a higher io depth?
> >
> > Mark
> >
> > On 11/16/2017 09:53 AM, Milanov, Radoslav Nikiforov wrote:
> >> FYI
> >>
> >> Having 50GB bock.db made no difference on the performance.
> >>
> >>
> >>
> >> - Rado
> >>
> >>
> >>
> >> *From:*David Turner [mailto:drakonst...@gmail.com]
> >> *Sent:* Tuesday, November 14, 2017 6:13 PM
> >> *To:* Milanov, Radoslav Nikiforov 
> >> *Cc:* Mark Nelson ; ceph-users@lists.ceph.com
> >> *Subject:* Re: [ceph-users] Bluestore performance 50% of filestore
> >>
> >>
> >>
> >> I'd probably say 50GB to leave some extra space over-provisioned.
> >> 50GB should definitely prevent any DB operations from spilling over to the
> HDD.
> >>
> >>
> >>
> >> On Tue, Nov 14, 2017, 5:43 PM Milanov, Radoslav Nikiforov
> >> mailto:rad...@bu.edu>> wrote:
> >>
> >> Thank you,
> >>
> >> It is 4TB OSDs and they might become full someday, I’ll try 60GB db
> >> partition – this is the max OSD capacity.
> >>
> >>
> >>
> >> - Rado
> >>
> >>
> >>
> >> *From:*David Turner [mailto:drakonst...@gmail.com
> >> <mailto:drakonst...@gmail.com>]
> >> *Sent:* Tuesday, November 14, 2017 5:38 PM
> >>
> >>
> >> *To:* Milanov, Radoslav Nikiforov  >> <mailto:rad...@bu.edu>>
> >>
> >> *Cc:*Mark Nelson  <mailto:mnel...@redhat.com>>;
> >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> >>
> >>
> >> *Subject:* Re: [ceph-users] Bluestore performance 50% of
> >> filestore
> >>
> >>
> >>
> >> You have to configure the size of the db partition in the config
> >> file for the cluster.  If you're db partition is 1GB, then I can all
> >> but guarantee that you're using your HDD for your blocks.db very
> >> quickly into your testing.  There have been multiple threads
> >> recently about what size the db partition should be and it seems to
> >> be based on how many objects your OSD is likely to have on it.  The
> >> recommendation has been to err on the side of bigger.  If you're
> >> running 10TB OSDs and anticipate filling them up, then you probably
> >> want closer to an 80GB+ db partition.  That's why I asked how full
> >> your cluster was and how large your HDDs are.
> >>
> >>
> >>
> >&

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-08 Thread Nick Fisk

> -Original Message-
> From: Mark Nelson [mailto:mnel...@redhat.com]
> Sent: 08 November 2017 21:42
> To: n...@fisk.me.uk; 'Wolfgang Lendl' 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] bluestore - wal,db on faster devices?
> 
> 
> 
> On 11/08/2017 03:16 PM, Nick Fisk wrote:
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Mark Nelson
> >> Sent: 08 November 2017 19:46
> >> To: Wolfgang Lendl 
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] bluestore - wal,db on faster devices?
> >>
> >> Hi Wolfgang,
> >>
> >> You've got the right idea.  RBD is probably going to benefit less
> >> since
> > you
> >> have a small number of large objects and little extra OMAP data.
> >> Having the allocation and object metadata on flash certainly
> >> shouldn't
> > hurt,
> >> and you should still have less overhead for small (<64k) writes.
> >> With RGW however you also have to worry about bucket index updates
> >> during writes and that's a big potential bottleneck that you don't
> >> need to worry about with RBD.
> >
> > If you are running anything which is sensitive to sync write latency,
> > like databases. You will see a big performance improvement in using WAL
> on SSD.
> > As Mark says, small writes will get ack'd once written to SSD.
> > ~10-200us vs 1-2us difference. It will also batch lots of
> > these small writes together and write them to disk in bigger chunks
> > much more effectively. If you want to run active workloads on RBD and
> > want them to match enterprise storage array with BBWC type
> > performance, I would say DB and WAL on SSD is a requirement.
> 
> Hi Nick,
> 
> You've done more investigation in this area than most I think.  Once you get
> to the point under continuous load where RocksDB is compacting, do you see
> better than a 2X gain?
> 
> Mark

Hi Mark,

I've not really been testing it in a way where all the OSD's would be under 
100% load for a long period of time. It's been more of a real world user facing 
test were IO comes and goes in short bursts and spikes. I've been busy in other 
areas for the last few months and so have sort of missed out on all the 
official Luminous/bluestore goodness. I hope to get round to doing some more 
testing towards the end of the year though. Once I do, I will look into the 
compaction and see what impact it might be having.

> 
> >
> >>
> >> Mark
> >>
> >> On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:
> >>> Hi Mark,
> >>>
> >>> thanks for your reply!
> >>> I'm a big fan of keeping things simple - this means that there has
> >>> to be a very good reason to put the WAL and DB on a separate device
> >>> otherwise I'll keep it collocated (and simpler).
> >>>
> >>> as far as I understood - putting the WAL,DB on a faster (than hdd)
> >>> device makes more sense in cephfs and rgw environments (more
> >> metadata)
> >>> - and less sense in rbd environments - correct?
> >>>
> >>> br
> >>> wolfgang
> >>>
> >>> On 11/08/2017 02:21 PM, Mark Nelson wrote:
> >>>> Hi Wolfgang,
> >>>>
> >>>> In bluestore the WAL serves sort of a similar purpose to
> >>>> filestore's journal, but bluestore isn't dependent on it for
> >>>> guaranteeing durability of large writes.  With bluestore you can
> >>>> often get higher large-write throughput than with filestore when
> >>>> using HDD-only or flash-only OSDs.
> >>>>
> >>>> Bluestore also stores allocation, object, and cluster metadata in
> >>>> the DB.  That, in combination with the way bluestore stores
> >>>> objects, dramatically improves behavior during certain workloads.
> >>>> A big one is creating millions of small objects as quickly as
> >>>> possible.  In filestore, PG splitting has a huge impact on
> >>>> performance and tail latency.  Bluestore is much better just on
> >>>> HDD, and putting the DB and WAL on flash makes it better still
> >>>> since metadata no longer is a bottleneck.
> >>>>
> >>>> Bluestore does have a couple of shortcomings vs filestore currently.
> >>>> The allocator is not as good as XFS's and

Re: [ceph-users] Blog post: storage server power consumption

2017-11-08 Thread Nick Fisk

Also look at the new WD 10TB Red's if you want very low use archive storage.
Because they spin at 5400, they only use 2.8W at idle.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jack
> Sent: 06 November 2017 22:31
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Blog post: storage server power consumption
> 
> Online does that on C14 (https://www.online.net/en/c14)
> 
> IIRC, 52 spining disks per RU, with only 2 disks usable at a time There is
some
> custom hardware, though, and it is really design for cold storage (as an
IO
> must wait for an idle slot, power-on the device, do the IO, power-off the
> device and release the slot) They use 1GB as a block size
> 
> I do not think this will work anyhow with Ceph
> 
> On 06/11/2017 23:12, Simon Leinen wrote:
> > The last paragraph contains a challenge to developers: Can we save
> > more power in "cold storage" applications by turning off idle disks?
> > Crazy idea, or did anyone already try this?
> >
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Recovery operations and ioprio options

2017-11-08 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> ??? ???
> Sent: 08 November 2017 16:21
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Recovery operations and ioprio options
> 
> Hello,
> Today we use ceph jewel with:
>   osd disk thread ioprio class=idle
>   osd disk thread ioprio priority=7
> and "nodeep-scrub" flag is set.
> 
> We want to change scheduler from CFQ to deadline, so these options will
> lose effect.
> I've tried to find out what operations are performed in "disk thread".
What I
> found is that only scrubbing and snap-trimming operations are performed in
> "disk thread".

In jewel those operations are now in the main OSD thread and setting the
ioprio's will have no effect. Use the scrub and snap trim sleep options to
throttle them.

> 
> Do these options affect recovery operations?
> Are there any other operations in "disk thread", except scrubbing and
snap-
> trimming?
> 
> --
> Regards,
> Aleksei Zakharov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-08 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mark Nelson
> Sent: 08 November 2017 19:46
> To: Wolfgang Lendl 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] bluestore - wal,db on faster devices?
> 
> Hi Wolfgang,
> 
> You've got the right idea.  RBD is probably going to benefit less since
you
> have a small number of large objects and little extra OMAP data.
> Having the allocation and object metadata on flash certainly shouldn't
hurt,
> and you should still have less overhead for small (<64k) writes.
> With RGW however you also have to worry about bucket index updates
> during writes and that's a big potential bottleneck that you don't need to
> worry about with RBD.

If you are running anything which is sensitive to sync write latency, like
databases. You will see a big performance improvement in using WAL on SSD.
As Mark says, small writes will get ack'd once written to SSD. ~10-200us vs
1-2us difference. It will also batch lots of these small writes
together and write them to disk in bigger chunks much more effectively. If
you want to run active workloads on RBD and want them to match enterprise
storage array with BBWC type performance, I would say DB and WAL on SSD is a
requirement.

> 
> Mark
> 
> On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:
> > Hi Mark,
> >
> > thanks for your reply!
> > I'm a big fan of keeping things simple - this means that there has to
> > be a very good reason to put the WAL and DB on a separate device
> > otherwise I'll keep it collocated (and simpler).
> >
> > as far as I understood - putting the WAL,DB on a faster (than hdd)
> > device makes more sense in cephfs and rgw environments (more
> metadata)
> > - and less sense in rbd environments - correct?
> >
> > br
> > wolfgang
> >
> > On 11/08/2017 02:21 PM, Mark Nelson wrote:
> >> Hi Wolfgang,
> >>
> >> In bluestore the WAL serves sort of a similar purpose to filestore's
> >> journal, but bluestore isn't dependent on it for guaranteeing
> >> durability of large writes.  With bluestore you can often get higher
> >> large-write throughput than with filestore when using HDD-only or
> >> flash-only OSDs.
> >>
> >> Bluestore also stores allocation, object, and cluster metadata in the
> >> DB.  That, in combination with the way bluestore stores objects,
> >> dramatically improves behavior during certain workloads.  A big one
> >> is creating millions of small objects as quickly as possible.  In
> >> filestore, PG splitting has a huge impact on performance and tail
> >> latency.  Bluestore is much better just on HDD, and putting the DB
> >> and WAL on flash makes it better still since metadata no longer is a
> >> bottleneck.
> >>
> >> Bluestore does have a couple of shortcomings vs filestore currently.
> >> The allocator is not as good as XFS's and can fragment more over time.
> >> There is no server-side readahead so small sequential read
> >> performance is very dependent on client-side readahead.  There's
> >> still a number of optimizations to various things ranging from
> >> threading and locking in the shardedopwq to pglog and dup_ops that
> >> potentially could improve performance.
> >>
> >> I have a blog post that we've been working on that explores some of
> >> these things but I'm still waiting on review before I publish it.
> >>
> >> Mark
> >>
> >> On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:
> >>> Hello,
> >>>
> >>> it's clear to me getting a performance gain from putting the journal
> >>> on a fast device (ssd,nvme) when using filestore backend.
> >>> it's not when it comes to bluestore - are there any resources,
> >>> performance test, etc. out there how a fast wal,db device impacts
> >>> performance?
> >>>
> >>>
> >>> br
> >>> wolfgang
> >>>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] VMware + Ceph using NFS sync/async ?

2017-08-16 Thread Nick Fisk

Hi Matt,

 

Well behaved applications are the problem here. ESXi sends all writes as sync 
writes. So although OS’s will still do their own buffering, any ESXi level 
operation is all done as sync. This is probably seen the greatest when 
migrating vm’s between datastores, everything gets done as sync 64KB ios 
meaning, copying a 1TB VM can often take nearly 24 hours.

 

Osama, can you describe the difference in performance you see between Openstack 
and ESXi and what type of operations are these? Sync writes should be the same 
no matter the client, except in the NFS case you will have an extra network hop 
and potentially a little bit of PG congestion around the FS journal on the RBd 
device.

 

Osama, you can’t compare Ceph to a SAN. Just in terms of network latency you 
have an extra 2 hops. In ideal scenario you might be able to get Ceph write 
latency down to 0.5-1ms for a 4kb io, compared to to about 0.1-0.3 for a 
storage array. However, what you will find with Ceph is that other things start 
to increase this average long before you would start to see this on storage 
arrays. 

 

The migration is a good example of this. As I said, ESXi migrates a vm in 64KB 
io’s, but does 32 of these blocks in parallel at a time. On storage arrays, 
these 64KB io’s are coalesced in the battery protected write cached into bigger 
IO’s before being persisted to disk. The storage array can also accept all 32 
of these requests at once.

 

A similar thing happens in Ceph/RBD/NFS via the Ceph filestore journal, but 
that coalescing is now an extra 2 hops away and with a bit of extra latency 
introduced by the Ceph code, we are already a bit slower. But here’s the 
killer, PG locking!!! You can’t write 32 IO’s in parallel to the same 
object/PG, each one has to be processed sequentially because of the locks. 
(Please someone correct me if I’m wrong here). If your 64KB write latency is 
2ms, then you can only do 500 64KB IO’s a second. 64KB*500=~30MB/s vs a Storage 
Array which would be doing the operation in the hundreds of MB/s range.

 

Note: When proper iSCSI for RBD support is finished, you might be able to use 
the VAAI offloads, which would dramatically increase performance for migrations 
as well.

 

Also once persistent SSD write caching for librbd becomes available, a lot of 
these problems will go away, as the SSD will behave like a storage array’s 
write cache and will only be 1 hop away from the client as well.

 

From: Matt Benjamin [mailto:mbenj...@redhat.com] 
Sent: 16 August 2017 14:49
To: Osama Hasebou 
Cc: n...@fisk.me.uk; ceph-users 
Subject: Re: [ceph-users] VMware + Ceph using NFS sync/async ?

 

Hi Osama,

I don't have a clear sense of the the application workflow here--and Nick 
appears to--but I thought it worth noting that NFSv3 and NFSv4 clients 
shouldn't normally need the sync mount option to achieve i/o stability with 
well-behaved applications.  In both versions of the protocol, an application 
write that is synchronous (or, more typically, the equivalent application sync 
barrier) should not succeed until an NFS-protocol COMMIT (or in some cases 
w/NFSv4, WRITE w/stable flag set) has been acknowledged by the NFS server.  If 
the NFS i/o stability model is insufficient for a your workflow, moreover, I'd 
be worried that -osync writes (which might be incompletely applied during a 
failure event) may not be correctly enforcing your invariant, either.

 

Matt

 

On Wed, Aug 16, 2017 at 8:33 AM, Osama Hasebou mailto:osama.hase...@csc.fi> > wrote:

Hi Nick,

 

Thanks for replying! If Ceph is combined with Openstack then, does that mean 
that actually when openstack writes are happening, it is not fully sync'd (as 
in written to disks) before it starts receiving more data, so acting as async ? 
In that scenario there is a chance for data loss if things go bad, i.e power 
outage or something like that ?

 

As for the slow operations, reading is quite fine when I compare it to a SAN 
storage system connected to VMware. It is writing data, small chunks or big 
ones, that suffer when trying to use the sync option with FIO for benchmarking.

 

In that case, I wonder, is no one using CEPH with VMware in a production 
environment ?

 

Cheers.

 

Regards,
Ossi

 

 

 

Hi Osama,

 

This is a known problem with many software defined storage stacks, but 
potentially slightly worse with Ceph due to extra overheads. Sync writes have 
to wait until all copies of the data are written to disk by the OSD and 
acknowledged back to the client. The extra network hops for replication and NFS 
gateways add significant latency which impacts the time it takes to carry out 
small writes. The Ceph code also takes time to process each IO request.

 

What particular operations are you finding slow? Storage vmotions are just bad, 
and I don’t think there is much that can be done about them as they are split 
into lots of 64kb IO’s.

 

One thing you can try is to force the CPU’s on your OSD nodes to run

Re: [ceph-users] VMware + Ceph using NFS sync/async ?

2017-08-14 Thread Nick Fisk

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Osama Hasebou
Sent: 14 August 2017 12:27
To: ceph-users 
Subject: [ceph-users] VMware + Ceph using NFS sync/async ?

Hi Everyone,

We started testing the idea of using Ceph storage with VMware, the idea was
to provide Ceph storage through open stack to VMware, by creating a virtual
machine coming from Ceph + Openstack , which acts as an NFS gateway, then
mount that storage on top of VMware cluster.

When mounting the NFS exports using the sync option, we noticed a huge
degradation in performance which makes it very slow to use it in production,
the async option makes it much better but then there is the risk of it being
risky that in case a failure shall happen, some data might be lost in that
Scenario.

Now I understand that some people in the ceph community are using Ceph with
VMware using NFS gateways, so if you can kindly shed some light on your
experience, and if you do use it in production purpose, that would be great
and how did you mitigate the sync/async options and keep write performance.

Thanks you!!!

Regards,
Ossi

Hi Osama,

This is a known problem with many software defined storage stacks, but
potentially slightly worse with Ceph due to extra overheads. Sync writes
have to wait until all copies of the data are written to disk by the OSD and
acknowledged back to the client. The extra network hops for replication and
NFS gateways add significant latency which impacts the time it takes to
carry out small writes. The Ceph code also takes time to process each IO
request.

What particular operations are you finding slow? Storage vmotions are just
bad, and I don't think there is much that can be done about them as they are
split into lots of 64kb IO's.

One thing you can try is to force the CPU's on your OSD nodes to run at C1
cstate and force their minimum frequency to 100%. This can have quite a
large impact on latency. Also you don't specify your network, but 10G is a
must.

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] luminous/bluetsore osd memory requirements

2017-08-14 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Ronny Aasen
> Sent: 14 August 2017 18:55
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] luminous/bluetsore osd memory requirements
> 
> On 10.08.2017 17:30, Gregory Farnum wrote:
> > This has been discussed a lot in the performance meetings so I've
> > added Mark to discuss. My naive recollection is that the per-terabyte
> > recommendation will be more realistic  than it was in the past (an
> > effective increase in memory needs), but also that it will be under
> > much better control than previously.
> 
> 
> Is there any way to tune or reduce the memory footprint? perhaps by
> sacrificing performace ? our jewel cluster osd servers is maxed out on
> memory. And with the added memory requirements I  fear we may not be
> able to upgrade to luminous/bluestore..

Check out this PR, it shows the settings to set memory used for cache and
their defaults

https://github.com/ceph/ceph/pull/16157


> 
> kind regards
> Ronny Aasen
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] luminous/bluetsore osd memory requirements

2017-08-13 Thread Nick Fisk

Hi David,

No serious testing but I have various disks fail, nodes go offline…etc over the 
last 12 months and I’m still only seeing 15-20% CPU max for user+system.

From: David Turner [mailto:drakonst...@gmail.com] 
Sent: 12 August 2017 21:20
To: n...@fisk.me.uk; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] luminous/bluetsore osd memory requirements

Did you do any of that testing to involve a degraded cluster, backfilling, 
peering, etc? A healthy cluster running normally uses sometimes 4x less memory 
and CPU resources as a cluster consistently peering and degraded.

On Sat, Aug 12, 2017, 2:40 PM Nick Fisk mailto:n...@fisk.me.uk> > wrote:

I was under the impression the memory requirements for Bluestore would be
around 2-3GB per OSD regardless of capacity.

CPU wise, I would lean towards working out how much total Ghz you require
and then get whatever CPU you need to get there, but with a preference of
Ghz over cores. Yes, there will be a slight overhead to having more threads
running on a lower number of cores, but I believe this is fairly minimal in
comparison to the speed boost obtained by the single threaded portion of the
data path in each OSD from running on a faster Ghz core. Each PG takes a
lock for each operation and so any other requests for the same PG will queue
up and be processed sequentially. The faster you can process through this
stage the better. I'm pretty sure if you graphed PG activity on an average
cluster, you would see a high skew to a certain number of PG's being hit
more often than others. I think Mark N has been experiencing the effects of
the PG locking in recent tests.

Also don't forget to make sure your CPUs are running at c-state C1 and max
Freq. This can sometimes give up to a 4x reduction in latency.

Also, if you look at the number of threads running on a OSD node, it will be
in the 10's of 100's of threads, each OSD process itself has several
threads. So don't think that 12 OSD's=12 core processor.

I did some tests to measure cpu usage per IO, which you may find useful.

http://www.sys-pro.co.uk/how-many-mhz-does-a-ceph-io-need/

I can max out 12x7.2k disks on a E3 1240 CPU and its only running at about
15-20%.

I haven't done any proper Bluestore tests, but from some rough testing the
CPU usage wasn't too dissimilar from Filestore.

Depending on if you are running hdd's or ssd's and how many per node. I
would possibly look at the single socket E3's or E5's.

Although saying that, the recent AMD and Intel announcements also have some
potentially interesting single socket Ceph potentials in the mix.

Hope that helps.

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> <mailto:ceph-users-boun...@lists.ceph.com> ] On Behalf
> Of Stijn De Weirdt
> Sent: 12 August 2017 14:41
> To: David Turner mailto:drakonst...@gmail.com> >; 
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
> Subject: Re: [ceph-users] luminous/bluetsore osd memory requirements
>
> hi david,
>
> sure i understand that. but how bad does it get when you oversubscribe
> OSDs? if context switching itself is dominant, then using HT should
> allow to run double the amount of OSDs on same CPU (on OSD per HT
> core); but if the issue is actual cpu cycles, HT won't help that much
> either (1 OSD per HT core vs 2 OSD per phys core).
>
> i guess the reason for this is that OSD processes have lots of threads?
>
> maybe i can run some tests on a ceph test cluster myself ;)
>
> stijn
>
>
> On 08/12/2017 03:13 PM, David Turner wrote:
> > The reason for an entire core peer osd is that it's trying to avoid
> > context switching your CPU to death. If you have a quad-core
> > processor with HT, I wouldn't recommend more than 8 osds on the box.
> > I probably would go with 7 myself to keep one core available for
> > system operations. This recommendation has nothing to do with GHz.
> > Higher GHz per core will likely improve your cluster latency. Of
> > course if your use case says that you only need very minimal
> > through-put... There is no need to hit or exceed the recommendation.
> > The number of cores recommendation is not changing for bluestore. It
> > might add a recommendation of how fast your processor should be...
> > But making it based on how much GHz per TB is an invitation to context
switch to death.
> >
> > On Sat, Aug 12, 2017, 8:40 AM Stijn De Weirdt
> > mailto:stijn.dewei...@ugent.be> >
> > wrote:
> >
> >> hi all,
> >>
> >> thanks for all the feedback. it's clear we should stick to the
> >> 1GB/TB for the memory.
> >>
> >> any (changes to) recommendation for the CPU? in particular, is it
&

Re: [ceph-users] luminous/bluetsore osd memory requirements

2017-08-12 Thread Nick Fisk

I was under the impression the memory requirements for Bluestore would be
around 2-3GB per OSD regardless of capacity.

CPU wise, I would lean towards working out how much total Ghz you require
and then get whatever CPU you need to get there, but with a preference of
Ghz over cores. Yes, there will be a slight overhead to having more threads
running on a lower number of cores, but I believe this is fairly minimal in
comparison to the speed boost obtained by the single threaded portion of the
data path in each OSD from running on a faster Ghz core. Each PG takes a
lock for each operation and so any other requests for the same PG will queue
up and be processed sequentially. The faster you can process through this
stage the better. I'm pretty sure if you graphed PG activity on an average
cluster, you would see a high skew to a certain number of PG's being hit
more often than others. I think Mark N has been experiencing the effects of
the PG locking in recent tests.

Also don't forget to make sure your CPUs are running at c-state C1 and max
Freq. This can sometimes give up to a 4x reduction in latency.

Also, if you look at the number of threads running on a OSD node, it will be
in the 10's of 100's of threads, each OSD process itself has several
threads. So don't think that 12 OSD's=12 core processor.

I did some tests to measure cpu usage per IO, which you may find useful. 

http://www.sys-pro.co.uk/how-many-mhz-does-a-ceph-io-need/

I can max out 12x7.2k disks on a E3 1240 CPU and its only running at about
15-20%.

I haven't done any proper Bluestore tests, but from some rough testing the
CPU usage wasn't too dissimilar from Filestore.

Depending on if you are running hdd's or ssd's and how many per node. I
would possibly look at the single socket E3's or E5's.

Although saying that, the recent AMD and Intel announcements also have some
potentially interesting single socket Ceph potentials in the mix.

Hope that helps.

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
> Of Stijn De Weirdt
> Sent: 12 August 2017 14:41
> To: David Turner ; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] luminous/bluetsore osd memory requirements
> 
> hi david,
> 
> sure i understand that. but how bad does it get when you oversubscribe 
> OSDs? if context switching itself is dominant, then using HT should 
> allow to run double the amount of OSDs on same CPU (on OSD per HT 
> core); but if the issue is actual cpu cycles, HT won't help that much 
> either (1 OSD per HT core vs 2 OSD per phys core).
> 
> i guess the reason for this is that OSD processes have lots of threads?
> 
> maybe i can run some tests on a ceph test cluster myself ;)
> 
> stijn
> 
> 
> On 08/12/2017 03:13 PM, David Turner wrote:
> > The reason for an entire core peer osd is that it's trying to avoid 
> > context switching your CPU to death. If you have a quad-core 
> > processor with HT, I wouldn't recommend more than 8 osds on the box. 
> > I probably would go with 7 myself to keep one core available for 
> > system operations. This recommendation has nothing to do with GHz. 
> > Higher GHz per core will likely improve your cluster latency. Of 
> > course if your use case says that you only need very minimal 
> > through-put... There is no need to hit or exceed the recommendation. 
> > The number of cores recommendation is not changing for bluestore. It 
> > might add a recommendation of how fast your processor should be... 
> > But making it based on how much GHz per TB is an invitation to context
switch to death.
> >
> > On Sat, Aug 12, 2017, 8:40 AM Stijn De Weirdt 
> > 
> > wrote:
> >
> >> hi all,
> >>
> >> thanks for all the feedback. it's clear we should stick to the 
> >> 1GB/TB for the memory.
> >>
> >> any (changes to) recommendation for the CPU? in particular, is it 
> >> still the rather vague "1 HT core per OSD" (or was it "1 1Ghz HT 
> >> core per OSD"? it would be nice if we had some numbers like 
> >> required specint per TB and/or per Gbs. also any indication how 
> >> much more cpu EC uses (10%, 100%, ...)?
> >>
> >> i'm aware that this also depeneds on the use case, but i'll take 
> >> any pointers i can get. we will probably end up overprovisioning, 
> >> but it would be nice if we can avoid a whole cpu (32GB dimms are 
> >> cheap, so lots of ram with single socket is really possible these
days).
> >>
> >> stijn
> >>
> >> On 08/10/2017 05:30 PM, Gregory Farnum wrote:
> >>> This has been discussed a lot in the performance meetings so I've 
> >>> added Mark to discuss. My naive recollection is that the 
> >>> per-terabyte recommendation will be more realistic  than it was in 
> >>> the past (an effective increase in memory needs), but also that it 
> >>> will be under much better control than previously.
> >>>
> >>> On Thu, Aug 10, 2017 at 1:35 AM Stijn De Weirdt 
> >>>  >>>
> >>> wrote:
> >>>
>  hi all,
> 
>  we are planning to purchse new OSD hardware, and we ar

Re: [ceph-users] ceph cluster experiencing major performance issues

2017-08-08 Thread Nick Fisk



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mclean, Patrick
> Sent: 08 August 2017 20:13
> To: David Turner ; ceph-us...@ceph.com
> Cc: Colenbrander, Roelof ; Payno,
> Victor ; Yip, Rae 
> Subject: Re: [ceph-users] ceph cluster experiencing major performance
issues
> 
> On 08/08/17 10:50 AM, David Turner wrote:
> > Are you also seeing osds marking themselves down for a little bit and
> > then coming back up?  There are 2 very likely problems
> > causing/contributing to this.  The first is if you are using a lot of
> > snapshots.  Deleting snapshots is a very expensive operation for your
> > cluster and can cause a lot of slowness.  The second is PG subfolder
> > splitting.  This will show as blocked requests and osds marking
> > themselves down and coming back up a little later without any errors
> > in the log.  I linked a previous thread where someone was having these
> > problems where both causes were investigated.
> >
> > https://www.mail-archive.com/ceph-
> us...@lists.ceph.com/msg36923.html
> 
> We are not seeing OSDs marking themselves down a little bit and coming
> back as far as we can tell. We will do some more investigation in to this.
> 
> We are creating and deleting quite a few snapshots, is there anything we
can
> do to make this less expensive? We are going to attempt to create less
> snapshots in our systems, but unfortunately we have to create a fair
number
> due to our use case.

That's probably most likely your problem. Upgrade to 10.2.9 and enable the
snap trim sleep option on your OSD's to somewhere around 0.1, it has a
massive effect on snapshot removal.

> 
> Is slow snapshot deletion likely to cause a slow backlog of purged snaps?
In
> some cases we are seeing ~40k snaps still in cached_removed_snaps.
> 
> > If you have 0.94.9 or 10.2.5 or later, then you can split your PG
> > subfolders sanely while your osds are temporarily turned off using the
> > 'ceph-objectstore-tool apply-layout-settings'.  There are a lot of
> > ways to skin the cat of snap trimming, but it depends greatly on your
use
> case.
> 
> We are currently running 10.2.5, and are planning to update to 10.2.9 at
> some point soon. Our clients are using the 4.9 kernel RBD driver (which
sort
> of forces us to keep our snapshot count down below 510), we are currently
> testing the possibility of using the nbd-rbd driver as an alternative.
> 
> > On Mon, Aug 7, 2017 at 11:49 PM Mclean, Patrick
> > mailto:patrick.mcl...@sony.com>> wrote:
> >
> > High CPU utilization and inexplicably slow I/O requests
> >
> > We have been having similar performance issues across several ceph
> > clusters. When all the OSDs are up in the cluster, it can stay
HEALTH_OK
> > for a while, but eventually performance worsens and becomes (at
first
> > intermittently, but eventually continually) HEALTH_WARN due to slow
> I/O
> > request blocked for longer than 32 sec. These slow requests are
> > accompanied by "currently waiting for rw locks", but we have not
found
> > any network issue that normally is responsible for this warning.
> >
> > Examining the individual slow OSDs from `ceph health detail` has
been
> > unproductive; there don't seem to be any slow disks and if we stop
the
> > OSD the problem just moves somewhere else.
> >
> > We also think this trends with increased number of RBDs on the
clusters,
> > but not necessarily a ton of Ceph I/O. At the same time, user %CPU
time
> > spikes up to 95-100%, at first frequently and then consistently,
> > simultaneously across all cores. We are running 12 OSDs on a 2.2 GHz
> CPU
> > with 6 cores and 64GiB RAM per node.
> >
> > ceph1 ~ $ sudo ceph status
> > cluster ----
> >  health HEALTH_WARN
> > 547 requests are blocked > 32 sec
> >  monmap e1: 3 mons at
> >
> {cephmon1.XXX=XXX.XXX.XXX.XXX:/0,cephmon1.
> XXX=XXX.XXX.XXX.XX:/0,cephmon1.XXX
> =XXX.XXX.XXX.XXX:/0}
> > election epoch 16, quorum 0,1,2
> >
> cephmon1.XXX,cephmon1.X
> XX,cephmon1.XXX
> >  osdmap e577122: 72 osds: 68 up, 68 in
> > flags sortbitwise,require_jewel_osds
> >   pgmap v6799002: 4096 pgs, 4 pools, 13266 GB data, 11091
kobjects
> > 126 TB used, 368 TB / 494 TB avail
> > 4084 active+clean
> >   12 active+clean+scrubbing+deep
> >   client io 113 kB/s rd, 11486 B/s wr, 135 op/s rd, 7 op/s wr
> >
> > ceph1 ~ $ vmstat 5 5
> > procs ---memory-- ---swap-- -io -system--
> > --cpu-
> >  r  b   swpd   free   buff  cache   si   sobibo   in   cs us
sy
> > id wa st
> > 27  1  0 3112660 165544 3626169200   472  127401
22
> >

Re: [ceph-users] Kernel mounted RBD's hanging

2017-07-31 Thread Nick Fisk


> -Original Message-
> From: Ilya Dryomov [mailto:idryo...@gmail.com]
> Sent: 31 July 2017 11:36
> To: Nick Fisk 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] Kernel mounted RBD's hanging
> 
> On Thu, Jul 13, 2017 at 12:54 PM, Ilya Dryomov  wrote:
> > On Wed, Jul 12, 2017 at 7:15 PM, Nick Fisk  wrote:
> >>> Hi Ilya,
> >>>
> >>> I have managed today to capture the kernel logs with debugging turned on 
> >>> and the ms+osd debug logs from the mentioned OSD.
> >>> However, this is from a few minutes after the stall starts, not
> >>> before. The very random nature of the stalls have made it difficult
> >>> to have debug logging on for extended periods of time. I hope there is 
> >>> something in here which might give a clue to what is
> happening, otherwise I will continue to try and capture debugs from before 
> the stall occurs.
> >>>
> >>> The kernel logs and OSD logs are attached as a zip.
> >>>
> >>> Sample of the hung requests
> >>> Wed 12 Jul 11:28:01 BST 2017
> >>> REQUESTS 8 homeless 0
> >>> 11457738osd37   17.be732844 [37,74,14]/37   [37,74,14]/37   
> >>> rbd_data.15d8670238e1f29.001a2068   0x4000241
> >>> 0'0 set-alloc-hint,write
> >>> 11457759osd37   17.9e3d4404 [37,74,14]/37   [37,74,14]/37   
> >>> rbd_data.15d8670238e1f29.00118149   0x4000241
> >>> 0'0 set-alloc-hint,write
> >>> 11457770osd37   17.86ec0d14 [37,72,74]/37   [37,72,74]/37   
> >>> rbd_data.15d8670238e1f29.0006665c   0x4000241
> >>> 0'0 set-alloc-hint,write
> >>> 11457818osd37   17.e80ed1c0 [37,75,14]/37   [37,75,14]/37   
> >>> rbd_data.158f204238e1f29.000d7f1e   0x4000141
> >>> 0'0 read
> >>> 11457822osd37   17.9db0a684 [37,74,14]/37   [37,74,14]/37   
> >>> rbd_data.15d8670238e1f29.211c   0x4000241
> >>> 0'0 set-alloc-hint,write
> >>> 11457916osd37   17.1848293b [37,3,73]/37[37,3,73]/37
> >>> rbd_data.158f204238e1f29.000d7f8e   0x4000241
> 0'0
> >>> set-alloc-hint,write
> >>> 11457967osd37   17.56b0f4c0 [37,75,14]/37   [37,75,14]/37   
> >>> rbd_data.158f204238e1f29.00047ffc   0x4000141
> 0'0
> >>> read
> >>> 11457970osd37   17.65ad6d40 [37,75,14]/37   [37,75,14]/37   
> >>> rbd_data.15d8670238e1f29.00060318   0x4000241
> >>> 0'0 set-alloc-hint,write
> >>>
> >>> Also a nice hung_task_timeout
> >>> [57154.424300] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> >>> disables this message.
> >>> [57154.424331] nfsdD0  1650  2 0x
> >>> [57154.424334] Call Trace:
> >>> [57154.424341]  __schedule+0x3c6/0x8c0 [57154.424344]
> >>> schedule+0x36/0x80 [57154.424346]
> >>> rwsem_down_write_failed+0x230/0x3a0
> >>> [57154.424389]  ? xfs_file_buffered_aio_write+0x68/0x270 [xfs]
> >>> [57154.424392]  call_rwsem_down_write_failed+0x17/0x30
> >>> [57154.424394]  ? call_rwsem_down_write_failed+0x17/0x30
> >>> [57154.424396]  down_write+0x2d/0x40 [57154.424422]
> >>> xfs_ilock+0xa7/0x110 [xfs] [57154.424446]  
> >>> xfs_file_buffered_aio_write+0x68/0x270 [xfs] [57154.424449]  ?
> >>> iput+0x8a/0x230 [57154.424450]  ? __check_object_size+0x100/0x19d
> >>> iput+[57154.424474]  xfs_file_write_iter+0x103/0x150 [xfs]
> >>> [57154.424482]  __do_readv_writev+0x2fb/0x390 [57154.424484]  
> >>> do_readv_writev+0x8d/0xe0 [57154.424487]  ?
> >>> security_file_open+0x8a/0x90 [57154.424488]  ? do_dentry_open+0x27a/0x310 
> >>> [57154.424511]  ?
> >>> xfs_extent_busy_ag_cmp+0x20/0x20 [xfs] [57154.424513]
> >>> vfs_writev+0x3c/0x50 [57154.424514]  ? vfs_writev+0x3c/0x50
> >>> [57154.424525]  nfsd_vfs_write+0xc6/0x3a0 [nfsd] [57154.424531]
> >>> nfsd_write+0x144/0x200 [nfsd] [57154.424538]
> >>> nfsd3_proc_write+0xaa/0x140 [nfsd] [57154.424549]
> >>> nfsd_dispatch+0xc8/0x260 [nfsd] [57154.424566]
> >>> svc_process_common+0x374/0x6b0 [sunrpc] [57154.424575]
> >>> svc_process+0xfe/0x1b0 [sunrpc] [57154.424581]  nfsd+0xe9/0x150 [nfsd]

Re: [ceph-users] RBD cache being filled up in small increases instead of 4MB

2017-07-15 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Gregory Farnum
> Sent: 15 July 2017 00:09
> To: Ruben Rodriguez 
> Cc: ceph-users 
> Subject: Re: [ceph-users] RBD cache being filled up in small increases instead
> of 4MB
> 
> On Fri, Jul 14, 2017 at 3:43 PM, Ruben Rodriguez  wrote:
> >
> > I'm having an issue with small sequential reads (such as searching
> > through source code files, etc), and I found that multiple small reads
> > withing a 4MB boundary would fetch the same object from the OSD
> > multiple times, as it gets inserted into the RBD cache partially.
> >
> > How to reproduce: rbd image accessed from a Qemu vm using virtio-scsi,
> > writethrough cache on. Monitor with perf dump on the rbd client. The
> > image is filled up with zeroes in advance. Rbd readahead is off.
> >
> > 1 - Small read from a previously unread section of the disk:
> > dd if=/dev/sdb ibs=512 count=1 skip=41943040 iflag=skip_bytes
> > Notes: dd cannot read less than 512 bytes. The skip is arbitrary to
> > avoid the beginning of the disk, which would have been read at boot.
> >
> > Expected outcomes: perf dump should show a +1 increase on values rd,
> > cache_ops_miss and op_r. This happens correctly.
> > It should show a 4194304 increase in data_read as a whole object is
> > put into the cache. Instead it increases by 4096. (not sure why 4096, btw).
> >
> > 2 - Small read from less than 4MB distance (in the example, +5000b).
> > dd if=/dev/sdb ibs=512 count=1 skip=41948040 iflag=skip_bytes Expected
> > outcomes: perf dump should show a +1 increase on cache_ops_hit.
> > Instead cache_ops_miss increases.
> > It should show a 4194304 increase in data_read as a whole object is
> > put into the cache. Instead it increases by 4096.
> > op_r should not increase. Instead it increases by one, indicating that
> > the object was fetched again.
> >
> > My tests show that this could be causing a 6 to 20-fold performance
> > loss in small sequential reads.
> >
> > Is it by design that the RBD cache only inserts the portion requested
> > by the client instead of the whole last object fetched? Could it be a
> > tunable in any of my layers (fs, block device, qemu, rbd...) that is
> > preventing this?
> 
> I don't know the exact readahead default values in that stack, but there's no
> general reason to think RBD (or any Ceph component) will read a whole
> object at a time. In this case, you're asking for 512 bytes and it appears to
> have turned that into a 4KB read (probably the virtual block size in use?),
> which seems pretty reasonable — if you were asking for 512 bytes out of
> every 4MB and it was reading 4MB each time, you'd probably be wondering
> why you were only getting 1/8192 the expected bandwidth. ;) -Greg

I think the general readahead logic might be a bit more advanced in the Linux 
kernel vs using readahead from the librbd client. The kernel will watch how 
successful each readahead is and scale as necessary. You might want to try 
uping the read_ahead_kb for the block device in the VM. Something between 4MB 
to 32MB works well for RBD's, but make sure you have a 4.x kernel as some fixes 
to readahead max size were introduced and not sure if they ever got backported.

Unless you tell the rbd client to not disable readahead after reading the 1st x 
number of bytes (rbd readahead disable after bytes=0), it will stop reading 
ahead and will only cache exactly what is requested by the client.

> 
> >
> > Regards,
> > --
> > Ruben Rodriguez | Senior Systems Administrator, Free Software
> > Foundation GPG Key: 05EF 1D2F FE61 747D 1FC8  27C3 7FAC 7D26 472F
> 4409
> > https://fsf.org | https://gnu.org
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph mount rbd

2017-07-14 Thread Nick Fisk



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Jason Dillaman
> Sent: 14 July 2017 16:40
> To: li...@marcelofrota.info
> Cc: ceph-users 
> Subject: Re: [ceph-users] Ceph mount rbd
> 
> On Fri, Jul 14, 2017 at 9:44 AM,   wrote:
> > Gonzalo,
> >
> >
> >
> > You are right, i told so much about my enviroment actual and maybe i
> > didn't know explain my problem the better form, with ceph in the
> > moment, mutiple hosts clients can mount and write datas in my system
> > and this is one problem, because i could have filesystem corruption.
> >
> >
> >
> > Example, today, if runing the comand in two machines in the same time,
> > it will work.
> >
> >
> >
> > mount /dev/rbd0 /mnt/veeamrepo
> >
> > cd /mnt/veeamrepo ; touch testfile.txt
> >
> >
> >
> > I need ensure, only one machine will can execute this.
> >
> 
> A user could do the same thing with any number of remote block devices (i.e. 
> I could map an iSCSI target multiple times). As I said
> before, you can use the "exclusive" option available since kernel 4.12, roll 
> your own solution using the advisory locks available from
> the rbd CLI, or just use CephFS if you want to be able to access a file 
> system on multiple hosts.

Pacemaker, will also prevent a RBD to be mounted multiple times, if you want to 
manage the fencing outside of Ceph.

> 
> >
> > Thanks a lot,
> >
> > Marcelo
> >
> >
> > Em 14/07/2017, Gonzalo Aguilar Delgado 
> > escreveu:
> >
> >
> >> Hi,
> >>
> >> Why you would like to maintain copies by yourself. You replicate on
> >> ceph and then on different files inside ceph? Let ceph take care of 
> >> counting.
> >> Create a pool with 3 or more copies and let ceph take care of what's
> >> stored and where.
> >>
> >> Best regards,
> >>
> >>
> >> El 13/07/17 a las 17:06, li...@marcelofrota.info escribió:
> >> >
> >> > I will explain More about my system actual, in the moment i have 2
> >> > machines using drbd in mode master/slave and i running the
> >> > aplication in machine master, but existing 2 questions importants
> >> > in my enviroment with drbd actualy :
> >> >
> >> > 1 - If machine one is master and mounting partitions, the slave
> >> > don't can mount the system, Unless it happens one problem in
> >> > machine master, this is one mode, to prevent write in filesystem
> >> > incorrect
> >> >
> >> > 2 - When i write data in machine master in drbd, the drbd write
> >> > datas in slave machine Automatically, with this, if one problem
> >> > happens in node master, the machine slave have coppy the data.
> >> >
> >> > In the moment, in my enviroment testing with ceph, using the
> >> > version
> >> > 4.10 of kernel and i mount the system in two machines in the same
> >> > time, in production enviroment, i could serious problem with this
> >> > comportament.
> >> >
> >> > How can i use the ceph and Ensure that I could get these 2
> >> > behaviors kept in a new environment with Ceph?
> >> >
> >> > Thanks a lot,
> >> >
> >> > Marcelo
> >> >
> >> >
> >> > Em 28/06/2017, Jason Dillaman  escreveu:
> >> > > ... additionally, the forthcoming 4.12 kernel release will
> >> > > support non-cooperative exclusive locking. By default, since 4.9,
> >> > > when the exclusive-lock feature is enabled, only a single client
> >> > > can write to
> >> > the
> >> > > block device at a time -- but they will cooperatively pass the
> >> > > lock
> >> > back
> >> > > and forth upon write request. With the new "rbd map" option, you
> >> > > can
> >> > map a
> >> > > image on exactly one host and prevent other hosts from mapping
> >> > > the
> >> > image.
> >> > > If that host should die, the exclusive-lock will automatically
> >> > > become available to other hosts for mapping.
> >> > >
> >> > > Of course, I always have to ask the use-case behind mapping the
> >> > > same
> >> > image
> >> > > on multiple hosts. Perhaps CephFS would be a better fit if you
> >> > > are
> >> > trying
> >> > > to serve out a filesystem?
> >> > >
> >> > > On Wed, Jun 28, 2017 at 6:25 PM, Maged Mokhtar
> >> >  wrote:
> >> > >
> >> > > > On 2017-06-28 22:55, li...@marcelofrota.info wrote:
> >> > > >
> >> > > > Hi People,
> >> > > >
> >> > > > I am testing the new enviroment, with ceph + rbd with ubuntu
> >> > 16.04, and i
> >> > > > have one question.
> >> > > >
> >> > > > I have my cluster ceph and mount the using the comands to ceph
> >> > > > in
> >> > my linux
> >> > > > enviroment :
> >> > > >
> >> > > > rbd create veeamrepo --size 20480 rbd --image veeamrepo info
> >> > > > modprobe rbd rbd map veeamrepo rbd feature disable veeamrepo
> >> > > > exclusive-lock object-map fast-diff deep-flatten mkdir
> >> > > > /mnt/veeamrepo mount /dev/rbd0 /mnt/veeamrepo
> >> > > >
> >> > > > The comands work fine, but i have one problem, in the moment, i
> >> > can mount
> >> > > > the /mnt/veeamrepo in the same time in 2 machines, and this is
> >> > > > a
> >> > bad option
> >> > > > for me in the moment, because this could generate

Re: [ceph-users] Kernel mounted RBD's hanging

2017-07-12 Thread Nick Fisk



> -Original Message-
> From: Nick Fisk [mailto:n...@fisk.me.uk]
> Sent: 12 July 2017 13:47
> To: 'Ilya Dryomov' 
> Cc: 'Ceph Users' 
> Subject: RE: [ceph-users] Kernel mounted RBD's hanging
> 
> > -Original Message-
> > From: Nick Fisk [mailto:n...@fisk.me.uk]
> > Sent: 08 July 2017 21:50
> > To: 'Ilya Dryomov' 
> > Cc: 'Ceph Users' 
> > Subject: RE: [ceph-users] Kernel mounted RBD's hanging
> >
> > > -----Original Message-
> > > From: Ilya Dryomov [mailto:idryo...@gmail.com]
> > > Sent: 07 July 2017 11:32
> > > To: Nick Fisk 
> > > Cc: Ceph Users 
> > > Subject: Re: [ceph-users] Kernel mounted RBD's hanging
> > >
> > > On Fri, Jul 7, 2017 at 12:10 PM, Nick Fisk  wrote:
> > > > Managed to catch another one, osd.75 again, not sure if that is an
> > > indication of anything or just a co-incidence. osd.75 is one of 8
> > > OSD's in a cache tier, so all IO will be funnelled through them.
> > > >
> > > >
> > > >
> > > > Also found this in the log of osd.75 at the same time, but the
> > > > client IP is not
> > > the same as the node which experienced the hang.
> > >
> > > Can you bump debug_ms and debug_osd to 30 on osd75?  I doubt it's an
> > > issue with that particular OSD, but if it goes down the same way
> > > again, I'd have something to look at.  Make sure logrotate is
> > > configured and working before doing that though... ;)
> > >
> > > Thanks,
> > >
> > > Ilya
> >
> > So, osd.75 was a coincidence, several other hangs have had outstanding
> > requests to other OSD's. I haven't been able to get the debug logs of
> > the OSD during a hang yet because of this. Although I think the crc problem 
> > may now be fixed, by upgrading all clients to 4.11.1+.
> >
> > Here is a series of osdc dumps every minute during one of the hangs
> > with a different target OSD. The osdc dumps on another node show IO
> > being processed normally whilst the other node hangs, so the cluster
> > is definitely handling IO fine whilst the other node hangs. And as I am 
> > using cache tiering with proxying, all IO will be going through
> just 8 OSD's. The host has 3 RBD's mounted and all 3 hang.
> >
> > Latest hang:
> > Sat  8 Jul 18:49:01 BST 2017
> > REQUESTS 4 homeless 0
> > 174662831   osd25   17.77737285 [25,74,14]/25   [25,74,14]/25   
> > rbd_data.15d8670238e1f29.000cf9f8   0x4000241
> > 0'0 set-alloc-hint,write
> > 174662863   osd25   17.7b91a345 [25,74,14]/25   [25,74,14]/25   
> > rbd_data.1555406238e1f29.0002571c   0x4000241
> > 0'0 set-alloc-hint,write
> > 174662887   osd25   17.6c2eaa93 [25,75,14]/25   [25,75,14]/25   
> > rbd_data.158f204238e1f29.0008   0x4000241
> > 0'0 set-alloc-hint,write
> > 174662925   osd25   17.32271445 [25,74,14]/25   [25,74,14]/25   
> > rbd_data.1555406238e1f29.0001   0x4000241
> > 0'0 set-alloc-hint,write
> > LINGER REQUESTS
> > 18446462598732840990osd74   17.145baa0f [74,72,14]/74   
> > [74,72,14]/74   rbd_header.158f204238e1f29  0x208   WC/0
> > 18446462598732840991osd74   17.7b4e2a06 [74,72,25]/74   
> > [74,72,25]/74   rbd_header.1555406238e1f29  0x209   WC/0
> > 18446462598732840992osd74   17.eea94d58 [74,73,25]/74   
> > [74,73,25]/74   rbd_header.15d8670238e1f29  0x208   WC/0
> > Sat  8 Jul 18:50:01 BST 2017
> > REQUESTS 5 homeless 0
> > 174662831   osd25   17.77737285 [25,74,14]/25   [25,74,14]/25   
> > rbd_data.15d8670238e1f29.000cf9f8   0x4000241
> > 0'0 set-alloc-hint,write
> > 174662863   osd25   17.7b91a345 [25,74,14]/25   [25,74,14]/25   
> > rbd_data.1555406238e1f29.0002571c   0x4000241
> > 0'0 set-alloc-hint,write
> > 174662887   osd25   17.6c2eaa93 [25,75,14]/25   [25,75,14]/25   
> > rbd_data.158f204238e1f29.0008   0x4000241
> > 0'0 set-alloc-hint,write
> > 174662925   osd25   17.32271445 [25,74,14]/25   [25,74,14]/25   
> > rbd_data.1555406238e1f29.0001   0x4000241
> > 0'0 set-alloc-hint,write
> > 174663129   osd25   17.32271445 [25,74,14]/25   [25,74,14]/25   
>

Re: [ceph-users] Kernel mounted RBD's hanging

2017-07-08 Thread Nick Fisk

> -Original Message-
> From: Ilya Dryomov [mailto:idryo...@gmail.com]
> Sent: 07 July 2017 11:32
> To: Nick Fisk 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] Kernel mounted RBD's hanging
> 
> On Fri, Jul 7, 2017 at 12:10 PM, Nick Fisk  wrote:
> > Managed to catch another one, osd.75 again, not sure if that is an
> indication of anything or just a co-incidence. osd.75 is one of 8 OSD's in a
> cache tier, so all IO will be funnelled through them.
> >
> >
> >
> > Also found this in the log of osd.75 at the same time, but the client IP is 
> > not
> the same as the node which experienced the hang.
> 
> Can you bump debug_ms and debug_osd to 30 on osd75?  I doubt it's an
> issue with that particular OSD, but if it goes down the same way again, I'd
> have something to look at.  Make sure logrotate is configured and working
> before doing that though... ;)
> 
> Thanks,
> 
> Ilya

So, osd.75 was a coincidence, several other hangs have had outstanding requests 
to other OSD's. I haven't been able to get the debug logs of the OSD during a 
hang yet because of this. Although I think the crc problem may now be fixed, by 
upgrading all clients to 4.11.1+.

Here is a series of osdc dumps every minute during one of the hangs with a 
different target OSD. The osdc dumps on another node show IO being processed 
normally whilst the other node hangs, so the cluster is definitely handling IO 
fine whilst the other node hangs. And as I am using cache tiering with 
proxying, all IO will be going through just 8 OSD's. The host has 3 RBD's 
mounted and all 3 hang.

Latest hang:
Sat  8 Jul 18:49:01 BST 2017
REQUESTS 4 homeless 0
174662831   osd25   17.77737285 [25,74,14]/25   [25,74,14]/25   
rbd_data.15d8670238e1f29.000cf9f8   0x4000241   0'0 
set-alloc-hint,write
174662863   osd25   17.7b91a345 [25,74,14]/25   [25,74,14]/25   
rbd_data.1555406238e1f29.0002571c   0x4000241   0'0 
set-alloc-hint,write
174662887   osd25   17.6c2eaa93 [25,75,14]/25   [25,75,14]/25   
rbd_data.158f204238e1f29.0008   0x4000241   0'0 
set-alloc-hint,write
174662925   osd25   17.32271445 [25,74,14]/25   [25,74,14]/25   
rbd_data.1555406238e1f29.0001   0x4000241   0'0 
set-alloc-hint,write
LINGER REQUESTS
18446462598732840990osd74   17.145baa0f [74,72,14]/74   [74,72,14]/74   
rbd_header.158f204238e1f29  0x208   WC/0
18446462598732840991osd74   17.7b4e2a06 [74,72,25]/74   [74,72,25]/74   
rbd_header.1555406238e1f29  0x209   WC/0
18446462598732840992osd74   17.eea94d58 [74,73,25]/74   [74,73,25]/74   
rbd_header.15d8670238e1f29  0x208   WC/0
Sat  8 Jul 18:50:01 BST 2017
REQUESTS 5 homeless 0
174662831   osd25   17.77737285 [25,74,14]/25   [25,74,14]/25   
rbd_data.15d8670238e1f29.000cf9f8   0x4000241   0'0 
set-alloc-hint,write
174662863   osd25   17.7b91a345 [25,74,14]/25   [25,74,14]/25   
rbd_data.1555406238e1f29.0002571c   0x4000241   0'0 
set-alloc-hint,write
174662887   osd25   17.6c2eaa93 [25,75,14]/25   [25,75,14]/25   
rbd_data.158f204238e1f29.0008   0x4000241   0'0 
set-alloc-hint,write
174662925   osd25   17.32271445 [25,74,14]/25   [25,74,14]/25   
rbd_data.1555406238e1f29.0001   0x4000241   0'0 
set-alloc-hint,write
174663129   osd25   17.32271445 [25,74,14]/25   [25,74,14]/25   
rbd_data.1555406238e1f29.0001   0x4000241   0'0 
set-alloc-hint,write
LINGER REQUESTS
18446462598732840990osd74   17.145baa0f [74,72,14]/74   [74,72,14]/74   
rbd_header.158f204238e1f29  0x208   WC/0
18446462598732840991osd74   17.7b4e2a06 [74,72,25]/74   [74,72,25]/74   
rbd_header.1555406238e1f29  0x209   WC/0
18446462598732840992osd74   17.eea94d58 [74,73,25]/74   [74,73,25]/74   
rbd_header.15d8670238e1f29  0x208   WC/0
Sat  8 Jul 18:51:01 BST 2017
REQUESTS 5 homeless 0
174662831   osd25   17.77737285 [25,74,14]/25   [25,74,14]/25   
rbd_data.15d8670238e1f29.000cf9f8   0x4000241   0'0 
set-alloc-hint,write
174662863   osd25   17.7b91a345 [25,74,14]/25   [25,74,14]/25   
rbd_data.1555406238e1f29.0002571c   0x4000241   0'0 
set-alloc-hint,write
174662887   osd25   17.6c2eaa93 [25,75,14]/25   [25,75,14]/25   
rbd_data.158f204238e1f29.0008   0x4000241   0'0 
set-alloc-hint,write
174662925   osd25   17.32271445 [25,74,14]/25   [25,74,14]/25   
rbd_data.1555406238e

Re: [ceph-users] Kernel mounted RBD's hanging

2017-07-07 Thread Nick Fisk

> -Original Message-
> From: Ilya Dryomov [mailto:idryo...@gmail.com]
> Sent: 01 July 2017 13:19
> To: Nick Fisk 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] Kernel mounted RBD's hanging
> 
> On Sat, Jul 1, 2017 at 9:29 AM, Nick Fisk  wrote:
> >> -Original Message-
> >> From: Ilya Dryomov [mailto:idryo...@gmail.com]
> >> Sent: 30 June 2017 14:06
> >> To: Nick Fisk 
> >> Cc: Ceph Users 
> >> Subject: Re: [ceph-users] Kernel mounted RBD's hanging
> >>
> >> On Fri, Jun 30, 2017 at 2:14 PM, Nick Fisk  wrote:
> >> >
> >> >
> >> >> -Original Message-
> >> >> From: Ilya Dryomov [mailto:idryo...@gmail.com]
> >> >> Sent: 29 June 2017 18:54
> >> >> To: Nick Fisk 
> >> >> Cc: Ceph Users 
> >> >> Subject: Re: [ceph-users] Kernel mounted RBD's hanging
> >> >>
> >> >> On Thu, Jun 29, 2017 at 6:22 PM, Nick Fisk  wrote:
> >> >> >> -Original Message-----
> >> >> >> From: Ilya Dryomov [mailto:idryo...@gmail.com]
> >> >> >> Sent: 29 June 2017 16:58
> >> >> >> To: Nick Fisk 
> >> >> >> Cc: Ceph Users 
> >> >> >> Subject: Re: [ceph-users] Kernel mounted RBD's hanging
> >> >> >>
> >> >> >> On Thu, Jun 29, 2017 at 4:30 PM, Nick Fisk  wrote:
> >> >> >> > Hi All,
> >> >> >> >
> >> >> >> > Putting out a call for help to see if anyone can shed some
> >> >> >> > light on
> >> this.
> >> >> >> >
> >> >> >> > Configuration:
> >> >> >> > Ceph cluster presenting RBD's->XFS->NFS->ESXi Running 10.2.7
> >> >> >> > on the OSD's and 4.11 kernel on the NFS gateways in a
> >> >> >> > pacemaker cluster Both OSD's and clients are go into a pair
> >> >> >> > of switches, single L2 domain (no sign from pacemaker that
> >> >> >> > there is network connectivity
> >> >> >> > issues)
> >> >> >> >
> >> >> >> > Symptoms:
> >> >> >> > - All RBD's on a single client randomly hang for 30s to
> >> >> >> > several minutes, confirmed by pacemaker and ESXi hosts
> >> >> >> > complaining
> >> >> >>
> >> >> >> Hi Nick,
> >> >> >>
> >> >> >> What is a "single client" here?
> >> >> >
> >> >> > I mean a node of the pacemaker cluster. So all RBD's on the same
> >> >> pacemaker node hang.
> >> >> >
> >> >> >>
> >> >> >> > - Cluster load is minimal when this happens most times
> >> >> >>
> >> >> >> Can you post gateway syslog and point at when this happened?
> >> >> >> Corresponding pacemaker excerpts won't hurt either.
> >> >> >
> >> >> > Jun 28 16:35:38 MS-CEPH-Proxy1 lrmd[2026]:  warning:
> >> >> > p_export_ceph-
> >> >> ds1_monitor_6 process (PID 17754) timed out
> >> >> > Jun 28 16:35:43 MS-CEPH-Proxy1 lrmd[2026]: crit: p_export_ceph-
> >> >> ds1_monitor_6 process (PID 17754) will not die!
> >> >> > Jun 28 16:43:51 MS-CEPH-Proxy1 lrmd[2026]:  warning:
> >> >> > p_export_ceph-ds1_monitor_6:17754 - timed out after 3ms
> >> Jun
> >> >> 28 16:43:52 MS-CEPH-Proxy1 IPaddr(p_vip_ceph-ds1)[28482]: INFO:
> >> >> ifconfig
> >> >> ens224:0 down
> >> >> > Jun 28 16:43:52 MS-CEPH-Proxy1 lrmd[2026]:   notice: p_vip_ceph-
> >> >> ds1_stop_0:28482:stderr [ SIOCDELRT: No such process ]
> >> >> > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation
> >> >> p_vip_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=471, rc=0,
> >> >> cib- update=318, confirmed=true)
> >> >> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]:
> >> >> INFO: Un-exporting file system ...
> >> >> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]:
> >> >> > INFO: unexporting 10.3.20.0/24:/mnt/Ceph-DS1 Jun 28 16:43:52
> >> >>

Re: [ceph-users] Kernel mounted RBD's hanging

2017-07-01 Thread Nick Fisk

> -Original Message-
> From: Ilya Dryomov [mailto:idryo...@gmail.com]
> Sent: 30 June 2017 14:06
> To: Nick Fisk 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] Kernel mounted RBD's hanging
> 
> On Fri, Jun 30, 2017 at 2:14 PM, Nick Fisk  wrote:
> >
> >
> >> -Original Message-
> >> From: Ilya Dryomov [mailto:idryo...@gmail.com]
> >> Sent: 29 June 2017 18:54
> >> To: Nick Fisk 
> >> Cc: Ceph Users 
> >> Subject: Re: [ceph-users] Kernel mounted RBD's hanging
> >>
> >> On Thu, Jun 29, 2017 at 6:22 PM, Nick Fisk  wrote:
> >> >> -----Original Message-
> >> >> From: Ilya Dryomov [mailto:idryo...@gmail.com]
> >> >> Sent: 29 June 2017 16:58
> >> >> To: Nick Fisk 
> >> >> Cc: Ceph Users 
> >> >> Subject: Re: [ceph-users] Kernel mounted RBD's hanging
> >> >>
> >> >> On Thu, Jun 29, 2017 at 4:30 PM, Nick Fisk  wrote:
> >> >> > Hi All,
> >> >> >
> >> >> > Putting out a call for help to see if anyone can shed some light on
> this.
> >> >> >
> >> >> > Configuration:
> >> >> > Ceph cluster presenting RBD's->XFS->NFS->ESXi Running 10.2.7 on
> >> >> > the OSD's and 4.11 kernel on the NFS gateways in a pacemaker
> >> >> > cluster Both OSD's and clients are go into a pair of switches,
> >> >> > single L2 domain (no sign from pacemaker that there is network
> >> >> > connectivity
> >> >> > issues)
> >> >> >
> >> >> > Symptoms:
> >> >> > - All RBD's on a single client randomly hang for 30s to several
> >> >> > minutes, confirmed by pacemaker and ESXi hosts complaining
> >> >>
> >> >> Hi Nick,
> >> >>
> >> >> What is a "single client" here?
> >> >
> >> > I mean a node of the pacemaker cluster. So all RBD's on the same
> >> pacemaker node hang.
> >> >
> >> >>
> >> >> > - Cluster load is minimal when this happens most times
> >> >>
> >> >> Can you post gateway syslog and point at when this happened?
> >> >> Corresponding pacemaker excerpts won't hurt either.
> >> >
> >> > Jun 28 16:35:38 MS-CEPH-Proxy1 lrmd[2026]:  warning: p_export_ceph-
> >> ds1_monitor_6 process (PID 17754) timed out
> >> > Jun 28 16:35:43 MS-CEPH-Proxy1 lrmd[2026]: crit: p_export_ceph-
> >> ds1_monitor_6 process (PID 17754) will not die!
> >> > Jun 28 16:43:51 MS-CEPH-Proxy1 lrmd[2026]:  warning:
> >> > p_export_ceph-ds1_monitor_6:17754 - timed out after 3ms
> Jun
> >> 28 16:43:52 MS-CEPH-Proxy1 IPaddr(p_vip_ceph-ds1)[28482]: INFO:
> >> ifconfig
> >> ens224:0 down
> >> > Jun 28 16:43:52 MS-CEPH-Proxy1 lrmd[2026]:   notice: p_vip_ceph-
> >> ds1_stop_0:28482:stderr [ SIOCDELRT: No such process ]
> >> > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation
> >> p_vip_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=471, rc=0, cib-
> >> update=318, confirmed=true)
> >> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]:
> >> INFO: Un-exporting file system ...
> >> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]:
> >> > INFO: unexporting 10.3.20.0/24:/mnt/Ceph-DS1 Jun 28 16:43:52
> >> > MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO: Unlocked
> >> > NFS
> >> export /mnt/Ceph-DS1 Jun 28 16:43:52 MS-CEPH-Proxy1
> >> exportfs(p_export_ceph-ds1)[28499]: INFO: Un-exported file system(s)
> >> > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation
> >> p_export_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=473, rc=0,
> >> cib- update=319, confirmed=true)
> >> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]:
> >> INFO: Exporting file system(s) ...
> >> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]:
> >> > INFO: exporting 10.3.20.0/24:/mnt/Ceph-DS1 Jun 28 16:43:52 MS-CEPH-
> >> Proxy1 exportfs(p_export_ceph-ds1)[28549]: INFO: directory
> >> /mnt/Ceph-DS1 exported
> >> > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation
> >> p_export_ceph-ds1_start_0: ok (node=MS-CEPH-Proxy1, call=474, rc=0,
> >> cib- up

Re: [ceph-users] Kernel mounted RBD's hanging

2017-06-30 Thread Nick Fisk



> -Original Message-
> From: Ilya Dryomov [mailto:idryo...@gmail.com]
> Sent: 29 June 2017 18:54
> To: Nick Fisk 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] Kernel mounted RBD's hanging
> 
> On Thu, Jun 29, 2017 at 6:22 PM, Nick Fisk  wrote:
> >> -Original Message-
> >> From: Ilya Dryomov [mailto:idryo...@gmail.com]
> >> Sent: 29 June 2017 16:58
> >> To: Nick Fisk 
> >> Cc: Ceph Users 
> >> Subject: Re: [ceph-users] Kernel mounted RBD's hanging
> >>
> >> On Thu, Jun 29, 2017 at 4:30 PM, Nick Fisk  wrote:
> >> > Hi All,
> >> >
> >> > Putting out a call for help to see if anyone can shed some light on this.
> >> >
> >> > Configuration:
> >> > Ceph cluster presenting RBD's->XFS->NFS->ESXi Running 10.2.7 on the
> >> > OSD's and 4.11 kernel on the NFS gateways in a pacemaker cluster
> >> > Both OSD's and clients are go into a pair of switches, single L2
> >> > domain (no sign from pacemaker that there is network connectivity
> >> > issues)
> >> >
> >> > Symptoms:
> >> > - All RBD's on a single client randomly hang for 30s to several
> >> > minutes, confirmed by pacemaker and ESXi hosts complaining
> >>
> >> Hi Nick,
> >>
> >> What is a "single client" here?
> >
> > I mean a node of the pacemaker cluster. So all RBD's on the same
> pacemaker node hang.
> >
> >>
> >> > - Cluster load is minimal when this happens most times
> >>
> >> Can you post gateway syslog and point at when this happened?
> >> Corresponding pacemaker excerpts won't hurt either.
> >
> > Jun 28 16:35:38 MS-CEPH-Proxy1 lrmd[2026]:  warning: p_export_ceph-
> ds1_monitor_6 process (PID 17754) timed out
> > Jun 28 16:35:43 MS-CEPH-Proxy1 lrmd[2026]: crit: p_export_ceph-
> ds1_monitor_6 process (PID 17754) will not die!
> > Jun 28 16:43:51 MS-CEPH-Proxy1 lrmd[2026]:  warning:
> > p_export_ceph-ds1_monitor_6:17754 - timed out after 3ms Jun
> 28 16:43:52 MS-CEPH-Proxy1 IPaddr(p_vip_ceph-ds1)[28482]: INFO: ifconfig
> ens224:0 down
> > Jun 28 16:43:52 MS-CEPH-Proxy1 lrmd[2026]:   notice: p_vip_ceph-
> ds1_stop_0:28482:stderr [ SIOCDELRT: No such process ]
> > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation
> p_vip_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=471, rc=0, cib-
> update=318, confirmed=true)
> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]:
> INFO: Un-exporting file system ...
> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]:
> > INFO: unexporting 10.3.20.0/24:/mnt/Ceph-DS1 Jun 28 16:43:52
> > MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO: Unlocked NFS
> export /mnt/Ceph-DS1 Jun 28 16:43:52 MS-CEPH-Proxy1
> exportfs(p_export_ceph-ds1)[28499]: INFO: Un-exported file system(s)
> > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation
> p_export_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=473, rc=0, cib-
> update=319, confirmed=true)
> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]:
> INFO: Exporting file system(s) ...
> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]:
> > INFO: exporting 10.3.20.0/24:/mnt/Ceph-DS1 Jun 28 16:43:52 MS-CEPH-
> Proxy1 exportfs(p_export_ceph-ds1)[28549]: INFO: directory /mnt/Ceph-DS1
> exported
> > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation
> p_export_ceph-ds1_start_0: ok (node=MS-CEPH-Proxy1, call=474, rc=0, cib-
> update=320, confirmed=true)
> >
> > If I enable the read/write checks for the FS resource, they also timeout at
> the same time.
> 
> What about syslog that the above corresponds to?

I get exactly the same "_monitor" timeout message.

Is there anything logging wise I can do with the kernel client to log when an 
IO is taking a long time. Sort of like the slow requests in Ceph, but client 
side?

> 
> Thanks,
> 
> Ilya

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Kernel mounted RBD's hanging

2017-06-30 Thread Nick Fisk

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 30 June 2017 03:54
To: Ceph Users ; n...@fisk.me.uk
Subject: Re: [ceph-users] Kernel mounted RBD's hanging

On Thu, Jun 29, 2017 at 10:30 AM Nick Fisk mailto:n...@fisk.me.uk> > wrote:

Hi All,

Putting out a call for help to see if anyone can shed some light on this.

Configuration:
Ceph cluster presenting RBD's->XFS->NFS->ESXi
Running 10.2.7 on the OSD's and 4.11 kernel on the NFS gateways in a
pacemaker cluster
Both OSD's and clients are go into a pair of switches, single L2 domain (no
sign from pacemaker that there is network connectivity issues)

Symptoms:
- All RBD's on a single client randomly hang for 30s to several minutes,
confirmed by pacemaker and ESXi hosts complaining
- Cluster load is minimal when this happens most times
- All other clients with RBD's are not affected (Same RADOS pool), so its
seems more of a client issue than cluster issue
- It looks like pacemaker tries to also stop RBD+FS resource, but this also
hangs
- Eventually pacemaker succeeds in stopping resources and immediately
restarts them, IO returns to normal
- No errors, slow requests, or any other non normal Ceph status is reported
on the cluster or ceph.log
- Client logs show nothing apart from pacemaker

Things I've tried:
- Different kernels (potentially happened less with older kernels, but can't
be 100% sure)
- Disabling scrubbing and anything else that could be causing high load
- Enabling Kernel RBD debugging (Problem maybe happens a couple of times a
day, debug logging was not practical as I can't reproduce on demand)

Anyone have any ideas?

Nick, are you using any network aggregation, LACP?  Can you drop to a simplest 
possible configuration to make sure there's nothing on the network switch side?

Hi Alex,

The OSD nodes are active/backup bond and the active Nic on each one, all goes 
into the same switch. The NFS gateways are currently VM’s, but again the 
hypervisor is using the Nic on the same switch. The cluster and public networks 
are vlans on the same Nic and I don’t get any alerts from monitoring/pacemaker 
to suggest there are comms issues. But I will look into getting some ping logs 
done to see if they reveal anything.

Do you check the ceph.log for any anomalies?

Yep, completely clean

Any occurrences on OSD nodes, anything in their OSD logs or syslogs?

Not that I can see. I’m using cache tiering, so all IO travels through a few 
OSD’s. I guess this might make it easier to try and see whats going on. But the 
random nature of it, means it’s not always easy to catch.

Aany odd page cache settings on the clients?

The only customizations on the clients are readahead, some TCP tunings and min 
free kbytes.

Alex

Thanks,
Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

--

Alex Gorbachev

Storcium

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Kernel mounted RBD's hanging

2017-06-29 Thread Nick Fisk

> -Original Message-
> From: Ilya Dryomov [mailto:idryo...@gmail.com]
> Sent: 29 June 2017 16:58
> To: Nick Fisk 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] Kernel mounted RBD's hanging
> 
> On Thu, Jun 29, 2017 at 4:30 PM, Nick Fisk  wrote:
> > Hi All,
> >
> > Putting out a call for help to see if anyone can shed some light on this.
> >
> > Configuration:
> > Ceph cluster presenting RBD's->XFS->NFS->ESXi Running 10.2.7 on the
> > OSD's and 4.11 kernel on the NFS gateways in a pacemaker cluster Both
> > OSD's and clients are go into a pair of switches, single L2 domain (no
> > sign from pacemaker that there is network connectivity issues)
> >
> > Symptoms:
> > - All RBD's on a single client randomly hang for 30s to several
> > minutes, confirmed by pacemaker and ESXi hosts complaining
> 
> Hi Nick,
> 
> What is a "single client" here?

I mean a node of the pacemaker cluster. So all RBD's on the same pacemaker node 
hang.

> 
> > - Cluster load is minimal when this happens most times
> 
> Can you post gateway syslog and point at when this happened?
> Corresponding pacemaker excerpts won't hurt either.

Jun 28 16:35:38 MS-CEPH-Proxy1 lrmd[2026]:  warning: 
p_export_ceph-ds1_monitor_6 process (PID 17754) timed out
Jun 28 16:35:43 MS-CEPH-Proxy1 lrmd[2026]: crit: 
p_export_ceph-ds1_monitor_6 process (PID 17754) will not die!
Jun 28 16:43:51 MS-CEPH-Proxy1 lrmd[2026]:  warning: 
p_export_ceph-ds1_monitor_6:17754 - timed out after 3ms
Jun 28 16:43:52 MS-CEPH-Proxy1 IPaddr(p_vip_ceph-ds1)[28482]: INFO: ifconfig 
ens224:0 down
Jun 28 16:43:52 MS-CEPH-Proxy1 lrmd[2026]:   notice: 
p_vip_ceph-ds1_stop_0:28482:stderr [ SIOCDELRT: No such process ]
Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation 
p_vip_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=471, rc=0, cib-update=318, 
confirmed=true)
Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO: 
Un-exporting file system ...
Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO: 
unexporting 10.3.20.0/24:/mnt/Ceph-DS1
Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO: 
Unlocked NFS export /mnt/Ceph-DS1
Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO: 
Un-exported file system(s)
Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation 
p_export_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=473, rc=0, 
cib-update=319, confirmed=true)
Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]: INFO: 
Exporting file system(s) ...
Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]: INFO: 
exporting 10.3.20.0/24:/mnt/Ceph-DS1
Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]: INFO: 
directory /mnt/Ceph-DS1 exported
Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation 
p_export_ceph-ds1_start_0: ok (node=MS-CEPH-Proxy1, call=474, rc=0, 
cib-update=320, confirmed=true)

If I enable the read/write checks for the FS resource, they also timeout at the 
same time.

> 
> > - All other clients with RBD's are not affected (Same RADOS pool), so
> > its seems more of a client issue than cluster issue
> > - It looks like pacemaker tries to also stop RBD+FS resource, but this
> > also hangs
> > - Eventually pacemaker succeeds in stopping resources and immediately
> > restarts them, IO returns to normal
> > - No errors, slow requests, or any other non normal Ceph status is
> > reported on the cluster or ceph.log
> > - Client logs show nothing apart from pacemaker
> >
> > Things I've tried:
> > - Different kernels (potentially happened less with older kernels, but
> > can't be 100% sure)
> 
> But still happened?  Do you have a list of all the kernels you've tried?

4.5 and 4.11. 

> 
> > - Disabling scrubbing and anything else that could be causing high
> > load
> > - Enabling Kernel RBD debugging (Problem maybe happens a couple of
> > times a day, debug logging was not practical as I can't reproduce on
> > demand)
> 
> When did it start occuring?  Can you think of any configuration changes that
> might have been the trigger or is this a new setup?

It has always done this from what I can tell. The majority of the time IO 
resumed before ESXi went All Paths Down, so it wasn't on my list of priorities 
to fix. But recently the hangs are lasting a lot longer. I need to go back to 
the 4.5 kernel, as I don't remember it happening as often or being as 
disruptive since upgrading to 4.11.

> 
> Thanks,
> 
> Ilya

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Kernel mounted RBD's hanging

2017-06-29 Thread Nick Fisk

Hi All,

Putting out a call for help to see if anyone can shed some light on this.

Configuration:
Ceph cluster presenting RBD's->XFS->NFS->ESXi
Running 10.2.7 on the OSD's and 4.11 kernel on the NFS gateways in a
pacemaker cluster
Both OSD's and clients are go into a pair of switches, single L2 domain (no
sign from pacemaker that there is network connectivity issues)

Symptoms:
- All RBD's on a single client randomly hang for 30s to several minutes,
confirmed by pacemaker and ESXi hosts complaining
- Cluster load is minimal when this happens most times
- All other clients with RBD's are not affected (Same RADOS pool), so its
seems more of a client issue than cluster issue
- It looks like pacemaker tries to also stop RBD+FS resource, but this also
hangs
- Eventually pacemaker succeeds in stopping resources and immediately
restarts them, IO returns to normal
- No errors, slow requests, or any other non normal Ceph status is reported
on the cluster or ceph.log
- Client logs show nothing apart from pacemaker

Things I've tried:
- Different kernels (potentially happened less with older kernels, but can't
be 100% sure)
- Disabling scrubbing and anything else that could be causing high load
- Enabling Kernel RBD debugging (Problem maybe happens a couple of times a
day, debug logging was not practical as I can't reproduce on demand)

Anyone have any ideas?

Thanks,
Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph random read IOPS

2017-06-26 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Willem Jan Withagen
> Sent: 26 June 2017 14:35
> To: Christian Wuerdig 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] Ceph random read IOPS
> 
> On 26-6-2017 09:01, Christian Wuerdig wrote:
> > Well, preferring faster clock CPUs for SSD scenarios has been floated
> > several times over the last few months on this list. And realistic or
> > not, Nick's and Kostas' setup are similar enough (testing single disk)
> > that it's a distinct possibility.
> > Anyway, as mentioned measuring the performance counters would
> probably
> > provide more insight.
> 
> I read the advise as:
>   prefer GHz over cores.
> 
> And especially since there is a sort of balance between either GHz or
cores,
> that can be an expensive one. Getting both means you have to pay
relatively
> substantial more money.
> 
> And for an average Ceph server with plenty OSDs, I personally just don't
buy
> that. There you'd have to look at the total throughput of the the system,
and
> latency is only one of the many factors.
> 
> Let alone in a cluster with several hosts (and or racks). There the
latency is
> dictated by the network. So a bad choice of network card or switch will
out
> do any extra cycles that your CPU can burn.
> 
> I think that just testing 1 OSD is testing artifacts, and has very little
to do with
> running an actual ceph cluster.
> 
> So if one would like to test this, the test setup should be something
> like: 3 hosts with something like 3 disks per host, min_disk=2  and a nice
> workload.
> Then turn the GHz-knob and see what happens with client latency and
> throughput.

Did similar tests last summer. 5 nodes with 12x 7.2k disks each, connected
via 10G. NVME journal. 3x replica pool.

First test was with C-states left to auto and frequency scaling leaving the
cores at lowest frequency of 900mhz. The cluster will quite happily do a
couple of thousand IO's without generating enough CPU load to boost the 4
cores up to max C-state or frequency.

With small background IO going on in background, a QD=1 sequential 4kb write
was done with the following results:

write: io=115268KB, bw=1670.1KB/s, iops=417, runt= 68986msec
slat (usec): min=2, max=414, avg= 4.41, stdev= 3.81
clat (usec): min=966, max=27116, avg=2386.84, stdev=571.57
 lat (usec): min=970, max=27120, avg=2391.25, stdev=571.69
clat percentiles (usec):
 |  1.00th=[ 1480],  5.00th=[ 1688], 10.00th=[ 1912], 20.00th=[ 2128],
 | 30.00th=[ 2192], 40.00th=[ 2288], 50.00th=[ 2352], 60.00th=[ 2448],
 | 70.00th=[ 2576], 80.00th=[ 2704], 90.00th=[ 2832], 95.00th=[ 2960],
 | 99.00th=[ 3312], 99.50th=[ 3536], 99.90th=[ 6112], 99.95th=[ 9536],
 | 99.99th=[22400]

So just under 2.5ms write latency.

I don't have the results from the separate C-states/frequency scaling, but
adjusting either got me a boost. Forcing to C1 and max frequency of 3.6Ghz
got me:

write: io=105900KB, bw=5715.7KB/s, iops=1428, runt= 18528msec
slat (usec): min=2, max=106, avg= 3.50, stdev= 1.31
clat (usec): min=491, max=32099, avg=694.16, stdev=491.91
 lat (usec): min=494, max=32102, avg=697.66, stdev=492.04
clat percentiles (usec):
 |  1.00th=[  540],  5.00th=[  572], 10.00th=[  588], 20.00th=[  604],
 | 30.00th=[  620], 40.00th=[  636], 50.00th=[  652], 60.00th=[  668],
 | 70.00th=[  692], 80.00th=[  716], 90.00th=[  764], 95.00th=[  820],
 | 99.00th=[ 1448], 99.50th=[ 2320], 99.90th=[ 7584], 99.95th=[11712],
 | 99.99th=[24448]

Quite a bit faster. Although these are best case figures, if any substantial
workload is run, the average tends to hover around 1ms latency.

Nick

> 
> --WjW
> 
> > On Sun, Jun 25, 2017 at 4:53 AM, Willem Jan Withagen  > > wrote:
> >
> >
> >
> > Op 24 jun. 2017 om 14:17 heeft Maged Mokhtar
>  > > het volgende geschreven:
> >
> >> My understanding was this test is targeting latency more than
> >> IOPS. This is probably why its was run using QD=1. It also makes
> >> sense that cpu freq will be more important than cores.
> >>
> >
> > But then it is not generic enough to be used as an advise!
> > It is just a line in 3D-space.
> > As there are so many
> >
> > --WjW
> >
> >> On 2017-06-24 12:52, Willem Jan Withagen wrote:
> >>
> >>> On 24-6-2017 05:30, Christian Wuerdig wrote:
>  The general advice floating around is that your want CPUs with
high
>  clock speeds rather than more cores to reduce latency and
>  increase IOPS
>  for SSD setups (see also
>  http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/
>   performance/>)
>  So
>  something like a E5-2667V4 might bring better results in that
>  situation.
>  Also there was some talk about disabling the processor C stat

Re: [ceph-users] Ceph random read IOPS

2017-06-24 Thread Nick Fisk

Apologies for the top post, I can't seem to break indents on my phone.

Anyway the point of that test was as maged suggests to show the effect of 
serial CPU speed on latency. IO is effectively serialised by the pg lock, and 
so trying to reduce the time spent in this area is key. Fast cpu, fast network 
and fast journals are the key here.

This is particularly important on either databases where the small log area 
which may only occupy a small number of pgs can cause contention. Similar to 
xfs journal as well.

Higher queue depths will start to show similar behaviour if you go high enough 
and start waiting for pgs to unlock.

Further tests on proper hardware and 3x replication over network have shown 
average latency figures of around 600us for qd1.


From: Maged Mokhtar 
Sent: 24 Jun 2017 1:17 p.m.
To: Willem Jan Withagen
Cc: Ceph Users
Subject: Re: [ceph-users] Ceph random read IOPS

> My understanding was this test is targeting latency more than IOPS. This is 
> probably why its was run using QD=1. It also makes sense that cpu freq will 
> be more important than cores. 
>
>  
>
>
> On 2017-06-24 12:52, Willem Jan Withagen wrote:
>>
>> On 24-6-2017 05:30, Christian Wuerdig wrote:
>>>
>>> The general advice floating around is that your want CPUs with high
>>> clock speeds rather than more cores to reduce latency and increase IOPS
>>> for SSD setups (see also
>>> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/) So
>>> something like a E5-2667V4 might bring better results in that situation.
>>> Also there was some talk about disabling the processor C states in order
>>> to bring latency down (something like this should be easy to test:
>>> https://stackoverflow.com/a/22482722/220986)
>>
>>
>> I would be very careful to call this a general advice...
>>
>> Although the article is interesting, it is rather single sided.
>>
>> The only thing is shows that there is a lineair relation between
>> clockspeed and write or read speeds???
>> The article is rather vague on how and what is actually tested.
>>
>> By just running a single OSD with no replication a lot of the
>> functionality is left out of the equation.
>> Nobody is running just 1 osD on a box in a normal cluster host.
>>
>> Not using a serious SSD is another source of noise on the conclusion.
>> More Queue depth can/will certainly have impact on concurrency.
>>
>> I would call this an observation, and nothing more.
>>
>> --WjW
>>>
>>>
>>> On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
>>> mailto:reverend...@gmail.com>> wrote:
>>>
>>> Hello,
>>>
>>> We are in the process of evaluating the performance of a testing
>>> cluster (3 nodes) with ceph jewel. Our setup consists of:
>>> 3 monitors (VMs)
>>> 2 physical servers each connected with 1 JBOD running Ubuntu Server
>>> 16.04
>>>
>>> Each server has 32 threads @2.1GHz and 128GB RAM.
>>> The disk distribution per server is:
>>> 38 * HUS726020ALS210 (SAS rotational)
>>> 2 * HUSMH8010BSS200 (SAS SSD for journals)
>>> 2 * ST1920FM0043 (SAS SSD for data)
>>> 1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)
>>>
>>> Since we don't currently have a 10Gbit switch, we test the performance
>>> with the cluster in a degraded state, the noout flag set and we mount
>>> rbd images on the powered on osd node. We confirmed that the network
>>> is not saturated during the tests.
>>>
>>> We ran tests on the NVME disk and the pool created on this disk where
>>> we hoped to get the most performance without getting limited by the
>>> hardware specs since we have more disks than CPU threads.
>>>
>>> The nvme disk was at first partitioned with one partition and the
>>> journal on the same disk. The performance on random 4K reads was
>>> topped at 50K iops. We then removed the osd and partitioned with 4
>>> data partitions and 4 journals on the same disk. The performance
>>> didn't increase significantly. Also, since we run read tests, the
>>> journals shouldn't cause performance issues.
>>>
>>> We then ran 4 fio processes in parallel on the same rbd mounted image
>>> and the total iops reached 100K. More parallel fio processes didn't
>>> increase the measured iops.
>>>
>>> Our ceph.conf is pretty basic (debug is set to 0/0 for everything) and
>>> the crushmap just defines the different buckets/rules for the disk
>>> separation (rotational, ssd, nvme) in order to create the required
>>> pools
>>>
>>> Is the performance of 100.000 iops for random 4K read normal for a
>>> disk that on the same benchmark runs at more than 300K iops on the
>>> same hardware or are we missing something?
>>>
>>> Best regards,
>>> Kostas
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/cep

Re: [ceph-users] VMware + CEPH Integration

2017-06-22 Thread Nick Fisk

> -Original Message-
> From: Adrian Saul [mailto:adrian.s...@tpgtelecom.com.au]
> Sent: 19 June 2017 06:54
> To: n...@fisk.me.uk; 'Alex Gorbachev' 
> Cc: 'ceph-users' 
> Subject: RE: [ceph-users] VMware + CEPH Integration
> 
> > Hi Alex,
> >
> > Have you experienced any problems with timeouts in the monitor action
> > in pacemaker? Although largely stable, every now and again in our
> > cluster the FS and Exportfs resources timeout in pacemaker. There's no
> > mention of any slow requests or any peering..etc from the ceph logs so it's
> a bit of a mystery.
> 
> Yes - we have that in our setup which is very similar.  Usually  I find it 
> related
> to RBD device latency  due to scrubbing or similar but even when tuning
> some of that down we still get it randomly.
> 
> The most annoying part is that once it comes up, having to use  "resource
> cleanup" to try and remove the failed usually has more impact than the
> actual error.

Are you using Stonith? Pacemaker should be able to recover from any sort of 
failure as long as it can bring the cluster into a known state.

I'm still struggling to get to the bottom of it in our environment. When it 
happens, every RBD on the same client host seems to hang, but all other hosts 
are fine. This seems to suggest it's not a Ceph cluster issue/performance, as 
this would affect the majority of RBD's and not just ones on a single client.

> Confidentiality: This email and any attachments are confidential and may be
> subject to copyright, legal or some other professional privilege. They are
> intended solely for the attention and use of the named addressee(s). They
> may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been sent
> to you by mistake.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] VMware + CEPH Integration

2017-06-17 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Alex Gorbachev
> Sent: 16 June 2017 01:48
> To: Osama Hasebou 
> Cc: ceph-users 
> Subject: Re: [ceph-users] VMware + CEPH Integration
> 
> On Thu, Jun 15, 2017 at 5:29 AM, Osama Hasebou 
> wrote:
> > Hi Everyone,
> >
> > We would like to start testing using VMware with CEPH storage. Can
> > people share their experience with production ready ideas they tried
> > and if they were successful?
> >
> > I have been reading lately that either NFS or iSCSI are possible with
> > some server acting as a gateway in between Ceph and VMware
> environment
> > but NFS is better.
> 
> We have very good results running our Storcium system with NFS for
> VMWare, the essence of storage delivery here is two ceph clients running
> kernel NFS servers in a Pacemaker/Corosync cluster.  We utilize NFS ACLs
to
> restrict access and consume RBD as XFS filesystems.

Hi Alex,

Have you experienced any problems with timeouts in the monitor action in
pacemaker? Although largely stable, every now and again in our cluster the
FS and Exportfs resources timeout in pacemaker. There's no mention of any
slow requests or any peering..etc from the ceph logs so it's a bit of a
mystery. 

Nick

> 
> Best regards,
> Alex Gorbachev
> Storcium
> 
> 
> >
> > Thank you.
> >
> > Regards,
> > Ossi
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 2x replica with NVMe

2017-06-08 Thread Nick Fisk

Bluestore will make 2x Replica’s “safer” to use in theory. Until Bluestore is 
in use in the wild, I don’t think anyone can give any guarantees. 

From: i...@witeq.com [mailto:i...@witeq.com] 
Sent: 08 June 2017 14:32
To: nick 
Cc: Vy Nguyen Tan ; ceph-users 

Subject: Re: [ceph-users] 2x replica with NVMe

I'm thinking to delay this project until Luminous release to have Bluestore 
support.

So are you telling me that checksum capability will be present in Bluestore and 
therefore considering using NVMe with 2x replica for production data will be 
possibile?

  _  

From: "nick" 
To: "Vy Nguyen Tan" , i...@witeq.com
Cc: "ceph-users" 
Sent: Thursday, June 8, 2017 3:19:20 PM
Subject: RE: [ceph-users] 2x replica with NVMe

There are two main concerns with using 2x replicas, recovery speed and coming 
across inconsistent objects.

With spinning disks their size to access speed means recovery can take a long 
time and increases the chance that additional failures may happen during the 
recovery process. NVME will recover a lot faster and so this risk is greatly 
reduced and means that using 2x replicas may be possible.

However, with Filestore there are no checksums and so there is no way to 
determine in the event of inconsistent objects, which one is corrupt. So even 
with NVME, I would not feel 100% confident using 2x replicas. With Bluestore 
this problem will go away.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vy 
Nguyen Tan
Sent: 08 June 2017 13:47
To: i...@witeq.com
Cc: ceph-users 
Subject: Re: [ceph-users] 2x replica with NVMe

Hi,

I think that the replica 2x on HDD/SSD are the same. You should read quote from 
Wido bellow:

""Hi,

As a Ceph consultant I get numerous calls throughout the year to help people 
with getting their broken Ceph clusters back online.

The causes of downtime vary vastly, but one of the biggest causes is that 
people use replication 2x. size = 2, min_size = 1.

In 2016 the amount of cases I have where data was lost due to these settings 
grew exponentially.

Usually a disk failed, recovery kicks in and while recovery is happening a 
second disk fails. Causing PGs to become incomplete.

There have been to many times where I had to use xfs_repair on broken disks and 
use ceph-objectstore-tool to export/import PGs.

I really don't like these cases, mainly because they can be prevented easily by 
using size = 3 and min_size = 2 for all pools.

With size = 2 you go into the danger zone as soon as a single disk/daemon 
fails. With size = 3 you always have two additional copies left thus keeping 
your data safe(r).

If you are running CephFS, at least consider running the 'metadata' pool with 
size = 3 to keep the MDS happy.

Please, let this be a big warning to everybody who is running with size = 2. 
The downtime and problems caused by missing objects/replicas are usually big 
and it takes days to recover from those. But very often data is lost and/or 
corrupted which causes even more problems.

I can't stress this enough. Running with size = 2 in production is a SERIOUS 
hazard and should not be done imho.

To anyone out there running with size = 2, please reconsider this!

Thanks,

Wido""

On Thu, Jun 8, 2017 at 5:32 PM, mailto:i...@witeq.com> > wrote:

Hi all,

i'm going to build an all-flash ceph cluster, looking around the existing 
documentation i see lots of guides and and use case scenarios from various 
vendor testing Ceph with replica 2x.

Now, i'm an old school Ceph user, I always considered 2x replica really 
dangerous for production data, especially when both OSDs can't decide which 
replica is the good one.

Why all NVMe storage vendor and partners use only 2x replica? 

They claim it's safe because NVMe is better in handling errors, but i usually 
don't trust marketing claims :)

Is it true? Can someone confirm that NVMe is different compared to HDD and 
therefore replica 2 can be considered safe to be put in production?

Many Thanks

Giordano

___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 2x replica with NVMe

2017-06-08 Thread Nick Fisk

There are two main concerns with using 2x replicas, recovery speed and coming 
across inconsistent objects.

 

With spinning disks their size to access speed means recovery can take a long 
time and increases the chance that additional failures may happen during the 
recovery process. NVME will recover a lot faster and so this risk is greatly 
reduced and means that using 2x replicas may be possible.

 

However, with Filestore there are no checksums and so there is no way to 
determine in the event of inconsistent objects, which one is corrupt. So even 
with NVME, I would not feel 100% confident using 2x replicas. With Bluestore 
this problem will go away.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vy 
Nguyen Tan
Sent: 08 June 2017 13:47
To: i...@witeq.com
Cc: ceph-users 
Subject: Re: [ceph-users] 2x replica with NVMe

 

Hi,

 

I think that the replica 2x on HDD/SSD are the same. You should read quote from 
Wido bellow:

 

""Hi,


As a Ceph consultant I get numerous calls throughout the year to help people 
with getting their broken Ceph clusters back online.

The causes of downtime vary vastly, but one of the biggest causes is that 
people use replication 2x. size = 2, min_size = 1.

In 2016 the amount of cases I have where data was lost due to these settings 
grew exponentially.

Usually a disk failed, recovery kicks in and while recovery is happening a 
second disk fails. Causing PGs to become incomplete.

There have been to many times where I had to use xfs_repair on broken disks and 
use ceph-objectstore-tool to export/import PGs.

I really don't like these cases, mainly because they can be prevented easily by 
using size = 3 and min_size = 2 for all pools.

With size = 2 you go into the danger zone as soon as a single disk/daemon 
fails. With size = 3 you always have two additional copies left thus keeping 
your data safe(r).

If you are running CephFS, at least consider running the 'metadata' pool with 
size = 3 to keep the MDS happy.

Please, let this be a big warning to everybody who is running with size = 2. 
The downtime and problems caused by missing objects/replicas are usually big 
and it takes days to recover from those. But very often data is lost and/or 
corrupted which causes even more problems.

I can't stress this enough. Running with size = 2 in production is a SERIOUS 
hazard and should not be done imho.

To anyone out there running with size = 2, please reconsider this!

Thanks,

Wido""

 

On Thu, Jun 8, 2017 at 5:32 PM, mailto:i...@witeq.com> > wrote:

Hi all,

 

i'm going to build an all-flash ceph cluster, looking around the existing 
documentation i see lots of guides and and use case scenarios from various 
vendor testing Ceph with replica 2x.

 

Now, i'm an old school Ceph user, I always considered 2x replica really 
dangerous for production data, especially when both OSDs can't decide which 
replica is the good one.

Why all NVMe storage vendor and partners use only 2x replica? 

They claim it's safe because NVMe is better in handling errors, but i usually 
don't trust marketing claims :)

Is it true? Can someone confirm that NVMe is different compared to HDD and 
therefore replica 2 can be considered safe to be put in production?

 

Many Thanks

Giordano


___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Changing SSD Landscape

2017-05-18 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dan 
> van der Ster
> Sent: 18 May 2017 09:30
> To: Christian Balzer 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Changing SSD Landscape
> 
> On Thu, May 18, 2017 at 3:11 AM, Christian Balzer  wrote:
> > On Wed, 17 May 2017 18:02:06 -0700 Ben Hines wrote:
> >
> >> Well, ceph journals are of course going away with the imminent bluestore.
> > Not really, in many senses.
> >
> 
> But we should expect far fewer writes to pass through the RocksDB and its 
> WAL, right? So perhaps lower endurance flash will be
> usable.

Depends, I flagged up an issue in Bluestore where client latency writing to 
spinners was tied to the underlying disks latency. Sage has introduced a new 
deferred write feature which does a similar double write strategy to Filestore, 
first into the WAL, where it gets coalesced and then written out to the disk. 
The deferred writes are tuneable, as in you can say only defer writes up to 
128kbetc. But if you want the same write latency you see in Filestore, then 
you will encounter increased SSD wear to match it. 

> 
> BTW, you asked about Samsung parts earlier. We are running these SM863's in a 
> block storage cluster:
> 
> Model Family: Samsung based SSDs
> Device Model: SAMSUNG MZ7KM240HAGR-0E005
> Firmware Version: GXM1003Q
> 
>   9 Power_On_Hours  0x0032   098   098   000Old_age
> Always   -   9971
> 177 Wear_Leveling_Count 0x0013   094   094   005Pre-fail
> Always   -   2195
> 241 Total_LBAs_Written  0x0032   099   099   000Old_age
> Always   -   701300549904
> 242 Total_LBAs_Read 0x0032   099   099   000Old_age
> Always   -   20421265
> 251 NAND_Writes 0x0032   100   100   000Old_age
> Always   -   1148921417736
> 
> The problem is that I don't know how to see how many writes have gone through 
> these drives.
> Total_LBAs_Written appears to be bogus -- it's based on time. It matches 
> exactly the 3.6DWPD spec'd for that model:
>   3.6*240GB*9971 hours = 358.95TB
>   701300549904 LBAs * 512Bytes/LBA = 359.06TB
> 
> If we trust Wear_Leveling_Count then we're only dropping 6% in a year
> -- these should be good.
> 
> But maybe they're EOL anyway?
> 
> Cheers, Dan
> 
> >> Are small SSDs still useful for something with Bluestore?
> >>
> > Of course, the WAL and other bits for the rocksdb, read up on it.
> >
> > On top of that is the potential to improve things further with things
> > like bcache.
> >
> >> For speccing out a cluster today that is a many 6+ months away from
> >> being required, which I am going to be doing, i was thinking all-SSD
> >> would be the way to go. (or is all-spinner performant with
> >> Bluestore?) Too early to make that call?
> >>
> > Your call and funeral with regards to all spinners (depending on your
> > needs).
> > Bluestore at the very best of circumstances could double your IOPS,
> > but there are other factors at play and most people who NEED SSD
> > journals now would want something with SSDs in Bluestore as well.
> >
> > If you're planning to actually deploy a (entirely) Bluestore cluster
> > in production with mission critical data before next year, you're a
> > lot braver than me.
> > An early adoption scheme with Bluestore nodes being in their own
> > failure domain (rack) would be the best I could see myself doing in my
> > generic cluster.
> > For the 2 mission critical production clusters, they are (will be)
> > frozen most likely.
> >
> > Christian
> >
> >> -Ben
> >>
> >> On Wed, May 17, 2017 at 5:30 PM, Christian Balzer  wrote:
> >>
> >> >
> >> > Hello,
> >> >
> >> > On Wed, 17 May 2017 11:28:17 +0200 Eneko Lacunza wrote:
> >> >
> >> > > Hi Nick,
> >> > >
> >> > > El 17/05/17 a las 11:12, Nick Fisk escribió:
> >> > > > There seems to be a shift in enterprise SSD products to larger
> >> > > > less
> >> > write intensive products and generally costing more than what
> >> > > > the existing P/S 3600/3700 ranges were. For example the new
> >> > > > Intel NVME
> >> > P4600 range seems to start at 2TB. Although I mention Intel
> >> > > > products, this seems to be the general outlook across all
> >> > manufacturers. This presents some problems for acquiring SSD's for
>

Re: [ceph-users] Changing SSD Landscape

2017-05-17 Thread Nick Fisk

Hi Dan,

> -Original Message-
> From: Dan van der Ster [mailto:d...@vanderster.com]
> Sent: 17 May 2017 10:29
> To: Nick Fisk 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Changing SSD Landscape
> 
> I am currently pricing out some DCS3520's, for OSDs. Word is that the price 
> is going up, but I don't have specifics, yet.
> 
> I'm curious, does your real usage show that the 3500 series don't offer 
> enough endurance?

We've written about 700-800TB to each P3700 in about 10 months and their 
official specs show that they should be good for about 7PBW. We plan to try and 
keep these nodes running for about 5 years, so roughly about right I would 
imagine.

Looking at the 3520's, I think by the time we have enough endurance, we would 
be talking about ones at the high end of the capacity scale.

> 
> Here's one of our DCS3700's after 2.5 years of RBD + a bit of S3:
> 
> Model Family: Intel 730 and DC S35x0/3610/3700 Series SSDs
> Device Model: INTEL SSDSC2BA200G3
> Firmware Version: 5DV10270
> User Capacity:200,049,647,616 bytes [200 GB]
> 
>   9 Power_On_Hours  0x0032   100   100   000Old_age
> Always   -   22580
> 226 Workld_Media_Wear_Indic 0x0032   100   100   000Old_age
> Always   -   3471
> 228 Workload_Minutes0x0032   100   100   000Old_age
> Always   -   1354810
> 232 Available_Reservd_Space 0x0033   099   099   010Pre-fail
> Always   -   0
> 233 Media_Wearout_Indicator 0x0032   097   097   000Old_age
> Always   -   0
> 241 Host_Writes_32MiB   0x0032   100   100   000Old_age
> Always   -   8236969
> 242 Host_Reads_32MiB0x0032   100   100   000Old_age
> Always   -   7400
> 
> Still loads of endurance left.
> 
> Anyway, in a couple months we'll start testing the Optane drives. They are 
> small and perhaps ideal journals, or?

Ok, interesting. Is this the P4800 model?

> 
> -- Dan
> 
> 
> 
> On Wed, May 17, 2017 at 11:12 AM, Nick Fisk  wrote:
> > Hi All,
> >
> > There seems to be a shift in enterprise SSD products to larger less
> > write intensive products and generally costing more than what the
> > existing P/S 3600/3700 ranges were. For example the new Intel NVME
> > P4600 range seems to start at 2TB. Although I mention Intel products,
> > this seems to be the general outlook across all manufacturers. This 
> > presents some problems for acquiring SSD's for Ceph
> journal/WAL use if your cluster is largely write only and wouldn't benefit 
> from using the extra capacity brought by these SSD's to use
> as cache.
> >
> > Is anybody in the same situation and is struggling to find good P3700 400G 
> > replacements?
> >
> > Nick
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Changing SSD Landscape

2017-05-17 Thread Nick Fisk

Hi All,

There seems to be a shift in enterprise SSD products to larger less write 
intensive products and generally costing more than what
the existing P/S 3600/3700 ranges were. For example the new Intel NVME P4600 
range seems to start at 2TB. Although I mention Intel
products, this seems to be the general outlook across all manufacturers. This 
presents some problems for acquiring SSD's for Ceph
journal/WAL use if your cluster is largely write only and wouldn't benefit from 
using the extra capacity brought by these SSD's to
use as cache.

Is anybody in the same situation and is struggling to find good P3700 400G 
replacements?

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Intel power tuning - 30% throughput performance increase

2017-05-03 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Blair Bethwaite
> Sent: 03 May 2017 09:53
> To: Dan van der Ster 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Intel power tuning - 30% throughput performance 
> increase
> 
> On 3 May 2017 at 18:38, Dan van der Ster  wrote:
> > Seems to work for me, or?
> 
> Yeah now that I read the code more I see it is opening and manipulating 
> /dev/cpu_dma_latency in response to that option, so the
> TODO comment seems to be outdated. I verified tuned latency-performance _is_ 
> doing this properly on our RHEL7.3 nodes (maybe I
> first tested this on 7.2 or just missed something then). In any case, I think 
> we're agreed that Ceph should recommend that
profile.

I did some testing on this last year, by forcing the C-state to C1 I found that 
small writes benefited the most. 4kb writes to a 3x
replica pool went from about 2ms write latency down to about 600us. I also 
measured power usage of the server and the increase was
only a couple of percent.

Ideally, I hoping for the day when the Linux scheduler is power aware so it 
always assigns threads to cores already running in
higher C-states, that way we will hopefully get the benefits of both worlds 
without having to force all cores to run at max.

> 
> --
> Cheers,
> ~Blairo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Maintaining write performance under a steady intake of small objects

2017-05-01 Thread Nick Fisk

Hi Patrick,

 

Is there any chance that you can graph the XFS stats to see if there is an 
increase in inode/dentry cache misses as the ingest performance drops off? At 
least that might confirm the issue.

 

Only other thing I can think of would be to try running the OSD’s on top of 
something like a bcache set. As your workload is very heavily write only, 
running the cache in read only mode (writearound), might mean that inodes and 
dentries have a good chance of being cached on SSD. If you have some free 
capacity on your journals to use for bcache, it might be worth a shot. I have 
done something very similar on a single node recently to try and combat 
excessive dentry/inode lookups. 200GB cache for 12x8TB OSD’s. Performance is 
better, but I can’t say exactly how much is down to caching of the general data 
vs inodes…etc

 

Kernel 4.10+ supports bcache partitions, so it’s a lot easier to use with Ceph.

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Patrick Dinnen
Sent: 01 May 2017 19:07
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Maintaining write performance under a steady intake of 
small objects

 

Hello Ceph-users,

Florian has been helping with some issues on our proof-of-concept cluster, 
where we've been experiencing these issues. Thanks for the replies so far. I 
wanted to jump in with some extra details.

All of our testing has been with scrubbing turned off, to remove that as a 
factor.

Our use case requires a Ceph cluster to indefinitely store ~10 billion files 
20-60KB in size. We’ll begin with 4 billion files migrated from a legacy 
storage system. Ongoing writes will be handled by ~10 client machines and come 
in at a fairly steady 10-20 million files/day. Every file (excluding the legacy 
4 billion) will be read once by a single client within hours of it’s initial 
write to the cluster. Future file read requests will come from a single server 
and with a long-tail distribution, with popular files read thousands of times a 
year but most read never or virtually never.

Our “production” design has 6-nodes, 24-OSDs (expandable to 48 OSDs). SSD 
journals at a 1:4 ratio with HDDs, Each node looks like this:

*   2 x E5-2660 8-core Xeons

*   64GB RAM DDR-3 PC1600

*   10Gb ceph-internal network (SFP+) 

*   LSI 9210-8i controller (IT mode)

*   4 x OSD 8TB HDDs, mix of two types

*   Seagate ST8000DM002

*   HGST HDN728080ALE604

*   Mount options = xfs (rw,noatime,attr2,inode64,noquota) 

*   1 x SSD journal Intel 200GB DC S3700

 

Running Kraken 11.2.0 on Ubuntu 16.04. All testing has been done with a 
replication level 2. We’re using rados bench to shotgun a lot of files into our 
test pools. Specifically following these two steps: 

ceph osd pool create poolofhopes 2048 2048 replicated "" replicated_ruleset 
5

rados -p poolofhopes bench -t 32 -b 2 3000 write --no-cleanup

 

We leave the bench running for days at a time and watch the objects in cluster 
count. We see performance that starts off decent and degrades over time. 
There’s a very brief initial surge in write performance after which things 
settle into the downward trending pattern.

 

1st hour - 2 million objects/hour

20th hour - 1.9 million objects/hour 

40th hour - 1.7 million objects/hour


This performance is not encouraging for us. We need to be writing 40 million 
objects per day (20 million files, duplicated twice). The rates we’re seeing at 
the 40th hour of our bench would be suffecient to achieve that. Those write 
rates are still falling though and we’re only at a fraction of the number of 
objects in cluster that we need to handle. So, the trends in performance 
suggests we shouldn’t count on having the write performance we need for too 
long.


If we repeat the process of creating a new pool and running the bench the same 
pattern holds, good initial performance that gradually degrades.

 

  https://postimg.org/image/ovymk7n2d/

[caption:90 million objects written to a brand new, pre-split pool 
(poolofhopes). There are already 330 million objects on the cluster in other 
pools.]

 

Our working theory is that the degradation over time may be related to inode or 
dentry lookups that miss cache and lead to additional disk reads and seek 
activity. There’s a suggestion that filestore directory splitting may 
exacerbate that problem as additional/longer disk seeks occur related to what’s 
in which XFS assignment group. We have found pre-split pools useful in one 
major way, they avoid periods of near-zero write performance that we have put 
down to the active splitting of directories (the "thundering herd" effect). The 
overall downward curve seems to remain the same whether we pre-split or not.

 

The thundering herd seems to be kept in check by an appropriate pre-split. 
Bluestore may or may not be a solution, but uncertainty and stability within 
our fairly tight t

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-20 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jogi 
> Hofmüller
> Sent: 20 April 2017 13:51
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] slow requests and short OSD failures in small 
> cluster
> 
> Hi,
> 
> Am Dienstag, den 18.04.2017, 18:34 + schrieb Peter Maloney:
> 
> > The 'slower with every snapshot even after CoW totally flattens it'
> > issue I just find easy to test, and I didn't test it on hammer or
> > earlier, and others confirmed it, but didn't keep track of the
> > versions. Just make an rbd image, map it (probably... but my tests
> > were with qemu librbd), do fio randwrite tests with sync and direct on
> > the device (no need for a fs, or anything), and then make a few snaps
> > and watch it go way slower.
> >
> > How about we make this thread a collection of versions then. And I'll
> > redo my test on Thursday maybe.
> 
> I did some tests now and provide the results and observations here:
> 
> This is the fio config file I used:
> 
> 
> [global]
> ioengine=rbd
> clientname=admin
> pool=images
> rbdname=benchmark
> invalidate=0# mandatory
> rw=randwrite
> bs=4k
> 
> [rbd_iodepth32]
> iodepth=32
> 
> 
> Results from fio on image 'benchmark' without any snapshots:
> 
> rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd,
> iodepth=32
> fio-2.16
> Starting 1 process
> rbd engine: RBD version: 0.1.10
> Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/3620KB/0KB /s] [0/905/0 iops] [eta 
> 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=14192: Thu Apr 20
> 13:11:27 2017
>   write: io=8192.0MB, bw=1596.2KB/s, iops=399, runt=5252799msec
> slat (usec): min=1, max=6708, avg=173.27, stdev=97.65
> clat (msec): min=9, max=14505, avg=79.97, stdev=456.86
>  lat (msec): min=9, max=14505, avg=80.15, stdev=456.86
> clat percentiles (msec):
>  |  1.00th=[   26],  5.00th=[   28], 10.00th=[   28], 20.00th=[   30],
>  | 30.00th=[   31], 40.00th=[   32], 50.00th=[   33], 60.00th=[   35],
>  | 70.00th=[   37], 80.00th=[   39], 90.00th=[   43], 95.00th=[   47],
>  | 99.00th=[ 1516], 99.50th=[ 3621], 99.90th=[ 7046], 99.95th=[ 8094],
>  | 99.99th=[10159]
> lat (msec) : 10=0.01%, 20=0.29%, 50=96.17%, 100=1.49%, 250=0.31%
> lat (msec) : 500=0.21%, 750=0.15%, 1000=0.14%, 2000=0.38%,
> >=2000=0.85%
>   cpu  : usr=31.95%, sys=58.32%, ctx=5392823, majf=0, minf=0
>   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
>  issued: total=r=0/w=2097152/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
>  latency   : target=0, window=0, percentile=100.00%, depth=32
> 
> Run status group 0 (all jobs):
>   WRITE: io=8192.0MB, aggrb=1596KB/s, minb=1596KB/s, maxb=1596KB/s, 
> mint=5252799msec, maxt=5252799msec
> 
> Disk stats (read/write):
>   vdb: ios=6/20, merge=0/29, ticks=76/12168, in_queue=12244, util=0.23% sudo 
> fio rbd.fio  2023.87s user 3216.33s system 99% cpu
> 1:27:31.92 total
> 
> Now I created three snapshots of image 'benchmark'. Cluster became 
> iresponsive (slow requests stared to appear), a new run of fio
> never got passed 0.0%.
> 
> Removed all three snapshots. Cluster became responsive again, fio started to 
> work like before (left it running during snapshot
> removal).
> 
> Created one snapshot of 'benchmark' while fio was running. Cluster became 
> iresponsive after few minutes, fio got nothing done as
> soon as the snapshot was made.
> 
> Stopped here ;)

You are generating a write amplification of a 2000x, every 4kb write IO will 
generate a 4MB read and 4MB write. If your cluster can't handle that IO then 
you will see extremely poor performance. Is your real life workload actually 
doing random 4kb writes at qd=32? If it is you will either want to use RBD's 
made up of smaller objects to try and lessen the overheads, or probably forget 
about using snapshots, unless there is some sort of sparse bitmap based COW 
feature on the horizon???

> 
> Regards,
> --
> J.Hofmüller
> 
>mur.sat -- a space art project
>http://sat.mur.at/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mon not starting after upgrading to 10.2.7

2017-04-12 Thread Nick Fisk

Thanks Dan,

I've just managed to fixed it. It looks like the upgrade process required some 
extra ram, the mon node was heavily swapping, so I think it was just stalled 
rather than broken. Once it came back up, ram dropped down by a lot.

Nick

> -Original Message-
> From: Dan van der Ster [mailto:d...@vanderster.com]
> Sent: 12 April 2017 10:53
> To: Nick Fisk 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Mon not starting after upgrading to 10.2.7
> 
> Can't help, but just wanted to say that the upgrade worked for us:
> 
> # ceph health
> HEALTH_OK
> # ceph tell mon.* version
> mon.p01001532077488: ceph version 10.2.7
> (50e863e0f4bc8f4b9e31156de690d765af245185)
> mon.p01001532149022: ceph version 10.2.7
> (50e863e0f4bc8f4b9e31156de690d765af245185)
> mon.p01001532184554: ceph version 10.2.7
> (50e863e0f4bc8f4b9e31156de690d765af245185)
> 
> -- dan
> 
> On Wed, Apr 12, 2017 at 11:50 AM, Nick Fisk  wrote:
> > Hi,
> >
> > I just upgraded one of my mons to 10.2.7 and it is now failing to
> > start properly. What's really odd is all the mon specific commands are
> > now missing from the admin socket.
> >
> > ceph --admin-daemon /var/run/ceph/ceph-mon.gp-ceph-mon2.asok help {
> > "config diff": "dump diff of current config and default config",
> > "config get": "config get : get the config value",
> > "config set": "config set   [ ...]: set a config
> > variable",
> > "config show": "dump current config settings",
> > "get_command_descriptions": "list available commands",
> > "git_version": "get git sha1",
> > "help": "list available commands",
> > "log dump": "dump recent log entries to log file",
> > "log flush": "flush log entries to log file",
> > "log reopen": "reopen log file",
> > "perf dump": "dump perfcounters value",
> > "perf reset": "perf reset : perf reset all or one
> > perfcounter name",
> > "perf schema": "dump perfcounters schema",
> > "version": "get ceph version"
> > }
> >
> > And from the log, with logging set to 20/20
> > 2017-04-12 10:47:35.667631 7fba2cf4d700  0 set uid:gid to 64045:64045
> > (ceph:ceph)
> > 2017-04-12 10:47:35.667681 7fba2cf4d700  0 ceph version 10.2.7
> > (50e863e0f4bc8f4b9e31156de690d765af245185), process ceph-mon, pid
> 7187
> > 2017-04-12 10:47:35.668335 7fba2cf4d700  0 pidfile_write: ignore empty
> > --pid-file
> > 2017-04-12 10:47:35.721480 7fba2cf4d700  5 asok(0x55a03144a2c0) init
> > /var/run/ceph/ceph-mon.gp-ceph-mon2.asok
> > 2017-04-12 10:47:35.721540 7fba2cf4d700  5 asok(0x55a03144a2c0)
> > bind_and_listen /var/run/ceph/ceph-mon.gp-ceph-mon2.asok
> > 2017-04-12 10:47:35.721692 7fba2cf4d700 20 asok(0x55a03144a2c0) unlink
> > stale file /var/run/ceph/ceph-mon.gp-ceph-mon2.asok
> > 2017-04-12 10:47:35.721729 7fba2cf4d700  5 asok(0x55a03144a2c0)
> > register_command 0 hook 0x55a03142a0a8
> > 2017-04-12 10:47:35.721742 7fba2cf4d700  5 asok(0x55a03144a2c0)
> > register_command version hook 0x55a03142a0a8
> > 2017-04-12 10:47:35.721752 7fba2cf4d700  5 asok(0x55a03144a2c0)
> > register_command git_version hook 0x55a03142a0a8
> > 2017-04-12 10:47:35.721762 7fba2cf4d700  5 asok(0x55a03144a2c0)
> > register_command help hook 0x55a03142e1d0
> > 2017-04-12 10:47:35.721771 7fba2cf4d700  5 asok(0x55a03144a2c0)
> > register_command get_command_descriptions hook 0x55a03142e1e0
> > 2017-04-12 10:47:35.721902 7fba2778e700  5 asok(0x55a03144a2c0) entry
> > start
> > 2017-04-12 10:47:35.764691 7fba2cf4d700 10 load: jerasure load: lrc load:
> > isa
> >
> >
> > Any ideas?
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Mon not starting after upgrading to 10.2.7

2017-04-12 Thread Nick Fisk

Hi,

I just upgraded one of my mons to 10.2.7 and it is now failing to start
properly. What's really odd is all the mon specific commands are now missing
from the admin socket.

ceph --admin-daemon /var/run/ceph/ceph-mon.gp-ceph-mon2.asok help
{
"config diff": "dump diff of current config and default config",
"config get": "config get : get the config value",
"config set": "config set   [ ...]: set a config
variable",
"config show": "dump current config settings",
"get_command_descriptions": "list available commands",
"git_version": "get git sha1",
"help": "list available commands",
"log dump": "dump recent log entries to log file",
"log flush": "flush log entries to log file",
"log reopen": "reopen log file",
"perf dump": "dump perfcounters value",
"perf reset": "perf reset : perf reset all or one perfcounter
name",
"perf schema": "dump perfcounters schema",
"version": "get ceph version"
}

And from the log, with logging set to 20/20
2017-04-12 10:47:35.667631 7fba2cf4d700  0 set uid:gid to 64045:64045
(ceph:ceph)
2017-04-12 10:47:35.667681 7fba2cf4d700  0 ceph version 10.2.7
(50e863e0f4bc8f4b9e31156de690d765af245185), process ceph-mon, pid 7187
2017-04-12 10:47:35.668335 7fba2cf4d700  0 pidfile_write: ignore empty
--pid-file
2017-04-12 10:47:35.721480 7fba2cf4d700  5 asok(0x55a03144a2c0) init
/var/run/ceph/ceph-mon.gp-ceph-mon2.asok
2017-04-12 10:47:35.721540 7fba2cf4d700  5 asok(0x55a03144a2c0)
bind_and_listen /var/run/ceph/ceph-mon.gp-ceph-mon2.asok
2017-04-12 10:47:35.721692 7fba2cf4d700 20 asok(0x55a03144a2c0) unlink stale
file /var/run/ceph/ceph-mon.gp-ceph-mon2.asok
2017-04-12 10:47:35.721729 7fba2cf4d700  5 asok(0x55a03144a2c0)
register_command 0 hook 0x55a03142a0a8
2017-04-12 10:47:35.721742 7fba2cf4d700  5 asok(0x55a03144a2c0)
register_command version hook 0x55a03142a0a8
2017-04-12 10:47:35.721752 7fba2cf4d700  5 asok(0x55a03144a2c0)
register_command git_version hook 0x55a03142a0a8
2017-04-12 10:47:35.721762 7fba2cf4d700  5 asok(0x55a03144a2c0)
register_command help hook 0x55a03142e1d0
2017-04-12 10:47:35.721771 7fba2cf4d700  5 asok(0x55a03144a2c0)
register_command get_command_descriptions hook 0x55a03142e1e0
2017-04-12 10:47:35.721902 7fba2778e700  5 asok(0x55a03144a2c0) entry start
2017-04-12 10:47:35.764691 7fba2cf4d700 10 load: jerasure load: lrc load:
isa


Any ideas?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Preconditioning an RBD image

2017-04-10 Thread Nick Fisk

Hi Peter,

Thanks for those graphs and thorough explanation. I guess a lot of your 
performance increase is due to the fact that a fair amount of your workload is 
cachable?

I'm got my new node online late last week with 12x8TB drives and 200GB bcache 
partition, coupled with my very uncachable workload, I was expecting the hit 
ratio to be very low. However, surprisingly its hovering around 50%!!

Further investigation of the stats however possibly indicate sequential 
bypasses might be counting towards that figure.

cat /sys/block/bcache2/bcache/stats_day/cache_hit_ratio
48
cat /sys/block/bcache2/bcache/stats_day/cache_bypass_hits
3000830
cat /sys/block/bcache2/bcache/stats_day/cache_hits
444625

So, I guess that a lot of the OSD metadata is being cached. Which is actually 
what I wanted.

I've also done some initial looking at the disk stats and the disks look like 
they are doing about half as much work as the existing nodes, which is 
encouraging.

I will try and get some good graphs to share over the course of this week.

Nick


> -Original Message-
> From: Peter Maloney [mailto:peter.malo...@brockmann-consult.de]
> Sent: 06 April 2017 16:04
> To: n...@fisk.me.uk; 'ceph-users' 
> Subject: Re: [ceph-users] Preconditioning an RBD image
> 
> On 03/25/17 23:01, Nick Fisk wrote:
> >
> >> I think I owe you another graph later when I put all my VMs on there
> >> (probably finally fixed my rbd snapshot hanging VM issue ...worked
> >> around it by disabling exclusive-lock,object-map,fast-diff). The
> >> bandwidth hungry ones (which hung the most often) were moved shortly
> >> after the bcache change, and it's hard to explain how it affects the
> >> graphs... easier to see with iostat while changing it and having a mix of
> cache and not than ganglia afterwards.
> > Please do, I can't resist a nice graph. What I would be really interested 
> > in is
> answers to these questions, if you can:
> >
> > 1. Has your per disk bandwidth gone up, due to removing random writes.
> > Ie. I struggle to get more than about 50MB/s writes per disk due to
> > extra random IO per request 2. Any feeling on how it helps with
> dentry/inode lookups. As mentioned above, I'm using 8TB disks and cold
> data has extra penalty for reads/writes as it has to lookup the FS data first 
> 3.
> I assume with 4.9 kernel you don't have the bcache fix which allows
> partitions. What method are you using to create OSDs?
> > 4. As mentioned above any stats around percentage of MB/s that is
> > hitting your cache device vs journal (assuming journal is 100% of IO).
> > This is to calculate extra wear
> >
> > Thanks,
> > Nick
> 
> So it's graph time...
> 
> Here's basically what you saw before, but I made it stacked (so 900 on the
> %util means like 18/27 of the disks in the whole cluster are at avg 50% in the
> sample period for that one pixel width of the graph) (remove gtype=stack
> and it won't be stacked, or http://www.brockmann-
> consult.de/ganglia/?c=ceph and find the aggregate report form and fill it out
> yourself ... I manually added date (cs and
> ce) copied from another url since that form doesn't have it, and only makes
> last x time periods. You can also find more metrics in the drop downs on that
> page. sda,sdb have always been the SSDs, disk metrics are 30s averages from
> iostat)
> 
> With no bcache until a bit at the end, plus some load from migrating to
> bcache possibly in there (didn't record dates on that).
> 
> %util -
> http://www.brockmann-
> consult.de/ganglia/graph.php?hreg[]=ceph.*&mreg[]=sd[c-
> z]_util&glegend=show&aggregate=1&_=1491205396888&cs=11%2F1%2F2016
> +21%3A18&ce=12%2F15%2F2016+4%3A21&z=xlarge>ype=stack&x=1000
> await -
> http://www.brockmann-
> consult.de/ganglia/graph.php?hreg[]=ceph.*&mreg[]=sd[c-
> z]_await&glegend=show&aggregate=1&_=1491205396888&cs=11%2F1%2F20
> 16+21%3A18&ce=12%2F15%2F2016+4%3A21&z=xlarge>ype=stack&x=1000
> wMBps -
> http://www.brockmann-
> consult.de/ganglia/graph.php?hreg[]=ceph.*&mreg[]=sd[c-
> z]_wMBps&glegend=show&aggregate=1&_=1491205396888&cs=11%2F1%2F
> 2016+21%3A18&ce=12%2F15%2F2016+4%3A21&z=xlarge>ype=stack&x=30
> 0
> 
> And here is since most VMs were on ceph (more than the before graphs),
> with some osd-reweight-by-utilization started since a few days ago (but scrub
> disabled during that) making the last part look higher. And the last VMs were
> moved today, also seen on the graph, plus some extra backup load some
> time later.
> 
> %util -
> http://www.brockmann-
> consult.de/ganglia/graph.php?h

Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Nick Fisk

> -Original Message-
> From: David Disseldorp [mailto:dd...@suse.de]
> Sent: 06 April 2017 14:06
> To: Nick Fisk 
> Cc: 'Maged Mokhtar' ; 'Brady Deetz'
> ; 'ceph-users' 
> Subject: Re: [ceph-users] rbd iscsi gateway question
> 
> X-Assp-URIBL failed: 'suse.de'(black.uribl.com )
> X-Assp-Spam-Level: *
> X-Assp-Envelope-From: dd...@suse.de
> X-Assp-Intended-For: n...@fisk.me.uk
> X-Assp-ID: ASSP.fisk.me.uk (49148-08075)
> X-Assp-Version: 1.9.1.4(1.0.00)
> 
> Hi,
> 
> On Thu, 6 Apr 2017 13:31:00 +0100, Nick Fisk wrote:
> 
> > > I believe there
> > > was a request to include it mainstream kernel but it did not happen,
> > > probably waiting for TCMU solution which will be better/cleaner design.
> 
> Indeed, we're proceeding with TCMU as a future upstream acceptable
> implementation.
> 
> > Yes, should have mentioned this, if you are using the suse kernel,
> > they have a fix for this spiral of death problem.
> 
> I'm not to sure what you're referring to WRT the spiral of death, but we did
> patch some LIO issues encountered when a command was aborted while
> outstanding at the LIO backstore layer.
> These specific fixes are carried in the mainline kernel, and can be tested
> using the AbortTaskSimpleAsync libiscsi test.

Awesome, glad this has finally been fixed. Death spiral was referring to when 
using it with ESXi, both the initiator and target effectively hang forever and 
if you didn't catch it soon enough, sometimes you end up having to kill all 
vm's and reboot hosts.

Do you know what kernel version these changes would have first gone into? I 
thought I looked back into this last summer and it was still showing the same 
behavior.

> 
> Cheers, David

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Maged Mokhtar
> Sent: 06 April 2017 12:21
> To: Brady Deetz ; ceph-users 
> Subject: Re: [ceph-users] rbd iscsi gateway question
> 
> The io hang (it is actually a pause not hang) is done by Ceph only in case
of a
> simultaneous failure of 2 hosts or 2 osds on separate hosts. A single
host/osd
> being out will not cause this.  In PetaSAN project www.petasan.org we use
> LIO/krbd. We have done a lot of tests on VMWare, in case of io failure,
the io
> will block for approx 30s on the VMWare ESX (default timeout, but can be
> configured)  then it will resume on the other MPIO path.
> 
> We are using a custom LIO/kernel upstreamed from SLE 12 used in their
> enterprise storage offering, it supports direct rbd backstore. I believe
there
> was a request to include it mainstream kernel but it did not happen,
> probably waiting for TCMU solution which will be better/cleaner design.

Yes, should have mentioned this, if you are using the suse kernel, they have
a fix for this spiral of death problem. Any other distribution or vanilla
kernel, will hang if a Ceph IO takes longer than about 5-10s. It's the path
failure bit which is the problem, LIO tries to abort the IO, but RBD doesn't
support this yet.

> 
> Cheers /maged
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Nick Fisk

I assume Brady is referring to the death spiral LIO gets into with some 
initiators, including vmware, if an IO takes longer than about 10s. I haven’t 
heard of anything, and can’t see any changes, so I would assume this issue 
still remains.

 

I would look at either SCST or NFS for now.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Adrian 
Saul
Sent: 06 April 2017 05:32
To: Brady Deetz ; ceph-users 
Subject: Re: [ceph-users] rbd iscsi gateway question

 

 

I am not sure if there is a hard and fast rule you are after, but pretty much 
anything that would cause ceph transactions to be blocked (flapping OSD, 
network loss, hung host) has the potential to block RBD IO which would cause 
your iSCSI LUNs to become unresponsive for that period.

 

For the most part though, once that condition clears things keep working, so 
its not like a hang where you need to reboot to clear it.  Some situations we 
have hit with our setup:

 

*   Failed OSDs (dead disks) – no issues
*   Cluster rebalancing – ok if throttled back to keep service times down
*   Network packet loss (bad fibre) – painful, broken communication 
everywhere, caused a krbd hang needing a reboot
*   RBD Snapshot deletion – disk latency through roof, cluster unresponsive 
for minutes at a time, won’t do again.

 

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brady 
Deetz
Sent: Thursday, 6 April 2017 12:58 PM
To: ceph-users
Subject: [ceph-users] rbd iscsi gateway question

 

I apologize if this is a duplicate of something recent, but I'm not finding 
much. Does the issue still exist where dropping an OSD results in a LUN's I/O 
hanging?

 

I'm attempting to determine if I have to move off of VMWare in order to safely 
use Ceph as my VM storage.

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake. 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Question about unfound objects

2017-03-30 Thread Nick Fisk

That’s interesting, the only time I have experienced unfound objects has also 
been related to snapshots and highly likely snap trimming. I had a number of 
OSD’s start flapping under load of snap trimming and 2 of them on the same host 
died with an assert.

>From memory the unfound objects were relating to objects that were trimmed, so 
>I could just delete them. I assume that when the PG is remapped/recovered, as 
>the objects have already been removed on the other OSD’s it tries to roll back 
>the transaction and fails, hence it wants the now down OSD’s to try and roll 
>back???

From: Steve Taylor [mailto:steve.tay...@storagecraft.com] 
Sent: 30 March 2017 20:07
To: n...@fisk.me.uk; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Question about unfound objects

One other thing to note with this experience is that we do a LOT of RBD snap 
trimming, like hundreds of millions of objects per day added to our snap_trimqs 
globally. All of the unfound objects in these cases were found on other OSDs in 
the cluster with identical contents, but associated with different snapshots. 
In other words, the file contents matched exactly, but the xattrs differed and 
the filenames indicated that the objects belonged to different snapshots.

Some of the unfound objects belonged to head, so I don't necessarily believe 
that they were in the process of being trimmed, but I imagine there is some 
possibility that this issue is related to snap trimming or deleting snapshots. 
Just more information...

On Thu, 2017-03-30 at 17:13 +, Steve Taylor wrote:

Good suggestion, Nick. I actually did that at the time. The "ceph osd map" 
wasn't all that interesting because the OSDs had been outed and their PGs had 
been mapped to new OSDs. Everything appeared to be in order with the PGs being 
mapped to the right number of new OSDs. The PG mappings looked fine, but the 
objects just didn't exist anywhere except on the OSDs that had been marked out.

The PG queries were a little more useful, but still didn't really help in the 
end. In all cases (unfound objects from 2 OSDs in each of 2 occurrences), the 
PGs showed 5 or so OSDs where they thought the unfound objects might be, one of 
which was an OSD that had been marked out. In both cases we even waited until 
backfilling completed to see if perhaps the missing objects would turn up 
somewhere else, but none ever did.

In the first instance we were simply able to reattach the 2 OSDs to the cluster 
with 0 weight and recover the unfound objects. The second instance involved 
drive problems and was a little bit trickier. The drives had experienced errors 
and the XFS filesystems had both become corrupt and wouldn't even mount. We 
didn't have any spare drives large enough, so I ended up using dd, ignoring 
errors, to copy the disks to RBDs in a different Ceph cluster. I then kernel 
mapped the RBDs on the host with the failed drives, ran XFS repairs on them, 
mouted them to the OSD directories, started the OSDs, and put them back in the 
cluster with 0 weight. I was lucky enough that those objects were available and 
they were recovered. Of course I immediately removed those OSDs once the 
unfound objects cleared up.

That's the other intersting aspect of this problem. This cluster had 4TB HGST 
drives for its OSDs, but we had to expand it fairly urgently and didn't have 
enough drives. We added two new servers, each with 16 4TB drives and 16 8TB 
HGST He8 drives. In both instances the problems we encountered were with the 
8TB drives. We have since acquired more 4TB drives and have replaced all of the 
8TB drives in the cluster. We have a total of 8 production clusters globally 
and have been running Ceph in production for 2 years. These two occurences 
recently are the only times we've seen these types of issues, and it was 
exclusive to the 8TB OSDs. I'm not sure how that would cause such a problem, 
but it's an interesting data point.

On Thu, 2017-03-30 at 17:33 +0100, Nick Fisk wrote:

Hi Steve,

If you can recreate or if you can remember the object name, it might be worth 
trying to run “ceph osd map” on the objects and see where it thinks they map 
to. And/or maybe pg query might show something?

Nick

  _  

 <https://storagecraft.com> 

Steve Taylor | Senior Software Engineer |  <https://storagecraft.com> 
StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | 

  _  

If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.

  _  

  _  

 <https://storagecraft.com> 

Steve Taylor | Senior Software Engineer |  <https://storagecraft.com> 
StorageCraft Technology Corporation

Re: [ceph-users] Question about unfound objects

2017-03-30 Thread Nick Fisk

Hi Steve,

 

If you can recreate or if you can remember the object name, it might be worth 
trying to run "ceph osd map" on the objects and see
where it thinks they map to. And/or maybe pg query might show something?

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steve 
Taylor
Sent: 30 March 2017 16:24
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Question about unfound objects

 

We've had a couple of puzzling experiences recently with unfound
objects, and I wonder if anyone can shed some light.

This happened with Hammer 0.94.7 on a cluster with 1,309 OSDs. Our use
case is exclusively RBD in this cluster, so it's naturally replicated.
The rbd pool size is 3, min_size is 2. The crush map is flat, so each
host is a failure domain. The OSD hosts are 4U Supermicro chassis with
32 OSDs each. Drive failures have caused the OSD count to be 1,309
instead of 1,312.

Twice in the last few weeks we've experienced issues where the cluster
was HEALTH_OK but was frequently getting some blocked requests. In each
of the two occurrences we investigated and discovered that the blocked
requests resulted from two drives in the same host that were
misbehaving (different set of 2 drives in each occurrence). We decided
to remove the misbehaving OSDs and let things backfill to see if that
would address the issue. Removing the drives resulted in a small number
of unfound objects, which was surprising. We were able to add the OSDs
back with 0 weight and recover the unfound objects in both cases, but
removing two OSDs from a single failure domain shouldn't have resulted
in unfound objects in an otherwise healthy cluster, correct?

  _  


  

Steve Taylor | Senior Software Engineer |   
StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | 

  _  


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together
with any attachments, and be advised that any dissemination or copying of this 
message is prohibited.

  _  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New hardware for OSDs

2017-03-28 Thread Nick Fisk

Hi Christian,

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Christian Balzer
> Sent: 28 March 2017 00:59
> To: ceph-users@lists.ceph.com
> Cc: Nick Fisk 
> Subject: Re: [ceph-users] New hardware for OSDs
> 
> 
> Hello,
> 
> On Mon, 27 Mar 2017 16:09:09 +0100 Nick Fisk wrote:
> 
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Wido den Hollander
> > > Sent: 27 March 2017 12:35
> > > To: ceph-users@lists.ceph.com; Christian Balzer 
> > > Subject: Re: [ceph-users] New hardware for OSDs
> > >
> > >
> > > > Op 27 maart 2017 om 13:22 schreef Christian Balzer :
> > > >
> > > >
> > > >
> > > > Hello,
> > > >
> > > > On Mon, 27 Mar 2017 12:27:40 +0200 Mattia Belluco wrote:
> > > >
> > > > > Hello all,
> > > > > we are currently in the process of buying new hardware to expand
> > > > > an existing Ceph cluster that already has 1200 osds.
> > > >
> > > > That's quite sizable, is the expansion driven by the need for more
> > > > space (big data?) or to increase IOPS (or both)?
> > > >
> > > > > We are currently using 24 * 4 TB SAS drives per osd with an SSD
> > > > > journal shared among 4 osds. For the upcoming expansion we were
> > > > > thinking of switching to either 6 or 8 TB hard drives (9 or 12
> > > > > per
> > > > > host) in order to drive down space and cost requirements.
> > > > >
> > > > > Has anyone any experience in mid-sized/large-sized deployment
> > > > > using such hard drives? Our main concern is the rebalance time
> > > > > but we might be overlooking some other aspects.
> > > > >
> > > >
> > > > If you researched the ML archives, you should already know to stay
> > > > well away from SMR HDDs.
> > > >
> > >
> > > Amen! Just don't. Stay away from SMR with Ceph.
> > >
> > > > Both HGST and Seagate have large Enterprise HDDs that have
> > > > journals/caches (MediaCache in HGST speak IIRC) that drastically
> > > > improve write IOPS compared to plain HDDs.
> > > > Even with SSD journals you will want to consider those, as these
> > > > new HDDs will see at least twice the action than your current ones.
> > > >
> >
> > I've got a mixture of WD Red Pro 6TB and HGST He8 8TB drives. Recovery
> > for ~70% full disks takes around 3-4 hours, this is for a cluster
> > containing 60 OSD's. I'm usually seeing recovery speeds up around 1GB/s
> or more.
> >
> Good data point.
> 
> How busy is your cluster at those times, client I/O impact?

Its normally around 20-30% busy through most parts of the day. No real
impact to client IO. Its backup data, so buffered IO coming in via wan
circuit.

> 
> > Depends on your workload, mine is for archiving/backups so big disks
> > are a must. I wouldn't recommend using them for more active workloads
> > unless you are planning a beefy cache tier or some other sort of caching
> solution.
> >
> > The He8 (and He10) drives also use a fair bit less power due to less
> > friction, but I think this only applies to the sata model. My 12x3.5
> > 8TB node with CPU...etc uses ~140W at idle. Hoping to get this down
> > further with a new Xeon-D design on next expansion phase.
> >
> > The only thing I will say about big disks is beware of cold FS
> > inodes/dentry's and PG splitting. The former isn't a problem if you
> > will only be actively accessing a small portion of your data, but I
> > see increases in latency if I access cold data even with VFS cache
pressure
> set to 1.
> > Currently investigating using bcache under the OSD to try and cache
this.
> >
> 
> I've seen this kind of behavior on my (non-Ceph) mailbox servers.
> As in, the maximum SLAB space may not be large enough to hold all inodes
> or the pagecache will eat into it over time when not constantly
referenced,
> despite cache pressure settings.
> 
> > PG splitting becomes a problem when the disks start to fill up,
> > playing with the split/merge thresholds may help, but you have to be
> > careful you don't end up with massive splits when they do finally
> > happen, as otherwise OSD's start timing out.
> >
> Getting this right (and predictable

Re: [ceph-users] New hardware for OSDs

2017-03-27 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Wido den Hollander
> Sent: 27 March 2017 12:35
> To: ceph-users@lists.ceph.com; Christian Balzer 
> Subject: Re: [ceph-users] New hardware for OSDs
> 
> 
> > Op 27 maart 2017 om 13:22 schreef Christian Balzer :
> >
> >
> >
> > Hello,
> >
> > On Mon, 27 Mar 2017 12:27:40 +0200 Mattia Belluco wrote:
> >
> > > Hello all,
> > > we are currently in the process of buying new hardware to expand an
> > > existing Ceph cluster that already has 1200 osds.
> >
> > That's quite sizable, is the expansion driven by the need for more
> > space (big data?) or to increase IOPS (or both)?
> >
> > > We are currently using 24 * 4 TB SAS drives per osd with an SSD
> > > journal shared among 4 osds. For the upcoming expansion we were
> > > thinking of switching to either 6 or 8 TB hard drives (9 or 12 per
> > > host) in order to drive down space and cost requirements.
> > >
> > > Has anyone any experience in mid-sized/large-sized deployment using
> > > such hard drives? Our main concern is the rebalance time but we
> > > might be overlooking some other aspects.
> > >
> >
> > If you researched the ML archives, you should already know to stay
> > well away from SMR HDDs.
> >
> 
> Amen! Just don't. Stay away from SMR with Ceph.
> 
> > Both HGST and Seagate have large Enterprise HDDs that have
> > journals/caches (MediaCache in HGST speak IIRC) that drastically
> > improve write IOPS compared to plain HDDs.
> > Even with SSD journals you will want to consider those, as these new
> > HDDs will see at least twice the action than your current ones.
> >

I've got a mixture of WD Red Pro 6TB and HGST He8 8TB drives. Recovery for
~70% full disks takes around 3-4 hours, this is for a cluster containing 60
OSD's. I'm usually seeing recovery speeds up around 1GB/s or more.

Depends on your workload, mine is for archiving/backups so big disks are a
must. I wouldn't recommend using them for more active workloads unless you
are planning a beefy cache tier or some other sort of caching solution.

The He8 (and He10) drives also use a fair bit less power due to less
friction, but I think this only applies to the sata model. My 12x3.5 8TB
node with CPU...etc uses ~140W at idle. Hoping to get this down further with
a new Xeon-D design on next expansion phase.

The only thing I will say about big disks is beware of cold FS
inodes/dentry's and PG splitting. The former isn't a problem if you will
only be actively accessing a small portion of your data, but I see increases
in latency if I access cold data even with VFS cache pressure set to 1.
Currently investigating using bcache under the OSD to try and cache this.

PG splitting becomes a problem when the disks start to fill up, playing with
the split/merge thresholds may help, but you have to be careful you don't
end up with massive splits when they do finally happen, as otherwise OSD's
start timing out.

> 
> I also have good experiences with bcache on NVM-E device in Ceph clusters.
> A single Intel P3600/P3700 which is the caching device for bcache.
> 
> > Rebalance time is a concern of course, especially if your cluster like
> > most HDD based ones has these things throttled down to not impede
> > actual client I/O.
> >
> > To get a rough idea, take a look at:
> > https://www.memset.com/tools/raid-calculator/
> >
> > For Ceph with replication 3 and the typical PG distribution, assume
> > 100 disks and the RAID6 with hotspares numbers are relevant.
> > For rebuild speed, consult your experience, you must have had a few
> > failures. ^o^
> >
> > For example with a recovery speed of 100MB/s, a 1TB disk (used data
> > with Ceph actually) looks decent at 1:16000 DLO/y.
> > At 5TB though it enters scary land
> >
> 
> Yes, those recoveries will take a long time. Let's say your 6TB drive is
filled for
> 80% you need to rebalance 4.8TB
> 
> 4.8TB / 100MB/sec = 13 hours rebuild time
> 
> 13 hours is a long time. And you will probably not have 100MB/sec
> sustained, I think that 50MB/sec is much more realistic.

Are we talking backfill or recovery here? Recovery will go at the combined
speed of all the disks in the cluster. If the OP's cluster is already at
1200 OSD's, a single disk will be a tiny percentage per OSD to recover. But
yes, backfill will probably crawl along at 50MB/s, but is this a problem?

> 
> That means that a single disk failure will take >24 hours to recover from
a
> rebuild.
> 
> I don't like very big disks that much. Not in RAID, not in Ceph.
> 
> Wido
> 
> > Christian
> >
> > > We currently use the cluster as storage for openstack services:
> > > Glance, Cinder and VMs' ephemeral disks.
> > >
> > > Thanks in advance for any advice.
> > >
> > > Mattia
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> >
> > --
> > Christian BalzerNetwork/S

Re: [ceph-users] Preconditioning an RBD image

2017-03-25 Thread Nick Fisk

Thanks for your response Peter, comments in line
> -Original Message-
> From: Peter Maloney [mailto:peter.malo...@brockmann-consult.de]
> Sent: 23 March 2017 22:45
> To: n...@fisk.me.uk; 'ceph-users' 
> Subject: Re: [ceph-users] Preconditioning an RBD image
> 
> Hi Nick,
> 
> I didn't test with a colocated journal. I figure ceph knows what it's doing 
> with
> the journal device, and it has no filesystem, so there's no xfs journal, file
> metadata, etc. to cache due to small random sync writes.

Sure, I guess I was more interested in the simplicity of not having lots of 
partitions everywhere. Although with col-located, I guess you run the risk of 
having large sequential journal IO's not being cached and effectively halving 
the speed of the disk.

> 
> I tested the bcache and journals on some SAS SSDs (rados bench was ok but
> real clients were really low bandwidth), and journals on NVMe (P3700) and
> bcache on some SAS SSDs, and also tested both on the NVMe. I think the
> performance is slightly better with it all on the NVMe (hdds being the
> bottleneck... tests in VMs show the same, but rados bench looks a tiny bit
> better). The bcache partition is shared by the osds, and the journals are
> separate partitions.
> 
> I'm not sure it's really triple overhead. bcache doesn't write all your data 
> to
> the writeback cache... just as much small sync writes as long as the cache
> doesn't fill up, or get too busy (based on await). And the bcache device
> flushes very slowly to the hdd, not overloading it (unless cache is full). And
> when I make it do it faster, it seems to do it more quickly than without
> bcache (like it does it more sequentially, or without sync; but I didn't 
> really
> measure... just looked at, eg. 400MB dirty data, and then it flushes in 20
> seconds). And if you overwrite the same data a few times (like a filesystem
> journal, or some fs metadata), you'd think it wouldn't have to write it more
> than once to the hdd in the end. Maybe that means something small like
> leveldb isn't written often to the hdd.

Yes, that makes sense, thanks for the explanation. I guess it depends on your 
IO profile. I've recently been hit with slow downs when trying to copy large 
amounts of data into the cluster, mainly relating to random overheads on the 
disks. So this bcache thing looks interesting. I also suffer from dentry/inode 
lookups, despite vfs pressure set to 1, so again, being able to cache this 
makes sense. 

I'm guessing lots of small writes + leveldb updates are going to get written 
Journal->Bcache->Disk. I'm guess I'm just a bit nervous about putting too much 
wear on my NVME's by trying this. Do you have any stats showing Journal 
partition vs HDD/Bcache to see if there is much amplification?

> 
> And it's not just a write cache. The default is 10% writeback, which means the
> rest is read cache. And it keeps read stats so it knows which data is the most
> popular. My nodes right now show 33-44% cache hits (cache is too small I
> think). And bcache reorders writes on the cache device so they are
> sequential, and can write to both at the same time so it can actually go 
> faster
> than a pure ssd in specific situations (mixed sequential and random, only
> until the cache fills).
> 
> I think I owe you another graph later when I put all my VMs on there
> (probably finally fixed my rbd snapshot hanging VM issue ...worked around it
> by disabling exclusive-lock,object-map,fast-diff). The bandwidth hungry ones
> (which hung the most often) were moved shortly after the bcache change,
> and it's hard to explain how it affects the graphs... easier to see with 
> iostat
> while changing it and having a mix of cache and not than ganglia afterwards.

Please do, I can't resist a nice graph. What I would be really interested in is 
answers to these questions, if you can:

1. Has your per disk bandwidth gone up, due to removing random writes. Ie. I 
struggle to get more than about 50MB/s writes per disk due to extra random IO 
per request
2. Any feeling on how it helps with dentry/inode lookups. As mentioned above, 
I'm using 8TB disks and cold data has extra penalty for reads/writes as it has 
to lookup the FS data first
3. I assume with 4.9 kernel you don't have the bcache fix which allows 
partitions. What method are you using to create OSDs?
4. As mentioned above any stats around percentage of MB/s that is hitting your 
cache device vs journal (assuming journal is 100% of IO). This is to calculate 
extra wear

Thanks,
Nick

> 
> Peter
> 
> On 03/23/17 21:18, Nick Fisk wrote:
> Hi Peter,
> 
> Interesting graph. Out of interest, when you use bcache, do you then just
> leave the jou

Re: [ceph-users] cephfs cache tiering - hitset

2017-03-23 Thread Nick Fisk

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mike 
Lovell
Sent: 20 March 2017 22:31
To: n...@fisk.me.uk
Cc: Webert de Souza Lima ; ceph-users 

Subject: Re: [ceph-users] cephfs cache tiering - hitset

On Mon, Mar 20, 2017 at 4:20 PM, Nick Fisk mailto:n...@fisk.me.uk> > wrote:

Just a few corrections, hope you don't mind

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mike Lovell
> Sent: 20 March 2017 20:30
> To: Webert de Souza Lima  <mailto:webert.b...@gmail.com> >
> Cc: ceph-users mailto:ceph-users@lists.ceph.com> >
> Subject: Re: [ceph-users] cephfs cache tiering - hitset
>
> i'm not an expert but here is my understanding of it. a hit_set keeps track of
> whether or not an object was accessed during the timespan of the hit_set.
> for example, if you have a hit_set_period of 600, then the hit_set covers a
> period of 10 minutes. the hit_set_count defines how many of the hit_sets to
> keep a record of. setting this to a value of 12 with the 10 minute
> hit_set_period would mean that there is a record of objects accessed over a
> 2 hour period. the min_read_recency_for_promote, and its newer
> min_write_recency_for_promote sibling, define how many of these hit_sets
> and object must be in before and object is promoted from the storage pool
> into the cache pool. if this were set to 6 with the previous examples, it 
> means
> that the cache tier will promote an object if that object has been accessed at
> least once in 6 of the 12 10-minute periods. it doesn't matter how many
> times the object was used in each period and so 6 requests in one 10-minute
> hit_set will not cause a promotion. it would be any number of access in 6
> separate 10-minute periods over the 2 hours.

Sort of, the recency looks at the last N most recent hitsets. So if set to 6, 
then the object would have to be in all last 6 hitsets. Because of this, during 
testing I found setting recency above 2 or 3 made the behavior quite binary. If 
an object was hot enough, it would probably be in every hitset, if it was only 
warm it would never be in enough hitsets in row. I did experiment with X out of 
N promotion logic, ie must be in 3 hitsets out of 10 non sequential. If you 
could find the right number to configure, you could get improved cache 
behavior, but if not, then there was a large chance it would be worse.

For promotion I think having more hitsets probably doesn't add much value, but 
they may help when it comes to determining what to flush.

that's good to know. i just made an assumption without actually digging in to 
the code. do you recommend keeping the number of hitsets equal to the max of 
either min_read_recency_for_promote and min_write_recency_for_promote? how are 
the hitsets checked during flush and/or eviction?

Possibly, I’ve not really looked into how effective the hitsets are for 
determining what to flush. But hitset overhead is minimal, so I normally just 
stick with 10 hitsets and don’t even think about it anymore.

mike

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Preconditioning an RBD image

2017-03-23 Thread Nick Fisk

Hi Peter,

 

Interesting graph. Out of interest, when you use bcache, do you then just
leave the journal collocated on the combined bcache device and rely on the
writeback to provide journal performance, or do you still create a separate
partition on whatever SSD/NVME you use, effectively giving triple write
overhead?

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Peter Maloney
Sent: 22 March 2017 10:06
To: Alex Gorbachev ; ceph-users

Subject: Re: [ceph-users] Preconditioning an RBD image

 

Does iostat (eg.  iostat -xmy 1 /dev/sd[a-z]) show high util% or await
during these problems?

Ceph filestore requires lots of metadata writing (directory splitting for
example), xattrs, leveldb, etc. which are small sync writes that HDDs are
bad at (100-300 iops), and SSDs are good at (cheapo would be 6k iops, and
not so crazy DC/NVMe would be 20-200k iops and more). So in theory, these
things are mitigated by using an SSD, like bcache on your osd device. You
could also try something like that, at least to test.

I have tested with bcache in writeback mode and found hugely obvious
differences seen by iostat, for example here's my before and after (heavier
load due to converting week 49-50 or so, and the highest spikes being the
scrub infinite loop bug in 10.2.3): 

http://www.brockmann-consult.de/ganglia/graph.php?cs=10%2F25%2F2016+10%3A27

&ce=03%2F09%2F2017+17%3A26&z=xlarge&hreg[]=ceph.*&mreg[]=sd[c-z]_await&glege
nd=show&aggregate=1&x=100

But when you share a cache device, you get a single point of failure (and
bcache, like all software, can be assumed to have bugs too). And I recommend
vanilla kernel 4.9 or later which has many bcache fixes, or Ubuntu's 4.4
kernel which has the specific fixes I checked for.

On 03/21/17 23:22, Alex Gorbachev wrote:

I wanted to share the recent experience, in which a few RBD volumes,
formatted as XFS and exported via Ubuntu NFS-kernel-server performed poorly,
even generated an "out of space" warnings on a nearly empty filesystem.  I
tried a variety of hacks and fixes to no effect, until things started
magically working just after some dd write testing. 

 

The only explanation I can come up with is that preconditioning, or
thickening, the images with this benchmarking is what caused the
improvement.

 

Ceph is Hammer 0.94.7 running on Ubuntu 14.04, kernel 4.10 on OSD nodes and
4.4 on NFS nodes.

 

Regards,

Alex

Storcium

-- 

-- 

Alex Gorbachev

Storcium






___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

 

-- 
 

Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
 
Internet: http://www.brockmann-consult.de

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs cache tiering - hitset

2017-03-20 Thread Nick Fisk

Just a few corrections, hope you don't mind

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mike Lovell
> Sent: 20 March 2017 20:30
> To: Webert de Souza Lima 
> Cc: ceph-users 
> Subject: Re: [ceph-users] cephfs cache tiering - hitset
> 
> i'm not an expert but here is my understanding of it. a hit_set keeps track of
> whether or not an object was accessed during the timespan of the hit_set.
> for example, if you have a hit_set_period of 600, then the hit_set covers a
> period of 10 minutes. the hit_set_count defines how many of the hit_sets to
> keep a record of. setting this to a value of 12 with the 10 minute
> hit_set_period would mean that there is a record of objects accessed over a
> 2 hour period. the min_read_recency_for_promote, and its newer
> min_write_recency_for_promote sibling, define how many of these hit_sets
> and object must be in before and object is promoted from the storage pool
> into the cache pool. if this were set to 6 with the previous examples, it 
> means
> that the cache tier will promote an object if that object has been accessed at
> least once in 6 of the 12 10-minute periods. it doesn't matter how many
> times the object was used in each period and so 6 requests in one 10-minute
> hit_set will not cause a promotion. it would be any number of access in 6
> separate 10-minute periods over the 2 hours.

Sort of, the recency looks at the last N most recent hitsets. So if set to 6, 
then the object would have to be in all last 6 hitsets. Because of this, during 
testing I found setting recency above 2 or 3 made the behavior quite binary. If 
an object was hot enough, it would probably be in every hitset, if it was only 
warm it would never be in enough hitsets in row. I did experiment with X out of 
N promotion logic, ie must be in 3 hitsets out of 10 non sequential. If you 
could find the right number to configure, you could get improved cache 
behavior, but if not, then there was a large chance it would be worse.

For promotion I think having more hitsets probably doesn't add much value, but 
they may help when it comes to determining what to flush.

> 
> this is just an example and might not fit well for your use case. the systems 
> i
> run have a lower hit_set_period, higher hit_set_count, and higher recency
> options. that means that the osds use some more memory (each hit_set
> takes space but i think they use the same amount of space regardless of
> period) but hit_set covers a smaller amount of time. the longer the period,
> the more likely a given object is in the hit_set. without knowing your access
> patterns, it would be hard to recommend settings. the overhead of a
> promotion can be substantial and so i'd probably go with settings that only
> promote after many requests to an object.

Also in Jewel is a promotion throttle which will limit promotions to 4MB/s

> 
> one thing to note is that the recency options only seemed to work for me in
> jewel. i haven't tried infernalis. the older versions of hammer didn't seem to
> use the min_read_recency_for_promote properly and 0.94.6 definitely had a
> bug that could corrupt data when min_read_recency_for_promote was more
> than 1. even though that was fixed in 0.94.7, i was hesitant to increase it 
> will
> still on hammer. min_write_recency_for_promote wasn't added till after
> hammer.
> 
> hopefully that helps.
> mike
> 
> On Fri, Mar 17, 2017 at 2:02 PM, Webert de Souza Lima
>  wrote:
> Hello everyone,
> 
> I`m deploying a ceph cluster with cephfs and I`d like to tune ceph cache
> tiering, and I`m
> a little bit confused of the
> settings hit_set_count, hit_set_period and min_read_recency_for_promote.
> The docs are very lean and I can`f find any more detailed explanation
> anywhere.
> 
> Could someone provide me a better understandment of this?
> 
> Thanks in advance!
> 
> ___
> ceph-users mailing list
> mailto:ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

1 2 3 4 5 6 7 8 >

1 - 100 of 720 matches

Mail list logo