Re: [ceph-users] Slow requests during OSD maintenance

2018-07-17 Thread Konstantin Shalygin
2. What is the best way to remove an OSD node from the cluster during maintenance? ceph osd set noout is not the way to go, since no OSD's are out during yum update and the node is still part of the cluster and will handle I/O. I think the best way is the combination of "ceph osd set noout" +

Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Linh Vu
I think the P4600 should be fine, although 2TB is probably way over kill for 15 OSDs. Our older nodes use the P3700 400GB for 16 OSDs. I have yet to see the WAL and DB getting filled up at 2GB/10GB each. Our newer nodes use the Intel Optane 900P 480GB, that's actually faster than the P4600

Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Oliver Schulz
Thanks, Linh! A question regarding choice of NVMe - do you think an Intel P4510 or P4600 would do well for WAL+DB? I'd thinking about using a single 2 TB NVMe for 15 OSDs. Would you recommend a different model? Is there any experience on how many 4k IOPS one should have for WAL+DB per OSD? We

Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Oliver Schulz
On 18.07.2018 00:43, Gregory Farnum wrote: > But you could also do workaround like letting it choose (K+M)/2 racks > and putting two shards in each rack. Oh yes, you are more susceptible to top-of-rack switch failures in this case or whatever. It's just one option — many people

[ceph-users] Exact scope of OSD heartbeating?

2018-07-17 Thread Anthony D'Atri
The documentation here: http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/ says "Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6 seconds" and " If a neighboring

Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Linh Vu
On our NLSAS OSD nodes, there is 1x NVMe PCIe card for all the WALs and DBs (we accept that the risk of 1 card failing is low, and our failure domain is host anyway). Each OSD (16 per host) gets 2GB of WAL and 10GB of DB. On our Flash (SSD but not NVMe) OSD nodes, there are 8 OSDs per node,

Re: [ceph-users] v12.2.7 Luminous released

2018-07-17 Thread Linh Vu
Thanks for all your hard work in putting out the fixes so quickly! :) We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, not RGW. In the release notes, it says RGW is a risk especially the garbage collection, and the recommendation is to either pause IO or disable RGW

Re: [ceph-users] resize wal/db

2018-07-17 Thread Shunde Zhang
Hi Igor, Thanks for the reply. I am using NVMe SSD for wal/db. Do you have any instruction on how to create a partition for wal/db on it? For example, the type of existing partitions are unknown. If I create a new one, it defaults to Linux: Disk /dev/nvme0n1: 500.1 GB, 500107862016 bytes,

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Tom W
Hi Bryan, Try both the commands timing out again but with the -verbose flag, see if we can get anything from that. Tom From: Bryan Banister Sent: 17 July 2018 23:51 To: Tom W ; ceph-users@lists.ceph.com Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Bryan Banister
Hey Tom, Ok, yeah, I've used those steps before myself for other operations like cluster software updates. I'll try them. As for the query, not sure how I missed that, and think I've used it before! Unfortunately it just hangs similar to the other daemon operation: root@rook-tools:/# ceph pg

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Tom W
Hi Bryan, Try "ceph pg xx.xx query", with xx.xx of course being the PG number. With some luck this will give you the state of that individual pg and which OSDs or issues may be blocking the peering from completing, which can be used as a clue perhaps as to the cause. If you can find a point of

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Bryan Banister
Thanks Tom, Yes, we can try pausing I/O to give the cluster time to recover. I assume that you're talking about using `ceph osd set pause` for this? We did finally get some health output, which seems to indicate everything is basically stuck: 2018-07-17 21:00:00.000107 mon.rook-ceph-mon7

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Tom W
Hi Bryan, That's unusual, and not something I can really begin to unravel. As some other pointers, perhaps run a PG query on some of the inactive and peering PGs for any potentially useful output? I suspect from what you've put that most PGs are simply in a down and peering state, and it

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Bryan Banister
Hi Tom, I tried to check out the ops in flight as you suggested but this seems to just hang: root@rook-ceph-osd-carg-kubelet-osd02-m9rhx:/# ceph --admin-daemon /var/lib/rook/osd238/rook-osd.238.asok daemon osd.238 dump_ops_in_flight Nothing returns and don't get a prompt back. The cluster is

Re: [ceph-users] v12.2.7 Luminous released

2018-07-17 Thread Cassiano Pilipavicius
FIY, I have updated some osds  from 12.2.6 that was suffering from the CRC error and the 12.2.7 fixed the issue! I installed some new osds on 12/07 without being aware from the issue, and in my small cluestes, I just noticed the problem when I was trying to copy some RBD images to another

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Tom W
Hi Bryan, OSDs may not truly be up, this flag merely prevents them being marked as down even if they are unresponsive. It may be worth unsetting nodown as soon as you are confident, but unsetting it before anything changes will just return to the previous state. Perhaps not harmful, but I have

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Bryan Banister
Hi Tom, Decided to try your suggestion of the 'nodown' setting and this indeed has gotten all of the OSDs up and they haven't failed out like before. However the PGs are in bad states and Ceph doesn't seem interested in starting recovery over the last 30 minues since the latest health message

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Tom W
Prior to the OSD being marked as down by the cluster, do you note the PGs become inactive on it? Using a flag such as nodown may prevent OSDs flapping if it helps reduce the IO load to see if things stabilise out, but be wary of this flag as I believe PGs using the OSD as the primary will not

Re: [ceph-users] v12.2.7 Luminous released

2018-07-17 Thread Sage Weil
On Tue, 17 Jul 2018, Stefan Kooman wrote: > Quoting Abhishek Lekshmanan (abhis...@suse.com): > > > *NOTE* The v12.2.5 release has a potential data corruption issue with > > erasure coded pools. If you ran v12.2.5 with erasure coding, please see ^^^ > > below. > > < snip > >

Re: [ceph-users] v12.2.7 Luminous released

2018-07-17 Thread Stefan Kooman
Quoting Abhishek Lekshmanan (abhis...@suse.com): > *NOTE* The v12.2.5 release has a potential data corruption issue with > erasure coded pools. If you ran v12.2.5 with erasure coding, please see > below. < snip > > Upgrading from v12.2.5 or v12.2.6 > - > > If

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Bryan Banister
I didn't find anything obvious in the release notes about this issue we see to have, but I don't understand it really. We have seen logs indicating some kind of heartbeat issue with OSDs, but we don't believe there is any issues with the networking between the nodes, which are mostly idle as

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Bryan Banister
Hi Tom, We're apparently running ceph version 12.2.5 on a Rook based cluster. We have EC pools on large 8TB HDDs and metadata on bluestore OSDs on NVMe drives. I'll look at the release notes. Thanks! -Bryan From: Tom W [mailto:to...@ukfast.co.uk] Sent: Tuesday, July 17, 2018 12:05 PM To:

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Tom W
Hi Bryan, What version of Ceph are you currently running on, and do you run any erasure coded pools or bluestore OSDs? Might be worth having a quick glance over the recent changelogs: http://docs.ceph.com/docs/master/releases/luminous/ Tom From: ceph-users

[ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Bryan Banister
Hi all, We're still very new to managing Ceph and seem to have cluster that is in an endless loop of failing OSDs, then marking them down, then booting them again: Here are some example logs: 2018-07-17 16:48:28.976673 mon.rook-ceph-mon7 [INF] osd.83 failed

[ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-17 Thread Troy Ablan
I was on 12.2.5 for a couple weeks and started randomly seeing corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke loose. I panicked and moved to Mimic, and when that didn't solve the problem, only then did I start to root around in mailing lists archives. It appears I can't

Re: [ceph-users] Is Ceph the right tool for storing lots of small files?

2018-07-17 Thread Gregory Farnum
On Mon, Jul 16, 2018 at 11:41 PM Christian Wimmer < christian.wim...@jaumo.com> wrote: > Hi all, > > I am trying to use Ceph with RGW to store lots (>300M) of small files (80% > 2-15kB, 20% up to 500kB). > After some testing, I wonder if Ceph is the right tool for that. > > Does anybody of you

Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Gregory Farnum
On Tue, Jul 17, 2018 at 3:40 AM Jake Grimmett wrote: > > I'd be interested to hear more from Greg about why cache pools are best > avoided... > While performance has improved over many releases, cache pools still don't do well on most workloads that most people use them for. As a result we've

[ceph-users] luminous librbd::image::OpenRequest: failed to retreive immutable metadata

2018-07-17 Thread Dan van der Ster
Hi, This mail is for the search engines. An old "Won't Fix" ticket is still quite relevant: http://tracker.ceph.com/issues/16211 When you upgrade an old rbd cluster to luminous, there is a good chance you will have several rbd images with unreadable header objects. E.g. # rbd info -p volumes

[ceph-users] v12.2.7 Luminous released

2018-07-17 Thread Abhishek Lekshmanan
This is the seventh bugfix release of Luminous v12.2.x long term stable release series. This release contains several fixes for regressions in the v12.2.6 and v12.2.5 releases. We recommend that all users upgrade. *NOTE* The v12.2.6 release has serious known regressions, while 12.2.6 wasn't

[ceph-users] multisite and link speed

2018-07-17 Thread Robert Stanford
I have ceph clusters in a zone configured as active/passive, or primary/backup. If the network link between the two clusters is slower than the speed of data coming in to the active cluster, what will eventually happen? Will data pool on the active cluster until memory runs out?

Re: [ceph-users] resize wal/db

2018-07-17 Thread Igor Fedotov
For now you can expand that space up to actual volume size using ceph-bluestore-tool commands ( bluefs-bdev-expand and set-label-key). Which is a bit tricky though. And I'm currently working on a solution within ceph-bluestore tool to  simplify both expansion and migration. Thanks, Igor

Re: [ceph-users] resize wal/db

2018-07-17 Thread Nicolas Huillard
Le mardi 17 juillet 2018 à 16:20 +0300, Igor Fedotov a écrit : > Right, but procedure described in the blog can be pretty easily > adjusted  > to do a resize. Sure, but if I remember correctly, Ceph itself cannot use the increased size: you'll end up with a larger device with unused additional

Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Oliver Schulz
Dear Linh, another question, if I may: How do you handle Bluestore WAL and DB, and how much SSD space do you allocate for them? Cheers, Oliver On 17.07.2018 08:55, Linh Vu wrote: Hi Oliver, We have several CephFS on EC pool deployments, one been in production for a while, the others

Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Oliver Schulz
Thanks a lot, Linh! On 17.07.2018 08:55, Linh Vu wrote: Hi Oliver, We have several CephFS on EC pool deployments, one been in production for a while, the others about to pending all the Bluestore+EC fixes in 12.2.7  Firstly as John and Greg have said, you don't need SSD cache pool at

Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Oliver Schulz
Hi Greg, On 17.07.2018 03:01, Gregory Farnum wrote: Since Luminous, you can use an erasure coded pool (on bluestore) directly as a CephFS data pool, no cache pool needed. More than that, we'd really prefer you didn't use cache pools for anything. Just Say No. :) Thanks for the

Re: [ceph-users] tcmalloc performance still relevant?

2018-07-17 Thread Uwe Sauter
I asked a similar question about 2 weeks ago, subject "jemalloc / Bluestore". Have a look at the archives Regards, Uwe Am 17.07.2018 um 15:27 schrieb Robert Stanford: > Looking here: > https://ceph.com/geen-categorie/the-ceph-and-tcmalloc-performance-story/ > >  I see that it was a

Re: [ceph-users] Read/write statistics per RBD image

2018-07-17 Thread Jason Dillaman
Yes, you just need to enable the "admin socket" in your ceph.conf and then use "ceph --admin-daemon /path/to/image/admin/socket.asok perf dump". On Tue, Jul 17, 2018 at 8:53 AM Mateusz Skala (UST, POL) < mateusz.sk...@ust-global.com> wrote: > Hi, > > It is possible to get statistics of issued

Re: [ceph-users] checking rbd volumes modification times

2018-07-17 Thread Jason Dillaman
That's not possible right now, but work is in-progress to add that to a future release of Ceph [1]. [1] https://github.com/ceph/ceph/pull/21114 On Mon, Jul 16, 2018 at 7:12 PM Andrei Mikhailovsky wrote: > Dear cephers, > > Could someone tell me how to check the rbd volumes modification times

[ceph-users] tcmalloc performance still relevant?

2018-07-17 Thread Robert Stanford
Looking here: https://ceph.com/geen-categorie/the-ceph-and-tcmalloc-performance-story/ I see that it was a good idea to change to JEMalloc. Is this still the case, with up to date Linux and current Ceph? ___ ceph-users mailing list

Re: [ceph-users] resize wal/db

2018-07-17 Thread Igor Fedotov
Right, but procedure described in the blog can be pretty easily adjusted to do a resize. Thanks, Igor On 7/17/2018 11:10 AM, Eugen Block wrote: Hi, There is no way to resize DB while OSD is running. There is a bit shorter "unofficial" but risky way than redeploying OSD though. But you'll

Re: [ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread John Spray
On Tue, Jul 17, 2018 at 8:26 AM Surya Bala wrote: > > Hi folks, > > We have production cluster with 8 nodes and each node has 60 disks of size > 6TB each. We are using cephfs and FUSE client with global mount point. We are > doing rsync from our old server to this cluster rsync is slow compared

[ceph-users] Read/write statistics per RBD image

2018-07-17 Thread Mateusz Skala (UST, POL)
Hi, It is possible to get statistics of issued reads/writes to specific RBD image? Best will be statistics like in /proc/diskstats in linux. Regards Mateusz ___ ceph-users mailing list ceph-users@lists.ceph.com

Re: [ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread Brenno Augusto Falavinha Martinez
Hi, could you share your MDS hardware specifications? Att., Brenno Martinez SUPCD / CDENA / CDNIA (41) 3593-8423 - Mensagem original - De: "Daniel Baumann" Para: "Ceph Users" Enviadas: Terça-feira, 17 de julho de 2018 6:45:25 Assunto: Re: [ceph-users] ls operation is too slow in

Re: [ceph-users] Why the change from ceph-disk to ceph-volume and lvm? (and just not stick with direct disk access)

2018-07-17 Thread Marc Roos
I still wanted to thank you for the nicely detailed arguments regarding this, it is much appreciated. It really gives me the broader perspective I was lacking. -Original Message- From: Warren Wang [mailto:warren.w...@walmart.com] Sent: maandag 11 juni 2018 17:30 To: Konstantin

Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Jake Grimmett
Hi Oliver, We put Cephfs directly on an 8:2 EC cluster, (10 nodes, 450 OSD), but put metadata on a replicated pool using NVMe drives (1 per node, 5 nodes). We get great performance with large files, but as Linh indicated, IOPS with small files could be better. I did consider adding a replicated

[ceph-users] Slow requests during OSD maintenance

2018-07-17 Thread sinan
Hi, On one of our OSD nodes I performed a "yum update" with Ceph repositories disabled. So only the OS packages were being updated. During and namely at the end of the yum update, the cluster started to have slow/blocked requests and all VM's with Ceph storage backend had high I/O load. After

Re: [ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread Surya Bala
Previosly we had multi-active MDS. But that time we got slow /stuck requests when multiple clients accessing the cluster. So we decided to have single active MDS and all others are stand by. When we got this issue MDS trimming was going on. when we checked the last ops { "ops": [ {

Re: [ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread Daniel Baumann
On 07/17/2018 11:43 AM, Marc Roos wrote: > I had similar thing with doing the ls. Increasing the cache limit helped > with our test cluster same here; additionally we also had to use more than one MDS to get good performance (currently 3 MDS plus 2 stand-by per FS). Regards, Daniel

Re: [ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread Marc Roos
I had similar thing with doing the ls. Increasing the cache limit helped with our test cluster mds_cache_memory_limit = 80 -Original Message- From: Surya Bala [mailto:sooriya.ba...@gmail.com] Sent: dinsdag 17 juli 2018 11:39 To: Anton Aleksandrov Cc:

Re: [ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread Surya Bala
Thanks for the reply anton. CPU core count - 40 RAM - 250GB We have single active MDS, Ceph version luminous 12.2.4 default PG number is 64 and we are not changing PG count while creating pool we have totally 8 server each with 60OSD of 6TB size. 8 server splitted into 2 per region . Crush map

Re: [ceph-users] Mon scrub errors

2018-07-17 Thread Elias Abacioglu
Hi, Sorry for bumping an old thread, but I have the same issue. Is it the mon.0 that I should reset? 2018-07-02 11:39:51.223731 [ERR] mon.2 ScrubResult(keys {monmap=48,osd_metadata=52} crc {monmap=3659766006,osd_metadata=1928984927}) 2018-07-02 11:39:51.223701 [ERR] mon.0 ScrubResult(keys

Re: [ceph-users] Delete pool nicely

2018-07-17 Thread Simon Ironside
On 22/05/18 18:28, David Turner wrote: From my experience, that would cause you some troubles as it would throw the entire pool into the deletion queue to be processed as it cleans up the disks and everything.  I would suggest using a pool listing from `rados -p .rgw.buckets ls` and iterate

Re: [ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread Anton Aleksandrov
You need to give us more details about your OSD setup and hardware specification of nodes (CPU core count, RAM amount) On 2018.07.17. 10:25, Surya Bala wrote: Hi folks, We have production cluster with 8 nodes and each node has 60 disks of size 6TB each. We are using cephfs and FUSE client

Re: [ceph-users] resize wal/db

2018-07-17 Thread Eugen Block
Hi, There is no way to resize DB while OSD is running. There is a bit shorter "unofficial" but risky way than redeploying OSD though. But you'll need to tag specific OSD out for a while in any case. You will also need either additional free partition(s) or initial deployment had to be

[ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread Surya Bala
Hi folks, We have production cluster with 8 nodes and each node has 60 disks of size 6TB each. We are using cephfs and FUSE client with global mount point. We are doing rsync from our old server to this cluster rsync is slow compared to normal server when we do 'ls' inside some folder, which has

[ceph-users] Is Ceph the right tool for storing lots of small files?

2018-07-17 Thread Christian Wimmer
Hi all, I am trying to use Ceph with RGW to store lots (>300M) of small files (80% 2-15kB, 20% up to 500kB). After some testing, I wonder if Ceph is the right tool for that. Does anybody of you have experience with this use case? Things I came across: - EC pools: default stripe-width is 4kB.