Re: [ceph-users] RDMA/RoCE enablement failed with (113) No route to host
I was informed today that the CEPH environment I’ve been working on is no longer available. Unfortunately this happened before I could try any of your suggestions, Roman. Thank you for all the attention and advice. -- Michael Green > On Dec 20, 2018, at 08:21, Roman Penyaev wrote: > >> On 2018-12-19 22:01, Marc Roos wrote: >> I would be interested learning about the performance increase it has >> compared to 10Gbit. I got the ConnectX-3 Pro but I am not using the rdma >> because support is not default available. > > Not too much, the following is the comparison on latest master using > fio engine, which measures bare ceph messenger performance (no disk IO): > https://github.com/ceph/ceph/pull/24678 > > > Mellanox MT27710 Family [ConnectX-4 Lx] 25gb/s: > > > bsiodepth=8, async+posix iodepth=8, async+rdma > - > -- > 4kIOPS=30.0k BW=121MiB/s 0.257ms IOPS=47.9k BW=187MiB/s 0.166ms > 8kIOPS=30.8k BW=240MiB/s 0.259ms IOPS=46.3k BW=362MiB/s 0.172ms > 16kIOPS=25.1k BW=392MiB/s 0.318ms IOPS=45.2k BW=706MiB/s 0.176ms > 32kIOPS=23.1k BW=722MiB/s 0.345ms IOPS=37.5k BW=1173MiB/s 0.212ms > 64kIOPS=18.0k BW=1187MiB/s 0.420ms IOPS=41.0k BW=2624MiB/s 0.189ms > 128kIOPS=12.1k BW=1518MiB/s 0.657ms IOPS=20.9k BW=2613MiB/s 0.381ms > 256kIOPS=3530 BW=883MiB/s 2.265ms IOPS=4624 BW=1156MiB/s 1.729ms > 512kIOPS=2084 BW=1042MiB/s 3.387ms IOPS=2406 BW=1203MiB/s 3.32ms > 1mIOPS=1119 BW=1119MiB/s 7.145ms IOPS=1277 BW=1277MiB/s 6.26ms > 2mIOPS=551BW=1101MiB/s 14.51ms IOPS=631BW=1263MiB/s 12.66ms > 4mIOPS=272BW=1085MiB/s 29.45ms IOPS=318BW=1268MiB/s 25.17ms > > > > bsiodepth=128, async+posix iodepth=128, async+rdma > - > -- > 4kIOPS=75.9k BW=297MiB/s 1.683ms IOPS=83.4k BW=326MiB/s 1.535ms > 8kIOPS=64.3k BW=502MiB/s 1.989ms IOPS=70.3k BW=549MiB/s 1.819ms > 16kIOPS=53.9k BW=841MiB/s 2.376ms IOPS=57.8k BW=903MiB/s 2.214ms > 32kIOPS=42.2k BW=1318MiB/s 3.034ms IOPS=59.4k BW=1855MiB/s 2.154ms > 64kIOPS=30.0k BW=1934MiB/s 4.135ms IOPS=42.3k BW=2645MiB/s 3.023ms > 128kIOPS=18.1k BW=2268MiB/s 7.052ms IOPS=21.2k BW=2651MiB/s > 6.031ms > 256kIOPS=5186 BW=1294MiB/s 24.71ms IOPS=5253 BW=1312MiB/s > 24.39ms > 512kIOPS=2897 BW=1444MiB/s 44.19ms IOPS=2944 BW=1469MiB/s > 43.48ms > 1mIOPS=1306 BW=1297MiB/s 97.98ms IOPS=1421 BW=1415MiB/s 90.27ms > 2mIOPS=612BW=1199MiB/s 208.6ms IOPS=862BW=1705MiB/s 148.9ms > 4mIOPS=316BW=1235MiB/s 409.1ms IOPS=416BW=1664MiB/s 307.4ms > > > 1. As you can see there is no big difference between posix and rdma. > > 2. Even 25gb/s card is used we barely reach 20gb/s. I have also results > on 100gb/s qlogic cards, no difference, because the bottleneck is not > a network. This is especially visible on loads with bigger number of > iopdeth: bandwidth is not significantly changed. So even you increase > number of requests in-flight you reach the limit how fast those > requests are processed. > > 3. Keep in mind this is only messenger performance, so on real ceph loads you > will get less, because of the whole IO stack involved. > > > -- > Roman ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs file block size: must it be so big?
On Fri, Dec 14, 2018 at 6:44 PM Bryan Henderson wrote: > > Going back through the logs though it looks like the main reason we do a > > 4MiB block size is so that we have a chance of reporting actual cluster > > sizes to 32-bit systems, > > I believe you're talking about a different block size (there are so many of > them). > > The 'statvfs' system call (the essence of a 'df' command) can return its > space > sizes in any units it wants, and tells you that unit. The unit has > variously > been called block size and fragment size. In Cephfs, it is hardcoded as 4 > MiB > so that 32 bit fields can represent large storage sizes. I'm not aware > that > anyone attempts to use that value for anything but interpreting statvfs > results. Not saying they don't, though. > > What I'm looking at, in contrast, is the block size returned by a 'stat' > system call on a particular file. In Cephfs, it's the stripe unit size for > the file, which is an aspect of the file's layout. In the default layout, > stripe unit size is 4 MiB. > You are of course correct; sorry for the confusion. It looks like this was introduced in (user space) commit 0457783f6eb0c41951b6d56a568eccaeccec8e6d, which swapped it from the previous hard-coded 4096. Probably in the expectation that there might be still-small stripe units that were nevertheless useful to do IO in terms of. You might want to try and be more sophisticated than just having a mount option to override the reported block size — perhaps forcing the reported size within some reasonable limits, but trying to keep some relationship between it and the stripe size? If someone deploys an erasure-coded pool under CephFS they definitely want to be doing IO in the stripe size if possible, rather than 4 or 8KiB. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OOM Killer Luminous
Can you provide the complete OOM message from the dmesg log? On Sat, Dec 22, 2018 at 7:53 AM Pardhiv Karri wrote: > > > Thank You for the quick response Dyweni! > > We are using FileStore as this cluster is upgraded from > Hammer-->Jewel-->Luminous 12.2.8. 16x2TB HDD per node for all nodes. R730xd > has 128GB and R740xd has 96GB of RAM. Everything else is the same. > > Thanks, > Pardhiv Karri > > On Fri, Dec 21, 2018 at 1:43 PM Dyweni - Ceph-Users <6exbab4fy...@dyweni.com> > wrote: >> >> Hi, >> >> >> You could be running out of memory due to the default Bluestore cache sizes. >> >> >> How many disks/OSDs in the R730xd versus the R740xd? How much memory in >> each server type? How many are HDD versus SSD? Are you running Bluestore? >> >> >> OSD's in Luminous, which run Bluestore, allocate memory to use as a "cache", >> since the kernel-provided page-cache is not available to Bluestore. >> Bluestore, by default, will use 1GB of memory for each HDD, and 3GB of >> memory for each SSD. OSD's do not allocate all that memory up front, but >> grow into it as it is used. This cache is in addition to any other memory >> the OSD uses. >> >> >> Check out the bluestore_cache_* values (these are specified in bytes) in the >> manual cache sizing section of the docs >> (http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/). >>Note that the automatic cache sizing feature wasn't added until 12.2.9. >> >> >> >> As an example, I have OSD's running on 32bit/armhf nodes. These nodes have >> 2GB of memory. I run 1 Bluestore OSD on each node. In my ceph.conf file, I >> have 'bluestore cache size = 536870912' and 'bluestore cache kv max = >> 268435456'. I see aprox 1.35-1.4 GB used by each OSD. >> >> >> >> >> On 2018-12-21 15:19, Pardhiv Karri wrote: >> >> Hi, >> >> We have a luminous cluster which was upgraded from Hammer --> Jewel --> >> Luminous 12.2.8 recently. Post upgrade we are seeing issue with a few nodes >> where they are running out of memory and dying. In the logs we are seeing >> OOM killer. We don't have this issue before upgrade. The only difference is >> the nodes without any issue are R730xd and the ones with the memory leak are >> R740xd. The hardware vendor don't see anything wrong with the hardware. From >> Ceph end we are not seeing any issue when it comes to running the cluster, >> only issue is with memory leak. Right now we are actively rebooting the >> nodes in timely manner to avoid crashes. One R740xd node we set all the OSDs >> to 0.0 and there is no memory leak there. Any pointers to fix the issue >> would be helpful. >> >> Thanks, >> Pardhiv Karri >> >> >> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Pardhiv Karri > "Rise and Rise again until LAMBS become LIONS" > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cheers, Brad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OOM Killer Luminous
Thank You for the quick response Dyweni! We are using FileStore as this cluster is upgraded from Hammer-->Jewel-->Luminous 12.2.8. 16x2TB HDD per node for all nodes. R730xd has 128GB and R740xd has 96GB of RAM. Everything else is the same. Thanks, Pardhiv Karri On Fri, Dec 21, 2018 at 1:43 PM Dyweni - Ceph-Users <6exbab4fy...@dyweni.com> wrote: > Hi, > > > You could be running out of memory due to the default Bluestore cache > sizes. > > > How many disks/OSDs in the R730xd versus the R740xd? How much memory in > each server type? How many are HDD versus SSD? Are you running Bluestore? > > > OSD's in Luminous, which run Bluestore, allocate memory to use as a > "cache", since the kernel-provided page-cache is not available to > Bluestore. Bluestore, by default, will use 1GB of memory for each HDD, and > 3GB of memory for each SSD. OSD's do not allocate all that memory up > front, but grow into it as it is used. This cache is in addition to any > other memory the OSD uses. > > > Check out the bluestore_cache_* values (these are specified in bytes) in > the manual cache sizing section of the docs ( > http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/). > Note that the automatic cache sizing feature wasn't added until 12.2.9. > > > > As an example, I have OSD's running on 32bit/armhf nodes. These nodes > have 2GB of memory. I run 1 Bluestore OSD on each node. In my ceph.conf > file, I have 'bluestore cache size = 536870912' and 'bluestore cache kv max > = 268435456'. I see aprox 1.35-1.4 GB used by each OSD. > > > > > On 2018-12-21 15:19, Pardhiv Karri wrote: > > Hi, > > We have a luminous cluster which was upgraded from Hammer --> Jewel --> > Luminous 12.2.8 recently. Post upgrade we are seeing issue with a few nodes > where they are running out of memory and dying. In the logs we are seeing > OOM killer. We don't have this issue before upgrade. The only difference is > the nodes without any issue are R730xd and the ones with the memory leak > are R740xd. The hardware vendor don't see anything wrong with the hardware. > From Ceph end we are not seeing any issue when it comes to running the > cluster, only issue is with memory leak. Right now we are actively > rebooting the nodes in timely manner to avoid crashes. One R740xd node we > set all the OSDs to 0.0 and there is no memory leak there. Any pointers to > fix the issue would be helpful. > > Thanks, > *Pardhiv Karri* > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- *Pardhiv Karri* "Rise and Rise again until LAMBS become LIONS" ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore nvme DB/WAL size
> It'll cause problems if yours the only one NVMe drive will die - you'll lost > all the DB partitions and all the OSDs are going to be failed The severity of this depends a lot on the size of the cluster. If there are only, say, 4 nodes total, for sure the loss of a quarter of the OSDs will be somewhere between painful and fatal. Especially if the subtree limit does not forestall rebalancing, and if EC is being used vs replication. From a pain angle, though, this is no worse than if the server itself smokes. It's easy to say "don't do that" but sometimes one doesn't have a choice: * Unit economics can confound provisioning of larger/more external metadata devices. I'm sure Vlad isn't using spinners because he hates SSDs. * Devices have to go somewhere. It's not uncommon to have a server using 2 PCIe slots for NICs (1) and another for an HBA, leaving as few as 1 or 0 free. Sometimes the potential for a second PCI riser is canceled by the need to provision a rear drive cage for OS/boot drives to maximize front-panel bay availability. * Cannibalizing one or more front drive bays for metadata drives can be problematic: - Usable cluster capacity is decreased, along with unit economics - Dogfood or draconian corporate policy (Herbert! Herbert!) can prohibit this. I've personally in the past been prohibited from the obvious choise to use a simple open-market LFF to SFF adapter because it wasn't officially "supported" and would use components without a corporate SKU. The 4% guidance was 1% until not all that long ago. Guidance on calculating adequate sizing based on application and workload would be NTH. I've been told that an object storage (RGW) use case can readily get away with less because L2/L3/etc are both rarely accessed and the first to be overflowed onto slower storage. And that block (RBD) workloads have different access patterns that are more impacted by overflow of higher levels. As RBD pools increasingly are deployed on SSD/NVMe devices, the case for colocating their metadata is strong, and obviates having to worry about sizing before deployment. (1) One of many reasons to seriously consider not having a separate replication network ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OOM Killer Luminous
Hi, You could be running out of memory due to the default Bluestore cache sizes. How many disks/OSDs in the R730xd versus the R740xd? How much memory in each server type? How many are HDD versus SSD? Are you running Bluestore? OSD's in Luminous, which run Bluestore, allocate memory to use as a "cache", since the kernel-provided page-cache is not available to Bluestore. Bluestore, by default, will use 1GB of memory for each HDD, and 3GB of memory for each SSD. OSD's do not allocate all that memory up front, but grow into it as it is used. This cache is in addition to any other memory the OSD uses. Check out the bluestore_cache_* values (these are specified in bytes) in the manual cache sizing section of the docs (http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/). Note that the automatic cache sizing feature wasn't added until 12.2.9. As an example, I have OSD's running on 32bit/armhf nodes. These nodes have 2GB of memory. I run 1 Bluestore OSD on each node. In my ceph.conf file, I have 'bluestore cache size = 536870912' and 'bluestore cache kv max = 268435456'. I see aprox 1.35-1.4 GB used by each OSD. On 2018-12-21 15:19, Pardhiv Karri wrote: > Hi, > > We have a luminous cluster which was upgraded from Hammer --> Jewel --> > Luminous 12.2.8 recently. Post upgrade we are seeing issue with a few nodes > where they are running out of memory and dying. In the logs we are seeing OOM > killer. We don't have this issue before upgrade. The only difference is the > nodes without any issue are R730xd and the ones with the memory leak are > R740xd. The hardware vendor don't see anything wrong with the hardware. From > Ceph end we are not seeing any issue when it comes to running the cluster, > only issue is with memory leak. Right now we are actively rebooting the nodes > in timely manner to avoid crashes. One R740xd node we set all the OSDs to 0.0 > and there is no memory leak there. Any pointers to fix the issue would be > helpful. > > Thanks, > PARDHIV KARRI > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph OOM Killer Luminous
Hi, We have a luminous cluster which was upgraded from Hammer --> Jewel --> Luminous 12.2.8 recently. Post upgrade we are seeing issue with a few nodes where they are running out of memory and dying. In the logs we are seeing OOM killer. We don't have this issue before upgrade. The only difference is the nodes without any issue are R730xd and the ones with the memory leak are R740xd. The hardware vendor don't see anything wrong with the hardware. >From Ceph end we are not seeing any issue when it comes to running the cluster, only issue is with memory leak. Right now we are actively rebooting the nodes in timely manner to avoid crashes. One R740xd node we set all the OSDs to 0.0 and there is no memory leak there. Any pointers to fix the issue would be helpful. Thanks, *Pardhiv Karri* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Cluster to OSD Utilization not in Sync
Thank You Dwyeni for the quick response. We have 2 Hammer which are due for upgrade to Luminous next month and 1 Luminous 12.2.8. Will try this on Luminous and if it works then will apply the same once the Hammer clusters are upgraded rather than adjusting the weights. Thanks, Pardhiv Karri On Fri, Dec 21, 2018 at 1:05 PM Dyweni - Ceph-Users <6exbab4fy...@dyweni.com> wrote: > Hi, > > > If you are running Ceph Luminous or later, use the Ceph Manager Daemon's > Balancer module. (http://docs.ceph.com/docs/luminous/mgr/balancer/). > > > Otherwise, tweak the OSD weights (not the OSD CRUSH weights) until you > achieve uniformity. (You should be able to get under 1 STDDEV). I would > adjust in small amounts to not overload your cluster. > > > Example: > > ceph osd reweight osd.X y.yyy > > > > > On 2018-12-21 14:56, Pardhiv Karri wrote: > > Hi, > > We have Ceph clusters which are greater than 1PB. We are using tree > algorithm. The issue is with the data placement. If the cluster utilization > percentage is at 65% then some of the OSDs are already above 87%. We had to > change the near_full ratio to 0.90 to circumvent warnings and to get back > the Health to OK state. > > How can we keep the OSDs utilization to be in sync with cluster > utilization (both percentages to be close enough) as we want to utilize the > cluster to the max (above 80%) without unnecessarily adding too many > nodes/osd's. Right now we are losing close to 400TB of the disk space > unused as some OSDs are above 87% and some are below 50%. If the above 87% > OSDs reach 95% then the cluster will have issues. What is the best way to > mitigate this issue? > > Thanks, > > *Pardhiv Karri* > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- *Pardhiv Karri* "Rise and Rise again until LAMBS become LIONS" ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Cluster to OSD Utilization not in Sync
Hi, If you are running Ceph Luminous or later, use the Ceph Manager Daemon's Balancer module. (http://docs.ceph.com/docs/luminous/mgr/balancer/). Otherwise, tweak the OSD weights (not the OSD CRUSH weights) until you achieve uniformity. (You should be able to get under 1 STDDEV). I would adjust in small amounts to not overload your cluster. Example: ceph osd reweight osd.X y.yyy On 2018-12-21 14:56, Pardhiv Karri wrote: > Hi, > > We have Ceph clusters which are greater than 1PB. We are using tree > algorithm. The issue is with the data placement. If the cluster utilization > percentage is at 65% then some of the OSDs are already above 87%. We had to > change the near_full ratio to 0.90 to circumvent warnings and to get back the > Health to OK state. > > How can we keep the OSDs utilization to be in sync with cluster utilization > (both percentages to be close enough) as we want to utilize the cluster to > the max (above 80%) without unnecessarily adding too many nodes/osd's. Right > now we are losing close to 400TB of the disk space unused as some OSDs are > above 87% and some are below 50%. If the above 87% OSDs reach 95% then the > cluster will have issues. What is the best way to mitigate this issue? > > Thanks, > Pardhiv Karri > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph Cluster to OSD Utilization not in Sync
Hi, We have Ceph clusters which are greater than 1PB. We are using tree algorithm. The issue is with the data placement. If the cluster utilization percentage is at 65% then some of the OSDs are already above 87%. We had to change the near_full ratio to 0.90 to circumvent warnings and to get back the Health to OK state. How can we keep the OSDs utilization to be in sync with cluster utilization (both percentages to be close enough) as we want to utilize the cluster to the max (above 80%) without unnecessarily adding too many nodes/osd's. Right now we are losing close to 400TB of the disk space unused as some OSDs are above 87% and some are below 50%. If the above 87% OSDs reach 95% then the cluster will have issues. What is the best way to mitigate this issue? Thanks, *Pardhiv Karri* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Your email to ceph-uses mailing list: Signature check failures.
Hi Cary, I ran across your email on the ceph-users mailing list 'Signature check failures.'. I've just run across the same issue on my end. Also Gentoo user here. Running Ceph 12.2.5... 32bit/armhf and 64bit/x64_64. Was your environment mixed or strictly just x86_64? What is interesting, is that my 32bit/armhf (built with USE="ssl -nss") OSDs have no problem talking to themselves, or my 64bit/x86_64 (built with USE="-ssl nss") OSDs or my 64bit/x86_64 (built with USE="-ssl nss") clients. Trying to build new 64bit/x86_64 (built with USE="ssl -nss") OSDs and getting this same error with a simple 'rbd ls -l'. OpenSSL version is 1.0.2p. Do you remember which version of OpenSSL you were building against? 'genlop -e openssl' will show you. The locally calculated signature most times looks really short, so I'm wondering if we're hitting some kind of variable size issue... maybe overflow too? Would appreciate any insight you could give. Thanks! Dyweni ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Possible data damage: 1 pg inconsistent
Christop, do you have any links to the bug? On Fri, Dec 21, 2018 at 11:07 AM Christoph Adomeit < christoph.adom...@gatworks.de> wrote: > Hi, > > same here but also for pgs in cephfs pools. > > As far as I know there is a known bug that under memory pressure some > reads return zero > and this will lead to the error message. > > I have set nodeep-scrub and i am waiting for 12.2.11. > > Thanks > Christoph > > On Fri, Dec 21, 2018 at 03:23:21PM +0100, Hervé Ballans wrote: > > Hi Frank, > > > > I encounter exactly the same issue with the same disks than yours. Every > > day, after a batch of deep scrubbing operation, ther are generally > between 1 > > and 3 inconsistent pgs, and that, on different OSDs. > > > > It could confirm a problem on these disks, but : > > > > - it concerns only the pgs of the rbd pool, not those of cephfs pools > (the > > same disk model is used) > > > > - I encounter this when I was running 12.2.5, not when I upgraded in > 12.2.8 > > but the problem appears again after upgrade in 12.2.10 > > > > - On my side, smartctl and dmesg do not show any media error, so I'm > pretty > > sure that physical media is not concerned... > > > > Small precision: each disk is configured with RAID0 on a PERC740P, is > this > > also the case for you or are your disks in JBOD mode ? > > > > Another question: in your case, the OSD who is involved in the > inconsistent > > pgs is it always the same one or is it a new one every time ? > > > > For information, currently, the manually 'ceph pg repair' command works > well > > each time... > > > > Context: Luminous 12.2.10, Bluestore OSD with data block on SATA disks > and > > WAL/DB on NVMe, rbd configuration replica 3/2 > > > > Cheers, > > rv > > > > Few outputs: > > > > $ sudo ceph -s > > cluster: > > id: 838506b7-e0c6-4022-9e17-2d1cf9458be6 > > health: HEALTH_ERR > > 3 scrub errors > > Possible data damage: 3 pgs inconsistent > > > > services: > > mon: 3 daemons, quorum inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2 > > mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1, inf-ceph-mon2 > > mds: cephfs_home-2/2/2 up > > {0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1 up:standby > > osd: 126 osds: 126 up, 126 in > > > > data: > > pools: 3 pools, 4224 pgs > > objects: 23.35M objects, 20.9TiB > > usage: 64.9TiB used, 136TiB / 201TiB avail > > pgs: 4221 active+clean > > 3active+clean+inconsistent > > > > io: > > client: 2.62KiB/s rd, 2.25MiB/s wr, 0op/s rd, 118op/s wr > > > > $ sudo ceph health detail > > HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent > > OSD_SCRUB_ERRORS 3 scrub errors > > PG_DAMAGED Possible data damage: 3 pgs inconsistent > > pg 9.27 is active+clean+inconsistent, acting [78,107,96] > > pg 9.260 is active+clean+inconsistent, acting [84,113,62] > > pg 9.6b9 is active+clean+inconsistent, acting [79,107,80] > > $ sudo rados list-inconsistent-obj 9.27 --format=json-prettyrados > > list-inconsistent-obj 9.27 --format=json-pretty |grep error > > "errors": [], > > "union_shard_errors": [ > > "read_error" > > "errors": [ > > "read_error" > > "errors": [], > > "errors": [], > > $ sudo rados list-inconsistent-obj 9.260 --format=json-prettyrados > > list-inconsistent-obj 9.260 --format=json-pretty |grep error > > "errors": [], > > "union_shard_errors": [ > > "read_error" > > "errors": [], > > "errors": [], > > "errors": [ > > "read_error" > > $ sudo rados list-inconsistent-obj 9.6b9 --format=json-prettyrados > > list-inconsistent-obj 9.6b9 --format=json-pretty |grep error > > "errors": [], > > "union_shard_errors": [ > > "read_error" > > "errors": [ > > "read_error" > > "errors": [], > > "errors": [], > > $ sudo ceph pg repair 9.27 > > instructing pg 9.27 on osd.78 to repair > > $ sudo ceph pg repair 9.260 > > instructing pg 9.260 on osd.84 to repair > > $ sudo ceph pg repair 9.6b9 > > instructing pg 9.6b9 on osd.79 to repair > > $ sudo ceph -s > > cluster: > > id: 838506b7-e0c6-4022-9e17-2d1cf9458be6 > > health: HEALTH_OK > > > > services: > > mon: 3 daemons, quorum inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2 > > mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1, inf-ceph-mon2 > > mds: cephfs_home-2/2/2 up > > {0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1 up:standby > > osd: 126 osds: 126 up, 126 in > > > > data: > > pools: 3 pools, 4224 pgs > > objects: 23.35M objects, 20.9TiB > > usage: 64.9TiB used, 136TiB / 201TiB avail > > pgs: 4224 active+clean > > > > io: > >
Re: [ceph-users] Possible data damage: 1 pg inconsistent
Hi, same here but also for pgs in cephfs pools. As far as I know there is a known bug that under memory pressure some reads return zero and this will lead to the error message. I have set nodeep-scrub and i am waiting for 12.2.11. Thanks Christoph On Fri, Dec 21, 2018 at 03:23:21PM +0100, Hervé Ballans wrote: > Hi Frank, > > I encounter exactly the same issue with the same disks than yours. Every > day, after a batch of deep scrubbing operation, ther are generally between 1 > and 3 inconsistent pgs, and that, on different OSDs. > > It could confirm a problem on these disks, but : > > - it concerns only the pgs of the rbd pool, not those of cephfs pools (the > same disk model is used) > > - I encounter this when I was running 12.2.5, not when I upgraded in 12.2.8 > but the problem appears again after upgrade in 12.2.10 > > - On my side, smartctl and dmesg do not show any media error, so I'm pretty > sure that physical media is not concerned... > > Small precision: each disk is configured with RAID0 on a PERC740P, is this > also the case for you or are your disks in JBOD mode ? > > Another question: in your case, the OSD who is involved in the inconsistent > pgs is it always the same one or is it a new one every time ? > > For information, currently, the manually 'ceph pg repair' command works well > each time... > > Context: Luminous 12.2.10, Bluestore OSD with data block on SATA disks and > WAL/DB on NVMe, rbd configuration replica 3/2 > > Cheers, > rv > > Few outputs: > > $ sudo ceph -s > cluster: > id: 838506b7-e0c6-4022-9e17-2d1cf9458be6 > health: HEALTH_ERR > 3 scrub errors > Possible data damage: 3 pgs inconsistent > > services: > mon: 3 daemons, quorum inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2 > mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1, inf-ceph-mon2 > mds: cephfs_home-2/2/2 up > {0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1 up:standby > osd: 126 osds: 126 up, 126 in > > data: > pools: 3 pools, 4224 pgs > objects: 23.35M objects, 20.9TiB > usage: 64.9TiB used, 136TiB / 201TiB avail > pgs: 4221 active+clean > 3 active+clean+inconsistent > > io: > client: 2.62KiB/s rd, 2.25MiB/s wr, 0op/s rd, 118op/s wr > > $ sudo ceph health detail > HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent > OSD_SCRUB_ERRORS 3 scrub errors > PG_DAMAGED Possible data damage: 3 pgs inconsistent > pg 9.27 is active+clean+inconsistent, acting [78,107,96] > pg 9.260 is active+clean+inconsistent, acting [84,113,62] > pg 9.6b9 is active+clean+inconsistent, acting [79,107,80] > $ sudo rados list-inconsistent-obj 9.27 --format=json-prettyrados > list-inconsistent-obj 9.27 --format=json-pretty |grep error > "errors": [], > "union_shard_errors": [ > "read_error" > "errors": [ > "read_error" > "errors": [], > "errors": [], > $ sudo rados list-inconsistent-obj 9.260 --format=json-prettyrados > list-inconsistent-obj 9.260 --format=json-pretty |grep error > "errors": [], > "union_shard_errors": [ > "read_error" > "errors": [], > "errors": [], > "errors": [ > "read_error" > $ sudo rados list-inconsistent-obj 9.6b9 --format=json-prettyrados > list-inconsistent-obj 9.6b9 --format=json-pretty |grep error > "errors": [], > "union_shard_errors": [ > "read_error" > "errors": [ > "read_error" > "errors": [], > "errors": [], > $ sudo ceph pg repair 9.27 > instructing pg 9.27 on osd.78 to repair > $ sudo ceph pg repair 9.260 > instructing pg 9.260 on osd.84 to repair > $ sudo ceph pg repair 9.6b9 > instructing pg 9.6b9 on osd.79 to repair > $ sudo ceph -s > cluster: > id: 838506b7-e0c6-4022-9e17-2d1cf9458be6 > health: HEALTH_OK > > services: > mon: 3 daemons, quorum inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2 > mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1, inf-ceph-mon2 > mds: cephfs_home-2/2/2 up > {0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1 up:standby > osd: 126 osds: 126 up, 126 in > > data: > pools: 3 pools, 4224 pgs > objects: 23.35M objects, 20.9TiB > usage: 64.9TiB used, 136TiB / 201TiB avail > pgs: 4224 active+clean > > io: > client: 195KiB/s rd, 7.19MiB/s wr, 17op/s rd, 127op/s wr > > > > Le 19/12/2018 à 04:48, Frank Ritchie a écrit : > >Hi all, > > > >I have been receiving alerts for: > > > >Possible data damage: 1 pg inconsistent > > > >almost daily for a few weeks now. When I check: > > > >rados list-inconsistent-obj $PG --format=json-pretty > > > >I will always see a read_error. When I run a deep scrub
[ceph-users] CephFS MDS optimal setup on Google Cloud
Hello, I'm doing benchmarks for metadata operations on CephFS, HDFS, and HopsFS on Google Cloud. In my current setup, i'm using 32 vCPU machines with 29 GB memory, and i have 1 MDS, 1 MON and 3 OSDs. The MDS and the MON nodes are co-located on one vm, while each of the OSDs is on a separate vm with 1 SSD disk attached. I'm using the default configuration for MDS, and OSDs. I'm running 300 clients on 10 machines (16 vCPU), each client creates a CephFileSystem using the CephFS hadoop plugin, and then writes empty files for 30 seconds followed by reading the empty files for another 30 seconds. The aggregated throughput is around 2000 file create opertions/sec and 1 file read operations/sec. However, the MDS is not fully utilizing the 32 cores on the machine, is there any configuration that i should consider to fully utilize the machine?. Also, i noticed that running more than 20-30 clients (on different threads) per machine degrade the aggregated throughput for read, is there a limitation on CephFileSystem and libceph on the number of clients created per machine? Another issue, Are the MDS operations single threaded as pointed here " https://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-performance-benchmark "? Regarding the MDS global lock, is it it a single lock per MDS or is it a global distributed lock for all MDSs? Regards, Mahmoud ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Possible data damage: 1 pg inconsistent
Hi Frank, I encounter exactly the same issue with the same disks than yours. Every day, after a batch of deep scrubbing operation, ther are generally between 1 and 3 inconsistent pgs, and that, on different OSDs. It could confirm a problem on these disks, but : - it concerns only the pgs of the rbd pool, not those of cephfs pools (the same disk model is used) - I encounter this when I was running 12.2.5, not when I upgraded in 12.2.8 but the problem appears again after upgrade in 12.2.10 - On my side, smartctl and dmesg do not show any media error, so I'm pretty sure that physical media is not concerned... Small precision: each disk is configured with RAID0 on a PERC740P, is this also the case for you or are your disks in JBOD mode ? Another question: in your case, the OSD who is involved in the inconsistent pgs is it always the same one or is it a new one every time ? For information, currently, the manually 'ceph pg repair' command works well each time... Context: Luminous 12.2.10, Bluestore OSD with data block on SATA disks and WAL/DB on NVMe, rbd configuration replica 3/2 Cheers, rv Few outputs: $ sudo ceph -s cluster: id: 838506b7-e0c6-4022-9e17-2d1cf9458be6 health: HEALTH_ERR 3 scrub errors Possible data damage: 3 pgs inconsistent services: mon: 3 daemons, quorum inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2 mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1, inf-ceph-mon2 mds: cephfs_home-2/2/2 up {0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1 up:standby osd: 126 osds: 126 up, 126 in data: pools: 3 pools, 4224 pgs objects: 23.35M objects, 20.9TiB usage: 64.9TiB used, 136TiB / 201TiB avail pgs: 4221 active+clean 3 active+clean+inconsistent io: client: 2.62KiB/s rd, 2.25MiB/s wr, 0op/s rd, 118op/s wr $ sudo ceph health detail HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent OSD_SCRUB_ERRORS 3 scrub errors PG_DAMAGED Possible data damage: 3 pgs inconsistent pg 9.27 is active+clean+inconsistent, acting [78,107,96] pg 9.260 is active+clean+inconsistent, acting [84,113,62] pg 9.6b9 is active+clean+inconsistent, acting [79,107,80] $ sudo rados list-inconsistent-obj 9.27 --format=json-prettyrados list-inconsistent-obj 9.27 --format=json-pretty |grep error "errors": [], "union_shard_errors": [ "read_error" "errors": [ "read_error" "errors": [], "errors": [], $ sudo rados list-inconsistent-obj 9.260 --format=json-prettyrados list-inconsistent-obj 9.260 --format=json-pretty |grep error "errors": [], "union_shard_errors": [ "read_error" "errors": [], "errors": [], "errors": [ "read_error" $ sudo rados list-inconsistent-obj 9.6b9 --format=json-prettyrados list-inconsistent-obj 9.6b9 --format=json-pretty |grep error "errors": [], "union_shard_errors": [ "read_error" "errors": [ "read_error" "errors": [], "errors": [], $ sudo ceph pg repair 9.27 instructing pg 9.27 on osd.78 to repair $ sudo ceph pg repair 9.260 instructing pg 9.260 on osd.84 to repair $ sudo ceph pg repair 9.6b9 instructing pg 9.6b9 on osd.79 to repair $ sudo ceph -s cluster: id: 838506b7-e0c6-4022-9e17-2d1cf9458be6 health: HEALTH_OK services: mon: 3 daemons, quorum inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2 mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1, inf-ceph-mon2 mds: cephfs_home-2/2/2 up {0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1 up:standby osd: 126 osds: 126 up, 126 in data: pools: 3 pools, 4224 pgs objects: 23.35M objects, 20.9TiB usage: 64.9TiB used, 136TiB / 201TiB avail pgs: 4224 active+clean io: client: 195KiB/s rd, 7.19MiB/s wr, 17op/s rd, 127op/s wr Le 19/12/2018 à 04:48, Frank Ritchie a écrit : Hi all, I have been receiving alerts for: Possible data damage: 1 pg inconsistent almost daily for a few weeks now. When I check: rados list-inconsistent-obj $PG --format=json-pretty I will always see a read_error. When I run a deep scrub on the PG I will see: head candidate had a read error When I check dmesg on the osd node I see: blk_update_request: critical medium error, dev sdX, sector 123 I will also see a few uncorrected read errors in smartctl. Info: Ceph: ceph version 12.2.4-30.el7cp OSD: Toshiba 1.8TB SAS 10K 120 OSDs total Has anyone else seen these alerts occur almost daily? Can the errors possibly be due to deep scrubbing too aggressively? I realize these errors indicate potential failing drives but I can't replace a drive daily. thx Frank
Re: [ceph-users] Bluestore nvme DB/WAL size
I'm in a similar situation, currently running filestore with spinners and journals on NVME partitions which are about 1% of the size of the OSD. If I migrate to bluestore, I'll still only have that 1% available. Per the docs, if my block.db device fills up, the metadata is going to spill back onto the block device which will incur an understandable perfomance penalty. The question is, will there be more of performance hit in that scenario versus if the block.db was on the spinner and just the WAL was on the NVME? On Fri, Dec 21, 2018 at 9:01 AM Janne Johansson wrote: > Den tors 20 dec. 2018 kl 22:45 skrev Vladimir Brik > : > > Hello > > I am considering using logical volumes of an NVMe drive as DB or WAL > > devices for OSDs on spinning disks. > > The documentation recommends against DB devices smaller than 4% of slow > > disk size. Our servers have 16x 10TB HDDs and a single 1.5TB NVMe, so > > dividing it equally will result in each OSD getting ~90GB DB NVMe > > volume, which is a lot less than 4%. Will this cause problems down the > road? > > Well, apart from the reply you already got on "one nvme fails all the > HDDs it is WAL/DB for", > the recommendations are about getting the best out of them, especially > for the DB I suppose. > > If one can size stuff up before, then following recommendations is a > good choice, but I think > you should test using it for WALs for instance, and bench it against > another host with data, > wal and db on the HDD and see if it helps a lot in your expected use case. > > -- > May the most significant bit of your life be positive. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore nvme DB/WAL size
I am considering using logical volumes of an NVMe drive as DB or WAL devices for OSDs on spinning disks. The documentation recommends against DB devices smaller than 4% of slow disk size. Our servers have 16x 10TB HDDs and a single 1.5TB NVMe, so dividing it equally will result in each OSD getting ~90GB DB NVMe volume, which is a lot less than 4%. Will this cause problems down the road? This host for 145Tb data on HDDs. You'll be in agony when this NVMe will be dead. k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore nvme DB/WAL size
Den tors 20 dec. 2018 kl 22:45 skrev Vladimir Brik : > Hello > I am considering using logical volumes of an NVMe drive as DB or WAL > devices for OSDs on spinning disks. > The documentation recommends against DB devices smaller than 4% of slow > disk size. Our servers have 16x 10TB HDDs and a single 1.5TB NVMe, so > dividing it equally will result in each OSD getting ~90GB DB NVMe > volume, which is a lot less than 4%. Will this cause problems down the road? Well, apart from the reply you already got on "one nvme fails all the HDDs it is WAL/DB for", the recommendations are about getting the best out of them, especially for the DB I suppose. If one can size stuff up before, then following recommendations is a good choice, but I think you should test using it for WALs for instance, and bench it against another host with data, wal and db on the HDD and see if it helps a lot in your expected use case. -- May the most significant bit of your life be positive. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com