Re: [ceph-users] Possible data damage: 1 pg inconsistent

2018-12-18 Thread Serkan Çoban
>I will also see a few uncorrected read errors in smartctl. uncorrected read errors in smartctl is a cause for us to replace the drive. On Wed, Dec 19, 2018 at 6:48 AM Frank Ritchie wrote: > > Hi all, > > I have been receiving alerts for: > > Possible data damage: 1 pg inconsistent > > almost

[ceph-users] Possible data damage: 1 pg inconsistent

2018-12-18 Thread Frank Ritchie
Hi all, I have been receiving alerts for: Possible data damage: 1 pg inconsistent almost daily for a few weeks now. When I check: rados list-inconsistent-obj $PG --format=json-pretty I will always see a read_error. When I run a deep scrub on the PG I will see: head candidate had a read error

Re: [ceph-users] Removing orphaned radosgw bucket indexes from pool

2018-12-18 Thread J. Eric Ivancich
On 11/29/18 6:58 PM, Bryan Stillwell wrote: > Wido, > > I've been looking into this large omap objects problem on a couple of our > clusters today and came across your script during my research. > > The script has been running for a few hours now and I'm already over 100,000 > 'orphaned'

Re: [ceph-users] Omap issues - metadata creating too many

2018-12-18 Thread J. Eric Ivancich
On 12/17/18 9:18 AM, Josef Zelenka wrote: > Hi everyone, i'm running a Luminous 12.2.5 cluster with 6 hosts on > ubuntu 16.04 - 12 HDDs for data each, plus 2 SSD metadata OSDs(three > nodes have an additional SSD i added to have more space to rebalance the > metadata). CUrrently, the cluster is

Re: [ceph-users] RDMA/RoCE enablement failed with (113) No route to host

2018-12-18 Thread Mohamad Gebai
Last I heard (read) was that the RDMA implementation is somewhat experimental. Search for "troubleshooting ceph rdma performance" on this mailing list for more info. (Adding Roman in CC who has been working on this recently.) Mohamad On 12/18/18 11:42 AM, Michael Green wrote: > I don't know.  >

Re: [ceph-users] RDMA/RoCE enablement failed with (113) No route to host

2018-12-18 Thread Michael Green
I don't know. Ceph documentation on Mimic doesn't appear to go into too much details on RDMA in general, but still it's mentioned in the Ceph docs here and there. Some examples: Change log - http://docs.ceph.com/docs/master/releases/mimic/

Re: [ceph-users] RBD snapshot atomicity guarantees?

2018-12-18 Thread Hector Martin
On 18/12/2018 20:29, Oliver Freyermuth wrote: Potentially, if granted arbitrary command execution by the guest agent, you could check (there might be a better interface than parsing meminfo...): cat /proc/meminfo | grep -i dirty Dirty: 19476 kB You could guess from that

Re: [ceph-users] RDMA/RoCE enablement failed with (113) No route to host

2018-12-18 Thread Виталий Филиппов
Is RDMA officially supported? I'm asking because I recently tried to use DPDK and it seems it's broken... i.e the code is there, but does not compile until I fix cmake scripts, and after fixing the build OSDs just get segfaults and die after processing something like 40-50 incoming packets.

Re: [ceph-users] IRC channels now require registered and identified users

2018-12-18 Thread Joao Eduardo Luis
On 12/18/2018 11:22 AM, Joao Eduardo Luis wrote: > On 12/18/2018 11:18 AM, Dan van der Ster wrote: >> Hi Joao, >> >> Has that broken the Slack connection? I can't tell if its broken or >> just quiet... last message on #ceph-devel was today at 1:13am. > > Just quiet, it seems. Just tested it and

Re: [ceph-users] RBD snapshot atomicity guarantees?

2018-12-18 Thread Oliver Freyermuth
Am 18.12.18 um 11:48 schrieb Hector Martin: > On 18/12/2018 18:28, Oliver Freyermuth wrote: >> We have yet to observe these hangs, we are running this with ~5 VMs with ~10 >> disks for about half a year now with daily snapshots. But all of these VMs >> have very "low" I/O, >> since we put

Re: [ceph-users] IRC channels now require registered and identified users

2018-12-18 Thread Joao Eduardo Luis
On 12/18/2018 11:18 AM, Dan van der Ster wrote: > Hi Joao, > > Has that broken the Slack connection? I can't tell if its broken or > just quiet... last message on #ceph-devel was today at 1:13am. Just quiet, it seems. Just tested it and the bridge is still working. -Joao

Re: [ceph-users] IRC channels now require registered and identified users

2018-12-18 Thread Dan van der Ster
Hi Joao, Has that broken the Slack connection? I can't tell if its broken or just quiet... last message on #ceph-devel was today at 1:13am. -- Dan On Tue, Dec 18, 2018 at 12:11 PM Joao Eduardo Luis wrote: > > All, > > > Earlier this week our IRC channels were set to require users to be >

[ceph-users] IRC channels now require registered and identified users

2018-12-18 Thread Joao Eduardo Luis
All, Earlier this week our IRC channels were set to require users to be registered and identified before being allowed to join a channel. This looked like the most reasonable option to combat the onslaught of spam bots we've been getting in the last weeks/months. As of today, this is in effect

Re: [ceph-users] RBD snapshot atomicity guarantees?

2018-12-18 Thread Hector Martin
On 18/12/2018 18:28, Oliver Freyermuth wrote: We have yet to observe these hangs, we are running this with ~5 VMs with ~10 disks for about half a year now with daily snapshots. But all of these VMs have very "low" I/O, since we put anything I/O intensive on bare metal (but with automated

Re: [ceph-users] Luminous (12.2.8 on CentOS), recover or recreate incomplete PG

2018-12-18 Thread Dan van der Ster
Hi Fulvio! Are you able to query that pg -- which osd is it waiting for? Also, since you're prepared for data loss anyway, you might have success setting osd_find_best_info_ignore_history_les=true on the relevant osds (set it conf, restart those osds). -- dan -- dan On Tue, Dec 18, 2018 at

[ceph-users] Luminous (12.2.8 on CentOS), recover or recreate incomplete PG

2018-12-18 Thread Fulvio Galeazzi
Hallo Cephers, I am stuck with an incomplete PG and am seeking help. At some point I had a bad configuration for gnocchi which caused a flooding of tiny objects to the backend Ceph rados pool. While cleaning things up, the load on the OSD disks was such that 3 of them "commited suicide"

Re: [ceph-users] RBD snapshot atomicity guarantees?

2018-12-18 Thread ceph
For what it worth, we are using snapshots on a daily basis for a couple of thousands rbd volume for some times So far so good, we have not catched any issue On 12/18/2018 10:28 AM, Oliver Freyermuth wrote: > Dear Hector, > > we are using the very same approach on CentOS 7 (freeze + thaw), but >

Re: [ceph-users] Create second pool with different disk size

2018-12-18 Thread Konstantin Shalygin
Assign a "custom" device class to the new OSD's. This would align pretty with how the existing OSD's are assigned to different crush rules, but I'm not sure if this is the correct way to do it, or if custom device classes is actually supported? This is correct way. Device-classes work like

Re: [ceph-users] RBD snapshot atomicity guarantees?

2018-12-18 Thread Oliver Freyermuth
Dear Hector, we are using the very same approach on CentOS 7 (freeze + thaw), but preceeded by an fstrim. With virtio-scsi, using fstrim propagates the discards from within the VM to Ceph RBD (if qemu is configured accordingly), and a lot of space is saved. We have yet to observe these hangs,

[ceph-users] RBD snapshot atomicity guarantees?

2018-12-18 Thread Hector Martin
Hi list, I'm running libvirt qemu guests on RBD, and currently taking backups by issuing a domfsfreeze, taking a snapshot, and then issuing a domfsthaw. This seems to be a common approach. This is safe, but it's impactful: the guest has frozen I/O for the duration of the snapshot. This is

[ceph-users] Create second pool with different disk size

2018-12-18 Thread Troels Hansen
I'm having an issue where the existing Ceph cluster consists mostly of 10Tb 3.5" HDD's, and some smaller 1Tb SSD, in different pools. This works perfectly. Storage pools have assigned a crush rule having set either ssd or hdd as storage class. Now I wan't to assign a bunch of smaller 2.5"

[ceph-users] Priority of repair vs rebalancing?

2018-12-18 Thread jesper
Hi. In our ceph cluster we hit one OSD with 95% full while others in same pool only hit 40% .. (total usage is ~55%). Thus I went into a: sudo ceph osd reweight-by-utilization 110 0.05 12 Which initated some data movement.. but right after ceph status reported: jk@bison:~/adm-git$ sudo ceph