[ceph-users] Luminous and calamari

2018-02-15 Thread Laszlo Budai
Hi, I've just started up the dasboard component of the ceph mgr. It looks OK, but from what can be seen, and what I was able to find in the docs, the dashboard is just for monitoring. Is there any plugin that allows management of the ceph resources (pool create/delete). Thanks, Laszlo ___

Re: [ceph-users] Changing the failure domain

2017-09-01 Thread Laszlo Budai
ost people find that they can use 3-5 before the disks are active enough to come close to impacting customer traffic. That would lead me to think you have a dying drive that you're reading from/writing to in sectors that are bad or at least slower. On Fri, Sep 1, 2017, 6:13 AM Laszlo B

Re: [ceph-users] Changing the failure domain

2017-09-01 Thread Laszlo Budai
iling drives) which could easily cause things to block. Also checking if your disks or journals are maxed out with iostat could shine some light on any mitigating factor. On Thu, Aug 31, 2017 at 9:01 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Dear all! In our Ha

[ceph-users] Changing the failure domain

2017-08-31 Thread Laszlo Budai
Dear all! In our Hammer cluster we are planning to switch our failure domain from host to chassis. We have performed some simulations, and regardless of the settings we have used some slow requests have appeared all the time. we had the the following settings: "osd_max_backfills": "1", "

Re: [ceph-users] expanding cluster with minimal impact

2017-08-08 Thread Laszlo Budai
at 8:12 PM, Laszlo Budai wrote: Dear all, I need to expand a ceph cluster with minimal impact. Reading previous threads on this topic from the list I've found the ceph-gentle-reweight script (https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight) created by Dan van der

Re: [ceph-users] expanding cluster with minimal impact

2017-08-08 Thread Laszlo Budai
reduce the extra data movement we were seeing with smaller weight increases. Maybe something to try out next time? Bryan From: ceph-users on behalf of Dan van der Ster Date: Friday, August 4, 2017 at 1:59 AM To: Laszlo Budai Cc: ceph-users Subject: Re: [ceph-users] expanding cluster with mini

[ceph-users] expanding cluster with minimal impact

2017-08-03 Thread Laszlo Budai
Dear all, I need to expand a ceph cluster with minimal impact. Reading previous threads on this topic from the list I've found the ceph-gentle-reweight script (https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight) created by Dan van der Ster (Thank you Dan for sharin

[ceph-users] RBD Snapsot space accounting ...

2017-07-26 Thread Laszlo Budai
Dear all, Where can I read more about how the space used by a snapshot of an RBD image is calculated? Or can someone explain it here? I can see that before the snapshot is created, the size of the image is let's say 100M as reported by the rbd du command, while after taking the snapshot, I ca

Re: [ceph-users] best practices for expanding hammer cluster

2017-07-19 Thread Laszlo Budai
olled the impact of the recovery/refilling operation on your clients' data traffic? What setting have you used to avoid slow requests? Kind regards, Laszlo On 19.07.2017 17:40, Richard Hesketh wrote: On 19/07/17 15:14, Laszlo Budai wrote: Hi David, Thank you for that reference about CRU

Re: [ceph-users] best practices for expanding hammer cluster

2017-07-19 Thread Laszlo Budai
e the hosts into them. Sage explains a lot of the crush map here. https://www.slideshare.net/mobile/sageweil1/a-crash-course-in-crush On Wed, Jul 19, 2017, 2:43 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Hi David, thank you for pointing this out. Google wasn't

Re: [ceph-users] best practices for expanding hammer cluster

2017-07-18 Thread Laszlo Budai
https://www.spinics.net/lists/ceph-users/msg37252.html On Tue, Jul 18, 2017, 9:07 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Dear all, we are planning to add new hosts to our existing hammer clusters, and I'm looking for best practices recommendations. cur

[ceph-users] best practices for expanding hammer cluster

2017-07-18 Thread Laszlo Budai
Dear all, we are planning to add new hosts to our existing hammer clusters, and I'm looking for best practices recommendations. currently we have 2 clusters with 72 OSDs and 6 nodes each. We want to add 3 more nodes (36 OSDs) to each cluster, and we have some questions about what would be the

Re: [ceph-users] cluster network question

2017-07-17 Thread Laszlo Budai
mds services on the network will do nothing. On Fri, Jul 14, 2017, 11:39 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Dear all, I'm reading the docs at http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/ regarding the cluster network and I wond

[ceph-users] cluster network question

2017-07-14 Thread Laszlo Budai
Dear all, I'm reading the docs at http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/ regarding the cluster network and I wonder which nodes are connected to the dedicated cluster network? The digram on the mentioned page only shows the OSDs connected to the cluster netwo

Re: [ceph-users] Crushmap from Rack aware to Node aware

2017-06-02 Thread Laszlo Budai
position where you need to rush to the datacenter to fix the hardware problems ASAP. On Fri, Jun 2, 2017, 5:14 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Hi David, If I understand correctly your suggestion is the following: If we have for instance 12 servers gr

Re: [ceph-users] Crushmap from Rack aware to Node aware

2017-06-02 Thread Laszlo Budai
Hi David, If I understand correctly your suggestion is the following: If we have for instance 12 servers grouped into 3 racks (4/rack) then you would build a crush map saying that you have 6 racks (virtual ones), and 2 servers in each of them, right? In this case if we are setting the failure

Re: [ceph-users] strange remap on host failure

2017-05-30 Thread Laszlo Budai
on by default? Yesterday we were able to reproduce the issue on a test cluster. Hammer has performed the same way, but Jewel has worked properly. Upgrading to jewel is planned, but it was not decided yet when to happen. Thank you, Laszlo On 30.05.2017 23:17, Gregory Farnum wrote: On Mon, May

Re: [ceph-users] strange remap on host failure

2017-05-30 Thread Laszlo Budai
2017 at 6:17 AM, Gregory Farnum wrote: On Mon, May 29, 2017 at 4:58 AM, Laszlo Budai wrote: Hello all, We have a ceph cluster with 72 OSDs distributed on 6 hosts, in 3 chassis. In our crush map the we are distributing the PGs on chassis (complete crush map below): # rules rule repli

Re: [ceph-users] strange remap on host failure

2017-05-30 Thread Laszlo Budai
the crush map, an osd being marked out changes the crush map, an osd being removed from the cluster changes the crush map... The crush map changes all the time even if you aren't modifying it directly. On Tue, May 30, 2017 at 2:08 PM Laszlo Budai mailto:las...@componentsoft.eu>> wro

Re: [ceph-users] strange remap on host failure

2017-05-30 Thread Laszlo Budai
ee if any of the PGs are showing that they are reflecting that they are running on multiple OSDs inside of the same failure domain. On Tue, May 30, 2017 at 12:34 PM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Hello David, Thank you for your message. Indeed we were exp

Re: [ceph-users] strange remap on host failure

2017-05-30 Thread Laszlo Budai
s can work if you replace failed storage quickly. On Mon, May 29, 2017, 12:07 PM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Dear all, How should ceph react in case of a host failure when from a total of 72 OSDs 12 are out? is it normal that for the remapping of the PG

[ceph-users] Ceph recovery

2017-05-29 Thread Laszlo Budai
Hello, can someone give me some directions on how the ceph recovery works? Let's suppose we have a ceph cluster with several nodes grouped in 3 racks (2 nodes/rack). The crush map is configured to distribute PGs on OSDs from different racks. What happens if a node fails? Where can I read a des

Re: [ceph-users] strange remap on host failure

2017-05-29 Thread Laszlo Budai
29.05.2017 14:58, Laszlo Budai wrote: Hello all, We have a ceph cluster with 72 OSDs distributed on 6 hosts, in 3 chassis. In our crush map the we are distributing the PGs on chassis (complete crush map below): # rules rule replicated_ruleset { ruleset 0 type replicated

[ceph-users] strange remap on host failure

2017-05-29 Thread Laszlo Budai
Hello all, We have a ceph cluster with 72 OSDs distributed on 6 hosts, in 3 chassis. In our crush map the we are distributing the PGs on chassis (complete crush map below): # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10

Re: [ceph-users] failed lossy con, dropping message

2017-04-13 Thread Laszlo Budai
17 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Hello Greg, Thank you for the answer. I'm still in doubt with the "lossy". What does it mean in this context? I can think of different variants: 1. The designer of the protocol from start is consid

Re: [ceph-users] failed lossy con, dropping message

2017-04-13 Thread Laszlo Budai
e connection. Maybe both are wrong and the truth is a third variant ... :) This is what I would like to understand. Kind regards, Laszlo On 13.04.2017 00:36, Gregory Farnum wrote: On Wed, Apr 12, 2017 at 3:00 AM, Laszlo Budai wrote: Hello, yesterday one of our compute nodes has record

Re: [ceph-users] failed lossy con, dropping message

2017-04-12 Thread Laszlo Budai
On 12.04.2017 22:19, Alex Gorbachev wrote: Hi Laszlo, On Wed, Apr 12, 2017 at 6:26 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Hello, yesterday one of our compute nodes has recorded the following message for one of the ceph connections: submit_message osd_op(

[ceph-users] failed lossy con, dropping message

2017-04-12 Thread Laszlo Budai
Hello, yesterday one of our compute nodes has recorded the following message for one of the ceph connections: submit_message osd_op(client.28817736.0:690186 rbd_data.15c046b11ab57b7.00c4 [read 2097152~380928] 3.6f81364a ack+read+known_if_redirected e3617) v5 remote, 10.12.68.71:68

Re: [ceph-users] null characters at the end of the file on hard reboot of VM

2017-04-09 Thread Laszlo Budai
mstances aren't the same, but the patterns of behaviour are similar enough that I wanted to raise awareness. k8 On Sat, Apr 8, 2017 at 6:39 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Hello Peter, Thank you for your answer. In our setup we have the virtu

Re: [ceph-users] null characters at the end of the file on hard reboot of VM

2017-04-07 Thread Laszlo Budai
keystone_token_cache_size": "1", "rgw_bucket_quota_cache_size": "1", I did some tests and the problem has appeared when I was using ext4 in the VM, but not in the case of xfs. I did an other test when I was calling a sync at the end of the while loop,

[ceph-users] null characters at the end of the file on hard reboot of VM

2017-04-07 Thread Laszlo Budai
Hello, we have observed that there are null characters written into the open files when hard rebooting a VM. Is tis a known issue? Our VM is using ceph (0.94.10) storage. we have a script like this: while sleep 1; do date >> somefile ; done if we hard reset the VM while the above line is runnin

Re: [ceph-users] Librbd logging

2017-04-07 Thread Laszlo Budai
|finalize)" log entries 3) use the asok file during one of these events to dump the objecter requests [1] http://docs.ceph.com/docs/jewel/rbd/rbd-replay/ [2] http://tracker.ceph.com/issues/14629 On Tue, Apr 4, 2017 at 7:36 AM, Laszlo Budai wrote: Hello cephers, I have a situation whe

[ceph-users] write to ceph hangs

2017-04-05 Thread Laszlo Budai
Hello, We have an issue when writing to ceph. From time to time the write operation seems to hang for a few seconds. We've seen the https://bugzilla.redhat.com/show_bug.cgi?id=1389503, and there it is said that when the qemu process would reach the max open files limit, then "the guest OS shou

[ceph-users] Librbd logging

2017-04-04 Thread Laszlo Budai
Hello cephers, I have a situation where from time to time the write operation to the seph storage hangs for 3-5 seconds. For testing we have a simple line like: while sleep 1; date >> logfile; done & with this we can see that rarely there are 3 seconds or more differences between the consecuti

[ceph-users] ceph pg dump - last_scrub last_deep_scrub

2017-03-24 Thread Laszlo Budai
Hello, can someone tell me the meaning of the last_scrub and last_deep_scrub values from the ceph pg dump output? I could not find it with google nor in the documentation. for example I can see here the last_scrub being 61092'4385, and the last_deep_scrub=61086'4379 pg_stat objects mip

[ceph-users] pgs stale during patching

2017-03-21 Thread Laszlo Budai
Hello, we have been patching our ceph cluster 0.94.7 to 0.94.10. We were updating one node at a time, and after each OSD node has been rebooted we were waiting for the cluster health status to be OK. In the docs we have "stale - The placement group status has not been updated by a ceph-osd, in

Re: [ceph-users] ceph 0.94.10 ceph-objectstore-tool segfault

2017-03-17 Thread Laszlo Budai
Hi all, I've found that the problem was due to missing /etc/ceph/ceph.client.admin.keyring file on the storage node where I was trying to do the import-rados operation. Kind regards, Laszlo On 15.03.2017 20:22, Laszlo Budai wrote: Hello, I'm trying to do an import-rados operatio

Re: [ceph-users] pgs stuck inactive

2017-03-17 Thread Laszlo Budai
h/$cluster-$name.$pid.log Then run the ceph-objectstore-tool again taking careful note of what file is created in /var/log/ceph/ and upload that. On Thu, Mar 16, 2017 at 5:21 PM, Laszlo Budai wrote: My mistake, I've run it on a wrong system ... I've attached the terminal output

Re: [ceph-users] pgs stuck inactive

2017-03-16 Thread Laszlo Budai
My mistake, I've run it on a wrong system ... I've attached the terminal output. I've run this on a test system where I was getting the same segfault when trying import-rados. Kind regards, Laszlo On 16.03.2017 07:41, Laszlo Budai wrote: [root@storage2 ~]# gdb -ex 'r

Re: [ceph-users] pgs stuck inactive

2017-03-15 Thread Laszlo Budai
the debuginfo for ceph (how this works depends on your distro) and run the following? # gdb -ex 'r' -ex 't a a bt full' -ex 'q' --args ceph-objectstore-tool import-rados volumes pg.3.367.export.OSD.35 On Thu, Mar 16, 2017 at 12:02 AM, Laszlo Budai wrote: Hello, the

[ceph-users] ceph 0.94.10 ceph-objectstore-tool segfault

2017-03-15 Thread Laszlo Budai
Hello, I'm trying to do an import-rados operation, but the ceph-objectstore-tool crashes with segfault: [root@storage1 ~]# ceph-objectstore-tool import-rados images pg6.6exp-osd1 *** Caught signal (Segmentation fault) ** in thread 7f84e0b24880 ceph version 0.94.10 (b1e0532418e4631af01acbc0ced

Re: [ceph-users] pgs stuck inactive

2017-03-15 Thread Laszlo Budai
ory on the disk). Use force_create_pg to recreate the pg empty. Use ceph-objectstore-tool to do a rados import on the exported pg copy. On Wed, Mar 15, 2017 at 12:00 PM, Laszlo Budai wrote: Hello, I have tried to recover the pg using the following steps: Preparation: 1. set noout 2. stop

Re: [ceph-users] pgs stuck inactive

2017-03-15 Thread Laszlo Budai
on the disk). Use force_create_pg to recreate the pg empty. Use ceph-objectstore-tool to do a rados import on the exported pg copy. On Wed, Mar 15, 2017 at 12:00 PM, Laszlo Budai wrote: Hello, I have tried to recover the pg using the following steps: Preparation: 1. set noout 2. stop osd.2

Re: [ceph-users] pgs stuck inactive

2017-03-15 Thread Laszlo Budai
Hello, So, I've done the following seps: 1. set noout 2. stop osd2 3. ceph-objectstore-tool remove 4. start osd2 5. repeat step 2-4 on osd 28 and 35 then I've run the ceph pg force_create_pg 3.367. This has left the PG in creating state: # ceph -s cluster 6713d1b8-83da-11e6-aa79-525400d98

Re: [ceph-users] pgs stuck inactive

2017-03-14 Thread Laszlo Budai
e. What else can I try? Thank you, Laszlo On 12.03.2017 13:06, Brad Hubbard wrote: On Sun, Mar 12, 2017 at 7:51 PM, Laszlo Budai wrote: Hello, I have already done the export with ceph_objectstore_tool. I just have to decide which OSDs to keep. Can you tell me why the directory structur

Re: [ceph-users] osd_disk_thread_ioprio_priority help

2017-03-12 Thread Laszlo Budai
ions which would help to improove the cluster's responsiveness during deep scrub operations. Kind regards, Laszlo On 12.03.2017 21:21, Florian Haas wrote: On Sat, Mar 11, 2017 at 4:24 PM, Laszlo Budai wrote: Can someone explain the meaning of osd_disk_thread_ioprio_priority. I'm [...

Re: [ceph-users] pgs stuck inactive

2017-03-12 Thread Laszlo Budai
27;ll read it. So far, searching for the architecture of an OSD, I could not find the gory details about these directories. Kind regards, Laszlo On 12.03.2017 02:12, Brad Hubbard wrote: On Sat, Mar 11, 2017 at 7:43 PM, Laszlo Budai wrote: Hello, Thank you for your answer. indeed the min_size

Re: [ceph-users] osd_disk_thread_ioprio_priority help

2017-03-11 Thread Laszlo Budai
On 11.03.2017 16:25, Nick Fisk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Laszlo Budai Sent: 11 March 2017 13:51 To: ceph-users Subject: [ceph-users] osd_disk_thread_ioprio_priority help Hello, Can someone explain the meaning

[ceph-users] osd_disk_thread_ioprio_priority help

2017-03-11 Thread Laszlo Budai
Hello, Can someone explain the meaning of osd_disk_thread_ioprio_priority. I'm reading the definition from this page: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/1.3/html/configuration_guide/osd_configuration_reference it says: "It sets the ioprio_set(2) I/O scheduling p

Re: [ceph-users] pgs stuck inactive

2017-03-11 Thread Laszlo Budai
/msg17820.html If you want to abandon the pg see http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012778.html for a possible solution. http://ceph.com/community/incomplete-pgs-oh-my/ may also give some ideas. On Fri, Mar 10, 2017 at 9:44 PM, Laszlo Budai wrote: The OSDs are al

Re: [ceph-users] pgs stuck inactive

2017-03-10 Thread Laszlo Budai
are marked DNE and seem to be uncontactable. This seems to be more than a network issue (unless the outage is still happening). http://docs.ceph.com/docs/master/rados/operations/pg-states/?highlight=incomplete On Fri, Mar 10, 2017 at 6:09 PM, Laszlo Budai wrote: Hello, I was informed that

Re: [ceph-users] pgs stuck inactive

2017-03-10 Thread Laszlo Budai
pdate": "0'0", "current_last_stamp": "0.00", "current_info": { "begin": "0.00", "end": "0.00", "versio

[ceph-users] pgs stuck inactive

2017-03-09 Thread Laszlo Budai
Hello, After a major network outage our ceph cluster ended up with an inactive PG: # ceph health detail HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 32 sec; 1 osds have slow requests pg 3.367 is stuck inactive for 912263.766607, current state

Re: [ceph-users] re enable scrubbing

2017-03-08 Thread Laszlo Budai
t the above is slowed down enough that everything is scrubbed within this long scrub interval, but might need adjustment for a more normal setting here: # 60 days ... default is 7 days osd deep scrub interval = 5259488 And more inline answers below On 03/08/17 10:46, Laszlo Budai wrote: Hello

[ceph-users] re enable scrubbing

2017-03-08 Thread Laszlo Budai
Hello, is there any risk related to cluster overload when the scrub is re enabled after a certain amount of time being disabled? I am thinking of the following scenario: 1. scrub/deep scrub are disabled. 2. after a while (few days) we re enable them. How will the cluster perform? Will it run a

[ceph-users] Can librbd operations increase iowait?

2017-02-28 Thread Laszlo Budai
Hello, I have a strange situation: On a host server we are running 5 VMs. The VMs have their disks provisioned by cinder from a ceph cluster and are attached by quemu-kvm using librbd. We have a very strange situation when the VMs apparently have stopped to work for a few seconds (10-20), and a

Re: [ceph-users] librbd logging

2017-02-28 Thread Laszlo Budai
Hello, Thank you for the answer. I don't have the admin socket either :( the ceph subdirectory is missing in /var/run. What would be the steps to get the socket? Kind regards, Laszlo On 28.02.2017 05:32, Jason Dillaman wrote: On Mon, Feb 27, 2017 at 12:36 PM, Laszlo Budai wrote: Curr

[ceph-users] librbd logging

2017-02-27 Thread Laszlo Budai
Hello, I have these settings in my /etc/ceph/ceph.conf: [client] rbd cache = true rbd cache writethrough until flush = true admin socket = /var/run/ceph/guests/$cluster-$type.$id.$pid.$cctid.asok log file = /var/log/qemu/qemu-guest-$pid.log rbd concurrent management ops = 20 Currently