[ceph-users] pool/volume live migration

2019-02-08 Thread Luis Periquito
Hi,

a recurring topic is live migration and pool type change (moving from
EC to replicated or vice versa).

When I went to the OpenStack open infrastructure (aka summit) Sage
mentioned about support of live migration of volumes (and as a result
of pools) in Nautilus. Is this still the case and is expected to have
live migration working by then?

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Marc Roos
 
There is a setting to set the max pg per osd. I would set that 
temporarily so you can work, create a new pool with 8 pg's and move data 
over to the new pool, remove the old pool, than unset this max pg per 
osd.

PS. I am always creating pools starting 8 pg's and when I know I am at 
what I want in production I can always increase the pg count.



-Original Message-
From: Brian Topping [mailto:brian.topp...@gmail.com] 
Sent: 08 February 2019 05:30
To: Ceph Users
Subject: [ceph-users] Downsizing a cephfs pool

Hi all, I created a problem when moving data to Ceph and I would be 
grateful for some guidance before I do something dumb.


1.  I started with the 4x 6TB source disks that came together as a 
single XFS filesystem via software RAID. The goal is to have the same 
data on a cephfs volume, but with these four disks formatted for 
bluestore under Ceph.
2.  The only spare disks I had were 2TB, so put 7x together. I sized 
data and metadata for cephfs at 256 PG, but it was wrong.
3.  The copy went smoothly, so I zapped and added the original 4x 6TB 
disks to the cluster.
4.  I realized what I did, that when the 7x2TB disks were removed, 
there were going to be far too many PGs per OSD.


I just read over https://stackoverflow.com/a/39637015/478209, but that 
addresses how to do this with a generic pool, not pools used by CephFS. 
It looks easy to copy the pools, but once copied and renamed, CephFS may 
not recognize them as the target and the data may be lost.

Do I need to create new pools and copy again using cpio? Is there a 
better way?

Thanks! Brian


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Burkhard Linke

Hi,


you can move the data off to another pool, but you need to keep your 
_first_ data pool, since part of the filesystem metadata is stored in 
that pool. You cannot remove the first pool.



Regards,

Burkhard

--
Dr. rer. nat. Burkhard Linke
Bioinformatics and Systems Biology
Justus-Liebig-University Giessen
35392 Giessen, Germany
Phone: (+49) (0)641 9935810

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Debugging 'slow requests' ...

2019-02-08 Thread Massimo Sgaravatto
Our Luminous ceph cluster have been worked without problems for a while,
but in the last days we have been suffering from continuous slow requests.

We have indeed done some changes in the infrastructure recently:

- Moved OSD nodes to a new switch
- Increased pg nums for a pool, to have about ~ 100 PGs/OSD (also because
we have to install new OSDs in the cluster). The output of 'ceph osd df' is
attached.

The problem could also be due to some ''bad' client, but in the log I don't
see a clear "correlation" with specific clients or images for such blocked
requests.

I also tried to update to latest luminous release and latest CentOS7, but
this didn't help.



Attached you can find the detail of one of such slow operations which took
about 266 secs (output from 'ceph daemon osd.11 dump_historic_ops').
So as far as I can understand from these events:
{
"time": "2019-02-08 10:26:25.651728",
"event": "op_commit"
},
{
"time": "2019-02-08 10:26:25.651965",
"event": "op_applied"
},

  {
"time": "2019-02-08 10:26:25.653236",
"event": "sub_op_commit_rec from 33"
},
{
"time": "2019-02-08 10:30:51.890404",
"event": "sub_op_commit_rec from 23"
},

the problem seems with the  "sub_op_commit_rec from 23" event which took
too much.
So the problem is that the answer from OSD 23 took to much ?


In the logs of the 2 OSD (11 and 23)in that time frame (attached) I can't
find anything useful.
When the problem happened the load and the usage of memory was not high in
the relevant nodes.


Any help to debug the issue is really appreciated ! :-)

Thanks, Massimo
{
{
"description": "osd_op(client.171725953.0:368920029 8.208 
8:10448e0e:::rbd_data.c47f3c390c8495.00018b81:head [set-alloc-hint 
object_size 4194304 write_size 4194304,write 2744320~200704] snapc 0=[] 
ondisk+write+known_if_redirected e1203982)",
"initiated_at": "2019-02-08 10:26:25.614728",
"age": 331.598616,
"duration": 266.275721,
"type_data": {
"flag_point": "commit sent; apply or cleanup",
"client_info": {
"client": "client.171725953",
"client_addr": "192.168.61.66:0/4056439540",
"tid": 368920029
},
"events": [
{
"time": "2019-02-08 10:26:25.614728",
"event": "initiated"
},
{
"time": "2019-02-08 10:26:25.650783",
"event": "queued_for_pg"
},
{
"time": "2019-02-08 10:26:25.650819",
"event": "reached_pg"
},
{
"time": "2019-02-08 10:26:25.650860",
"event": "started"
},
{
"time": "2019-02-08 10:26:25.650919",
"event": "waiting for subops from 23,33"
},
{
"time": "2019-02-08 10:26:25.65",
"event": "commit_queued_for_journal_write"
},
{
"time": "2019-02-08 10:26:25.65",
"event": "commit_queued_for_journal_write"
},
{
"time": "2019-02-08 10:26:25.651141",
"event": "write_thread_in_journal_buffer"
},
{
"time": "2019-02-08 10:26:25.651694",
"event": "journaled_completion_queued"
},
{
"time": "2019-02-08 10:26:25.651728",
"event": "op_commit"
},
{
"time": "2019-02-08 10:26:25.651965",
"event": "op_applied"
},
{
"time": "2019-02-08 10:26:25.653236",
"event": "sub_op_commit_rec from 33"
},
{
"time": "2019-02-08 10:30:51.890404", 
"event": "sub_op_commit_rec from 23"
},
{
"time": "2019-02-08 10:30:51.890434",
"event": "commit_sent"
},
{
"time": "2019-02-08 10:30:51.890450",

[ceph-users] Bluestore increased disk usage

2019-02-08 Thread Jan Kasprzak
Hello, ceph users,

I moved my cluster to bluestore (Ceph Mimic), and now I see the increased
disk usage. From ceph -s:

pools:   8 pools, 3328 pgs
objects: 1.23 M objects, 4.6 TiB
usage:   23 TiB used, 444 TiB / 467 TiB avail

I use 3-way replication of my data, so I would expect the disk usage
to be around 14 TiB. Which was true when I used filestore-based Luminous OSDs
before. Why the disk usage now is 23 TiB?

If I remember it correctly (a big if!), the disk usage was about the same
when I originally moved the data to empty bluestore OSDs by changing the
crush rule, but went up after I have added more bluestore OSDs and the cluster
rebalanced itself.

Could it be some miscalculation of free space in bluestore? Also, could it be
related to the HEALTH_ERR backfill_toofull problem discused here in the other
thread?

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool/volume live migration

2019-02-08 Thread Caspar Smit
Hi Luis,

According to slide 21 of Sage's presentation at FOSDEM it is coming in
Nautilus:

https://fosdem.org/2019/schedule/event/ceph_project_status_update/attachments/slides/3251/export/events/attachments/ceph_project_status_update/slides/3251/ceph_new_in_nautilus.pdf

Kind regards,
Caspar

Op vr 8 feb. 2019 om 11:24 schreef Luis Periquito :

> Hi,
>
> a recurring topic is live migration and pool type change (moving from
> EC to replicated or vice versa).
>
> When I went to the OpenStack open infrastructure (aka summit) Sage
> mentioned about support of live migration of volumes (and as a result
> of pools) in Nautilus. Is this still the case and is expected to have
> live migration working by then?
>
> thanks,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] change OSD IP it uses

2019-02-08 Thread Ashley Merrick
I just tried that, it already had a restart done as I fully deleted the old
OSD and re-created using the correct hostname after zapping the disk and
restarting the server itself.

Just somewhere it still seems to have stored the external IP's of the other
hosts for just this OSD, after restarting it's still full of log lines like
:  no reply from externalip:6801 osd.21, which is a OSD on another node and
trying to connect via the external IP of that node.

The 3 other OSD's i created on the same node after the first issue one all
are fine and communicating via the internal network correctly, so I know
its not a node wide config issue.

Hope that makes sense!

,Ashley

On Fri, Feb 8, 2019 at 3:58 PM Wido den Hollander  wrote:

>
>
> On 2/8/19 8:38 AM, Ashley Merrick wrote:
> > So I was adding a new host using ceph-deploy, for the first OSD I
> > accidentally run it against the hostname of the external IP and not the
> > internal network.
> >
> > I stopped / deleted the OSD from the new host and then re-created the
> > OSD using the internal hostname along with the rest of the OSD's.
> >
> > They are all running fine however the one that has the same ID as the
> > original OSD I created is trying to communicate with the other OSD's
> > over the external interface, the OSD is working it seem's however unable
> > to control it via any ceph commands.
> >
> > "heartbeat_check: no reply from external IP of another host:6801 osd.21
> > ever on either front or back, first ping sent "
> >
> > Is it possible to update somewhere for this OSD to use the internet
> > network, or is there something else compared to the normal ceph osd
> > removal process I should do before re-adding it.
> >
>
> When you stop/start an OSD it will re-register itself with the cluster
> and update it's IP.
>
> No need to remove it. Just a stop/start should be sufficient.
>
> Wido
>
> > Thanks
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Brian Topping
Hi Mark, that’s great advice, thanks! I’m always grateful for the knowledge. 

What about the issue with the pools containing a CephFS though? Is it something 
where I can just turn off the MDS, copy the pools and rename them back to the 
original name, then restart the MDS? 

Agreed about using smaller numbers. When I went to using seven disks, I was 
getting warnings about too few PGs per OSD. I’m sure this is something one 
learns to cope with via experience and I’m still picking that up. Had hoped not 
I get in a bind like this so quickly, but hey, here I am again :)

> On Feb 8, 2019, at 01:53, Marc Roos  wrote:
> 
> 
> There is a setting to set the max pg per osd. I would set that 
> temporarily so you can work, create a new pool with 8 pg's and move data 
> over to the new pool, remove the old pool, than unset this max pg per 
> osd.
> 
> PS. I am always creating pools starting 8 pg's and when I know I am at 
> what I want in production I can always increase the pg count.
> 
> 
> 
> -Original Message-
> From: Brian Topping [mailto:brian.topp...@gmail.com] 
> Sent: 08 February 2019 05:30
> To: Ceph Users
> Subject: [ceph-users] Downsizing a cephfs pool
> 
> Hi all, I created a problem when moving data to Ceph and I would be 
> grateful for some guidance before I do something dumb.
> 
> 
> 1.I started with the 4x 6TB source disks that came together as a 
> single XFS filesystem via software RAID. The goal is to have the same 
> data on a cephfs volume, but with these four disks formatted for 
> bluestore under Ceph.
> 2.The only spare disks I had were 2TB, so put 7x together. I sized 
> data and metadata for cephfs at 256 PG, but it was wrong.
> 3.The copy went smoothly, so I zapped and added the original 4x 6TB 
> disks to the cluster.
> 4.I realized what I did, that when the 7x2TB disks were removed, 
> there were going to be far too many PGs per OSD.
> 
> 
> I just read over https://stackoverflow.com/a/39637015/478209, but that 
> addresses how to do this with a generic pool, not pools used by CephFS. 
> It looks easy to copy the pools, but once copied and renamed, CephFS may 
> not recognize them as the target and the data may be lost.
> 
> Do I need to create new pools and copy again using cpio? Is there a 
> better way?
> 
> Thanks! Brian
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Marc Roos
 

Yes that is thus a partial move, not the behaviour you expect from a mv 
command. (I think this should be changed)



-Original Message-
From: Burkhard Linke 
[mailto:burkhard.li...@computational.bio.uni-giessen.de] 
Sent: 08 February 2019 11:27
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Downsizing a cephfs pool

Hi,


you can move the data off to another pool, but you need to keep your 
_first_ data pool, since part of the filesystem metadata is stored in 
that pool. You cannot remove the first pool.


Regards,

Burkhard

--
Dr. rer. nat. Burkhard Linke
Bioinformatics and Systems Biology
Justus-Liebig-University Giessen
35392 Giessen, Germany
Phone: (+49) (0)641 9935810

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best practices for EC pools

2019-02-08 Thread Caspar Smit
Op vr 8 feb. 2019 om 11:31 schreef Scheurer François <
francois.scheu...@everyware.ch>:

> Dear Eugen Block
> Dear Alan Johnson
>
>
> Thank you for your answers.
>
> So we will use EC 3+2 on 6 nodes.
> Currently with only 4 osd's per node, then 8 and later 20.
>
>
> >Just to add, that a more general formula is that the number of nodes
> should be greater than or equal to k+m+m so N>=k+m+m for full recovery
>
> Understood.
> EC k+m assumes the case of loosing m nodes and that would require m
> 'spare' nodes to recover, so k+m+m in total.
> But the loss of a single node should allow a full recovery, shouldn'it ?
>
> Having 3+2 on 6 nodes should be able to:
> -survive the loss of max 2 nodes simultaneously
>

Yes and No, technically you can survive a 2 node failure but EC requires
K+1 nodes to allow writes, so every IO freezes (until all affected PG's are
recovered to at least K+1) when losing the second node.
So yes you survive, but no you can't use the cluster for a while during
this, so if you want to keep using your cluster at all times you can only
have 1 node failure.


> -survive the loss of max 3 nodes, if the recovery has enough time to
> complete between failures
>

I think this kind of scenario shouldn't even be considered.


> -recover the loss of max 1 node
>
> Only if there's enough free disk space left to hold all the data.

Kind regards,
Caspar


> >If the pools are empty I also wouldn't expect that, is restarting one OSD
> also that slow or is it just when you reboot the whole cluster?
> It also happens after rebooting a single node.
>
> In the mon logs we see a lot os such messages:
>
> 2019-02-06 23:07:46.003473 7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd
> e116 prepare_failure osd.17 10.38.66.71:6803/76983 from osd.1
> 10.38.67.72:6800/75206 is reporting failure:1
> 2019-02-06 23:07:46.003486 7f14d8ed6700  0 log_channel(cluster) log [DBG]
> : osd.17 10.38.66.71:6803/76983 reported failed by osd.1
> 10.38.67.72:6800/75206
> 2019-02-06  23:07:57.948959
> 7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure
> osd.17 10.38.66.71:6803/76983 from osd.1 10.38.67.72:6800/75206 is
> reporting failure:0
> 2019-02-06 23:07:57.948971 7f14d8ed6700  0 log_channel(cluster) log [DBG]
> : osd.17 10.38.66.71:6803/76983 failure report canceled by osd.1
> 10.38.67.72:6800/75206
> 2019-02-06  23:08:54.632356
> 7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure
> osd.0 10.38.65.72:6800/72872 from osd.17 10.38.66.71:6803/76983 is
> reporting failure:1
> 2019-02-06 23:08:54.632374 7f14d8ed6700  0 log_channel(cluster) log [DBG]
> : osd.0 10.38.65.72:6800/72872 reported failed by osd.17
> 10.38.66.71:6803/76983
> 2019-02-06  23:10:21.333513
> 7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure
> osd.23 10.38.66.71:6807/79639 from osd.18 10.38.67.72:6806/79121 is
> reporting failure:1
> 2019-02-06 23:10:21.333527 7f14d8ed6700  0 log_channel(cluster) log [DBG]
> : osd.23 10.38.66.71:6807/79639 reported failed by osd.18
> 10.38.67.72:6806/79121
> 2019-02-06  23:10:57.660468
> 7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure
> osd.23 10.38.66.71:6807/79639 from osd.18 10.38.67.72:6806/79121 is
> reporting failure:0
> 2019-02-06 23:10:57.660481 7f14d8ed6700  0 log_channel(cluster) log [DBG]
> : osd.23 10.38.66.71:6807/79639 failure report canceled by osd.18
> 10.38.67.72:6806/79121
>
>
>
> Best Regards
> Francois Scheurer
>
>
>
>
>
> 
> From: ceph-users  on behalf of Alan
> Johnson 
> Sent: Thursday, February 7, 2019 8:11 PM
> To: Eugen Block; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] best practices for EC pools
>
> Just to add, that a more general formula is that the number of nodes
> should be greater than or equal to k+m+m so N>=k+m+m for full recovery
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Eugen Block
> Sent: Thursday, February 7, 2019 8:47 AM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] best practices for EC pools
>
> Hi Francois,
>
> > Is that correct that recovery will be forbidden by the crush rule if a
> > node is down?
>
> yes, that is correct, failure-domain=host means no two chunks of the same
> PG can be on the same host. So if your PG is divided into 6 chunks, they're
> all on different hosts, no recovery is possible at this point (for the
> EC-pool).
>
> > After rebooting all nodes we noticed that the recovery was slow, maybe
> > half an hour, but all pools are currently empty (new install).
> > This is odd...
>
> If the pools are empty I also wouldn't expect that, is restarting one OSD
> also that slow or is it just when you reboot the whole cluster?
>
> > Which k values are preferred on 6 nodes?
>
> It depends on the 

Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Jan Kasprzak
Hello,

Brian Topping wrote:
: Hi all, I created a problem when moving data to Ceph and I would be grateful 
for some guidance before I do something dumb.
[...]
: Do I need to create new pools and copy again using cpio? Is there a better 
way?

I think I will be facing the same problem soon (moving my cluster
from ~64 1-2TB OSDs to about 16 12TB OSDs). Maybe this is the way to go:

https://ceph.com/geen-categorie/ceph-pool-migration/

(I did not tested that, though).

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Marc Roos
 

I think I would COPY and DELETE in chunks the data not via the 'backend' 
but just via cephfs. So you are 100% sure nothing weird can happen. 
(MOVE is not working as you think on a cephfs between different pools)
You can create and mount an extra data pool in cephfs. I have done this 
also so you can mix rep3 and erasure and a fast ssd pool on you cephfs. 

Adding a pool, something like this:
ceph osd pool set fs_data.ec21 allow_ec_overwrites true
ceph osd pool application enable fs_data.ec21 cephfs
ceph fs add_data_pool cephfs fs_data.ec21

Change a directory to use a different pool:
setfattr -n ceph.dir.layout.pool -v fs_data.ec21 folder
getfattr -n ceph.dir.layout.pool folder


-Original Message-
From: Brian Topping [mailto:brian.topp...@gmail.com] 
Sent: 08 February 2019 10:02
To: Marc Roos
Cc: ceph-users
Subject: Re: [ceph-users] Downsizing a cephfs pool

Hi Mark, thats great advice, thanks! Im always grateful for the 
knowledge. 

What about the issue with the pools containing a CephFS though? Is it 
something where I can just turn off the MDS, copy the pools and rename 
them back to the original name, then restart the MDS? 

Agreed about using smaller numbers. When I went to using seven disks, I 
was getting warnings about too few PGs per OSD. Im sure this is 
something one learns to cope with via experience and Im still picking 
that up. Had hoped not I get in a bind like this so quickly, but hey, 
here I am again :)

> On Feb 8, 2019, at 01:53, Marc Roos  wrote:
> 
> 
> There is a setting to set the max pg per osd. I would set that 
> temporarily so you can work, create a new pool with 8 pg's and move 
> data over to the new pool, remove the old pool, than unset this max pg 

> per osd.
> 
> PS. I am always creating pools starting 8 pg's and when I know I am at 

> what I want in production I can always increase the pg count.
> 
> 
> 
> -Original Message-
> From: Brian Topping [mailto:brian.topp...@gmail.com]
> Sent: 08 February 2019 05:30
> To: Ceph Users
> Subject: [ceph-users] Downsizing a cephfs pool
> 
> Hi all, I created a problem when moving data to Ceph and I would be 
> grateful for some guidance before I do something dumb.
> 
> 
> 1.I started with the 4x 6TB source disks that came together as a 
> single XFS filesystem via software RAID. The goal is to have the same 
> data on a cephfs volume, but with these four disks formatted for 
> bluestore under Ceph.
> 2.The only spare disks I had were 2TB, so put 7x together. I sized 

> data and metadata for cephfs at 256 PG, but it was wrong.
> 3.The copy went smoothly, so I zapped and added the original 4x 
6TB 
> disks to the cluster.
> 4.I realized what I did, that when the 7x2TB disks were removed, 
> there were going to be far too many PGs per OSD.
> 
> 
> I just read over https://stackoverflow.com/a/39637015/478209, but that 

> addresses how to do this with a generic pool, not pools used by 
CephFS.
> It looks easy to copy the pools, but once copied and renamed, CephFS 
> may not recognize them as the target and the data may be lost.
> 
> Do I need to create new pools and copy again using cpio? Is there a 
> better way?
> 
> Thanks! Brian
> 
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best practices for EC pools

2019-02-08 Thread Scheurer François
Dear Eugen Block 
Dear Alan Johnson


Thank you for your answers.

So we will use EC 3+2 on 6 nodes.
Currently with only 4 osd's per node, then 8 and later 20.


>Just to add, that a more general formula is that the number of nodes should be 
>greater than or equal to k+m+m so N>=k+m+m for full recovery

Understood.
EC k+m assumes the case of loosing m nodes and that would require m 'spare' 
nodes to recover, so k+m+m in total.
But the loss of a single node should allow a full recovery, shouldn'it ?

Having 3+2 on 6 nodes should be able to:
-survive the loss of max 2 nodes simultaneously
-survive the loss of max 3 nodes, if the recovery has enough time to complete 
between failures
-recover the loss of max 1 node


>If the pools are empty I also wouldn't expect that, is restarting one OSD also 
>that slow or is it just when you reboot the whole cluster?
It also happens after rebooting a single node.

In the mon logs we see a lot os such messages:

2019-02-06 23:07:46.003473 7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd 
e116 prepare_failure osd.17 10.38.66.71:6803/76983 from osd.1 
10.38.67.72:6800/75206 is reporting failure:1
2019-02-06 23:07:46.003486 7f14d8ed6700  0 log_channel(cluster) log [DBG] : 
osd.17 10.38.66.71:6803/76983 reported failed by osd.1 10.38.67.72:6800/75206
2019-02-06 23:07:57.948959 7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd 
e116 prepare_failure osd.17 10.38.66.71:6803/76983 from osd.1 
10.38.67.72:6800/75206 is reporting failure:0
2019-02-06 23:07:57.948971 7f14d8ed6700  0 log_channel(cluster) log [DBG] : 
osd.17 10.38.66.71:6803/76983 failure report canceled by osd.1 
10.38.67.72:6800/75206
2019-02-06 23:08:54.632356 7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd 
e116 prepare_failure osd.0 10.38.65.72:6800/72872 from osd.17 
10.38.66.71:6803/76983 is reporting failure:1
2019-02-06 23:08:54.632374 7f14d8ed6700  0 log_channel(cluster) log [DBG] : 
osd.0 10.38.65.72:6800/72872 reported failed by osd.17 10.38.66.71:6803/76983
2019-02-06 23:10:21.333513 7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd 
e116 prepare_failure osd.23 10.38.66.71:6807/79639 from osd.18 
10.38.67.72:6806/79121 is reporting failure:1
2019-02-06 23:10:21.333527 7f14d8ed6700  0 log_channel(cluster) log [DBG] : 
osd.23 10.38.66.71:6807/79639 reported failed by osd.18 10.38.67.72:6806/79121
2019-02-06 23:10:57.660468 7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd 
e116 prepare_failure osd.23 10.38.66.71:6807/79639 from osd.18 
10.38.67.72:6806/79121 is reporting failure:0
2019-02-06 23:10:57.660481 7f14d8ed6700  0 log_channel(cluster) log [DBG] : 
osd.23 10.38.66.71:6807/79639 failure report canceled by osd.18 
10.38.67.72:6806/79121



Best Regards
Francois Scheurer






From: ceph-users  on behalf of Alan Johnson 

Sent: Thursday, February 7, 2019 8:11 PM
To: Eugen Block; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] best practices for EC pools

Just to add, that a more general formula is that the number of nodes should be 
greater than or equal to k+m+m so N>=k+m+m for full recovery

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Eugen 
Block
Sent: Thursday, February 7, 2019 8:47 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] best practices for EC pools

Hi Francois,

> Is that correct that recovery will be forbidden by the crush rule if a
> node is down?

yes, that is correct, failure-domain=host means no two chunks of the same PG 
can be on the same host. So if your PG is divided into 6 chunks, they're all on 
different hosts, no recovery is possible at this point (for the EC-pool).

> After rebooting all nodes we noticed that the recovery was slow, maybe
> half an hour, but all pools are currently empty (new install).
> This is odd...

If the pools are empty I also wouldn't expect that, is restarting one OSD also 
that slow or is it just when you reboot the whole cluster?

> Which k values are preferred on 6 nodes?

It depends on the failures you expect and how many concurrent failures you need 
to cover.
I think I would keep failure-domain=host (with only 4 OSDs per host).
As for the k and m values, 3+2 would make sense, I guess. That profile would 
leave one host for recovery and two OSDs of one PG acting set could fail 
without data loss, so as resilient as the 4+2 profile. This is one approach, so 
please don't read this as *the* solution for your environment.

Regards,
Eugen


Zitat von Scheurer François :

> Dear All
>
>
> We created an erasure coded pool with k=4 m=2 with failure-domain=host
> but have only 6 osd nodes.
> Is that correct that recovery will be forbidden by the crush rule if a
> node is down?
>
> After rebooting all nodes we noticed that the recovery was slow, maybe
> half an hour, but all pools are currently empty (new install).
> This is odd...
>
> Can it be related to the k+m being equal to the number of nodes?
> (4+2=6) step set_choose_tries 100 was 

Re: [ceph-users] change OSD IP it uses

2019-02-08 Thread Hector Martin
On 08/02/2019 17.05, Ashley Merrick wrote:
> Just somewhere it still seems to have stored the external IP's of the
> other hosts for just this OSD, after restarting it's still full of log
> lines like :  no reply from externalip:6801 osd.21, which is a OSD on
> another node and trying to connect via the external IP of that node.

Does your ceph.conf have the right network settings? Compare it with the
other nodes. Also check that your network interfaces and routes are
correctly configured on the problem node, of course.

-- 
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] change OSD IP it uses

2019-02-08 Thread Hector Martin
On 08/02/2019 20.54, Ashley Merrick wrote:
> Yes that is all fine, the other 3 OSD's on the node work fine as expected,
> 
> When I did the orginal OSD via ceph-deploy i used the external hostname
> at the end of the command instead of the internal hostname, I then
> deleted the OSD and zap'd the disk and re-added using the internal
> hostname + the other 3 OSD's.
> 
> The other 3 are using the internal IP fine, the first OSD is not.
> 
> The config and everything else is fine as I can reboot any of the other
> 3 OSD's and they work fine, just somewhere the osd.22 is still storing
> the orginal hostname/ip it was given via ceph-deploy even after a rm /
> disk zap

The OSDMap stores the OSD IP, though I *think* it's supposed to update
itself when the OSD's IP changes.

If this is a new OSD and you don't care about the data (or can just let
it rebalance away), just follow the instructions for "add/remove OSDs" here:

http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-osds/

Make sure when the OSD is gone it really is gone (nothing in 'ceph osd
tree' or 'ceph osd ls'), e.g. 'ceph osd purge 
--yes-i-really-mean-it' and make sure there isn't a spurious entry for
it in ceph.conf, then re-deploy. Once you do that there is no possible
other place for the OSD to somehow remember its old IP.


-- 
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] change OSD IP it uses

2019-02-08 Thread Ashley Merrick
I just tried that, nothing showing in ceph osd ls or ceph osd tree.

Run the purge command, wiped the disk.

However after re-creating the OSD it's still trying to connect via the
external IP, I've looked to see if there is an option to specify the osd ID
in ceph-deploy to try and use another ID but does not seem to be an option.

Any other ideas?

Ashley

On Fri, Feb 8, 2019 at 8:14 PM Hector Martin  wrote:

> On 08/02/2019 20.54, Ashley Merrick wrote:
> > Yes that is all fine, the other 3 OSD's on the node work fine as
> expected,
> >
> > When I did the orginal OSD via ceph-deploy i used the external hostname
> > at the end of the command instead of the internal hostname, I then
> > deleted the OSD and zap'd the disk and re-added using the internal
> > hostname + the other 3 OSD's.
> >
> > The other 3 are using the internal IP fine, the first OSD is not.
> >
> > The config and everything else is fine as I can reboot any of the other
> > 3 OSD's and they work fine, just somewhere the osd.22 is still storing
> > the orginal hostname/ip it was given via ceph-deploy even after a rm /
> > disk zap
>
> The OSDMap stores the OSD IP, though I *think* it's supposed to update
> itself when the OSD's IP changes.
>
> If this is a new OSD and you don't care about the data (or can just let
> it rebalance away), just follow the instructions for "add/remove OSDs"
> here:
>
> http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-osds/
>
> Make sure when the OSD is gone it really is gone (nothing in 'ceph osd
> tree' or 'ceph osd ls'), e.g. 'ceph osd purge 
> --yes-i-really-mean-it' and make sure there isn't a spurious entry for
> it in ceph.conf, then re-deploy. Once you do that there is no possible
> other place for the OSD to somehow remember its old IP.
>
>
> --
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://mrcn.st/pub
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] change OSD IP it uses

2019-02-08 Thread Sage Weil
The IP that an OSD (or other non-monitor daemon) uses normally depends on 
what IP is used by the local host to reach the monitor(s).  If you want 
your OSDs to be on a different network, generally the way to do 
that is to move the monitors to that network too.

You can also try the public_network option, which lets you specify a CIDR 
network (e.g., "10.0.0.0/8").  The OSD (or other daemon) will look at all 
IPs bound to the local host and pick one that is contained by that network 
(or by one of the networks in the list--you can list multiple networks).

Recreating the OSD isn't necessary or related; the daemon picks a new 
IP and port each time it is started.

HTH,
sage

On Fri, 8 Feb 2019, Ashley Merrick wrote:

> I just tried that, nothing showing in ceph osd ls or ceph osd tree.
> 
> Run the purge command, wiped the disk.
> 
> However after re-creating the OSD it's still trying to connect via the
> external IP, I've looked to see if there is an option to specify the osd ID
> in ceph-deploy to try and use another ID but does not seem to be an option.
> 
> Any other ideas?
> 
> Ashley
> 
> On Fri, Feb 8, 2019 at 8:14 PM Hector Martin  wrote:
> 
> > On 08/02/2019 20.54, Ashley Merrick wrote:
> > > Yes that is all fine, the other 3 OSD's on the node work fine as
> > expected,
> > >
> > > When I did the orginal OSD via ceph-deploy i used the external hostname
> > > at the end of the command instead of the internal hostname, I then
> > > deleted the OSD and zap'd the disk and re-added using the internal
> > > hostname + the other 3 OSD's.
> > >
> > > The other 3 are using the internal IP fine, the first OSD is not.
> > >
> > > The config and everything else is fine as I can reboot any of the other
> > > 3 OSD's and they work fine, just somewhere the osd.22 is still storing
> > > the orginal hostname/ip it was given via ceph-deploy even after a rm /
> > > disk zap
> >
> > The OSDMap stores the OSD IP, though I *think* it's supposed to update
> > itself when the OSD's IP changes.
> >
> > If this is a new OSD and you don't care about the data (or can just let
> > it rebalance away), just follow the instructions for "add/remove OSDs"
> > here:
> >
> > http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-osds/
> >
> > Make sure when the OSD is gone it really is gone (nothing in 'ceph osd
> > tree' or 'ceph osd ls'), e.g. 'ceph osd purge 
> > --yes-i-really-mean-it' and make sure there isn't a spurious entry for
> > it in ceph.conf, then re-deploy. Once you do that there is no possible
> > other place for the OSD to somehow remember its old IP.
> >
> >
> > --
> > Hector Martin (hec...@marcansoft.com)
> > Public Key: https://mrcn.st/pub
> >
> 
e
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Brian Topping
Thanks Marc and Burkhard. I think what I am learning is it’s best to copy 
between filesystems with cpio, if not impossible to do it any other way due to 
the “fs metadata in first pool” problem.

FWIW, the mimic docs still describe how to create a differently named cluster 
on the same hardware. But then I see 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021560.html 
 
saying that behavior is deprecated and problematic. 

A hard lesson, but no data was lost. I will set up two machines and a new 
cluster with the larger drives tomorrow.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool/volume live migration

2019-02-08 Thread Jason Dillaman
Indeed, it is forthcoming in the Nautilus release.

You would initiate a "rbd migration prepare 
" to transparently link the dst-image-spec to the
src-image-spec. Any active Nautilus clients against the image will
then re-open the dst-image-spec for all IO operations. Read requests
that cannot be fulfilled by the new dst-image-spec will be forwarded
to the original src-image-spec (similar to how parent/child cloning
behaves). Write requests to the dst-image-spec will force a deep-copy
of all impacted src-image-spec backing data objects (including
snapshot history) to the associated dst-image-spec backing data
object.  At any point a storage admin can run "rbd migration execute"
to deep-copy all src-image-spec data blocks to the dst-image-spec.
Once the migration is complete, you would just run "rbd migration
commit" to remove src-image-spec.

Note: at some point prior to "rbd migration commit", you will need to
take minimal downtime to switch OpenStack volume registration from the
old image to the new image if you are changing pools.

On Fri, Feb 8, 2019 at 5:33 AM Caspar Smit  wrote:
>
> Hi Luis,
>
> According to slide 21 of Sage's presentation at FOSDEM it is coming in 
> Nautilus:
>
> https://fosdem.org/2019/schedule/event/ceph_project_status_update/attachments/slides/3251/export/events/attachments/ceph_project_status_update/slides/3251/ceph_new_in_nautilus.pdf
>
> Kind regards,
> Caspar
>
> Op vr 8 feb. 2019 om 11:24 schreef Luis Periquito :
>>
>> Hi,
>>
>> a recurring topic is live migration and pool type change (moving from
>> EC to replicated or vice versa).
>>
>> When I went to the OpenStack open infrastructure (aka summit) Sage
>> mentioned about support of live migration of volumes (and as a result
>> of pools) in Nautilus. Is this still the case and is expected to have
>> live migration working by then?
>>
>> thanks,
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best practices for EC pools

2019-02-08 Thread Scheurer François
Thank you Caspar for your corrections!



> EC requires K+1 nodes to allow writes, so every IO freezes (until all 
> affected PG's are recovered to at least K+1)


I was not aware of this. This is quite important to know, many thanks.



-survive the loss of max 3 nodes, if the recovery has enough time to complete 
between failures

>I think this kind of scenario shouldn't even be considered.


Ok  the cluster will also freeze in this case, as you mentioned, so not really 
surviving.

(Maybe adding a new node will still be possible to unfreeze it, from a 
theoretical point of view.)




Best Regards

Francois Scheurer




From: Caspar Smit 
Sent: Friday, February 8, 2019 11:47 AM
To: Scheurer François
Cc: Alan Johnson; Eugen Block; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] best practices for EC pools

Op vr 8 feb. 2019 om 11:31 schreef Scheurer François 
mailto:francois.scheu...@everyware.ch>>:
Dear Eugen Block
Dear Alan Johnson


Thank you for your answers.

So we will use EC 3+2 on 6 nodes.
Currently with only 4 osd's per node, then 8 and later 20.


>Just to add, that a more general formula is that the number of nodes should be 
>greater than or equal to k+m+m so N>=k+m+m for full recovery

Understood.
EC k+m assumes the case of loosing m nodes and that would require m 'spare' 
nodes to recover, so k+m+m in total.
But the loss of a single node should allow a full recovery, shouldn'it ?

Having 3+2 on 6 nodes should be able to:
-survive the loss of max 2 nodes simultaneously

Yes and No, technically you can survive a 2 node failure but EC requires K+1 
nodes to allow writes, so every IO freezes (until all affected PG's are 
recovered to at least K+1) when losing the second node.
So yes you survive, but no you can't use the cluster for a while during this, 
so if you want to keep using your cluster at all times you can only have 1 node 
failure.

-survive the loss of max 3 nodes, if the recovery has enough time to complete 
between failures

I think this kind of scenario shouldn't even be considered.

-recover the loss of max 1 node

Only if there's enough free disk space left to hold all the data.

Kind regards,
Caspar


>If the pools are empty I also wouldn't expect that, is restarting one OSD also 
>that slow or is it just when you reboot the whole cluster?
It also happens after rebooting a single node.

In the mon logs we see a lot os such messages:

2019-02-06 23:07:46.003473 7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd 
e116 prepare_failure osd.17 
10.38.66.71:6803/76983 from osd.1 
10.38.67.72:6800/75206 is reporting failure:1
2019-02-06 23:07:46.003486 7f14d8ed6700  0 log_channel(cluster) log [DBG] : 
osd.17 10.38.66.71:6803/76983 reported failed by 
osd.1 10.38.67.72:6800/75206
2019-02-06 23:07:57.948959 
7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure osd.17 
10.38.66.71:6803/76983 from osd.1 
10.38.67.72:6800/75206 is reporting failure:0
2019-02-06 23:07:57.948971 7f14d8ed6700  0 log_channel(cluster) log [DBG] : 
osd.17 10.38.66.71:6803/76983 failure report 
canceled by osd.1 10.38.67.72:6800/75206
2019-02-06 23:08:54.632356 
7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure osd.0 
10.38.65.72:6800/72872 from osd.17 
10.38.66.71:6803/76983 is reporting failure:1
2019-02-06 23:08:54.632374 7f14d8ed6700  0 log_channel(cluster) log [DBG] : 
osd.0 10.38.65.72:6800/72872 reported failed by 
osd.17 10.38.66.71:6803/76983
2019-02-06 23:10:21.333513 
7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure osd.23 
10.38.66.71:6807/79639 from osd.18 
10.38.67.72:6806/79121 is reporting failure:1
2019-02-06 23:10:21.333527 7f14d8ed6700  0 log_channel(cluster) log [DBG] : 
osd.23 10.38.66.71:6807/79639 reported failed by 
osd.18 10.38.67.72:6806/79121
2019-02-06 23:10:57.660468 
7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure osd.23 
10.38.66.71:6807/79639 from osd.18 
10.38.67.72:6806/79121 is reporting failure:0
2019-02-06 23:10:57.660481 7f14d8ed6700  0 log_channel(cluster) log [DBG] : 
osd.23 10.38.66.71:6807/79639 failure report 
canceled by osd.18 10.38.67.72:6806/79121



Best Regards
Francois Scheurer






From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of 

Re: [ceph-users] change OSD IP it uses

2019-02-08 Thread Ashley Merrick
Yes that is all fine, the other 3 OSD's on the node work fine as expected,

When I did the orginal OSD via ceph-deploy i used the external hostname at
the end of the command instead of the internal hostname, I then deleted the
OSD and zap'd the disk and re-added using the internal hostname + the other
3 OSD's.

The other 3 are using the internal IP fine, the first OSD is not.

The config and everything else is fine as I can reboot any of the other 3
OSD's and they work fine, just somewhere the osd.22 is still storing the
orginal hostname/ip it was given via ceph-deploy even after a rm / disk zap

,Ashley

On Fri, Feb 8, 2019 at 7:30 PM Hector Martin  wrote:

> On 08/02/2019 17.05, Ashley Merrick wrote:
> > Just somewhere it still seems to have stored the external IP's of the
> > other hosts for just this OSD, after restarting it's still full of log
> > lines like :  no reply from externalip:6801 osd.21, which is a OSD on
> > another node and trying to connect via the external IP of that node.
>
> Does your ceph.conf have the right network settings? Compare it with the
> other nodes. Also check that your network interfaces and routes are
> correctly configured on the problem node, of course.
>
> --
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://mrcn.st/pub
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Hector Martin
On 08/02/2019 19.29, Marc Roos wrote:
>  
> 
> Yes that is thus a partial move, not the behaviour you expect from a mv 
> command. (I think this should be changed)

CephFS lets you put *data* in separate pools, but not *metadata*. Also,
I think you can't remove the original/default data pool.

The FSMap seems to store pools by ID, not by name, so renaming the pools
won't work.

This past thread has an untested procedure for migrating CephFS pools:

https://www.spinics.net/lists/ceph-users/msg29536.html

-- 
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-08 Thread Alexandre DERUMIER
I'm just seeing 

StupidAllocator::_aligned_len 
and 
btree::btree_iterator, mempoo 

on 1 osd, both 10%.

here the dump_mempools

{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 210243456,
"bytes": 210243456
},
"bluestore_cache_data": {
"items": 54,
"bytes": 643072
},
"bluestore_cache_onode": {
"items": 105637,
"bytes": 70988064
},
"bluestore_cache_other": {
"items": 48661920,
"bytes": 1539544228
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 12,
"bytes": 8928
},
"bluestore_writing_deferred": {
"items": 406,
"bytes": 4792868
},
"bluestore_writing": {
"items": 66,
"bytes": 1085440
},
"bluefs": {
"items": 1882,
"bytes": 93600
},
"buffer_anon": {
"items": 138986,
"bytes": 24983701
},
  "buffer_meta": {
"items": 544,
"bytes": 34816
},
"osd": {
"items": 243,
"bytes": 3089016
},
"osd_mapbl": {
"items": 36,
"bytes": 179308
},
"osd_pglog": {
"items": 952564,
"bytes": 372459684
},
"osdmap": {
"items": 3639,
"bytes": 224664
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 260109445,
"bytes": 2228370845
}
}
}


and the perf dump

root@ceph5-2:~# ceph daemon osd.4 perf dump
{
"AsyncMessenger::Worker-0": {
"msgr_recv_messages": 22948570,
"msgr_send_messages": 22561570,
"msgr_recv_bytes": 333085080271,
"msgr_send_bytes": 261798871204,
"msgr_created_connections": 6152,
"msgr_active_connections": 2701,
"msgr_running_total_time": 1055.197867330,
"msgr_running_send_time": 352.764480121,
"msgr_running_recv_time": 499.206831955,
"msgr_running_fast_dispatch_time": 130.982201607
},
"AsyncMessenger::Worker-1": {
"msgr_recv_messages": 18801593,
"msgr_send_messages": 18430264,
"msgr_recv_bytes": 306871760934,
"msgr_send_bytes": 192789048666,
"msgr_created_connections": 5773,
"msgr_active_connections": 2721,
"msgr_running_total_time": 816.821076305,
"msgr_running_send_time": 261.353228926,
"msgr_running_recv_time": 394.035587911,
"msgr_running_fast_dispatch_time": 104.012155720
},
"AsyncMessenger::Worker-2": {
"msgr_recv_messages": 18463400,
"msgr_send_messages": 18105856,
"msgr_recv_bytes": 187425453590,
"msgr_send_bytes": 220735102555,
"msgr_created_connections": 5897,
"msgr_active_connections": 2605,
"msgr_running_total_time": 807.186854324,
"msgr_running_send_time": 296.834435839,
"msgr_running_recv_time": 351.364389691,
"msgr_running_fast_dispatch_time": 101.215776792
},
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 256050724864,
"db_used_bytes": 12413042688,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 0,
"slow_used_bytes": 0,
"num_files": 209,
"log_bytes": 10383360,
"log_compactions": 14,
"logged_bytes": 336498688,
"files_written_wal": 2,
"files_written_sst": 4499,
"bytes_written_wal": 417989099783,
"bytes_written_sst": 213188750209
},
"bluestore": {
"kv_flush_lat": {
"avgcount": 26371957,
"sum": 26.734038497,
"avgtime": 0.01013
},
"kv_commit_lat": {
"avgcount": 26371957,
"sum": 3397.491150603,
"avgtime": 0.000128829
},
"kv_lat": {
"avgcount": 26371957,
"sum": 3424.225189100,

[ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())

2019-02-08 Thread Jake Grimmett
Dear All,

Unfortunately the MDS has crashed on our Mimic cluster...

First symptoms were rsync giving:
"No space left on device (28)"
when trying to rename or delete

This prompted me to try restarting the MDS, as it reported laggy.

Restarting the MDS, shows this as error in the log before the crash:

elist.h: 39: FAILED assert(!is_on_list())

A full MDS log showing the crash is here:

http://p.ip.fi/iWlz

I've tried upgrading the cluster to 13.2.4, but the MDS still crashes...

The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x
replication for MDS. We have a single active MDS, with two failover MDS

We have ~2PB of cephfs data here, all of which is currently
inaccessible, all and any advice gratefully received :)

best regards,

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool/volume live migration

2019-02-08 Thread Jason Dillaman
On Fri, Feb 8, 2019 at 11:43 AM Luis Periquito  wrote:
>
> This is indeed for an OpenStack cloud - it didn't require any level of
> performance (so was created on an EC pool) and now it does :(
>
> So the idea would be:

0 - upgrade OSDs and librbd clients to Nautilus

> 1- create a new pool

Are you using EC via cache tier over a replicated pool or an RBD image
with an EC data pool?

> 2- change cinder to use the new pool
>
> for each volume
>   3- stop the usage of the volume (stop the instance?)
>   4- "live migrate" the volume to the new pool

Yes, execute the "rbd migration prepare" step here and manually update
the Cinder database to point the instance to the new pool (if the pool
name changed). I cannot remember if Nova caches the Cinder volume
connector details, so you might also need to detach/re-attach the
volume if that's the case (or tweak the Nova database entries as
well).

>   5- start up the instance again

6 - run "rbd migration execute" and "rbd migration commit" at your convenience.

>
>
> Does that sound right?
>
> thanks,
>
> On Fri, Feb 8, 2019 at 4:25 PM Jason Dillaman  wrote:
> >
> > Correction: at least for the initial version of live-migration, you
> > need to temporarily stop clients that are using the image, execute
> > "rbd migration prepare", and then restart the clients against the new
> > destination image. The "prepare" step will fail if it detects that the
> > source image is in-use.
> >
> > On Fri, Feb 8, 2019 at 9:00 AM Jason Dillaman  wrote:
> > >
> > > Indeed, it is forthcoming in the Nautilus release.
> > >
> > > You would initiate a "rbd migration prepare 
> > > " to transparently link the dst-image-spec to the
> > > src-image-spec. Any active Nautilus clients against the image will
> > > then re-open the dst-image-spec for all IO operations. Read requests
> > > that cannot be fulfilled by the new dst-image-spec will be forwarded
> > > to the original src-image-spec (similar to how parent/child cloning
> > > behaves). Write requests to the dst-image-spec will force a deep-copy
> > > of all impacted src-image-spec backing data objects (including
> > > snapshot history) to the associated dst-image-spec backing data
> > > object.  At any point a storage admin can run "rbd migration execute"
> > > to deep-copy all src-image-spec data blocks to the dst-image-spec.
> > > Once the migration is complete, you would just run "rbd migration
> > > commit" to remove src-image-spec.
> > >
> > > Note: at some point prior to "rbd migration commit", you will need to
> > > take minimal downtime to switch OpenStack volume registration from the
> > > old image to the new image if you are changing pools.
> > >
> > > On Fri, Feb 8, 2019 at 5:33 AM Caspar Smit  wrote:
> > > >
> > > > Hi Luis,
> > > >
> > > > According to slide 21 of Sage's presentation at FOSDEM it is coming in 
> > > > Nautilus:
> > > >
> > > > https://fosdem.org/2019/schedule/event/ceph_project_status_update/attachments/slides/3251/export/events/attachments/ceph_project_status_update/slides/3251/ceph_new_in_nautilus.pdf
> > > >
> > > > Kind regards,
> > > > Caspar
> > > >
> > > > Op vr 8 feb. 2019 om 11:24 schreef Luis Periquito :
> > > >>
> > > >> Hi,
> > > >>
> > > >> a recurring topic is live migration and pool type change (moving from
> > > >> EC to replicated or vice versa).
> > > >>
> > > >> When I went to the OpenStack open infrastructure (aka summit) Sage
> > > >> mentioned about support of live migration of volumes (and as a result
> > > >> of pools) in Nautilus. Is this still the case and is expected to have
> > > >> live migration working by then?
> > > >>
> > > >> thanks,
> > > >> ___
> > > >> ceph-users mailing list
> > > >> ceph-users@lists.ceph.com
> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > >
> > > --
> > > Jason
> >
> >
> >
> > --
> > Jason
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-08 Thread Alexandre DERUMIER
another mempool dump after 1h run. (latency ok)

Biggest difference:

before restart
-
"bluestore_cache_other": { 
"items": 48661920, 
"bytes": 1539544228 
}, 
"bluestore_cache_data": { 
"items": 54, 
"bytes": 643072 
}, 
(other caches seem to be quite low too, like bluestore_cache_other take all the 
memory)


After restart
-
"bluestore_cache_other": {
 "items": 12432298,
  "bytes": 500834899
},
"bluestore_cache_data": {
 "items": 40084,
 "bytes": 1056235520
},


full mempool dump after restart
---

{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 165053952,
"bytes": 165053952
},
"bluestore_cache_data": {
"items": 40084,
"bytes": 1056235520
},
"bluestore_cache_onode": {
"items": 5,
"bytes": 14935200
},
"bluestore_cache_other": {
"items": 12432298,
"bytes": 500834899
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 11,
"bytes": 8184
},
"bluestore_writing_deferred": {
"items": 5047,
"bytes": 22673736
},
"bluestore_writing": {
"items": 91,
"bytes": 1662976
},
"bluefs": {
"items": 1907,
"bytes": 95600
},
"buffer_anon": {
"items": 19664,
"bytes": 25486050
},
"buffer_meta": {
"items": 46189,
"bytes": 2956096
},
"osd": {
"items": 243,
"bytes": 3089016
},
"osd_mapbl": {
"items": 17,
"bytes": 214366
},
"osd_pglog": {
"items": 889673,
"bytes": 367160400
},
"osdmap": {
"items": 3803,
"bytes": 224552
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 178515204,
"bytes": 2160630547
}
}
}

- Mail original -
De: "aderumier" 
À: "Igor Fedotov" 
Cc: "Stefan Priebe, Profihost AG" , "Mark Nelson" 
, "Sage Weil" , "ceph-users" 
, "ceph-devel" 
Envoyé: Vendredi 8 Février 2019 16:14:54
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

I'm just seeing 

StupidAllocator::_aligned_len 
and 
btree::btree_iterator, mempoo 

on 1 osd, both 10%. 

here the dump_mempools 

{ 
"mempool": { 
"by_pool": { 
"bloom_filter": { 
"items": 0, 
"bytes": 0 
}, 
"bluestore_alloc": { 
"items": 210243456, 
"bytes": 210243456 
}, 
"bluestore_cache_data": { 
"items": 54, 
"bytes": 643072 
}, 
"bluestore_cache_onode": { 
"items": 105637, 
"bytes": 70988064 
}, 
"bluestore_cache_other": { 
"items": 48661920, 
"bytes": 1539544228 
}, 
"bluestore_fsck": { 
"items": 0, 
"bytes": 0 
}, 
"bluestore_txc": { 
"items": 12, 
"bytes": 8928 
}, 
"bluestore_writing_deferred": { 
"items": 406, 
"bytes": 4792868 
}, 
"bluestore_writing": { 
"items": 66, 
"bytes": 1085440 
}, 
"bluefs": { 
"items": 1882, 
"bytes": 93600 
}, 
"buffer_anon": { 
"items": 138986, 
"bytes": 24983701 
}, 
"buffer_meta": { 
"items": 544, 
"bytes": 34816 
}, 
"osd": { 
"items": 243, 
"bytes": 3089016 
}, 
"osd_mapbl": { 
"items": 36, 
"bytes": 179308 
}, 
"osd_pglog": { 
"items": 952564, 
"bytes": 372459684 
}, 
"osdmap": { 
"items": 3639, 
"bytes": 224664 
}, 
"osdmap_mapping": { 
"items": 0, 
"bytes": 0 
}, 
"pgmap": { 
"items": 0, 
"bytes": 0 
}, 
"mds_co": { 
"items": 0, 
"bytes": 0 
}, 
"unittest_1": { 
"items": 0, 
"bytes": 0 
}, 
"unittest_2": { 
"items": 0, 
"bytes": 0 
} 
}, 
"total": { 
"items": 260109445, 
"bytes": 2228370845 
} 
} 
} 


and the perf dump 

root@ceph5-2:~# ceph daemon osd.4 perf dump 
{ 
"AsyncMessenger::Worker-0": { 
"msgr_recv_messages": 22948570, 
"msgr_send_messages": 22561570, 
"msgr_recv_bytes": 333085080271, 
"msgr_send_bytes": 261798871204, 
"msgr_created_connections": 6152, 
"msgr_active_connections": 2701, 
"msgr_running_total_time": 1055.197867330, 
"msgr_running_send_time": 

Re: [ceph-users] pool/volume live migration

2019-02-08 Thread Luis Periquito
This is indeed for an OpenStack cloud - it didn't require any level of
performance (so was created on an EC pool) and now it does :(

So the idea would be:
1- create a new pool
2- change cinder to use the new pool

for each volume
  3- stop the usage of the volume (stop the instance?)
  4- "live migrate" the volume to the new pool
  5- start up the instance again


Does that sound right?

thanks,

On Fri, Feb 8, 2019 at 4:25 PM Jason Dillaman  wrote:
>
> Correction: at least for the initial version of live-migration, you
> need to temporarily stop clients that are using the image, execute
> "rbd migration prepare", and then restart the clients against the new
> destination image. The "prepare" step will fail if it detects that the
> source image is in-use.
>
> On Fri, Feb 8, 2019 at 9:00 AM Jason Dillaman  wrote:
> >
> > Indeed, it is forthcoming in the Nautilus release.
> >
> > You would initiate a "rbd migration prepare 
> > " to transparently link the dst-image-spec to the
> > src-image-spec. Any active Nautilus clients against the image will
> > then re-open the dst-image-spec for all IO operations. Read requests
> > that cannot be fulfilled by the new dst-image-spec will be forwarded
> > to the original src-image-spec (similar to how parent/child cloning
> > behaves). Write requests to the dst-image-spec will force a deep-copy
> > of all impacted src-image-spec backing data objects (including
> > snapshot history) to the associated dst-image-spec backing data
> > object.  At any point a storage admin can run "rbd migration execute"
> > to deep-copy all src-image-spec data blocks to the dst-image-spec.
> > Once the migration is complete, you would just run "rbd migration
> > commit" to remove src-image-spec.
> >
> > Note: at some point prior to "rbd migration commit", you will need to
> > take minimal downtime to switch OpenStack volume registration from the
> > old image to the new image if you are changing pools.
> >
> > On Fri, Feb 8, 2019 at 5:33 AM Caspar Smit  wrote:
> > >
> > > Hi Luis,
> > >
> > > According to slide 21 of Sage's presentation at FOSDEM it is coming in 
> > > Nautilus:
> > >
> > > https://fosdem.org/2019/schedule/event/ceph_project_status_update/attachments/slides/3251/export/events/attachments/ceph_project_status_update/slides/3251/ceph_new_in_nautilus.pdf
> > >
> > > Kind regards,
> > > Caspar
> > >
> > > Op vr 8 feb. 2019 om 11:24 schreef Luis Periquito :
> > >>
> > >> Hi,
> > >>
> > >> a recurring topic is live migration and pool type change (moving from
> > >> EC to replicated or vice versa).
> > >>
> > >> When I went to the OpenStack open infrastructure (aka summit) Sage
> > >> mentioned about support of live migration of volumes (and as a result
> > >> of pools) in Nautilus. Is this still the case and is expected to have
> > >> live migration working by then?
> > >>
> > >> thanks,
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > --
> > Jason
>
>
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-08 Thread Alexandre DERUMIER
>>hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>it? 
yes
>>The same for other OSDs? 
yes



>>Wondering if you have OSD mempool monitoring (dump_mempools command 
>>output on admin socket) reports? Do you have any historic data? 

not currently (I only have perf dump), I'll add them in my monitoring stats.


>>If not may I have current output and say a couple more samples with 
>>8-12 hours interval? 

I'll do it next week.

Thanks again for helping.


- Mail original -
De: "Igor Fedotov" 
À: "aderumier" 
Cc: "Stefan Priebe, Profihost AG" , "Mark Nelson" 
, "Sage Weil" , "ceph-users" 
, "ceph-devel" 
Envoyé: Mardi 5 Février 2019 18:56:51
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>> but I don't see l_bluestore_fragmentation counter. 
>>> (but I have bluestore_fragmentation_micros) 
> ok, this is the same 
> 
> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
> "How fragmented bluestore free space is (free extents / max possible number 
> of free extents) * 1000"); 
> 
> 
> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
> 
> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 

hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
it? The same for other OSDs? 

This proves some issue with the allocator - generally fragmentation 
might grow but it shouldn't reset on restart. Looks like some intervals 
aren't properly merged in run-time. 

On the other side I'm not completely sure that latency degradation is 
caused by that - fragmentation growth is relatively small - I don't see 
how this might impact performance that high. 

Wondering if you have OSD mempool monitoring (dump_mempools command 
output on admin socket) reports? Do you have any historic data? 

If not may I have current output and say a couple more samples with 
8-12 hours interval? 


Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
before that but I'll discuss this at BlueStore meeting shortly. 


Thanks, 

Igor 

> - Mail original - 
> De: "Alexandre Derumier"  
> À: "Igor Fedotov"  
> Cc: "Stefan Priebe, Profihost AG" , "Mark Nelson" 
> , "Sage Weil" , "ceph-users" 
> , "ceph-devel"  
> Envoyé: Lundi 4 Février 2019 16:04:38 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> Thanks Igor, 
> 
>>> Could you please collect BlueStore performance counters right after OSD 
>>> startup and once you get high latency. 
>>> 
>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
> I'm already monitoring with 
> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
> 
> but I don't see l_bluestore_fragmentation counter. 
> 
> (but I have bluestore_fragmentation_micros) 
> 
> 
>>> Also if you're able to rebuild the code I can probably make a simple 
>>> patch to track latency and some other internal allocator's paramter to 
>>> make sure it's degraded and learn more details. 
> Sorry, It's a critical production cluster, I can't test on it :( 
> But I have a test cluster, maybe I can try to put some load on it, and try to 
> reproduce. 
> 
> 
> 
>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>> and try the difference... 
> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
> perf results of new bitmap allocator seem very promising from what I've seen 
> in PR. 
> 
> 
> 
> - Mail original - 
> De: "Igor Fedotov"  
> À: "Alexandre Derumier" , "Stefan Priebe, Profihost AG" 
> , "Mark Nelson"  
> Cc: "Sage Weil" , "ceph-users" 
> , "ceph-devel"  
> Envoyé: Lundi 4 Février 2019 15:51:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> Hi Alexandre, 
> 
> looks like a bug in StupidAllocator. 
> 
> Could you please collect BlueStore performance counters right after OSD 
> startup and once you get high latency. 
> 
> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
> 
> Also if you're able to rebuild the code I can probably make a simple 
> patch to track latency and some other internal allocator's paramter to 
> make sure it's degraded and learn more details. 
> 
> 
> More vigorous fix would be to backport bitmap allocator from Nautilus 
> and try the difference... 
> 
> 
> Thanks, 
> 
> Igor 
> 
> 
> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>> Hi again, 
>> 
>> I speak too fast, the problem has occured again, so it's not tcmalloc cache 
>> size related. 
>> 
>> 
>> I have notice something using a simple "perf top", 
>> 
>> each time I have this problem (I have seen exactly 4 times the same 
>> behaviour), 
>> 
>> when latency is bad, perf top give me : 
>> 
>> StupidAllocator::_aligned_len 
>> and 
>> btree::btree_iterator> long, unsigned long, std::less, mempoo 
>> l::pool_allocator<(mempool::pool_index_t)1, 

Re: [ceph-users] pool/volume live migration

2019-02-08 Thread Jason Dillaman
Correction: at least for the initial version of live-migration, you
need to temporarily stop clients that are using the image, execute
"rbd migration prepare", and then restart the clients against the new
destination image. The "prepare" step will fail if it detects that the
source image is in-use.

On Fri, Feb 8, 2019 at 9:00 AM Jason Dillaman  wrote:
>
> Indeed, it is forthcoming in the Nautilus release.
>
> You would initiate a "rbd migration prepare 
> " to transparently link the dst-image-spec to the
> src-image-spec. Any active Nautilus clients against the image will
> then re-open the dst-image-spec for all IO operations. Read requests
> that cannot be fulfilled by the new dst-image-spec will be forwarded
> to the original src-image-spec (similar to how parent/child cloning
> behaves). Write requests to the dst-image-spec will force a deep-copy
> of all impacted src-image-spec backing data objects (including
> snapshot history) to the associated dst-image-spec backing data
> object.  At any point a storage admin can run "rbd migration execute"
> to deep-copy all src-image-spec data blocks to the dst-image-spec.
> Once the migration is complete, you would just run "rbd migration
> commit" to remove src-image-spec.
>
> Note: at some point prior to "rbd migration commit", you will need to
> take minimal downtime to switch OpenStack volume registration from the
> old image to the new image if you are changing pools.
>
> On Fri, Feb 8, 2019 at 5:33 AM Caspar Smit  wrote:
> >
> > Hi Luis,
> >
> > According to slide 21 of Sage's presentation at FOSDEM it is coming in 
> > Nautilus:
> >
> > https://fosdem.org/2019/schedule/event/ceph_project_status_update/attachments/slides/3251/export/events/attachments/ceph_project_status_update/slides/3251/ceph_new_in_nautilus.pdf
> >
> > Kind regards,
> > Caspar
> >
> > Op vr 8 feb. 2019 om 11:24 schreef Luis Periquito :
> >>
> >> Hi,
> >>
> >> a recurring topic is live migration and pool type change (moving from
> >> EC to replicated or vice versa).
> >>
> >> When I went to the OpenStack open infrastructure (aka summit) Sage
> >> mentioned about support of live migration of volumes (and as a result
> >> of pools) in Nautilus. Is this still the case and is expected to have
> >> live migration working by then?
> >>
> >> thanks,
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] change OSD IP it uses

2019-02-08 Thread Ashley Merrick
All fixed was partly with the above, and partly me just missing something.

Thanks all for your help!

,Ash

On Fri, Feb 8, 2019 at 10:46 PM Sage Weil  wrote:

> The IP that an OSD (or other non-monitor daemon) uses normally depends on
> what IP is used by the local host to reach the monitor(s).  If you want
> your OSDs to be on a different network, generally the way to do
> that is to move the monitors to that network too.
>
> You can also try the public_network option, which lets you specify a CIDR
> network (e.g., "10.0.0.0/8").  The OSD (or other daemon) will look at all
> IPs bound to the local host and pick one that is contained by that network
> (or by one of the networks in the list--you can list multiple networks).
>
> Recreating the OSD isn't necessary or related; the daemon picks a new
> IP and port each time it is started.
>
> HTH,
> sage
>
> On Fri, 8 Feb 2019, Ashley Merrick wrote:
>
> > I just tried that, nothing showing in ceph osd ls or ceph osd tree.
> >
> > Run the purge command, wiped the disk.
> >
> > However after re-creating the OSD it's still trying to connect via the
> > external IP, I've looked to see if there is an option to specify the osd
> ID
> > in ceph-deploy to try and use another ID but does not seem to be an
> option.
> >
> > Any other ideas?
> >
> > Ashley
> >
> > On Fri, Feb 8, 2019 at 8:14 PM Hector Martin 
> wrote:
> >
> > > On 08/02/2019 20.54, Ashley Merrick wrote:
> > > > Yes that is all fine, the other 3 OSD's on the node work fine as
> > > expected,
> > > >
> > > > When I did the orginal OSD via ceph-deploy i used the external
> hostname
> > > > at the end of the command instead of the internal hostname, I then
> > > > deleted the OSD and zap'd the disk and re-added using the internal
> > > > hostname + the other 3 OSD's.
> > > >
> > > > The other 3 are using the internal IP fine, the first OSD is not.
> > > >
> > > > The config and everything else is fine as I can reboot any of the
> other
> > > > 3 OSD's and they work fine, just somewhere the osd.22 is still
> storing
> > > > the orginal hostname/ip it was given via ceph-deploy even after a rm
> /
> > > > disk zap
> > >
> > > The OSDMap stores the OSD IP, though I *think* it's supposed to update
> > > itself when the OSD's IP changes.
> > >
> > > If this is a new OSD and you don't care about the data (or can just let
> > > it rebalance away), just follow the instructions for "add/remove OSDs"
> > > here:
> > >
> > > http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-osds/
> > >
> > > Make sure when the OSD is gone it really is gone (nothing in 'ceph osd
> > > tree' or 'ceph osd ls'), e.g. 'ceph osd purge 
> > > --yes-i-really-mean-it' and make sure there isn't a spurious entry for
> > > it in ceph.conf, then re-deploy. Once you do that there is no possible
> > > other place for the OSD to somehow remember its old IP.
> > >
> > >
> > > --
> > > Hector Martin (hec...@marcansoft.com)
> > > Public Key: https://mrcn.st/pub
> > >
> >
> e
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Controlling CephFS hard link "primary name" for recursive stat

2019-02-08 Thread Hector Martin
Hi list,

As I understand it, CephFS implements hard links as effectively "smart
soft links", where one link is the primary for the inode and the others
effectively reference it. When it comes to directories, the size for a
hardlinked file is only accounted for in recursive stats for the
"primary" link. This is good (no double-accounting).

I'd like to be able to control *which* of those hard links is the
primary, post-facto, to control what directory their size is accounted
under. I want to write a tool that takes some rules as to which
directories should be "preferred" for containing the master link, and
corrects it if necessary (by recursively stating everything and looking
for files with the same inode number to enumerate all links).

To swap out a primary link with another I came up with this sequence:

link("old_primary", "tmp1")
symlink("tmp1", "tmp2")
rename("tmp2", "old_primary") // old_primary replaced with another inode
stat("/otherdir/new_primary") // new_primary hopefully takes over stray
rename("tmp1", "old_primary)  // put things back the way they were

The idea is that, since renames of hardlinks over themselves are a no-op
in POSIX and won't work, I need to use an intermediate symlink step to
ensure continuity of access to the old file; this isn't 100% transparent
but it beats e.g. removing old_primary and re-linking new_primary over
it (which would cause old_primary to vanish for a short time, which is
undesirable). Hopefully the stat() ensures that the new_primary is what
takes over the stray inode. This seems to work in practice; if there is
a better way, I'd like to hear it.

Figuring out which link is the primary is a bigger issue. Only
directories report recursive stats where this matters, not files
themselves. On a directory with hardlinked files, if ceph.dir.rfiles >
sum(ceph.dir.rfiles for each subdir) + count(files with nlinks == 1)
then some hardlinked files are primary; I could attempt to use this
formula and then just do the above dance for every hardlinked file to
move the primaries off, but this seems fragile and likely to break in
certain situations (or do needless work). Any other ideas?

Thanks,
-- 
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multicast communication compuverde

2019-02-08 Thread Robin H. Johnson
On Wed, Feb 06, 2019 at 11:49:28AM +0200, Maged Mokhtar wrote:
> It could be used for sending cluster maps or other configuration in a 
> push model, i believe corosync uses this by default. For use in sending 
> actual data during write ops, a primary osd can send to its replicas, 
> they do not have to process all traffic but can listen on specific group 
> address associated with that pg, which could be an increment from a base 
> multicast address defined. Some additional erasure codes and 
> acknowledgment messages need to be added to account for errors/dropped 
> packets.

> i doubt it will give a appreciable boost given most pools use 3
> replicas in total, additionally there could be issues to get multicast
> working correctly like setup igmp, so all in all in it could be a
> hassle.
A separate concern there is that there are too many combinations of OSDs
vs multicast limitations in switchgear. As a quick math testcase: 
Having 3 replicas with 512 OSDs, split over 32 hosts for is ~30k unique
host combinations. 

At at IPv4 protocol layer, this does fit into the 232/8 network for SSM
scope or 239/8 LSA scope; in each of those 16.7M multicast addresses.

On the switchgear side, even the big Cisco gear, the limits are even
lower: 32K.
| Output interface lists are stored in the multicast expansion table
| (MET). The MET has room for up to 32,000 output interface lists.  The
| MET resources are shared by both Layer 3 multicast routes and by Layer 2
| multicast entries. The actual number of output interface lists available
| in hardware depends on the specific configuration. If the total number
| of multicast routes exceed 32,000, multicast packets might not be
| switched by the Integrated Switching Engine. They would be forwarded by
| the CPU subsystem at much slower speeds.
older switchgear was even lower :-(.

This would also be a switch from TCP to UDP, and redesign of other
pieces, including CephX security.

I'm not convinced of the overall gain at this scale for actual data.
For heartbeat and other cluster-wide stuff, yes, I do agree that
multicast might have benefits.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Debugging 'slow requests' ...

2019-02-08 Thread Brad Hubbard
Try capturing another log with debug_ms turned up. 1 or 5 should be Ok
to start with.

On Fri, Feb 8, 2019 at 8:37 PM Massimo Sgaravatto
 wrote:
>
> Our Luminous ceph cluster have been worked without problems for a while, but 
> in the last days we have been suffering from continuous slow requests.
>
> We have indeed done some changes in the infrastructure recently:
>
> - Moved OSD nodes to a new switch
> - Increased pg nums for a pool, to have about ~ 100 PGs/OSD (also because  we 
> have to install new OSDs in the cluster). The output of 'ceph osd df' is 
> attached.
>
> The problem could also be due to some ''bad' client, but in the log I don't 
> see a clear "correlation" with specific clients or images for such blocked 
> requests.
>
> I also tried to update to latest luminous release and latest CentOS7, but 
> this didn't help.
>
>
>
> Attached you can find the detail of one of such slow operations which took 
> about 266 secs (output from 'ceph daemon osd.11 dump_historic_ops').
> So as far as I can understand from these events:
> {
> "time": "2019-02-08 10:26:25.651728",
> "event": "op_commit"
> },
> {
> "time": "2019-02-08 10:26:25.651965",
> "event": "op_applied"
> },
>
>   {
> "time": "2019-02-08 10:26:25.653236",
> "event": "sub_op_commit_rec from 33"
> },
> {
> "time": "2019-02-08 10:30:51.890404",
> "event": "sub_op_commit_rec from 23"
> },
>
> the problem seems with the  "sub_op_commit_rec from 23" event which took too 
> much.
> So the problem is that the answer from OSD 23 took to much ?
>
>
> In the logs of the 2 OSD (11 and 23)in that time frame (attached) I can't 
> find anything useful.
> When the problem happened the load and the usage of memory was not high in 
> the relevant nodes.
>
>
> Any help to debug the issue is really appreciated ! :-)
>
> Thanks, Massimo
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Brian Topping
Thanks Hector. So many things going through my head and I totally forgot to 
explore if just turning off the warnings (if only until I get more disks) was 
an option. 

This is 1000% more sensible for sure.

> On Feb 8, 2019, at 7:19 PM, Hector Martin  wrote:
> 
> My practical suggestion would be to do nothing for now (perhaps tweaking
> the config settings to shut up the warnings about PGs per OSD). Ceph
> will gain the ability to downsize pools soon, and in the meantime,
> anecdotally, I have a production cluster where we overshot the current
> recommendation by 10x due to confusing documentation at the time, and
> it's doing fine :-)
> 
> Stable multi-FS support is also coming, so really, multiple ways to fix
> your problem will probably materialize Real Soon Now, and in the
> meantime having more PGs than recommended isn't the end of the world.
> 
> (resending because the previous reply wound up off-list)
> 
> On 09/02/2019 10.39, Brian Topping wrote:
>> Thanks again to Jan, Burkhard, Marc and Hector for responses on this. To
>> review, I am removing OSDs from a small cluster and running up against
>> the “too many PGs per OSD problem due to lack of clarity. Here’s a
>> summary of what I have collected on it:
>> 
>> 1. The CephFS data pool can’t be changed, only added to. 
>> 2. CephFS metadata pool might be rebuildable
>>via https://www.spinics.net/lists/ceph-users/msg29536.html, but the
>>post is a couple of years old, and even then, the author stated that
>>he wouldn’t do this unless it was an emergency.
>> 3. Running multiple clusters on the same hardware is deprecated, so
>>there’s no way to make a new cluster with properly-sized pools and
>>cpio across.
>> 4. Running multiple filesystems on the same hardware is considered
>>experimental: 
>> http://docs.ceph.com/docs/master/cephfs/experimental-features/#multiple-filesystems-within-a-ceph-cluster.
>>It’s unclear what permanent changes this will effect on the cluster
>>that I’d like to use moving forward. This would be a second option
>>to mount and cpio across.
>> 5. Importing pools (ie `zpool export …`, `zpool import …`) from other
>>clusters is likely not supported, so even if I created a new cluster
>>on a different machine, getting the pools back in the original
>>cluster is fraught.
>> 6. There’s really no way to tell Ceph where to put pools, so when the
>>new drives are added to CRUSH, everything starts rebalancing unless
>>`max pg per osd` is set to some small number that is already
>>exceeded. But if I start copying data to the new pool, doesn’t it fail?
>> 7. Maybe the former problem can be avoided by changing the weights of
>>the OSDs...
>> 
>> 
>> All these options so far seem either a) dangerous or b) like I’m going
>> to have a less-than-pristine cluster to kick off the next ten years
>> with. Unless I am mistaken in that, the only options are to copy
>> everything at least once or twice more:
>> 
>> 1. Copy everything back off CephFS to a `mdadm` RAID 1 with two of the
>>6TB drives. Blow away the cluster and start over with the other two
>>drives, copy everything back to CephFS, then re-add the freed drive
>>used as a store. Might be done by the end of next week.
>> 2. Create a new, properly sized cluster on a second machine, copy
>>everything over ethernet, then move the drives and the
>>`/var/lib/ceph` and `/etc/ceph` back to the cluster seed.
>> 
>> 
>> I appreciate small clusters are not the target use case of Ceph, but
>> everyone has to start somewhere!
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://mrcn.st/pub

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Brian Topping
Thanks again to Jan, Burkhard, Marc and Hector for responses on this. To 
review, I am removing OSDs from a small cluster and running up against the “too 
many PGs per OSD problem due to lack of clarity. Here’s a summary of what I 
have collected on it:

The CephFS data pool can’t be changed, only added to. 
CephFS metadata pool might be rebuildable via 
https://www.spinics.net/lists/ceph-users/msg29536.html 
, but the post is a 
couple of years old, and even then, the author stated that he wouldn’t do this 
unless it was an emergency.
Running multiple clusters on the same hardware is deprecated, so there’s no way 
to make a new cluster with properly-sized pools and cpio across.
Running multiple filesystems on the same hardware is considered experimental: 
http://docs.ceph.com/docs/master/cephfs/experimental-features/#multiple-filesystems-within-a-ceph-cluster
 
.
 It’s unclear what permanent changes this will effect on the cluster that I’d 
like to use moving forward. This would be a second option to mount and cpio 
across.
Importing pools (ie `zpool export …`, `zpool import …`) from other clusters is 
likely not supported, so even if I created a new cluster on a different 
machine, getting the pools back in the original cluster is fraught.
There’s really no way to tell Ceph where to put pools, so when the new drives 
are added to CRUSH, everything starts rebalancing unless `max pg per osd` is 
set to some small number that is already exceeded. But if I start copying data 
to the new pool, doesn’t it fail?
Maybe the former problem can be avoided by changing the weights of the OSDs...

All these options so far seem either a) dangerous or b) like I’m going to have 
a less-than-pristine cluster to kick off the next ten years with. Unless I am 
mistaken in that, the only options are to copy everything at least once or 
twice more:

Copy everything back off CephFS to a `mdadm` RAID 1 with two of the 6TB drives. 
Blow away the cluster and start over with the other two drives, copy everything 
back to CephFS, then re-add the freed drive used as a store. Might be done by 
the end of next week.
Create a new, properly sized cluster on a second machine, copy everything over 
ethernet, then move the drives and the `/var/lib/ceph` and `/etc/ceph` back to 
the cluster seed.

I appreciate small clusters are not the target use case of Ceph, but everyone 
has to start somewhere!___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Hector Martin
My practical suggestion would be to do nothing for now (perhaps tweaking
the config settings to shut up the warnings about PGs per OSD). Ceph
will gain the ability to downsize pools soon, and in the meantime,
anecdotally, I have a production cluster where we overshot the current
recommendation by 10x due to confusing documentation at the time, and
it's doing fine :-)

Stable multi-FS support is also coming, so really, multiple ways to fix
your problem will probably materialize Real Soon Now, and in the
meantime having more PGs than recommended isn't the end of the world.

(resending because the previous reply wound up off-list)

On 09/02/2019 10.39, Brian Topping wrote:
> Thanks again to Jan, Burkhard, Marc and Hector for responses on this. To
> review, I am removing OSDs from a small cluster and running up against
> the “too many PGs per OSD problem due to lack of clarity. Here’s a
> summary of what I have collected on it:
> 
>  1. The CephFS data pool can’t be changed, only added to. 
>  2. CephFS metadata pool might be rebuildable
> via https://www.spinics.net/lists/ceph-users/msg29536.html, but the
> post is a couple of years old, and even then, the author stated that
> he wouldn’t do this unless it was an emergency.
>  3. Running multiple clusters on the same hardware is deprecated, so
> there’s no way to make a new cluster with properly-sized pools and
> cpio across.
>  4. Running multiple filesystems on the same hardware is considered
> experimental: 
> http://docs.ceph.com/docs/master/cephfs/experimental-features/#multiple-filesystems-within-a-ceph-cluster.
> It’s unclear what permanent changes this will effect on the cluster
> that I’d like to use moving forward. This would be a second option
> to mount and cpio across.
>  5. Importing pools (ie `zpool export …`, `zpool import …`) from other
> clusters is likely not supported, so even if I created a new cluster
> on a different machine, getting the pools back in the original
> cluster is fraught.
>  6. There’s really no way to tell Ceph where to put pools, so when the
> new drives are added to CRUSH, everything starts rebalancing unless
> `max pg per osd` is set to some small number that is already
> exceeded. But if I start copying data to the new pool, doesn’t it fail?
>  7. Maybe the former problem can be avoided by changing the weights of
> the OSDs...
> 
> 
> All these options so far seem either a) dangerous or b) like I’m going
> to have a less-than-pristine cluster to kick off the next ten years
> with. Unless I am mistaken in that, the only options are to copy
> everything at least once or twice more:
> 
>  1. Copy everything back off CephFS to a `mdadm` RAID 1 with two of the
> 6TB drives. Blow away the cluster and start over with the other two
> drives, copy everything back to CephFS, then re-add the freed drive
> used as a store. Might be done by the end of next week.
>  2. Create a new, properly sized cluster on a second machine, copy
> everything over ethernet, then move the drives and the
> `/var/lib/ceph` and `/etc/ceph` back to the cluster seed.
> 
> 
> I appreciate small clusters are not the target use case of Ceph, but
> everyone has to start somewhere!
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com