a recurring topic is live migration and pool type change (moving from
EC to replicated or vice versa).
When I went to the OpenStack open infrastructure (aka summit) Sage
mentioned about support of live migration of volumes (and as a result
of pools) in Nautilus. Is this still the case and is
There is a setting to set the max pg per osd. I would set that
temporarily so you can work, create a new pool with 8 pg's and move data
over to the new pool, remove the old pool, than unset this max pg per
PS. I am always creating pools starting 8 pg's and when I know I am at
you can move the data off to another pool, but you need to keep your
_first_ data pool, since part of the filesystem metadata is stored in
that pool. You cannot remove the first pool.
Dr. rer. nat. Burkhard Linke
Bioinformatics and Systems Biology
Our Luminous ceph cluster have been worked without problems for a while,
but in the last days we have been suffering from continuous slow requests.
We have indeed done some changes in the infrastructure recently:
- Moved OSD nodes to a new switch
- Increased pg nums for a pool, to have about ~
Hello, ceph users,
I moved my cluster to bluestore (Ceph Mimic), and now I see the increased
disk usage. From ceph -s:
pools: 8 pools, 3328 pgs
objects: 1.23 M objects, 4.6 TiB
usage: 23 TiB used, 444 TiB / 467 TiB avail
I use 3-way replication of my data, so I would
According to slide 21 of Sage's presentation at FOSDEM it is coming in
I just tried that, it already had a restart done as I fully deleted the old
OSD and re-created using the correct hostname after zapping the disk and
restarting the server itself.
Just somewhere it still seems to have stored the external IP's of the other
hosts for just this OSD, after restarting
Hi Mark, that’s great advice, thanks! I’m always grateful for the knowledge.
What about the issue with the pools containing a CephFS though? Is it something
where I can just turn off the MDS, copy the pools and rename them back to the
original name, then restart the MDS?
Agreed about using
Yes that is thus a partial move, not the behaviour you expect from a mv
command. (I think this should be changed)
From: Burkhard Linke
Sent: 08 February 2019 11:27
Op vr 8 feb. 2019 om 11:31 schreef Scheurer François <
> Dear Eugen Block
> Dear Alan Johnson
> Thank you for your answers.
> So we will use EC 3+2 on 6 nodes.
> Currently with only 4 osd's per node, then 8 and later 20.
> >Just to add, that a more
Brian Topping wrote:
: Hi all, I created a problem when moving data to Ceph and I would be grateful
for some guidance before I do something dumb.
: Do I need to create new pools and copy again using cpio? Is there a better
I think I will be facing the same
I think I would COPY and DELETE in chunks the data not via the 'backend'
but just via cephfs. So you are 100% sure nothing weird can happen.
(MOVE is not working as you think on a cephfs between different pools)
You can create and mount an extra data pool in cephfs. I have done this
Dear Eugen Block
Dear Alan Johnson
Thank you for your answers.
So we will use EC 3+2 on 6 nodes.
Currently with only 4 osd's per node, then 8 and later 20.
>Just to add, that a more general formula is that the number of nodes should be
>greater than or equal to k+m+m so N>=k+m+m for full
On 08/02/2019 17.05, Ashley Merrick wrote:
> Just somewhere it still seems to have stored the external IP's of the
> other hosts for just this OSD, after restarting it's still full of log
> lines like : no reply from externalip:6801 osd.21, which is a OSD on
> another node and trying to connect
On 08/02/2019 20.54, Ashley Merrick wrote:
> Yes that is all fine, the other 3 OSD's on the node work fine as expected,
> When I did the orginal OSD via ceph-deploy i used the external hostname
> at the end of the command instead of the internal hostname, I then
> deleted the OSD and zap'd the
I just tried that, nothing showing in ceph osd ls or ceph osd tree.
Run the purge command, wiped the disk.
However after re-creating the OSD it's still trying to connect via the
external IP, I've looked to see if there is an option to specify the osd ID
in ceph-deploy to try and use another ID
The IP that an OSD (or other non-monitor daemon) uses normally depends on
what IP is used by the local host to reach the monitor(s). If you want
your OSDs to be on a different network, generally the way to do
that is to move the monitors to that network too.
You can also try the
Thanks Marc and Burkhard. I think what I am learning is it’s best to copy
between filesystems with cpio, if not impossible to do it any other way due to
the “fs metadata in first pool” problem.
FWIW, the mimic docs still describe how to create a differently named cluster
on the same hardware.
Indeed, it is forthcoming in the Nautilus release.
You would initiate a "rbd migration prepare
" to transparently link the dst-image-spec to the
src-image-spec. Any active Nautilus clients against the image will
then re-open the dst-image-spec for all IO operations. Read requests
that cannot be
Thank you Caspar for your corrections!
> EC requires K+1 nodes to allow writes, so every IO freezes (until all
> affected PG's are recovered to at least K+1)
I was not aware of this. This is quite important to know, many thanks.
-survive the loss of max 3 nodes, if the recovery has enough
Yes that is all fine, the other 3 OSD's on the node work fine as expected,
When I did the orginal OSD via ceph-deploy i used the external hostname at
the end of the command instead of the internal hostname, I then deleted the
OSD and zap'd the disk and re-added using the internal hostname + the
On 08/02/2019 19.29, Marc Roos wrote:
> Yes that is thus a partial move, not the behaviour you expect from a mv
> command. (I think this should be changed)
CephFS lets you put *data* in separate pools, but not *metadata*. Also,
I think you can't remove the original/default data pool.
I'm just seeing
on 1 osd, both 10%.
here the dump_mempools
Unfortunately the MDS has crashed on our Mimic cluster...
First symptoms were rsync giving:
"No space left on device (28)"
when trying to rename or delete
This prompted me to try restarting the MDS, as it reported laggy.
Restarting the MDS, shows this as error in the log before the
On Fri, Feb 8, 2019 at 11:43 AM Luis Periquito wrote:
> This is indeed for an OpenStack cloud - it didn't require any level of
> performance (so was created on an EC pool) and now it does :(
> So the idea would be:
0 - upgrade OSDs and librbd clients to Nautilus
> 1- create a new pool
another mempool dump after 1h run. (latency ok)
(other caches seem to be quite low too, like
This is indeed for an OpenStack cloud - it didn't require any level of
performance (so was created on an EC pool) and now it does :(
So the idea would be:
1- create a new pool
2- change cinder to use the new pool
for each volume
3- stop the usage of the volume (stop the instance?)
>>hmm, so fragmentation grows eventually and drops on OSD restarts, isn't
>>The same for other OSDs?
>>Wondering if you have OSD mempool monitoring (dump_mempools command
>>output on admin socket) reports? Do you have any historic data?
not currently (I only have perf dump),
Correction: at least for the initial version of live-migration, you
need to temporarily stop clients that are using the image, execute
"rbd migration prepare", and then restart the clients against the new
destination image. The "prepare" step will fail if it detects that the
source image is
All fixed was partly with the above, and partly me just missing something.
Thanks all for your help!
On Fri, Feb 8, 2019 at 10:46 PM Sage Weil wrote:
> The IP that an OSD (or other non-monitor daemon) uses normally depends on
> what IP is used by the local host to reach the monitor(s).
As I understand it, CephFS implements hard links as effectively "smart
soft links", where one link is the primary for the inode and the others
effectively reference it. When it comes to directories, the size for a
hardlinked file is only accounted for in recursive stats for the
On Wed, Feb 06, 2019 at 11:49:28AM +0200, Maged Mokhtar wrote:
> It could be used for sending cluster maps or other configuration in a
> push model, i believe corosync uses this by default. For use in sending
> actual data during write ops, a primary osd can send to its replicas,
> they do not
Try capturing another log with debug_ms turned up. 1 or 5 should be Ok
to start with.
On Fri, Feb 8, 2019 at 8:37 PM Massimo Sgaravatto
> Our Luminous ceph cluster have been worked without problems for a while, but
> in the last days we have been suffering from continuous slow
Thanks Hector. So many things going through my head and I totally forgot to
explore if just turning off the warnings (if only until I get more disks) was
This is 1000% more sensible for sure.
> On Feb 8, 2019, at 7:19 PM, Hector Martin wrote:
> My practical suggestion would be
Thanks again to Jan, Burkhard, Marc and Hector for responses on this. To
review, I am removing OSDs from a small cluster and running up against the “too
many PGs per OSD problem due to lack of clarity. Here’s a summary of what I
have collected on it:
The CephFS data pool can’t be changed, only
My practical suggestion would be to do nothing for now (perhaps tweaking
the config settings to shut up the warnings about PGs per OSD). Ceph
will gain the ability to downsize pools soon, and in the meantime,
anecdotally, I have a production cluster where we overshot the current
Mail list logo