Hello,
we have had some trouble of osds running full,
even after rebalancing. So at 100% usage and ceph-osds not starting
anymore, we decided to delete some pg directories, after which
rebalancing finished.
However after this, we have the situation that one pg is not
becoming clean anymore.
We t
Hello,
another issue we have experienced with qemu VMs
(qemu 2.0.0) with ceph-0.80 on Ubuntu 14.04
managed by opennebula 4.10.1:
The VMs are completly frozen when rebalancing takes place,
they do not even respond to ping anymore.
Looking at the qemu processes they are in state "Sl".
Is this a
Hello list,
I am a bit wondering about "ceph-deploy" and the development of ceph: I
see that many people in the community are pushing towards the use of
ceph-deploy, likely to ease use of ceph.
However, I have run multiple times into issues using ceph-deploy, when
it failed or incorrectly setup p
fork of the Ceph cookbook. The
> Ceph cookbook doesn't use ceph-deploy, but it does use ceph-disk. Whenever
> I have problems with the ceph-disk command, I first go look at the cookbook
> to see how it's doing things.
>
>
>
> On Sun, Dec 21, 2014 at 10:37 AM,
Hello Ali Shah,
we are running VMs using Opennebula with ceph as the backend. So far
with varying results: From time to time VMs are freezing, probably
panic'ing when the load is too high on the ceph storage due to rebalance
work.
We are experimenting with --osd-max-backfills 1, but it hasn't sol
Max, List,
Max Power [Tue, Dec 23, 2014 at 12:34:54PM +0100]:
> [...Recovering from full osd ...]
>
> Normally
> the osd process quits then and I cannot restart it (even after setting the
> replicas back). The only possibility is to manually delete complete PG folders
> after exploring them with
Hey Jiri,
also rais the pgp_num (pg != pgp - it's easy to overread).
Cheers,
Nico
Jiri Kanicky [Sun, Dec 28, 2014 at 01:52:39AM +1100]:
> Hi,
>
> I just build my CEPH cluster but having problems with the health of
> the cluster.
>
> Here are few details:
> - I followed the ceph documentation.
Hey Christian,
Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:
> [incomplete PG / RBD hanging, osd lost also not helping]
that is very interesting to hear, because we had a similar situation
with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
directories to allow OSDs
Good evening,
for some time we have the problem that ceph stores too much data on
a host with small disks. Originally we used weight 1 = 1 TB, but
we reduced the weight for this particular host further to keep it
somehow alive.
Our setup currently consists of 3 hosts:
wein: 6x 136G (fest dis
Hey Lindsay,
Lindsay Mathieson [Wed, Dec 31, 2014 at 06:23:10AM +1000]:
> On Tue, 30 Dec 2014 05:07:31 PM Nico Schottelius wrote:
> > While writing this I noted that the relation / factor is exactly 5.5 times
> > wrong, so I *guess* that ceph treats all hosts with the same weight (
r to create a new pool along the
> old one to at least enable our clients to send data to ceph again.
>
> To tell the truth, I guess that will result in the end of our ceph
> project (running for already 9 Monthes).
>
> Regards,
> Christian
>
> Am 29.12.2014 15:59,
Hello Achim,
good to hear someone else running this setup. We have changed the number
of backfills using
ceph tell osd.\* injectargs '--osd-max-backfills 1'
and it seems to work mostly in regards of issues when rebalancing.
One unsolved problem we have is machines kernel panic'ing, when i/o
t know why you were getting kernel panics. It's probably advisable to
> stick to the most recent mainline kernel when using kRBD.
>
> Cheers, Dan
>
> On 7 Jan 2015 20:45, Nico Schottelius wrote:
> Good evening,
>
> we also tried to rescue data *from* our old
Lionel, Christian,
we do have the exactly same trouble as Christian,
namely
Christian Eichelmann [Fri, Jan 09, 2015 at 10:43:20AM +0100]:
> We still don't know what caused this specific error...
and
> ...there is currently no way to make ceph forget about the data of this pg
> and create it as
more info
> about your deployment: ceph version, kernel versions, OS, filesystem
> btrfs/xfs.
>
> Thx Jiri
>
> - Reply message -
> From: "Nico Schottelius"
> To:
> Subject: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete =
>
Good morning,
yesterday we had an unpleasant surprise that I would like to discuss:
Many (not all!) of our VMs were suddenly
dying (qemu process exiting) and when trying to restart them, inside the
qemu process we saw i/o errors on the disks and the OS was not able to
start (i.e. stopped in init
f a mapped krbd block device,
> correct? If that is the case, can you add "debug-rbd=20" and "debug
> objecter=20" to your ceph.conf and boot up your last remaining broken
> OSD?
>
> On Sun, Sep 10, 2017 at 8:23 AM, Nico Schottelius
> wrote:
>>
>&g
ttings.
>
> On Sun, Sep 10, 2017 at 9:22 AM, Nico Schottelius
> wrote:
>>
>> Hello Jason,
>>
>> I think there is a slight misunderstanding:
>> There is only one *VM*, not one OSD left that we did not start.
>>
>> Or does librbd also read ceph.conf
egards,
> Lionel
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Nico Schottelius
>> Sent: dimanche 10 septembre 2017 14:23
>> To: ceph-users
>> Cc: kamila.souck...@ungleich.ch
>>
Sarunas,
may I ask when this happened?
And did you move OSDs or mons after that export/import procecdure?
I really wonder, what is the reason for this behaviour and also if it is
likely to experience it again.
Best,
Nico
Sarunas Burdulis writes:
> On 2017-09-10 08:23, Nico Schottel
Hey Mykola,
thanks for the hint, I will test this in a few hours when I'm back on a
regular Internet connection!
Best,
Nico
Mykola Golub writes:
> On Sun, Sep 10, 2017 at 03:56:21PM +0200, Nico Schottelius wrote:
>>
>> Just tried and there is not much more log in ceph -w
og at
http://www.nico.schottelius.org/ceph.client.libvirt.41670.log.bz2
I wonder if anyone sees the real reason for the I/O errors in the log?
Best,
Nico
> Mykola Golub writes:
>
>> On Sun, Sep 10, 2017 at 03:56:21PM +0200, Nico Schottelius wrote:
>>>
>>> Just tried and there is
format error" isn't actually an issue -- but now that I know
>> about it, we can prevent it from happening in the future [1]
>>
>> [1] http://tracker.ceph.com/issues/21360
>>
>> On Mon, Sep 11, 2017 at 4:32 PM, Nico Schottelius
>> wrote:
>>
looks like your "client.libvirt" user lacks the permission to
> blacklist a dead client that had previously acquired the exclusive
> lock and failed to release it.
>
> Can you provide the results from "ceph auth get client.libvirt"? I
> suspect it only has 'cap
That indeed worked! Thanks a lot!
The remaining question from my side: did we do anything wrong in the
upgrade process and if not, should it be documented somewhere how to
setup the permissions correctly on upgrade?
Or should the documentation on the side of the cloud infrastructure
software be
penstack/#setup-ceph-client-authentication
>
> On Mon, Sep 11, 2017 at 5:16 PM, Nico Schottelius
> wrote:
>>
>> That indeed worked! Thanks a lot!
>>
>> The remaining question from my side: did we do anything wrong in the
>> upgrade process and if not, should it be documen
ocs.ceph.com/docs/master/rbd/rbd-openstack/#setup-ceph-client-authentication
>>
>> On Mon, Sep 11, 2017 at 5:16 PM, Nico Schottelius
>> wrote:
>>>
>>> That indeed worked! Thanks a lot!
>>>
>>> The remaining question from my side: did we d
Good morning,
we have recently upgraded our kraken cluster to luminous and since then
noticed an odd behaviour: we cannot add a monitor anymore.
As soon as we start a new monitor (server2), ceph -s and ceph -w start to hang.
The situation became worse, since one of our staff stopped an existing
lient
> connections because it's been out of quorum for too long, which is the
> correct behavior in general. I'd imagine that you've got clients trying to
> connect to the new monitor instead of the ones already in the quorum and
> not passing around correctly; this is a
have
ntpd running).
We are running everything on IPv6, but this should not be a problem,
should it?
Best,
Nico
Nico Schottelius writes:
> Hello Gregory,
>
> the logfile I produced has already debug mon = 20 set:
>
> [21:03:51] server1:~# grep "debug mon" /etc/ceph/c
ith the monitors that was solely related to a
> switch's MTU being too small.
>
> Maybe that could be the case? If not, I'll take a look at the logs as
> soon as possible.
>
> -Joao
>
>>
>> On Wed, Oct 4, 2017 at 1:04 PM Nico Schottelius
>> mailto:
uot;: "server2",
"addr": "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0",
"public_addr": "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0"
},
{
"rank": 3,
"name&q
and now comes
the not so funny part: restarting the monitor makes the cluster hang again.
I will post another debug log in the next hours, now from the monitor on
server2.
Nico Schottelius writes:
> Not sure if I mentioned before: adding a new monitor also puts the whole
> cluster into
Good morning Joao,
thanks for your feedback! We do actually have three managers running:
cluster:
id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
health: HEALTH_WARN
1/3 mons down, quorum server5,server3
services:
mon: 3 daemons, quorum server5,server3, out of quorum: s
Hello everyone,
is there any solution in sight for this problem? Currently our cluster
is stuck with a 2 monitor configuration, as everytime we restart the one
server2, it crashes after some minutes (and in between the cluster is stuck).
Should we consider downgrading to kraken to fix that probl
nizing is
> progressing, albeit slowly.
>
> Can you please share the logs of the other monitors, especially of
> those crashing?
>
> -Joao
>
> On 10/18/2017 06:58 AM, Nico Schottelius wrote:
>>
>> Hello everyone,
>>
>> is there any solutio
Hey Joao,
thanks for the pointer! Do you have a timeline for the release of
v12.2.2?
Best,
Nico
--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/li
Hello,
we are running everything IPv6 only. You just need to setup the MTU on
your devices (nics, switches) correctly, nothing ceph or IPv6 specific
required.
If you are using SLAAC (like we do), you can also announce the MTU via
RA.
Best,
Nico
Jack writes:
> Or maybe you reach that ipv4
Hello,
our problems with ceph monitors continue in version 12.2.2:
Adding a specific monitor causes all monitors to hang and not respond to
ceph -s or similar anymore.
Interestingly when this monitor is on (mon.server2), the other two
monitors (mon.server3, mon.server5) randomly begin to consum
Hello,
we added about 7 new disks yesterday/today and our cluster became very
slow. While the rebalancing took place, 2 of the 7 new added disks
died.
Our cluster is still recovering, however we spotted that there are a lot
of unfound objects.
We lost osd.63 and osd.64, which seem not to be inv
find all of the objects by the time it's done backfilling. With
> only losing 2 disks, I wouldn't worry about the missing objects not
> becoming found unless you're pool size=2.
>
> On Mon, Jan 22, 2018 at 11:47 AM Nico Schottelius <
> nico.schottel...@ungleich.ch&g
th full data integrity again.
>
> On Mon, Jan 22, 2018 at 1:03 PM Nico Schottelius <
> nico.schottel...@ungleich.ch> wrote:
>
>>
>> Hey David,
>>
>> thanks for the fast answer. All our pools are running with size=3,
>> min_size=2 and the two disks were
rrors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap&q
"last_epoch_clean": 0,
> "parent": "0.0",
> "parent_split_bits": 0,
> "last_scrub": "0'0",
> "last_scrub_stamp": "0.00"
Hey Burkhard,
we did actually restart osd.61, which led to the current status.
Best,
Nico
Burkhard Linke writes:>
> On 01/23/2018 08:54 AM, Nico Schottelius wrote:
>> Good morning,
>>
>> the osd.61 actually just crashed and the disk is still intact. However,
>&g
Good evening list,
we are soon expanding our data center [0] to a new location [1].
We are mainly offering VPS / VM Hosting, so rbd is our main interest.
We have a low latency 10 Gbit/s link between our other location [2] and
we are wondering, what is the best practise for expanding.
Naturally
Hey Wido,
> [...]
> Like I said, latency, latency, latency. That's what matters. Bandwidth
> usually isn't a real problem.
I imagined that.
> What latency do you have with a 8k ping between hosts?
As the link will be setup this week, I cannot tell yet.
However, currently we have on a 65km lin
Good morning,
after another disk failure, we currently have 7 inactive pgs [1], which
are stalling IO from the affected VMs.
It seems that ceph, when rebuilding does not focus on repairing
the inactive PGs first, which surprised us quite a lot:
It does not repair the inactive first, but mixes i
Dear list,
for a few days we are disecting ceph-disk and ceph-volume to find out,
what is the appropriate way of creating partitions for ceph.
For years already I found ceph-disk (and especially ceph-deploy) very
error prone and we at ungleich are considering to rewrite both into a
ceph-block-do
Hello,
we have one pool, in which about 10 disks failed last week (fortunately
mostly sequentially), which now has now some pgs that are only left on
one disk.
Is there a command to set one pool into "read-only" mode or even
"recovery io-only" mode so that the only thing same is doing is
recover
Hello,
on a test cluster I issued a few seconds ago:
ceph auth caps client.admin mgr 'allow *'
instead of what I really wanted to do
ceph auth caps client.admin mgr 'allow *' mon 'allow *' osd 'allow *' \
mds allow
Now any access to the cluster using client.admin correctly results in
cl
3) Permission denied
[errno 13] error connecting to the cluster
... which kind of makes sense, as the mon. key does not have
capabilities for it. Then again, I wonder how monitors actually talk to
each other...
Michel Raabe writes:
> On 02/16/18 @ 18:21, Nico Schottelius wrote:
>> on a
It seems your monitor capabilities are different to mine:
root@server3:/opt/ungleich-tools# ceph -k
/var/lib/ceph/mon/ceph-server3/keyring -n mon. auth list
2018-02-16 20:34:59.257529 7fe0d5c6b700 0 librados: mon. authentication error
(13) Permission denied
[errno 13] error connecting to the c
A very interesting question and I would add the follow up question:
Is there an easy way to add an external DB/WAL devices to an existing
OSD?
I suspect that it might be something on the lines of:
- stop osd
- create a link in ...ceph/osd/ceph-XX/block.db to the target device
- (maybe run some
Max,
I understand your frustration.
However, last time I checked, ceph was open source.
Some of you might not remember, but one major reason why open source is
great is that YOU CAN DO your own modifications.
If you need a change like iSCSI support and it isn't there,
it is probably best, if yo
Good morning,
some days ago we created a new pool with 512 pgs, and originally 5 osds.
We use the device class "ssd" and a crush rule that maps all data for
the pool "ssd" to the ssd device class osds.
While creating, one of the ssds failed and we are left with 4 osds:
[10:00:22] server2.place6
018/01/placement-groups-with-ceph-luminous-stay-in-activating-state/
>
> 2018-03-17 12:15 GMT+03:00 Nico Schottelius :
>
>>
>> Good morning,
>>
>> some days ago we created a new pool with 512 pgs, and originally 5 osds.
>> We use the device class "
Hey Ansgar,
we have a similar "problem": in our case all servers are wiped on
reboot, as they boot their operating system from the network into
initramfs.
While the OS configuration is done with cdist [0], we consider ceph osds
more dynamic data and just re-initialise all osds on boot using the
58 matches
Mail list logo