The system is down again saying it is missing the same stray7 again.
2017-10-25 11:24:29.736774 mds.0 [WRN] failed to reconnect caps for
missing inodes:
2017-10-25 11:24:29.736779 mds.0 [WRN] ino 100147160e6
2017-10-25 11:24:29.753665 mds.0 [ERR] dir 607 object missing on disk;
some files
Apologies, corrected second link:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-March/016663.html
On Wed, Oct 25, 2017 at 9:44 AM, Brian Andrus
wrote:
> Please see the following mailing list topics that have covered this topic
> in detail:
>
> "2x
Please see the following mailing list topics that have covered this topic
in detail:
"2x replication: A BIG warning":
https://www.spinics.net/lists/ceph-users/msg32915.html
"replica questions":
https://www.spinics.net/lists/ceph-users/msg32915.html
On Wed, Oct 25, 2017 at 9:39 AM, Ian Bobbitt
I think you are searching for this:
|osd scrub sleep|
|http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/|
|
|
|Denes.|
On 10/25/2017 06:06 PM, Alejandro Comisario wrote:
any comment on this one ?
interesting what to do in this situation
On Wed, Jul 5, 2017 at 10:51 PM,
That helps a little bit, but overall the process would take years at this rate:
# for i in {1..3600}; do ceph df -f json-pretty |grep -A7 '".rgw.buckets"'
|grep objects; sleep 60; done
"objects": 1660775838
"objects": 1660775733
"objects":
Well, there were a few bug logged around upgraded which hit a similar
assert but those were fixed 2 years ago supposedly. Looks like Ubuntu
15.04 shipped Hammer (0.94.5) so presumably that's what you upgraded
from.
The current Jewel release is 10.2.10 - I don't know if the problem
you're seeing is
Thanks to all.
I took the OSDs down in the problem host, without shutting down the machine.
As predicted, our MB/s about doubled.
Using this bench/atop procedure, I found two other OSDs on another host
that are the next bottlenecks.
Is this the only good way to really test the performance of the
HI
3 jewel hosts
All pools min_size 2 size 3
All 3 RGW are balanced by nginx
If I restart specific services it's OK
But when I reboot a host
Web dashboard through balancer gives back
502 - Web server received an invalid response while acting as a gateway or
proxy server.
While 2 host still
On Wed, Oct 25, 2017 at 2:32 PM, Bryan Stillwell wrote:
> That helps a little bit, but overall the process would take years at this
> rate:
>
>
>
> # for i in {1..3600}; do ceph df -f json-pretty |grep -A7 '".rgw.buckets"'
> |grep objects; sleep 60; done
>
>
It depends on what stage you are in:
in production, probably the best thing is to setup a monitoring tool
(collectd/grahite/prometheus/grafana) to monitor both ceph stats as well
as resource load. This will, among other things, show you if you have
slowing disks.
Before production you should
Hi Christian,
I've just upgraded to 10.2.10 and the problem still persist. Both. OSD
not starting (the most problematic now) and the wrong report of degraded
objects:
20266198323226120/281736 objects degraded (7193329330730.229%)
Any ideas about how to resolve the problem with the
Thanks for the information.
I did:
# ceph daemon mds.ceph-0 scrub_path / repair recursive
Saw in the logs it finished
# ceph daemon mds.ceph-0 flush journal
Saw in the logs it finished
#ceph mds fail 0
#ceph mds repaired 0
And it went back to missing stray7 again. I added that back like we
> Op 25 oktober 2017 om 10:39 schreef koukou73gr :
>
>
> On 2017-10-25 11:21, Wido den Hollander wrote:
> >
> >> Op 25 oktober 2017 om 5:58 schreef Christian Sarrasin
> >> :
> >>
> >> The one thing I'm still wondering about is failure domains.
That log is showing that a snap remove request was made from a client
that couldn't acquire the lock to a client that currently owns the
lock. The client that currently owns the lock responded w/ an -ENOENT
error that the snapshot doesn't exist. Depending on the maintenance
operation requested,
Commands that start "ceph daemon" take mds. rather than a rank
(notes on terminology here:
http://docs.ceph.com/docs/master/cephfs/standby/). The name is how
you would refer to the daemon from systemd, it's often set to the
hostname where the daemon is running by default.
John
On Wed, Oct 25,
On 17-10-25 02:39 PM, Jason Dillaman wrote:
That log is showing that a snap remove request was made from a client
that couldn't acquire the lock to a client that currently owns the
lock. The client that currently owns the lock responded w/ an -ENOENT
error that the snapshot doesn't exist.
Thanks -- let me know. In the future, you may want to consider having
librbd create an admin socket so that you can change (certain)
settings or interact w/ the process w/o restarting it.
On Wed, Oct 25, 2017 at 9:54 AM, Piotr Dałek wrote:
> On 17-10-25 03:30 PM, Jason
John, thank you so much. After doing the initial rados command you
mentioned it is back up and running. It did complain about a bunch of
files which frankly are not important having duplicate inodes, but I
will run those repair and scrub commands you mentioned and get it back
clean again.
Hmm, hard to say off the top of my head. If you could enable "debug
librbd = 20" logging on the buggy client that owns the lock, create a
new snapshot, and attempt to delete it, it would be interesting to
verify that the image is being properly refreshed.
On Wed, Oct 25, 2017 at 9:23 AM, Piotr
On 17-10-25 03:30 PM, Jason Dillaman wrote:
Hmm, hard to say off the top of my head. If you could enable "debug
librbd = 20" logging on the buggy client that owns the lock, create a
new snapshot, and attempt to delete it, it would be interesting to
verify that the image is being properly
I do have a problem with running the commands you mentioned to repair
the mds:
# ceph daemon mds.0 scrub_path
admin_socket: exception getting command descriptions: [Errno 2] No such
file or directory
admin_socket: exception getting command descriptions: [Errno 2] No such
file or directory
On 10/25/2017 03:51 AM, Caspar Smit wrote:
Hi,
I've asked the exact same question a few days ago, same answer:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021708.html
I guess we'll have to bite the bullet on this one and take this into
account when designing.
This is
Hi Giang,
Can I ask you if you used the elrepo kernels? Because I tried these, but
they are not booting because of I think the mpt2sas mpt3sas drivers.
Regards,
Marc
-Original Message-
From: GiangCoi Mr [mailto:ltrgian...@gmail.com]
Sent: woensdag 25 oktober 2017 16:11
To:
any comment on this one ?
interesting what to do in this situation
On Wed, Jul 5, 2017 at 10:51 PM, Adrian Saul
wrote:
>
>
> During a recent snafu with a production cluster I disabled scrubbing and
> deep scrubbing in order to reduce load on the cluster while
https://github.com/ceph/ceph-iscsi-cli/issues/36
I have already asked for that, you can remove the check or download the
2.5 release which doesnt check for OS
El 25/10/2017 a las 17:02, Marc Roos escribió:
>
>
> Hi Giang,
>
> Can I ask you if you used the elrepo kernels? Because I tried
Yes, I used elerepo to upgrade kernel, I can boot and show it, kernel 4.x. What
is the problem?
Sent from my iPhone
> On Oct 25, 2017, at 10:02 PM, Marc Roos wrote:
>
>
>
> Hi Giang,
>
> Can I ask you if you used the elrepo kernels? Because I tried these, but
>
Hay All
is it possible to set permissions to buckets
for example if i have 2 users (user_a and user_b) and 2 buckets (bk_a and
bk_b)
i want to set permissions, so user a can only see bk_a and user b can only
see bk_b
I have been looking at cant see what i am after.
Any advise would be
Hi all.
I am researching with Ceph for Storage. I am using 3 VM: ceph01, ceph02,
ceph03. All VM is using CentOS 7.4 with kernel from 4.x (I upgraded). Now I
want to configure high availability iSCSI with ceph-iscsi-cli.
I installed ceph-iscsi-cli on ceph01. But when I create isci gateway by
It does support for CentOS or other distributions but there is not
avaible yet a way of checking for pre requisites so the only way to
automaticaly detect that they are met is using RedHat.
Still that is a pre-requisite check, you can delete it if you manually
ensure they are met. :)
Check the
So, do you have other solution to configure iscsi gateway for ceph? Please help
me
Sent from my iPhone
> On Oct 25, 2017, at 10:17 PM, Jorge Pinilla López wrote:
>
> It does support for CentOS or other distributions but there is not avaible
> yet a way of checking for pre
Hi jorge pinilla Lopez
So, it mean, now ceph iscsi doesn’t support for centos?
Sent from my iPhone
> On Oct 25, 2017, at 10:07 PM, Jorge Pinilla López wrote:
>
> https://github.com/ceph/ceph-iscsi-cli/issues/36
>
> I have already asked for that, you can remove the check or
We tried various options like the one's Ben mentioned to speed up the garbage
collection process and were unsuccessful. Luckily, we had the ability to
create a new cluster and move all the data that wasn't part of the POC which
created our problem.
One of the things we ran into was the
I believe you can append "skipchecks" to the "create ceph01
192.168.101.151" action. The tools still expect to have a kernel that
includes queue full timeout handling [1][2] which is awaiting upstream
review. That was added to support some low, non-configurable timeouts in
ESX environments when
Are you talking about RGW buckets with limited permissions for cephx
authentication? Or RGW buckets with limited permissions for RGW users?
On Wed, Oct 25, 2017 at 12:16 PM nigel davies wrote:
> Hay All
>
> is it possible to set permissions to buckets
>
> for example if i
When I do this, I reweight all of the OSDs I want to remove to 0 first, wait
for the rebalance, then proceed to remove the OSDs. Doing it your way, you
have to wait for the rebalance after removing each OSD one by one.
Mike Kuriger
Sr. Unix Systems Engineer
818-434-6195
After upgrading from CEPH Hammer to Jewel, we are experiencing extremely long
osd boot duration.
This long boot time is a huge concern for us and are looking for insight into
how we can speed up the boot time.
In Hammer, OSD boot time was approx 3 minutes. After upgrading to Jewel, boot
time
Hello everyone! :)
I have an interesting problem. For a few weeks, we've been testing Luminous
in a cluster made up of 8 servers and with about 20 SSD disks almost evenly
distributed. It is running erasure coding.
Yesterday, we decided to bring the cluster to a minimum of 8 servers and 1
disk
Some of the options there won't do much for you as they'll only affect
newer object removals. I think the default number of gc objects is
just inadequate for your needs. You can try manually running
'radosgw-admin gc process' concurrently (for the start 2 or 3
processes), see if it makes any dent
I am fallowing a guide at the mo.
But I believe it's RWG users
On 25 Oct 2017 5:29 pm, "David Turner" wrote:
> Are you talking about RGW buckets with limited permissions for cephx
> authentication? Or RGW buckets with limited permissions for RGW users?
>
> On Wed, Oct
Very interesting.
I've been toying around with Rook.io [1]. Did you know of this project, and
if so can you tell if ceph-helm and Rook.io have similar goals?
Regards,
Hans
[1] https://rook.io/
On 25 Oct 2017 21:09, "Sage Weil" wrote:
> There is a new repo under the ceph
Hello,
in the lumious release notes is stated that zstd is not supported by
bluestor due to performance reason. I'm wondering why btrfs instead
states that zstd is as fast as lz4 but compresses as good as zlib.
Why is zlib than supported by bluestor? And why does btrfs / facebook
behave
Hello,
I cannot tell what was the previous version since I used the one
installed on ubuntu 15.04. Now 16.04.
But what I can tell is that I get errors from ceph osd and mon from time
to time. The mon problems are scaring since I have to wipe the monitor
and then reinstall a new one. I cannot
Hello,
in the lumious release notes is stated that zstd is not supported by
bluestor due to performance reason. I'm wondering why btrfs instead
states that zstd is as fast as lz4 but compresses as good as zlib.
Why is zlib than supported by bluestor? And why does btrfs / facebook
behave
I could not get it to boot on CentOS7 with just installing it. I think
it is because of the booting from mpt2sas and that driver is replaced
with mpt3sas in >4.x kernels. I was even recreating the boot initrd's,
but could not get it to run quickly.
-Original Message-
From:
There is a new repo under the ceph org, ceph-helm, which includes helm
charts for deploying ceph on kubernetes. The code is based on the ceph
charts from openstack-helm, but we've moved them into their own upstream
repo here so that they can be developed more quickly and independently
from
On Wed, 25 Oct 2017, Hans van den Bogert wrote:
> Very interesting.I've been toying around with Rook.io [1]. Did you know of
> this project, and if so can you tell if ceph-helm
> and Rook.io have similar goals?
Similar but a bit different.
Probably the main difference is that ceph-helm aims to
On Wed, 25 Oct 2017, Stefan Priebe - Profihost AG wrote:
> Hello,
>
> in the lumious release notes is stated that zstd is not supported by
> bluestor due to performance reason. I'm wondering why btrfs instead
> states that zstd is as fast as lz4 but compresses as good as zlib.
>
> Why is zlib
Hi Ronny,
>From the documentation, I thought this was the proper way to resolve the
issue.
Dan
> On 24. okt. 2017 19:14, Daniel Davidson wrote:
>> Our ceph system is having a problem.
>>
>> A few days a go we had a pg that was marked as inconsistent, and today I
>> fixed it with a:
>>
>> #ceph
> Op 25 oktober 2017 om 5:58 schreef Christian Sarrasin
> :
>
>
> I'm planning to migrate an existing Filestore cluster with (SATA)
> SSD-based journals fronting multiple HDD-hosted OSDs - should be a
> common enough setup. So I've been trying to parse various
On 2017-10-25 11:21, Wido den Hollander wrote:
>
>> Op 25 oktober 2017 om 5:58 schreef Christian Sarrasin
>> :
>>
>> The one thing I'm still wondering about is failure domains. With
>> Filestore and SSD-backed journals, an SSD failure would kill writes but
>> OSDs
On 24. okt. 2017 19:14, Daniel Davidson wrote:
Our ceph system is having a problem.
A few days a go we had a pg that was marked as inconsistent, and today I
fixed it with a:
#ceph pg repair 1.37c
then a file was stuck as missing so I did a:
#ceph pg 1.37c mark_unfound_lost delete
pg has 1
Hello,
I have a 10.2.5 Ceph cluster, there is an image with exclusive lock that is
being held by client. Some other client creates a snapshot on that image,
then (that client) goes away. Later, third client attempts to remove that
snapshot using rbd snap rm, but fails to do so without error:
Hi,
I've asked the exact same question a few days ago, same answer:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021708.html
I guess we'll have to bite the bullet on this one and take this into
account when designing.
Kind regards,
Caspar
2017-10-25 10:39 GMT+02:00
On Mon, Oct 23, 2017 at 5:03 PM Keane Wolter wrote:
> Hi Gregory,
>
> I did set the cephx caps for the client to:
>
> caps: [mds] allow r, allow rw uid=100026 path=/user, allow rw path=/project
>
So you’ve got three different permission granting clauses here:
1) allows the
I'm on Luminous with this cluster.
I've seen that the cluster started cleaning up on sunday, which made the
bucketsize shrink again. I've changed the garbagecollection settings to:
rgw_gc_max_objs = 67
rgw_gc_obj_min_wait = 1800
rgw_gc_processor_max_time = 1800
rgw_gc_processor_period = 1800
I found it is similar to bug: http://tracker.ceph.com/issues/21388.
And fix it by rados command.
The pg inconsistent info is like following,wish it could be fixed in the future.
root@n10-075-019:/var/lib/ceph/osd/ceph-27/current/1.fcd_head# rados
list-inconsistent-obj 1.fcd --format=json-pretty
56 matches
Mail list logo