re-adding the list.
i'm glad to hear you got things back to a working state. one thing you
might want to check is the hit_set_history in the pg data. if the missing
hit sets are no longer in the history, then it is probably safe to go back
to the normal builds. that is until you have to mark
After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a
problem where the new ceph-mgr would sometimes hang indefinitely when doing
commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs). The rest of
our clusters (10+) aren't seeing the same issue, but they
Hi Tom,
I used a slightly modified version of your script to generate a comparative
list to mine (echoing out the bucket name, id and actual_id), which has
returned substantially more indexes than mine, including a number that
don't show any indication of resharding having been run, or versioning
Hi,
today we had an issue with our 6 node Ceph cluster.
We had to shutdown one node (Ceph-03), to replace a disk (because, we did now
know the slot). We set the noout flag and did a graceful shutdown. All was O.K.
After the disk was replaced, the node comes up and our VMs had a big I/O
I had the same problem (or a problem with the same symptoms)
In my case the problem was with wrong ownership of the log file
You might want to check if you are having the same issue
Cheers, Massimo
On Mon, Oct 15, 2018 at 6:00 AM Zhenshi Zhou wrote:
> Hi,
>
> I added some OSDs into
Am Do., 18. Okt. 2018 um 13:01 Uhr schrieb Matthew Vernon :
>
> On 17/10/18 15:23, Paul Emmerich wrote:
>
> [apropos building Mimic on Debian 9]
>
> > apt-get install -y g++ libc6-dbg libc6 -t testing
> > apt-get install -y git build-essential cmake
>
> I wonder if you could avoid the "need a
Hi,
Ceph Version = 12.2.8
8TB spinner with 20G SSD partition
Perf dump shows the following:
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 21472731136,
"db_used_bytes": 3467640832,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
Hmm, It's useful to rebuild the index by rewriting a object.
But at first, I need know the all keys of objects. If I want to know all
keys, I need list_objects ...
Maybe I can make an union set of instances, then copy all of them into
themselves.
Anyway, I want to find out more about why it
Hi David,
Thanks for the explanation!
I'll make a search on how much data each pool will use.
Thanks!
David Turner 于2018年10月18日周四 下午9:26写道:
> Not all pools need the same amount of PGs. When you get to so many pools
> you want to start calculating how much data each pool will have. If 1 of
>
On Mon, Oct 15, 2018 at 9:54 PM Dietmar Rieder
wrote:
>
> On 10/15/18 1:17 PM, jes...@krogh.cc wrote:
> >> On 10/15/18 12:41 PM, Dietmar Rieder wrote:
> >>> No big difference here.
> >>> all CentOS 7.5 official kernel 3.10.0-862.11.6.el7.x86_64
> >>
> >> ...forgot to mention: all is luminous
Hi Massimo,
I checked the ownership of the file as well as the log directory.
The files' ownership are ceph with permission 644. Besides, the
log directory's ownership is ceph and the permission is 'drwxrws--T'
I suppose that the ownership and file permission are enough for
ceph to write logs.
On Thu, Oct 18, 2018 at 1:35 PM Bryan Stillwell wrote:
>
> Thanks Dan!
>
>
>
> It does look like we're hitting the ms_tcp_read_timeout. I changed it to 79
> seconds and I've had a couple dumps that were hung for ~2m40s
> (2*ms_tcp_read_timeout) and one that was hung for 8 minutes
>
On Wed, Oct 17, 2018 at 1:14 AM Yang Yang wrote:
>
> Hi,
> A few weeks ago I found radosgw index has been inconsistent with reality.
> Some object I can not list, but I can get them by key. Please see the details
> below:
>
> BACKGROUND:
> Ceph version 12.2.4
I could see something related to that bug might be happening, but we're not
seeing the "clock skew" or "signal: Hangup" messages in our logs.
One reason that this cluster might be running into this problem is that we
appear to have a script that is gathering stats for collectd which is running
On Thu, Oct 18, 2018 at 10:31 PM Bryan Stillwell wrote:
>
> I could see something related to that bug might be happening, but we're not
> seeing the "clock skew" or "signal: Hangup" messages in our logs.
>
>
>
> One reason that this cluster might be running into this problem is that we
> appear
On 10/18/2018 7:49 PM, Nick Fisk wrote:
Hi,
Ceph Version = 12.2.8
8TB spinner with 20G SSD partition
Perf dump shows the following:
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 21472731136,
"db_used_bytes": 3467640832,
Thanks Greg,
This did get resolved though I'm not 100% certain why!
For one of the suspect shards which caused crash on backfill, I
attempted to delete the associated via s3, late last week. I then
examined the filestore OSDs and the file shards were still present...
maybe for an hour
On Thu, Oct 18, 2018 at 6:17 PM Bryan Stillwell wrote:
>
> After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing
> a problem where the new ceph-mgr would sometimes hang indefinitely when doing
> commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs). The rest
I left some of the 'ceph pg dump' commands running and twice they returned
results after 30 minutes, and three times it took 45 minutes. Is there
something that runs every 15 minutes that would let these commands finish?
Bryan
From: Bryan Stillwell
Date: Thursday, October 18, 2018 at 11:16
15 minutes seems like the ms tcp read timeout would be related.
Try shortening that and see if it works around the issue...
(We use ms tcp read timeout = 60 over here -- the 900s default seems
really long to keep idle connections open)
-- dan
On Thu, Oct 18, 2018 at 9:39 PM Bryan Stillwell
Thanks Dan!
It does look like we're hitting the ms_tcp_read_timeout. I changed it to 79
seconds and I've had a couple dumps that were hung for ~2m40s
(2*ms_tcp_read_timeout) and one that was hung for 8 minutes
(6*ms_tcp_read_timeout).
I agree that 15 minutes (900s) is a long timeout. Anyone
After RGW upgrade from Jewel to Luminous, one S3 user started to receive
errors from his postgre wal-e solution. Error is like this: "Server Side
Encryption with KMS managed key requires HTTP header
x-amz-server-side-encryption : aws:kms".
This can be resolved via simple patch of wal-e/wal-g. I
Hi!
I use ceph 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable),
and find that:
When expand whole cluster, i update pg_num, all succeed, but the status is as
below:
cluster:
id: 41ef913c-2351-4794-b9ac-dd340e3fbc75
health: HEALTH_WARN
3 pools have pg_num > pgp_num
Then
I want to ask did you had similar experience with upgrading Jewel RGW to
Luminous. After upgrading monitors and OSD's, I started two new Luminous
RGWs and put them to LB together with Jewel ones. And than interesting
things started to happen. Some our jobs start to fail with "
fatal error: An
On Thu, Oct 18, 2018 at 3:35 PM Florent B wrote:
>
> I'm not familiar with gdb, what do I need to do ? Install "-gdb" version
> of ceph-mds package ? Then ?
> Thank you
>
install ceph with debug info, install gdb. run 'gdb attach '
> On 18/10/2018 03:40, Yan, Zheng wrote:
> > On Thu, Oct 18,
Hi,
I copy some big files to radosgw with awscli. But I found some copy
will failed, like :
* aws s3 --endpoint=XXX cp ./bigfile s3://mybucket/bigfile*
*upload failed: ./bigfile to s3://mybucket/bigfile An error occurred
(InternalError) when calling the CompleteMultipartUpload operation
On 17/10/18 15:23, Paul Emmerich wrote:
[apropos building Mimic on Debian 9]
apt-get install -y g++ libc6-dbg libc6 -t testing
apt-get install -y git build-essential cmake
I wonder if you could avoid the "need a newer libc" issue by using
backported versions of cmake/g++ ?
Regards,
Not all pools need the same amount of PGs. When you get to so many pools
you want to start calculating how much data each pool will have. If 1 of
your pools will have 80% of your data in it, it should have 80% of your
PGs. The metadata pools for rgw likely won't need more than 8 or so PGs
each. If
What are you OSD node stats? CPU, RAM, quantity and size of OSD disks.
You might need to modify some bluestore settings to speed up the time it
takes to peer or perhaps you might just be underpowering the amount of OSD
disks you're trying to do and your servers and OSD daemons are going as
fast
29 matches
Mail list logo