Re: [ceph-users] OSDs crash after deleting unfound object in Luminous 12.2.8

2018-10-18 Thread Mike Lovell
re-adding the list.

i'm glad to hear you got things back to a working state. one thing you
might want to check is the hit_set_history in the pg data. if the missing
hit sets are no longer in the history, then it is probably safe to go back
to the normal builds. that is until you have to mark another hit set
missing. :)  i think the code that removes the hit set from the pg data is
before that assert so its possible it still removed it from the history.

mike

On Thu, Oct 18, 2018 at 9:11 AM Lawrence Smith <
lawrence.sm...@uni-muenster.de> wrote:

> Hi Mike,
>
> Thanks a bunch for your writeup, that was exactly the problem and
> solution! All i did was comment out the assert and ad an if(obc){ } after
> to make sure i don't run into a segfault, and now the cluster is healthy
> once again. I am not sure if ceph will register a mismatch in a byte count
> while scrubbing due to the missing object, but I don't think so.
>
> Anyway, I just wanted to thank you for your help!
>
> Best wishes,
>
> Lawrence
>
> On 10/13/2018 02:00 AM, Mike Lovell wrote:
>
> what was the object name that you marked lost? was it one of the cache
> tier hit_sets?
>
> the trace you have does seem to be failing when the OSD is trying to
> remove a hit set that is no longer needed. i ran into a similar problem
> which might have been why that bug you listed was created. maybe providing
> what i have since discovered about hit sets might help.
>
> the hit sets are what the cache tier uses to know which objects have been
> accessed in a given period of time. these hit sets are then stored in the
> object store using an object name that is generated. for the version you're
> running, the code for that generation is at
> https://github.com/ceph/ceph/blob/v12.2.8/src/osd/PrimaryLogPG.cc#L12667.
> its bascially "hit_set__archive__" where the
> times are recorded in the hit set history. that hit set history is stored
> as part of the PG metadata. you can get a list of all of the hit sets the
> PG has by looking at 'ceph pg  query' and looking at the
> ['info']['hit_set_history']['history'] array. each entry in that array has
> the information on each hit set for the PG and the times are what is used
> in generation of the object name. there should be one ceph object for each
> hit set listed in that array.
>
> if you told the cluster to mark one of the hit set objects as lost, its
> possible the OSD cannot get that object and is hitting the assert(obc) near
> the end of PrimaryLogPG::hit_set_trim in the same source file referenced
> above. you can potentially verify this by a couple methods. i think if you
> set debug_osd at 20 that it should log a line saying something like
> "hit_set_trim removing hit_set__archive." if that name matches
> one of the ones you marked lost, then is this almost certainly the cause.
> you can also do a find on the OSD directory, if you're using file store,
> and look for the right file name. something like 'find
> /var/lib/ceph/osd/ceph-/current/_head -name
> hit\*set\*\*archive\*' should work. include the \ to escape the * so
> bash doesn't interpret it. if you're using bluestore, i think you can use
> the ceph-objectstore-tool while the osd is stopped to get a list of
> objects. you'll probably want to only look in the .ceph-internal namespace
> since the hit sets are stored in that namespace.
>
> there are a couple potential ways to get around this. what we did when we
> had the problem was run a custom build of the ceph-osd where we commented
> out the assert(obc); line in hit_set_trim. that build was only run for long
> enough to get the cluster back online and then to flush and evict the
> entire cache, remove the cache, restart using the normal ceph builds, and
> then recreate the cache.
>
> the other options are things that i don't know for sure if they'll work.
> if you're using file store, you might be able to just copy another hit set
> to the file name of the missing hit set object. this should be pretty
> benign and its just going to remove the object in a moment anyways. also,
> i'm not entirely sure how to come up with what directory to put the object
> in if the osd has done any directory splitting. maybe someone on the list
> will know how to do this. there might be a way with the
> ceph-objectstore-tool to write in the object but i couldn't find one in my
> testing on hammer.
>
> the last option i can think of, is that if you can completely stop any
> traffic to the pools in question, its possible the OSDs wont crash.
> hit_set_trim doesn't appear to get called if there is no client traffic
> reaching the osds and the hit sets aren't being updated. if you can stop
> anything from using the pools in question and guarantee nothing will come
> in, then it might be possible to keep the OSDs long enough to flush
> everything from the cache tier, remove it, and recreate it. this option
> seems like a long shot and i don't know for sure it'll work. it just seemed
> to me like the OSDs would stay up 

[ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell
After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a 
problem where the new ceph-mgr would sometimes hang indefinitely when doing 
commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest of 
our clusters (10+) aren't seeing the same issue, but they are all under 600 
OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or so, but 
usually overnight it'll get back into the state where the hang reappears.  At 
first I thought it was a hardware issue, but switching the primary ceph-mgr to 
another node didn't fix the problem.

I've increased the logging to 20/20 for debug_mgr, and while a working dump 
looks like this:

2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3
2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command prefix=pg 
dump
2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
client.admin capable
2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
dispatch
2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command (0) 
Success dumped all

A failed dump call doesn't show up at all.  The "mgr.server handle_command 
prefix=pg dump" log entry doesn't seem to even make it to the logs.

This problem also continued to appear after upgrading to 12.2.8.

Has anyone else seen this?

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Resolving Large omap objects in RGW index pool

2018-10-18 Thread Chris Sarginson
Hi Tom,

I used a slightly modified version of your script to generate a comparative
list to mine (echoing out the bucket name, id and actual_id), which has
returned substantially more indexes than mine, including a number that
don't show any indication of resharding having been run, or versioning
being enabled, including some with only minor different bucket_ids:

  5813 buckets_with_multiple_reindexes2.txt (my script)
  7999 buckets_with_multiple_reindexes3.txt (modified Tomasz script)

For example a bucket has 2 entries:

default.23404.6
default.23407.8

running "radosgw-admin bucket stats" against this bucket shows the current
id as default.23407.9

None of the indexes (including the active one) shows multiple shards, or
any resharding activities.

Using the command:
rados -p ,rgw,buckets.index listomapvals .dir.${id}

Shows the other (lower) index ids as being empty, and the current one
containing the index data.

I'm wondering if it is possible some of these are remnants from upgrades
(this cluster started as giant and has been upgraded through the LTS
releases to Luminous)?  Using radosgw-admin metadata get bucket.instance on
my sample bucket shows different "ver" information between them all:

old:
"ver": {


"tag": "__17wYsZGbXIhRKtx3goicMV",
"ver": 1
},
"mtime": "2014-03-24 15:45:03.00Z"

"ver": {
"tag": "_x5RWprsckrL3Bj8h7Mbwklt",
"ver": 1
},
"mtime": "2014-03-24 15:43:31.00Z"

active:
"ver": {
"tag": "_6sTOABOHCGTSZ-EEIZ29VSN",
"ver": 4
},
"mtime": "2017-08-10 15:06:38.940464Z",

This obviously still leaves me with the original issue noticed, which is
multiple instances of buckets that seem to have been repeatedly resharded
to the same number of shards as the currently active index.  From having a
search around the tracker it seems like this may be worth following -
"Aborted dynamic resharding should clean up created bucket index objs" :

https://tracker.ceph.com/issues/35953

Again, any other suggestions or ideas are greatly welcomed on this :)

Chris

On Wed, 17 Oct 2018 at 12:29 Tomasz Płaza  wrote:

> Hi,
>
> I have a similar issue, and created a simple bash file to delete old
> indexes (it is PoC and have not been tested on production):
>
> for bucket in `radosgw-admin metadata list bucket | jq -r '.[]' | sort`
> do
>   actual_id=`radosgw-admin bucket stats --bucket=${bucket} | jq -r '.id'`
>   for instance in `radosgw-admin metadata list bucket.instance | jq -r
> '.[]' | grep ${bucket}: | cut -d ':' -f 2`
>   do
> if [ "$actual_id" != "$instance" ]
> then
>   radosgw-admin bi purge --bucket=${bucket} --bucket-id=${instance}
>   radosgw-admin metadata rm bucket.instance:${bucket}:${instance}
> fi
>   done
> done
>
> I find it more readable than mentioned one liner. Any sugestions on this
> topic are greatly appreciated.
> Tom
>
> Hi,
>
> Having spent some time on the below issue, here are the steps I took to
> resolve the "Large omap objects" warning.  Hopefully this will help others
> who find themselves in this situation.
>
> I got the object ID and OSD ID implicated from the ceph cluster logfile on
> the mon.  I then proceeded to the implicated host containing the OSD, and
> extracted the implicated PG by running the following, and looking at which
> PG had started and completed a deep-scrub around the warning being logged:
>
> grep -C 200 Large /var/log/ceph/ceph-osd.*.log | egrep '(Large
> omap|deep-scrub)'
>
> If the bucket had not been sharded sufficiently (IE the cluster log showed
> a "Key Count" or "Size" over the thresholds), I ran through the manual
> sharding procedure (shown here:
> https://tracker.ceph.com/issues/24457#note-5)
>
> Once this was successfully sharded, or if the bucket was previously
> sufficiently sharded by Ceph prior to disabling the functionality I was
> able to use the following command (seemingly undocumented for Luminous
> http://docs.ceph.com/docs/mimic/man/8/radosgw-admin/#commands):
>
> radosgw-admin bi purge --bucket ${bucketname} --bucket-id ${old_bucket_id}
>
> I then issued a ceph pg deep-scrub against the PG that had contained the
> Large omap object.
>
> Once I had completed this procedure, my Large omap object warnings went
> away and the cluster returned to HEALTH_OK.
>
> However our radosgw bucket indexes pool now seems to be using
> substantially more space than previously.  Having looked initially at this
> bug, and in particular the first comment:
>
> http://tracker.ceph.com/issues/34307#note-1
>
> I was able to extract a number of bucket indexes that had apparently been
> resharded, and removed the legacy index using the radosgw-admin bi purge
> --bucket ${bucket} ${marker}.  I am still able  to perform a radosgw-admin
> metadata get bucket.instance:${bucket}:${marker} successfully, however now
> when I run rados -p .rgw.buckets.index ls | grep ${marker} nothing is
> returned.  Even after this, we were still seeing extremely 

[ceph-users] 12.2.8: 1 node comes up (noout set), from a 6 nodes cluster -> I/O stuck (rbd usage)

2018-10-18 Thread Denny Fuchs
Hi,

today we had an issue with our 6 node Ceph cluster.

We had to shutdown one node (Ceph-03), to replace a disk (because,  we did now 
know the slot). We set the noout flag and did a graceful shutdown. All was O.K. 
After the disk was replaced, the node comes up and our VMs had a big I/O 
latency.
We never saw this in the past, with the same procedure ...

* From our logs on Ceph-01:

2018-10-18 15:53:45.455743 mon.qh-a07-ceph-osd-03 mon.2 10.3.0.3:6789/0 1 : 
cluster [INF] mon.qh-a07-ceph-  osd-03 calling monitor election
...
2018-10-18 15:53:45.503818 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663050 
: cluster [INF] mon.qh-a07-ceph-osd-01 is new leader, mons 
qh-a07-ceph-osd-01,qh-a07-ceph-osd-02,qh-a07-ceph-osd-03,qh-a07-ceph-osd-04,qh-a07-ceph-osd-05,qh

* First OSD comes up:

2018-10-18 15:53:55.207742 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663063 
: cluster [WRN] Health check update: 10 osds down (OSD_DOWN)
2018-10-18 15:53:55.207768 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663064 
: cluster [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (11 osds) down)
2018-10-18 15:53:55.240079 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663065 
: cluster [INF] osd.43 10.3.0.3:6812/7554 boot

* All OSDs where up:

2018-10-18 15:54:25.331692 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663134 
: cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2018-10-18 15:54:25.360151 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663135 
: cluster [INF] osd.12 10.3.0.3:6820/8537 boot

* This OSDs here are a mix of HDD and SDD and different nodes

2018-10-18 15:54:27.073266 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663138 
: cluster [WRN] Health check update: Degraded data redundancy: 84012/4293867 
objects degraded (1.957%), 1316 pgs degraded, 487 pgs undersized (PG_DEGRADED)
2018-10-18 15:54:32.073644 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663142 
: cluster [WRN] Health check update: Degraded data redundancy: 4611/4293867 
objects degraded (0.107%), 1219 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:36.841189 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663144 
: cluster [WRN] Health check failed: 1 slow requests are blocked > 32 sec. 
Implicated osds 16 (REQUEST_SLOW)
2018-10-18 15:54:37.074098 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663145 
: cluster [WRN] Health check update: Degraded data redundancy: 4541/4293867 
objects degraded (0.106%), 1216 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:42.074510 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663149 
: cluster [WRN] Health check update: Degraded data redundancy: 4364/4293867 
objects degraded (0.102%), 1176 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:42.074561 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663150 
: cluster [WRN] Health check update: 5 slow requests are blocked > 32 sec. 
Implicated osds 15,25,30,34 (REQUEST_SLOW)
2018-10-18 15:54:47.074886 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663152 
: cluster [WRN] Health check update: Degraded data redundancy: 4193/4293867 
objects degraded (0.098%), 1140 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:47.074934 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663153 
: cluster [WRN] Health check update: 5 slow requests are blocked > 32 sec. 
Implicated osds 9,15,23,30 (REQUEST_SLOW)
2018-10-18 15:54:52.075274 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663156 
: cluster [WRN] Health check update: Degraded data redundancy: 4087/4293867 
objects degraded (0.095%), 1120 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:52.075313 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663157 
: cluster [WRN] Health check update: 14 slow requests are blocked > 32 sec. 
Implicated osds 2,13,14,15,16,23 (REQUEST_SLOW)
2018-10-18 15:54:57.075635 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663158 
: cluster [WRN] Health check update: Degraded data redundancy: 3932/4293867 
objects degraded (0.092%), 1074 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:57.075683 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663159 
: cluster [WRN] Health check update: 6 slow requests are blocked > 32 sec. 
Implicated osds 14,15 (REQUEST_SLOW)
2018-10-18 15:55:02.076071 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663163 
: cluster [WRN] Health check update: Degraded data redundancy: 3805/4293867 
objects degraded (0.089%), 1036 pgs degraded (PG_DEGRADED)
2018-10-18 15:55:02.076138 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663164 
: cluster [WRN] Health check update: 4 slow requests are blocked > 32 sec. 
Implicated osds 1,15,19 (REQUEST_SLOW)
2018-10-18 15:55:07.076562 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663168 
: cluster [WRN] Health check update: Degraded data redundancy: 3656/4293867 
objects degraded (0.085%), 988 pgs degraded (PG_DEGRADED)
2018-10-18 15:55:07.076633 mon.qh-a07-ceph-osd-01 mon.0 10.3.0.1:6789/0 1663169 
: cluster [WRN] Health check update: 4 slow requests are blocked > 32 sec. 
Implicated osds 15,22 (REQUEST_SLOW)
2018-10-18 15:55:12.077091 mon.qh-a07-ceph-osd-01 

Re: [ceph-users] Ceph osd logs

2018-10-18 Thread Massimo Sgaravatto
I had the same  problem (or a problem with the same symptoms)
In my case the problem was with wrong ownership of the log file
You might want to check if you are having the same issue

Cheers, Massimo

On Mon, Oct 15, 2018 at 6:00 AM Zhenshi Zhou  wrote:

> Hi,
>
> I added some OSDs into cluster(luminous) lately. The osds use
> bluestoreand everything goes fine. But there is no osd log in the
> log file. The log directory has only empty files.
>
> I check my settings, "ceph daemon osd.x config show", and I get
> "debug_osd": "1/5".
>
> How can I get the new osds' logs?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic and Debian 9

2018-10-18 Thread Paul Emmerich
Am Do., 18. Okt. 2018 um 13:01 Uhr schrieb Matthew Vernon :
>
> On 17/10/18 15:23, Paul Emmerich wrote:
>
> [apropos building Mimic on Debian 9]
>
> > apt-get install -y g++ libc6-dbg libc6 -t testing
> > apt-get install -y git build-essential cmake
>
> I wonder if you could avoid the "need a newer libc" issue by using
> backported versions of cmake/g++ ?

Yes, a backport of gcc would solve all problems. But that has been
discussed before and is unlikely to happen :(


Paul

>
> Regards,
>
> Matthew
>
>
>
> --
>  The Wellcome Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?

2018-10-18 Thread Nick Fisk
Hi,

Ceph Version = 12.2.8   
8TB spinner with 20G SSD partition 

Perf dump shows the following:

"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 21472731136,
"db_used_bytes": 3467640832,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 320063143936,
"slow_used_bytes": 4546625536,
"num_files": 124,
"log_bytes": 11833344,
"log_compactions": 4,
"logged_bytes": 316227584,
"files_written_wal": 2,
"files_written_sst": 4375,
"bytes_written_wal": 204427489105,
"bytes_written_sst": 248223463173

Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 4.5GB 
of DB is stored on the spinning disk?

Am I also understanding correctly that BlueFS has reserved 300G of space on the 
spinning disk?

Found a previous bug tracker for something which looks exactly the same case, 
but should be fixed now:
https://tracker.ceph.com/issues/22264

Thanks,
Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw index has been inconsistent with reality

2018-10-18 Thread Yang Yang
Hmm, It's useful to rebuild the index by rewriting a object.
But at first, I need know the all keys of objects. If I want to know all
keys, I need list_objects ...
Maybe I can make an union set of instances, then copy all of them into
themselves.

Anyway, I want to find out more about why it happens and how to avoid it.

Yehuda Sadeh-Weinraub 于2018年10月19日周五 上午2:25写道:

> On Wed, Oct 17, 2018 at 1:14 AM Yang Yang  wrote:
> >
> > Hi,
> > A few weeks ago I found radosgw index has been inconsistent with
> reality. Some object I can not list, but I can get them by key. Please see
> the details below:
> >
> > BACKGROUND:
> > Ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
> luminous (stable)
> > Index pool is on ssd.
> > There is a very big bucket with more than 10 million object and
> 500TB data.
> > Ceph health is OK.
> > I use s3 api on radosgw.
> >
> > DESCRIBE:
> > When use s3 list_object() to list, some uploaded object can not be
> listed and some uploaded object have an old lastModified time.
> > But at the same time, we can get this object by an exact key. And if
> I put a new object into this bucket, it can be listed.
> > It seems that some indexes during a period of time have been lost.
> >
> > I try to run "radosgw-admin bucket check --bucket  --fix
> --check-objects" and I get nothing at all.
> >
> > SOME ELSE:
> > I found that one bucket will have many indexes, and we can use
> "radosgw-admin metadata list bucket.instance | grep "{bucket name}" to show
> them. But I can not found a doc to describe this feature. And we can use
> "radosgw-admin bucket stats --bucket {bucket_name}" to get id as the active
> instance id.
> > I use "rados listomapkeys" at active(or latest) index to get all
> object in a index, it is really lost. But when I use "rados listomapkeys"
> at another index which is not active as mentioned above, I found the lost
> object index.
> >
> > Resharding is within my consideration. Listomapkeys means do this
> action on all shards(more than 300).
> > In my understanding, a big bucket has one latest index and many old
> indexes. Every index has many shards. So listomapkeys on a index means
> listomapkeys on many shards.
> >
> > QUESTION:
> > Why my index lost?
> > How to recover?
>
> I don't really know what happened, haven't seen this exact issue
> before. You can try copying objects into themselves. That should
> recreate their bucket index entry.
>
> > Why radosgw has many index instances, how do radosgw use them and
> how to change active index?
>
> Could be related to an existing bug. You can unlink the bucket and
> then link a specific bucket instance version (to the user), however,
> I'm not sure I recommend going this path if it isn't necessary.
>
> Regards,
> Yehuda
> >
> >
> > Thanks,
> >
> > Inksink
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph pg/pgp number calculation

2018-10-18 Thread Zhenshi Zhou
Hi David,

Thanks for the explanation!
I'll make a search on how much data each pool will use.

Thanks!

David Turner  于2018年10月18日周四 下午9:26写道:

> Not all pools need the same amount of PGs. When you get to so many pools
> you want to start calculating how much data each pool will have. If 1 of
> your pools will have 80% of your data in it, it should have 80% of your
> PGs. The metadata pools for rgw likely won't need more than 8 or so PGs
> each. If your rgw data pool is only going to have a little scratch data,
> then it won't need very many PGs either.
>
> On Tue, Oct 16, 2018, 3:35 AM Zhenshi Zhou  wrote:
>
>> Hi,
>>
>> I have a cluster serving rbd and cephfs storage for a period of
>> time. I added rgw in the cluster yesterday and wanted it to server
>> object storage. Everything seems good.
>>
>> What I'm confused is how to calculate the pg/pgp number. As we
>> all know, the formula of calculating pgs is:
>>
>> Total PGs = ((Total_number_of_OSD * 100) / max_replication_count) /
>> pool_count
>>
>> Before I created rgw, the cluster had 3 pools(rbd, cephfs_data,
>> cephfs_meta).
>> But now it has 8 pools, which object service may use, including
>> '.rgw.root',
>> 'default.rgw.control', 'default.rgw.meta', 'default.rgw.log' and
>> 'defualt.rgw.buckets.index'.
>>
>> Should I calculate pg number again using new pool number as 8, or should
>> I
>> continue to use the old pg number?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client - page cache being invaildated.

2018-10-18 Thread Yan, Zheng
On Mon, Oct 15, 2018 at 9:54 PM Dietmar Rieder
 wrote:
>
> On 10/15/18 1:17 PM, jes...@krogh.cc wrote:
> >> On 10/15/18 12:41 PM, Dietmar Rieder wrote:
> >>> No big difference here.
> >>> all CentOS 7.5 official kernel 3.10.0-862.11.6.el7.x86_64
> >>
> >> ...forgot to mention: all is luminous ceph-12.2.7
> >
> > Thanks for your time in testing, this is very valueable to me in the
> > debugging. 2 questions:
> >
> > Did you "sleep 900" in-between the execution?
> > Are you using the kernel client or the fuse client?
> >
> > If I run them "right after each other" .. then I get the same behaviour.
> >
>
> Hi, as I stated I'm using the kernel client, and yes I did the sleep 900
> between the two runs.
>
> ~Dietmar
>
Sorry for the delay

I suspect that mds asked client to trim its cache. Please run
following commands on an idle client.

time  for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null
bs=1M"; done  | parallel -j 4
echo module ceph +p > /sys/kernel/debug/dynamic_debug/control;
sleep 900;
echo module ceph -p > /sys/kernel/debug/dynamic_debug/control;
time for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null
bs=1M"; done  | parallel -j 4

If you can reproduce this issue. please send kernel log to us.

Regards
Yan, Zheng


> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph osd logs

2018-10-18 Thread Zhenshi Zhou
Hi Massimo,

I checked the ownership of the file as well as the log directory.
The files' ownership are ceph with permission 644. Besides, the
log directory's ownership is ceph and the permission is 'drwxrws--T'

I suppose that the ownership and file permission are enough for
ceph to write logs.

Thanks

Massimo Sgaravatto  于2018年10月18日周四 下午11:40写道:

> I had the same  problem (or a problem with the same symptoms)
> In my case the problem was with wrong ownership of the log file
> You might want to check if you are having the same issue
>
> Cheers, Massimo
>
> On Mon, Oct 15, 2018 at 6:00 AM Zhenshi Zhou  wrote:
>
>> Hi,
>>
>> I added some OSDs into cluster(luminous) lately. The osds use
>> bluestoreand everything goes fine. But there is no osd log in the
>> log file. The log directory has only empty files.
>>
>> I check my settings, "ceph daemon osd.x config show", and I get
>> "debug_osd": "1/5".
>>
>> How can I get the new osds' logs?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Gregory Farnum
On Thu, Oct 18, 2018 at 1:35 PM Bryan Stillwell  wrote:
>
> Thanks Dan!
>
>
>
> It does look like we're hitting the ms_tcp_read_timeout.  I changed it to 79 
> seconds and I've had a couple dumps that were hung for ~2m40s 
> (2*ms_tcp_read_timeout) and one that was hung for 8 minutes 
> (6*ms_tcp_read_timeout).
>
>
>
> I agree that 15 minutes (900s) is a long timeout.  Anyone know the reasoning 
> for that decision?

I think we picked it because it was long enough to be very sure that a
connection wouldn't time out while it was waiting on some kind of slow
response, but short enough that it would actually go away.
In general, we don't expect it to be an "important" value since
connections shouldn't dangle unless one Ceph entity actually remains
alive that whole time and stops needing to talk to an entity it was
previously using, and establishing a connection takes a few
round-trips but otherwise costs little.

So eg it's not uncommon for an rbd client to hit these disconnects if
it stops using its disk for a while. But there's also very little cost
to keeping the session around.

I wouldn't worry much about turning it down quite a bit, but if it's
changing the behavior of ceph-mgr there's also a ceph-mgr bug that
needs to be resolved. I presume John's link is more useful for that.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw index has been inconsistent with reality

2018-10-18 Thread Yehuda Sadeh-Weinraub
On Wed, Oct 17, 2018 at 1:14 AM Yang Yang  wrote:
>
> Hi,
> A few weeks ago I found radosgw index has been inconsistent with reality. 
> Some object I can not list, but I can get them by key. Please see the details 
> below:
>
> BACKGROUND:
> Ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous 
> (stable)
> Index pool is on ssd.
> There is a very big bucket with more than 10 million object and 500TB 
> data.
> Ceph health is OK.
> I use s3 api on radosgw.
>
> DESCRIBE:
> When use s3 list_object() to list, some uploaded object can not be listed 
> and some uploaded object have an old lastModified time.
> But at the same time, we can get this object by an exact key. And if I 
> put a new object into this bucket, it can be listed.
> It seems that some indexes during a period of time have been lost.
>
> I try to run "radosgw-admin bucket check --bucket  --fix 
> --check-objects" and I get nothing at all.
>
> SOME ELSE:
> I found that one bucket will have many indexes, and we can use 
> "radosgw-admin metadata list bucket.instance | grep "{bucket name}" to show 
> them. But I can not found a doc to describe this feature. And we can use 
> "radosgw-admin bucket stats --bucket {bucket_name}" to get id as the active 
> instance id.
> I use "rados listomapkeys" at active(or latest) index to get all object 
> in a index, it is really lost. But when I use "rados listomapkeys" at another 
> index which is not active as mentioned above, I found the lost object index.
>
> Resharding is within my consideration. Listomapkeys means do this action 
> on all shards(more than 300).
> In my understanding, a big bucket has one latest index and many old 
> indexes. Every index has many shards. So listomapkeys on a index means 
> listomapkeys on many shards.
>
> QUESTION:
> Why my index lost?
> How to recover?

I don't really know what happened, haven't seen this exact issue
before. You can try copying objects into themselves. That should
recreate their bucket index entry.

> Why radosgw has many index instances, how do radosgw use them and how to 
> change active index?

Could be related to an existing bug. You can unlink the bucket and
then link a specific bucket instance version (to the user), however,
I'm not sure I recommend going this path if it isn't necessary.

Regards,
Yehuda
>
>
> Thanks,
>
> Inksink
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell
I could see something related to that bug might be happening, but we're not 
seeing the "clock skew" or "signal: Hangup" messages in our logs.

One reason that this cluster might be running into this problem is that we 
appear to have a script that is gathering stats for collectd which is running 
'ceph pg dump' every 16-17 seconds.  I guess you could say we're stress testing 
that code path fairly well...  :)

Bryan

On Thu, Oct 18, 2018 at 6:17 PM Bryan Stillwell 
mailto:bstillw...@godaddy.com>> wrote:

After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a 
problem where the new ceph-mgr would sometimes hang indefinitely when doing 
commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest of 
our clusters (10+) aren't seeing the same issue, but they are all under 600 
OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or so, but 
usually overnight it'll get back into the state where the hang reappears.  At 
first I thought it was a hardware issue, but switching the primary ceph-mgr to 
another node didn't fix the problem.



I've increased the logging to 20/20 for debug_mgr, and while a working dump 
looks like this:



2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3

2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command prefix=pg 
dump

2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
client.admin capable

2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
dispatch

2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command (0) 
Success dumped all



A failed dump call doesn't show up at all.  The "mgr.server handle_command 
prefix=pg dump" log entry doesn't seem to even make it to the logs.

This could be a manifestation of
https://tracker.ceph.com/issues/23460, as the "pg dump" path is one of
the places where the pgmap and osdmap locks are taken together.

Deadlockyness aside, this code path could use some improvement so that
both locks aren't being held unnecessarily, and so that we aren't
holding up all other accesses to pgmap while doing a dump.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread John Spray
On Thu, Oct 18, 2018 at 10:31 PM Bryan Stillwell  wrote:
>
> I could see something related to that bug might be happening, but we're not 
> seeing the "clock skew" or "signal: Hangup" messages in our logs.
>
>
>
> One reason that this cluster might be running into this problem is that we 
> appear to have a script that is gathering stats for collectd which is running 
> 'ceph pg dump' every 16-17 seconds.  I guess you could say we're stress 
> testing that code path fairly well...  :)

That would be one way of putting it!  Consider moving to the
prometheus module, which includes output about which PGs are in which
states (and does so without serializing every PG's full status...)

John

>
>
> Bryan
>
>
>
> On Thu, Oct 18, 2018 at 6:17 PM Bryan Stillwell  
> wrote:
>
>
>
> After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing 
> a problem where the new ceph-mgr would sometimes hang indefinitely when doing 
> commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest 
> of our clusters (10+) aren't seeing the same issue, but they are all under 
> 600 OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or 
> so, but usually overnight it'll get back into the state where the hang 
> reappears.  At first I thought it was a hardware issue, but switching the 
> primary ceph-mgr to another node didn't fix the problem.
>
>
>
>
>
>
>
> I've increased the logging to 20/20 for debug_mgr, and while a working dump 
> looks like this:
>
>
>
>
>
>
>
> 2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3
>
>
>
> 2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command 
> prefix=pg dump
>
>
>
> 2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
> client.admin capable
>
>
>
> 2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
> from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
> cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
> dispatch
>
>
>
> 2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command 
> (0) Success dumped all
>
>
>
>
>
>
>
> A failed dump call doesn't show up at all.  The "mgr.server handle_command 
> prefix=pg dump" log entry doesn't seem to even make it to the logs.
>
>
>
> This could be a manifestation of
>
> https://tracker.ceph.com/issues/23460, as the "pg dump" path is one of
>
> the places where the pgmap and osdmap locks are taken together.
>
>
>
> Deadlockyness aside, this code path could use some improvement so that
>
> both locks aren't being held unnecessarily, and so that we aren't
>
> holding up all other accesses to pgmap while doing a dump.
>
>
>
> John
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?

2018-10-18 Thread Igor Fedotov



On 10/18/2018 7:49 PM, Nick Fisk wrote:

Hi,

Ceph Version = 12.2.8
8TB spinner with 20G SSD partition

Perf dump shows the following:

"bluefs": {
 "gift_bytes": 0,
 "reclaim_bytes": 0,
 "db_total_bytes": 21472731136,
 "db_used_bytes": 3467640832,
 "wal_total_bytes": 0,
 "wal_used_bytes": 0,
 "slow_total_bytes": 320063143936,
 "slow_used_bytes": 4546625536,
 "num_files": 124,
 "log_bytes": 11833344,
 "log_compactions": 4,
 "logged_bytes": 316227584,
 "files_written_wal": 2,
 "files_written_sst": 4375,
 "bytes_written_wal": 204427489105,
 "bytes_written_sst": 248223463173

Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 4.5GB 
of DB is stored on the spinning disk?
Correct. Most probably the rationale for this is the layered scheme 
RocksDB uses to keep its sst. For each level It has a maximum threshold 
(determined by level no, some base value and corresponding multiplier - 
see max_bytes_for_level_base & max_bytes_for_level_multiplier at 
https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide)
If the next level  (at its max size) doesn't fit into the space 
available at DB volume - it's totally spilled over to slow device.
IIRC level_base is about 250MB and multiplier is 10 so the third level 
needs 25Gb and hence doesn't fit into your DB volume.


In fact  DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the 
slow one. AFAIR current recommendation is about 4%.




Am I also understanding correctly that BlueFS has reserved 300G of space on the 
spinning disk?

Right.

Found a previous bug tracker for something which looks exactly the same case, 
but should be fixed now:
https://tracker.ceph.com/issues/22264

Thanks,
Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fixing another remapped+incomplete EC 4+2 pg

2018-10-18 Thread Graham Allan

Thanks Greg,

This did get resolved though I'm not 100% certain why!

For one of the suspect shards which caused crash on backfill, I 
attempted to delete the associated via s3, late last week. I then 
examined the filestore OSDs and the file shards were still present... 
maybe for an hour following (after which I stopped looking).


I left the cluster set to nobackfill over the weekend, during which time 
all osds kept running; then on Monday morning re-enabled backfill. I 
expected the osd to crash again. after which I could look into moving or 
deleting the implicated backfill shards out of the way. Instead of which 
it happily backfilled its way to cleanliness.


I suppose it's possible the shards got deleted later in some kind of rgw 
gc operation, and this could have cleared the problem? Unfortunately I 
didn't look for them again before re-enabling backfill. I'm not sure if 
that's how s3 object deletion works - does it make any sense?


The only other thing I did late last week was notice that one of the 
active osds for the pg seemed very slow to respond - the drive was 
clearly failing. I was never getting any actual i/o errors at the user 
or osd level, though it did trigger a 24-hour deathwatch SMART warning a 
bit later.


I exported the pg shard from the failing osd, and re-imported it to 
another otherwise-evacuated osd. This was just for data safety; it seems 
really unlikely this could be causing the other osds in the pg to crash...


Graham

On 10/15/2018 01:44 PM, Gregory Farnum wrote:



On Thu, Oct 11, 2018 at 3:22 PM Graham Allan 

As the osd crash implies, setting "nobackfill" appears to let all the
osds keep running and the pg stays active and can apparently serve data.

If I track down the object referenced below in the object store, I can
download it without error via s3... though as I can't generate a
matching etag, it may well be corrupt.

Still I do wonder if deleting this object - either via s3, or maybe
more
likely directly within filestore, might permit backfill to continue.


Yes, that is very likely! (...unless there are a bunch of other objects 
with the same issue.)


I'm not immediately familiar with the crash asserts you're seeing, but 
it certainly looks like somehow the object data didn't quite get stored 
correctly as the metadata understands it. Perhaps a write got 
lost/missed on m+1 of the PG shards, setting the 
osd_find_best_info_ignore_history_les caused it to try and recover from 
what it had rather than following normal recovery procedures, and now 
it's not working.

-Greg



--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread John Spray
On Thu, Oct 18, 2018 at 6:17 PM Bryan Stillwell  wrote:
>
> After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing 
> a problem where the new ceph-mgr would sometimes hang indefinitely when doing 
> commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest 
> of our clusters (10+) aren't seeing the same issue, but they are all under 
> 600 OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or 
> so, but usually overnight it'll get back into the state where the hang 
> reappears.  At first I thought it was a hardware issue, but switching the 
> primary ceph-mgr to another node didn't fix the problem.
>
>
>
> I've increased the logging to 20/20 for debug_mgr, and while a working dump 
> looks like this:
>
>
>
> 2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3
>
> 2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command 
> prefix=pg dump
>
> 2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
> client.admin capable
>
> 2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
> from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
> cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
> dispatch
>
> 2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command 
> (0) Success dumped all
>
>
>
> A failed dump call doesn't show up at all.  The "mgr.server handle_command 
> prefix=pg dump" log entry doesn't seem to even make it to the logs.

This could be a manifestation of
https://tracker.ceph.com/issues/23460, as the "pg dump" path is one of
the places where the pgmap and osdmap locks are taken together.

Deadlockyness aside, this code path could use some improvement so that
both locks aren't being held unnecessarily, and so that we aren't
holding up all other accesses to pgmap while doing a dump.

John






>
>
> This problem also continued to appear after upgrading to 12.2.8.
>
>
>
> Has anyone else seen this?



>
>
>
> Thanks,
>
> Bryan
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell
I left some of the 'ceph pg dump' commands running and twice they returned 
results after 30 minutes, and three times it took 45 minutes.  Is there 
something that runs every 15 minutes that would let these commands finish?

Bryan

From: Bryan Stillwell 
Date: Thursday, October 18, 2018 at 11:16 AM
To: "ceph-users@lists.ceph.com" 
Subject: ceph-mgr hangs on larger clusters in Luminous

After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a 
problem where the new ceph-mgr would sometimes hang indefinitely when doing 
commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest of 
our clusters (10+) aren't seeing the same issue, but they are all under 600 
OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or so, but 
usually overnight it'll get back into the state where the hang reappears.  At 
first I thought it was a hardware issue, but switching the primary ceph-mgr to 
another node didn't fix the problem.
 
I've increased the logging to 20/20 for debug_mgr, and while a working dump 
looks like this:
 
2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3
2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command prefix=pg 
dump
2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
client.admin capable
2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
dispatch
2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command (0) 
Success dumped all
 
A failed dump call doesn't show up at all.  The "mgr.server handle_command 
prefix=pg dump" log entry doesn't seem to even make it to the logs.
 
This problem also continued to appear after upgrading to 12.2.8.
 
Has anyone else seen this?
 
Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Dan van der Ster
15 minutes seems like the ms tcp read timeout would be related.

Try shortening that and see if it works around the issue...

(We use ms tcp read timeout = 60 over here -- the 900s default seems
really long to keep idle connections open)

-- dan


On Thu, Oct 18, 2018 at 9:39 PM Bryan Stillwell  wrote:
>
> I left some of the 'ceph pg dump' commands running and twice they returned 
> results after 30 minutes, and three times it took 45 minutes.  Is there 
> something that runs every 15 minutes that would let these commands finish?
>
> Bryan
>
> From: Bryan Stillwell 
> Date: Thursday, October 18, 2018 at 11:16 AM
> To: "ceph-users@lists.ceph.com" 
> Subject: ceph-mgr hangs on larger clusters in Luminous
>
> After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing 
> a problem where the new ceph-mgr would sometimes hang indefinitely when doing 
> commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest 
> of our clusters (10+) aren't seeing the same issue, but they are all under 
> 600 OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or 
> so, but usually overnight it'll get back into the state where the hang 
> reappears.  At first I thought it was a hardware issue, but switching the 
> primary ceph-mgr to another node didn't fix the problem.
>
> I've increased the logging to 20/20 for debug_mgr, and while a working dump 
> looks like this:
>
> 2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3
> 2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command 
> prefix=pg dump
> 2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
> client.admin capable
> 2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
> from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
> cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
> dispatch
> 2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command 
> (0) Success dumped all
>
> A failed dump call doesn't show up at all.  The "mgr.server handle_command 
> prefix=pg dump" log entry doesn't seem to even make it to the logs.
>
> This problem also continued to appear after upgrading to 12.2.8.
>
> Has anyone else seen this?
>
> Thanks,
> Bryan
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell
Thanks Dan!

It does look like we're hitting the ms_tcp_read_timeout.  I changed it to 79 
seconds and I've had a couple dumps that were hung for ~2m40s 
(2*ms_tcp_read_timeout) and one that was hung for 8 minutes 
(6*ms_tcp_read_timeout).

I agree that 15 minutes (900s) is a long timeout.  Anyone know the reasoning 
for that decision?

Bryan

From: Dan van der Ster 
Date: Thursday, October 18, 2018 at 2:03 PM
To: Bryan Stillwell 
Cc: ceph-users 
Subject: Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

15 minutes seems like the ms tcp read timeout would be related.

Try shortening that and see if it works around the issue...

(We use ms tcp read timeout = 60 over here -- the 900s default seems
really long to keep idle connections open)

-- dan


On Thu, Oct 18, 2018 at 9:39 PM Bryan Stillwell 
mailto:bstillw...@godaddy.com>> wrote:

I left some of the 'ceph pg dump' commands running and twice they returned 
results after 30 minutes, and three times it took 45 minutes.  Is there 
something that runs every 15 minutes that would let these commands finish?

Bryan

From: Bryan Stillwell mailto:bstillw...@godaddy.com>>
Date: Thursday, October 18, 2018 at 11:16 AM
To: "ceph-users@lists.ceph.com" 
mailto:ceph-users@lists.ceph.com>>
Subject: ceph-mgr hangs on larger clusters in Luminous

After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a 
problem where the new ceph-mgr would sometimes hang indefinitely when doing 
commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest of 
our clusters (10+) aren't seeing the same issue, but they are all under 600 
OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or so, but 
usually overnight it'll get back into the state where the hang reappears.  At 
first I thought it was a hardware issue, but switching the primary ceph-mgr to 
another node didn't fix the problem.

I've increased the logging to 20/20 for debug_mgr, and while a working dump 
looks like this:

2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3
2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command prefix=pg 
dump
2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
client.admin capable
2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
dispatch
2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command (0) 
Success dumped all

A failed dump call doesn't show up at all.  The "mgr.server handle_command 
prefix=pg dump" log entry doesn't seem to even make it to the logs.

This problem also continued to appear after upgrading to 12.2.8.

Has anyone else seen this?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disabling RGW Encryption support in Luminous

2018-10-18 Thread Konstantin Shalygin

After RGW upgrade from Jewel to Luminous, one S3 user started to receive
errors from his postgre wal-e solution. Error is like this: "Server Side
Encryption with KMS managed key requires HTTP header
x-amz-server-side-encryption : aws:kms".


This can be resolved via simple patch of wal-e/wal-g. I was already make 
a patch for our pgsql guys.


And don't know was this patch upstreamed or not. If this still need to 
someone I can make a Github PR for this.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] why set pg_num do not update pgp_num

2018-10-18 Thread xiang . dai
Hi! 

I use ceph 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable), 
and find that: 

When expand whole cluster, i update pg_num, all succeed, but the status is as 
below: 
cluster: 
id: 41ef913c-2351-4794-b9ac-dd340e3fbc75 
health: HEALTH_WARN 
3 pools have pg_num > pgp_num 

Then i update pgp_num too, warning miss. 

What makes me confused is that when i create whole cluster at first time, 
i use "ceph osd create pool pool_name pg_num", the pgp_num is auto equal to 
pg_num. 

But "ceph osd set pool pool_name pg_num" not. 

Why does this design? 

Why do not auto update pgp_num when update pg_num? 

Thanks 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel to Luminous RGW upgrade issues

2018-10-18 Thread Konstantin Shalygin

I want to ask did you had similar experience with upgrading Jewel RGW to
Luminous. After upgrading monitors and OSD's, I started two new Luminous
RGWs and put them to LB together with Jewel ones. And than interesting
things started to happen. Some our jobs start to fail with "

fatal error: An error occurred (500) when calling the HeadObject
operation (reached max retries: 4): Internal Server Error" error.

After some testing, I noticed, it happens only if object is uploaded
via Luminous RGW, and downloading via Jewel RGW. Quick googling and
upgrade notes read doesn't explained is it normal behavior or
something is wrong with my RGW configuration. It's quite old cluster
(started as Firefly, I believe) and upgraded several times. My guess
is, Luminous is saving RGW data differently and that's why older RGW's
fail.

I would like to ask, did you had same experience and is it safe to
turn all operations on Luminous RGW only?


Of course you should upgrade all your Ceph RGW's to Luminous too.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to debug problem in MDS ?

2018-10-18 Thread Yan, Zheng
On Thu, Oct 18, 2018 at 3:35 PM Florent B  wrote:
>
> I'm not familiar with gdb, what do I need to do ? Install "-gdb" version
> of ceph-mds package ? Then ?
> Thank you
>

install ceph with debug info, install gdb. run 'gdb attach '

> On 18/10/2018 03:40, Yan, Zheng wrote:
> > On Thu, Oct 18, 2018 at 3:59 AM Florent B  wrote:
> >> Hi,
> >> I'm not running multiple active MDS (1 active & 7 standby).
> >> I know about debug_mds 20, is it the only log you need to see bugs ?
> >>
> >> On 16/10/2018 18:32, Sergey Malinin wrote:
> >>> Are you running multiple active MDS daemons?
> >>> On MDS host issue "ceph-daemon mds.X config set debug_mds 20" for maximum 
> >>> logging verbosity.
> >>>
>  On 16.10.2018, at 19:23, Florent B  wrote:
> 
>  Hi,
> 
>  A few months ago I sent a message to that list about a problem with a
>  Ceph + Dovecot setup.
> 
>  Bug disappeared and I didn't answer to the thread.
> 
>  Now the bug has come again (Luminous up-to-date cluster + Dovecot
>  up-to-date + Debian Stretch up-to-date).
> 
>  I know how to reproduce it, but it seems very related to my user's
>  Dovecot data (few GB) and is related to file locking system (bug occurs
>  when I set locking method to "fcntl" or "flock" in Dovecot, but not with
>  "dotlock".
> 
>  It ends to a unresponsive MDS (100% CPU hang, switching to another MDS
>  but always staying at 100% CPU usage). I can't even use the admin socket
>  when MDS is hanged.
> 
> > For issue like this, gdb is the most convenient way to debug. After
> > finding where the buggy code is, set debug_mds=20 and restart mds,
> > check the log to find how the bug was triggered.
> >
> > Regards
> > Yan, Zheng
> >
> >
>  I would like to know *exactly* which information do you need to
>  investigate that bug ? (which commands, when, how to report large log
>  files...)
> 
>  Thank you.
> 
>  Florent
> 
> 
>  ___
>  ceph-users mailing list
>  ceph-users@lists.ceph.com
>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RadosGW multipart completion is already in progress

2018-10-18 Thread Yang Yang
Hi,
I copy some big files to radosgw with awscli. But I found some copy
will failed, like :
   * aws s3 --endpoint=XXX cp ./bigfile s3://mybucket/bigfile*
*upload failed: ./bigfile to s3://mybucket/bigfile An error occurred
(InternalError) when calling the CompleteMultipartUpload operation (reached
max retries: 4): This multipart completion is already in progress *

*BACKGROUND:*
Ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous
(stable)

I found a similar issue: https://tracker.ceph.com/issues/22368 , but it
has been fixed.

Not all cp is failed.  I copy 2000 files, about 90 of them failed.

Is this a bug?

Thanks,
Inksink
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic and Debian 9

2018-10-18 Thread Matthew Vernon

On 17/10/18 15:23, Paul Emmerich wrote:

[apropos building Mimic on Debian 9]


apt-get install -y g++ libc6-dbg libc6 -t testing
apt-get install -y git build-essential cmake


I wonder if you could avoid the "need a newer libc" issue by using 
backported versions of cmake/g++ ?


Regards,

Matthew



--
The Wellcome Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph pg/pgp number calculation

2018-10-18 Thread David Turner
Not all pools need the same amount of PGs. When you get to so many pools
you want to start calculating how much data each pool will have. If 1 of
your pools will have 80% of your data in it, it should have 80% of your
PGs. The metadata pools for rgw likely won't need more than 8 or so PGs
each. If your rgw data pool is only going to have a little scratch data,
then it won't need very many PGs either.

On Tue, Oct 16, 2018, 3:35 AM Zhenshi Zhou  wrote:

> Hi,
>
> I have a cluster serving rbd and cephfs storage for a period of
> time. I added rgw in the cluster yesterday and wanted it to server
> object storage. Everything seems good.
>
> What I'm confused is how to calculate the pg/pgp number. As we
> all know, the formula of calculating pgs is:
>
> Total PGs = ((Total_number_of_OSD * 100) / max_replication_count) /
> pool_count
>
> Before I created rgw, the cluster had 3 pools(rbd, cephfs_data,
> cephfs_meta).
> But now it has 8 pools, which object service may use, including
> '.rgw.root',
> 'default.rgw.control', 'default.rgw.meta', 'default.rgw.log' and
> 'defualt.rgw.buckets.index'.
>
> Should I calculate pg number again using new pool number as 8, or should I
> continue to use the old pg number?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-18 Thread David Turner
What are you OSD node stats?  CPU, RAM, quantity and size of OSD disks.
You might need to modify some bluestore settings to speed up the time it
takes to peer or perhaps you might just be underpowering the amount of OSD
disks you're trying to do and your servers and OSD daemons are going as
fast as they can.
On Sat, Oct 13, 2018 at 4:08 PM Stefan Priebe - Profihost AG <
s.pri...@profihost.ag> wrote:

> and a 3rd one:
>
> health: HEALTH_WARN
> 1 MDSs report slow metadata IOs
> 1 MDSs report slow requests
>
> 2018-10-13 21:44:08.150722 mds.cloud1-1473 [WRN] 7 slow requests, 1
> included below; oldest blocked for > 199.922552 secs
> 2018-10-13 21:44:08.150725 mds.cloud1-1473 [WRN] slow request 34.829662
> seconds old, received at 2018-10-13 21:43:33.321031:
> client_request(client.216121228:929114 lookup #0x1/.active.lock
> 2018-10-13 21:43:33.321594 caller_uid=0, caller_gid=0{}) currently
> failed to rdlock, waiting
>
> The relevant OSDs are bluestore again running at 100% I/O:
>
> iostat shows:
> sdi  77,00 0,00  580,00   97,00 511032,00   972,00
> 1512,5714,88   22,05   24,576,97   1,48 100,00
>
> so it reads with 500MB/s which completely saturates the osd. And it does
> for > 10 minutes.
>
> Greets,
> Stefan
>
> Am 13.10.2018 um 21:29 schrieb Stefan Priebe - Profihost AG:
> >
> > ods.19 is a bluestore osd on a healthy 2TB SSD.
> >
> > Log of osd.19 is here:
> > https://pastebin.com/raw/6DWwhS0A
> >
> > Am 13.10.2018 um 21:20 schrieb Stefan Priebe - Profihost AG:
> >> Hi David,
> >>
> >> i think this should be the problem - form a new log from today:
> >>
> >> 2018-10-13 20:57:20.367326 mon.a [WRN] Health check update: 4 osds down
> >> (OSD_DOWN)
> >> ...
> >> 2018-10-13 20:57:41.268674 mon.a [WRN] Health check update: Reduced data
> >> availability: 3 pgs peering (PG_AVAILABILITY)
> >> ...
> >> 2018-10-13 20:58:08.684451 mon.a [WRN] Health check failed: 1 osds down
> >> (OSD_DOWN)
> >> ...
> >> 2018-10-13 20:58:22.841210 mon.a [WRN] Health check failed: Reduced data
> >> availability: 8 pgs inactive (PG_AVAILABILITY)
> >> 
> >> 2018-10-13 20:58:47.570017 mon.a [WRN] Health check update: Reduced data
> >> availability: 5 pgs inactive (PG_AVAILABILITY)
> >> ...
> >> 2018-10-13 20:58:49.142108 osd.19 [WRN] Monitor daemon marked osd.19
> >> down, but it is still running
> >> 2018-10-13 20:58:53.750164 mon.a [WRN] Health check update: Reduced data
> >> availability: 3 pgs inactive (PG_AVAILABILITY)
> >> ...
> >>
> >> so there is a timeframe of > 90s whee PGs are inactive and unavail -
> >> this would at least explain stalled I/O to me?
> >>
> >> Greets,
> >> Stefan
> >>
> >>
> >> Am 12.10.2018 um 15:59 schrieb David Turner:
> >>> The PGs per OSD does not change unless the OSDs are marked out.  You
> >>> have noout set, so that doesn't change at all during this test.  All of
> >>> your PGs peered quickly at the beginning and then were
> active+undersized
> >>> the rest of the time, you never had any blocked requests, and you
> always
> >>> had 100MB/s+ client IO.  I didn't see anything wrong with your cluster
> >>> to indicate that your clients had any problems whatsoever accessing
> data.
> >>>
> >>> Can you confirm that you saw the same problems while you were running
> >>> those commands?  The next thing would seem that possibly a client isn't
> >>> getting an updated OSD map to indicate that the host and its OSDs are
> >>> down and it's stuck trying to communicate with host7.  That would
> >>> indicate a potential problem with the client being unable to
> communicate
> >>> with the Mons maybe?  Have you completely ruled out any network
> problems
> >>> between all nodes and all of the IPs in the cluster.  What does your
> >>> client log show during these times?
> >>>
> >>> On Fri, Oct 12, 2018 at 8:35 AM Nils Fahldieck - Profihost AG
> >>> mailto:n.fahldi...@profihost.ag>> wrote:
> >>>
> >>> Hi, in our `ceph.conf` we have:
> >>>
> >>>   mon_max_pg_per_osd = 300
> >>>
> >>> While the host is offline (9 OSDs down):
> >>>
> >>>   4352 PGs * 3 / 62 OSDs ~ 210 PGs per OSD
> >>>
> >>> If all OSDs are online:
> >>>
> >>>   4352 PGs * 3 / 71 OSDs ~ 183 PGs per OSD
> >>>
> >>> ... so this doesn't seem to be the issue.
> >>>
> >>> If I understood you right, that's what you've meant. If I got you
> wrong,
> >>> would you mind to point to one of those threads you mentioned?
> >>>
> >>> Thanks :)
> >>>
> >>> Am 12.10.2018 um 14:03 schrieb Burkhard Linke:
> >>> > Hi,
> >>> >
> >>> >
> >>> > On 10/12/2018 01:55 PM, Nils Fahldieck - Profihost AG wrote:
> >>> >> I rebooted a Ceph host and logged `ceph status` & `ceph health
> >>> detail`
> >>> >> every 5 seconds. During this I encountered 'PG_AVAILABILITY
> >>> Reduced data
> >>> >> availability: pgs peering'. At the same time some VMs hung as
> >>> described
> >>> >> before.
> >>> >
> >>> > Just a wild guess... you have 71