Re: [ceph-users] WAL/DB size

2019-08-14 Thread Janne Johansson
Den tors 15 aug. 2019 kl 00:16 skrev Anthony D'Atri :

> Good points in both posts, but I think there’s still some unclarity.
>

...


> We’ve seen good explanations on the list of why only specific DB sizes,
> say 30GB, are actually used _for the DB_.
> If the WAL goes along with the DB, shouldn’t we also explicitly determine
> an appropriate size N for the WAL, and make the partition (30+N) GB?
> If so, how do we derive N?  Or is it a constant?
>
> Filestore was so much simpler, 10GB set+forget for the journal.  Not that
> I miss XFS, mind you.
>

But we got a simple handwaving-best-effort-guesstimate that went "WAL 1GB
is fine, yes." so there you have an N you can use for the
30+N or 60+N sizings.
Can't see how that N needs more science than the filestore N=10G you
showed. Not that I think journal=10G was wrong or anything.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph capacity versus pool replicated size discrepancy?

2019-08-14 Thread Konstantin Shalygin

On 8/14/19 6:19 PM, Kenneth Van Alstyne wrote:
Got it!  I can calculate individual clone usage using “rbd du”, but 
does anything exist to show total clone usage across the pool? 
 Otherwise it looks like phantom space is just missing. 


rbd du for each snapshot, I think...




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducible rbd-nbd crashes

2019-08-14 Thread Mike Christie
On 08/14/2019 02:09 PM, Mike Christie wrote:
> On 08/14/2019 07:35 AM, Marc Schöchlin wrote:
 3. I wonder if we are hitting a bug with PF_MEMALLOC Ilya hit with krbd.
 He removed that code from the krbd. I will ping him on that.
>>
>> Interesting. I activated Coredumps for that processes - probably we can
>> find something interesting here...
>>
> 
> Can you replicate the problem with timeout=0 on a 4.4 kernel (ceph
> version does not matter as long as its known to hit the problem). When
> you start to see IO hang and it gets jammed up can you do:
> 
> dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt
> 
> and give me the waiting-tasks.txt so I can check if we are stuck in the
> kernel waiting for memory.

Don't waste your time. I found a way to replicate it now.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-14 Thread Anthony D'Atri
Good points in both posts, but I think there’s still some unclarity.

Absolutely let’s talk about DB and WAL together.  By “bluestore goes on flash” 
I assume you mean WAL+DB?

“Simply allocate DB and WAL will appear there automatically”

Forgive me please if this is obvious, but I’d like to see a holistic 
explanation of WAL and DB sizing *together*, which I think would help folks put 
these concepts together and plan deployments with some sense of confidence.

We’ve seen good explanations on the list of why only specific DB sizes, say 
30GB, are actually used _for the DB_.
If the WAL goes along with the DB, shouldn’t we also explicitly determine an 
appropriate size N for the WAL, and make the partition (30+N) GB?
If so, how do we derive N?  Or is it a constant?

Filestore was so much simpler, 10GB set+forget for the journal.  Not that I 
miss XFS, mind you.


>> Actually standalone WAL is required when you have either very small fast
>> device (and don't want db to use it) or three devices (different in
>> performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be located
>> at the fastest one.
>> 
>> For the given use case you just have HDD and NVMe and DB and WAL can
>> safely collocate. Which means you don't need to allocate specific volume
>> for WAL. Hence no need to answer the question how many space is needed
>> for WAL. Simply allocate DB and WAL will appear there automatically.
>> 
>> 
> Yes, i'm surprised how often people talk about the DB and WAL separately
> for no good reason.  In common setups bluestore goes on flash and the
> storage goes on the HDDs, simple.
> 
> In the event flash is 100s of GB and would be wasted, is there anything
> that needs to be done to set rocksdb to use the highest level?  600 I
> believe



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducible rbd-nbd crashes

2019-08-14 Thread Mike Christie
On 08/14/2019 07:35 AM, Marc Schöchlin wrote:
>>> 3. I wonder if we are hitting a bug with PF_MEMALLOC Ilya hit with krbd.
>>> He removed that code from the krbd. I will ping him on that.
> 
> Interesting. I activated Coredumps for that processes - probably we can
> find something interesting here...
> 

Can you replicate the problem with timeout=0 on a 4.4 kernel (ceph
version does not matter as long as its known to hit the problem). When
you start to see IO hang and it gets jammed up can you do:

dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt

and give me the waiting-tasks.txt so I can check if we are stuck in the
kernel waiting for memory.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph device list empty

2019-08-14 Thread Gary Molenkamp
I've had no luck in tracing this down.  I've tried setting debugging and 
log channels to try and find what is failing with no success.

With debug_mgr at 20/20, the logs will show:
        log_channel(audit) log [DBG] : from='client.10424012 -' 
entity='client.admin' cmd=[{"prefix": "device ls", "target": ["mgr", 
""]}]: dispatch
but I don't see anything further.

Interestingly, when using "ceph device ls-by-daemon" I see this in the logs:
0 log_channel(audit) log [DBG] : from='client.10345413 -' 
entity='client.admin' cmd=[{"prefix": "device ls-by-daemon", "who": 
"osd.0", "target": ["mgr", ""]}]: dispatch
-1 mgr.server reply reply (22) Invalid argument No handler found for 
'device ls-by-daemon'


Gary.




On 2019-08-07 11:20 a.m., Gary Molenkamp wrote:
> I'm testing an upgrade to Nautilus on a development cluster and the
> command "ceph device ls" is returning an empty list.
>
> # ceph device ls
> DEVICE HOST:DEV DAEMONS LIFE EXPECTANCY
> #
>
> I have walked through the luminous upgrade documentation under
> https://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous
> but I don't see anything pertaining to "activating" device support under
> Nautilus.
>
> The devices are visible to ceph-volume on the OSS nodes.  ie:
>
> osdev-stor1 ~]# ceph-volume lvm list
> == osd.0 ===
>     [block]
> /dev/ceph-f5eb16ec-7074-477b-8f83-ce87c5f74fa3/osd-block-c1de464f-d838-4558-ba75-1c268e538d6b
>
>     block device
> /dev/ceph-f5eb16ec-7074-477b-8f83-ce87c5f74fa3/osd-block-c1de464f-d838-4558-ba75-1c268e538d6b
>     block uuid dlbIm6-H5za-001b-C3mQ-EGks-yoed-zoQpoo
> 
>     devices   /dev/sdb
>
> == osd.2 ===
>     [block]
> /dev/ceph-37145a74-6b2b-4519-b72e-2defe11732aa/osd-block-e06c513b-5af3-4bf6-927f-1f0142c59e8a
>     block device
> /dev/ceph-37145a74-6b2b-4519-b72e-2defe11732aa/osd-block-e06c513b-5af3-4bf6-927f-1f0142c59e8a
>     block uuid egdvpm-3bXx-xmNO-ACzp-nxax-Wka2-81rfNT
> 
>     devices   /dev/sdc
>
> Is there a step I missed?
> Thanks.
>
> Gary.
>
>
>

-- 
Gary Molenkamp  Computer Science/Science Technology Services
Systems Administrator   University of Western Ontario
molen...@uwo.ca http://www.csd.uwo.ca
(519) 661-2111 x86882   (519) 661-3566

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Cluster Failing to Start (Resolved)

2019-08-14 Thread DHilsbos
All;

We found the problem, we had the v2 ports incorrect in the monmap.

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
dhils...@performair.com
Sent: Wednesday, August 14, 2019 10:13 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] New Cluster Failing to Start

All;

We're working to deploy our first production Ceph cluster, and we've run into a 
snag.

The MONs start, but the "cluster" doesn't appear to come up.  Ceph -s never 
returns.

These are the last lines in the event log of one of the mons:
2019-08-13 16:20:03.706 7f668108f180  0 starting mon.s700034 rank 0 at public 
addrs [v2:10.0.80.10:3330/0,v1:10.0.80.10:6789/0] at bind addrs 
[v2:10.0.80.10:3330/0,v1:10.0.80.10:6789/0] mon_data 
/var/lib/ceph/mon/ceph-s700034 fsid effc5134-e0cc-4628-a079-d67b60071f90
2019-08-13 16:20:03.709 7f668108f180  1 mon.s700034@-1(???) e0 preinit fsid 
effc5134-e0cc-4628-a079-d67b60071f90
2019-08-13 16:20:03.709 7f668108f180  1 mon.s700034@-1(???) e0  initial_members 
s700034,s700035,s700036, filtering seed monmap
2019-08-13 16:20:03.713 7f668108f180  0 mon.s700034@-1(probing) e0  my rank is 
now 0 (was -1)

Aside from the address and hostname, the others logs end with the same 
statements.

I'm not seeing the log entries that I would expect as each MON joins the 
cluster, nor am I seeing the "cluster" log files being generated (i.e. I'm used 
to seeing ceph.log, and ceph-audit.log on one of the MONs).

Each machine can ping the others.  Firewall rules are in place for ports 330 & 
6789.

Any idea what I'm missing?

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-14 Thread Mark Nelson


On 8/14/19 1:06 PM, solarflow99 wrote:


Actually standalone WAL is required when you have either very
small fast
device (and don't want db to use it) or three devices (different in
performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be
located
at the fastest one.

For the given use case you just have HDD and NVMe and DB and WAL can
safely collocate. Which means you don't need to allocate specific
volume
for WAL. Hence no need to answer the question how many space is
needed
for WAL. Simply allocate DB and WAL will appear there automatically.


Yes, i'm surprised how often people talk about the DB and WAL 
separately for no good reason.  In common setups bluestore goes on 
flash and the storage goes on the HDDs, simple.


In the event flash is 100s of GB and would be wasted, is there 
anything that needs to be done to set rocksdb to use the highest 
level?  600 I believe






When you first setup the OSD you could manually tweak the level 
sizes/multipliers so that one of the level boundaries + WAL falls 
somewhat under the total allocated size of the DB device.  Keep in mind 
that there can be temporary space usage increases due to compaction.  
Ultimately though I think this is a bad approach. The better bet is the 
work that Igor and Adam are doing:



https://github.com/ceph/ceph/pull/28960

https://github.com/ceph/ceph/pull/29047


Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS meltdown fallout: mds assert failure, kernel oopses

2019-08-14 Thread Jeff Layton
On Wed, 2019-08-14 at 19:29 +0200, Ilya Dryomov wrote:
> On Tue, Aug 13, 2019 at 1:06 PM Hector Martin  wrote:
> > I just had a minor CephFS meltdown caused by underprovisioned RAM on the
> > MDS servers. This is a CephFS with two ranks; I manually failed over the
> > first rank and the new MDS server ran out of RAM in the rejoin phase
> > (ceph-mds didn't get OOM-killed, but I think things slowed down enough
> > due to swapping out that something timed out). This happened 4 times,
> > with the rank bouncing between two MDS servers, until I brought up an
> > MDS on a bigger machine.
> > 
> > The new MDS managed to become active, but then crashed with an assert:
> > 
> > 2019-08-13 16:03:37.346 7fd4578b2700  1 mds.0.1164 clientreplay_done
> > 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.mon02 Updating MDS map to
> > version 1239 from mon.1
> > 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 handle_mds_map i am
> > now mds.0.1164
> > 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 handle_mds_map state
> > change up:clientreplay --> up:active
> > 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 active_start
> > 2019-08-13 16:03:37.690 7fd45e2a7700  1 mds.0.1164 cluster recovered.
> > 2019-08-13 16:03:45.130 7fd45e2a7700  1 mds.mon02 Updating MDS map to
> > version 1240 from mon.1
> > 2019-08-13 16:03:46.162 7fd45e2a7700  1 mds.mon02 Updating MDS map to
> > version 1241 from mon.1
> > 2019-08-13 16:03:50.286 7fd4578b2700 -1
> > /build/ceph-13.2.6/src/mds/MDCache.cc: In function 'void
> > MDCache::remove_inode(CInode*)' thread 7fd4578b2700 time 2019-08-13
> > 16:03:50.279463
> > /build/ceph-13.2.6/src/mds/MDCache.cc: 361: FAILED
> > assert(o->get_num_ref() == 0)
> > 
> >   ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
> > (stable)
> >   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x14e) [0x7fd46650eb5e]
> >   2: (()+0x2c4cb7) [0x7fd46650ecb7]
> >   3: (MDCache::remove_inode(CInode*)+0x59d) [0x55f423d6992d]
> >   4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long,
> > LogSegment*)+0x1f2) [0x55f423dc7192]
> >   5: (MDSIOContextBase::complete(int)+0x11d) [0x55f423ed42bd]
> >   6: (MDSLogContextBase::complete(int)+0x40) [0x55f423ed4430]
> >   7: (Finisher::finisher_thread_entry()+0x135) [0x7fd46650d0a5]
> >   8: (()+0x76db) [0x7fd465dc26db]
> >   9: (clone()+0x3f) [0x7fd464fa888f]
> > 
> > Thankfully this didn't happen on a subsequent attempt, and I got the
> > filesystem happy again.
> > 
> > At this point, of the 4 kernel clients actively using the filesystem, 3
> > had gone into a strange state (can't SSH in, partial service). Here is a
> > kernel log from one of the hosts (the other two were similar):
> > https://mrcn.st/p/ezrhr1qR
> > 
> > After playing some service failover games and hard rebooting the three
> > affected client boxes everything seems to be fine. The remaining FS
> > client box had no kernel errors (other than blocked task warnings and
> > cephfs talking about reconnections and such) and seems to be fine.
> > 
> > I can't find these errors anywhere, so I'm guessing they're not known bugs?
> 
> Jeff, the oops seems to be a NULL dereference in ceph_lock_message().
> Please take a look.
> 

(sorry for duplicate mail -- the other one ended up in moderation)

Thanks Ilya,

That function is pretty straightforward. We don't do a whole lot of
pointer chasing in there, so I'm a little unclear on where this would
have crashed. Right offhand, that kernel is probably missing
1b52931ca9b5b87 (ceph: remove duplicated filelock ref increase), but
that seems unlikely to result in an oops.

Hector, if you have the debuginfo for this kernel installed on one of
these machines, could you run gdb against the ceph.ko module and then
do:

 gdb> list *(ceph_lock_message+0x212)

That may give me a better hint as to what went wrong.

Thanks,
-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-14 Thread solarflow99
> Actually standalone WAL is required when you have either very small fast
> device (and don't want db to use it) or three devices (different in
> performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be located
> at the fastest one.
>
> For the given use case you just have HDD and NVMe and DB and WAL can
> safely collocate. Which means you don't need to allocate specific volume
> for WAL. Hence no need to answer the question how many space is needed
> for WAL. Simply allocate DB and WAL will appear there automatically.
>
>
Yes, i'm surprised how often people talk about the DB and WAL separately
for no good reason.  In common setups bluestore goes on flash and the
storage goes on the HDDs, simple.

In the event flash is 100s of GB and would be wasted, is there anything
that needs to be done to set rocksdb to use the highest level?  600 I
believe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS meltdown fallout: mds assert failure, kernel oopses

2019-08-14 Thread Ilya Dryomov
On Tue, Aug 13, 2019 at 1:06 PM Hector Martin  wrote:
>
> I just had a minor CephFS meltdown caused by underprovisioned RAM on the
> MDS servers. This is a CephFS with two ranks; I manually failed over the
> first rank and the new MDS server ran out of RAM in the rejoin phase
> (ceph-mds didn't get OOM-killed, but I think things slowed down enough
> due to swapping out that something timed out). This happened 4 times,
> with the rank bouncing between two MDS servers, until I brought up an
> MDS on a bigger machine.
>
> The new MDS managed to become active, but then crashed with an assert:
>
> 2019-08-13 16:03:37.346 7fd4578b2700  1 mds.0.1164 clientreplay_done
> 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.mon02 Updating MDS map to
> version 1239 from mon.1
> 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 handle_mds_map i am
> now mds.0.1164
> 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 handle_mds_map state
> change up:clientreplay --> up:active
> 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 active_start
> 2019-08-13 16:03:37.690 7fd45e2a7700  1 mds.0.1164 cluster recovered.
> 2019-08-13 16:03:45.130 7fd45e2a7700  1 mds.mon02 Updating MDS map to
> version 1240 from mon.1
> 2019-08-13 16:03:46.162 7fd45e2a7700  1 mds.mon02 Updating MDS map to
> version 1241 from mon.1
> 2019-08-13 16:03:50.286 7fd4578b2700 -1
> /build/ceph-13.2.6/src/mds/MDCache.cc: In function 'void
> MDCache::remove_inode(CInode*)' thread 7fd4578b2700 time 2019-08-13
> 16:03:50.279463
> /build/ceph-13.2.6/src/mds/MDCache.cc: 361: FAILED
> assert(o->get_num_ref() == 0)
>
>   ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
> (stable)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x14e) [0x7fd46650eb5e]
>   2: (()+0x2c4cb7) [0x7fd46650ecb7]
>   3: (MDCache::remove_inode(CInode*)+0x59d) [0x55f423d6992d]
>   4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long,
> LogSegment*)+0x1f2) [0x55f423dc7192]
>   5: (MDSIOContextBase::complete(int)+0x11d) [0x55f423ed42bd]
>   6: (MDSLogContextBase::complete(int)+0x40) [0x55f423ed4430]
>   7: (Finisher::finisher_thread_entry()+0x135) [0x7fd46650d0a5]
>   8: (()+0x76db) [0x7fd465dc26db]
>   9: (clone()+0x3f) [0x7fd464fa888f]
>
> Thankfully this didn't happen on a subsequent attempt, and I got the
> filesystem happy again.
>
> At this point, of the 4 kernel clients actively using the filesystem, 3
> had gone into a strange state (can't SSH in, partial service). Here is a
> kernel log from one of the hosts (the other two were similar):
> https://mrcn.st/p/ezrhr1qR
>
> After playing some service failover games and hard rebooting the three
> affected client boxes everything seems to be fine. The remaining FS
> client box had no kernel errors (other than blocked task warnings and
> cephfs talking about reconnections and such) and seems to be fine.
>
> I can't find these errors anywhere, so I'm guessing they're not known bugs?

Jeff, the oops seems to be a NULL dereference in ceph_lock_message().
Please take a look.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] New Cluster Failing to Start

2019-08-14 Thread DHilsbos
All;

We're working to deploy our first production Ceph cluster, and we've run into a 
snag.

The MONs start, but the "cluster" doesn't appear to come up.  Ceph -s never 
returns.

These are the last lines in the event log of one of the mons:
2019-08-13 16:20:03.706 7f668108f180  0 starting mon.s700034 rank 0 at public 
addrs [v2:10.0.80.10:3330/0,v1:10.0.80.10:6789/0] at bind addrs 
[v2:10.0.80.10:3330/0,v1:10.0.80.10:6789/0] mon_data 
/var/lib/ceph/mon/ceph-s700034 fsid effc5134-e0cc-4628-a079-d67b60071f90
2019-08-13 16:20:03.709 7f668108f180  1 mon.s700034@-1(???) e0 preinit fsid 
effc5134-e0cc-4628-a079-d67b60071f90
2019-08-13 16:20:03.709 7f668108f180  1 mon.s700034@-1(???) e0  initial_members 
s700034,s700035,s700036, filtering seed monmap
2019-08-13 16:20:03.713 7f668108f180  0 mon.s700034@-1(probing) e0  my rank is 
now 0 (was -1)

Aside from the address and hostname, the others logs end with the same 
statements.

I'm not seeing the log entries that I would expect as each MON joins the 
cluster, nor am I seeing the "cluster" log files being generated (i.e. I'm used 
to seeing ceph.log, and ceph-audit.log on one of the MONs).

Each machine can ping the others.  Firewall rules are in place for ports 330 & 
6789.

Any idea what I'm missing?

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS corruption

2019-08-14 Thread ☣Adam
I was able to get this resolved, thanks again to Pierre Dittes!

The reason the recovery did not work the first time I tried it was
because I still had the filesystem mounted (or at least attempted to
have it mounted).  This was causing sessions to be active.  After
rebooting all the machines which were attempting to mount cephfs, the
recovery steps worked.

For posterity, here are the exact commands:
cephfs-journal-tool --rank cephfs:all event recover_dentries summary
cephfs-journal-tool --rank cephfs:all journal reset
cephfs-table-tool all reset session
cephfs-table-tool all reset snap
cephfs-table-tool all reset inodes

After that the MDS came up properly and I was able to mount the
filesystem again.

Doing a backup of the journal reveals that there is still some
filesystem corruption, so I'm going to take Pierre's advice and copy all
the data off ceph, destroy and re-create the filesystem, and then put it
back.

The backup command and output showing there's still an error:
cephfs-journal-tool --rank cephfs:all journal export
ceph-mds-backup-`date '+%Y.%m.%d'`.bin
2019-08-14 11:10:24.237 7f7b2305b7c0 -1 Bad entry start ptr
(0x11440) at 0x11401956e
journal is 4630511616~103790
wrote 103790 bytes at offset 4630511616 to ceph-mds-backup-2019.08.14.bin



On 8/13/19 8:33 AM, Yan, Zheng wrote:
> nautilus version (14.2.2) of ‘cephfs-data-scan scan_links’  can fix
> snaptable. hopefully it will fix your issue.
> 
> you don't need to upgrade whole cluster. Just install nautilus in a
> temp machine or compile ceph from source.
> 
> 
> 
> On Tue, Aug 13, 2019 at 2:35 PM Adam  wrote:
>>
>> Pierre Dittes helped me with adding --rank=yourfsname:all and I ran the
>> following steps from the disaster recovery page: journal export, dentry
>> recovery, journal truncation, mds table wipes (session, snap and inode),
>> scan_extents, scan_inodes, scan_links, and cleanup.
>>
>> Now all three of my MDS servers are crashing due to a failed assert.
>> Logs with stacktrace are included (the other two servers have the same
>> stacktrace in their logs).
>>
>> Currently I can't mount cephfs (which makes sense since there aren't any
>> MDS services up for more than a few minutes before they crash).  Any
>> suggestions on next steps to troubleshoot/fix this?
>>
>> Hopefully there's some way to recover from this and I don't have to tell
>> my users that I lost all the data and we need to go back to the backups.
>>  It shouldn't be a huge problem if we do, but it'll lose a lot of
>> confidence in ceph and its ability to keep data safe.
>>
>> Thanks,
>> Adam
>>
>> On 8/8/19 3:31 PM, Adam wrote:
>>> I had a machine with insufficient memory and it seems to have corrupted
>>> data on my MDS.  The filesystem seems to be working fine, with the
>>> exception of accessing specific files.
>>>
>>> The ceph-mds logs include things like:
>>> mds.0.1596621 unhandled write error (2) No such file or directory, force
>>> readonly...
>>> dir 0x100fb03 object missing on disk; some files may be lost
>>> (/adam/programming/bash)
>>>
>>> I'm using mimic and trying to follow the instructions here:
>>> https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
>>>
>>> The punchline is this:
>>> cephfs-journal-tool --rank all journal export backup.bin
>>> Error ((22) Invalid argument)
>>> 2019-08-08 20:02:39.847 7f06827537c0 -1 main: Couldn't determine MDS rank.
>>>
>>> I have a backup (outside of ceph) of all data which is inaccessible and
>>> I can back anything which is accessible if need be.  There's some more
>>> information below, but my main question is: what are my next steps?
>>>
>>> On a side note, I'd like to get involved with helping with documentation
>>> (man pages, the ceph website, usage text, etc). Where can I get started?
>>>
>>>
>>>
>>> Here's the context:
>>>
>>> cephfs-journal-tool event recover_dentries summary
>>> Error ((22) Invalid argument)
>>> 2019-08-08 19:50:04.798 7f21f4ffe7c0 -1 main: missing mandatory "--rank"
>>> argument
>>>
>>> Seems like a bug in the documentation since `--rank` is a "mandatory
>>> option" according to the help text.  It looks like the rank of this node
>>> for MDS is 0, based on `ceph health detail`, but using `--rank 0` or
>>> `--rank all` doesn't work either:
>>>
>>> ceph health detail
>>> HEALTH_ERR 1 MDSs report damaged metadata; 1 MDSs are read only
>>> MDS_DAMAGE 1 MDSs report damaged metadata
>>> mdsge.hax0rbana.org(mds.0): Metadata damage detected
>>> MDS_READ_ONLY 1 MDSs are read only
>>> mdsge.hax0rbana.org(mds.0): MDS in read-only mode
>>>
>>> cephfs-journal-tool --rank 0 event recover_dentries summary
>>> Error ((22) Invalid argument)
>>> 2019-08-08 19:54:45.583 7f5b37c4c7c0 -1 main: Couldn't determine MDS rank.
>>>
>>>
>>> The only place I've found this error message is in an unanswered
>>> stackoverflow question and in the source code here:
>>> https://github.com/ceph/ceph/blob/master/src/tools/cephfs/JournalTool.cc#L114
>>>
>>> It looks like that is trying to read a filesyst

Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-14 Thread Troy Ablan

Paul,

Thanks for the reply.  All of these seemed to fail except for pulling 
the osdmap from the live cluster.


-Troy

-[~:#]- ceph-objectstore-tool --op get-osdmap --data-path 
/var/lib/ceph/osd/ceph-45/ --file osdmap45
terminate called after throwing an instance of 
'ceph::buffer::malformed_input'

  what():  buffer::malformed_input: unsupported bucket algorithm: -1
*** Caught signal (Aborted) **
 in thread 7f945ee04f00 thread_name:ceph-objectstor
 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)

 1: (()+0xf5d0) [0x7f94531935d0]
 2: (gsignal()+0x37) [0x7f9451d80207]
 3: (abort()+0x148) [0x7f9451d818f8]
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f945268f7d5]
 5: (()+0x5e746) [0x7f945268d746]
 6: (()+0x5e773) [0x7f945268d773]
 7: (__cxa_rethrow()+0x49) [0x7f945268d9e9]
 8: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x18b8) 
[0x7f94553218d8]

 9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x4ad) [0x7f94550ff4ad]
 10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f9455101db1]
 11: (get_osdmap(ObjectStore*, unsigned int, OSDMap&, 
ceph::buffer::list&)+0x1d0) [0x55de1f9a6e60]

 12: (main()+0x5340) [0x55de1f8c8870]
 13: (__libc_start_main()+0xf5) [0x7f9451d6c3d5]
 14: (()+0x3adc10) [0x55de1f9a1c10]
Aborted

-[~:#]- ceph-objectstore-tool --op get-osdmap --data-path 
/var/lib/ceph/osd/ceph-46/ --file osdmap46
terminate called after throwing an instance of 
'ceph::buffer::malformed_input'

  what():  buffer::malformed_input: unsupported bucket algorithm: -1
*** Caught signal (Aborted) **
 in thread 7f9ce4135f00 thread_name:ceph-objectstor
 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)

 1: (()+0xf5d0) [0x7f9cd84c45d0]
 2: (gsignal()+0x37) [0x7f9cd70b1207]
 3: (abort()+0x148) [0x7f9cd70b28f8]
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f9cd79c07d5]
 5: (()+0x5e746) [0x7f9cd79be746]
 6: (()+0x5e773) [0x7f9cd79be773]
 7: (__cxa_rethrow()+0x49) [0x7f9cd79be9e9]
 8: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x18b8) 
[0x7f9cda6528d8]

 9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x4ad) [0x7f9cda4304ad]
 10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f9cda432db1]
 11: (get_osdmap(ObjectStore*, unsigned int, OSDMap&, 
ceph::buffer::list&)+0x1d0) [0x55cea26c8e60]

 12: (main()+0x5340) [0x55cea25ea870]
 13: (__libc_start_main()+0xf5) [0x7f9cd709d3d5]
 14: (()+0x3adc10) [0x55cea26c3c10]
Aborted

-[~:#]- ceph osd getmap -o osdmap
got osdmap epoch 81298

-[~:#]- ceph-objectstore-tool --op set-osdmap --data-path 
/var/lib/ceph/osd/ceph-46/ --file osdmap

osdmap (#-1:92f679f2:::osdmap.81298:0#) does not exist.

-[~:#]- ceph-objectstore-tool --op set-osdmap --data-path 
/var/lib/ceph/osd/ceph-45/ --file osdmap

osdmap (#-1:92f679f2:::osdmap.81298:0#) does not exist.



On 8/14/19 2:54 AM, Paul Emmerich wrote:

Starting point to debug/fix this would be to extract the osdmap from
one of the dead OSDs:

ceph-objectstore-tool --op get-osdmap --data-path /var/lib/ceph/osd/...

Then try to run osdmaptool on that osdmap to see if it also crashes,
set some --debug options (don't know which one off the top of my
head).
Does it also crash? How does it differ from the map retrieved with
"ceph osd getmap"?

You can also set the osdmap with "--op set-osdmap", does it help to
set the osdmap retrieved by "ceph osd getmap"?

Paul


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question to developers about iscsi

2019-08-14 Thread Fyodor Ustinov
Hi!

As I understand - iscsi gate is part of ceph.

Documentation says:
Note
The iSCSI management functionality of Ceph Dashboard depends on the latest 
version 3 of the ceph-iscsi project. Make sure that your operating system 
provides the correct version, otherwise the dashboard won’t enable the 
management features. 

Questions:
Where I can download ready for install deb package of version 3 ceph-iscsi 
project? rpm package?
Why "version 3"? Why not "14.2.2"?

Or ceph-iscsi is not part of ceph?

WBR,
Fyodor.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Canonical Livepatch broke CephFS client

2019-08-14 Thread Ilya Dryomov
On Wed, Aug 14, 2019 at 1:54 PM Tim Bishop  wrote:
>
> On Wed, Aug 14, 2019 at 12:44:15PM +0200, Ilya Dryomov wrote:
> > On Tue, Aug 13, 2019 at 10:56 PM Tim Bishop  wrote:
> > > This email is mostly a heads up for others who might be using
> > > Canonical's livepatch on Ubuntu on a CephFS client.
> > >
> > > I have an Ubuntu 18.04 client with the standard kernel currently at
> > > version linux-image-4.15.0-54-generic 4.15.0-54.58. CephFS is mounted
> > > with the kernel client. Cluster is running mimic 13.2.6. I've got
> > > livepatch running and this evening it did an update:
> > >
> > > Aug 13 17:33:55 myclient canonical-livepatch[2396]: Client.Check
> > > Aug 13 17:33:55 myclient canonical-livepatch[2396]: Checking with 
> > > livepatch service.
> > > Aug 13 17:33:55 myclient canonical-livepatch[2396]: updating last-check
> > > Aug 13 17:33:55 myclient canonical-livepatch[2396]: touched last check
> > > Aug 13 17:33:56 myclient canonical-livepatch[2396]: Applying update 54.1 
> > > for 4.15.0-54.58-generic
> > > Aug 13 17:33:56 myclient kernel: [3700923.970750] PKCS#7 signature not 
> > > signed with a trusted key
> > > Aug 13 17:33:59 myclient kernel: [3700927.069945] livepatch: enabling 
> > > patch 'lkp_Ubuntu_4_15_0_54_58_generic_54'
> > > Aug 13 17:33:59 myclient kernel: [3700927.154956] livepatch: 
> > > 'lkp_Ubuntu_4_15_0_54_58_generic_54': starting patching transition
> > > Aug 13 17:34:01 myclient kernel: [3700928.994487] livepatch: 
> > > 'lkp_Ubuntu_4_15_0_54_58_generic_54': patching complete
> > > Aug 13 17:34:09 myclient canonical-livepatch[2396]: Applied patch version 
> > > 54.1 to 4.15.0-54.58-generic
> > >
> > > And then immediately I saw:
> > >
> > > Aug 13 17:34:18 myclient kernel: [3700945.728684] libceph: mds0 
> > > 1.2.3.4:6800 socket closed (con state OPEN)
> > > Aug 13 17:34:18 myclient kernel: [3700946.040138] libceph: mds0 
> > > 1.2.3.4:6800 socket closed (con state OPEN)
> > > Aug 13 17:34:19 myclient kernel: [3700947.105692] libceph: mds0 
> > > 1.2.3.4:6800 socket closed (con state OPEN)
> > > Aug 13 17:34:20 myclient kernel: [3700948.033704] libceph: mds0 
> > > 1.2.3.4:6800 socket closed (con state OPEN)
> > >
> > > And on the MDS:
> > >
> > > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 Message 
> > > signature does not match contents.
> > > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Signature on 
> > > message:
> > > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367sig: 
> > > 10517606059379971075
> > > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Locally 
> > > calculated signature:
> > > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 
> > > sig_check:4899837294009305543
> > > 2019-08-13 17:34:18.286 7ff165e75700  0 Signature failed.
> > > 2019-08-13 17:34:18.286 7ff165e75700  0 -- 1.2.3.4:6800/512468759 >> 
> > > 4.3.2.1:0/928333509 conn(0xe6b9500 :6800 >> 
> > > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=0).process >> 
> > > Signature check failed
> > >
> > > Thankfully I was able to umount -f to unfreeze the client, but I have
> > > been unsuccessful remounting the file system using the kernel client.
> > > The fuse client worked fine as a workaround, but is slower.
> > >
> > > Taking a look at livepatch 54.1 I can see it touches Ceph code in the
> > > kernel:
> > >
> > > https://git.launchpad.net/~ubuntu-livepatch/+git/bionic-livepatches/commit/?id=3a3081c1e4c8e2e0f9f7a1ae4204eba5f38fbd29
> > >
> > > But the relevance of those changes isn't immediately clear to me. I
> > > expect after a reboot it'll be fine, but as yet untested.
> >
> > These changes are very relevant.  They introduce support for CEPHX_V2
> > protocol, where message signatures are computed slightly differently:
> > same algorithm but a different set of inputs.  The live-patched kernel
> > likely started signing using CEPHX_V2 without renegotiating.
>
> Ah - thanks for looking. Looks like something that wasn't a security
> issue so shouldn't have been included in the live patch.

Well, strictly speaking it is a security issue because the protocol was
rev'ed in response to two CVEs:

  https://nvd.nist.gov/vuln/detail/CVE-2018-1128
  https://nvd.nist.gov/vuln/detail/CVE-2018-1129

That said, it definitely doesn't qualify for live-patching, especially
when the resulting kernel image is not thoroughly tested.

>
> > This is a good example of how live-patching can go wrong.  A reboot
> > should definitely help.
>
> Yup, it certainly has its tradeoffs (not having to reboot so regularly
> is certainly a positive, though). I've replicated on a test machine and
> confirmed that a reboot does indeed fix the problem.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] strange backfill delay after outing one node

2019-08-14 Thread Simon Oosthoek
On 14/08/2019 10:44, Wido den Hollander wrote:
> 
> 
> On 8/14/19 9:48 AM, Simon Oosthoek wrote:
>> Is it a good idea to give the above commands or other commands to speed
>> up the backfilling? (e.g. like increasing "osd max backfills")
>>
> 
> Yes, as right now the OSDs aren't doing that many backfills. You still
> have a large queue of PGs which need to be backfilled.
> 
> $ ceph tell osd.* config set osd_max_backfills 5
> 
> The default is that only one (1) backfills runs at the same time per
> OSD. By setting it to 5 you speed up the process by increasing the
> concurrency. This will however add load to the system and thus reduce
> the available I/O for clients.
> 

Currently the main user is the backfilling, so that's ok for now ;-)

It seems to reduce the wait time by about 3-5 times, so this will help
us make the changes we need. We can always reduce the max_backfills
later when we have actual users on the system...

Cheers

/Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] reproducible rbd-nbd crashes

2019-08-14 Thread Marc Schöchlin
Hello Mike,

see my inline comments.

Am 14.08.19 um 02:09 schrieb Mike Christie:
>>> -
>>> Previous tests crashed in a reproducible manner with "-P 1" (single io 
>>> gzip/gunzip) after a few minutes up to 45 minutes.
>>>
>>> Overview of my tests:
>>>
>>> - SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 
>>> 120s device timeout
>>>   -> 18 hour testrun was successful, no dmesg output
>>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
>>> device timeout
>>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
>>> errors, map/mount can be re-created without reboot
>>>   -> parallel krbd device usage with 99% io usage worked without a problem 
>>> while running the test
>>> - FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
>>> device timeout
>>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
>>> errors, map/mount can be re-created
>>>   -> parallel krbd device usage with 99% io usage worked without a problem 
>>> while running the test
>>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no 
>>> timeout
>>>   -> failed after < 10 minutes
>>>   -> system runs in a high system load, system is almost unusable, unable 
>>> to shutdown the system, hard reset of vm necessary, manual exclusive lock 
>>> removal is necessary before remapping the device

There is something new compared to yesterday.three days ago i downgraded a 
production system to client version 12.2.5.
This night also this machine crashed. So it seems that rbd-nbd is broken in 
general also with release 12.2.5 and potentially before.

The new (updated) list:

*- FAILED: kernel 4.15, ceph 12.2.5, 2TB ec-volume, ext4 file system, 120s 
device timeout**
**  -> crashed in production while snapshot trimming is running on that pool*
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device 
timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created without reboot
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
  -> failed after < 10 minutes
  -> system runs in a high system load, system is almost unusable, unable to 
shutdown the system, hard reset of vm necessary, manual exclusive lock removal 
is necessary before remapping the device
- FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created
- FAILED: kernel 5.0, ceph 12.2.12, 2TB ec-volume, ext4 file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created

>>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 
>>> 120s device timeout
>>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
>>> errors, map/mount can be re-created
>> How many CPUs and how much memory does the VM have?

Charateristic of the crashed vm machine:

  * Ubuntu 18.04, with kernel 4.15, Ceph Client 12.2.5
  * Services: NFS kernel Server, nothing else
  * Crash behavior:
  o daily Task for snapshot creation/deletion started at 19:00
  o a daily database backup started at 19:00, this created
  + 120 IOPS write, and 1 IOPS read
  + 22K/sectors per second write, 0 sectors/per second
  + 97 MBIT inbound and 97 MBIT outbound network usage (nfs server)
  o we had slow requests at the time of the crash
  o rbd-nbd process terminated 25min later without segfault
  o the nfs usage created a 5 min load of 10 from start, 5K context 
switches/sec
  o memory usage (kernel+userspace) was 10% of the system
  o no swap usage
  * ceph.conf
[client]
rbd cache = true
rbd cache size = 67108864
rbd cache max dirty = 33554432
rbd cache target dirty = 25165824
rbd cache max dirty age = 3
rbd readahead max bytes = 4194304
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
  * 4 CPUs
  * 6 GB RAM
  * Non default Sysctl Settings
vm.swappiness = 1
fs.aio-max-nr = 262144
fs.file-max = 100
kernel.pid_max = 4194303
vm.zone_reclaim_mode = 0
kernel.randomize_va_space = 0
kernel.panic = 0
kernel.panic_on_oops = 0


>> I'm not sure which test it covers above, but for
>> test-with-timeout/ceph-client.archiv.log and dmesg-crash it looks like
>> the command that probably triggered the timeout got stuck in safe_write
>> o

Re: [ceph-users] Canonical Livepatch broke CephFS client

2019-08-14 Thread Tim Bishop
On Wed, Aug 14, 2019 at 12:44:15PM +0200, Ilya Dryomov wrote:
> On Tue, Aug 13, 2019 at 10:56 PM Tim Bishop  wrote:
> > This email is mostly a heads up for others who might be using
> > Canonical's livepatch on Ubuntu on a CephFS client.
> >
> > I have an Ubuntu 18.04 client with the standard kernel currently at
> > version linux-image-4.15.0-54-generic 4.15.0-54.58. CephFS is mounted
> > with the kernel client. Cluster is running mimic 13.2.6. I've got
> > livepatch running and this evening it did an update:
> >
> > Aug 13 17:33:55 myclient canonical-livepatch[2396]: Client.Check
> > Aug 13 17:33:55 myclient canonical-livepatch[2396]: Checking with livepatch 
> > service.
> > Aug 13 17:33:55 myclient canonical-livepatch[2396]: updating last-check
> > Aug 13 17:33:55 myclient canonical-livepatch[2396]: touched last check
> > Aug 13 17:33:56 myclient canonical-livepatch[2396]: Applying update 54.1 
> > for 4.15.0-54.58-generic
> > Aug 13 17:33:56 myclient kernel: [3700923.970750] PKCS#7 signature not 
> > signed with a trusted key
> > Aug 13 17:33:59 myclient kernel: [3700927.069945] livepatch: enabling patch 
> > 'lkp_Ubuntu_4_15_0_54_58_generic_54'
> > Aug 13 17:33:59 myclient kernel: [3700927.154956] livepatch: 
> > 'lkp_Ubuntu_4_15_0_54_58_generic_54': starting patching transition
> > Aug 13 17:34:01 myclient kernel: [3700928.994487] livepatch: 
> > 'lkp_Ubuntu_4_15_0_54_58_generic_54': patching complete
> > Aug 13 17:34:09 myclient canonical-livepatch[2396]: Applied patch version 
> > 54.1 to 4.15.0-54.58-generic
> >
> > And then immediately I saw:
> >
> > Aug 13 17:34:18 myclient kernel: [3700945.728684] libceph: mds0 
> > 1.2.3.4:6800 socket closed (con state OPEN)
> > Aug 13 17:34:18 myclient kernel: [3700946.040138] libceph: mds0 
> > 1.2.3.4:6800 socket closed (con state OPEN)
> > Aug 13 17:34:19 myclient kernel: [3700947.105692] libceph: mds0 
> > 1.2.3.4:6800 socket closed (con state OPEN)
> > Aug 13 17:34:20 myclient kernel: [3700948.033704] libceph: mds0 
> > 1.2.3.4:6800 socket closed (con state OPEN)
> >
> > And on the MDS:
> >
> > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 Message signature 
> > does not match contents.
> > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Signature on 
> > message:
> > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367sig: 
> > 10517606059379971075
> > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Locally calculated 
> > signature:
> > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 
> > sig_check:4899837294009305543
> > 2019-08-13 17:34:18.286 7ff165e75700  0 Signature failed.
> > 2019-08-13 17:34:18.286 7ff165e75700  0 -- 1.2.3.4:6800/512468759 >> 
> > 4.3.2.1:0/928333509 conn(0xe6b9500 :6800 >> 
> > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=0).process >> 
> > Signature check failed
> >
> > Thankfully I was able to umount -f to unfreeze the client, but I have
> > been unsuccessful remounting the file system using the kernel client.
> > The fuse client worked fine as a workaround, but is slower.
> >
> > Taking a look at livepatch 54.1 I can see it touches Ceph code in the
> > kernel:
> >
> > https://git.launchpad.net/~ubuntu-livepatch/+git/bionic-livepatches/commit/?id=3a3081c1e4c8e2e0f9f7a1ae4204eba5f38fbd29
> >
> > But the relevance of those changes isn't immediately clear to me. I
> > expect after a reboot it'll be fine, but as yet untested.
> 
> These changes are very relevant.  They introduce support for CEPHX_V2
> protocol, where message signatures are computed slightly differently:
> same algorithm but a different set of inputs.  The live-patched kernel
> likely started signing using CEPHX_V2 without renegotiating.

Ah - thanks for looking. Looks like something that wasn't a security
issue so shouldn't have been included in the live patch.

> This is a good example of how live-patching can go wrong.  A reboot
> should definitely help.

Yup, it certainly has its tradeoffs (not having to reboot so regularly
is certainly a positive, though). I've replicated on a test machine and
confirmed that a reboot does indeed fix the problem.

Thanks,

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph capacity versus pool replicated size discrepancy?

2019-08-14 Thread Kenneth Van Alstyne
Got it!  I can calculate individual clone usage using “rbd du”, but does 
anything exist to show total clone usage across the pool?  Otherwise it looks 
like phantom space is just missing.

Thanks,

--
Kenneth Van Alstyne
Systems Architect
M: 228.547.8045
15052 Conference Center Dr, Chantilly, VA 20151
perspecta

On Aug 13, 2019, at 11:05 PM, Konstantin Shalygin 
mailto:k0...@k0ste.ru>> wrote:



Hey guys, this is probably a really silly question, but I’m trying to reconcile 
where all of my space has gone in one cluster that I am responsible for.

The cluster is made up of 36 2TB SSDs across 3 nodes (12 OSDs per node), all 
using FileStore on XFS.  We are running Ceph Luminous 12.2.8 on this particular 
cluster. The only pool where data is heavily stored is the “rbd” pool, of which 
7.09TiB is consumed.  With a replication of “3”, I would expect that the raw 
used to be close to 21TiB, but it’s actually closer to 35TiB.  Some additional 
details are below.  Any thoughts?

[cluster] root at 
dashboard:~# ceph df
GLOBAL:
SIZEAVAIL   RAW USED %RAW USED
62.8TiB 27.8TiB  35.1TiB 55.81
POOLS:
NAME   ID USED%USED MAX AVAIL 
OBJECTS
rbd0  7.09TiB 53.76   6.10TiB 
3056783
data   3  29.4GiB  0.47   6.10TiB   
 7918
metadata   4  57.2MiB 0   6.10TiB   
   95
.rgw.root  5  1.09KiB 0   6.10TiB   
4
default.rgw.control6   0B 0   6.10TiB   
8
default.rgw.meta   7   0B 0   6.10TiB   
0
default.rgw.log8   0B 0   6.10TiB   
  207
default.rgw.buckets.index  9   0B 0   6.10TiB   
0
default.rgw.buckets.data   10  0B 0   6.10TiB   
0
default.rgw.buckets.non-ec 11  0B 0   6.10TiB   
0

[cluster] root at 
dashboard:~# ceph 
--version
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

[cluster] root at 
dashboard:~# ceph osd 
dump | grep 'replicated size'
pool 0 'rbd' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins 
pg_num 682 pgp_num 682 last_change 414873 flags hashpspool 
min_write_recency_for_promote 1 stripe_width 0 application rbd
pool 3 'data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins 
pg_num 682 pgp_num 682 last_change 409614 flags hashpspool 
crash_replay_interval 45 min_write_recency_for_promote 1 stripe_width 0 
application cephfs
pool 4 'metadata' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 682 pgp_num 682 last_change 409617 flags hashpspool 
min_write_recency_for_promote 1 stripe_width 0 application cephfs
pool 5 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 409 pgp_num 409 last_change 409710 lfor 0/336229 flags 
hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409711 lfor 0/336232 
flags hashpspool stripe_width 0 application rgw
pool 7 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 409 pgp_num 409 last_change 409713 lfor 0/336235 flags 
hashpspool stripe_width 0 application rgw
pool 8 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 409 pgp_num 409 last_change 409712 lfor 0/336238 flags 
hashpspool stripe_width 0 application rgw
pool 9 'default.rgw.buckets.index' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409714 lfor 0/336241 
flags hashpspool stripe_width 0 application rgw
pool 10 'default.rgw.buckets.data' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409715 lfor 0/336244 
flags hashpspool stripe_width 0 application rgw
pool 11 'default.rgw.buckets.non-ec' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409716 lfor 0/336247 
flags hashpspool stripe_width 0 application rgw

[cluster] root at 
dashboard:~# ceph osd 
lspools
0 rbd,3 data,4 metadata,5 .rgw.root,6 default.rgw.control,7 default.rgw.meta,8 
default.rgw.log,9 default.rgw.buckets.index,10 default.rgw.buckets.data,11 
default.rgw.buckets.non-ec,

[cluster] root at 
dashboard:~# rados df
POOL_NAME  USEDOBJECTS CLONES  COPIES  MISSING_ON_PRIMARY 
UNFOUND DEGRADED RD_OPS  RD

Re: [ceph-users] WAL/DB size

2019-08-14 Thread Igor Fedotov

Hi Wido & Hermant.

On 8/14/2019 11:36 AM, Wido den Hollander wrote:


On 8/14/19 9:33 AM, Hemant Sonawane wrote:

Hello guys,

Thank you so much for your responses really appreciate it. But I would
like to mention one more thing which I forgot in my last email is that I
am going to use this storage for openstack VM's. So still the answer
will be the same that I should use 1GB for wal?


WAL 1GB is fine, yes.


I'd like to argue against this for a bit.

Actually standalone WAL is required when you have either very small fast 
device (and don't want db to use it) or three devices (different in 
performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be located  
at the fastest one.


For the given use case you just have HDD and NVMe and DB and WAL can 
safely collocate. Which means you don't need to allocate specific volume 
for WAL. Hence no need to answer the question how many space is needed 
for WAL. Simply allocate DB and WAL will appear there automatically.




As this is an OpenStack/RBD only use-case I would say that 10GB of DB
per 1TB of disk storage is sufficient.


Given RocksDB granularity already mentioned in this thread we tend to 
prefer some fixed allocation sizes with 30-60Gb being close to the optimal.


Anyway suggest to use LVM for DB/WAL volume and may be start with 
smaller size (e.g. 32GB per OSD) which leaves some extra spare space on 
your NVMes and allows to add more space if needed. (Just to note - 
removing some already allocated but still unused space from existing OSD 
and gift it to another/new OSD is a more troublesome task than adding 
some space from the spare volume).



On Wed, 14 Aug 2019 at 05:54, Mark Nelson mailto:mnel...@redhat.com>> wrote:

 On 8/13/19 3:51 PM, Paul Emmerich wrote:

 > On Tue, Aug 13, 2019 at 10:04 PM Wido den Hollander mailto:w...@42on.com>> wrote:
 >> I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of
 DB in
 >> use. No slow db in use.
 > random rgw-only setup here: 12TB drive, 77% full, 48GB metadata and
 > 10GB omap for index and whatever.
 >
 > That's 0.5% + 0.1%. And that's a setup that's using mostly erasure
 > coding and small-ish objects.
 >
 >
 >> I've talked with many people from the community and I don't see an
 >> agreement for the 4% rule.
 > agreed, 4% isn't a reasonable default.
 > I've seen setups with even 10% metadata usage, but these are weird
 > edge cases with very small objects on NVMe-only setups (obviously
 > without a separate DB device).
 >
 > Paul


 I agree, and I did quite a bit of the early space usage analysis.  I
 have a feeling that someone was trying to be well-meaning and make a
 simple ratio for users to target that was big enough to handle the
 majority of use cases.  The problem is that reality isn't that simple
 and one-size-fits all doesn't really work here.


 For RBD you can usually get away with far less than 4%.  A small
 fraction of that is often sufficient.  For tiny (say 4K) RGW objects
 (especially objects with very long names!) you potentially can end up
 using significantly more than 4%. Unfortunately there's no really good
 way for us to normalize this so long as RGW is using OMAP to store
 bucket indexes.  I think the best we can do long run is make it much
 clearer how space is being used on the block/db/wal devices and easier
 for users to shrink/grow the amount of "fast" disk they have on an OSD.
 Alternately we could put bucket indexes into rados objects instead of
 OMAP, but that would be a pretty big project (with it's own challenges
 but potentially also with rewards).


 Mark

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Thanks and Regards,

Hemant Sonawane


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Canonical Livepatch broke CephFS client

2019-08-14 Thread Ilya Dryomov
On Tue, Aug 13, 2019 at 10:56 PM Tim Bishop  wrote:
>
> Hi,
>
> This email is mostly a heads up for others who might be using
> Canonical's livepatch on Ubuntu on a CephFS client.
>
> I have an Ubuntu 18.04 client with the standard kernel currently at
> version linux-image-4.15.0-54-generic 4.15.0-54.58. CephFS is mounted
> with the kernel client. Cluster is running mimic 13.2.6. I've got
> livepatch running and this evening it did an update:
>
> Aug 13 17:33:55 myclient canonical-livepatch[2396]: Client.Check
> Aug 13 17:33:55 myclient canonical-livepatch[2396]: Checking with livepatch 
> service.
> Aug 13 17:33:55 myclient canonical-livepatch[2396]: updating last-check
> Aug 13 17:33:55 myclient canonical-livepatch[2396]: touched last check
> Aug 13 17:33:56 myclient canonical-livepatch[2396]: Applying update 54.1 for 
> 4.15.0-54.58-generic
> Aug 13 17:33:56 myclient kernel: [3700923.970750] PKCS#7 signature not signed 
> with a trusted key
> Aug 13 17:33:59 myclient kernel: [3700927.069945] livepatch: enabling patch 
> 'lkp_Ubuntu_4_15_0_54_58_generic_54'
> Aug 13 17:33:59 myclient kernel: [3700927.154956] livepatch: 
> 'lkp_Ubuntu_4_15_0_54_58_generic_54': starting patching transition
> Aug 13 17:34:01 myclient kernel: [3700928.994487] livepatch: 
> 'lkp_Ubuntu_4_15_0_54_58_generic_54': patching complete
> Aug 13 17:34:09 myclient canonical-livepatch[2396]: Applied patch version 
> 54.1 to 4.15.0-54.58-generic
>
> And then immediately I saw:
>
> Aug 13 17:34:18 myclient kernel: [3700945.728684] libceph: mds0 1.2.3.4:6800 
> socket closed (con state OPEN)
> Aug 13 17:34:18 myclient kernel: [3700946.040138] libceph: mds0 1.2.3.4:6800 
> socket closed (con state OPEN)
> Aug 13 17:34:19 myclient kernel: [3700947.105692] libceph: mds0 1.2.3.4:6800 
> socket closed (con state OPEN)
> Aug 13 17:34:20 myclient kernel: [3700948.033704] libceph: mds0 1.2.3.4:6800 
> socket closed (con state OPEN)
>
> And on the MDS:
>
> 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 Message signature 
> does not match contents.
> 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Signature on message:
> 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367sig: 
> 10517606059379971075
> 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Locally calculated 
> signature:
> 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 
> sig_check:4899837294009305543
> 2019-08-13 17:34:18.286 7ff165e75700  0 Signature failed.
> 2019-08-13 17:34:18.286 7ff165e75700  0 -- 1.2.3.4:6800/512468759 >> 
> 4.3.2.1:0/928333509 conn(0xe6b9500 :6800 >> 
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=0).process >> 
> Signature check failed
>
> Thankfully I was able to umount -f to unfreeze the client, but I have
> been unsuccessful remounting the file system using the kernel client.
> The fuse client worked fine as a workaround, but is slower.
>
> Taking a look at livepatch 54.1 I can see it touches Ceph code in the
> kernel:
>
> https://git.launchpad.net/~ubuntu-livepatch/+git/bionic-livepatches/commit/?id=3a3081c1e4c8e2e0f9f7a1ae4204eba5f38fbd29
>
> But the relevance of those changes isn't immediately clear to me. I
> expect after a reboot it'll be fine, but as yet untested.

Hi Tim,

These changes are very relevant.  They introduce support for CEPHX_V2
protocol, where message signatures are computed slightly differently:
same algorithm but a different set of inputs.  The live-patched kernel
likely started signing using CEPHX_V2 without renegotiating.

This is a good example of how live-patching can go wrong.  A reboot
should definitely help.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrub start-time and end-time

2019-08-14 Thread Thomas Byrne - UKRI STFC
Hi Torben,

> Is it allowed to have the scrub period cross midnight ? eg have start time at 
> 22:00 and end time 07:00 next morning.

Yes, I think that's what the way it is mostly used, primarily to reduce the 
scrub impact during waking/working hours.

> I assume that if you only configure the one of them - it still behaves as if 
> it is unconfigured ??

The begin and end hours default to 0 and 24 hours respectively, so setting one 
have an effect. E.g. setting the end hour to 6 will mean scrubbing runs from 
midnight to 6AM, or setting the start hour to 16 will run scrubs from 4PM to 
midnight.

Cheers,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-14 Thread Paul Emmerich
Starting point to debug/fix this would be to extract the osdmap from
one of the dead OSDs:

ceph-objectstore-tool --op get-osdmap --data-path /var/lib/ceph/osd/...

Then try to run osdmaptool on that osdmap to see if it also crashes,
set some --debug options (don't know which one off the top of my
head).
Does it also crash? How does it differ from the map retrieved with
"ceph osd getmap"?

You can also set the osdmap with "--op set-osdmap", does it help to
set the osdmap retrieved by "ceph osd getmap"?

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Aug 14, 2019 at 7:59 AM Troy Ablan  wrote:
>
> I've opened a tracker issue at https://tracker.ceph.com/issues/41240
>
> Background: Cluster of 13 hosts, 5 of which contain 14 SSD OSDs between
> them.  409 HDDs in as well.
>
> The SSDs contain the RGW index and log pools, and some smaller pools
> The HDDs ccontain all other pools, including the RGW data pool
>
> The RGW instance contains just over 1 billion objects across about 65k
> buckets.  I don't know of any action on the cluster that would have
> caused this.  There have been no changes to the crush map in months, but
> HDDs were added a couple weeks ago and backfilling is still in progress
> but in the home stretch.
>
> I don't know what I can do at this point, though something points to the
> osdmap on these being wrong and/or corrupted?  Log excerpt from crash
> included below.  All of the OSD logs I checked look very similar.
>
>
>
>
> 2019-08-13 18:09:52.913 7f76484e9d80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
> /el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3362] Recovered
> from manifest file:db/MANIFEST-245361 succeeded,manifest_file_number is
> 245361, next_file_number is 245364, last_sequence is 606668564
> 6, log_number is 0,prev_log_number is 0,max_column_family is
> 0,deleted_log_number is 245359
>
> 2019-08-13 18:09:52.913 7f76484e9d80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
> /el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3370] Column family
> [default] (ID 0), log number is 245360
>
> 2019-08-13 18:09:52.918 7f76484e9d80  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1565719792920682, "job": 1, "event": "recovery_started",
> "log_files": [245362]}
> 2019-08-13 18:09:52.918 7f76484e9d80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
> /el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:551] Recovering
> log #245362 mode 0
> 2019-08-13 18:09:52.919 7f76484e9d80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
> /el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:2863] Creating
> manifest 245364
>
> 2019-08-13 18:09:52.933 7f76484e9d80  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1565719792935329, "job": 1, "event": "recovery_finished"}
> 2019-08-13 18:09:52.951 7f76484e9d80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
> /el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:1218] DB pointer
> 0x56445a6c8000
> 2019-08-13 18:09:52.951 7f76484e9d80  1
> bluestore(/var/lib/ceph/osd/ceph-46) _open_db opened rocksdb path db
> options
> compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=
> 1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152
> 2019-08-13 18:09:52.964 7f76484e9d80  1 freelist init
> 2019-08-13 18:09:52.976 7f76484e9d80  1
> bluestore(/var/lib/ceph/osd/ceph-46) _open_alloc opening allocation metadata
> 2019-08-13 18:09:53.119 7f76484e9d80  1
> bluestore(/var/lib/ceph/osd/ceph-46) _open_alloc loaded 926 GiB in 13292
> extents
> 2019-08-13 18:09:53.133 7f76484e9d80 -1 *** Caught signal (Aborted) **
>   in thread 7f76484e9d80 thread_name:ceph-osd
>
>   ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
> (stable)
>   1: (()+0xf5d0) [0x7f763c4455d0]
>   2: (gsignal()+0x37) [0x7f763b466207]
>   3: (abort()+0x148) [0x7f763b4678f8]
>   4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f763bd757d5]
>   5: (()+0x5e746) [0x7f763bd73746]
>   6: (()+0x5e773) [0x7f763bd73773]
>   7: (__cxa_rethrow()+0x49) [0x7f763bd739e9]
>   8: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x18b8)
> [0x7f763fcb48d8]
>   9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x4ad) [0x7f763fa924ad]
>   10: (OSDMap::decode(ceph::buffe

Re: [ceph-users] WAL/DB size

2019-08-14 Thread Burkhard Linke

Hi,


please keep in mind that due to the rocksdb level concept, only certain 
db partition sizes are useful. Larger partitions are a waste of 
capacity, since rockdb will only use whole level sizes.



There has been a lot of discussion about this on the mailing list in the 
last months. A plain XY% of OSD size is just wrong and misleading.



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] strange backfill delay after outing one node

2019-08-14 Thread Janne Johansson
Den ons 14 aug. 2019 kl 09:49 skrev Simon Oosthoek :

> Hi all,
>
> Yesterday I marked out all the osds on one node in our new cluster to
> reconfigure them with WAL/DB on their NVMe devices, but it is taking
> ages to rebalance.
>




> > ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
> > ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
> Since the cluster is currently hardly loaded, backfilling can take up
> all the unused bandwidth as far as I'm concerned...
> Is it a good idea to give the above commands or other commands to speed
> up the backfilling? (e.g. like increasing "osd max backfills")
>
>
> OSD max backfills is going to have a very large effect on recovery time,
so that
would be the obvious knob to twist first. Check what it defaults to now,
raise to 4,8,12,16
in steps and see that it doesn't slow rebalancing down too much.
Spindrives without any ssd/nvme journal/wal/db should perhaps have 1 or 2
at most,
ssds can take more than that and nvme even more before diminishing gains
occur.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] strange backfill delay after outing one node

2019-08-14 Thread Wido den Hollander



On 8/14/19 9:48 AM, Simon Oosthoek wrote:
> Hi all,
> 
> Yesterday I marked out all the osds on one node in our new cluster to
> reconfigure them with WAL/DB on their NVMe devices, but it is taking
> ages to rebalance. The whole cluster (and thus the osds) is only ~1%
> full, therefore the full ratio is nowhere in sight.
> 
> We have 14 osd nodes with 12 disks each, one of them was marked out,
> Yesterday around noon. It is still not completed and all the while, the
> cluster is in ERROR state, even though this is a normal maintenance
> operation.
> 
> We are still experimenting with the cluster, and it is still operational
> while being in ERROR state, however it is slightly worrying when
> considering that it could take even (50x?) longer if the cluster has 50x
> the amount of data. And the OSD's are mostly flatlined in the dashboard
> graphs, so it could potentially do it much faster, I think.
> 
> below are a few outputs of ceph -s(w):
> 
> Yesterday afternoon (~16:00)
> # ceph -w
>   cluster:
> id: b489547c-ba50-4745-a914-23eb78e0e5dc
> health: HEALTH_ERR
> Degraded data redundancy (low space): 139 pgs backfill_toofull
> 
>   services:
> mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 4h)
> mgr: cephmon1(active, since 4h), standbys: cephmon2, cephmon3
> mds: cephfs:1 {0=cephmds1=up:active} 1 up:standby
> osd: 168 osds: 168 up (since 3h), 156 in (since 3h); 1588 remapped pgs
> rgw: 1 daemon active (cephs3.rgw0)
> 
>   data:
> pools:   12 pools, 4116 pgs
> objects: 14.04M objects, 11 TiB
> usage:   20 TiB used, 1.7 PiB / 1.8 PiB avail
> pgs: 16188696/109408503 objects misplaced (14.797%)
>  2528 active+clean
>  1422 active+remapped+backfill_wait
>  139  active+remapped+backfill_wait+backfill_toofull
>  27   active+remapped+backfilling
> 
>   io:
> recovery: 205 MiB/s, 198 objects/s
> 
>   progress:
> Rebalancing after osd.47 marked out
>   [=.]
> Rebalancing after osd.5 marked out
>   [===...]
> Rebalancing after osd.132 marked out
>   [=.]
> Rebalancing after osd.90 marked out
>   [=.]
> Rebalancing after osd.76 marked out
>   [=.]
> Rebalancing after osd.157 marked out
>   [==]
> Rebalancing after osd.19 marked out
>   [=.]
> Rebalancing after osd.118 marked out
>   [..]
> Rebalancing after osd.146 marked out
>   [=.]
> Rebalancing after osd.104 marked out
>   [..]
> Rebalancing after osd.62 marked out
>   [===...]
> Rebalancing after osd.33 marked out
>   [==]
> 
> 
> This morning:
> # ceph -s
>   cluster:
> id: b489547c-ba50-4745-a914-23eb78e0e5dc
> health: HEALTH_ERR
> Degraded data redundancy (low space): 8 pgs backfill_toofull
> 
>   services:
> mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 22h)
> mgr: cephmon1(active, since 22h), standbys: cephmon2, cephmon3
> mds: cephfs:1 {0=cephmds2=up:active} 1 up:standby
> osd: 168 osds: 168 up (since 22h), 156 in (since 21h); 189 remapped pgs
> rgw: 1 daemon active (cephs3.rgw0)
> 
>   data:
> pools:   12 pools, 4116 pgs
> objects: 14.11M objects, 11 TiB
> usage:   21 TiB used, 1.7 PiB / 1.8 PiB avail
> pgs: 4643284/110159565 objects misplaced (4.215%)
>  3927 active+clean
>  162  active+remapped+backfill_wait
>  19   active+remapped+backfilling
>  8active+remapped+backfill_wait+backfill_toofull
> 
>   io:
> client:   32 KiB/s rd, 0 B/s wr, 31 op/s rd, 21 op/s wr
> recovery: 198 MiB/s, 149 objects/s
> 

It is still recovering it seems with 149 objects/second.

>   progress:
> Rebalancing after osd.47 marked out
>   [=.]
> Rebalancing after osd.5 marked out
>   [=.]
> Rebalancing after osd.132 marked out
>   [=.]
> Rebalancing after osd.90 marked out
>   [=.]
> Rebalancing after osd.76 marked out
>   [=.]
> Rebalancing after osd.157 marked out
>   [=.]
> Rebalancing after osd.19 marked out
>   [=.]
> Rebalancing after osd.146 marked out
>   [=.]
> Rebalancing after osd.104 marked out
>   [=.]
> Rebalancing after osd.62 marked out
>   [=.]
> 
> 
> I found some hints, though I'm not sure it's right for us at this url:
> https://forum.pr

Re: [ceph-users] WAL/DB size

2019-08-14 Thread Wido den Hollander


On 8/14/19 9:33 AM, Hemant Sonawane wrote:
> Hello guys,
> 
> Thank you so much for your responses really appreciate it. But I would
> like to mention one more thing which I forgot in my last email is that I
> am going to use this storage for openstack VM's. So still the answer
> will be the same that I should use 1GB for wal?
> 

WAL 1GB is fine, yes.

As this is an OpenStack/RBD only use-case I would say that 10GB of DB
per 1TB of disk storage is sufficient.

> 
> On Wed, 14 Aug 2019 at 05:54, Mark Nelson  > wrote:
> 
> On 8/13/19 3:51 PM, Paul Emmerich wrote:
> 
> > On Tue, Aug 13, 2019 at 10:04 PM Wido den Hollander  > wrote:
> >> I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of
> DB in
> >> use. No slow db in use.
> > random rgw-only setup here: 12TB drive, 77% full, 48GB metadata and
> > 10GB omap for index and whatever.
> >
> > That's 0.5% + 0.1%. And that's a setup that's using mostly erasure
> > coding and small-ish objects.
> >
> >
> >> I've talked with many people from the community and I don't see an
> >> agreement for the 4% rule.
> > agreed, 4% isn't a reasonable default.
> > I've seen setups with even 10% metadata usage, but these are weird
> > edge cases with very small objects on NVMe-only setups (obviously
> > without a separate DB device).
> >
> > Paul
> 
> 
> I agree, and I did quite a bit of the early space usage analysis.  I
> have a feeling that someone was trying to be well-meaning and make a
> simple ratio for users to target that was big enough to handle the
> majority of use cases.  The problem is that reality isn't that simple
> and one-size-fits all doesn't really work here.
> 
> 
> For RBD you can usually get away with far less than 4%.  A small
> fraction of that is often sufficient.  For tiny (say 4K) RGW objects 
> (especially objects with very long names!) you potentially can end up
> using significantly more than 4%. Unfortunately there's no really good
> way for us to normalize this so long as RGW is using OMAP to store
> bucket indexes.  I think the best we can do long run is make it much
> clearer how space is being used on the block/db/wal devices and easier
> for users to shrink/grow the amount of "fast" disk they have on an OSD.
> Alternately we could put bucket indexes into rados objects instead of
> OMAP, but that would be a pretty big project (with it's own challenges
> but potentially also with rewards).
> 
> 
> Mark
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> Thanks and Regards,
> 
> Hemant Sonawane
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs cannot mount with kernel client

2019-08-14 Thread Serkan Çoban
Hi, just double checked the stack trace and I can confirm it is same
as in tracker.
compaction also worked for me, I can now mount cephfs without problems.
Thanks for help,

Serkan

On Tue, Aug 13, 2019 at 6:44 PM Ilya Dryomov  wrote:
>
> On Tue, Aug 13, 2019 at 4:30 PM Serkan Çoban  wrote:
> >
> > I am out of office right now, but I am pretty sure it was the same
> > stack trace as in tracker.
> > I will confirm tomorrow.
> > Any workarounds?
>
> Compaction
>
> # echo 1 >/proc/sys/vm/compact_memory
>
> might help if the memory in question is moveable.  If not, reboot and
> mount on a freshly booted node.
>
> I have raised the priority on the ticket.
>
> Thanks,
>
> Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] strange backfill delay after outing one node

2019-08-14 Thread Simon Oosthoek
Hi all,

Yesterday I marked out all the osds on one node in our new cluster to
reconfigure them with WAL/DB on their NVMe devices, but it is taking
ages to rebalance. The whole cluster (and thus the osds) is only ~1%
full, therefore the full ratio is nowhere in sight.

We have 14 osd nodes with 12 disks each, one of them was marked out,
Yesterday around noon. It is still not completed and all the while, the
cluster is in ERROR state, even though this is a normal maintenance
operation.

We are still experimenting with the cluster, and it is still operational
while being in ERROR state, however it is slightly worrying when
considering that it could take even (50x?) longer if the cluster has 50x
the amount of data. And the OSD's are mostly flatlined in the dashboard
graphs, so it could potentially do it much faster, I think.

below are a few outputs of ceph -s(w):

Yesterday afternoon (~16:00)
# ceph -w
  cluster:
id: b489547c-ba50-4745-a914-23eb78e0e5dc
health: HEALTH_ERR
Degraded data redundancy (low space): 139 pgs backfill_toofull

  services:
mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 4h)
mgr: cephmon1(active, since 4h), standbys: cephmon2, cephmon3
mds: cephfs:1 {0=cephmds1=up:active} 1 up:standby
osd: 168 osds: 168 up (since 3h), 156 in (since 3h); 1588 remapped pgs
rgw: 1 daemon active (cephs3.rgw0)

  data:
pools:   12 pools, 4116 pgs
objects: 14.04M objects, 11 TiB
usage:   20 TiB used, 1.7 PiB / 1.8 PiB avail
pgs: 16188696/109408503 objects misplaced (14.797%)
 2528 active+clean
 1422 active+remapped+backfill_wait
 139  active+remapped+backfill_wait+backfill_toofull
 27   active+remapped+backfilling

  io:
recovery: 205 MiB/s, 198 objects/s

  progress:
Rebalancing after osd.47 marked out
  [=.]
Rebalancing after osd.5 marked out
  [===...]
Rebalancing after osd.132 marked out
  [=.]
Rebalancing after osd.90 marked out
  [=.]
Rebalancing after osd.76 marked out
  [=.]
Rebalancing after osd.157 marked out
  [==]
Rebalancing after osd.19 marked out
  [=.]
Rebalancing after osd.118 marked out
  [..]
Rebalancing after osd.146 marked out
  [=.]
Rebalancing after osd.104 marked out
  [..]
Rebalancing after osd.62 marked out
  [===...]
Rebalancing after osd.33 marked out
  [==]


This morning:
# ceph -s
  cluster:
id: b489547c-ba50-4745-a914-23eb78e0e5dc
health: HEALTH_ERR
Degraded data redundancy (low space): 8 pgs backfill_toofull

  services:
mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 22h)
mgr: cephmon1(active, since 22h), standbys: cephmon2, cephmon3
mds: cephfs:1 {0=cephmds2=up:active} 1 up:standby
osd: 168 osds: 168 up (since 22h), 156 in (since 21h); 189 remapped pgs
rgw: 1 daemon active (cephs3.rgw0)

  data:
pools:   12 pools, 4116 pgs
objects: 14.11M objects, 11 TiB
usage:   21 TiB used, 1.7 PiB / 1.8 PiB avail
pgs: 4643284/110159565 objects misplaced (4.215%)
 3927 active+clean
 162  active+remapped+backfill_wait
 19   active+remapped+backfilling
 8active+remapped+backfill_wait+backfill_toofull

  io:
client:   32 KiB/s rd, 0 B/s wr, 31 op/s rd, 21 op/s wr
recovery: 198 MiB/s, 149 objects/s

  progress:
Rebalancing after osd.47 marked out
  [=.]
Rebalancing after osd.5 marked out
  [=.]
Rebalancing after osd.132 marked out
  [=.]
Rebalancing after osd.90 marked out
  [=.]
Rebalancing after osd.76 marked out
  [=.]
Rebalancing after osd.157 marked out
  [=.]
Rebalancing after osd.19 marked out
  [=.]
Rebalancing after osd.146 marked out
  [=.]
Rebalancing after osd.104 marked out
  [=.]
Rebalancing after osd.62 marked out
  [=.]


I found some hints, though I'm not sure it's right for us at this url:
https://forum.proxmox.com/threads/increase-ceph-recovery-speed.36728/
:
> ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
> ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'

Since the cluster is currently hardly loaded, backfilling can take up
all the unused bandwidth as far as I'm concerned...

Is it a good idea to give the above commands or other command

Re: [ceph-users] WAL/DB size

2019-08-14 Thread Hemant Sonawane
Hello guys,

Thank you so much for your responses really appreciate it. But I would like
to mention one more thing which I forgot in my last email is that I am
going to use this storage for openstack VM's. So still the answer will be
the same that I should use 1GB for wal?


On Wed, 14 Aug 2019 at 05:54, Mark Nelson  wrote:

> On 8/13/19 3:51 PM, Paul Emmerich wrote:
>
> > On Tue, Aug 13, 2019 at 10:04 PM Wido den Hollander 
> wrote:
> >> I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of DB in
> >> use. No slow db in use.
> > random rgw-only setup here: 12TB drive, 77% full, 48GB metadata and
> > 10GB omap for index and whatever.
> >
> > That's 0.5% + 0.1%. And that's a setup that's using mostly erasure
> > coding and small-ish objects.
> >
> >
> >> I've talked with many people from the community and I don't see an
> >> agreement for the 4% rule.
> > agreed, 4% isn't a reasonable default.
> > I've seen setups with even 10% metadata usage, but these are weird
> > edge cases with very small objects on NVMe-only setups (obviously
> > without a separate DB device).
> >
> > Paul
>
>
> I agree, and I did quite a bit of the early space usage analysis.  I
> have a feeling that someone was trying to be well-meaning and make a
> simple ratio for users to target that was big enough to handle the
> majority of use cases.  The problem is that reality isn't that simple
> and one-size-fits all doesn't really work here.
>
>
> For RBD you can usually get away with far less than 4%.  A small
> fraction of that is often sufficient.  For tiny (say 4K) RGW objects
> (especially objects with very long names!) you potentially can end up
> using significantly more than 4%. Unfortunately there's no really good
> way for us to normalize this so long as RGW is using OMAP to store
> bucket indexes.  I think the best we can do long run is make it much
> clearer how space is being used on the block/db/wal devices and easier
> for users to shrink/grow the amount of "fast" disk they have on an OSD.
> Alternately we could put bucket indexes into rados objects instead of
> OMAP, but that would be a pretty big project (with it's own challenges
> but potentially also with rewards).
>
>
> Mark
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Thanks and Regards,

Hemant Sonawane
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com