date:20190219

[ceph-users] How to change/anable/activate a different osd_memory_target value

2019-02-19 Thread Götz Reinicke

Hi,

we run into some OSD node freezes with out of memory and eating all swap too. 
Till we get more physical RAM I’d like to reduce the osd_memory_target, but 
can’t find where and how to enable it.

We have 24 bluestore Disks in 64 GB centos nodes with Luminous v12.2.11

Thanks for hints an suggestions . Regards . Götz



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replicating CephFS between clusters

2019-02-19 Thread Alexandre DERUMIER

Hi,

I think that cephfs snap mirroring is coming for nautilus

https://www.openstack.org/assets/presentation-media/2018.11.15-openstack-ceph-data-services.pdf
(slide 26)

But I don't known if it's already ready is master ?



- Mail original -
De: "Vitaliy Filippov" 
À: "Marc Roos" , "Balazs.Soltesz" 
, "ceph-users" , "Wido 
den Hollander" 
Envoyé: Mardi 19 Février 2019 23:24:44
Objet: Re: [ceph-users] Replicating CephFS between clusters

> Ah, yes, good question. I don't know if there is a true upper limit, but 
> leaving old snapshot around could hurt you when replaying journals and 
> such. 

Is is still so in mimic? 

Should I live in fear if I keep old snapshots all the time (because I'm 
using them as "checkpoints")? :) 

-- 
With best regards, 
Vitaliy Filippov 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-19 Thread Konstantin Shalygin


On 2/19/19 11:46 PM, David Turner wrote:
I don't know that there's anything that can be done to resolve this 
yet without rebuilding the OSD.  Based on a Nautilus tool being able 
to resize the DB device, I'm assuming that Nautilus is also capable of 
migrating the DB/WAL between devices.  That functionality would allow 
anyone to migrate their DB back off of their spinner which is what's 
happening to you.  I don't believe that sort of tooling exists yet, 
though, without compiling the Nautilus Beta tooling for yourself.


I think there you are wrong, initially bluestore tool can expand only 
wal/db devices [1]. With last releases of mimic and luminous this should 
work fine.


And only master received  feature for main device expanding [2].



[1] 
https://github.com/ceph/ceph/commit/2184e3077caa9de5f21cc901d26f6ecfb76de9e1


[2] 
https://github.com/ceph/ceph/commit/d07c10dfc02e4cdeda288bf39b8060b10da5bbf9


k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] krbd: Can I only just update krbd module without updating kernal?

2019-02-19 Thread Konstantin Shalygin


Because of some reasons, I can update the kernal to higher version.
So I wonder if I can only just update krbd kernal module ? Has anyone
done this before?


Of course you can. You "just" need a make krbd patch from upstream 
kernel and apply it to your kernel tree.


It's a lot of work and may be you will stuck at some place, because krbd 
use linux block layer. In practical it's not


a good idea in technical and business aspect.



k


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] krbd: Can I only just update krbd module without updating kernal?

2019-02-19 Thread Wei Zhao

Hi:
   Because of some reasons, I can update the kernal to higher version.
So I wonder if I can only just update krbd kernal module ? Has anyone
done this before?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook

2019-02-19 Thread Brian Topping

> On Feb 19, 2019, at 3:30 PM, Vitaliy Filippov  wrote:
> 
> In our russian-speaking Ceph chat we swear "ceph inside kuber" people all the 
> time because they often do not understand in what state their cluster is at 
> all

Agreed 100%. This is a really good way to lock yourself out of your data (and 
maybe lose it), especially if you’re new to Kubernetes and using Rook to manage 
Ceph. 

Some months ago, I was on VMs running on Citrix. Everything is stable on 
Kubernetes and Ceph now, but it’s been a lot of work. I’d suggest starting with 
Kubernetes first, especially if you are going to do this on bare metal. I can 
give you some ideas about how to lay things out if you are running with limited 
hardware.

Brian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Intel P4600 3.2TB U.2 form factor NVMe firmware problems causing dead disks

2019-02-19 Thread Alexandre DERUMIER

I'm running some s4610 (SSDPE2KE064T8), with firmware VDV10140.

don't have any problem with them since 6months.

But I remember than around september 2017, supermicro has warned me about a
firmware bug on s4600. (don't known which firmware version)

- Mail original -
De: "David Turner"
À: "ceph-users"
Envoyé: Lundi 18 Février 2019 16:44:18
Objet: [ceph-users] Intel P4600 3.2TB U.2 form factor NVMe firmware problems
causing dead disks

We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk
(partitioned), 3 disks per node, 5 nodes per cluster. The clusters are 12.2.4
running CephFS and RBDs. So in total we have 15 NVMe's per cluster and 30
NVMe's in total. They were all built at the same time and were running firmware
version QDV10130. On this firmware version we early on had 2 disks failures, a
few months later we had 1 more, and then a month after that (just a few weeks
ago) we had 7 disk failures in 1 week.

The failures are such that the disk is no longer visible to the OS. This holds
true beyond server reboots as well as placing the failed disks into a new
server. With a firmware upgrade tool we got an error that pretty much said
there's no way to get data back and to RMA the disk. We upgraded all of our
remaining disks' firmware to QDV101D1 and haven't had any problems since then.
Most of our failures happened while rebalancing the cluster after replacing
dead disks and we tested rigorously around that use case after upgrading the
firmware. This firmware version seems to have resolved whatever the problem
was.

We have about 100 more of these scattered among database servers and other
servers that have never had this problem while running the QDV10130 firmware as
well as firmwares between this one and the one we upgraded to. Bluestore on
Ceph is the only use case we've had so far with this sort of failure.

Has anyone else come across this issue before? Our current theory is that
Bluestore is accessing the disk in a way that is triggering a bug in the older
firmware version that isn't triggered by more traditional filesystems. We have
a scheduled call with Intel to discuss this, but their preliminary searches
into the bugfixes and known problems between firmware versions didn't indicate
the bug that we triggered. It would be good to have some more information about
what those differences for disk accessing might be to hopefully get a better
answer from them as to what the problem is.

[1] [
https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
|
https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
]

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook

2019-02-19 Thread Vitaliy Filippov

In our russian-speaking Ceph chat we swear "ceph inside kuber" people all  
the time because they often do not understand in what state their cluster  
is at all


// Sorry to intervene :))

--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replicating CephFS between clusters

2019-02-19 Thread Vitaliy Filippov


Ah, yes, good question. I don't know if there is a true upper limit, but
leaving old snapshot around could hurt you when replaying journals and  
such.


Is is still so in mimic?

Should I live in fear if I keep old snapshots all the time (because I'm  
using them as "checkpoints")? :)


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Intel P4600 3.2TB U.2 form factor NVMe firmware problems causing dead disks

2019-02-19 Thread solarflow99

no, but I know that if the wear leveling isn't right then I wouldn't expect
them to last long, FW updates on SSDs are very important.


On Mon, Feb 18, 2019 at 7:44 AM David Turner  wrote:

> We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk
> (partitioned), 3 disks per node, 5 nodes per cluster.  The clusters are
> 12.2.4 running CephFS and RBDs.  So in total we have 15 NVMe's per cluster
> and 30 NVMe's in total.  They were all built at the same time and were
> running firmware version QDV10130.  On this firmware version we early on
> had 2 disks failures, a few months later we had 1 more, and then a month
> after that (just a few weeks ago) we had 7 disk failures in 1 week.
>
> The failures are such that the disk is no longer visible to the OS.  This
> holds true beyond server reboots as well as placing the failed disks into a
> new server.  With a firmware upgrade tool we got an error that pretty much
> said there's no way to get data back and to RMA the disk.  We upgraded all
> of our remaining disks' firmware to QDV101D1 and haven't had any problems
> since then.  Most of our failures happened while rebalancing the cluster
> after replacing dead disks and we tested rigorously around that use case
> after upgrading the firmware.  This firmware version seems to have resolved
> whatever the problem was.
>
> We have about 100 more of these scattered among database servers and other
> servers that have never had this problem while running the
> QDV10130 firmware as well as firmwares between this one and the one we
> upgraded to.  Bluestore on Ceph is the only use case we've had so far with
> this sort of failure.
>
> Has anyone else come across this issue before?  Our current theory is that
> Bluestore is accessing the disk in a way that is triggering a bug in the
> older firmware version that isn't triggered by more traditional
> filesystems.  We have a scheduled call with Intel to discuss this, but
> their preliminary searches into the bugfixes and known problems between
> firmware versions didn't indicate the bug that we triggered.  It would be
> good to have some more information about what those differences for disk
> accessing might be to hopefully get a better answer from them as to what
> the problem is.
>
>
> [1]
> https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] faster switch to another mds

2019-02-19 Thread Fyodor Ustinov

Hi!

>From documentation:

mds beacon grace
Description:The interval without beacons before Ceph declares an MDS laggy 
(and possibly replace it).
Type:   Float
Default:15

I do not understand, 15 - are is seconds or beacons?

And an additional misunderstanding - if we gently turn off the MDS (or MON), 
why it does not inform everyone interested before death - "I am turned off, no 
need to wait, appoint a new active server"

- Original Message -
From: "David Turner" 
To: "Gregory Farnum" 
Cc: "Fyodor Ustinov" , "ceph-users" 
Sent: Tuesday, 19 February, 2019 20:57:49
Subject: Re: [ceph-users] faster switch to another mds

It's also been mentioned a few times that when MDS and MON are on the same
host that the downtime for MDS is longer when both daemons stop at about
the same time.  It's been suggested to stop the MDS daemon, wait for `ceph
mds stat` to reflect the change, and then restart the rest of the server.
HTH.

On Mon, Feb 11, 2019 at 3:55 PM Gregory Farnum  wrote:

> You can't tell from the client log here, but probably the MDS itself was
> failing over to a new instance during that interval. There's not much
> experience with it, but you could experiment with faster failover by
> reducing the mds beacon and grace times. This may or may not work
> reliably...
>
> On Sat, Feb 9, 2019 at 10:52 AM Fyodor Ustinov  wrote:
>
>> Hi!
>>
>> I have ceph cluster with 3 nodes with mon/mgr/mds servers.
>> I reboot one node and see this in client log:
>>
>> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon2 10.5.105.40:6789 socket
>> closed (con state OPEN)
>> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon2 10.5.105.40:6789 session
>> lost, hunting for new mon
>> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon0 10.5.105.34:6789 session
>> established
>> Feb 09 20:29:22 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state OPEN)
>> Feb 09 20:29:23 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state CONNECTING)
>> Feb 09 20:29:24 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state CONNECTING)
>> Feb 09 20:29:24 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state CONNECTING)
>> Feb 09 20:29:53 ceph-nfs1 kernel: ceph: mds0 reconnect start
>> Feb 09 20:29:53 ceph-nfs1 kernel: ceph: mds0 reconnect success
>> Feb 09 20:30:05 ceph-nfs1 kernel: ceph: mds0 recovery completed
>>
>> As I understand it, the following has happened:
>> 1. Client detects - link with mon server broken and fast switches to
>> another mon (less that 1 seconds).
>> 2. Client detects - link with mds server broken, 3 times trying reconnect
>> (unsuccessful), waiting and reconnects to the same mds after 30 seconds
>> downtime.
>>
>> I have 2 questions:
>> 1. Why?
>> 2. How to reduce switching time to another mds?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] faster switch to another mds

2019-02-19 Thread David Turner

It's also been mentioned a few times that when MDS and MON are on the same
host that the downtime for MDS is longer when both daemons stop at about
the same time.  It's been suggested to stop the MDS daemon, wait for `ceph
mds stat` to reflect the change, and then restart the rest of the server.
HTH.

On Mon, Feb 11, 2019 at 3:55 PM Gregory Farnum  wrote:

> You can't tell from the client log here, but probably the MDS itself was
> failing over to a new instance during that interval. There's not much
> experience with it, but you could experiment with faster failover by
> reducing the mds beacon and grace times. This may or may not work
> reliably...
>
> On Sat, Feb 9, 2019 at 10:52 AM Fyodor Ustinov  wrote:
>
>> Hi!
>>
>> I have ceph cluster with 3 nodes with mon/mgr/mds servers.
>> I reboot one node and see this in client log:
>>
>> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon2 10.5.105.40:6789 socket
>> closed (con state OPEN)
>> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon2 10.5.105.40:6789 session
>> lost, hunting for new mon
>> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon0 10.5.105.34:6789 session
>> established
>> Feb 09 20:29:22 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state OPEN)
>> Feb 09 20:29:23 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state CONNECTING)
>> Feb 09 20:29:24 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state CONNECTING)
>> Feb 09 20:29:24 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state CONNECTING)
>> Feb 09 20:29:53 ceph-nfs1 kernel: ceph: mds0 reconnect start
>> Feb 09 20:29:53 ceph-nfs1 kernel: ceph: mds0 reconnect success
>> Feb 09 20:30:05 ceph-nfs1 kernel: ceph: mds0 recovery completed
>>
>> As I understand it, the following has happened:
>> 1. Client detects - link with mon server broken and fast switches to
>> another mon (less that 1 seconds).
>> 2. Client detects - link with mds server broken, 3 times trying reconnect
>> (unsuccessful), waiting and reconnects to the same mds after 30 seconds
>> downtime.
>>
>> I have 2 questions:
>> 1. Why?
>> 2. How to reduce switching time to another mds?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS overwrite/truncate performance hit

2019-02-19 Thread David Turner

If your client needs to be able to handle the writes like that on its own,
RBDs might be the more appropriate use case.  You lose the ability to have
multiple clients accessing the data as easily as with CephFS, but you would
gain the features you're looking for.

On Tue, Feb 12, 2019 at 1:43 PM Gregory Farnum  wrote:

>
>
> On Tue, Feb 12, 2019 at 5:10 AM Hector Martin 
> wrote:
>
>> On 12/02/2019 06:01, Gregory Farnum wrote:
>> > Right. Truncates and renames require sending messages to the MDS, and
>> > the MDS committing to RADOS (aka its disk) the change in status, before
>> > they can be completed. Creating new files will generally use a
>> > preallocated inode so it's just a network round-trip to the MDS.
>>
>> I see. Is there a fundamental reason why these kinds of metadata
>> operations cannot be buffered in the client, or is this just the current
>> way they're implemented?
>>
>
> It's pretty fundamental, at least to the consistency guarantees we hold
> ourselves to. What happens if the client has buffered an update like that,
> performs writes to the data with those updates in mind, and then fails
> before they're flushed to the MDS? A local FS doesn't need to worry about a
> different node having a different lifetime, and can control the write order
> of its metadata and data updates on belated flush a lot more precisely than
> we can. :(
> -Greg
>
>
>>
>> e.g. on a local FS these kinds of writes can just stick around in the
>> block cache unflushed. And of course for CephFS I assume file extension
>> also requires updating the file size in the MDS, yet that doesn't block
>> while truncation does.
>>
>> > Going back to your first email, if you do an overwrite that is confined
>> > to a single stripe unit in RADOS (by default, a stripe unit is the size
>> > of your objects which is 4MB and it's aligned from 0), it is guaranteed
>> > to be atomic. CephFS can only tear writes across objects, and only if
>> > your client fails before the data has been flushed.
>>
>> Great! I've implemented this in a backwards-compatible way, so that gets
>> rid of this bottleneck. It's just a 128-byte flag file (formerly
>> variable length, now I just pad it to the full 128 bytes and rewrite it
>> in-place). This is good information to know for optimizing things :-)
>>
>> --
>> Hector Martin (hec...@marcansoft.com)
>> Public Key: https://mrcn.st/pub
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: client hangs

2019-02-19 Thread David Turner

You're attempting to use mismatching client name and keyring.  You want to
use matching name and keyring.  For your example, you would want to either
use `--keyring /etc/ceph/ceph.client.admin.keyring --name client.admin` or
`--keyring /etc/ceph/ceph.client.cephfs.keyring --name client.cephfs`.
Mixing and matching does not work.  Treat them like username and password.
You wouldn't try to log into your computer under your account with the
admin password.

On Tue, Feb 19, 2019 at 12:58 PM Hennen, Christian <
christian.hen...@uni-trier.de> wrote:

> > sounds like network issue. are there firewall/NAT between nodes?
> No, there is currently no firewall in place. Nodes and clients are on the
> same network. MTUs match, ports are opened according to nmap.
>
> > try running ceph-fuse on the node that run mds, check if it works
> properly.
> When I try to run ceph-fuse on either a client or cephfiler1
> (MON,MGR,MDS,OSDs) I get
> - "operation not permitted" when using the client keyring
> - "invalid argument" when using the admin keyring
> - "ms_handle_refused" when using the admin keyring and connecting to
> 127.0.0.1:6789
>
> ceph-fuse --keyring /etc/ceph/ceph.client.admin.keyring --name
> client.cephfs -m 192.168.1.17:6789 /mnt/cephfs
>
> -Ursprüngliche Nachricht-
> Von: Yan, Zheng 
> Gesendet: Dienstag, 19. Februar 2019 11:31
> An: Hennen, Christian 
> Cc: ceph-users@lists.ceph.com
> Betreff: Re: [ceph-users] CephFS: client hangs
>
> On Tue, Feb 19, 2019 at 5:10 PM Hennen, Christian <
> christian.hen...@uni-trier.de> wrote:
> >
> > Hi!
> >
> > >mon_max_pg_per_osd = 400
> > >
> > >In the ceph.conf and then restart all the services / or inject the
> > >config into the running admin
> >
> > I restarted each server (MONs and OSDs weren’t enough) and now the
> health warning is gone. Still no luck accessing CephFS though.
> >
> >
> > > MDS show a client got evicted. Nothing else looks abnormal.  Do new
> > > cephfs clients also get evicted quickly?
> >
> > Aside from the fact that evicted clients don’t show up in ceph –s, we
> observe other strange things:
> >
> > ·   Setting max_mds has no effect
> >
> > ·   Ceph osd blacklist ls sometimes lists cluster nodes
> >
>
> sounds like network issue. are there firewall/NAT between nodes?
>
> > The only client that is currently running is ‚master1‘. It also hosts a
> MON and a MGR. Its syslog (https://gitlab.uni-trier.de/snippets/78) shows
> messages like:
> >
> > Feb 13 06:40:33 master1 kernel: [56165.943008] libceph: wrong peer,
> > want 192.168.1.17:6800/-2045158358, got 192.168.1.17:6800/1699349984
> >
> > Feb 13 06:40:33 master1 kernel: [56165.943014] libceph: mds1
> > 192.168.1.17:6800 wrong peer at address
> >
> > The other day I did the update from 12.2.8 to 12.2.11, which can also be
> seen in the logs. Again, there appeared these messages. I assume that’s
> normal operations since ports can change and daemons have to find each
> other again? But what about Feb 13 in the morning? I didn’t do any restarts
> then.
> >
> > Also, clients are printing messages like the following on the console:
> >
> > [1026589.751040] ceph: handle_cap_import: mismatched seq/mseq: ino
> > (1994988.fffe) mds0 seq1 mseq 15 importer mds1 has
> > peer seq 2 mseq 15
> >
> > [1352658.876507] ceph: build_path did not end path lookup where
> > expected, namelen is 23, pos is 0
> >
> > Oh, and btw, the ceph nodes are running on Ubuntu 16.04, clients are on
> 14.04 with kernel 4.4.0-133.
> >
>
> try running ceph-fuse on the node that run mds, check if it works properly.
>
>
> > For reference:
> >
> > > Cluster details: https://gitlab.uni-trier.de/snippets/77
> >
> > > MDS log:
> > > https://gitlab.uni-trier.de/snippets/79?expanded=true=simple)
> >
> >
> > Kind regards
> > Christian Hennen
> >
> > Project Manager Infrastructural Services ZIMK University of Trier
> > Germany
> >
> > Von: Ashley Merrick 
> > Gesendet: Montag, 18. Februar 2019 16:53
> > An: Hennen, Christian 
> > Cc: ceph-users@lists.ceph.com
> > Betreff: Re: [ceph-users] CephFS: client hangs
> >
> > Correct yes from my expirence OSD’s aswel.
> >
> > On Mon, 18 Feb 2019 at 11:51 PM, Hennen, Christian <
> christian.hen...@uni-trier.de> wrote:
> >
> > Hi!
> >
> > >mon_max_pg_per_osd = 400
> > >
> > >In the ceph.conf and then restart all the services / or inject the
> > >config into the running admin
> >
> > I restarted all MONs, but I assume the OSDs need to be restarted as well?
> >
> > > MDS show a client got evicted. Nothing else looks abnormal.  Do new
> > > cephfs clients also get evicted quickly?
> >
> > Yeah, it seems so. But strangely there is no indication of it in 'ceph
> > -s' or 'ceph health detail'. And they don't seem to be evicted
> > permanently? Right now, only 1 client is connected. The others are shut
> down since last week.
> > 'ceph osd blacklist ls' shows 0 entries.
> >
> >
> > Kind regards
> > Christian Hennen
> >
> > Project Manager Infrastructural Services ZIMK

Re: [ceph-users] crush map has straw_calc_version=0 and legacy tunables on luminous

2019-02-19 Thread David Turner

[1] Here is a really cool set of slides from Ceph Day Berlin where Dan van
der Ster uses the mgr balancer module with upmap to gradually change the
tunables of a cluster without causing major client impact.  The down side
for you is that upmap requires all luminous or newer clients, but if you
upgrade your kernel clients to 1.13+, then you can enable upmap in the
cluster and utilize the balancer module to upgrade your cluster tunables.
As stated [2] here that those kernel versions still report as Jewel
clients, but only because they are missing some non-essential luminous
client features even they they are fully compatible with the upmap
features, and other required features.

As a side note to the balancer manager in upmap mode, it balances your
cluster in such a way that it attempts to evenly distribute all PGs for a
pool evenly across all OSDs.  So if you have 3 different pools, the PGs for
those pools should each be within 1 or 2 PG totals on every OSD in your
cluster... it's really cool.  The slides discuss how to get your cluster to
that point as well, incase you have modified your weights or reweights at
all.


[1]
https://www.slideshare.net/Inktank_Ceph/ceph-day-berlin-mastering-ceph-operations-upmap-and-the-mgr-balancer
[2]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-November/031206.html

On Mon, Feb 4, 2019 at 6:31 PM Shain Miley  wrote:

> For future reference I found these 2 links which answer most of the
> questions:
>
> http://docs.ceph.com/docs/master/rados/operations/crush-map/
>
>
> https://www.openstack.org/assets/presentation-media/Advanced-Tuning-and-Operation-guide-for-Block-Storage-using-Ceph-Boston-2017-final.pdf
>
>
>
> We have about 250TB (x3) in our cluster so I am leaning toward not
> changing things at this point because it sounds like there will be a
> significant amount of data movement involved for not a lot in return.
>
>
>
> If anyone knows of a strong reason I should change the tunables profile
> away from what I have…then please let me know so I don’t end up running the
> cluster in a sub-optimal state for no reason.
>
>
>
> Thanks,
>
> Shain
>
>
>
> --
>
> Shain Miley | Manager of Systems and Infrastructure, Digital Media |
> smi...@npr.org | 202.513.3649
>
>
>
> *From: *ceph-users  on behalf of Shain
> Miley 
> *Date: *Monday, February 4, 2019 at 3:03 PM
> *To: *"ceph-users@lists.ceph.com" 
> *Subject: *[ceph-users] crush map has straw_calc_version=0 and legacy
> tunables on luminous
>
>
>
> Hello,
>
> I just upgraded our cluster to 12.2.11 and I have a few questions around
> straw_calc_version and tunables.
>
> Currently ceph status shows the following:
>
> crush map has straw_calc_version=0
>
> crush map has legacy tunables (require argonaut, min is firefly)
>
>
>
>1. Will setting tunables to optimal also change the staw_calc_version
>or do I need to set that separately?
>
>
>2. Right now I have a set of rbd kernel clients connecting using
>kernel version 4.4.  The ‘ceph daemon mon.id sessions’ command shows
>that this client is still connecting using the hammer feature set (and a
>few others on jewel as well):
>
>"MonSession(client.113933130 10.35.100.121:0/3425045489 is open allow
>*, features 0x7fddff8ee8cbffb (jewel))",  “MonSession(client.112250505
>10.35.100.99:0/4174610322 is open allow *, features 0x106b84a842a42
>(hammer))",
>
>My question is what is the minimum kernel version I would need to
>upgrade the 4.4 kernel server to in order to get to jewel or luminous?
>
>
>
>1. Will setting the tunables to optimal on luminous prevent jewel and
>hammer clients from connecting?  I want to make sure I don’t do anything
>will prevent my existing clients from connecting to the cluster.
>
>
>
>
> Thanks in advance,
>
> Shain
>
>
>
> --
>
> Shain Miley | Manager of Systems and Infrastructure, Digital Media |
> smi...@npr.org | 202.513.3649
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: client hangs

2019-02-19 Thread Hennen, Christian

> sounds like network issue. are there firewall/NAT between nodes?
No, there is currently no firewall in place. Nodes and clients are on the same 
network. MTUs match, ports are opened according to nmap.

> try running ceph-fuse on the node that run mds, check if it works properly.
When I try to run ceph-fuse on either a client or cephfiler1 (MON,MGR,MDS,OSDs) 
I get
- "operation not permitted" when using the client keyring
- "invalid argument" when using the admin keyring
- "ms_handle_refused" when using the admin keyring and connecting to 
127.0.0.1:6789

ceph-fuse --keyring /etc/ceph/ceph.client.admin.keyring --name client.cephfs -m 
192.168.1.17:6789 /mnt/cephfs

-Ursprüngliche Nachricht-
Von: Yan, Zheng  
Gesendet: Dienstag, 19. Februar 2019 11:31
An: Hennen, Christian 
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] CephFS: client hangs

On Tue, Feb 19, 2019 at 5:10 PM Hennen, Christian 
 wrote:
>
> Hi!
>
> >mon_max_pg_per_osd = 400
> >
> >In the ceph.conf and then restart all the services / or inject the 
> >config into the running admin
>
> I restarted each server (MONs and OSDs weren’t enough) and now the health 
> warning is gone. Still no luck accessing CephFS though.
>
>
> > MDS show a client got evicted. Nothing else looks abnormal.  Do new 
> > cephfs clients also get evicted quickly?
>
> Aside from the fact that evicted clients don’t show up in ceph –s, we observe 
> other strange things:
>
> ·   Setting max_mds has no effect
>
> ·   Ceph osd blacklist ls sometimes lists cluster nodes
>

sounds like network issue. are there firewall/NAT between nodes?

> The only client that is currently running is ‚master1‘. It also hosts a MON 
> and a MGR. Its syslog (https://gitlab.uni-trier.de/snippets/78) shows 
> messages like:
>
> Feb 13 06:40:33 master1 kernel: [56165.943008] libceph: wrong peer, 
> want 192.168.1.17:6800/-2045158358, got 192.168.1.17:6800/1699349984
>
> Feb 13 06:40:33 master1 kernel: [56165.943014] libceph: mds1 
> 192.168.1.17:6800 wrong peer at address
>
> The other day I did the update from 12.2.8 to 12.2.11, which can also be seen 
> in the logs. Again, there appeared these messages. I assume that’s normal 
> operations since ports can change and daemons have to find each other again? 
> But what about Feb 13 in the morning? I didn’t do any restarts then.
>
> Also, clients are printing messages like the following on the console:
>
> [1026589.751040] ceph: handle_cap_import: mismatched seq/mseq: ino 
> (1994988.fffe) mds0 seq1 mseq 15 importer mds1 has 
> peer seq 2 mseq 15
>
> [1352658.876507] ceph: build_path did not end path lookup where 
> expected, namelen is 23, pos is 0
>
> Oh, and btw, the ceph nodes are running on Ubuntu 16.04, clients are on 14.04 
> with kernel 4.4.0-133.
>

try running ceph-fuse on the node that run mds, check if it works properly.


> For reference:
>
> > Cluster details: https://gitlab.uni-trier.de/snippets/77
>
> > MDS log: 
> > https://gitlab.uni-trier.de/snippets/79?expanded=true=simple)
>
>
> Kind regards
> Christian Hennen
>
> Project Manager Infrastructural Services ZIMK University of Trier 
> Germany
>
> Von: Ashley Merrick 
> Gesendet: Montag, 18. Februar 2019 16:53
> An: Hennen, Christian 
> Cc: ceph-users@lists.ceph.com
> Betreff: Re: [ceph-users] CephFS: client hangs
>
> Correct yes from my expirence OSD’s aswel.
>
> On Mon, 18 Feb 2019 at 11:51 PM, Hennen, Christian 
>  wrote:
>
> Hi!
>
> >mon_max_pg_per_osd = 400
> >
> >In the ceph.conf and then restart all the services / or inject the 
> >config into the running admin
>
> I restarted all MONs, but I assume the OSDs need to be restarted as well?
>
> > MDS show a client got evicted. Nothing else looks abnormal.  Do new 
> > cephfs clients also get evicted quickly?
>
> Yeah, it seems so. But strangely there is no indication of it in 'ceph 
> -s' or 'ceph health detail'. And they don't seem to be evicted 
> permanently? Right now, only 1 client is connected. The others are shut down 
> since last week.
> 'ceph osd blacklist ls' shows 0 entries.
>
>
> Kind regards
> Christian Hennen
>
> Project Manager Infrastructural Services ZIMK University of Trier 
> Germany
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replicating CephFS between clusters

2019-02-19 Thread Marc Roos



 >> 
 >>  >> 
 >>  >
 >>  >I'm not saying CephFS snapshots are 100% stable, but for certain
 >>  >use-cases they can be.
 >>  >
 >>  >Try to avoid:
 >>  >
 >>  >- Multiple CephFS in same cluster
 >>  >- Snapshot the root (/)
 >>  >- Having a lot of snapshots
 >> 
 >> How many is a lot? Having a lot of snapshots in total? Or having a 
lot 
 >> of snapshots on one dir? I was thinking of applying 7 snapshots on 
1500 
 >> directories.
 >> 
 >
 >Ah, yes, good question. I don't know if there is a true upper limit, 
but
 >leaving old snapshot around could hurt you when replaying journals and 
such.
 >
 >Therefor, if you create a snapshot, rsync and then remove it, it 
should
 >be fine.

I wanted to keep it for 7 days, then remove it and replace it with a new
Snapshot.

 >
 >You were thinking about 1500*7 snapshots?

Yes indeed, (with the exception that the script first checks if the
directory has data in it)

 >
 >>  >Then you could use the cephfs recursive statistics to figure out 
which
 >>  >directories have changed and sync their data to another cluster.
 >>  >
 >>  >But there are some caveats, but it can work though!
 >>  >
 >>  >Wido
 >>  >
 >>  >>  
 >>  >> 
 >>  >> To be more precise, Id like to be able to replicate data in a
 >>  >> scheduled, atomic way to another cluster, so if the site hosting 
our
 >>  >> primary bitbucket cluster becomes unavailable for some reason, 
Im 
 >> able
 >>  >> to spin up another bitbucket cluster elsewhere.
 >>  >> 
 >>  >>  
 >> 
 >
 >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replicating CephFS between clusters

2019-02-19 Thread Wido den Hollander



On 2/19/19 6:28 PM, Marc Roos wrote:
> 
>  >> 
>  >
>  >I'm not saying CephFS snapshots are 100% stable, but for certain
>  >use-cases they can be.
>  >
>  >Try to avoid:
>  >
>  >- Multiple CephFS in same cluster
>  >- Snapshot the root (/)
>  >- Having a lot of snapshots
> 
> How many is a lot? Having a lot of snapshots in total? Or having a lot 
> of snapshots on one dir? I was thinking of applying 7 snapshots on 1500 
> directories.
> 

Ah, yes, good question. I don't know if there is a true upper limit, but
leaving old snapshot around could hurt you when replaying journals and such.

Therefor, if you create a snapshot, rsync and then remove it, it should
be fine.

You were thinking about 1500*7 snapshots?

Wido

>  >Then you could use the cephfs recursive statistics to figure out which
>  >directories have changed and sync their data to another cluster.
>  >
>  >But there are some caveats, but it can work though!
>  >
>  >Wido
>  >
>  >>  
>  >> 
>  >> To be more precise, Id like to be able to replicate data in a
>  >> scheduled, atomic way to another cluster, so if the site hosting our
>  >> primary bitbucket cluster becomes unavailable for some reason, Im 
> able
>  >> to spin up another bitbucket cluster elsewhere.
>  >> 
>  >>  
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replicating CephFS between clusters

2019-02-19 Thread Marc Roos



 >> 
 >
 >I'm not saying CephFS snapshots are 100% stable, but for certain
 >use-cases they can be.
 >
 >Try to avoid:
 >
 >- Multiple CephFS in same cluster
 >- Snapshot the root (/)
 >- Having a lot of snapshots

How many is a lot? Having a lot of snapshots in total? Or having a lot 
of snapshots on one dir? I was thinking of applying 7 snapshots on 1500 
directories.

 >Then you could use the cephfs recursive statistics to figure out which
 >directories have changed and sync their data to another cluster.
 >
 >But there are some caveats, but it can work though!
 >
 >Wido
 >
 >>  
 >> 
 >> To be more precise, Id like to be able to replicate data in a
 >> scheduled, atomic way to another cluster, so if the site hosting our
 >> primary bitbucket cluster becomes unavailable for some reason, Im 
able
 >> to spin up another bitbucket cluster elsewhere.
 >> 
 >>  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cluster stability

2019-02-19 Thread David Turner

With a RACK failure domain, you should be able to have an entire rack
powered down without noticing any major impact on the clients.  I regularly
take down OSDs and nodes for maintenance and upgrades without seeing any
problems with client IO.

On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
wrote:

> Hello - I have a couple of questions on ceph cluster stability, even
> we follow all recommendations as below:
> - Having separate replication n/w and data n/w
> - RACK is the failure domain
> - Using SSDs for journals (1:4ratio)
>
> Q1 - If one OSD down, cluster IO down drastically and customer Apps
> impacted.
> Q2 - what is stability ratio, like with above, is ceph cluster
> workable condition, if one osd down or one node down,etc.
>
> Thanks
> Swami
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook

2019-02-19 Thread David Turner

Have you ever seen an example of a Ceph cluster being run and managed by
Rook?  It's a really cool idea and takes care of containerizing mons, rgw,
mds, etc that I've been thinking about doing anyway.  Having those
containerized means that if you can upgrade all of the mon services before
any of your other daemons are even aware of a new Ceph version even if
they're running on the same server.  There are some recent upgrade bugs for
small clusters with mons and osds on the same node that would have been
mitigated with containerized Ceph versions.  For putting OSDs in
containers, have you ever needed to run a custom compiled version of Ceph
for a few OSDs to get past a bug that was causing you some troubles?  With
OSDs in containers, you could do that without worrying about that version
of Ceph being used by any other OSDs.

On top of all of that, I keep feeling like a dinosaur for not understanding
Kubernetes better and have been really excited since seeing Rook
orchestrating a Ceph cluster in K8s.  I spun up a few VMs to start testing
configuring a Kubernetes cluster.  The Rook Slack channel recommended using
kubeadm to set up K8s to manage Ceph.

On Mon, Feb 18, 2019 at 11:50 AM Marc Roos  wrote:

>
> Why not just keep it bare metal? Especially with future ceph
> upgrading/testing. I am having centos7 with luminous and am running
> libvirt on the nodes aswell. If you configure them with a tls/ssl
> connection, you can even nicely migrate a vm, from one host/ceph node to
> the other.
> Next thing I am testing with is mesos, to use the ceph nodes to run
> containers. I am still testing this on some vm's, but looks like you
> have to install only a few rpms (maybe around 300MB) and 2 extra
> services on the nodes to get this up and running aswell. (But keep in
> mind that the help on their mailing list is not so good as here ;))
>
>
>
> -Original Message-
> From: David Turner [mailto:drakonst...@gmail.com]
> Sent: 18 February 2019 17:31
> To: ceph-users
> Subject: [ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook
>
> I'm getting some "new" (to me) hardware that I'm going to upgrade my
> home Ceph cluster with.  Currently it's running a Proxmox cluster
> (Debian) which precludes me from upgrading to Mimic.  I am thinking
> about taking the opportunity to convert most of my VMs into containers
> and migrate my cluster into a K8s + Rook configuration now that Ceph is
> [1] stable on Rook.
>
> I haven't ever configured a K8s cluster and am planning to test this out
> on VMs before moving to it with my live data.  Has anyone done a
> migration from a baremetal Ceph cluster into K8s + Rook?  Additionally
> what is a good way for a K8s beginner to get into managing a K8s
> cluster.  I see various places recommend either CoreOS or kubeadm for
> starting up a new K8s cluster but I don't know the pros/cons for either.
>
> As far as migrating the Ceph services into Rook, I would assume that the
> process would be pretty simple to add/create new mons, mds, etc into
> Rook with the baremetal cluster details.  Once those are active and
> working just start decommissioning the services on baremetal.  For me,
> the OSD migration should be similar since I don't have any multi-device
> OSDs so I only need to worry about migrating individual disks between
> nodes.
>
>
> [1]
> https://blog.rook.io/rook-v0-9-new-storage-backends-in-town-ab952523ec53
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replicating CephFS between clusters

2019-02-19 Thread Wido den Hollander



On 2/19/19 6:00 PM, Balazs Soltesz wrote:
> Hi all,
> 
>  
> 
> I’m experimenting with CephFS as storage to a bitbucket cluster.
> 
>  
> 
> One problems to tackle is replicating the filesystem contents between
> ceph clusters in different sites around the globe.
> 
> I’ve read about pool replication, but I’ve also read replicating pools
> under a CephFS is not advised. As far as I know CephFS snapshots are not
> quite production ready, which is a shame, because that might provide an
> atomic way of capturing state of the filesystem.
> 

I'm not saying CephFS snapshots are 100% stable, but for certain
use-cases they can be.

Try to avoid:

- Multiple CephFS in same cluster
- Snapshot the root (/)
- Having a lot of snapshots

Then you could use the cephfs recursive statistics to figure out which
directories have changed and sync their data to another cluster.

But there are some caveats, but it can work though!

Wido

>  
> 
> To be more precise, I’d like to be able to replicate data in a
> scheduled, atomic way to another cluster, so if the site hosting our
> primary bitbucket cluster becomes unavailable for some reason, I’m able
> to spin up another bitbucket cluster elsewhere.
> 
>  
> 
> I’m hoping someone here could point me in the right direction with this
> issue.
> 
>  
> 
>  
> 
> *Balázs Soltész *| Software Engineer
> 1061 Budapest, Andrássí út 9.
> *LogMeInInc.com* 
> 
> ../../CORPORATE%20IDENTITY/LogMeIn%20Corp/For%20Screen/LMI%20HEX/LMI%20HEX%20Corporate%20Blue/LMI_logo_HEX%20Blue.jpg
> 
> Learn moreat _LogMeInInc.com _.
> 
>  
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Replicating CephFS between clusters

2019-02-19 Thread Balazs Soltesz

Hi all,

I'm experimenting with CephFS as storage to a bitbucket cluster.

One problems to tackle is replicating the filesystem contents between ceph 
clusters in different sites around the globe.
I've read about pool replication, but I've also read replicating pools under a 
CephFS is not advised. As far as I know CephFS snapshots are not quite 
production ready, which is a shame, because that might provide an atomic way of 
capturing state of the filesystem.

To be more precise, I'd like to be able to replicate data in a scheduled, 
atomic way to another cluster, so if the site hosting our primary bitbucket 
cluster becomes unavailable for some reason, I'm able to spin up another 
bitbucket cluster elsewhere.

I'm hoping someone here could point me in the right direction with this issue.


Balázs Soltész | Software Engineer
1061 Budapest, Andrássí út 9.
LogMeInInc.com

[../../CORPORATE%20IDENTITY/LogMeIn%20Corp/For%20Screen/LMI%20HEX/LMI%20HEX%20Corporate%20Blue/LMI_logo_HEX%20Blue.jpg]
Learn more at LogMeInInc.com.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-19 Thread David Turner

I don't know that there's anything that can be done to resolve this yet
without rebuilding the OSD.  Based on a Nautilus tool being able to resize
the DB device, I'm assuming that Nautilus is also capable of migrating the
DB/WAL between devices.  That functionality would allow anyone to migrate
their DB back off of their spinner which is what's happening to you.  I
don't believe that sort of tooling exists yet, though, without compiling
the Nautilus Beta tooling for yourself.

On Tue, Feb 19, 2019 at 12:03 AM Konstantin Shalygin  wrote:

> On 2/18/19 9:43 PM, David Turner wrote:
> > Do you have historical data from these OSDs to see when/if the DB used
> > on osd.73 ever filled up?  To account for this OSD using the slow
> > storage for DB, all we need to do is show that it filled up the fast
> > DB at least once.  If that happened, then something spilled over to
> > the slow storage and has been there ever since.
>
> Yes, I have. Also I checked my JIRA records what I was do at this times
> and marked this on timeline: [1]
>
> Another graph compared osd.(33|73) for a last year: [2]
>
>
> [1] https://ibb.co/F7smCxW
>
> [1] https://ibb.co/dKWWDzW
>
> k
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-19 Thread Alexandre DERUMIER

>>I think op_w_process_latency includes replication times, not 100% sure 
>>though. 
>>
>>So restarting other nodes might affect latencies at this specific OSD. 

Seem to be the case, I have compared with sub_op_latency.

I have changed my graph, to clearly identify the osd where the latency is high. 


I have done some changes in my setup:
- 2 osd by nvme (2x3TB by osd), with 6GB memory. (instead 1osd of 6TB with 12G 
memory).
- disabling transparent hugepage

Since 24h, latencies are still low (between 0.7-1.2ms).

I'm also seeing that total memory used (#free), is lower than before (48GB 
(8osd x 6GB) vs 56GB (4osd x 12GB).

I'll send more stats tomorrow.

Alexandre


- Mail original -
De: "Igor Fedotov" 
À: "Alexandre Derumier" , "Wido den Hollander" 

Cc: "ceph-users" , "ceph-devel" 

Envoyé: Mardi 19 Février 2019 11:12:43
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi Alexander, 

I think op_w_process_latency includes replication times, not 100% sure 
though. 

So restarting other nodes might affect latencies at this specific OSD. 


Thanks, 

Igot 

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote: 
>>> There are 10 OSDs in these systems with 96GB of memory in total. We are 
>>> runnigh with memory target on 6G right now to make sure there is no 
>>> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
>>> so it will max out on 80GB leaving 16GB as spare. 
> Thanks Wido. I send results monday with my increased memory 
> 
> 
> 
> @Igor: 
> 
> I have also notice, that sometime when I have bad latency on an osd on node1 
> (restarted 12h ago for example). 
> (op_w_process_latency). 
> 
> If I restart osds on other nodes (last restart some days ago, so with bigger 
> latency), it's reducing latency on osd of node1 too. 
> 
> does "op_w_process_latency" counter include replication time ? 
> 
> - Mail original - 
> De: "Wido den Hollander"  
> À: "aderumier"  
> Cc: "Igor Fedotov" , "ceph-users" 
> , "ceph-devel"  
> Envoyé: Vendredi 15 Février 2019 14:59:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: 
 Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
 OSDs as well. Over time their latency increased until we started to 
 notice I/O-wait inside VMs. 
>> I'm also notice it in the vms. BTW, what it your nvme disk size ? 
> Samsung PM983 3.84TB SSDs in both clusters. 
> 
>> 
 A restart fixed it. We also increased memory target from 4G to 6G on 
 these OSDs as the memory would allow it. 
>> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. 
>> (my last test was 8gb with 1osd of 6TB, but that didn't help) 
> There are 10 OSDs in these systems with 96GB of memory in total. We are 
> runnigh with memory target on 6G right now to make sure there is no 
> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
> so it will max out on 80GB leaving 16GB as spare. 
> 
> As these OSDs were all restarted earlier this week I can't tell how it 
> will hold up over a longer period. Monitoring (Zabbix) shows the latency 
> is fine at the moment. 
> 
> Wido 
> 
>> 
>> - Mail original - 
>> De: "Wido den Hollander"  
>> À: "Alexandre Derumier" , "Igor Fedotov" 
>>  
>> Cc: "ceph-users" , "ceph-devel" 
>>  
>> Envoyé: Vendredi 15 Février 2019 14:50:34 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
>> restart 
>> 
>> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
>>> Thanks Igor. 
>>> 
>>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is 
>>> different. 
>>> 
>>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't 
>>> see this latency problem. 
>>> 
>>> 
>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>> OSDs as well. Over time their latency increased until we started to 
>> notice I/O-wait inside VMs. 
>> 
>> A restart fixed it. We also increased memory target from 4G to 6G on 
>> these OSDs as the memory would allow it. 
>> 
>> But we noticed this on two different 12.2.10/11 clusters. 
>> 
>> A restart made the latency drop. Not only the numbers, but the 
>> real-world latency as experienced by a VM as well. 
>> 
>> Wido 
>> 
>>> 
>>> 
>>> 
>>> 
>>> - Mail original - 
>>> De: "Igor Fedotov"  
>>> Cc: "ceph-users" , "ceph-devel" 
>>>  
>>> Envoyé: Vendredi 15 Février 2019 13:47:57 
>>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
>>> restart 
>>> 
>>> Hi Alexander, 
>>> 
>>> I've read through your reports, nothing obvious so far. 
>>> 
>>> I can only see several times average latency increase for OSD write ops 
>>> (in seconds) 
>>> 0.002040060 (first hour) vs. 
>>> 
>>> 0.002483516 (last 24 hours) vs. 
>>> 0.008382087 (last hour) 
>>> 
>>> subop_w_latency: 
>>> 0.000478934 (first hour) vs. 
>>> 0.000537956 (last 24 hours) vs.

Re: [ceph-users] Ceph OSD: how to keep files after umount or reboot vs tempfs ?

2019-02-19 Thread PHARABOT Vincent

Ok thank you for confirmation Burkhard

I’m trying this

Vincent

De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de 
Burkhard Linke
Envoyé : mardi 19 février 2019 13:20
À : ceph-users@lists.ceph.com
Objet : Re: [ceph-users] Ceph OSD: how to keep files after umount or reboot vs 
tempfs ?


Hi,
On 2/19/19 11:52 AM, PHARABOT Vincent wrote:
Hello Cephers,

I have an issue with OSD device mount on tmpfs with bluestore
For some occasion, I need to keep the files on the tiny bluestore fs 
(especially keyring and may be other useful files needed for osd to work) on a 
working OSD
Since osd partition is mount as tmpfs , these files are deleted once VM 
rebooted or even when umount

Is there a way to have a persistent storage for those files instead of tmpfs ?
I could copy them in another location and copy back once rebooted, but this 
seems very odd



Those files are generated on osd activation from information stored in LVM 
metadata. You do not need an extra external storage for the information any 
more.



Regards,

Burkhard Linke


This email and any attachments are intended solely for the use of the 
individual or entity to whom it is addressed and may be confidential and/or 
privileged.

If you are not one of the named recipients or have received this email in error,

(i) you should not read, disclose, or copy it,

(ii) please notify sender of your receipt by reply email and delete this email 
and all attachments,

(iii) Dassault Systèmes does not accept or assume any liability or 
responsibility for any use of or reliance on this email.


Please be informed that your personal data are processed according to our data 
privacy policy as described on our website. Should you have any questions 
related to personal data protection, please contact 3DS Data Protection Officer 
at 3ds.compliance-priv...@3ds.com


For other languages, go to https://www.3ds.com/terms/email-disclaimer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph-ansible try to recreate existing osds in osds.yml

2019-02-19 Thread Jawad Ahmed

Hi all,

I have running cluster deployed with ceph-ansible.

Why ceph-ansible try to recreate and gives error on existing osds mentioned
in osds.yml. Should't it just skip the existing osds and find disks/volumes
which are empty?

some info:
I am using osd_scenarion: lvm with bluestore

I mean, what if I want to do some overrides and run site.yml again to make
changes across cluster(ceph.conf files), what is the best way to do that?

Help would be appreciated.



-- 
Greetings,
Jawad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph OSD: how to keep files after umount or reboot vs tempfs ?

2019-02-19 Thread Burkhard Linke


Hi,

On 2/19/19 11:52 AM, PHARABOT Vincent wrote:


Hello Cephers,

I have an issue with OSD device mount on tmpfs with bluestore

For some occasion, I need to keep the files on the tiny bluestore fs 
(especially keyring and may be other useful files needed for osd to 
work) on a working OSD


Since osd partition is mount as tmpfs , these files are deleted once 
VM rebooted or even when umount


Is there a way to have a persistent storage for those files instead of 
tmpfs ?


I could copy them in another location and copy back once rebooted, but 
this seems very odd




Those files are generated on osd activation from information stored in 
LVM metadata. You do not need an extra external storage for the 
information any more.



Regards,

Burkhard Linke


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph OSD: how to keep files after umount or reboot vs tempfs ?

2019-02-19 Thread PHARABOT Vincent

Hello Cephers,

I have an issue with OSD device mount on tmpfs with bluestore
For some occasion, I need to keep the files on the tiny bluestore fs 
(especially keyring and may be other useful files needed for osd to work) on a 
working OSD
Since osd partition is mount as tmpfs , these files are deleted once VM 
rebooted or even when umount

Is there a way to have a persistent storage for those files instead of tmpfs ?
I could copy them in another location and copy back once rebooted, but this 
seems very odd

May be I need to keep keyring and use ceph-volume command (lvm activate ?) to 
recover the files

Do you have any best practice for this use case ?

Thanks a lot for your help!

Vincent

This email and any attachments are intended solely for the use of the 
individual or entity to whom it is addressed and may be confidential and/or 
privileged.

If you are not one of the named recipients or have received this email in error,

(i) you should not read, disclose, or copy it,

(ii) please notify sender of your receipt by reply email and delete this email 
and all attachments,

(iii) Dassault Systèmes does not accept or assume any liability or 
responsibility for any use of or reliance on this email.


Please be informed that your personal data are processed according to our data 
privacy policy as described on our website. Should you have any questions 
related to personal data protection, please contact 3DS Data Protection Officer 
at 3ds.compliance-priv...@3ds.com


For other languages, go to https://www.3ds.com/terms/email-disclaimer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: client hangs

2019-02-19 Thread Yan, Zheng

On Tue, Feb 19, 2019 at 5:10 PM Hennen, Christian
 wrote:
>
> Hi!
>
> >mon_max_pg_per_osd = 400
> >
> >In the ceph.conf and then restart all the services / or inject the config
> >into the running admin
>
> I restarted each server (MONs and OSDs weren’t enough) and now the health 
> warning is gone. Still no luck accessing CephFS though.
>
>
> > MDS show a client got evicted. Nothing else looks abnormal.  Do new cephfs
> > clients also get evicted quickly?
>
> Aside from the fact that evicted clients don’t show up in ceph –s, we observe 
> other strange things:
>
> ·   Setting max_mds has no effect
>
> ·   Ceph osd blacklist ls sometimes lists cluster nodes
>

sounds like network issue. are there firewall/NAT between nodes?

> The only client that is currently running is ‚master1‘. It also hosts a MON 
> and a MGR. Its syslog (https://gitlab.uni-trier.de/snippets/78) shows 
> messages like:
>
> Feb 13 06:40:33 master1 kernel: [56165.943008] libceph: wrong peer, want 
> 192.168.1.17:6800/-2045158358, got 192.168.1.17:6800/1699349984
>
> Feb 13 06:40:33 master1 kernel: [56165.943014] libceph: mds1 
> 192.168.1.17:6800 wrong peer at address
>
> The other day I did the update from 12.2.8 to 12.2.11, which can also be seen 
> in the logs. Again, there appeared these messages. I assume that’s normal 
> operations since ports can change and daemons have to find each other again? 
> But what about Feb 13 in the morning? I didn’t do any restarts then.
>
> Also, clients are printing messages like the following on the console:
>
> [1026589.751040] ceph: handle_cap_import: mismatched seq/mseq: ino 
> (1994988.fffe) mds0 seq1 mseq 15 importer mds1 has peer seq 2 
> mseq 15
>
> [1352658.876507] ceph: build_path did not end path lookup where expected, 
> namelen is 23, pos is 0
>
> Oh, and btw, the ceph nodes are running on Ubuntu 16.04, clients are on 14.04 
> with kernel 4.4.0-133.
>

try running ceph-fuse on the node that run mds, check if it works properly.


> For reference:
>
> > Cluster details: https://gitlab.uni-trier.de/snippets/77
>
> > MDS log: 
> > https://gitlab.uni-trier.de/snippets/79?expanded=true=simple)
>
>
> Kind regards
> Christian Hennen
>
> Project Manager Infrastructural Services ZIMK University of Trier
> Germany
>
> Von: Ashley Merrick 
> Gesendet: Montag, 18. Februar 2019 16:53
> An: Hennen, Christian 
> Cc: ceph-users@lists.ceph.com
> Betreff: Re: [ceph-users] CephFS: client hangs
>
> Correct yes from my expirence OSD’s aswel.
>
> On Mon, 18 Feb 2019 at 11:51 PM, Hennen, Christian 
>  wrote:
>
> Hi!
>
> >mon_max_pg_per_osd = 400
> >
> >In the ceph.conf and then restart all the services / or inject the config
> >into the running admin
>
> I restarted all MONs, but I assume the OSDs need to be restarted as well?
>
> > MDS show a client got evicted. Nothing else looks abnormal.  Do new cephfs
> > clients also get evicted quickly?
>
> Yeah, it seems so. But strangely there is no indication of it in 'ceph -s' or
> 'ceph health detail'. And they don't seem to be evicted permanently? Right
> now, only 1 client is connected. The others are shut down since last week.
> 'ceph osd blacklist ls' shows 0 entries.
>
>
> Kind regards
> Christian Hennen
>
> Project Manager Infrastructural Services ZIMK University of Trier
> Germany
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-19 Thread Igor Fedotov


Hi Alexander,

I think op_w_process_latency includes replication times, not 100% sure 
though.


So restarting other nodes might affect latencies at this specific OSD.


Thanks,

Igot

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote:

There are 10 OSDs in these systems with 96GB of memory in total. We are
runnigh with memory target on 6G right now to make sure there is no
leakage. If this runs fine for a longer period we will go to 8GB per OSD
so it will max out on 80GB leaving 16GB as spare.

Thanks Wido. I send results monday with my increased memory



@Igor:

I have also notice, that sometime when I have bad latency on an osd on node1 
(restarted 12h ago for example).
(op_w_process_latency).

If I restart osds on other nodes (last restart some days ago, so with bigger 
latency), it's reducing latency on osd of node1 too.

does "op_w_process_latency" counter include replication time ?

- Mail original -
De: "Wido den Hollander" 
À: "aderumier" 
Cc: "Igor Fedotov" , "ceph-users" , 
"ceph-devel" 
Envoyé: Vendredi 15 Février 2019 14:59:30
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

On 2/15/19 2:54 PM, Alexandre DERUMIER wrote:

Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
OSDs as well. Over time their latency increased until we started to
notice I/O-wait inside VMs.

I'm also notice it in the vms. BTW, what it your nvme disk size ?

Samsung PM983 3.84TB SSDs in both clusters.




A restart fixed it. We also increased memory target from 4G to 6G on
these OSDs as the memory would allow it.

I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme.
(my last test was 8gb with 1osd of 6TB, but that didn't help)

There are 10 OSDs in these systems with 96GB of memory in total. We are
runnigh with memory target on 6G right now to make sure there is no
leakage. If this runs fine for a longer period we will go to 8GB per OSD
so it will max out on 80GB leaving 16GB as spare.

As these OSDs were all restarted earlier this week I can't tell how it
will hold up over a longer period. Monitoring (Zabbix) shows the latency
is fine at the moment.

Wido



- Mail original -
De: "Wido den Hollander" 
À: "Alexandre Derumier" , "Igor Fedotov" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 15 Février 2019 14:50:34
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

On 2/15/19 2:31 PM, Alexandre DERUMIER wrote:

Thanks Igor.

I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is 
different.

I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see 
this latency problem.



Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
OSDs as well. Over time their latency increased until we started to
notice I/O-wait inside VMs.

A restart fixed it. We also increased memory target from 4G to 6G on
these OSDs as the memory would allow it.

But we noticed this on two different 12.2.10/11 clusters.

A restart made the latency drop. Not only the numbers, but the
real-world latency as experienced by a VM as well.

Wido






- Mail original -
De: "Igor Fedotov" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 15 Février 2019 13:47:57
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi Alexander,

I've read through your reports, nothing obvious so far.

I can only see several times average latency increase for OSD write ops
(in seconds)
0.002040060 (first hour) vs.

0.002483516 (last 24 hours) vs.
0.008382087 (last hour)

subop_w_latency:
0.000478934 (first hour) vs.
0.000537956 (last 24 hours) vs.
0.003073475 (last hour)

and OSD read ops, osd_r_latency:

0.000408595 (first hour)
0.000709031 (24 hours)
0.004979540 (last hour)

What's interesting is that such latency differences aren't observed at
neither BlueStore level (any _lat params under "bluestore" section) nor
rocksdb one.

Which probably means that the issue is rather somewhere above BlueStore.

Suggest to proceed with perf dumps collection to see if the picture
stays the same.

W.r.t. memory usage you observed I see nothing suspicious so far - No
decrease in RSS report is a known artifact that seems to be safe.

Thanks,
Igor

On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote:

Hi Igor,

Thanks again for helping !



I have upgrade to last mimic this weekend, and with new autotune memory,
I have setup osd_memory_target to 8G. (my nvme are 6TB)


I have done a lot of perf dump and mempool dump and ps of process to

see rss memory at different hours,

here the reports for osd.0:

http://odisoweb1.odiso.net/perfanalysis/


osd has been started the 12-02-2019 at 08:00

first report after 1h running
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt


http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt

http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt



report after 24 before counter

Re: [ceph-users] RGW: Reshard index of non-master zones in multi-site

2019-02-19 Thread Iain Buclaw

On Tue, 19 Feb 2019 at 10:05, Iain Buclaw  wrote:
>
> On Tue, 19 Feb 2019 at 09:59, Iain Buclaw  wrote:
> >
> > On Wed, 6 Feb 2019 at 09:28, Iain Buclaw  wrote:
> > >
> > > On Tue, 5 Feb 2019 at 10:04, Iain Buclaw  wrote:
> > > >
> > > > On Tue, 5 Feb 2019 at 09:46, Iain Buclaw  wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Following the update of one secondary site from 12.2.8 to 12.2.11, the
> > > > > following warning have come up.
> > > > >
> > > > > HEALTH_WARN 1 large omap objects
> > > > > LARGE_OMAP_OBJECTS 1 large omap objects
> > > > > 1 large objects found in pool '.rgw.buckets.index'
> > > > > Search the cluster log for 'Large omap object found' for more 
> > > > > details.
> > > > >
> > > >
> > > > [...]
> > > >
> > > > > Is this the reason why resharding hasn't propagated?
> > > > >
> > > >
> > > > Furthermore, infact it looks like the index is broken on the 
> > > > secondaries.
> > > >
> > > > On the master:
> > > >
> > > > # radosgw-admin bi get --bucket=mybucket --object=myobject
> > > > {
> > > > "type": "plain",
> > > > "idx": "myobject",
> > > > "entry": {
> > > > "name": "myobject",
> > > > "instance": "",
> > > > "ver": {
> > > > "pool": 28,
> > > > "epoch": 8848
> > > > },
> > > > "locator": "",
> > > > "exists": "true",
> > > > "meta": {
> > > > "category": 1,
> > > > "size": 9200,
> > > > "mtime": "2018-03-27 21:12:56.612172Z",
> > > > "etag": "c365c324cda944d2c3b687c0785be735",
> > > > "owner": "mybucket",
> > > > "owner_display_name": "Bucket User",
> > > > "content_type": "application/octet-stream",
> > > > "accounted_size": 9194,
> > > > "user_data": ""
> > > > },
> > > > "tag": "0ef1a91a-4aee-427e-bdf8-30589abb2d3e.36603989.137292",
> > > > "flags": 0,
> > > > "pending_map": [],
> > > > "versioned_epoch": 0
> > > > }
> > > > }
> > > >
> > > >
> > > > On the secondaries:
> > > >
> > > > # radosgw-admin bi get --bucket=mybucket --object=myobject
> > > > ERROR: bi_get(): (2) No such file or directory
> > > >
> > > > How does one go about rectifying this mess?
> > > >
> > >
> > > Random blog in language I don't understand seems to allude to using
> > > radosgw-admin bi put to restore backed up indexes, but not under what
> > > circumstances you would use such a command.
> > >
> > > https://cloud.tencent.com/developer/article/1032854
> > >
> > > Would this be safe to run on secondaries?
> > >
> >
> > Removed the bucket on the secondaries and scheduled new sync.  However
> > this gets stuck at some point and radosgw is complaining about:
> >
> > data sync: WARNING: skipping data log entry for missing bucket
> > mybucket:0ef1a91a-4aee-427e-bdf8-30589abb2d3e.92151615.1:21
> >
> > Hopeless that RGW can't even do a simple job right, I removed the
> > problematic bucket on the master, but now there are now hundreds of
> > shard objects inside the index pool, all look to be orphaned, and
> > still the warnings for missing bucket continue to happen on the
> > secondaries.  In some cases there's an object on the secondary that
> > doesn't exist on the master.
> >
> > All the while, ceph is still complaining about large omap files.
> >
> > $ ceph daemon mon.ceph-mon-1 config get
> > osd_deep_scrub_large_omap_object_value_sum_threshold
> > {
> > "osd_deep_scrub_large_omap_object_value_sum_threshold": "1073741824"
> > }
> >
> > It seems implausible that the cluster is still complaining about this
> > when the largest omap contains 71405 entries.
> >
> >
> > I can't run bi purge or metadata rm on the unreferenced entries
> > because the bucket itself is no more.  Can I remove objects from the
> > index pool using 'rados rm' ?
> >
>
> Possibly related
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-November/031350.html
>

# ./radosgw-gc-bucket-indexes.sh master.rgw.buckets.index | wc -l
7511

# ./radosgw-gc-bucket-indexes.sh secondary1.rgw.buckets.index | wc -l
3509

# ./radosgw-gc-bucket-indexes.sh secondary2.rgw.buckets.index | wc -l
3801

I believe the correct phrase in italian would be 'Che Pasticcio'.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: client hangs

2019-02-19 Thread Hennen, Christian

Hi!

>mon_max_pg_per_osd = 400
>
>In the ceph.conf and then restart all the services / or inject the config 
>into the running admin

I restarted each server (MONs and OSDs weren’t enough) and now the health 
warning is gone. Still no luck accessing CephFS though.

> MDS show a client got evicted. Nothing else looks abnormal.  Do new cephfs 
> clients also get evicted quickly?

Aside from the fact that evicted clients don’t show up in ceph –s, we observe 
other strange things:
*   Setting max_mds has no effect
*   Ceph osd blacklist ls sometimes lists cluster nodes

The only client that is currently running is ‚master1‘. It also hosts a MON and 
a MGR. Its syslog (https://gitlab.uni-trier.de/snippets/78) shows messages like:
Feb 13 06:40:33 master1 kernel: [56165.943008] libceph: wrong peer, want 
192.168.1.17:6800/-2045158358, got 192.168.1.17:6800/1699349984
Feb 13 06:40:33 master1 kernel: [56165.943014] libceph: mds1 192.168.1.17:6800 
wrong peer at address
The other day I did the update from 12.2.8 to 12.2.11, which can also be seen 
in the logs. Again, there appeared these messages. I assume that’s normal 
operations since ports can change and daemons have to find each other again? 
But what about Feb 13 in the morning? I didn’t do any restarts then.

Also, clients are printing messages like the following on the console:
[1026589.751040] ceph: handle_cap_import: mismatched seq/mseq: ino 
(1994988.fffe) mds0 seq1 mseq 15 importer mds1 has peer seq 2 
mseq 15
[1352658.876507] ceph: build_path did not end path lookup where expected, 
namelen is 23, pos is 0

Oh, and btw, the ceph nodes are running on Ubuntu 16.04, clients are on 14.04 
with kernel 4.4.0-133.

For reference:
> Cluster details: https://gitlab.uni-trier.de/snippets/77 
> MDS log: https://gitlab.uni-trier.de/snippets/79?expanded=true=simple)

Kind regards
Christian Hennen

Project Manager Infrastructural Services ZIMK University of Trier
Germany

Von: Ashley Merrick  
Gesendet: Montag, 18. Februar 2019 16:53
An: Hennen, Christian 
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] CephFS: client hangs

Correct yes from my expirence OSD’s aswel.

On Mon, 18 Feb 2019 at 11:51 PM, Hennen, Christian 
mailto:christian.hen...@uni-trier.de> > wrote:
Hi!

>mon_max_pg_per_osd = 400
>
>In the ceph.conf and then restart all the services / or inject the config 
>into the running admin

I restarted all MONs, but I assume the OSDs need to be restarted as well?

> MDS show a client got evicted. Nothing else looks abnormal.  Do new cephfs 
> clients also get evicted quickly?

Yeah, it seems so. But strangely there is no indication of it in 'ceph -s' or 
'ceph health detail'. And they don't seem to be evicted permanently? Right 
now, only 1 client is connected. The others are shut down since last week. 
'ceph osd blacklist ls' shows 0 entries.

Kind regards
Christian Hennen

Project Manager Infrastructural Services ZIMK University of Trier
Germany

___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW: Reshard index of non-master zones in multi-site

2019-02-19 Thread Iain Buclaw

On Tue, 19 Feb 2019 at 09:59, Iain Buclaw  wrote:
>
> On Wed, 6 Feb 2019 at 09:28, Iain Buclaw  wrote:
> >
> > On Tue, 5 Feb 2019 at 10:04, Iain Buclaw  wrote:
> > >
> > > On Tue, 5 Feb 2019 at 09:46, Iain Buclaw  wrote:
> > > >
> > > > Hi,
> > > >
> > > > Following the update of one secondary site from 12.2.8 to 12.2.11, the
> > > > following warning have come up.
> > > >
> > > > HEALTH_WARN 1 large omap objects
> > > > LARGE_OMAP_OBJECTS 1 large omap objects
> > > > 1 large objects found in pool '.rgw.buckets.index'
> > > > Search the cluster log for 'Large omap object found' for more 
> > > > details.
> > > >
> > >
> > > [...]
> > >
> > > > Is this the reason why resharding hasn't propagated?
> > > >
> > >
> > > Furthermore, infact it looks like the index is broken on the secondaries.
> > >
> > > On the master:
> > >
> > > # radosgw-admin bi get --bucket=mybucket --object=myobject
> > > {
> > > "type": "plain",
> > > "idx": "myobject",
> > > "entry": {
> > > "name": "myobject",
> > > "instance": "",
> > > "ver": {
> > > "pool": 28,
> > > "epoch": 8848
> > > },
> > > "locator": "",
> > > "exists": "true",
> > > "meta": {
> > > "category": 1,
> > > "size": 9200,
> > > "mtime": "2018-03-27 21:12:56.612172Z",
> > > "etag": "c365c324cda944d2c3b687c0785be735",
> > > "owner": "mybucket",
> > > "owner_display_name": "Bucket User",
> > > "content_type": "application/octet-stream",
> > > "accounted_size": 9194,
> > > "user_data": ""
> > > },
> > > "tag": "0ef1a91a-4aee-427e-bdf8-30589abb2d3e.36603989.137292",
> > > "flags": 0,
> > > "pending_map": [],
> > > "versioned_epoch": 0
> > > }
> > > }
> > >
> > >
> > > On the secondaries:
> > >
> > > # radosgw-admin bi get --bucket=mybucket --object=myobject
> > > ERROR: bi_get(): (2) No such file or directory
> > >
> > > How does one go about rectifying this mess?
> > >
> >
> > Random blog in language I don't understand seems to allude to using
> > radosgw-admin bi put to restore backed up indexes, but not under what
> > circumstances you would use such a command.
> >
> > https://cloud.tencent.com/developer/article/1032854
> >
> > Would this be safe to run on secondaries?
> >
>
> Removed the bucket on the secondaries and scheduled new sync.  However
> this gets stuck at some point and radosgw is complaining about:
>
> data sync: WARNING: skipping data log entry for missing bucket
> mybucket:0ef1a91a-4aee-427e-bdf8-30589abb2d3e.92151615.1:21
>
> Hopeless that RGW can't even do a simple job right, I removed the
> problematic bucket on the master, but now there are now hundreds of
> shard objects inside the index pool, all look to be orphaned, and
> still the warnings for missing bucket continue to happen on the
> secondaries.  In some cases there's an object on the secondary that
> doesn't exist on the master.
>
> All the while, ceph is still complaining about large omap files.
>
> $ ceph daemon mon.ceph-mon-1 config get
> osd_deep_scrub_large_omap_object_value_sum_threshold
> {
> "osd_deep_scrub_large_omap_object_value_sum_threshold": "1073741824"
> }
>
> It seems implausible that the cluster is still complaining about this
> when the largest omap contains 71405 entries.
>
>
> I can't run bi purge or metadata rm on the unreferenced entries
> because the bucket itself is no more.  Can I remove objects from the
> index pool using 'rados rm' ?
>

Possibly related

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-November/031350.html

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW: Reshard index of non-master zones in multi-site

2019-02-19 Thread Iain Buclaw

On Wed, 6 Feb 2019 at 09:28, Iain Buclaw  wrote:
>
> On Tue, 5 Feb 2019 at 10:04, Iain Buclaw  wrote:
> >
> > On Tue, 5 Feb 2019 at 09:46, Iain Buclaw  wrote:
> > >
> > > Hi,
> > >
> > > Following the update of one secondary site from 12.2.8 to 12.2.11, the
> > > following warning have come up.
> > >
> > > HEALTH_WARN 1 large omap objects
> > > LARGE_OMAP_OBJECTS 1 large omap objects
> > > 1 large objects found in pool '.rgw.buckets.index'
> > > Search the cluster log for 'Large omap object found' for more details.
> > >
> >
> > [...]
> >
> > > Is this the reason why resharding hasn't propagated?
> > >
> >
> > Furthermore, infact it looks like the index is broken on the secondaries.
> >
> > On the master:
> >
> > # radosgw-admin bi get --bucket=mybucket --object=myobject
> > {
> > "type": "plain",
> > "idx": "myobject",
> > "entry": {
> > "name": "myobject",
> > "instance": "",
> > "ver": {
> > "pool": 28,
> > "epoch": 8848
> > },
> > "locator": "",
> > "exists": "true",
> > "meta": {
> > "category": 1,
> > "size": 9200,
> > "mtime": "2018-03-27 21:12:56.612172Z",
> > "etag": "c365c324cda944d2c3b687c0785be735",
> > "owner": "mybucket",
> > "owner_display_name": "Bucket User",
> > "content_type": "application/octet-stream",
> > "accounted_size": 9194,
> > "user_data": ""
> > },
> > "tag": "0ef1a91a-4aee-427e-bdf8-30589abb2d3e.36603989.137292",
> > "flags": 0,
> > "pending_map": [],
> > "versioned_epoch": 0
> > }
> > }
> >
> >
> > On the secondaries:
> >
> > # radosgw-admin bi get --bucket=mybucket --object=myobject
> > ERROR: bi_get(): (2) No such file or directory
> >
> > How does one go about rectifying this mess?
> >
>
> Random blog in language I don't understand seems to allude to using
> radosgw-admin bi put to restore backed up indexes, but not under what
> circumstances you would use such a command.
>
> https://cloud.tencent.com/developer/article/1032854
>
> Would this be safe to run on secondaries?
>

Removed the bucket on the secondaries and scheduled new sync.  However
this gets stuck at some point and radosgw is complaining about:

data sync: WARNING: skipping data log entry for missing bucket
mybucket:0ef1a91a-4aee-427e-bdf8-30589abb2d3e.92151615.1:21

Hopeless that RGW can't even do a simple job right, I removed the
problematic bucket on the master, but now there are now hundreds of
shard objects inside the index pool, all look to be orphaned, and
still the warnings for missing bucket continue to happen on the
secondaries.  In some cases there's an object on the secondary that
doesn't exist on the master.

All the while, ceph is still complaining about large omap files.

$ ceph daemon mon.ceph-mon-1 config get
osd_deep_scrub_large_omap_object_value_sum_threshold
{
"osd_deep_scrub_large_omap_object_value_sum_threshold": "1073741824"
}

It seems implausible that the cluster is still complaining about this
when the largest omap contains 71405 entries.


I can't run bi purge or metadata rm on the unreferenced entries
because the bucket itself is no more.  Can I remove objects from the
index pool using 'rados rm' ?

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] IRC channels now require registered and identified users

2019-02-19 Thread Joao Eduardo Luis

On 02/18/2019 07:17 PM, David Turner wrote:
> Is this still broken in the 1-way direction where Slack users' comments
> do not show up in IRC?  That would explain why nothing I ever type (as
> either helping someone or asking a question) ever have anyone respond to
> them.

I noticed that yesterday as well, but I since we disabled the chan mode
that (I thought) led to that, I don't think that was the cause.

Mike, did we ever get word from Patrick on how this was setup? Any idea
what we should do here?

  -Joao

> 
> On Tue, Dec 18, 2018 at 6:50 AM Joao Eduardo Luis  > wrote:
> 
> On 12/18/2018 11:22 AM, Joao Eduardo Luis wrote:
> > On 12/18/2018 11:18 AM, Dan van der Ster wrote:
> >> Hi Joao,
> >>
> >> Has that broken the Slack connection? I can't tell if its broken or
> >> just quiet... last message on #ceph-devel was today at 1:13am.
> >
> > Just quiet, it seems. Just tested it and the bridge is still working.
> 
> Okay, turns out the ceph-ircslackbot user is not identified, and that
> makes it unable to send messages to the channel. This means the bridge
> is working in one direction only (irc to slack), and will likely break
> when/if the user leaves the channel (as it won't be able to get back
> in).
> 
> I will figure out just how this works today. In the mean time, I've
> relaxed the requirement for registered/identified users so that the bot
> works again. It will be reactivated once this is addressed.
> 
>   -Joao
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Prevent rebalancing in the same host?

2019-02-19 Thread Christian Balzer

On Tue, 19 Feb 2019 09:21:21 +0100 Marco Gaiarin wrote:

> Little cluster, 3 nodes, 4 OSD per node.
> 
> An OSD died, and ceph start to rebalance data between the OSD of the
> same node (not completing it, leading to 'near os full' warning).
> 
> As exist:
>   mon osd down out subtree limit = host
> 
> to prevent host rebalancing, there's some way to prevent intra-host OSD
> rebalancing?
> 
> 
You pretty much answered your question, as in a limit of "osd' would do
the trick, though not just for intra-host.

Wanting no re-balance ever you could also permanently set noout
and nodown and live with the consequences and warning state.


But of course everybody will (rightly) tell you that you need enough
capacity to at the very least deal with a single OSD loss.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Prevent rebalancing in the same host?

2019-02-19 Thread Marco Gaiarin



Little cluster, 3 nodes, 4 OSD per node.

An OSD died, and ceph start to rebalance data between the OSD of the
same node (not completing it, leading to 'near os full' warning).

As exist:
mon osd down out subtree limit = host

to prevent host rebalancing, there's some way to prevent intra-host OSD
rebalancing?


Thanks.

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Upgrade Luminous to mimic on Ubuntu 18.04

2019-02-19 Thread Kurt Bauer



Known problem, see http://tracker.ceph.com/issues/24326

Would nevertheless be nice to know, if it's planned to get this fixed.

br,
Kurt

Ketil Froyn wrote on 19.02.19 08:47:

I think there may be something wrong with the apt repository for
bionic, actually. Compare the packages available for Xenial:

https://download.ceph.com/debian-luminous/dists/xenial/main/binary-amd64/Packages

to the ones available for Bionic:

https://download.ceph.com/debian-luminous/dists/bionic/main/binary-amd64/Packages

The only package listed in the repository for bionic is ceph-deploy,
while there's lots for xenial. A quick summary:

$ curl -s 
https://download.ceph.com/debian-luminous/dists/bionic/main/binary-amd64/Packages
| grep ^Package | wc -l
1
$ curl -s 
https://download.ceph.com/debian-luminous/dists/xenial/main/binary-amd64/Packages
| grep ^Package | wc -l
63

Ketil

On Tue, 19 Feb 2019 at 02:10, David Turner  wrote:

Everybody is just confused that you don't have a newer version of Ceph 
available. Are you running `apt-get dist-upgrade` to upgrade ceph? Do you have 
any packages being held back? There is no reason that Ubuntu 18.04 shouldn't be 
able to upgrade to 12.2.11.

On Mon, Feb 18, 2019, 4:38 PM 
Hello people,

Am 11. Februar 2019 12:47:36 MEZ schrieb c...@elchaka.de:

Hello Ashley,

Am 9. Februar 2019 17:30:31 MEZ schrieb Ashley Merrick
:

What does the output of apt-get update look like on one of the nodes?

You can just list the lines that mention CEPH


... .. .
Get:6 Https://Download.ceph.com/debian-luminous bionic InRelease [8393
B]
... .. .

The Last available is 12.2.8.

Any advice or recommends on how to proceed to be able to Update to 
mimic/(nautilus)?

- Mehmet

- Mehmet


Thanks

On Sun, 10 Feb 2019 at 12:28 AM,  wrote:


Hello Ashley,

Thank you for this fast response.

I cannt prove this jet but i am using already cephs own repo for

Ubuntu

18.04 and this 12.2.7/8 is the latest available there...

- Mehmet

Am 9. Februar 2019 17:21:32 MEZ schrieb Ashley Merrick <
singap...@amerrick.co.uk>:

Around available versions, are you using the Ubuntu repo’s or the

CEPH

18.04 repo.

The updates will always be slower to reach you if your waiting for

it

to
hit the Ubuntu repo vs adding CEPH’s own.


On Sun, 10 Feb 2019 at 12:19 AM,  wrote:


Hello m8s,

Im curious how we should do an Upgrade of our ceph Cluster on

Ubuntu

16/18.04. As (At least on our 18.04 nodes) we only have 12.2.7

(or

.8?)

For an Upgrade to mimic we should First Update to Last version,

actualy

12.2.11 (iirc).
Which is not possible on 18.04.

Is there a Update path from 12.2.7/8 to actual mimic release or

better the

upcoming nautilus?

Any advice?

- Mehmet___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--

Kurt Bauer
Vienna University Computer Center - ACOnet - VIX
Universitaetsstrasse 7, A-1010 Vienna, Austria, Europe
Tel: ++431 4277  - 14070 (Fax: - 814070)  KB1970-RIPE

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

39 matches

Mail list logo