Re: [ceph-users] Debugging 'slow requests' ...

2019-02-11 Thread Brad Hubbard
Glad to help!

On Tue, Feb 12, 2019 at 4:55 PM Massimo Sgaravatto
 wrote:
>
> Thanks a lot Brad !
>
> The problem is indeed in the network: we moved the OSD nodes back to the 
> "old" switches and the problem disappeared.
>
> Now we have to figure out what is wrong/misconfigured with the new switch: we 
> would try to replicate the problem, possibly without a ceph deployment ...
>
> Thanks again for your help !
>
> Cheers, Massimo
>
> On Sun, Feb 10, 2019 at 12:07 AM Brad Hubbard  wrote:
>>
>> The log ends at
>>
>> $ zcat ceph-osd.5.log.gz |tail -2
>> 2019-02-09 07:37:00.022534 7f5fce60d700  1 --
>> 192.168.61.202:6816/157436 >> - conn(0x56308edcf000 :6816
>> s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=296 -
>>
>> The last two messages are outbound to 192.168.222.204 and there are no
>> further messages between these two hosts (other than osd_pings) in the
>> log.
>>
>> $ zcat ceph-osd.5.log.gz |gawk
>> '!/osd_ping/&&/192.168.222.202/&&/192.168.222.204/&&/07:29:3/'|tail -5
>> 2019-02-09 07:29:34.267744 7f5fcee0e700  1 --
>> 192.168.222.202:6816/157436 <== osd.29 192.168.222.204:6804/4159520
>> 1946  rep_scrubmap(8.2bc e1205735 from shard 29) v2  40+0+1492
>> (3695125937 0 2050362985) 0x563090674d80 con 0x56308bf61000
>> 2019-02-09 07:29:34.375223 7f5faf4b4700  1 --
>> 192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 --
>> replica_scrub(pg:
>> 8.2bc,from:0'0,to:0'0,epoch:1205833/1205735,start:8:3d4e6145:::rbd_data.35f46d19abe4ed.77a4:0,end:8:3d4e6916:::rbd_data.a6dc2425de9600.0006249c:0,chunky:1,deep:0,version:9,allow_preemption:1,priority=5)
>> v9 -- 0x56308bdf2000 con 0
>> 2019-02-09 07:29:34.378535 7f5fcee0e700  1 --
>> 192.168.222.202:6816/157436 <== osd.29 192.168.222.204:6804/4159520
>> 1947  rep_scrubmap(8.2bc e1205735 from shard 29) v2  40+0+1494
>> (3695125937 0 865217733) 0x563092d90900 con 0x56308bf61000
>> 2019-02-09 07:29:34.415868 7f5faf4b4700  1 --
>> 192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 --
>> osd_repop(client.171725953.0:404377591 8.9b e1205833/1205735
>> 8:d90adab6:::rbd_data.c47f3c390c8495.0001934a:head v
>> 1205833'4767322) v2 -- 0x56308ca42400 con 0
>> 2019-02-09 07:29:34.486296 7f5faf4b4700  1 --
>> 192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 --
>> replica_scrub(pg:
>> 8.2bc,from:0'0,to:0'0,epoch:1205833/1205735,start:8:3d4e6916:::rbd_data.a6dc2425de9600.0006249c:0,end:8:3d4e7434:::rbd_data.47c1b437840214.0003c594:0,chunky:1,deep:0,version:9,allow_preemption:1,priority=5)
>> v9 -- 0x56308e565340 con 0
>>
>> I'd be taking a good, hard look at the network, yes.
>>
>> On Sat, Feb 9, 2019 at 6:33 PM Massimo Sgaravatto
>>  wrote:
>> >
>> > Thanks for your feedback !
>> >
>> > I increased debug_ms to 1/5.
>> >
>> > This is another slow request (full output from 'ceph daemon osd.5 
>> > dump_historic_ops' for this event is attached):
>> >
>> >
>> > {
>> > "description": "osd_op(client.171725953.0:404377591 8.9b 
>> > 8:d90adab6:
>> > ::rbd_data.c47f3c390c8495.0001934a:head [set-alloc-hint 
>> > object_size 4194
>> > 304 write_size 4194304,write 1413120~122880] snapc 0=[] 
>> > ondisk+write+known_if_re
>> > directed e1205833)",
>> > "initiated_at": "2019-02-09 07:29:34.404655",
>> > "age": 387.914193,
>> > "duration": 340.224154,
>> > "type_data": {
>> > "flag_point": "commit sent; apply or cleanup",
>> > "client_info": {
>> > "client": "client.171725953",
>> > "client_addr": "192.168.61.66:0/4056439540",
>> > "tid": 404377591
>> > },
>> > "events": [
>> > {
>> > "time": "2019-02-09 07:29:34.404655",
>> > "event": "initiated"
>> > },
>> > 
>> > 
>> >{
>> > "time": "2019-02-09 07:29:34.416752",
>> > "event": "op_applied"
>> > },
>> > {
>> > "time": "2019-02-09 07:29:34.417200",
>> > "event": "sub_op_commit_rec from 14"
>> > },
>> > {
>> > "time": "2019-02-09 07:35:14.628724",
>> > "event": "sub_op_commit_rec from 29"
>> > },
>> >
>> > osd.5 has IP 192.168.222.202
>> > osd.14 has IP 192.168.222.203
>> > osd.29 has IP 192.168.222.204
>> >
>> >
>> > Grepping using that client id from osd.5 log as far as I can understand 
>> > (please correct me if my debugging is completely wrong) the request to 5 
>> > and 14 is sent at  07:29:34:
>> >
>> > 2019-02-09 07:29:34.415808 7f5faf4b4700  1 -- 192.168.222.202:6816/157436 
>> > --> 192.168.222.203:6811/158495 -- osd_repop(client.171725953.0:404377591 
>> > 8.9b e1205833/1205735 8:d90

Re: [ceph-users] Debugging 'slow requests' ...

2019-02-11 Thread Massimo Sgaravatto
Thanks a lot Brad !

The problem is indeed in the network: we moved the OSD nodes back to the
"old" switches and the problem disappeared.

Now we have to figure out what is wrong/misconfigured with the new switch:
we would try to replicate the problem, possibly without a ceph deployment
...

Thanks again for your help !

Cheers, Massimo

On Sun, Feb 10, 2019 at 12:07 AM Brad Hubbard  wrote:

> The log ends at
>
> $ zcat ceph-osd.5.log.gz |tail -2
> 2019-02-09 07:37:00.022534 7f5fce60d700  1 --
> 192.168.61.202:6816/157436 >> - conn(0x56308edcf000 :6816
> s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=296 -
>
> The last two messages are outbound to 192.168.222.204 and there are no
> further messages between these two hosts (other than osd_pings) in the
> log.
>
> $ zcat ceph-osd.5.log.gz |gawk
> '!/osd_ping/&&/192.168.222.202/&&/192.168.222.204/&&/07:29:3/'|tail
>  -5
> 2019-02-09 07:29:34.267744 7f5fcee0e700  1 --
> 192.168.222.202:6816/157436 <== osd.29 192.168.222.204:6804/4159520
> 1946  rep_scrubmap(8.2bc e1205735 from shard 29) v2  40+0+1492
> (3695125937 0 2050362985) 0x563090674d80 con 0x56308bf61000
> 2019-02-09 07:29:34.375223 7f5faf4b4700  1 --
> 192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 --
> replica_scrub(pg:
>
> 8.2bc,from:0'0,to:0'0,epoch:1205833/1205735,start:8:3d4e6145:::rbd_data.35f46d19abe4ed.77a4:0,end:8:3d4e6916:::rbd_data.a6dc2425de9600.0006249c:0,chunky:1,deep:0,version:9,allow_preemption:1,priority=5)
> v9 -- 0x56308bdf2000 con 0
> 2019-02-09 07:29:34.378535 7f5fcee0e700  1 --
> 192.168.222.202:6816/157436 <== osd.29 192.168.222.204:6804/4159520
> 1947  rep_scrubmap(8.2bc e1205735 from shard 29) v2  40+0+1494
> (3695125937 0 865217733) 0x563092d90900 con 0x56308bf61000
> 2019-02-09 07:29:34.415868 7f5faf4b4700  1 --
> 192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 --
> osd_repop(client.171725953.0:404377591 8.9b e1205833/1205735
> 8:d90adab6:::rbd_data.c47f3c390c8495.0001934a:head v
> 1205833'4767322) v2 -- 0x56308ca42400 con 0
> 2019-02-09 07:29:34.486296 7f5faf4b4700  1 --
> 192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 --
> replica_scrub(pg:
>
> 8.2bc,from:0'0,to:0'0,epoch:1205833/1205735,start:8:3d4e6916:::rbd_data.a6dc2425de9600.0006249c:0,end:8:3d4e7434:::rbd_data.47c1b437840214.0003c594:0,chunky:1,deep:0,version:9,allow_preemption:1,priority=5)
> v9 -- 0x56308e565340 con 0
>
> I'd be taking a good, hard look at the network, yes.
>
> On Sat, Feb 9, 2019 at 6:33 PM Massimo Sgaravatto
>  wrote:
> >
> > Thanks for your feedback !
> >
> > I increased debug_ms to 1/5.
> >
> > This is another slow request (full output from 'ceph daemon osd.5
> dump_historic_ops' for this event is attached):
> >
> >
> > {
> > "description": "osd_op(client.171725953.0:404377591 8.9b
> 8:d90adab6:
> > ::rbd_data.c47f3c390c8495.0001934a:head [set-alloc-hint
> object_size 4194
> > 304 write_size 4194304,write 1413120~122880] snapc 0=[]
> ondisk+write+known_if_re
> > directed e1205833)",
> > "initiated_at": "2019-02-09 07:29:34.404655",
> > "age": 387.914193,
> > "duration": 340.224154,
> > "type_data": {
> > "flag_point": "commit sent; apply or cleanup",
> > "client_info": {
> > "client": "client.171725953",
> > "client_addr": "192.168.61.66:0/4056439540",
> > "tid": 404377591
> > },
> > "events": [
> > {
> > "time": "2019-02-09 07:29:34.404655",
> > "event": "initiated"
> > },
> > 
> > 
> >{
> > "time": "2019-02-09 07:29:34.416752",
> > "event": "op_applied"
> > },
> > {
> > "time": "2019-02-09 07:29:34.417200",
> > "event": "sub_op_commit_rec from 14"
> > },
> > {
> > "time": "2019-02-09 07:35:14.628724",
> > "event": "sub_op_commit_rec from 29"
> > },
> >
> > osd.5 has IP 192.168.222.202
> > osd.14 has IP 192.168.222.203
> > osd.29 has IP 192.168.222.204
> >
> >
> > Grepping using that client id from osd.5 log as far as I can understand
> (please correct me if my debugging is completely wrong) the request to 5
> and 14 is sent at  07:29:34:
> >
> > 2019-02-09 07:29:34.415808 7f5faf4b4700  1 --
> 192.168.222.202:6816/157436 --> 192.168.222.203:6811/158495 --
> osd_repop(client.171725953.0:404377591 8.9b e1205833/1205735 8:d90ada\
> > b6:::rbd_data.c47f3c390c8495.0001934a:head v 1205833'4767322) v2
> -- 0x56307bb61e00 con 0
> > 2019-02-09 07:29:34.415868 7f5faf4b4700  1 

Re: [ceph-users] will crush rule be used during object relocation in OSD failure ?

2019-02-11 Thread ST Wong (ITSC)
Hi all,

Tested 4 cases.  Case 1-3 are as expected, while for case 4,   rebuild didn’t 
take place on surviving room as Gregory mentioned.  Repeated case 4 several 
times on both rooms got same result.  We’re running mimic 13.2.2.

E.g.

Room1
Host 1 osd: 2,5
Host 2 osd: 1,3

Room 2  <-- failed room
Host 3 osd: 0,4
Host 4 osd: 6,7


Before:
5.62  0  00 0   0 00
0 active+clean 2019-02-12 04:47:28.1833750'0  3643:2299   
[0,7,5]  0   [0,7,5]  00'0 2019-02-12 
04:47:28.183218 0'0 2019-02-11 01:20:51.276922 0

After:
5.62  0  00 0   0 00
0  undersized+peered 2019-02-12 09:10:59.1010960'0  
3647:2284   [5]  5[5]  50'0 2019-02-12 
04:47:28.183218 0'0 2019-02-11 01:20:51.276922 0

Fyi.   Sorry for the belated report.

Thanks a lot.
/st


From: Gregory Farnum 
Sent: Monday, November 26, 2018 9:27 PM
To: ST Wong (ITSC) 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] will crush rule be used during object relocation in 
OSD failure ?

On Fri, Nov 23, 2018 at 11:01 AM ST Wong (ITSC) 
mailto:s...@itsc.cuhk.edu.hk>> wrote:

Hi all,



We've 8 osd hosts, 4 in room 1 and 4 in room2.

A pool with size = 3 using following crush map is created, to cater for room 
failure.


rule multiroom {
id 0
type replicated
min_size 2
max_size 4
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}




We're expecting:

1.for each object, there are always 2 replicas in one room and 1 replica in 
other room making size=3.  But we can't control which room has 1 or 2 replicas.

Right.


2.in case an osd host fails, ceph will assign remaining osds to 
the same PG to hold replicas on the failed osd host.  Selection is based on 
crush rule of the pool, thus maintaining the same failure domain - won't make 
all replicas in the same room.

Yes, if a host fails the copies it held will be replaced by new copies in the 
same room.


3.in case of entire room with 1 replica fails, the pool will 
remain degraded but won't do any replica relocation.

Right.


4. in case of entire room with 2 replicas fails, ceph will make use of osds in 
the surviving room and making 2 replicas.  Pool will not be writeable before 
all objects are made 2 copies (unless we make pool size=4?).  Then when 
recovery is complete, pool will remain in degraded state until the failed room 
recover.

Hmm, I'm actually not sure if this will work out — because CRUSH is 
hierarchical, it will keep trying to select hosts from the dead room and will 
fill out the location vector's first two spots with -1. It could be that Ceph 
will skip all those "nonexistent" entries and just pick the two copies from 
slots 3 and 4, but it might not. You should test this carefully and report back!
-Greg

Is our understanding correct?  Thanks a lot.
Will do some simulation later to verify.

Regards,
/stwong
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Update / upgrade cluster with MDS from 12.2.7 to 12.2.11

2019-02-11 Thread Patrick Donnelly
On Mon, Feb 11, 2019 at 12:10 PM Götz Reinicke
 wrote:
> as 12.2.11 is out for some days and no panic mails showed up on the list I 
> was planing to update too.
>
> I know there are recommended orders in which to update/upgrade the cluster 
> but I don’t know how rpm packages are handling restarting services after a 
> yum update. E.g. when MDS and MONs are on the same server.

This should be fine. The MDS only uses a new executable file if you
explicitly restart it via systemd (or, the MDS fails and systemd
restarts it).

More info: when the MDS respawns in normal circumstances, it passes
the /proc/self/exe file to execve. An intended side-effect is that the
MDS will continue using the same executable file across execs.

> And regarding an MDS Cluster I like to ask, if the upgrading instructions 
> regarding only running one MDS during upgrading also applies for an update?
>
> http://docs.ceph.com/docs/mimic/cephfs/upgrading/

If you upgrade an MDS, it may update the compatibility bits in the
Monitor's MDSMap. Other MDSs will abort when they see this change. The
upgrade process intended to help you avoid seeing those errors so you
don't inadvertently think something went wrong.

If you don't mind seeing those errors and you're using 1 active MDS,
then don't worry about it.

Good luck!

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS overwrite/truncate performance hit

2019-02-11 Thread Gregory Farnum
On Thu, Feb 7, 2019 at 3:31 AM Hector Martin  wrote:

> On 07/02/2019 19:47, Marc Roos wrote:
> >
> > Is this difference not related to chaching? And you filling up some
> > cache/queue at some point? If you do a sync after each write, do you
> > have still the same results?
>
> No, the slow operations are slow from the very beginning. It's not about
> filling a buffer/cache somewhere. I'm guessing the slow operations
> trigger several synchronous writes to the underlying OSDs, while the
> fast ones don't. But I'd like to know more about why exactly there is
> this significant performance hit to truncation operations vs. normal
> writes.
>
> To give some more numbers:
>
> echo test | dd of=b conv=notrunc
>
> This completes extremely quickly (microseconds). The data obviously
> remains in the client cache at this point. This is what I want.
>
> echo test | dd of=b conv=notrunc,fdatasync
>
> This runs quickly until the fdatasync(), then that takes ~12ms, which is
> about what I'd expect for a synchronous write to the underlying HDDs. Or
> maybe that's two writes?


It's certainly one write, and may be two overlapping ones if you've
extended the file and need to persist its new size (via the MDS journal).


>


> echo test | dd of=b
>
> This takes ~10ms in the best case for the open() call (sometimes 30-40
> or even more), and 6-8ms for the write() call.
>
> echo test | dd of=b conv=fdatasync
>
> This takes ~10ms for the open() call, ~8ms for the write() call, and
> ~18ms for the fdatasync() call.
>
> So it seems like truncating/recreating an existing file introduces
> several disk I/Os worth of latency and forces synchronous behavior
> somewhere down the stack, while merely creating a new file or writing to
> an existing one without truncation does not.
>

Right. Truncates and renames require sending messages to the MDS, and the
MDS committing to RADOS (aka its disk) the change in status, before they
can be completed. Creating new files will generally use a preallocated
inode so it's just a network round-trip to the MDS.

Going back to your first email, if you do an overwrite that is confined to
a single stripe unit in RADOS (by default, a stripe unit is the size of
your objects which is 4MB and it's aligned from 0), it is guaranteed to be
atomic. CephFS can only tear writes across objects, and only if your client
fails before the data has been flushed.
-Greg

>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] faster switch to another mds

2019-02-11 Thread Gregory Farnum
You can't tell from the client log here, but probably the MDS itself was
failing over to a new instance during that interval. There's not much
experience with it, but you could experiment with faster failover by
reducing the mds beacon and grace times. This may or may not work
reliably...

On Sat, Feb 9, 2019 at 10:52 AM Fyodor Ustinov  wrote:

> Hi!
>
> I have ceph cluster with 3 nodes with mon/mgr/mds servers.
> I reboot one node and see this in client log:
>
> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon2 10.5.105.40:6789 socket
> closed (con state OPEN)
> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon2 10.5.105.40:6789 session
> lost, hunting for new mon
> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon0 10.5.105.34:6789 session
> established
> Feb 09 20:29:22 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
> closed (con state OPEN)
> Feb 09 20:29:23 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
> closed (con state CONNECTING)
> Feb 09 20:29:24 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
> closed (con state CONNECTING)
> Feb 09 20:29:24 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
> closed (con state CONNECTING)
> Feb 09 20:29:53 ceph-nfs1 kernel: ceph: mds0 reconnect start
> Feb 09 20:29:53 ceph-nfs1 kernel: ceph: mds0 reconnect success
> Feb 09 20:30:05 ceph-nfs1 kernel: ceph: mds0 recovery completed
>
> As I understand it, the following has happened:
> 1. Client detects - link with mon server broken and fast switches to
> another mon (less that 1 seconds).
> 2. Client detects - link with mds server broken, 3 times trying reconnect
> (unsuccessful), waiting and reconnects to the same mds after 30 seconds
> downtime.
>
> I have 2 questions:
> 1. Why?
> 2. How to reduce switching time to another mds?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Update / upgrade cluster with MDS from 12.2.7 to 12.2.11

2019-02-11 Thread Götz Reinicke
Hi,

as 12.2.11 is out for some days and no panic mails showed up on the list I was 
planing to update too.

I know there are recommended orders in which to update/upgrade the cluster but 
I don’t know how rpm packages are handling restarting services after a yum 
update. E.g. when MDS and MONs are on the same server.

And regarding an MDS Cluster I like to ask, if the upgrading instructions 
regarding only running one MDS during upgrading also applies for an update?

http://docs.ceph.com/docs/mimic/cephfs/upgrading/

For me an update is e.g. from a point release, like  12.2.7 to 12.2.11, and an 
upgrade might be from 12.2. to 12.3.

Thanks for feedback . Regards . Götz




smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool/volume live migration

2019-02-11 Thread Jason Dillaman
On Mon, Feb 11, 2019 at 4:53 AM Luis Periquito  wrote:
>
> Hi Jason,
>
> that's been very helpful, but it got me thinking and looking.
>
> The pool name is both inside the libvirt.xml (and running KVM config)
> and it's cached in the Nova database. For it to change would require a
> detach/attach which may not be viable or easy, specially for boot
> volumes.
>
> What about:
> 1 - upgrade all binaries to Nautilus and ensure Ceph daemons are
> restarted and all are running Nautilus
> 2 - stop the OpenStack instances that are on that pool
> 3 - rename "openstack" pool to "old.openstack"
> 4 - create new pool "openstack"
> 5 - "rbd migration" prepare all the RBD volumes in the "old.openstack"
> 6 - start all instances again. This should ensure all instances are
> running with the Nautilius binaries
> 7 - run the "rbd migration execute" and "rbd migration commit" as
> performance allows
>
> when all is finished delete the "old.openstack" pool.
>
> As we're running an EC+CT environment will this still apply? Should we
> remove the CT before doing this or will Ceph be aware of the Cache
> Tier and do it transparently?
>
> I will test all this a few times before doing it in any significant
> environment. I will do my best to share the experience in the mailing
> list after...
> Is there anything we should be aware, and is there anything I could
> report back to the community to test/experiment/debug the solution?
>
> and thanks for all your help.

I think that should work. You won't be able to remove the cache tier
against the old pool since directly storing RBD images on an EC pool
is not possible. However, internally, Ceph should be using the unique
pool id (and not the pool name), so the cache tier will stick to the
original pool even after a rename.

> On Fri, Feb 8, 2019 at 5:20 PM Jason Dillaman  wrote:
> >
> > On Fri, Feb 8, 2019 at 11:43 AM Luis Periquito  wrote:
> > >
> > > This is indeed for an OpenStack cloud - it didn't require any level of
> > > performance (so was created on an EC pool) and now it does :(
> > >
> > > So the idea would be:
> >
> > 0 - upgrade OSDs and librbd clients to Nautilus
> >
> > > 1- create a new pool
> >
> > Are you using EC via cache tier over a replicated pool or an RBD image
> > with an EC data pool?
> >
> > > 2- change cinder to use the new pool
> > >
> > > for each volume
> > >   3- stop the usage of the volume (stop the instance?)
> > >   4- "live migrate" the volume to the new pool
> >
> > Yes, execute the "rbd migration prepare" step here and manually update
> > the Cinder database to point the instance to the new pool (if the pool
> > name changed). I cannot remember if Nova caches the Cinder volume
> > connector details, so you might also need to detach/re-attach the
> > volume if that's the case (or tweak the Nova database entries as
> > well).
> >
> > >   5- start up the instance again
> >
> > 6 - run "rbd migration execute" and "rbd migration commit" at your 
> > convenience.
> >
> > >
> > >
> > > Does that sound right?
> > >
> > > thanks,
> > >
> > > On Fri, Feb 8, 2019 at 4:25 PM Jason Dillaman  wrote:
> > > >
> > > > Correction: at least for the initial version of live-migration, you
> > > > need to temporarily stop clients that are using the image, execute
> > > > "rbd migration prepare", and then restart the clients against the new
> > > > destination image. The "prepare" step will fail if it detects that the
> > > > source image is in-use.
> > > >
> > > > On Fri, Feb 8, 2019 at 9:00 AM Jason Dillaman  
> > > > wrote:
> > > > >
> > > > > Indeed, it is forthcoming in the Nautilus release.
> > > > >
> > > > > You would initiate a "rbd migration prepare 
> > > > > " to transparently link the dst-image-spec to the
> > > > > src-image-spec. Any active Nautilus clients against the image will
> > > > > then re-open the dst-image-spec for all IO operations. Read requests
> > > > > that cannot be fulfilled by the new dst-image-spec will be forwarded
> > > > > to the original src-image-spec (similar to how parent/child cloning
> > > > > behaves). Write requests to the dst-image-spec will force a deep-copy
> > > > > of all impacted src-image-spec backing data objects (including
> > > > > snapshot history) to the associated dst-image-spec backing data
> > > > > object.  At any point a storage admin can run "rbd migration execute"
> > > > > to deep-copy all src-image-spec data blocks to the dst-image-spec.
> > > > > Once the migration is complete, you would just run "rbd migration
> > > > > commit" to remove src-image-spec.
> > > > >
> > > > > Note: at some point prior to "rbd migration commit", you will need to
> > > > > take minimal downtime to switch OpenStack volume registration from the
> > > > > old image to the new image if you are changing pools.
> > > > >
> > > > > On Fri, Feb 8, 2019 at 5:33 AM Caspar Smit  
> > > > > wrote:
> > > > > >
> > > > > > Hi Luis,
> > > > > >
> > > > > > According to slide 21 of Sage's presentation at FOSDEM it is coming 
> > > > > > i

Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())

2019-02-11 Thread Yan, Zheng
On Mon, Feb 11, 2019 at 8:01 PM Jake Grimmett  wrote:
>
> Hi Zheng,
>
> Many, many thanks for your help...
>
> Your suggestion of setting large values for mds_cache_size and
> mds_cache_memory_limit stopped our MDS crashing :)
>
> The values in ceph.conf are now:
>
> mds_cache_size = 8589934592
> mds_cache_memory_limit = 17179869184
>
> Should these values be left in our configuration?

No. you'd better to change them to original values.

>
> again thanks for the assistance,
>
> Jake
>
> On 2/11/19 8:17 AM, Yan, Zheng wrote:
> > On Sat, Feb 9, 2019 at 12:36 AM Jake Grimmett  
> > wrote:
> >>
> >> Dear All,
> >>
> >> Unfortunately the MDS has crashed on our Mimic cluster...
> >>
> >> First symptoms were rsync giving:
> >> "No space left on device (28)"
> >> when trying to rename or delete
> >>
> >> This prompted me to try restarting the MDS, as it reported laggy.
> >>
> >> Restarting the MDS, shows this as error in the log before the crash:
> >>
> >> elist.h: 39: FAILED assert(!is_on_list())
> >>
> >> A full MDS log showing the crash is here:
> >>
> >> http://p.ip.fi/iWlz
> >>
> >> I've tried upgrading the cluster to 13.2.4, but the MDS still crashes...
> >>
> >> The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x
> >> replication for MDS. We have a single active MDS, with two failover MDS
> >>
> >> We have ~2PB of cephfs data here, all of which is currently
> >> inaccessible, all and any advice gratefully received :)
> >>
> >
> > Add mds_cache_size and mds_cache_memory_limit to ceph.conf and set
> > them to very large values before starting mds. If mds does not crash,
> > restore the mds_cache_size and mds_cache_memory_limit  to their
> > original values (by admin socket) after mds becomes active for 10
> > seconds
> >
> > If mds still crash, try compile ceph-mds with following patch
> >
> > diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
> > index d3461fba2e..c2731e824c 100644
> > --- a/src/mds/CDir.cc
> > +++ b/src/mds/CDir.cc
> > @@ -508,6 +508,8 @@ void CDir::remove_dentry(CDentry *dn)
> >// clean?
> >if (dn->is_dirty())
> >  dn->mark_clean();
> > +  if (inode->is_stray())
> > +dn->item_stray.remove_myself();
> >
> >if (dn->state_test(CDentry::STATE_BOTTOMLRU))
> >  cache->bottom_lru.lru_remove(dn);
> >
> >
> >> best regards,
> >>
> >> Jake
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())

2019-02-11 Thread Jake Grimmett
Hi Zheng,

Sorry - I've just re-read your email and saw your instruction to restore
the mds_cache_size and mds_cache_memory_limit to original values if the
MDS does not crash - I have now done this...

thanks again for your help,

best regards,

Jake

On 2/11/19 12:01 PM, Jake Grimmett wrote:
> Hi Zheng,
> 
> Many, many thanks for your help...
> 
> Your suggestion of setting large values for mds_cache_size and
> mds_cache_memory_limit stopped our MDS crashing :)
> 
> The values in ceph.conf are now:
> 
> mds_cache_size = 8589934592
> mds_cache_memory_limit = 17179869184
> 
> Should these values be left in our configuration?
> 
> again thanks for the assistance,
> 
> Jake
> 
> On 2/11/19 8:17 AM, Yan, Zheng wrote:
>> On Sat, Feb 9, 2019 at 12:36 AM Jake Grimmett  wrote:
>>>
>>> Dear All,
>>>
>>> Unfortunately the MDS has crashed on our Mimic cluster...
>>>
>>> First symptoms were rsync giving:
>>> "No space left on device (28)"
>>> when trying to rename or delete
>>>
>>> This prompted me to try restarting the MDS, as it reported laggy.
>>>
>>> Restarting the MDS, shows this as error in the log before the crash:
>>>
>>> elist.h: 39: FAILED assert(!is_on_list())
>>>
>>> A full MDS log showing the crash is here:
>>>
>>> http://p.ip.fi/iWlz
>>>
>>> I've tried upgrading the cluster to 13.2.4, but the MDS still crashes...
>>>
>>> The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x
>>> replication for MDS. We have a single active MDS, with two failover MDS
>>>
>>> We have ~2PB of cephfs data here, all of which is currently
>>> inaccessible, all and any advice gratefully received :)
>>>
>>
>> Add mds_cache_size and mds_cache_memory_limit to ceph.conf and set
>> them to very large values before starting mds. If mds does not crash,
>> restore the mds_cache_size and mds_cache_memory_limit  to their
>> original values (by admin socket) after mds becomes active for 10
>> seconds
>>
>> If mds still crash, try compile ceph-mds with following patch
>>
>> diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
>> index d3461fba2e..c2731e824c 100644
>> --- a/src/mds/CDir.cc
>> +++ b/src/mds/CDir.cc
>> @@ -508,6 +508,8 @@ void CDir::remove_dentry(CDentry *dn)
>>// clean?
>>if (dn->is_dirty())
>>  dn->mark_clean();
>> +  if (inode->is_stray())
>> +dn->item_stray.remove_myself();
>>
>>if (dn->state_test(CDentry::STATE_BOTTOMLRU))
>>  cache->bottom_lru.lru_remove(dn);
>>
>>
>>> best regards,
>>>
>>> Jake
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())

2019-02-11 Thread Jake Grimmett
Hi Zheng,

Many, many thanks for your help...

Your suggestion of setting large values for mds_cache_size and
mds_cache_memory_limit stopped our MDS crashing :)

The values in ceph.conf are now:

mds_cache_size = 8589934592
mds_cache_memory_limit = 17179869184

Should these values be left in our configuration?

again thanks for the assistance,

Jake

On 2/11/19 8:17 AM, Yan, Zheng wrote:
> On Sat, Feb 9, 2019 at 12:36 AM Jake Grimmett  wrote:
>>
>> Dear All,
>>
>> Unfortunately the MDS has crashed on our Mimic cluster...
>>
>> First symptoms were rsync giving:
>> "No space left on device (28)"
>> when trying to rename or delete
>>
>> This prompted me to try restarting the MDS, as it reported laggy.
>>
>> Restarting the MDS, shows this as error in the log before the crash:
>>
>> elist.h: 39: FAILED assert(!is_on_list())
>>
>> A full MDS log showing the crash is here:
>>
>> http://p.ip.fi/iWlz
>>
>> I've tried upgrading the cluster to 13.2.4, but the MDS still crashes...
>>
>> The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x
>> replication for MDS. We have a single active MDS, with two failover MDS
>>
>> We have ~2PB of cephfs data here, all of which is currently
>> inaccessible, all and any advice gratefully received :)
>>
> 
> Add mds_cache_size and mds_cache_memory_limit to ceph.conf and set
> them to very large values before starting mds. If mds does not crash,
> restore the mds_cache_size and mds_cache_memory_limit  to their
> original values (by admin socket) after mds becomes active for 10
> seconds
> 
> If mds still crash, try compile ceph-mds with following patch
> 
> diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
> index d3461fba2e..c2731e824c 100644
> --- a/src/mds/CDir.cc
> +++ b/src/mds/CDir.cc
> @@ -508,6 +508,8 @@ void CDir::remove_dentry(CDentry *dn)
>// clean?
>if (dn->is_dirty())
>  dn->mark_clean();
> +  if (inode->is_stray())
> +dn->item_stray.remove_myself();
> 
>if (dn->state_test(CDentry::STATE_BOTTOMLRU))
>  cache->bottom_lru.lru_remove(dn);
> 
> 
>> best regards,
>>
>> Jake
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade Luminous to mimic on Ubuntu 18.04

2019-02-11 Thread ceph
Hello Ashley,

Am 9. Februar 2019 17:30:31 MEZ schrieb Ashley Merrick 
:
>What does the output of apt-get update look like on one of the nodes?
>
>You can just list the lines that mention CEPH
>

... .. .
Get:6 Https://Download.ceph.com/debian-luminous bionic InRelease [8393 B]
... .. .

The Last available is 12.2.8.

- Mehmet

>Thanks
>
>On Sun, 10 Feb 2019 at 12:28 AM,  wrote:
>
>> Hello Ashley,
>>
>> Thank you for this fast response.
>>
>> I cannt prove this jet but i am using already cephs own repo for
>Ubuntu
>> 18.04 and this 12.2.7/8 is the latest available there...
>>
>> - Mehmet
>>
>> Am 9. Februar 2019 17:21:32 MEZ schrieb Ashley Merrick <
>> singap...@amerrick.co.uk>:
>> >Around available versions, are you using the Ubuntu repo’s or the
>CEPH
>> >18.04 repo.
>> >
>> >The updates will always be slower to reach you if your waiting for
>it
>> >to
>> >hit the Ubuntu repo vs adding CEPH’s own.
>> >
>> >
>> >On Sun, 10 Feb 2019 at 12:19 AM,  wrote:
>> >
>> >> Hello m8s,
>> >>
>> >> Im curious how we should do an Upgrade of our ceph Cluster on
>Ubuntu
>> >> 16/18.04. As (At least on our 18.04 nodes) we only have 12.2.7 (or
>> >.8?)
>> >>
>> >> For an Upgrade to mimic we should First Update to Last version,
>> >actualy
>> >> 12.2.11 (iirc).
>> >> Which is not possible on 18.04.
>> >>
>> >> Is there a Update path from 12.2.7/8 to actual mimic release or
>> >better the
>> >> upcoming nautilus?
>> >>
>> >> Any advice?
>> >>
>> >> - Mehmet___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-11 Thread Igor Fedotov


On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote:

another mempool dump after 1h run. (latency ok)

Biggest difference:

before restart
-
"bluestore_cache_other": {
"items": 48661920,
"bytes": 1539544228
},
"bluestore_cache_data": {
"items": 54,
"bytes": 643072
},
(other caches seem to be quite low too, like bluestore_cache_other take all the 
memory)


After restart
-
"bluestore_cache_other": {
  "items": 12432298,
   "bytes": 500834899
},
"bluestore_cache_data": {
  "items": 40084,
  "bytes": 1056235520
},

This is fine as cache is warming after restart and some rebalancing 
between data and metadata  might occur.


What relates to allocator and most probably to fragmentation growth is :

"bluestore_alloc": {
"items": 165053952,
"bytes": 165053952
},

which had been higher before the reset (if I got these dumps' order 
properly)


"bluestore_alloc": {
"items": 210243456,
"bytes": 210243456
},

But as I mentioned - I'm not 100% sure this might cause such a huge 
latency increase...


Do you have perf counters dump after the restart?

Could you collect some more dumps - for both mempool and perf counters?

So ideally I'd like to have:

1) mempool/perf counters dumps after the restart (1hour is OK)

2) mempool/perf counters dumps in 24+ hours after restart

3) reset perf counters  after 2), wait for 1 hour (and without OSD 
restart) and dump mempool/perf counters again.


So we'll be able to learn both allocator mem usage growth and operation 
latency distribution for the following periods:


a) 1st hour after restart

b) 25th hour.


Thanks,

Igor



full mempool dump after restart
---

{
 "mempool": {
 "by_pool": {
 "bloom_filter": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_alloc": {
 "items": 165053952,
 "bytes": 165053952
 },
 "bluestore_cache_data": {
 "items": 40084,
 "bytes": 1056235520
 },
 "bluestore_cache_onode": {
 "items": 5,
 "bytes": 14935200
 },
 "bluestore_cache_other": {
 "items": 12432298,
 "bytes": 500834899
 },
 "bluestore_fsck": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_txc": {
 "items": 11,
 "bytes": 8184
 },
 "bluestore_writing_deferred": {
 "items": 5047,
 "bytes": 22673736
 },
 "bluestore_writing": {
 "items": 91,
 "bytes": 1662976
 },
 "bluefs": {
 "items": 1907,
 "bytes": 95600
 },
 "buffer_anon": {
 "items": 19664,
 "bytes": 25486050
 },
 "buffer_meta": {
 "items": 46189,
 "bytes": 2956096
 },
 "osd": {
 "items": 243,
 "bytes": 3089016
 },
 "osd_mapbl": {
 "items": 17,
 "bytes": 214366
 },
 "osd_pglog": {
 "items": 889673,
 "bytes": 367160400
 },
 "osdmap": {
 "items": 3803,
 "bytes": 224552
 },
 "osdmap_mapping": {
 "items": 0,
 "bytes": 0
 },
 "pgmap": {
 "items": 0,
 "bytes": 0
 },
 "mds_co": {
 "items": 0,
 "bytes": 0
 },
 "unittest_1": {
 "items": 0,
 "bytes": 0
 },
 "unittest_2": {
 "items": 0,
 "bytes": 0
 }
 },
 "total": {
 "items": 178515204,
 "bytes": 2160630547
 }
 }
}

- Mail original -
De: "aderumier" 
À: "Igor Fedotov" 
Cc: "Stefan Priebe, Profihost AG" , "Mark Nelson" , "Sage Weil" 
, "ceph-users" , "ceph-devel" 
Envoyé: Vendredi 8 Février 2019 16:14:54
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

I'm just seeing

StupidAllocator::_aligned_len
and
btree::btree_iterator, mempoo

on 1 osd, both 10%.

here the dump_mempools

{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 210243456,
"bytes": 210243456
},
"bluestore_cache_data": {
"items": 54,
"bytes": 643072
},
"bluestore_cache_onode": {
"items": 105637,
"bytes": 70988064
},
"bluestore_cache_other": {
"items": 48661920,
"bytes": 1539544228
},

[ceph-users] NAS solution for CephFS

2019-02-11 Thread Marvin Zhang
Hi,
As http://docs.ceph.com/docs/master/cephfs/nfs/ says, it's OK to
config active/passive NFS-Ganesha to use CephFs. My question is if we
can use active/active nfs-ganesha for CephFS.
In my thought, only state consistance should we think about.
1. Lock support for Active/Active. Even each nfs-ganesha sever mantain
the lock state, the real lock/unlock will call
ceph_ll_getlk/ceph_ll_setlk. So Ceph cluster will handle the lock
safely.
2. Delegation support Active/Active. It's similar question 1,
ceph_ll_delegation will handle it safely.
3. Nfs-ganesha cache support Active/Active. As
https://github.com/nfs-ganesha/nfs-ganesha/blob/next/src/config_samples/ceph.conf
describes, we can config cache size as size 1.
4. Ceph-FSAL cache support Active/Active. Like other CephFs client,
there is no issues for cache consistance.

Thanks,
Marvin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool/volume live migration

2019-02-11 Thread Luis Periquito
Hi Jason,

that's been very helpful, but it got me thinking and looking.

The pool name is both inside the libvirt.xml (and running KVM config)
and it's cached in the Nova database. For it to change would require a
detach/attach which may not be viable or easy, specially for boot
volumes.

What about:
1 - upgrade all binaries to Nautilus and ensure Ceph daemons are
restarted and all are running Nautilus
2 - stop the OpenStack instances that are on that pool
3 - rename "openstack" pool to "old.openstack"
4 - create new pool "openstack"
5 - "rbd migration" prepare all the RBD volumes in the "old.openstack"
6 - start all instances again. This should ensure all instances are
running with the Nautilius binaries
7 - run the "rbd migration execute" and "rbd migration commit" as
performance allows

when all is finished delete the "old.openstack" pool.

As we're running an EC+CT environment will this still apply? Should we
remove the CT before doing this or will Ceph be aware of the Cache
Tier and do it transparently?

I will test all this a few times before doing it in any significant
environment. I will do my best to share the experience in the mailing
list after...
Is there anything we should be aware, and is there anything I could
report back to the community to test/experiment/debug the solution?

and thanks for all your help.

On Fri, Feb 8, 2019 at 5:20 PM Jason Dillaman  wrote:
>
> On Fri, Feb 8, 2019 at 11:43 AM Luis Periquito  wrote:
> >
> > This is indeed for an OpenStack cloud - it didn't require any level of
> > performance (so was created on an EC pool) and now it does :(
> >
> > So the idea would be:
>
> 0 - upgrade OSDs and librbd clients to Nautilus
>
> > 1- create a new pool
>
> Are you using EC via cache tier over a replicated pool or an RBD image
> with an EC data pool?
>
> > 2- change cinder to use the new pool
> >
> > for each volume
> >   3- stop the usage of the volume (stop the instance?)
> >   4- "live migrate" the volume to the new pool
>
> Yes, execute the "rbd migration prepare" step here and manually update
> the Cinder database to point the instance to the new pool (if the pool
> name changed). I cannot remember if Nova caches the Cinder volume
> connector details, so you might also need to detach/re-attach the
> volume if that's the case (or tweak the Nova database entries as
> well).
>
> >   5- start up the instance again
>
> 6 - run "rbd migration execute" and "rbd migration commit" at your 
> convenience.
>
> >
> >
> > Does that sound right?
> >
> > thanks,
> >
> > On Fri, Feb 8, 2019 at 4:25 PM Jason Dillaman  wrote:
> > >
> > > Correction: at least for the initial version of live-migration, you
> > > need to temporarily stop clients that are using the image, execute
> > > "rbd migration prepare", and then restart the clients against the new
> > > destination image. The "prepare" step will fail if it detects that the
> > > source image is in-use.
> > >
> > > On Fri, Feb 8, 2019 at 9:00 AM Jason Dillaman  wrote:
> > > >
> > > > Indeed, it is forthcoming in the Nautilus release.
> > > >
> > > > You would initiate a "rbd migration prepare 
> > > > " to transparently link the dst-image-spec to the
> > > > src-image-spec. Any active Nautilus clients against the image will
> > > > then re-open the dst-image-spec for all IO operations. Read requests
> > > > that cannot be fulfilled by the new dst-image-spec will be forwarded
> > > > to the original src-image-spec (similar to how parent/child cloning
> > > > behaves). Write requests to the dst-image-spec will force a deep-copy
> > > > of all impacted src-image-spec backing data objects (including
> > > > snapshot history) to the associated dst-image-spec backing data
> > > > object.  At any point a storage admin can run "rbd migration execute"
> > > > to deep-copy all src-image-spec data blocks to the dst-image-spec.
> > > > Once the migration is complete, you would just run "rbd migration
> > > > commit" to remove src-image-spec.
> > > >
> > > > Note: at some point prior to "rbd migration commit", you will need to
> > > > take minimal downtime to switch OpenStack volume registration from the
> > > > old image to the new image if you are changing pools.
> > > >
> > > > On Fri, Feb 8, 2019 at 5:33 AM Caspar Smit  
> > > > wrote:
> > > > >
> > > > > Hi Luis,
> > > > >
> > > > > According to slide 21 of Sage's presentation at FOSDEM it is coming 
> > > > > in Nautilus:
> > > > >
> > > > > https://fosdem.org/2019/schedule/event/ceph_project_status_update/attachments/slides/3251/export/events/attachments/ceph_project_status_update/slides/3251/ceph_new_in_nautilus.pdf
> > > > >
> > > > > Kind regards,
> > > > > Caspar
> > > > >
> > > > > Op vr 8 feb. 2019 om 11:24 schreef Luis Periquito 
> > > > > :
> > > > >>
> > > > >> Hi,
> > > > >>
> > > > >> a recurring topic is live migration and pool type change (moving from
> > > > >> EC to replicated or vice versa).
> > > > >>
> > > > >> When I went to the OpenStack open infrastructure (aka summit) Sa

Re: [ceph-users] Controlling CephFS hard link "primary name" for recursive stat

2019-02-11 Thread Yan, Zheng
On Sat, Feb 9, 2019 at 8:10 AM Hector Martin  wrote:
>
> Hi list,
>
> As I understand it, CephFS implements hard links as effectively "smart
> soft links", where one link is the primary for the inode and the others
> effectively reference it. When it comes to directories, the size for a
> hardlinked file is only accounted for in recursive stats for the
> "primary" link. This is good (no double-accounting).
>
> I'd like to be able to control *which* of those hard links is the
> primary, post-facto, to control what directory their size is accounted
> under. I want to write a tool that takes some rules as to which
> directories should be "preferred" for containing the master link, and
> corrects it if necessary (by recursively stating everything and looking
> for files with the same inode number to enumerate all links).
>
> To swap out a primary link with another I came up with this sequence:
>
> link("old_primary", "tmp1")
> symlink("tmp1", "tmp2")
> rename("tmp2", "old_primary") // old_primary replaced with another inode
> stat("/otherdir/new_primary") // new_primary hopefully takes over stray
> rename("tmp1", "old_primary)  // put things back the way they were
>
> The idea is that, since renames of hardlinks over themselves are a no-op
> in POSIX and won't work, I need to use an intermediate symlink step to
> ensure continuity of access to the old file; this isn't 100% transparent
> but it beats e.g. removing old_primary and re-linking new_primary over
> it (which would cause old_primary to vanish for a short time, which is
> undesirable). Hopefully the stat() ensures that the new_primary is what
> takes over the stray inode. This seems to work in practice; if there is
> a better way, I'd like to hear it.
>
> Figuring out which link is the primary is a bigger issue. Only
> directories report recursive stats where this matters, not files
> themselves. On a directory with hardlinked files, if ceph.dir.rfiles >
> sum(ceph.dir.rfiles for each subdir) + count(files with nlinks == 1)
> then some hardlinked files are primary; I could attempt to use this
> formula and then just do the above dance for every hardlinked file to
> move the primaries off, but this seems fragile and likely to break in
> certain situations (or do needless work). Any other ideas?
>
how about directly reading backtrace, something equivalent to:

rados -p cephfs1_data getxattr xxx. parent >/tmp/parent
ceph-dencoder import /tmp/parent type inode_backtrace_t decode dump_json


> Thanks,
> --
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://mrcn.st/pub
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())

2019-02-11 Thread Yan, Zheng
On Sat, Feb 9, 2019 at 12:36 AM Jake Grimmett  wrote:
>
> Dear All,
>
> Unfortunately the MDS has crashed on our Mimic cluster...
>
> First symptoms were rsync giving:
> "No space left on device (28)"
> when trying to rename or delete
>
> This prompted me to try restarting the MDS, as it reported laggy.
>
> Restarting the MDS, shows this as error in the log before the crash:
>
> elist.h: 39: FAILED assert(!is_on_list())
>
> A full MDS log showing the crash is here:
>
> http://p.ip.fi/iWlz
>
> I've tried upgrading the cluster to 13.2.4, but the MDS still crashes...
>
> The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x
> replication for MDS. We have a single active MDS, with two failover MDS
>
> We have ~2PB of cephfs data here, all of which is currently
> inaccessible, all and any advice gratefully received :)
>

Add mds_cache_size and mds_cache_memory_limit to ceph.conf and set
them to very large values before starting mds. If mds does not crash,
restore the mds_cache_size and mds_cache_memory_limit  to their
original values (by admin socket) after mds becomes active for 10
seconds

If mds still crash, try compile ceph-mds with following patch

diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
index d3461fba2e..c2731e824c 100644
--- a/src/mds/CDir.cc
+++ b/src/mds/CDir.cc
@@ -508,6 +508,8 @@ void CDir::remove_dentry(CDentry *dn)
   // clean?
   if (dn->is_dirty())
 dn->mark_clean();
+  if (inode->is_stray())
+dn->item_stray.remove_myself();

   if (dn->state_test(CDentry::STATE_BOTTOMLRU))
 cache->bottom_lru.lru_remove(dn);


> best regards,
>
> Jake
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com