Re: [ceph-users] MDS: obscene buffer_anon memory use when scanning lots of files

2020-01-21 Thread Patrick Donnelly
On Tue, Jan 21, 2020 at 8:32 AM John Madden  wrote:
>
> On 14.2.5 but also present in Luminous, buffer_anon memory use spirals
> out of control when scanning many thousands of files. The use case is
> more or less "look up this file and if it exists append this chunk to
> it, otherwise create it with this chunk." The memory is recovered as
> soon as the workload stops, and at most only 20-100 files are ever
> open at one time.
>
> Cache gets oversized but that's more or less expected, it's pretty
> much always/immediately in some warn state, which makes me wonder if a
> much larger cache might help buffer_anon use, looking for advice
> there. This is on a deeply-hashed directory, but overall very little
> data (<20GB), lots of tiny files.
>
> As I typed this post the pool went from ~60GB to ~110GB. I've resorted
> to a cronjob that restarts the active MDS when it reaches swap just to
> keep the cluster alive.

This looks like it will be fixed by

https://tracker.ceph.com/issues/42943

That will be available in v14.2.7.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 1 MDSs report oversized cache

2019-12-05 Thread Patrick Donnelly
On Thu, Dec 5, 2019 at 9:45 AM Ranjan Ghosh  wrote:
> Ah, that seems to have fixed it. Hope it stays that way. I've raised it
> to 4 GB. Thanks to you both!

Just be aware the warning could come back. You just moved the goal posts.

The 1GB default is probably too low for most deployments, I have a PR
to increase this: https://github.com/ceph/ceph/pull/32042

> Although I have to say that the message is IMHO *very* misleading: "1
> MDSs report oversized cache" sounds to me like the cache is too large
> (i.e. wasting RAM unnecessarily). Shouldn't the message rather be "1
> MDSs report *undersized* cache"? Weird.

No. I means the MDS cache is larger than its target. This means the
MDS cannot trim its cache to go back under the limit. This could be
for many reasons but probably due to clients not releasing
capabilities, perhaps due to a bug.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED and POOL_TARGET_SIZE_RATIO_OVERCOMMITTED

2019-11-23 Thread Patrick Donnelly
On Wed, Nov 20, 2019 at 12:29 AM Björn Hinz  wrote:
>
> Hello,
>
> I can also confirm the same problem described by Joe Ryner in 14.2.2. and 
> Oliver Freyermuth.
>
> My ceph version is 14.2.4
>
> -
> # ceph health detail
> HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes; 1 subtrees 
> have overcommitted pool target_size_ratio
> POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted pool 
> target_size_bytes
> Pools ['volumes', 'backups', 'images', 'cephfs_cindercache', 'rbd', 
> 'vms'] overcommit available storage by 1.308x due to target_size_bytes0  
> on pools []
> POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted pool 
> target_size_ratio
> Pools ['volumes', 'backups', 'images', 'cephfs_cindercache', 'rbd', 
> 'vms'] overcommit available storage by 1.308x due to target_size_ratio 0.000 
> on pools []

Will be fixed in 14.2.5: https://tracker.ceph.com/issues/42260

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Revert a CephFS snapshot?

2019-11-14 Thread Patrick Donnelly
On Wed, Nov 13, 2019 at 6:36 PM Jerry Lee  wrote:
>
> On Thu, 14 Nov 2019 at 07:07, Patrick Donnelly  wrote:
> >
> > On Wed, Nov 13, 2019 at 2:30 AM Jerry Lee  wrote:
> > > Recently, I'm evaluating the snpahsot feature of CephFS from kernel
> > > client and everthing works like a charm.  But, it seems that reverting
> > > a snapshot is not available currently.  Is there some reason or
> > > technical limitation that the feature is not provided?  Any insights
> > > or ideas are appreciated.
> >
> > Please provide more information about what you tried to do (commands
> > run) and how it surprised you.
>
> The thing I would like to do is to rollback a snapped directory to a
> previous version of snapshot.  It looks like the operation can be done
> by over-writting all the current version of files/directories from a
> previous snapshot via cp.  But cp may take lots of time when there are
> many files and directories in the target directory.  Is there any
> possibility to achieve the goal much faster from the CephFS internal
> via command like "ceph fs   snap rollback
> " (just a example)?  Thank you!

RADOS doesn't support rollback of snapshots so it needs to be done
manually. The best tool to do this would probably be rsync of the
.snap directory with appropriate options including deletion of files
that do not exist in the source (snapshot).

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Revert a CephFS snapshot?

2019-11-13 Thread Patrick Donnelly
On Wed, Nov 13, 2019 at 2:30 AM Jerry Lee  wrote:
> Recently, I'm evaluating the snpahsot feature of CephFS from kernel
> client and everthing works like a charm.  But, it seems that reverting
> a snapshot is not available currently.  Is there some reason or
> technical limitation that the feature is not provided?  Any insights
> or ideas are appreciated.

Please provide more information about what you tried to do (commands
run) and how it surprised you.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs 1 large omap objects

2019-10-30 Thread Patrick Donnelly
On Wed, Oct 30, 2019 at 9:28 AM Jake Grimmett  wrote:
>
> Hi Zheng,
>
> Many thanks for your helpful post, I've done the following:
>
> 1) set the threshold to 1024 * 1024:
>
> # ceph config set osd \
> osd_deep_scrub_large_omap_object_key_threshold 1048576
>
> 2) deep scrubbed all of the pgs on the two OSD that reported "Large omap
> object found." - these were all in pool 1, which has just four osd.
>
>
> Result: After 30 minutes, all deep-scrubs completed, and all "large omap
> objects" warnings disappeared.
>
> ...should we be worried about the size of these OMAP objects?

No. There are only a few of these objects and it's not caused problems
up to now in any other cluster.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problematic inode preventing ceph-mds from starting

2019-10-28 Thread Patrick Donnelly
On Fri, Oct 25, 2019 at 12:11 PM Pickett, Neale T  wrote:
> In the last week we have made a few changes to the down filesystem in an 
> attempt to fix what we thought was an inode problem:
>
>
> cephfs-data-scan scan_extents   # about 1 day with 64 processes
>
> cephfs-data-scan scan_inodes   # about 1 day with 64 processes
>
> cephfs-data_scan scan_links   # about 1 day

Did you reset the journals or perform any other disaster recovery
commands? This process likely introduced the duplicate inodes.

> After these three, we tried to start an MDS and it stayed up. We then ran:
>
> ceph tell mds.a scrub start / recursive repair
>
>
> The repair ran about 3 days, spewing logs to `ceph -w` about duplicated 
> inodes, until it stopped. All looked well until we began bringing production 
> services back online, at which point many error messages appeared, the mds 
> went back into damaged, and the fs back to degraded. At this point I removed 
> the objects you suggested, which brought everything back briefly.
>
> The latest crash is:
>
> -1> 2019-10-25 18:47:50.731 7fc1f3b56700 -1 
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/MDCache.cc:
>  In function 'void MDCache::add_inode(CInode*)' thread 7fc1f3b56700 time 
> 2019-1...
>
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/MDCache.cc:
>  258: FAILED ceph_assert(!p)

This error indicates a duplicate inode loaded into cache. Fixing this
probably requires significant intervention and (meta)data loss for
recent changes:

- Stop/unmount all clients. (Probably already the case if the rank is damaged!)

- Reset the MDS journal [1] and optionally recover any dentries first.
(This will hopefully resolve the ESubtreeMap errors you pasted.) Note
that some metadata may be lost through this command.

- `cephfs-data_scan scan_links` again. This should repair any
duplicate inodes (by dropping the older dentries).

- Then you can try marking the rank as repaired.

Good luck!

[1] 
https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/#journal-truncation


--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kernel cephfs - too many caps used by client

2019-10-24 Thread Patrick Donnelly
It's not clear what the problem is to me. Please try increasing the
debugging on your MDS and share a snippet (privately to me if you
wish). Other information would also be helpful like `ceph status` and
what kind of workloads these clients are running.

On Fri, Oct 18, 2019 at 7:22 PM Lei Liu  wrote:
>
> Only osds is v12.2.8, all of mds and mon used v12.2.12
>
> # ceph versions
> {
> "mon": {
> "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
> luminous (stable)": 3
> },
> "mgr": {
> "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
> luminous (stable)": 4
> },
> "osd": {
> "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
> luminous (stable)": 24,
> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
> luminous (stable)": 203
> },
> "mds": {
> "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
> luminous (stable)": 5
> },
> "rgw": {
> "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
> luminous (stable)": 1
> },
> "overall": {
> "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
> luminous (stable)": 37,
> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
> luminous (stable)": 203
> }
> }
>
> Lei Liu  于2019年10月19日周六 上午10:09写道:
>>
>> Thanks for your reply.
>>
>> Yes, Already set it.
>>
>>> [mds]
>>> mds_max_caps_per_client = 10485760 # default is 1048576
>>
>>
>> I think the current configuration is big enough for per client. Do I need to 
>> continue to increase this value?
>>
>> Thanks.
>>
>> Patrick Donnelly  于2019年10月19日周六 上午6:30写道:
>>>
>>> Hello Lei,
>>>
>>> On Thu, Oct 17, 2019 at 8:43 PM Lei Liu  wrote:
>>> >
>>> > Hi cephers,
>>> >
>>> > We have some ceph clusters use cephfs in production(mount with kernel 
>>> > cephfs), but several of clients often keep a lot of caps(millions) 
>>> > unreleased.
>>> > I know this is due to the client's inability to complete the cache 
>>> > release, errors might have been encountered, but no logs.
>>> >
>>> > client kernel version is 3.10.0-957.21.3.el7.x86_64
>>> > ceph version is mostly v12.2.8
>>> >
>>> > ceph status shows:
>>> >
>>> > x clients failing to respond to cache pressure
>>> >
>>> > client kernel debug shows:
>>> >
>>> > # cat 
>>> > /sys/kernel/debug/ceph/a00cc99c-f9f9-4dd9-9281-43cd12310e41.client11291811/caps
>>> > total 23801585
>>> > avail 1074
>>> > used 23800511
>>> > reserved 0
>>> > min 1024
>>> >
>>> > mds config:
>>> > [mds]
>>> > mds_max_caps_per_client = 10485760
>>> > # 50G
>>> > mds_cache_memory_limit = 53687091200
>>> >
>>> > I want to know if some ceph configurations can solve this problem ?
>>>
>>> mds_max_caps_per_client is new in Luminous 12.2.12. See [1]. You need
>>> to upgrade.
>>>
>>> [1] https://tracker.ceph.com/issues/38130
>>>
>>> --
>>> Patrick Donnelly, Ph.D.
>>> He / Him / His
>>> Senior Software Engineer
>>> Red Hat Sunnyvale, CA
>>> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>>>


-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problematic inode preventing ceph-mds from starting

2019-10-24 Thread Patrick Donnelly
9: (Thread::_entry_func(void*)+0x18) [0x7faf0a971fba]
>  20: (()+0x7dd5) [0x7faf07844dd5]
>  21: (clone()+0x6d) [0x7faf064f502d]
>
> I tried removing it, but it does not show up in the omapkeys for that inode:
>
> lima:/home/neale$ ceph -- rados -p cephfs_metadata listomapkeys 
> 1995e63.
> __about__.py_head
> __init__.py_head
> __pycache___head
> _compat.py_head
> _structures.py_head
> markers.py_head
> requirements.py_head
> specifiers.py_head
> utils.py_head
> version.py_head
> lima:/home/neale$ ceph -- rados -p cephfs_metadata rmomapkey 
> 1995e63. _compat.py_head
> lima:/home/neale$ ceph -- rados -p cephfs_metadata rmomapkey 
> 1995e63. compat.py_head
> lima:/home/neale$ ceph -- rados -p cephfs_metadata rmomapkey 
> 1995e63. file-does-not-exist_head
> lima:/home/neale$ ceph -- rados -p cephfs_metadata listomapkeys 
> 1995e63.
> __about__.py_head
> __init__.py_head
> __pycache___head
> _structures.py_head
> markers.py_head
> requirements.py_head
> specifiers.py_head
> utils.py_head
> version.py_head
>
> Predictably, this did nothing to solve our problem, and ceph-mds is still 
> dying during startup.
>
> Any suggestions?

Looks like the openfiletable is corrupt. It is not necessary to start
the FS. You can simply delete all of the metadata pool objects
matching this format: "mds%d_openfiles.%x". No data loss will occur.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kernel cephfs - too many caps used by client

2019-10-18 Thread Patrick Donnelly
Hello Lei,

On Thu, Oct 17, 2019 at 8:43 PM Lei Liu  wrote:
>
> Hi cephers,
>
> We have some ceph clusters use cephfs in production(mount with kernel 
> cephfs), but several of clients often keep a lot of caps(millions) unreleased.
> I know this is due to the client's inability to complete the cache release, 
> errors might have been encountered, but no logs.
>
> client kernel version is 3.10.0-957.21.3.el7.x86_64
> ceph version is mostly v12.2.8
>
> ceph status shows:
>
> x clients failing to respond to cache pressure
>
> client kernel debug shows:
>
> # cat 
> /sys/kernel/debug/ceph/a00cc99c-f9f9-4dd9-9281-43cd12310e41.client11291811/caps
> total 23801585
> avail 1074
> used 23800511
> reserved 0
> min 1024
>
> mds config:
> [mds]
> mds_max_caps_per_client = 10485760
> # 50G
> mds_cache_memory_limit = 53687091200
>
> I want to know if some ceph configurations can solve this problem ?

mds_max_caps_per_client is new in Luminous 12.2.12. See [1]. You need
to upgrade.

[1] https://tracker.ceph.com/issues/38130

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Nfs-ganesha-devel] 2.7.3 with CEPH_FSAL Crashing

2019-10-09 Thread Patrick Donnelly
 > > > > const&) () from /lib64/libcephfs.so.2
>> >  > > > > #5  0x7f04cc81c41f in
>> > Client::ll_lookup_inode(inodeno_t, UserPerm
>> >  > > > > const&, Inode**) () from /lib64/libcephfs.so.2
>> >  > > > > #6  0x7f04ccadbf0e in create_handle
>> > (export_pub=0x1baff10,
>> >  > > > > desc=, pub_handle=0x7f0470fd4718,
>> >  > > > > attrs_out=0x7f0470fd4740) at
>> >  > > > >
>> > /usr/src/debug/nfs-ganesha-2.7.3/FSAL/FSAL_CEPH/export.c:256
>> >  > > > > #7  0x00523895 in mdcache_locate_host
>> > (fh_desc=0x7f0470fd4920,
>> >  > > > > export=export@entry=0x1bafbf0,
>> > entry=entry@entry=0x7f0470fd48b8,
>> >  > > > > attrs_out=attrs_out@entry=0x0)
>> >  > > > >  at
>> >  > > > >
>> > 
>> > /usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1011
>> >  > > > > #8  0x0051d278 in mdcache_create_handle
>> > (exp_hdl=0x1bafbf0,
>> >  > > > > fh_desc=, handle=0x7f0470fd4900,
>> > attrs_out=0x0) at
>> >  > > > >
>> > 
>> > /usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1578
>> >  > > > > #9  0x0046d404 in nfs4_mds_putfh
>> >  > > > > (data=data@entry=0x7f0470fd4ea0) at
>> >  > > > >
>> > /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_op_putfh.c:211
>> >  > > > > #10 0x0046d8e8 in nfs4_op_putfh
>> > (op=0x7f03effaf1d0,
>> >  > > > > data=0x7f0470fd4ea0, resp=0x7f03ec1de1f0) at
>> >  > > > >
>> > /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_op_putfh.c:281
>> >  > > > > #11 0x0045d120 in nfs4_Compound (arg=> > out>,
>> >  > > > > req=, res=0x7f03ec1de9d0) at
>> >  > > > >
>> > /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_Compound.c:942
>> >  > > > > #12 0x004512cd in nfs_rpc_process_request
>> >  > > > > (reqdata=0x7f03ee5ed4b0) at
>> >  > > > >
>> > /usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_worker_thread.c:1328
>> >  > > > > #13 0x00450766 in nfs_rpc_decode_request
>> > (xprt=0x7f02180c2320,
>> >  > > > > xdrs=0x7f03ec568ab0) at
>> >  > > > >
>> > 
>> > /usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_rpc_dispatcher_thread.c:1345
>> >  > > > > #14 0x7f04df45d07d in svc_rqst_xprt_task
>> > (wpe=0x7f02180c2538) at
>> >  > > > >
>> > /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:769
>> >  > > > > #15 0x7f04df45d59a in svc_rqst_epoll_events
>> > (n_events=> >  > > > > out>, sr_rec=0x4bb53e0) at
>> >  > > > >
>> > /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:941
>> >  > > > > #16 svc_rqst_epoll_loop (sr_rec=) at
>> >  > > > >
>> > /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:1014
>> >  > > > > #17 svc_rqst_run_task (wpe=0x4bb53e0) at
>> >  > > > >
>> > /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:1050
>> >  > > > > #18 0x7f04df465123 in work_pool_thread
>> > (arg=0x7f044c0008c0) at
>> >  > > > >
>> > /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/work_pool.c:181
>> >  > > > > #19 0x7f04dda05dd5 in start_thread () from
>> > /lib64/libpthread.so.0
>> >  > > > > #20 0x7f04dcb7dead in clone () from /lib64/libc.so.6
>> >  > > > >
>> >  > > > > Package versions:
>> >  > > > >
>> >  > > > > nfs-ganesha-2.7.3-0.1.el7.x86_64
>> >  > > > > nfs-ganesha-ceph-2.7.3-0.1.el7.x86_64
>> >  > > > > libcephfs2-14.2.1-0.el7.x86_64
>> >  > > > > librados2-14.2.1-0.el7.x86_64
>> >  > > > >
>> >  > > > > I notice in my Ceph log I have a bunch of slow requests
>> > around the time
>> >  > > > > it went down, I'm not sure if it's a symptom of Ganesha
>> > segfaulting or
>> >  > > > > if it was a contributing factor.
>> >  > > > >
>> >  > > > > Thanks,
>> >  > > > > David
>> >  > > > >
>> >  > > > >
>> >  > > > > ___
>> >  > > > > Nfs-ganesha-devel mailing list
>> >  > > > > nfs-ganesha-de...@lists.sourceforge.net
>> > <mailto:nfs-ganesha-de...@lists.sourceforge.net>
>> >  > > > >
>> > https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>> >  > > > >
>> >  > >
>> >  > > ___
>> >  > > ceph-users mailing list
>> >  > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> >  > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > --
>> > Jeff Layton > > <mailto:jlay...@poochiereds.net>>
>> >
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS Stability with lots of CAPS

2019-10-05 Thread Patrick Donnelly
On Wed, Oct 2, 2019 at 9:48 AM Stefan Kooman  wrote:
> According to [1] there are new parameters in place to have the MDS
> behave more stable. Quoting that blog post "One of the more recent
> issues weve discovered is that an MDS with a very large cache (64+GB)
> will hang during certain recovery events."
>
> For all of us that are not (yet) running Nautilus I wonder what the best
> course of action is to prevent instable MDS during recovery situations.
>
> Artificially limit the "mds_cache_memory_limit" to say 32 GB?

Reduce the MDS cache size.

Mimic backport will probably make next minor release:
https://github.com/ceph/ceph/pull/28452

> I wonder if the amount of clients is of influence in a MDS being
> overwhelmed by release messages. Of are a handfull of clients (with
> millions of CAPS) able to overload an MDS?

Just one client with millions of caps could cause issues.

> Is there a way, other than unmounting cephfs on clients, to decrease the
> amount of CAPS the MDS has handed out, before an upgrade to a newer Ceph
> release is undertaken when running luminous / Mimic?

Incrementally reduce the cache size using a script.

> I'm assuming you need to restart the MDS to make the
> "mds_cache_memory_limit" effective, is that correct?

No. It is respected at runtime.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds directory pinning, status display

2019-09-13 Thread Patrick Donnelly
On Fri, Sep 13, 2019 at 7:09 AM thoralf schulze  wrote:
>
> hi there,
>
> while debugging metadata servers reporting slow requests, we took a stab
> at pinning directories of a cephfs like so:
>
> setfattr -n ceph.dir.pin -v 1 /tubfs/kubernetes/
> setfattr -n ceph.dir.pin -v 0 /tubfs/profiles/
> setfattr -n ceph.dir.pin -v 0 /tubfs/homes
>
> on the active mds for rank 0, we can see all pinnings like expected:
>
> ceph daemon /var/run/[rank0].asok get subtrees | jq -c
> '.[]|select(.dir.path|contains("/"))|[.dir.path, .export_pin, .auth_first]'
> ["/kubernetes",1,1]
> ["/homes",0,0]
> ["/profiles",0,0]
>
> while the active mds for rank 1 reports back its own pinnings only:
>
> ceph daemon /var/run/[rank1].asok get subtrees | jq -c
> '.[]|select(.dir.path|contains("/"))|[.dir.path, .export_pin, .auth_first]'
> ["/kubernetes",1,1]
> ["/.ctdb",-1,1]
>
> is this to be expected? anecdotical data indicate that the pinning does
> work as intended.

Each MDS rank can only see subtrees that border the ones its
authoritative for. Therefore, you need to gather all subtrees from all
ranks and merge to see the entire distribution. This could be made
simpler by showing this information in the upcoming `ceph fs top`
display. I've created a tracker ticket:
https://tracker.ceph.com/issues/41824

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mutliple CephFS Filesystems Nautilus (14.2.2)

2019-08-21 Thread Patrick Donnelly
On Wed, Aug 21, 2019 at 2:02 PM  wrote:
> How experimental is the multiple CephFS filesystems per cluster feature?  We 
> plan to use different sets of pools (meta / data) per filesystem.
>
> Are there any known issues?

No. It will likely work fine but some things may change in a future
version that makes upgrading more difficult.

> While we're on the subject, is it possible to assign a different active MDS 
> to each filesystem?

The monitors do the assignment. You cannot specify which file system
an MDS servers.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does CephFS find a file?

2019-08-19 Thread Patrick Donnelly
On Mon, Aug 19, 2019 at 7:50 AM Robert LeBlanc  wrote:
> The MDS manages dentries as omap (simple key/value database) entries in the 
> metadada pool. Each dentry keeps a list of filenames and some metadata about 
> the file such as inode number and some other info such as size I presume 
> (can't find a documentation outlining the binary format of the omap, just did 
> enough digging to find the inode location).

Each directory (actually: directory fragment) is a single object in
the metadata pool. They are indexed by inode number. Root is always
inode 1 and can be used as a starting point for finding any other
directory (since the file system hierarchy is a tree). (Note: some
special directories exist outside the file system tree, like the stray
directories.)

The value in the omap Robert refers to is the binary encoded inode. It
will include the inode number, file layout (!) [1], and size. All
three of these pieces of information are necessary to find a file's
data or write new data.

> The MDS can return the inode and size

and file layout*

> to the client and the client looks up the OSDs for the inode using the CRUSH 
> map and dividing the size by the stripe size to know how many objects to 
> fetch for the whole object.

The file layout and the inode number determine where a particular
block can be found. This is all encoded in the name of the object
within the data pool.

[1] https://docs.ceph.com/docs/master/cephfs/file-layouts/

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-06 Thread Patrick Donnelly
On Tue, Aug 6, 2019 at 7:57 AM Janek Bevendorff
 wrote:
>
>
> > 4k req/s is too fast for a create workload on one MDS. That must
> > include other operations like getattr.
>
> That is rsync going through millions of files checking which ones need
> updating. Right now there are not actually any create operations, since
> I restarted the copy job.

Your parallel rsync job is only getting 150 creates per second? What
was the previous throughput?

> > I wouldn't expect such extreme latency issues. Please share:
> >
> > ceph config dump
> > ceph daemon mds.X cache status
>
> Config dump: https://pastebin.com/1jTrjzA9
>
> Cache status:
>
> {
>  "pool": {
>  "items": 127688932,
>  "bytes": 20401092561
>  }
> }
>
>
> > and the two perf dumps one second apart again please.
> Perf dump 1: https://pastebin.com/US3y6JEJ
> Perf dump 2: https://pastebin.com/Mm02puje

The cache size looks correct here.

> > Also, you said you removed the aggressive recall changes. I assume you
> > didn't reset them to the defaults, right? Just the first suggested
> > change (10k/1.0)?
>
> Either seems to work.
>
> I added two more MDSs to split the workload and got a steady 150 reqs/s
> after that. Then I noticed that I still had a max segments settings from
> one of my earlier attempts at fixing the cache runaway issue and after
> removing that, I got 250-500 reqs/s, sometimes up to 1.5k (per MDS).

Okay, so you're getting a more normal throughput for parallel creates
on a single MDS.

> However, to generate the dumps for you, I changed my max_mds setting
> back to 1 and reqs/s went down to 80. After re-adding the two active
> MDSs again, I am back at higher numbers, although not quite as much as
> before. But I think to remember that it took several minutes if not more
> until all MDSs received approximately equal load the last time.

Try pinning if possible in each parallel rsync job.

Here are tracker tickets to resolve the issues you encountered:

https://tracker.ceph.com/issues/41140
https://tracker.ceph.com/issues/41141

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-06 Thread Patrick Donnelly
On Tue, Aug 6, 2019 at 12:48 AM Janek Bevendorff
 wrote:
> > However, now my client processes are basically in constant I/O wait
> > state and the CephFS is slow for everybody. After I restarted the copy
> > job, I got around 4k reqs/s and then it went down to 100 reqs/s with
> > everybody waiting their turn. So yes, it does seem to help, but it
> > increases latency by a magnitude.

4k req/s is too fast for a create workload on one MDS. That must
include other operations like getattr.

> Addition: I reduced the number to 256K and the cache size started
> inflating instantly (with about 140 reqs/s). So I reset it to 512K and
> the cache size started reducing slowly, though with fewer reqs/s.
>
> So I guess it is solving the problem, but only by trading it off against
> severe latency issues (order of magnitude as we saw).

I wouldn't expect such extreme latency issues. Please share:

ceph config dump
ceph daemon mds.X cache status

and the two perf dumps one second apart again please.

Also, you said you removed the aggressive recall changes. I assume you
didn't reset them to the defaults, right? Just the first suggested
change (10k/1.0)?

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-05 Thread Patrick Donnelly
On Mon, Aug 5, 2019 at 12:21 AM Janek Bevendorff
 wrote:
>
> Hi,
>
> > You can also try increasing the aggressiveness of the MDS recall but
> > I'm surprised it's still a problem with the settings I gave you:
> >
> > ceph config set mds mds_recall_max_caps 15000
> > ceph config set mds mds_recall_max_decay_rate 0.75
>
> I finally had the chance to try the more aggressive recall settings, but
> they did not change anything. As soon as the client starts copying files
> again, the numbers go up an I get a health message that the client is
> failing to respond to cache pressure.
>
> After this week of idle time, the dns/inos numbers (what does dns stand
> for anyway?) settled at around 8000k. That's basically that "idle"
> number that it goes back to when the client stops copying files. Though,
> for some weird reason, this number gets (quite) a bit higher every time
> (last time it was around 960k). Of course, I wouldn't expect it to go
> back all the way to zero, because that would mean dropping the entire
> cache for no reason, but it's still quite high and the same after
> restarting the MDS and all clients, which doesn't make a lot of sense to
> me. After resuming the copy job, the number went up to 20M in just the
> time it takes to write this email. There must be a bug somewhere.
>
> > Can you share two captures of `ceph daemon mds.X perf dump` about 1
> > second apart.
>
> I attached the requested perf dumps.

Thanks that helps. Looks like the problem is that the MDS is not
automatically trimming its cache fast enough. Please try bumping
mds_cache_trim_threshold:

bin/ceph config set mds mds_cache_trim_threshold 512K

Increase it further if it's not aggressive enough. Please let us know
if that helps.

It shouldn't be necessary to do this so I'll make a tracker ticket
once we confirm that's the issue.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-07-25 Thread Patrick Donnelly
On Thu, Jul 25, 2019 at 12:49 PM Janek Bevendorff
 wrote:
>
>
> > Based on that message, it would appear you still have an inode limit
> > in place ("mds_cache_size"). Please unset that config option. Your
> > mds_cache_memory_limit is apparently ~19GB.
>
> No, I do not have an inode limit set. Only the memory limit.
>
>
> > There is another limit mds_max_caps_per_client (default 1M) which the
> > client is hitting. That's why the MDS is recalling caps from the
> > client and not because any cache memory limit is hit. It is not
> > recommend you increase this.
> Okay, this this setting isn't documented either and I did not change it,
> but it's also quite clear that it isn't working. My MDS hasn't crashed
> yet (without the recall settings it would have), but ceph fs status is
> reporting 14M inodes at this point and the number is slowly going up.

Can you share two captures of `ceph daemon mds.X perf dump` about 1
second apart.

You can also try increasing the aggressiveness of the MDS recall but
I'm surprised it's still a problem with the settings I gave you:

ceph config set mds mds_recall_max_caps 15000
ceph config set mds mds_recall_max_decay_rate 0.75

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-07-25 Thread Patrick Donnelly
On Thu, Jul 25, 2019 at 3:08 AM Janek Bevendorff
 wrote:
>
> The rsync job has been copying quite happily for two hours now. The good
> news is that the cache size isn't increasing unboundedly with each
> request anymore. The bad news is that it still is increasing afterall,
> though much slower. I am at 3M inodes now and it started off with 900k,
> settling at 1M initially. I had a peak just now of 3.7M, but it went
> back down to 3.2M shortly after that.
>
> According to the health status, the client has started failing to
> respond to cache pressure, so it's still not working as reliably as I
> would like it to. I am also getting this very peculiar message:
>
> MDS cache is too large (7GB/19GB); 52686 inodes in use by clients
>
> I guess the 53k inodes is the number that is actively in use right now
> (compared to the 3M for which the client generally holds caps). Is that
> so? Cache memory is still well within bounds, however. Perhaps the
> message is triggered by the recall settings and just a bit misleading?

Based on that message, it would appear you still have an inode limit
in place ("mds_cache_size"). Please unset that config option. Your
mds_cache_memory_limit is apparently ~19GB.

There is another limit mds_max_caps_per_client (default 1M) which the
client is hitting. That's why the MDS is recalling caps from the
client and not because any cache memory limit is hit. It is not
recommend you increase this.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to power off a cephfs cluster cleanly

2019-07-25 Thread Patrick Donnelly
On Thu, Jul 25, 2019 at 7:48 AM Dan van der Ster  wrote:
>
> Hi all,
>
> In September we'll need to power down a CephFS cluster (currently
> mimic) for a several-hour electrical intervention.
>
> Having never done this before, I thought I'd check with the list.
> Here's our planned procedure:
>
> 1. umounts /cephfs from all hpc clients.
> 2. ceph osd set noout
> 3. wait until there is zero IO on the cluster
> 4. stop all mds's (active + standby)

You can also use `ceph fs set  down true` which will flush all
metadata/journals, evict any lingering clients, and leave the file
system down until manually brought back up even if there are standby
MDSs available.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-07-24 Thread Patrick Donnelly
+ other ceph-users

On Wed, Jul 24, 2019 at 10:26 AM Janek Bevendorff
 wrote:
>
> > what's the ceph.com mailing list? I wondered whether this list is dead but 
> > it's the list announced on the official ceph.com homepage, isn't it?
> There are two mailing lists announced on the website. If you go to
> https://ceph.com/resources/ you will find the
> subscribe/unsubscribe/archive links for the (much more active) ceph.com
> MLs. But if you click on "Mailing Lists & IRC page" you will get to a
> page where you can subscribe to this list, which is different. Very
> confusing.

It is confusing. This is supposed to be the new ML but I don't think
the migration has started yet.

> > What did you have the MDS cache size set to at the time?
> >
> > < and an inode count between
>
> I actually did not think I'd get a reply here. We are a bit further than
> this on the other mailing list. This is the thread:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-July/036095.html
>
> To sum it up: the ceph client prevents the MDS from freeing its cache,
> so inodes keep piling up until either the MDS becomes too slow (fixable
> by increasing the beacon grace time) or runs out of memory. The latter
> will happen eventually. In the end, my MDSs couldn't even rejoin because
> they hit the host's 128GB memory limit and crashed.

It's possible the MDS is not being aggressive enough with asking the
single (?) client to reduce its cache size. There were recent changes
[1] to the MDS to improve this. However, the defaults may not be
aggressive enough for your client's workload. Can you try:

ceph config set mds mds_recall_max_caps 1
ceph config set mds mds_recall_max_decay_rate 1.0

Also your other mailings made me think you may still be using the old
inode limit for the cache size. Are you using the new
mds_cache_memory_limit config option?

Finally, if this fixes your issue (please let us know!) and you decide
to try multiple active MDS, you should definitely use pinning as the
parallel create workload will greatly benefit from it.

[1] https://ceph.com/community/nautilus-cephfs/

--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating a cephfs data pool

2019-07-02 Thread Patrick Donnelly
On Fri, Jun 28, 2019 at 8:27 AM Jorge Garcia  wrote:
>
> This seems to be an issue that gets brought up repeatedly, but I haven't
> seen a definitive answer yet. So, at the risk of repeating a question
> that has already been asked:
>
> How do you migrate a cephfs data pool to a new data pool? The obvious
> case would be somebody that has set up a replicated pool for their
> cephfs data and then wants to convert it to an erasure code pool. Is
> there a simple way to do this, other than creating a whole new ceph
> cluster and copying the data using rsync?

For those interested, there's a ticket [1] to perform file layout
migrations in the MDS in an automated way. Not sure if it'll get done
for Octopus though.

[1] http://tracker.ceph.com/issues/40285

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS damaged and cannot recover

2019-06-19 Thread Patrick Donnelly
On Wed, Jun 19, 2019 at 9:19 AM Wei Jin  wrote:
>
> There are plenty of data in this cluster (2PB), please help us, thx.
> Before doing this dangerous
> operations(http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts)
> , any suggestions?
>
> Ceph version: 12.2.12
>
> ceph fs status:
>
> cephfs - 1057 clients
> ==
> +--+-+-+--+---+---+
> | Rank |  State  | MDS | Activity |  dns  |  inos |
> +--+-+-+--+---+---+
> |  0   |  failed | |  |   |   |
> |  1   | resolve | n31-023-214 |  |0  |0  |
> |  2   | resolve | n31-023-215 |  |0  |0  |
> |  3   | resolve | n31-023-218 |  |0  |0  |
> |  4   | resolve | n31-023-220 |  |0  |0  |
> |  5   | resolve | n31-023-217 |  |0  |0  |
> |  6   | resolve | n31-023-222 |  |0  |0  |
> |  7   | resolve | n31-023-216 |  |0  |0  |
> |  8   | resolve | n31-023-221 |  |0  |0  |
> |  9   | resolve | n31-023-223 |  |0  |0  |
> |  10  | resolve | n31-023-225 |  |0  |0  |
> |  11  | resolve | n31-023-224 |  |0  |0  |
> |  12  | resolve | n31-023-219 |  |0  |0  |
> |  13  | resolve | n31-023-229 |  |0  |0  |
> +--+-+-+--+---+---+
> +-+--+---+---+
> |   Pool  |   type   |  used | avail |
> +-+--+---+---+
> | cephfs_metadata | metadata | 2843M | 34.9T |
> |   cephfs_data   |   data   | 2580T |  731T |
> +-+--+---+---+
>
> +-+
> | Standby MDS |
> +-+
> | n31-023-227 |
> | n31-023-226 |
> | n31-023-228 |
> +-----+

Are there failovers occurring while all the ranks are in up:resolve?
MDS logs at high debug level would be helpful.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Meaning of Ceph MDS / Rank in "Stopped" state.

2019-06-03 Thread Patrick Donnelly
Hello Wesley,

On Wed, May 29, 2019 at 8:35 AM Wesley Dillingham
 wrote:
> On further thought, Im now thinking this is telling me which rank is stopped 
> (2), not that two ranks are stopped.

Correct!

> I guess I am still curious about why this information is retained here

Time has claimed that secret.

> and can rank 2 be made active again?

Yes.

> If so, would this be cleaned up out of "stopped"?
>
> The state diagram here: http://docs.ceph.com/docs/master/cephfs/mds-states/
>
> seems to indicate that once a rank is "Stopped" it has no path to move out of 
> that state. Perhaps I am reading it wrong.

Well, we didn't document the transitions for rank "states" in this
diagram so we don't show that. The path out of "down:stopped" is to
increase max_mds so the rank is reactivated.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Quotas with Mimic (CephFS-FUSE) clients in a Luminous Cluster

2019-06-03 Thread Patrick Donnelly
On Mon, May 27, 2019 at 2:36 AM Oliver Freyermuth
 wrote:
>
> Dear Cephalopodians,
>
> in the process of migrating a cluster from Luminous (12.2.12) to Mimic 
> (13.2.5), we have upgraded the FUSE clients first (we took the chance during 
> a time of low activity),
> thinking that this should not cause any issues. All MDS+MON+OSDs are still on 
> Luminous, 12.2.12.
>
> However, it seems quotas have stopped working - with a (FUSE) Mimic client 
> (13.2.5), I see:
> $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
> /cephfs/user/freyermu/
> /cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute
>
> A Luminous client (12.2.12) on the same cluster sees:
> $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
> /cephfs/user/freyermu/
> 5
>
> It does not seem as if the attribute has been renamed (e.g. 
> https://github.com/ceph/ceph/blob/mimic/qa/tasks/cephfs/test_quota.py still 
> references it, same for the docs),
> and I have to assume the clients also do not enforce quota if they do not see 
> it.
>
> Is this a known incompatibility between Mimic clients and a Luminous cluster?
> The release notes of Mimic only mention that quota support was added to the 
> kernel client, but nothing else quota related catches my eye.

Unfortunately this wasn't adequately tested. But yes, Mimic ceph-fuse
clients will not be able to interact with quotas on older clusters.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2019-05-31 Thread Patrick Donnelly
Hi Stefan,

Sorry I couldn't get back to you sooner.

On Mon, May 27, 2019 at 5:02 AM Stefan Kooman  wrote:
>
> Quoting Stefan Kooman (ste...@bit.nl):
> > Hi Patrick,
> >
> > Quoting Stefan Kooman (ste...@bit.nl):
> > > Quoting Stefan Kooman (ste...@bit.nl):
> > > > Quoting Patrick Donnelly (pdonn...@redhat.com):
> > > > > Thanks for the detailed notes. It looks like the MDS is stuck
> > > > > somewhere it's not even outputting any log messages. If possible, it'd
> > > > > be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
> > > > > if you're comfortable with gdb, a backtrace of any threads that look
> > > > > suspicious (e.g. not waiting on a futex) including `info threads`.
> > >
> > > Today the issue reappeared (after being absent for ~ 3 weeks). This time
> > > the standby MDS could take over and would not get into a deadlock
> > > itself. We made gdb traces again, which you can find over here:
> > >
> > > https://8n1.org/14011/d444
> >
> > We are still seeing these crashes occur ~ every 3 weeks or so. Have you
> > find the time to look into the backtraces / gdb dumps?
>
> We have not seen this issue anymore for the past three months. We have
> updated the cluster to 12.2.11 in the meantime, but not sure if that is
> related. Hopefully it stays away.

Looks like you hit the infinite loop bug in OpTracker. It was fixed in
12.2.11: https://tracker.ceph.com/issues/37977

The problem was introduced in 12.2.8.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool migration for cephfs?

2019-05-15 Thread Patrick Donnelly
On Wed, May 15, 2019 at 5:05 AM Lars Täuber  wrote:
> is there a way to migrate a cephfs to a new data pool like it is for rbd on 
> nautilus?
> https://ceph.com/geen-categorie/ceph-pool-migration/

No, this isn't possible.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clients failing to respond to cache pressure

2019-05-09 Thread Patrick Donnelly
On Thu, May 9, 2019 at 3:21 AM Stolte, Felix  wrote:
>
> Thanks for the info Patrick. We are using ceph packages from ubuntu main 
> repo, so it will take some weeks until I can do the update. In the meantime 
> is there anything I can do manually to decrease the number of caps hold by 
> the backup nodes, like flushing the client cache or something like that? Is 
> it possible to mount cephfs without caching on specific mounts?

You can try dropping the cache on the clients
(/proc/sys/vm/drop_caches). Don't do all of them at once. This might
hang your MDS and cause it to be replaced by the monitors. This is one
of the reasons the changes were made. I'm not really sure how quickly
the MDS will chew through 5M cap releases.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clients failing to respond to cache pressure

2019-05-08 Thread Patrick Donnelly
On Wed, May 8, 2019 at 4:10 AM Stolte, Felix  wrote:
>
> Hi folks,
>
> we are running a luminous cluster and using the cephfs for fileservices. We 
> use Tivoli Storage Manager to backup all data in the ceph filesystem to tape 
> for disaster recovery. Backup runs on two dedicated servers, which mounted 
> the cephfs via kernel mount. In order to complete the Backup in time we are 
> using 60 Backup Threads per Server. While backup is running, ceph health 
> often changes from “OK” to “2 clients failing to respond to cache pressure”. 
> After investigating and doing research in the mailing list I set the 
> following parameters:
>
> mds_cache_memory_limit = 34359738368 (32 GB) on MDS Server
>
> client_oc_size = 104857600 (100 MB, default is 200 MB) on Backup Servers
>
> All Servers running Ubuntu 18.04 with Kernel 4.15.0-47 and ceph 12.2.11. We 
> have 3 MDS Servers, 1 Active, 2 Standby. Changing to multiple active MDS 
> Servers is not an option, since we are planning to use snapshots. Cephfs 
> holds 78,815,975 files.
>
> Any advice on getting rid of the Warning would be very much appreciated. On a 
> sidenote: Although MDS Cache Memory is set to 32GB htop shows 60GB Memory 
> Usage for the ceph-mds process

With clients doing backup it's likely that they hold millions of caps.
This is not a good situation to be in. I recommend upgrading to
12.2.12 as we recently backported a fix for the MDS to limit the
number of caps held by clients to 1M. Additionally, trimming the cache
and recalling caps is now throttled. This may help a lot for your
workload.

Note that these fixes haven't been backported to Mimic yet.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inodes on /cephfs

2019-04-30 Thread Patrick Donnelly
On Tue, Apr 30, 2019 at 8:01 AM Oliver Freyermuth
 wrote:
>
> Dear Cephalopodians,
>
> we have a classic libvirtd / KVM based virtualization cluster using Ceph-RBD 
> (librbd) as backend and sharing the libvirtd configuration between the nodes 
> via CephFS
> (all on Mimic).
>
> To share the libvirtd configuration between the nodes, we have symlinked some 
> folders from /etc/libvirt to their counterparts on /cephfs,
> so all nodes see the same configuration.
> In general, this works very well (of course, there's a "gotcha": Libvirtd 
> needs reloading / restart for some changes to the XMLs, we have automated 
> that),
> but there is one issue caused by Yum's cleverness (that's on CentOS 7). 
> Whenever there's a libvirtd update, unattended upgrades fail, and we see:
>
>Transaction check error:
>  installing package libvirt-daemon-driver-network-4.5.0-10.el7_6.7.x86_64 
> needs 2 inodes on the /cephfs filesystem
>  installing package 
> libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.7.x86_64 needs 18 inodes on the 
> /cephfs filesystem
>
> So it seems yum follows the symlinks and checks the available inodes on 
> /cephfs. Sadly, that reveals:
>[root@kvm001 libvirt]# LANG=C df -i /cephfs/
>Filesystem Inodes IUsed IFree IUse% Mounted on
>ceph-fuse  6868 0  100% /cephfs
>
> I think that's just because there is no real "limit" on the maximum inodes on 
> CephFS. However, returning 0 breaks some existing tools (notably, Yum).
>
> What do you think? Should CephFS return something different than 0 here to 
> not break existing tools?
> Or should the tools behave differently? But one might also argue that if the 
> total number of Inodes matches the used number of Inodes, the FS is indeed 
> "full".
> It's just unclear to me who to file a bug against ;-).
>
> Right now, I am just using:
> yum -y --setopt=diskspacecheck=0 update
> as a manual workaround, but this is naturally rather cumbersome.

This is fallout from [1]. See discussion on setting f_free to 0 here
[2]. In summary, userland tools are trying to be too clever by looking
at f_free. [I could be convinced to go back to f_free = ULONG_MAX if
there are other instances of this.]

[1] https://github.com/ceph/ceph/pull/23323
[2] https://github.com/ceph/ceph/pull/23323#issuecomment-409249911

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inline_data (was: CephFS and many small files)

2019-04-03 Thread Patrick Donnelly
On Tue, Apr 2, 2019 at 5:24 AM Clausen, Jörn  wrote:
>
> Hi!
>
> Am 29.03.2019 um 23:56 schrieb Paul Emmerich:
> > There's also some metadata overhead etc. You might want to consider
> > enabling inline data in cephfs to handle small files in a
> > store-efficient way (note that this feature is officially marked as
> > experimental, though).
> > http://docs.ceph.com/docs/master/cephfs/experimental-features/#inline-data
>
> Is there something missing from the documentation? I have turned on this
> feature:
>
> $ ceph fs dump | grep inline_data
> dumped fsmap epoch 1224
> inline_data enabled
>
> I have reduced the size of the bonnie-generated files to 1 byte. But
> this is the situation halfway into the test: (output slightly shortened)
>
> $ rados df
> POOL_NAME  USED OBJECTS CLONES   COPIES
> fs-data 3.2 MiB 3390041  0 10170123
> fs-metadata 772 MiB2249  0 6747
>
> total_objects3392290
> total_used   643 GiB
> total_avail  957 GiB
> total_space  1.6 TiB
>
> i.e. bonnie has created a little over 3 million files, for which the
> same number of objects was created in the data pool. So the raw usage is
> again at more than 500 GB.

Even for inline files, there is one object created in the data pool to
hold backtrace information (an xattr of the object) used for hard
links and disaster recovery.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and many small files

2019-03-29 Thread Patrick Donnelly
Hi Jörn,

On Fri, Mar 29, 2019 at 5:20 AM Clausen, Jörn  wrote:
>
> Hi!
>
> In my ongoing quest to wrap my head around Ceph, I created a CephFS
> (data and metadata pool with replicated size 3, 128 pgs each).

What version?

> When I
> mount it on my test client, I see a usable space of ~500 GB, which I
> guess is okay for the raw capacity of 1.6 TiB I have in my OSDs.
>
> I run bonnie with
>
> -s 0G -n 20480:1k:1:8192
>
> i.e. I should end up with ~20 million files, each file 1k in size
> maximum. After about 8 million files (about 4.7 GBytes of actual use),
> my cluster runs out of space.

Meaning, you got ENOSPC?

> Is there something like a "block size" in CephFS? I've read
>
> http://docs.ceph.com/docs/master/cephfs/file-layouts/
>
> and thought maybe object_size is something I can tune, but I only get
>
> $ setfattr -n ceph.dir.layout.object_size -v 524288 bonnie
> setfattr: bonnie: Invalid argument

You can only set a layout on an empty directory. The layouts here are
not likely to be the cause.

> Is this even the right approach? Or are "CephFS" and "many small files"
> such opposing concepts that it is simply not worth the effort?

You should not have had issues growing to that number of files. Please
post more information about your cluster including configuration
changes and `ceph osd df`.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How To Scale Ceph for Large Numbers of Clients?

2019-03-07 Thread Patrick Donnelly
On Thu, Mar 7, 2019 at 8:24 AM Zack Brenton  wrote:
>
> Hey Patrick,
>
> I understand your skepticism! I'm also confident that this is some kind of a 
> configuration issue; I'm not very familiar with all of Ceph's various 
> configuration options as Rook generally abstracts those away, so I appreciate 
> you taking the time to look into this.
>
> I've attached a screenshot of our internal Ceph MDS dashboard that includes 
> some data from one of my older load tests showing the memory and CPU usage of 
> each MDS pod, as well as the session count, handled client request rate, and 
> object r/w op rates. I'm confident that the `mds_cache_memory_limit` was 16GB 
> for this test, although I've been testing with different values and 
> unfortunately I don't have a historical record of those like I do for the 
> metrics included on our dashboard.

Is this with one active MDS and one standby-replay? The graph is odd
to me because the session count shows sessions on fs-b and fs-d but
not fs-c. Or maybe max_mds=2 and fs-d has no activity and fs-c is
standby-replay?

> Types of devices:
> We run our Ceph pods on 3 AWS i3.2xlarge nodes. We're running 3 OSDs, 3 Mons, 
> and 2 MDS pods (1 active, 1 standby-replay). Currently, each pod runs with 
> the following resources:
> - osds: 2 CPU, 6Gi RAM, 1.7Ti NVMe disk
> - mds:  3 CPU, 24Gi RAM
> - mons: 500m (.5) CPU, 1Gi RAM

Three OSDs are going to really struggle with the client load you're
putting on it. It doesn't surprise me you are getting slow requests
warning on the MDS for this reason. When you were running Luminous
12.2.9+ or Mimic 13.2.2+, were you seeing slow metadata I/O warnings?
Even if you did not, it possible that the MDS is delayed issuing caps
to clients because it's waiting for another client to flush writes and
release conflicting caps.

Generally we recommend that the metadata pool be located on OSDs with
fast devices separate from the data pool. This avoids priority
inversion of MDS metadata I/O with data I/O. See [1] to configure the
metadata pool on a separate set of OSDs.

Also, you're not going to saturate a 1.9TB NVMe SSD with one OSD. You
must partition it and setup multiple OSDs. This ends up being positive
for you so that you can put the metadata pool on its own set of OSDs.

[1] https://ceph.com/community/new-luminous-crush-device-classes/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How To Scale Ceph for Large Numbers of Clients?

2019-03-06 Thread Patrick Donnelly
Hello Zack,

On Wed, Mar 6, 2019 at 1:18 PM Zack Brenton  wrote:
>
> Hello,
>
> We're running Ceph on Kubernetes 1.12 using the Rook operator 
> (https://rook.io), but we've been struggling to scale applications mounting 
> CephFS volumes above 600 pods / 300 nodes. All our instances use the kernel 
> client and run kernel `4.19.23-coreos-r1`.
>
> We've tried increasing the MDS memory limits, running multiple active MDS 
> pods, and running different versions of Ceph (up to the latest Luminous and 
> Mimic releases), but we run into MDS_SLOW_REQUEST errors at the same scale 
> regardless of the memory limits we set. See this GitHub issue for more info 
> on what we've tried up to this point: https://github.com/rook/rook/issues/2590
>
> I've written a simple load test that reads all the files in a given directory 
> on an interval. While running this test, I've noticed that the `mds_co.bytes` 
> value (from `ceph daemon mds.myfs-a dump_mempools | jq -c 
> '.mempool.by_pool.mds_co'`) increases each time files are read. Why is this 
> number increasing after the first iteration? If the same client is reading 
> the same cached files, why would the data in the cache change at all? What is 
> `mds_co.bytes` actually reporting?
>
> My most important question is this: How do I configure Ceph to be able to 
> scale to large numbers of clients?

Please post more information about your cluster: types of devices,
`ceph osd tree`, `ceph osd df`, and `ceph osd lspools`.

There's no reason why CephFS shouldn't be able to scale to that number
of clients. The issue is probably related configuration of the
pools/MDS. From your ticket, I have a *lot* of trouble believing the
MDS still at 3GB memory usage with that number of clients and
mds_cache_memory_limit=17179869184 (16GB).

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS_SLOW_METADATA_IO

2019-02-28 Thread Patrick Donnelly
On Thu, Feb 28, 2019 at 12:49 PM Stefan Kooman  wrote:
>
> Dear list,
>
> After upgrading to 12.2.11 the MDSes are reporting slow metadata IOs
> (MDS_SLOW_METADATA_IO). The metadata IOs would have been blocked for
> more that 5 seconds. We have one active, and one active standby MDS. All
> storage on SSD (Samsung PM863a / Intel DC4500). No other (OSD) slow ops
> reported. The MDSes are underutilized, only a handful of active clients
> and almost no load (fast hexacore CPU, 256 GB RAM, 20 Gb/s network). The
> cluster is also far from busy.
>
> I've dumped ops in flight on the MDSes but all ops that are printed are
> finished in a split second (duration: 0.000152), flag_point": "acquired
> locks".

I believe you're looking at the wrong "ops" dump. You want to check
"objector_requests".

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] faster switch to another mds

2019-02-20 Thread Patrick Donnelly
On Tue, Feb 19, 2019 at 11:39 AM Fyodor Ustinov  wrote:
>
> Hi!
>
> From documentation:
>
> mds beacon grace
> Description:The interval without beacons before Ceph declares an MDS 
> laggy (and possibly replace it).
> Type:   Float
> Default:15
>
> I do not understand, 15 - are is seconds or beacons?

seconds

> And an additional misunderstanding - if we gently turn off the MDS (or MON), 
> why it does not inform everyone interested before death - "I am turned off, 
> no need to wait, appoint a new active server"

The MDS does inform the monitors if it has been shutdown. If you pull
the plug or SIGKILL, it does not. :)


-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding EC properties for CephFS / small files.

2019-02-18 Thread Patrick Donnelly
Hello Jesper,

On Sat, Feb 16, 2019 at 11:11 PM  wrote:
>
> Hi List.
>
> I'm trying to understand the nuts and bolts of EC / CephFS
> We're running an EC4+2 pool on top of 72 x 7.2K rpm 10TB drives. Pretty
> slow bulk / archive storage.
>
> # getfattr -n ceph.dir.layout /mnt/home/cluster/mysqlbackup
> getfattr: Removing leading '/' from absolute path names
> # file: mnt/home/cluster/mysqlbackup
> ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304
> pool=cephfs_data_ec42"
>
> This configuration is taken directly out of the online documentation:
> (Which may have been where it went all wrong from our perspective):

Correction: this is from the Ceph default for the file layout. The
default is that no file striping is performed and 4MB chunks are used
for file blocks. You may find this document instructive on how files
are striped (especially the ASCII art):

https://github.com/ceph/ceph/blob/master/doc/dev/file-striping.rst

> http://docs.ceph.com/docs/master/cephfs/file-layouts/
>
> Ok, this means that a 16MB file will be split at 4 chuncks of 4MB each
> with 2 erasure coding chuncks? I dont really understand the stripe_count
> element?

A 16 MB file would be split into 4 RADOS objects. Then those objects
would be distributed across OSDs according to the EC profile.

> And since erasure-coding works at the object level, striping individual
> objects across - here 4 replicas - it'll end up filling 16MB ? Or
> is there an internal optimization causing this not to be the case?
>
> Additionally, when reading the file, all 4 chunck need to be read to
> assemble the object. Causing (at a minumum) 4 IOPS per file.
>
> Now, my common file size is < 8MB and commonly 512KB files are on
> this pool.
>
> Will that cause a 512KB file to be padded to 4MB with 3 empty chuncks
> to fill the erasure coded profile and then 2 coding chuncks on top?
> In total 24MB for storing 512KB ?

No. Files do not always use the full 4MB chunk. The final chunk of the
file will be minimally sized. For example:

pdonnell@senta02 ~/mnt/tmp.ZS9VCMhBWg$ cp /bin/grep .
pdonnell@senta02 ~/mnt/tmp.ZS9VCMhBWg$ stat grep
  File: 'grep'
  Size: 211224  Blocks: 413IO Block: 4194304 regular file
Device: 2ch/44d Inode: 1099511627836  Links: 1
Access: (0750/-rwxr-x---)  Uid: ( 1163/pdonnell)   Gid: ( 1163/pdonnell)
Access: 2019-02-18 14:02:11.503875296 -0500
Modify: 2019-02-18 14:02:11.523375657 -0500
Change: 2019-02-18 14:02:11.523375657 -0500
 Birth: -
pdonnell@senta02 ~/mnt/tmp.ZS9VCMhBWg$ printf %x 1099511627836
13c
$ bin/rados -p cephfs.a.data stat 13c.
cephfs.a.data/13c. mtime 2019-02-18 14:02:11.00, size 211224

So the object holding "grep" still only uses ~200KB and not 4MB.


-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - read latency.

2019-02-18 Thread Patrick Donnelly
On Sun, Feb 17, 2019 at 9:51 PM  wrote:
>
> > Probably not related to CephFS. Try to compare the latency you are
> > seeing to the op_r_latency reported by the OSDs.
> >
> > The fast_read option on the pool can also help a lot for this IO pattern.
>
> Magic, that actually cut the read-latency in half - making it more
> aligned with what to expect from the HW+network side:
>
> N   Min   MaxMedian   AvgStddev
> x 100  0.015687  0.221538  0.0252530.03259606   0.028827849
>
> 25ms as a median, 32ms average is still on the high side,
> but way, way better.

I'll use this opportunity to point out that serial archive programs
like tar are terrible for distributed file systems. It would be
awesome if someone multithreaded tar or extended it for asynchronous
I/O. If only I had more time (TM)...

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Update / upgrade cluster with MDS from 12.2.7 to 12.2.11

2019-02-11 Thread Patrick Donnelly
On Mon, Feb 11, 2019 at 12:10 PM Götz Reinicke
 wrote:
> as 12.2.11 is out for some days and no panic mails showed up on the list I 
> was planing to update too.
>
> I know there are recommended orders in which to update/upgrade the cluster 
> but I don’t know how rpm packages are handling restarting services after a 
> yum update. E.g. when MDS and MONs are on the same server.

This should be fine. The MDS only uses a new executable file if you
explicitly restart it via systemd (or, the MDS fails and systemd
restarts it).

More info: when the MDS respawns in normal circumstances, it passes
the /proc/self/exe file to execve. An intended side-effect is that the
MDS will continue using the same executable file across execs.

> And regarding an MDS Cluster I like to ask, if the upgrading instructions 
> regarding only running one MDS during upgrading also applies for an update?
>
> http://docs.ceph.com/docs/mimic/cephfs/upgrading/

If you upgrade an MDS, it may update the compatibility bits in the
Monitor's MDSMap. Other MDSs will abort when they see this change. The
upgrade process intended to help you avoid seeing those errors so you
don't inadvertently think something went wrong.

If you don't mind seeing those errors and you're using 1 active MDS,
then don't worry about it.

Good luck!

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-filesystem wthin a cluster

2019-01-17 Thread Patrick Donnelly
On Thu, Jan 17, 2019 at 2:44 AM Dan van der Ster  wrote:
>
> On Wed, Jan 16, 2019 at 11:17 PM Patrick Donnelly  wrote:
> >
> > On Wed, Jan 16, 2019 at 1:21 AM Marvin Zhang  wrote:
> > > Hi CephFS experts,
> > > From document, I know multi-fs within a cluster is still experiment 
> > > feature.
> > > 1. Is there any estimation about stability and performance for this 
> > > feature?
> >
> > Remaining blockers [1] need completed. No developer has yet taken on
> > this task. Perhaps by O release.
> >
> > > 2. It seems that each FS will consume at least 1 active MDS and
> > > different FS can't share MDS. Suppose I want to create 10 FS , I need
> > > at least 10 MDS. Is it right? Is ther any limit number for MDS within
> > > a cluster?
> >
> > No limit on number of MDS but there is a limit on the number of
> > actives (multimds).
>
> TIL...
> What is the max number of actives in a single FS?

https://github.com/ceph/ceph/blob/39f9e8db4dc7f8bfcb01a9ad20b8961c36138f4f/src/mds/mdstypes.h#L40

I don't think there's a particular reason for this limit. There may be
some parts of the code that expect fewer than 256 active MDS but that
could probably be easily changed.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to do multiple cephfs mounts.

2019-01-17 Thread Patrick Donnelly
On Thu, Jan 17, 2019 at 3:23 AM Marc Roos  wrote:
> Should I not be able to increase the io's by splitting the data writes
> over eg. 2 cephfs mounts? I am still getting similar overall
> performance. Is it even possible to increase performance by using
> multiple mounts?
>
> Using 2 kernel mounts on CentOS 7.6

It's unlikely this changes anything unless you also split the workload
into two. That may allow the kernel to do parallel requests?

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH_FSAL Nfs-ganesha

2019-01-15 Thread Patrick Donnelly
On Mon, Jan 14, 2019 at 7:11 AM Daniel Gryniewicz  wrote:
>
> Hi.  Welcome to the community.
>
> On 01/14/2019 07:56 AM, David C wrote:
> > Hi All
> >
> > I've been playing around with the nfs-ganesha 2.7 exporting a cephfs
> > filesystem, it seems to be working pretty well so far. A few questions:
> >
> > 1) The docs say " For each NFS-Ganesha export, FSAL_CEPH uses a
> > libcephfs client,..." [1]. For arguments sake, if I have ten top level
> > dirs in my Cephfs namespace, is there any value in creating a separate
> > export for each directory? Will that potentially give me better
> > performance than a single export of the entire namespace?
>
> I don't believe there are any advantages from the Ceph side.  From the
> Ganesha side, you configure permissions, client ACLs, squashing, and so
> on on a per-export basis, so you'll need different exports if you need
> different settings for each top level directory.  If they can all use
> the same settings, one export is probably better.

There may be performance impact (good or bad) with having separate
exports for CephFS. Each export instantiates a separate instance of
the CephFS client which has its own bookkeeping and set of
capabilities issued by the MDS. Also, each client instance has a
separate big lock (potentially a big deal for performance). If the
data for each export is disjoint (no hard links or shared inodes) and
the NFS server is expected to have a lot of load, breaking out the
exports can have a positive impact on performance. If there are hard
links, then the clients associated with the exports will potentially
fight over capabilities which will add to request latency.)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tuning ceph mds cache settings

2019-01-09 Thread Patrick Donnelly
Hello Jonathan,

On Wed, Jan 9, 2019 at 5:37 AM Jonathan Woytek  wrote:
> While working on examining performance under load at scale, I see a marked 
> performance improvement whenever I would restart certain mds daemons. I was 
> able to duplicate the performance improvement by issuing a "daemon mds.blah 
> cache drop". The performance bump lasts for quite a long time--far longer 
> than it takes for the cache to "fill" according to the stats.

What version of Ceph are you running? Can you expand on what this
performance improvement is?

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-maintainers] v13.2.4 Mimic released

2019-01-08 Thread Patrick Donnelly
On Mon, Jan 7, 2019 at 7:10 AM Alexandre DERUMIER  wrote:
>
> Hi,
>
> >>* Ceph v13.2.2 includes a wrong backport, which may cause mds to go into
> >>'damaged' state when upgrading Ceph cluster from previous version.
> >>The bug is fixed in v13.2.3. If you are already running v13.2.2,
> >>upgrading to v13.2.3 does not require special action.
>
> Any special action for upgrading from 13.2.1 ?

No special actions for CephFS are required for the upgrade.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS MDS optimal setup on Google Cloud

2019-01-07 Thread Patrick Donnelly
Hello Mahmoud,

On Fri, Dec 21, 2018 at 7:44 AM Mahmoud Ismail
 wrote:
> I'm doing benchmarks for metadata operations on CephFS, HDFS, and HopsFS on 
> Google Cloud. In my current setup, i'm using 32 vCPU machines with 29 GB 
> memory, and i have 1 MDS, 1 MON and 3 OSDs. The MDS and the MON nodes are 
> co-located on one vm, while each of the OSDs is on a separate vm with 1 SSD 
> disk attached. I'm using the default configuration for MDS, and OSDs.
>
> I'm running 300 clients on 10 machines (16 vCPU), each client creates a 
> CephFileSystem using the CephFS hadoop plugin, and then writes empty files 
> for 30 seconds followed by reading the empty files for another 30 seconds. 
> The aggregated throughput is around 2000 file create opertions/sec and 1 
> file read operations/sec. However, the MDS is not fully utilizing the 32 
> cores on the machine, is there any configuration that i should consider to 
> fully utilize the machine?.

The MDS is not yet very parallel; it can only utilize about 2.5 cores
in the best circumstances. Make sure you allocate plenty of RAM for
the MDS. 16GB or 32GB would be a good choice. See (and disregard the
warning on that page):
http://docs.ceph.com/docs/mimic/cephfs/cache-size-limits/

You may also try using multiple active metadata servers to increase
throughput. See: http://docs.ceph.com/docs/mimic/cephfs/multimds/

> Also, i noticed that running more than 20-30 clients (on different threads) 
> per machine degrade the aggregated throughput for read, is there a limitation 
> on CephFileSystem and libceph on the number of clients created per machine?

No. Can't give you any hints without more information about the test
setup. We also have not tested with the Hadoop plugin in years. There
may be limitations we're not presently aware of.

> Another issue,  Are the MDS operations single threaded as pointed here 
> "https://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-performance-benchmark;?

Yes, this is still the case.

> Regarding the MDS global lock, is it it a single lock per MDS or is it a 
> global distributed lock for all MDSs?

per-MDS


-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon:failed in thread_name:safe_timer

2018-11-21 Thread Patrick Donnelly
On Tue, Nov 20, 2018 at 6:18 PM 楼锴毅  wrote:
> Hello
> Yesterday I upgraded my cluster to v12.2.9.But the mons still failed for the 
> same reason.And when I run 'ceph versions', it returned
> "
> "mds": {
> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
> luminous (stable)": 1,
> "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) 
> luminous (stable)": 4
> },
> "
> But actually I only have four MDS , and their versions are all v12.2.9 .I am 
> confused about it.

How did you restart the MDSs? If you used `ceph mds fail` then the
executable version (v12.2.8) will not change.

Also, the monitor failure requires updating the monitor to v12.2.9.
What version is the mons?

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-20 Thread Patrick Donnelly
You either need to accept that reads/writes will land on different data
centers, primary OSD for a given pool is always in the desired data center,
or some other non-Ceph solution which will have either expensive, eventual,
or false consistency.

On Fri, Nov 16, 2018, 10:07 AM Vlad Kopylov  This is what Jean suggested. I understand it and it works with primary.
> *But what I need is for all clients to access same files, not separate
> sets (like red blue green)*
>
> Thanks Konstantin.
>
> On Fri, Nov 16, 2018 at 3:43 AM Konstantin Shalygin 
> wrote:
>
>> On 11/16/18 11:57 AM, Vlad Kopylov wrote:
>> > Exactly. But write operations should go to all nodes.
>>
>> This can be set via primary affinity [1], when a ceph client reads or
>> writes data, it always contacts the primary OSD in the acting set.
>>
>>
>> If u want to totally segregate IO, you can use device classes:
>>
>> Just create osds with different classes:
>>
>> dc1
>>
>>host1
>>
>>  red osd.0 primary
>>
>>  blue osd.1
>>
>>  green osd.2
>>
>> dc2
>>
>>host2
>>
>>  red osd.3
>>
>>  blue osd.4 primary
>>
>>  green osd.5
>>
>> dc3
>>
>>host3
>>
>>  red osd.6
>>
>>  blue osd.7
>>
>>  green osd.8 primary
>>
>>
>> create 3 crush rules:
>>
>> ceph osd crush rule create-replicated red default host red
>>
>> ceph osd crush rule create-replicated blue default host blue
>>
>> ceph osd crush rule create-replicated green default host green
>>
>>
>> and 3 pools:
>>
>> ceph osd pool create red 64 64 replicated red
>>
>> ceph osd pool create blue 64 64 replicated blue
>>
>> ceph osd pool create blue 64 64 replicated green
>>
>>
>> [1]
>>
>> http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity
>> '
>>
>>
>>
>> k
>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon:failed in thread_name:safe_timer

2018-11-19 Thread Patrick Donnelly
On Mon, Nov 19, 2018 at 7:17 PM 楼锴毅  wrote:
> sorry to disturb , but recently when I use ceph(12.2.8),I found that the 
> leader monitor will always failed in thread_name:safe_timer.
> [...]

Try upgrading the mons to v12.2.9 (but see recent warnings concerning
upgrades to v12.2.9 for the OSDs):
https://tracker.ceph.com/issues/35848

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing MDS

2018-10-30 Thread Patrick Donnelly
On Tue, Oct 30, 2018 at 4:05 PM Rhian Resnick  wrote:
> We are running into issues deactivating mds ranks. Is there a way to safely 
> forcibly remove a rank?

No, there's no "safe" way to force the issue. The rank needs to come
back, flush its journal, and then complete its deactivation. To get
more help, you need to describe your environment, version of Ceph in
use, relevant log snippets, etc.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-17 Thread Patrick Donnelly
On Wed, Oct 17, 2018 at 11:05 AM Alexandre DERUMIER  wrote:
>
> Hi,
>
> Is it possible to have more infos or announce about this problem ?
>
> I'm currently waiting to migrate from luminious to mimic, (I need new quota 
> feature for cephfs)
>
> is it safe to upgrade to 13.2.2 ?
>
> or better to wait to 13.2.3 ? or install 13.2.1 for now ?

Upgrading to 13.2.1 would be safe.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2018-10-08 Thread Patrick Donnelly
On Thu, Oct 4, 2018 at 3:58 PM Stefan Kooman  wrote:
> A couple of hours later we hit the same issue. We restarted with
> debug_mds=20 and debug_journaler=20 on the standby-replay node. Eight
> hours later (an hour ago) we hit the same issue. We captured ~ 4.7 GB of
> logging I skipped to the end of the log file just before the
> "hearbeat_map" messages start:
>
> 2018-10-04 23:23:53.144644 7f415ebf4700 20 mds.0.locker  client.17079146 
> pending pAsLsXsFscr allowed pAsLsXsFscr wanted pFscr
> 2018-10-04 23:23:53.144645 7f415ebf4700 10 mds.0.locker eval done
> 2018-10-04 23:23:55.088542 7f415bbee700 10 mds.beacon.mds2 _send up:active 
> seq 5021
> 2018-10-04 23:23:59.088602 7f415bbee700 10 mds.beacon.mds2 _send up:active 
> seq 5022
> 2018-10-04 23:24:03.088688 7f415bbee700 10 mds.beacon.mds2 _send up:active 
> seq 5023
> 2018-10-04 23:24:07.088775 7f415bbee700 10 mds.beacon.mds2 _send up:active 
> seq 5024
> 2018-10-04 23:24:11.088867 7f415bbee700  1 heartbeat_map is_healthy 'MDSRank' 
> had timed out after 15
> 2018-10-04 23:24:11.088871 7f415bbee700  1 mds.beacon.mds2 _send skipping 
> beacon, heartbeat map not healthy
>
> As far as I can see just normal behaviour.
>
> The big question is: what is happening when the mds start logging the 
> hearbeat_map messages?
> Why does it log "heartbeat_map is_healthy", just to log .04 seconds later 
> it's not healthy?
>
> Ceph version: 12.2.8 on all nodes (mon, osd, mds)
> mds: one active / one standby-replay
>
> The system was not under any kind of resource pressure: plenty of CPU, RAM
> available. Metrics all look normal up to the moment things go into a deadlock
> (so it seems).

Thanks for the detailed notes. It looks like the MDS is stuck
somewhere it's not even outputting any log messages. If possible, it'd
be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
if you're comfortable with gdb, a backtrace of any threads that look
suspicious (e.g. not waiting on a futex) including `info threads`.
-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-08 Thread Patrick Donnelly
+ceph-announce

On Sun, Oct 7, 2018 at 7:30 PM Yan, Zheng  wrote:
> There is a bug in v13.2.2 mds, which causes decoding purge queue to
> fail. If mds is already in damaged state, please downgrade mds to
> 13.2.1, then run 'ceph mds repaired fs_name:damaged_rank' .
>
> Sorry for all the trouble I caused.
> Yan, Zheng

This issue is being tracked here: http://tracker.ceph.com/issues/36346

The problem was caused by a backport of the wrong commit which
unfortunately was not caught. The backport was not done to Luminous;
only Mimic 13.2.2 is affected. New deployments on 13.2.2 are also
affected but do not require immediate action. A procedure for handling
upgrades of fresh deployments from 13.2.2 to 13.2.3 will be included
in the release notes for 13.2.3.
-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS performance.

2018-10-04 Thread Patrick Donnelly
On Thu, Oct 4, 2018 at 2:10 AM Ronny Aasen  wrote:
> in rbd there is a fancy striping solution, by using --stripe-unit and
> --stripe-count. This would get more spindles running ; perhaps consider
> using rbd instead of cephfs if it fits the workload.

CephFS also supports custom striping via layouts:
http://docs.ceph.com/docs/master/cephfs/file-layouts/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] omap vs. xattr in librados

2018-09-11 Thread Patrick Donnelly
On Tue, Sep 11, 2018 at 12:43 PM, Benjamin Cherian
 wrote:
> On Tue, Sep 11, 2018 at 10:44 AM Gregory Farnum  wrote:
>>
>> 
>> In general, if the key-value storage is of unpredictable or non-trivial
>> size, you should use omap.
>>
>> At the bottom layer where the data is actually stored, they're likely to
>> be in the same places (if using BlueStore, they are the same — in FileStore,
>> a rados xattr *might* be in the local FS xattrs, or it might not). It is
>> somewhat more likely that something stored in an xattr will get pulled into
>> memory at the same time as the object's internal metadata, but that only
>> happens if it's quite small (think the xfs or ext4 xattr rules).
>
>
> Based on this description, if I'm planning on using Bluestore, there is no
> particular reason to ever prefer using xattrs over omap (outside of ease of
> use in the API), correct?

You may prefer xattrs on bluestore if the metadata is small and you
may need to store the xattrs on an EC pool. omap is not supported on
ecpools.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Quota and ACL support

2018-08-27 Thread Patrick Donnelly
On Mon, Aug 27, 2018 at 12:51 AM, Oliver Freyermuth
 wrote:
> These features are critical for us, so right now we use the Fuse client. My 
> hope is CentOS 8 will use a recent enough kernel
> to get those features automatically, though.

Your cluster needs to be running Mimic and Linux v4.17+.

See also: https://github.com/ceph/ceph/pull/23728/files

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Secure way to wipe a Ceph cluster

2018-07-27 Thread Patrick Donnelly
Hello Christopher,

On Fri, Jul 27, 2018 at 12:00 AM, Christopher Kunz
 wrote:
> Hello all,
>
> as part of deprovisioning customers, we regularly have the task of
> wiping their Ceph clusters. Is there a certifiable, GDPR compliant way
> to do so without physically shredding the disks?

This should work and should be as fast as it can be:

wipefs -a /dev/sdX
shred /dev/sdX

Whether or not that's "GDPR compliant" will depend on external
certification, I guess.

(The issues might be that you can't guarantee all blocks in an SSD/HDD
are actually erased because the device firmware may retire bad blocks
and make them inaccessible. It may not be possible for the device to
physically destroy those blocks either even with SMART directives. You
may be stuck with an industrial shredder to be compliant if the rules
are stringent.)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Insane CPU utilization in ceph.fuse

2018-07-23 Thread Patrick Donnelly
On Mon, Jul 23, 2018 at 5:48 AM, Daniel Carrasco  wrote:
> Hi, thanks for your response.
>
> Clients are about 6, and 4 of them are the most of time on standby. Only two
> are active servers that are serving the webpage. Also we've a varnish on
> front, so are not getting all the load (below 30% in PHP is not much).
> About the MDS cache, now I've the mds_cache_memory_limit at 8Mb.

What! Please post `ceph daemon mds. config diff`,  `... perf
dump`, and `... dump_mempools `  from the server the active MDS is on.

> I've tested
> also 512Mb, but the CPU usage is the same and the MDS RAM usage grows up to
> 15GB (on a 16Gb server it starts to swap and all fails). With 8Mb, at least
> the memory usage is stable on less than 6Gb (now is using about 1GB of RAM).

We've seen reports of possible memory leaks before and the potential
fixes for those were in 12.2.6. How fast does your MDS reach 15GB?
Your MDS cache size should be configured to 1-8GB (depending on your
preference) so it's disturbing to see you set it so low.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds daemon damaged

2018-07-12 Thread Patrick Donnelly
On Thu, Jul 12, 2018 at 3:55 PM, Patrick Donnelly  wrote:
>> Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it
>> completes but still have the same issue above
>>
>> Final option is to attempt removing mds.ds27. If mds.ds29 was a standby and
>> has data it should become live. If it was not
>> I assume we will lose the filesystem at this point
>>
>> Why didn't the standby MDS failover?
>>
>> Just looking for any way to recover the cephfs, thanks!
>
> I think it's time to do a scrub on the PG containing that object.

Sorry didn't read the part of the email that said you did that :) Did
you confirm that after the deep scrub finished that the pg is
active+clean? It looks like you're still scrubbing that PG.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds daemon damaged

2018-07-12 Thread Patrick Donnelly
On Thu, Jul 12, 2018 at 2:30 PM, Kevin  wrote:
> Sorry for the long posting but trying to cover everything
>
> I woke up to find my cephfs filesystem down. This was in the logs
>
> 2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 0x6fc2f65a
> != expected 0x1c08241c on 2:292cf221:::200.:head

Being that this came from the OSD, you should look to resolve that
problem. What you've done below is blow the journal away which hasn't
helped you any because (a) now your journal is probably lost without a
lot of manual intervention and (b) the "new" journal is still written
to the same bad backing device/file so it's probably still unusable as
you found out.

> I had one standby MDS, but as far as I can tell it did not fail over. This
> was in the logs

If a rank becomes damaged, standbys will not take over. You must mark
it repaired first.

> (insufficient standby MDS daemons available)
>
> Currently my ceph looks like this
>   cluster:
> id: ..
> health: HEALTH_ERR
> 1 filesystem is degraded
> 1 mds daemon damaged
>
>   services:
> mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29
> mgr: ids27(active)
> mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged
> osd: 5 osds: 5 up, 5 in
>
>   data:
> pools:   3 pools, 202 pgs
> objects: 1013k objects, 4018 GB
> usage:   12085 GB used, 6544 GB / 18630 GB avail
> pgs: 201 active+clean
>  1   active+clean+scrubbing+deep
>
>   io:
> client:   0 B/s rd, 0 op/s rd, 0 op/s wr
>
> I started trying to get the damaged MDS back online
>
> Based on this page
> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts
>
> # cephfs-journal-tool journal export backup.bin
> 2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is unreadable
> 2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not
> readable, attempt object-by-object dump with `rados`
> Error ((5) Input/output error)
>
> # cephfs-journal-tool event recover_dentries summary
> Events by type:
> 2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is
> unreadableErrors: 0
>
> cephfs-journal-tool journal reset - (I think this command might have worked)
>
> Next up, tried to reset the filesystem
>
> ceph fs reset test-cephfs-1 --yes-i-really-mean-it
>
> Each time same errors
>
> 2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: MDS_DAMAGE
> (was: 1 mds daemon damaged)
> 2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 assigned
> to filesystem test-cephfs-1 as rank 0
> 2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 0x200:
> (5) Input/output error
> 2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds daemon
> damaged (MDS_DAMAGE)
> 2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 0x6fc2f65a
> != expected 0x1c08241c on 2:292cf221:::200.:head
> 2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 filesystem is
> degraded; 1 mds daemon damaged
>
> Tried to 'fail' mds.ds27
> # ceph mds fail ds27
> # failed mds gid 1929168
>
> Command worked, but each time I run the reset command the same errors above
> appear
>
> Online searches say the object read error has to be removed. But there's no
> object listed. This web page is the closest to the issue
> http://tracker.ceph.com/issues/20863
>
> Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it
> completes but still have the same issue above
>
> Final option is to attempt removing mds.ds27. If mds.ds29 was a standby and
> has data it should become live. If it was not
> I assume we will lose the filesystem at this point
>
> Why didn't the standby MDS failover?
>
> Just looking for any way to recover the cephfs, thanks!

I think it's time to do a scrub on the PG containing that object.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mimic (13.2.0) Release Notes Bug on CephFS Snapshot Upgrades

2018-06-07 Thread Patrick Donnelly
There was a bug [1] in the release notes [2] which had incorrect
commands for upgrading the snapshot format of an existing CephFS file
system which has had snapshots enabled at some point. The correction
is here [3]:

diff --git a/doc/releases/mimic.rst b/doc/releases/mimic.rst
index 137d56311c..3a3345bbc0 100644
--- a/doc/releases/mimic.rst
+++ b/doc/releases/mimic.rst
@@ -346,8 +346,8 @@ These changes occurred between the Luminous and
Mimic releases.
 previous max_mds" step in above URL to fail. To re-enable the feature,
 either delete all old snapshots or scrub the whole filesystem:

-  - ``ceph daemon  scrub_path /``
-  - ``ceph daemon  scrub_path '~mdsdir'``
+  - ``ceph daemon  scrub_path / force recursive repair``
+  - ``ceph daemon  scrub_path '~mdsdir' force
recursive repair``

   - Support has been added in Mimic for quotas in the Linux kernel
client as of v4.17.


The release notes on the blog have already been updated.

If you executed the wrong commands already, it should be sufficient to
run the correct commands once more to fix the file system.

[1] https://tracker.ceph.com/issues/24435
[2] https://ceph.com/releases/v13-2-0-mimic-released/
[3] https://github.com/ceph/ceph/pull/22445/files

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Patrick Donnelly
On Fri, May 25, 2018 at 6:46 AM, Oliver Freyermuth
<freyerm...@physik.uni-bonn.de> wrote:
>> It might be possible to allow rename(2) to proceed in cases where
>> nlink==1, but the behavior will probably seem inconsistent (some files get
>> EXDEV, some don't).
>
> I believe even this would be extremely helpful, performance-wise. At least in 
> our case, hardlinks are seldomly used,
> it's more about data movement between user, group and scratch areas.
> For files with nlinks>1, it's more or less expected a copy has to be 
> performed when crossing quota boundaries (I think).

It may be possible to allow the rename in the MDS and check quotas
there. I've filed a tracker ticket here:
http://tracker.ceph.com/issues/24305


-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] (yet another) multi active mds advise needed

2018-05-18 Thread Patrick Donnelly
Hello Webert,

On Fri, May 18, 2018 at 1:10 PM, Webert de Souza Lima
<webert.b...@gmail.com> wrote:
> Hi,
>
> We're migrating from a Jewel / filestore based cephfs archicture to a
> Luminous / buestore based one.
>
> One MUST HAVE is multiple Active MDS daemons. I'm still lacking knowledge of
> how it actually works.
> After reading the docs and ML we learned that they work by sort of dividing
> the responsibilities, each with his own and only directory subtree. (please
> correct me if I'm wrong).

Each MDS may have multiple subtrees they are authoritative for. Each
MDS may also replicate metadata from another MDS as a form of load
balancing.

> Question 1: I'd like to know if it is viable to have 4 MDS daemons, being 3
> Active and 1 Standby (or Standby-Replay if that's still possible with
> multi-mds).

standby-replay daemons are not available to take over for ranks other
than the one it follows. So, you would want to have a standby-replay
daemon for each rank or just have normal standbys. It will likely
depend on the size of your MDS (cache size) and available hardware.

> Basically, what we have is 2 subtrees used by dovecot: INDEX and MAIL.
> Their tree is almost identical but INDEX stores all dovecot metadata with
> heavy IO going on and MAIL stores actual email files, with much more writes
> than reads.
>
> I don't know by now which one could bottleneck the MDS servers most so I
> wonder if I can take metrics on MDS usage per pool when it's deployed.
> Question 2: If the metadata workloads are very different I wonder if I can
> isolate them, like pinning MDS servers X and Y to one of the directories.

It's best if y ou see if the normal balancer (especially in v12.2.6
[1]) can handle the load for you without trying to micromanage things
via pins. You can use pinning to isolate metadata load from other
ranks as a stop-gap measure.

[1] https://github.com/ceph/ceph/pull/21412

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Too many active mds servers

2018-05-15 Thread Patrick Donnelly
Hello Thomas,

On Tue, May 15, 2018 at 2:35 PM, Thomas Bennett <tho...@ska.ac.za> wrote:
> Hi,
>
> I'm running Luminous 12.2.5 and I'm testing cephfs.
>
> However, I seem to have too many active mds servers on my test cluster.
>
> How do I set one of my mds servers to become standby?
>
> I've run ceph fs set cephfs max_mds 2 which set the max_mds from 3 to 2 but
> has no effect on my running configuration.

http://docs.ceph.com/docs/luminous/cephfs/multimds/#decreasing-the-number-of-ranks

Note: the behavior is changing in Mimic to be automatic after reducing max_mds.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-10 Thread Patrick Donnelly
On Thu, May 10, 2018 at 12:00 PM, Brady Deetz <bde...@gmail.com> wrote:
> [ceph-admin@mds0 ~]$ ps aux | grep ceph-mds
> ceph1841  3.5 94.3 133703308 124425384 ? Ssl  Apr04 1808:32
> /usr/bin/ceph-mds -f --cluster ceph --id mds0 --setuser ceph --setgroup ceph
>
>
> [ceph-admin@mds0 ~]$ sudo ceph daemon mds.mds0 cache status
> {
> "pool": {
> "items": 173261056,
> "bytes": 76504108600
> }
> }
>
> So, 80GB is my configured limit for the cache and it appears the mds is
> following that limit. But, the mds process is using over 100GB RAM in my
> 128GB host. I thought I was playing it safe by configuring at 80. What other
> things consume a lot of RAM for this process?
>
> Let me know if I need to create a new thread.

The cache size measurement is imprecise pre-12.2.5 [1]. You should upgrade ASAP.

[1] https://tracker.ceph.com/issues/22972

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-10 Thread Patrick Donnelly
Hello Brady,

On Thu, May 10, 2018 at 7:35 AM, Brady Deetz <bde...@gmail.com> wrote:
> I am now seeing the exact same issues you are reporting. A heap release did
> nothing for me.

I'm not sure it's the same issue...

> [root@mds0 ~]# ceph daemon mds.mds0 config get mds_cache_memory_limit
> {
> "mds_cache_memory_limit": "80530636800"
> }

80G right? What was the memory use from `ps aux | grep ceph-mds`?

> [root@mds0 ~]# ceph daemon mds.mds0 perf dump
> {
> ...
> "inode_max": 2147483647,
> "inodes": 35853368,
> "inodes_top": 23669670,
> "inodes_bottom": 12165298,
> "inodes_pin_tail": 18400,
> "inodes_pinned": 2039553,
> "inodes_expired": 142389542,
> "inodes_with_caps": 831824,
> "caps": 881384,

Your cap count is 2% of the inodes in cache; the inodes pinned 5% of
the total. Your cache should be getting trimmed assuming the cache
size (as measured by the MDS, there are fixes in 12.2.5 which improve
its precision) is larger than your configured limit.

If the cache size is larger than the limit (use `cache status` admin
socket command) then we'd be interested in seeing a few seconds of the
MDS debug log with higher debugging set (`config set debug_mds 20`).

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-04-30 Thread Patrick Donnelly
Hello Sean,

On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan <lookcr...@gmail.com> wrote:
> I was creating a new user and mount point. On another hardware node I
> mounted CephFS as admin to mount as root. I created /aufstest and then
> unmounted. From there it seems that both of my mds nodes crashed for some
> reason and I can't start them any more.
>
> https://pastebin.com/1ZgkL9fa -- my mds log
>
> I have never had this happen in my tests so now I have live data here. If
> anyone can lend a hand or point me in the right direction while
> troubleshooting that would be a godsend!

Thanks for keeping the list apprised of your efforts. Since this is so
easily reproduced for you, I would suggest that you next get higher
debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is
a segmentation fault, a backtrace with debug symbols from gdb would
also be helpful.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-MDS Failover

2018-04-27 Thread Patrick Donnelly
On Thu, Apr 26, 2018 at 7:04 PM, Scottix <scot...@gmail.com> wrote:
> Ok let me try to explain this better, we are doing this back and forth and
> its not going anywhere. I'll just be as genuine as I can and explain the
> issue.
>
> What we are testing is a critical failure scenario and actually more of a
> real world scenario. Basically just what happens when it is 1AM and the shit
> hits the fan, half of your servers are down and 1 of the 3 MDS boxes are
> still alive.
> There is one very important fact that happens with CephFS and when the
> single Active MDS server fails. It is guaranteed 100% all IO is blocked. No
> split-brain, no corrupted data, 100% guaranteed ever since we started using
> CephFS
>
>
> Now with multi_mds, I understand this changes the logic and I understand how
> difficult and how hard this problem is, trust me I would not be able to
> tackle this. Basically I need to answer the question; what happens when 1 of
> 2 multi_mds fails with no standbys ready to come save them?
> What I have tested is not the same of a single active MDS; this absolutely
> changes the logic of what happens and how we troubleshoot. The CephFS is
> still alive and it does allow operations and does allow resources to go
> through. How, why and what is affected are very relevant questions if this
> is what the failure looks like since it is not 100% blocking.

Okay so now I understand what your real question is: what is the state
of CephFS when one or more ranks have failed but no standbys exist to
takeover? The answer is that there may be partial availability from
the up:active ranks which may hand out capabilities for the subtrees
they manage or no availability if that's not possible because it
cannot obtain the necessary locks.  No metadata is lost. No
inconsistency is created between clients. Full availability will be
restored when the lost ranks come back online.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-MDS Failover

2018-04-26 Thread Patrick Donnelly
On Thu, Apr 26, 2018 at 4:40 PM, Scottix <scot...@gmail.com> wrote:
>> Of course -- the mons can't tell the difference!
> That is really unfortunate, it would be nice to know if the filesystem has
> been degraded and to what degree.

If a rank is laggy/crashed, the file system as a whole is generally
unavailable. The span between partial outage and full is small and not
worth quantifying.

>> You must have standbys for high availability. This is the docs.
> Ok but what if you have your standby go down and a master go down. This
> could happen in the real world and is a valid error scenario.
>Also there is
> a period between when the standby becomes active what happens in-between
> that time?

The standby MDS goes through a series of states where it recovers the
lost state and connections with clients. Finally, it goes active.

>> It depends(tm) on how the metadata is distributed and what locks are
> held by each MDS.
> Your saying depending on which mds had a lock on a resource it will block
> that particular POSIX operation? Can you clarify a little bit?
>
>> Standbys are not optional in any production cluster.
> Of course in production I would hope people have standbys but in theory
> there is no enforcement in Ceph for this other than a warning. So when you
> say not optional that is not exactly true it will still run.

It's self-defeating to expect CephFS to enforce having standbys --
presumably by throwing an error or becoming unavailable -- when the
standbys exist to make the system available.

There's nothing to enforce. A warning is sufficient for the operator
that (a) they didn't configure any standbys or (b) MDS daemon
processes/boxes are going away and not coming back as standbys (i.e.
the pool of MDS daemons is decreasing with each failover)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-MDS Failover

2018-04-26 Thread Patrick Donnelly
On Thu, Apr 26, 2018 at 3:16 PM, Scottix <scot...@gmail.com> wrote:
> Updated to 12.2.5
>
> We are starting to test multi_mds cephfs and we are going through some
> failure scenarios in our test cluster.
>
> We are simulating a power failure to one machine and we are getting mixed
> results of what happens to the file system.
>
> This is the status of the mds once we simulate the power loss considering
> there are no more standbys.
>
> mds: cephfs-2/2/2 up
> {0=CephDeploy100=up:active,1=TigoMDS100=up:active(laggy or crashed)}
>
> 1. It is a little unclear if it is laggy or really is down, using this line
> alone.

Of course -- the mons can't tell the difference!

> 2. The first time we lost total access to ceph folder and just blocked i/o

You must have standbys for high availability. This is the docs.

> 3. One time we were still able to access ceph folder and everything seems to
> be running.

It depends(tm) on how the metadata is distributed and what locks are
held by each MDS.

Standbys are not optional in any production cluster.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs luminous 12.2.4 - multi-active MDSes with manual pinning

2018-04-24 Thread Patrick Donnelly
Hello Linh,

On Tue, Apr 24, 2018 at 12:34 AM, Linh Vu <v...@unimelb.edu.au> wrote:
> However, on our production cluster, with more powerful MDSes (10 cores
> 3.4GHz, 256GB RAM, much faster networking), I get this in the logs
> constantly:
>
> 2018-04-24 16:29:21.998261 7f02d1af9700  0 mds.1.migrator nicely exporting
> to mds.0 [dir 0x110cd91.1110* /home/ [2,head] auth{0=1017} v=5632699
> cv=5632651/5632651 dir_auth=1 state=1611923458|complete|auxsubtree f(v84
> 55=0+55) n(v245771 rc2018-04-24 16:28:32.830971 b233439385711
> 423085=383063+40022) hs=55+0,ss=0+0 dirty=1 | child=1 frozen=0 subtree=1
> replicated=1 dirty=1 authpin=0 0x55691ccf1c00]
>
> To clarify, /home is pinned to mds.1, so there is no reason it should export
> this to mds.0, and the loads on both MDSes (req/s, network load, CPU load)
> are fairly low, lower than those on the test MDS VMs.

As Dan said, this is simply a spurious log message. Nothing is being
exported. This will be fixed in 12.2.6 as part of several fixes to the
load balancer:

https://github.com/ceph/ceph/pull/21412/commits/cace918dd044b979cd0d54b16a6296094c8a9f90

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2

2018-04-12 Thread Patrick Donnelly
On Thu, Apr 12, 2018 at 5:05 AM, Mark Schouten <m...@tuxis.nl> wrote:
> On Wed, 2018-04-11 at 17:10 -0700, Patrick Donnelly wrote:
>> No longer recommended. See:
>> http://docs.ceph.com/docs/master/cephfs/upgrading/#upgrading-the-mds-
>> cluster
>
> Shouldn't docs.ceph.com/docs/luminous/cephfs/upgrading include that
> too?

The backport is in-progress: https://github.com/ceph/ceph/pull/21352

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2

2018-04-11 Thread Patrick Donnelly
Hello Ronny,

On Wed, Apr 11, 2018 at 10:25 AM, Ronny Aasen <ronny+ceph-us...@aasen.cx> wrote:
> mds: restart mds's one at the time. you will notice the standby mds taking
> over for the mds that was restarted. do both.

No longer recommended. See:
http://docs.ceph.com/docs/master/cephfs/upgrading/#upgrading-the-mds-cluster

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs snapshot format upgrade

2018-04-10 Thread Patrick Donnelly
On Tue, Apr 10, 2018 at 5:54 AM, John Spray <jsp...@redhat.com> wrote:
> On Tue, Apr 10, 2018 at 1:44 PM, Yan, Zheng <uker...@gmail.com> wrote:
>> Hello
>>
>> To simplify snapshot handling in multiple active mds setup, we changed
>> format of snaprealm in mimic dev.
>> https://github.com/ceph/ceph/pull/16779.
>>
>> The new version mds can handle old format snaprealm in single active
>> setup. It also can convert old format snaprealm to the new format when
>> snaprealm is modified. The problem is that new version mds can not
>> properly handle old format snaprealm in multiple active setup. It may
>> crash when it encounter old format snaprealm. For existing filesystem
>> with snapshots, upgrading mds to mimic seems to be no problem at first
>> glance. But if user later enables multiple active mds,  mds may
>> crashes continuously. No easy way to switch back to single acitve mds.
>>
>> I don't have clear idea how to handle this situation. I can think of a
>> few options.
>>
>> 1. Forbid multiple active before all old snapshots are deleted or
>> before all snaprealms are converted to new format. Format conversion
>> requires traversing while whole filesystem tree.  Not easy to
>> implement.
>
> This has been a general problem with metadata format changes: we can
> never know if all the metadata in a filesystem has been brought up to
> a particular version.  Scrubbing (where scrub does the updates) should
> be the answer, but we don't have the mechanism for recording/ensuring
> the scrub has really happened.
>
> Maybe we need the MDS to be able to report a complete whole-filesystem
> scrub to the monitor, and record a field like
> "latest_scrubbed_version" in FSMap, so that we can be sure that all
> the filesystem metadata has been brought up to a certain version
> before enabling certain features?  So we'd then have a
> "latest_scrubbed_version >= mimic" test before enabling multiple
> active daemons.
>
> For this particular situation, we'd also need to protect against
> people who had enabled multimds and snapshots experimentally, with an
> MDS startup check like:
>  ((ever_allowed_features & CEPH_MDSMAP_ALLOW_SNAPS) &&
> (allows_multimds() || in.size() >1)) && latest_scrubbed_version <
> mimic

This sounds like the right approach to me. The mons should also be
capable of performing the same test and raising a health error that
pre-Mimic MDSs must be started and the number of actives be reduced to
1.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs hardlink snapshot

2018-04-05 Thread Patrick Donnelly
Hi Marc,

On Wed, Apr 4, 2018 at 11:21 PM, Marc Roos <m.r...@f1-outsourcing.eu> wrote:
>
> 'Hard links do not interact well with snapshots' is this still an issue?
> Because I am using rsync and hardlinking. And it would be nice if I can
> snapshot the directory, instead of having to copy it.

Hardlink handling for snapshots will be in Mimic.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse segfaults

2018-04-02 Thread Patrick Donnelly
Probably fixed by this: http://tracker.ceph.com/issues/17206

You need to upgrade your version of ceph-fuse.

On Mon, Apr 2, 2018 at 12:56 AM, Zhang Qiang <dotslash...@gmail.com> wrote:
> Hi,
>
> I'm using ceph-fuse 10.2.3 on CentOS 7.3.1611. ceph-fuse always
> segfaults after running for some time.
>
> *** Caught signal (Segmentation fault) **
>  in thread 7f455d832700 thread_name:ceph-fuse
>  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>  1: (()+0x2a442a) [0x7f457208e42a]
>  2: (()+0xf5e0) [0x7f4570b895e0]
>  3: (Client::get_root_ino()+0x10) [0x7f4571f86a20]
>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x18d)
> [0x7f4571f844bd]
>  5: (()+0x19ae21) [0x7f4571f84e21]
>  6: (()+0x164b5) [0x7f457199e4b5]
>  7: (()+0x16bdb) [0x7f457199ebdb]
>  8: (()+0x13471) [0x7f457199b471]
>  9: (()+0x7e25) [0x7f4570b81e25]
>  10: (clone()+0x6d) [0x7f456fa6934d]
>
> Detailed events dump:
> https://drive.google.com/file/d/0B_4ESJRu7BZFcHZmdkYtVG5CTGQ3UVFod0NxQloxS0ZCZmQ0/view?usp=sharing
> Let me know if more info is needed.
>
> Thanks.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to suggest the active MDS to move to a datacenter ?

2018-03-29 Thread Patrick Donnelly
On Thu, Mar 29, 2018 at 1:02 PM, Nicolas Huillard <nhuill...@dolomede.fr> wrote:
> Hi,
>
> I manage my 2 datacenters with Pacemaker and Booth. One of them is the
> publicly-known one, thanks to Booth.
> Whatever the "public datacenter", Ceph is a single storage cluster.
> Since most of the cephfs traffic come from this "public datacenter",
> I'd like to suggest or force the active MDS to move to the same
> datacenter, hoping to reduce trafic on the inter-datacenter link, and
> reduce cephfs metadata operations latency.
>
> Is it possible for forcefully move the active MDS using external
> triggers ?

No and it probably wouldn't be beneficial. The MDS still needs to talk
to the metadata/data pools and increasing the latency between the MDS
and the OSDs will probably do more harm.

One possibility for helping your situation is to put NFS-Ganesha in
the public datacenter as a gateway to CephFS. This may help with your
performance by (a) sharing a larger cache among multiple clients and
(b) reducing capability conflicts between clients thereby resulting in
less metadata traffic with the MDS. Be aware an HA solution doesn't
yet exist for NFS-Ganesha+CephFS outside of Openstack Queens
deployments.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-03-27 Thread Patrick Donnelly
Hello Alexandre,

On Thu, Mar 22, 2018 at 2:29 AM, Alexandre DERUMIER <aderum...@odiso.com> wrote:
> Hi,
>
> I'm running cephfs since 2 months now,
>
> and my active msd memory usage is around 20G now (still growing).
>
> ceph 1521539 10.8 31.2 20929836 20534868 ?   Ssl  janv.26 8573:34 
> /usr/bin/ceph-mds -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
> USER PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
>
>
> this is on luminous 12.2.2
>
> only tuning done is:
>
> mds_cache_memory_limit = 5368709120
>
>
> (5GB). I known it's a soft limit, but 20G seem quite huge vs 5GB 
>
>
> Is it normal ?

No, that's definitely not normal!


> # ceph daemon mds.2 perf dump mds
> {
> "mds": {
> "request": 1444009197,
> "reply": 1443999870,
> "reply_latency": {
> "avgcount": 1443999870,
> "sum": 1657849.656122933,
> "avgtime": 0.001148095
> },
> "forward": 0,
> "dir_fetch": 51740910,
> "dir_commit": 9069568,
> "dir_split": 64367,
> "dir_merge": 58016,
> "inode_max": 2147483647,
> "inodes": 2042975,
> "inodes_top": 152783,
> "inodes_bottom": 138781,
> "inodes_pin_tail": 1751411,
> "inodes_pinned": 1824714,
> "inodes_expired": 7258145573,
> "inodes_with_caps": 1812018,
> "caps": 2538233,
> "subtrees": 2,
> "traverse": 1591668547,
> "traverse_hit": 1259482170,
> "traverse_forward": 0,
> "traverse_discover": 0,
> "traverse_dir_fetch": 30827836,
> "traverse_remote_ino": 7510,
>     "traverse_lock": 86236,
> "load_cent": 144401980319,
> "q": 49,
> "exported": 0,
> "exported_inodes": 0,
> "imported": 0,
> "imported_inodes": 0
> }
> }

Can you also share `ceph daemon mds.2 cache status`, the full `ceph
daemon mds.2 perf dump`, and `ceph status`?

Note [1] will be in 12.2.5 and may help with your issue.

[1] https://github.com/ceph/ceph/pull/20527

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs and number of clients

2018-03-20 Thread Patrick Donnelly
On Tue, Mar 20, 2018 at 3:27 AM, James Poole <james.po...@fasthosts.com> wrote:
> I have a query regarding cephfs and prefered number of clients. We are
> currently using luminous cephfs to support storage for a number of web
> servers. We have one file system split into folders, example:
>
> /vol1
> /vol2
> /vol3
> /vol4
>
> At the moment the root of the cephfs filesystem is mounted to each web
> server. The query is would there be a benefit to having separate mount
> points for each folder like above?

Performance benefit? No. Data isolation benefit? Sure.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rctime not tracking inode ctime

2018-03-14 Thread Patrick Donnelly
On Wed, Mar 14, 2018 at 9:22 AM, Dan van der Ster <d...@vanderster.com> wrote:
> Hi all,
>
> On our luminous v12.2.4 ceph-fuse clients / mds the rctime is not
> tracking the latest inode ctime, but only the latest directory ctimes.
>
> Initial empty dir:
>
> # getfattr -d -m ceph . | egrep 'bytes|ctime'
> ceph.dir.rbytes="0"
> ceph.dir.rctime="1521043742.09466372697"
>
> Create a file, rctime is updated:
>
> # touch a
> # getfattr -d -m ceph . | egrep 'bytes|ctime'
> ceph.dir.rbytes="0"
> ceph.dir.rctime="1521043831.0921836283"
>
> Modify a file, rbytes is updated but not rctime:
>
> # echo hello > a
> # getfattr -d -m ceph . | egrep 'bytes|ctime'
> ceph.dir.rbytes="6"
> ceph.dir.rctime="1521043831.0921836283"
>
> Modify the dir, rctime is updated:
>
> # touch b
> # getfattr -d -m ceph . | egrep 'bytes|ctime'
> ceph.dir.rbytes="6"
> ceph.dir.rctime="1521043861.09597651370"
>
> Do others see the same rctime behaviour? Is this how it's supposed to work?

It appears rctime is meant to reflect changes to directory inodes.
Traditionally, modifying a file (truncate, write) does not involve
metadata changes to a directory inode.

Whether that is the intended behavior is a good question. Perhaps it
should be changed?

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating standby mds from 12.2.2 to 12.2.4 caused up:active 12.2.2 mds's to suicide

2018-03-14 Thread Patrick Donnelly
On Wed, Mar 14, 2018 at 5:48 AM, Lars Marowsky-Bree <l...@suse.com> wrote:
> On 2018-02-28T02:38:34, Patrick Donnelly <pdonn...@redhat.com> wrote:
>
>> I think it will be necessary to reduce the actives to 1 (max_mds -> 1;
>> deactivate other ranks), shutdown standbys, upgrade the single active,
>> then upgrade/start the standbys.
>>
>> Unfortunately this didn't get flagged in upgrade testing. Thanks for
>> the report Dan.
>
> This means that - when the single active is being updated - there's a
> time when there's no MDS active, right?

Yes. But the real outcome is not "no MDS [is] active" but "some or all
metadata I/O will pause" -- and there is no avoiding that. During an
MDS upgrade, a standby must take over the MDS being shutdown (and
upgraded).  During takeover, metadata I/O will briefly pause as the
rank is unavailable. (Specifically, no other rank can obtains locks or
communicate with the "failed" rank; so metadata I/O will necessarily
pause until a standby takes over.) Single active vs. multiple active
upgrade makes little difference in this outcome.

> Is another approach theoretically feasible? Have the updated MDS only go
> into the incompatible mode once there's a quorum of new ones available,
> or something?

I believe so, yes. That option wasn't explored for this patch because
it was just disambiguating the compatibility flags and the full
side-effects weren't realized.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Don't use ceph mds set max_mds

2018-03-07 Thread Patrick Donnelly
On Wed, Mar 7, 2018 at 5:29 AM, John Spray <jsp...@redhat.com> wrote:
> On Wed, Mar 7, 2018 at 10:11 AM, Dan van der Ster <d...@vanderster.com> wrote:
>> Hi all,
>>
>> What is the purpose of
>>
>>ceph mds set max_mds 
>>
>> ?
>>
>> We just used that by mistake on a cephfs cluster when attempting to
>> decrease from 2 to 1 active mds's.
>>
>> The correct command to do this is of course
>>
>>   ceph fs set  max_mds 
>>
>> So, is `ceph mds set max_mds` useful for something? If not, should it
>> be removed from the CLI?
>
> It's the legacy version of the command from before we had multiple
> filesystems.  Those commands are marked as obsolete internally so that
> they're not included in the --help output, but they're still handled
> (applied to the "default" filesystem) if called.
>
> The multi-fs stuff went in for Jewel, so maybe we should think about
> removing the old commands in Mimic: any thoughts Patrick?

These commands have already been removed (obsoleted) in master/Mimic.
You can no longer use them. In Luminous, the commands are deprecated
(basically, omitted from --help).

See also: https://tracker.ceph.com/issues/20596

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-Fuse and mount namespaces

2018-02-28 Thread Patrick Donnelly
On Tue, Feb 27, 2018 at 3:27 PM, Oliver Freyermuth
<freyerm...@physik.uni-bonn.de> wrote:
> As you can see:
> - Name collision for admin socket, since the helper is already running.

You can change the admin socket path using the `admin socket` config
variable. Use metavariables [1] to make the path unique.

> - A second helper for the same mountpoint was fired up!

This is expected. If you want a single ceph-fuse mount then you need
to persist the mount in the host namespace somewhere (using bind
mounts) so you can reuse it. However, mind what David Turner said
regarding using a single ceph-fuse client for multiple containers.
Right now parallel requests are not handled well in the client so it
can be slow for multiple applications (or containers). Another option
is to use a kernel mount which would be more performant and also allow
parallel requests.

> - On a side-note, once I exit the container (and hence close the mount 
> namespace), the "old" helper is finally freed.

Once the last mount point is unmounted, FUSE will destroy the userspace helper.

[1] 
http://docs.ceph.com/docs/master/rados/configuration/ceph-conf/?highlight=configuration#metavariables

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating standby mds from 12.2.2 to 12.2.4 caused up:active 12.2.2 mds's to suicide

2018-02-28 Thread Patrick Donnelly
On Wed, Feb 28, 2018 at 2:07 AM, Dan van der Ster <d...@vanderster.com> wrote:
> (Sorry to spam)
>
> I guess it's related to this fix to the layout v2 feature id:
> https://github.com/ceph/ceph/pull/18782/files
>
> -#define MDS_FEATURE_INCOMPAT_FILE_LAYOUT_V2 CompatSet::Feature(8,
> "file layout v2")
> +#define MDS_FEATURE_INCOMPAT_FILE_LAYOUT_V2 CompatSet::Feature(9,
> "file layout v2")

Yes, this looks to be the issue.

> Is there a way to update from 12.2.2 without causing the other active
> MDS's to suicide?

I think it will be necessary to reduce the actives to 1 (max_mds -> 1;
deactivate other ranks), shutdown standbys, upgrade the single active,
then upgrade/start the standbys.

Unfortunately this didn't get flagged in upgrade testing. Thanks for
the report Dan.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread Patrick Donnelly
On Mon, Feb 26, 2018 at 7:59 AM, Patrick Donnelly <pdonn...@redhat.com> wrote:
> It seems in the above test you're using about 1KB per inode (file).
> Using that you can extrapolate how much space the data pool needs

s/data pool/metadata pool/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread Patrick Donnelly
On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth
<freyerm...@physik.uni-bonn.de> wrote:
> Looking with:
> ceph daemon osd.2 perf dump
> I get:
> "bluefs": {
> "gift_bytes": 0,
> "reclaim_bytes": 0,
> "db_total_bytes": 84760592384,
> "db_used_bytes": 78920024064,
> "wal_total_bytes": 0,
> "wal_used_bytes": 0,
> "slow_total_bytes": 0,
> "slow_used_bytes": 0,
> so it seems this is almost exclusively RocksDB usage.
>
> Is this expected?

Yes. The directory entries are stored in the omap of the objects. This
will be stored in the RocksDB backend of Bluestore.

> Is there a recommendation on how much MDS storage is needed for a CephFS with 
> 450 TB?

It seems in the above test you're using about 1KB per inode (file).
Using that you can extrapolate how much space the data pool needs
based on your file system usage. (If all you're doing is filling the
file system with empty files, of course you're going to need an
unusually large metadata pool.)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS very unstable with many small files

2018-02-26 Thread Patrick Donnelly
On Sun, Feb 25, 2018 at 3:49 PM, Oliver Freyermuth
<freyerm...@physik.uni-bonn.de> wrote:
> Am 25.02.2018 um 21:50 schrieb John Spray:
>> On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth
>>> Now, with about 100,000,000 objects written, we are in a disaster situation.
>>> First off, the MDS could not restart anymore - it required >40 GB of 
>>> memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and 
>>> swap.
>>> So it tried to recover and OOMed quickly after. Replay was reasonably fast, 
>>> but join took many minutes:
>>> 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
>>> 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
>>> and finally, 5 minutes later, OOM.
>>>
>>> I stopped half of the stress-test tar's, which did not help - then I 
>>> rebooted half of the clients, which did help and let the MDS recover just 
>>> fine.
>>> So it seems the client caps have been too many for the MDS to handle. I'm 
>>> unsure why "tar" would cause so many open file handles.
>>> Is there anything that can be configured to prevent this from happening?
>>
>> Clients will generally hold onto capabilities for files they've
>> written out -- this is pretty sub-optimal for many workloads where
>> files are written out but not likely to be accessed again in the near
>> future.  While clients hold these capabilities, the MDS cannot drop
>> things from its own cache.
>>
>> The way this is *meant* to work is that the MDS hits its cache size
>> limit, and sends a message to clients asking them to drop some files
>> from their local cache, and consequently release those capabilities.
>> However, this has historically been a tricky area with ceph-fuse
>> clients (there are some hacks for detecting kernel version and using
>> different mechanisms for different versions of fuse), and it's
>> possible that on your clients this mechanism is simply not working,
>> leading to a severely oversized MDS cache.
>>
>> The MDS should have been showing health alerts in "ceph status" about
>> this, but I suppose it's possible that it wasn't surviving long enough
>> to hit the timeout (60s) that we apply for warning about misbehaving
>> clients?  It would be good to check the cluster log to see if you were
>> getting any health messages along the lines of "Client xyz failing to
>> respond to cache pressure".
>
> This explains the high memory usage indeed.
> I can also confirm seeing those health alerts, now that I check the logs.
> The systems have been (servers and clients) all exclusively CentOS 7.4,
> so kernels are rather old, but I would have hoped things have been backported
> by RedHat.
>
> Is there anything one can do to limit client's cache sizes?

You said the clients are ceph-fuse running 12.2.3? Then they should have:

http://tracker.ceph.com/issues/22339

(Please double check you're not running older clients on accident.)

I have run small file tests with ~128 clients without issue. Generally
if there is an issue it is because clients are not releasing their
capabilities properly (due to invalidation bugs which should be caught
by the above backport) or the MDS memory usage exceeds RAM. If the
clients are not releasing their capabilities, you should see the
errors John described in the cluster log.

You said in the original post that the `mds cache memory limit = 4GB`.
If that's the case, you really shouldn't be exceeding 40GB of RAM!
It's possible you have found a bug of some kind. I suggest tracking
the MDS cache statistics (which includes the inode count in cache) by
collecting a `perf dump` via the admin socket. Then you can begin to
find out what's consuming all of the MDS memory.

Additionally, I concur with John on digging into why the MDS is
missing heartbeats by collecting debug logs (`debug mds = 15`) at that
time. It may also shed light on the issue.

Thanks for performing the test and letting us know the results.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balanced MDS, all as active and recomended client settings.

2018-02-23 Thread Patrick Donnelly
On Fri, Feb 23, 2018 at 12:54 AM, Daniel Carrasco <d.carra...@i2tic.com> wrote:
>  client_permissions = false

Yes, this will potentially reduce checks against the MDS.

>   client_quota = false

This option no longer exists since Luminous; quota enforcement is no
longer optional. However, if you don't have any quotas then there is
no added load on the client/mds.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balanced MDS, all as active and recomended client settings.

2018-02-22 Thread Patrick Donnelly
On Wed, Feb 21, 2018 at 11:17 PM, Daniel Carrasco <d.carra...@i2tic.com> wrote:
> I want to search also if there is any way to cache file metadata on client,
> to lower the MDS load. I suppose that files are cached but the client check
> with MDS if there are changes on files. On my server files are the most of
> time read-only so MDS data can be also cached for a while.

The MDS issues capabilities that allow clients to coherently cache metadata.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balanced MDS, all as active and recomended client settings.

2018-02-21 Thread Patrick Donnelly
Hello Daniel,

On Wed, Feb 21, 2018 at 10:26 AM, Daniel Carrasco <d.carra...@i2tic.com> wrote:
> Is possible to make a better distribution on the MDS load of both nodes?.

We are aware of bugs with the balancer which are being worked on. You
can also manually create a partition if the workload can benefit:

https://ceph.com/community/new-luminous-cephfs-subtree-pinning/

> Is posible to set all nodes as Active without problems?

No. I recommend you read the docs carefully:

http://docs.ceph.com/docs/master/cephfs/multimds/

> My last question is if someone can recomend me a good client configuration
> like cache size, and maybe something to lower the metadata servers load.

>>
>> ##
>> [mds]
>>  mds_cache_size = 25
>>  mds_cache_memory_limit = 792723456

You should only specify one of those. See also:

http://docs.ceph.com/docs/master/cephfs/cache-size-limits/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] client with uid

2018-02-06 Thread Patrick Donnelly
On Mon, Feb 5, 2018 at 9:08 AM, Keane Wolter <wolt...@umich.edu> wrote:
> Hi Patrick,
>
> Thanks for the info. Looking at the fuse options in the man page, I should
> be able to pass "-o uid=$(id -u)" at the end of the ceph-fuse command.
> However, when I do, it returns with an unknown option for fuse and
> segfaults. Any pointers would be greatly appreciated. This is the result I
> get:

I'm not familiar with that uid= option, you'll ahve to redirect that
question to FUSE devs. (However, I don't think it does what you want
it to. It says it only hard-codes the st_uid field returned by stat.)

> daemoneye@wolterk:~$ ceph-fuse --id=kwolter_test1 -r /user/kwolter/
> /home/daemoneye/ceph/ --client-die-on-failed-remount=false -o uid=$(id -u)
> ceph-fuse[25156]: starting ceph client
> fuse: unknown option `uid=1000'
> ceph-fuse[25156]: fuse failed to start
> *** Caught signal (Segmentation fault) **
>  in thread 7efc7da86100 thread_name:ceph-fuse
>  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
> (stable)
>  1: (()+0x6a8784) [0x5583372d8784]
>  2: (()+0x12180) [0x7efc7bb4f180]
>  3: (Client::_ll_drop_pins()+0x67) [0x558336e5dea7]
>  4: (Client::unmount()+0x943) [0x558336e67323]
>  5: (main()+0x7ed) [0x558336e02b0d]
>  6: (__libc_start_main()+0xea) [0x7efc7a892f2a]
>  7: (_start()+0x2a) [0x558336e0b73a]
> ceph-fuse [25154]: (33) Numerical argument out of domain
> daemoneye@wolterk:~$

I wasn't able to reproduce this.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] client with uid

2018-01-25 Thread Patrick Donnelly
On Wed, Jan 24, 2018 at 7:47 AM, Keane Wolter <wolt...@umich.edu> wrote:
> Hello all,
>
> I was looking at the Client Config Reference page
> (http://docs.ceph.com/docs/master/cephfs/client-config-ref/) and there was
> mention of a flag --client_with_uid. The way I read it is that you can
> specify the UID of a user on a cephfs and the user mounting the filesystem
> will act as the same UID. I am using the flags --client_mount_uid and
> --client_mount_gid set equal to my UID and GID values on the cephfs when
> running ceph-fuse. Is this the correct action for the flags or am I
> misunderstanding the flags?

These options are no longer used (with the exception of some bugs
[1,2]). The uid/gid should be provided by FUSE so you don't need to do
anything. If you're using the client library, you provide the uid/gid
via the UserPerm struct to each operation.

[1] http://tracker.ceph.com/issues/22802
[2] http://tracker.ceph.com/issues/22801


-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

2018-01-18 Thread Patrick Donnelly
Hi Andras,

On Thu, Jan 18, 2018 at 3:38 AM, Andras Pataki
<apat...@flatironinstitute.org> wrote:
> Hi John,
>
> Some other symptoms of the problem:  when the MDS has been running for a few
> days, it starts looking really busy.  At this time, listing directories
> becomes really slow.  An "ls -l" on a directory with about 250 entries takes
> about 2.5 seconds.  All the metadata is on OSDs with NVMe backing stores.
> Interestingly enough the memory usage seems pretty low (compared to the
> allowed cache limit).
>
>
> PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> COMMAND
> 1604408 ceph  20   0 3710304 2.387g  18360 S 100.0  0.9 757:06.92
> /usr/bin/ceph-mds -f --cluster ceph --id cephmon00 --setuser ceph --setgroup
> ceph
>
> Once I bounce it (fail it over), the CPU usage goes down to the 10-25%
> range.  The same ls -l after the bounce takes about 0.5 seconds.  I
> remounted the filesystem before each test to ensure there isn't anything
> cached.
>
> PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> COMMAND
>   00 ceph  20   0 6537052 5.864g  18500 S  17.6  2.3   9:23.55
> /usr/bin/ceph-mds -f --cluster ceph --id cephmon02 --setuser ceph --setgroup
> ceph
>
> Also, I have a crawler that crawls the file system periodically.  Normally
> the full crawl runs for about 24 hours, but with the slowing down MDS, now
> it has been running for more than 2 days and isn't close to finishing.
>
> The MDS related settings we are running with are:
>
> mds_cache_memory_limit = 17179869184
> mds_cache_reservation = 0.10

Debug logs from the MDS at that time would be helpful with `debug mds
= 20` and `debug ms = 1`. Feel free to create a tracker ticket and use
ceph-post-file [1] to share logs.

[1] http://docs.ceph.com/docs/hammer/man/8/ceph-post-file/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS cache size limits

2018-01-05 Thread Patrick Donnelly
On Fri, Jan 5, 2018 at 3:54 AM, Stefan Kooman <ste...@bit.nl> wrote:
> Quoting Patrick Donnelly (pdonn...@redhat.com):
>>
>> It's expected but not desired: http://tracker.ceph.com/issues/21402
>>
>> The memory usage tracking is off by a constant factor. I'd suggest
>> just lowering the limit so it's about where it should be for your
>> system.
>
> Thanks for the info. Yeah, we did exactly that (observe and adjust
> setting accordingly). Is this something worth
> mentioning in the documentation? Escpecially when this "factor" is a
> constant? Over time (with issue 21402 being worked on) things will
> change. Ceph operators will want to make use of as much cache as
> possible without overcommitting (MDS won't notice until there is no more
> memory left, restart, and looses all its cache :/).

Yup: http://tracker.ceph.com/issues/22599

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS cache size limits

2018-01-04 Thread Patrick Donnelly
Hello Stefan,

On Thu, Jan 4, 2018 at 1:45 AM, Stefan Kooman <ste...@bit.nl> wrote:
> I have a question about the "mds_cache_memory_limit" parameter and MDS
> memory usage. We currently have set mds_cache_memory_limit=150G.
> The MDS server itself (and its active-standby) have 256 GB of RAM.
> Eventually the MDS process will consume ~ 87.5% of available memory.
> At that point it will trim its cache, confirmed with:
>
> while sleep 1; do ceph daemon mds.mds1 perf dump | jq '.mds_mem.rss'; ceph
> daemon mds.mds1 dump_mempools | jq -c '.mds_co'; done
>
> 1 cephfs kernel client (4.13.0-21-generic), Ceph 12.2.2.
>
> Anyways, it will consume roughly 1.5 times the amount of memory it is
> allowed to use according to mds_cache_memory_limit. Is this expected
> behaviour?

It's expected but not desired: http://tracker.ceph.com/issues/21402

The memory usage tracking is off by a constant factor. I'd suggest
just lowering the limit so it's about where it should be for your
system.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-14 Thread Patrick Donnelly
On Thu, Dec 14, 2017 at 4:44 PM, Webert de Souza Lima
<webert.b...@gmail.com> wrote:
> Hi Patrick,
>
> On Thu, Dec 14, 2017 at 7:52 PM, Patrick Donnelly <pdonn...@redhat.com>
> wrote:
>>
>>
>> It's likely you're a victim of a kernel backport that removed a dentry
>> invalidation mechanism for FUSE mounts. The result is that ceph-fuse
>> can't trim dentries.
>
>
> even though I'm not using FUSE? I'm using kernel mounts.
>
>
>>
>> I suggest setting that config manually to false on all of your clients
>
>
> Ok how do I do that?

I missed that you were using the kernel client. I agree with Zheng's analysis.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-14 Thread Patrick Donnelly
On Thu, Dec 14, 2017 at 9:18 AM, Webert de Souza Lima
<webert.b...@gmail.com> wrote:
> So, questions: does that really matter? What are possible impacts? What
> could have caused this 2 hosts to hold so many capabilities?
> 1 of the hosts are for tests purposes, traffic is close to zero. The other
> host wasn't using cephfs at all. All services stopped.

It's likely you're a victim of a kernel backport that removed a dentry
invalidation mechanism for FUSE mounts. The result is that ceph-fuse
can't trim dentries. We have a patch to turn off that particular
mechanism by default:

https://github.com/ceph/ceph/pull/17925

I suggest setting that config manually to false on all of your clients
and ensure each client can remount itself to trim dentries (i.e. it's
being run as root or with sufficient capabiltities) which is a
fallback mechanism.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS log jam prevention

2017-12-05 Thread Patrick Donnelly
On Tue, Dec 5, 2017 at 8:07 AM, Reed Dier <reed.d...@focusvq.com> wrote:
> Been trying to do a fairly large rsync onto a 3x replicated, filestore HDD
> backed CephFS pool.
>
> Luminous 12.2.1 for all daemons, kernel CephFS driver, Ubuntu 16.04 running
> mix of 4.8 and 4.10 kernels, 2x10GbE networking between all daemons and
> clients.

You should try a newer kernel client if possible since the MDS is
having trouble trimming its cache.

> HEALTH_ERR 1 MDSs report oversized cache; 1 MDSs have many clients failing
> to respond to cache pressure; 1 MDSs behind on tr
> imming; noout,nodeep-scrub flag(s) set; application not enabled on 1
> pool(s); 242 slow requests are blocked > 32 sec
> ; 769378 stuck requests are blocked > 4096 sec
> MDS_CACHE_OVERSIZED 1 MDSs report oversized cache
> mdsdb(mds.0): MDS cache is too large (23GB/8GB); 1018 inodes in use by
> clients, 1 stray files
> MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache
> pressure
> mdsdb(mds.0): Many clients (37) failing to respond to cache
> pressureclient_count: 37
> MDS_TRIM 1 MDSs behind on trimming
> mdsdb(mds.0): Behind on trimming (36252/30)max_segments: 30,
> num_segments: 36252

See also: http://tracker.ceph.com/issues/21975

You can try doubling (several times if necessary) the MDS configs
`mds_log_max_segments` and `mds_log_max_expiring` to make it more
aggressively trim its journal. (That may not help since your OSD
requests are slow.)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] strange error on link() for nfs over cephfs

2017-11-29 Thread Patrick Donnelly
On Wed, Nov 29, 2017 at 3:44 AM, Jens-U. Mozdzen <jmozd...@nde.ag> wrote:
> Hi *,
>
> we recently have switched to using CephFS (with Luminous 12.2.1). On one
> node, we're kernel-mounting the CephFS (kernel 4.4.75, openSUSE version) and
> export it via kernel nfsd. As we're transitioning right now, a number of
> machines still auto-mount users home directories from that nfsd.

You need to try a newer kernel as there have been many fixes since 4.4
which probably have not been backported to your distribution's kernel.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous ceph-fuse crashes with "failed to remount for kernel dentry trimming"

2017-11-27 Thread Patrick Donnelly
Hello Andras,

On Mon, Nov 27, 2017 at 2:31 PM, Andras Pataki
<apat...@flatironinstitute.org> wrote:
> After upgrading to the Luminous 12.2.1 ceph-fuse client, we've seen clients
> on various nodes randomly crash at the assert
> FAILED assert(0 == "failed to remount for kernel dentry trimming")
>
> with the stack:
>
>  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x110) [0x5584ad80]
>  2: (C_Client_Remount::finish(int)+0xcf) [0x557e7fff]
>  3: (Context::complete(int)+0x9) [0x557e3dc9]
>  4: (Finisher::finisher_thread_entry()+0x198) [0x55849d18]
>  5: (()+0x7e25) [0x760a3e25]
>  6: (clone()+0x6d) [0x74f8234d]

What kernel version are you using? We have seen instances of this
error recently. It may be related to [1]. Are you running out of
memory on these machines?

[1] http://tracker.ceph.com/issues/17517

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >