Re: [ceph-users] cephfs kernel client - page cache being invaildated.
> If a CephFS client receive a cap release request and it is able to > perform it (no processes accessing the file at the moment), the client > cleaned up its internal state and allows the MDS to release the cap. > This cleanup also involves removing file data from the page cache. > > If your MDS was running with a too small cache size, it had to revoke > caps over and over to adhere to its cache size, and the clients had to > cleanup their cache over and over, too. Well.. It could just mark it "elegible for future cleanup" - if the client has not use of the available memory, then this is just trashing local client memory cache for a file that goes into use again in a few minutes from here. - based on your description, this is what we have been seeing. Bumping MDS memory has pushed our problem and our setup works fine, but above behaviour still seems very "unoptimal" - of course if the file changes - feel free to active prune - but hey - why actually - the it will get no hits in the client LRU cache and be automatically evicted by the client anyway. I feel this is messing up with thing that has worked well for a few decades now, but I may just be missing the fine grained details. > Hope this helps. Definately - thanks. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
Hi, On 03.11.18 10:31, jes...@krogh.cc wrote: I suspect that mds asked client to trim its cache. Please run following commands on an idle client. In the mean time - we migrated to the RH Ceph version and deliered the MDS both SSD's and more memory and the problem went away. It still puzzles my mind a bit - why is there a connection between the "client page cache" and the MDS server performance/etc. The only argument I can find is that if the MDS cannot cache data, then and it need to go back and get metadata from the Ceph metadata poll then it exposes data as "new" to the clients, despite it being the same. - if that is the case, then I would say there is a significant room for performance optimization here. CephFS is a distributed system, so there's a bookkeeping about every file in use by any CephFS client. These entities are 'capabilities'; they also implement stuff like distributed locking. The MDS has to cache every capability it has assigned to a CephFS client, in addition to the cache for inode information and other stuff. The cache size is limited to control the memory consumption of the MDS process. If a MDS is running out of cache, it tries to revoke capabilities assigned to CephFS clients to free some memory for new capabilities. This revoke process runs asynchronous from MDS to CephFS client, similar to NFS delegation. If a CephFS client receive a cap release request and it is able to perform it (no processes accessing the file at the moment), the client cleaned up its internal state and allows the MDS to release the cap. This cleanup also involves removing file data from the page cache. If your MDS was running with a too small cache size, it had to revoke caps over and over to adhere to its cache size, and the clients had to cleanup their cache over and over, too. You did not mention any details about the MDS settings, especially the cache size. I assume you increased the cache size after adding more memory, since the problem seems to be solved now. It actually is not solved, but only mitigated. If your working set size increases or the number of clients increases, the MDS has to manage more caps and will have to revoke caps more often. You will probably reach an equilibrium at some point. The MDS is the most memory hungry part of Ceph, and it often caught people by surprise. We had the same problem in our setup; even worse the nightly backup is also trashing the MDS cache. The best way to monitor the MDS is using the 'ceph daemonperf mds.XYZ' command on the MDS host. It gives you the current performance counters including the inode and caps count. Our MDS is configured with a 40 GB cache size and currently has 15 million inodes cached and is managing 3.1 million capabilities. TL;DR: MDS needs huge amounts of memory for its internal bookkeeping. Hope this helps. Regards, Burkhard If you can reproduce this issue. please send kernel log to us. Will do if/when it reappears. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
> I suspect that mds asked client to trim its cache. Please run > following commands on an idle client. In the mean time - we migrated to the RH Ceph version and deliered the MDS both SSD's and more memory and the problem went away. It still puzzles my mind a bit - why is there a connection between the "client page cache" and the MDS server performance/etc. The only argument I can find is that if the MDS cannot cache data, then and it need to go back and get metadata from the Ceph metadata poll then it exposes data as "new" to the clients, despite it being the same. - if that is the case, then I would say there is a significant room for performance optimization here. > If you can reproduce this issue. please send kernel log to us. Will do if/when it reappears. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
On Mon, Oct 15, 2018 at 9:54 PM Dietmar Rieder wrote: > > On 10/15/18 1:17 PM, jes...@krogh.cc wrote: > >> On 10/15/18 12:41 PM, Dietmar Rieder wrote: > >>> No big difference here. > >>> all CentOS 7.5 official kernel 3.10.0-862.11.6.el7.x86_64 > >> > >> ...forgot to mention: all is luminous ceph-12.2.7 > > > > Thanks for your time in testing, this is very valueable to me in the > > debugging. 2 questions: > > > > Did you "sleep 900" in-between the execution? > > Are you using the kernel client or the fuse client? > > > > If I run them "right after each other" .. then I get the same behaviour. > > > > Hi, as I stated I'm using the kernel client, and yes I did the sleep 900 > between the two runs. > > ~Dietmar > Sorry for the delay I suspect that mds asked client to trim its cache. Please run following commands on an idle client. time for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done | parallel -j 4 echo module ceph +p > /sys/kernel/debug/dynamic_debug/control; sleep 900; echo module ceph -p > /sys/kernel/debug/dynamic_debug/control; time for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done | parallel -j 4 If you can reproduce this issue. please send kernel log to us. Regards Yan, Zheng > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
On 10/15/18 1:17 PM, jes...@krogh.cc wrote: >> On 10/15/18 12:41 PM, Dietmar Rieder wrote: >>> No big difference here. >>> all CentOS 7.5 official kernel 3.10.0-862.11.6.el7.x86_64 >> >> ...forgot to mention: all is luminous ceph-12.2.7 > > Thanks for your time in testing, this is very valueable to me in the > debugging. 2 questions: > > Did you "sleep 900" in-between the execution? > Are you using the kernel client or the fuse client? > > If I run them "right after each other" .. then I get the same behaviour. > Hi, as I stated I'm using the kernel client, and yes I did the sleep 900 between the two runs. ~Dietmar signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
> On 10/15/18 12:41 PM, Dietmar Rieder wrote: >> No big difference here. >> all CentOS 7.5 official kernel 3.10.0-862.11.6.el7.x86_64 > > ...forgot to mention: all is luminous ceph-12.2.7 Thanks for your time in testing, this is very valueable to me in the debugging. 2 questions: Did you "sleep 900" in-between the execution? Are you using the kernel client or the fuse client? If I run them "right after each other" .. then I get the same behaviour. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
On 10/15/18 12:41 PM, Dietmar Rieder wrote: > On 10/15/18 12:02 PM, jes...@krogh.cc wrote: On Sun, Oct 14, 2018 at 8:21 PM wrote: how many cephfs mounts that access the file? Is is possible that some program opens that file in RW mode (even they just read the file)? >>> >>> >>> The nature of the program is that it is "prepped" by one-set of commands >>> and queried by another, thus the RW case is extremely unlikely. >>> I can change permission bits to rewoke the w-bit for the user, they >>> dont need it anyway... it is just the same service-users that generates >>> the data and queries it today. >> >> Just to remove the suspicion of other clients fiddling with the files I did a >> more structured test. I have 4 x 10GB files from fio-benchmarking, total >> 40GB . Hosted on >> >> 1) CephFS /ceph/cluster/home/jk >> 2) NFS /z/home/jk >> >> First I read them .. then sleep 900 seconds .. then read again (just with dd) >> >> jk@sild12:/ceph/cluster/home/jk$ time for i in $(seq 0 3); do echo "dd >> if=test.$i.0 of=/dev/null bs=1M"; done | parallel -j 4 ; sleep 900; time >> for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done | >> parallel -j 4 >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 2.56413 s, 4.2 GB/s >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 2.82234 s, 3.8 GB/s >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 2.9361 s, 3.7 GB/s >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 3.10397 s, 3.5 GB/s >> >> real0m3.449s >> user0m0.217s >> sys 0m11.497s >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 315.439 s, 34.0 MB/s >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 338.661 s, 31.7 MB/s >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 354.725 s, 30.3 MB/s >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 356.126 s, 30.2 MB/s >> >> real5m56.634s >> user0m0.260s >> sys 0m16.515s >> jk@sild12:/ceph/cluster/home/jk$ >> >> >> Then NFS: >> >> jk@sild12:~$ time for i in $(seq 0 3); do echo "dd if=test.$i.0 >> of=/dev/null bs=1M"; done | parallel -j 4 ; sleep 900; time for i in >> $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done | parallel >> -j 4 >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 1.60267 s, 6.7 GB/s >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 2.18602 s, 4.9 GB/s >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 2.47564 s, 4.3 GB/s >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 2.54674 s, 4.2 GB/s >> >> real0m2.855s >> user0m0.185s >> sys 0m8.888s >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 1.68613 s, 6.4 GB/s >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 1.6983 s, 6.3 GB/s >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 2.20059 s, 4.9 GB/s >> 10240+0 records in >> 10240+0 records out >> 10737418240 bytes (11 GB, 10 GiB) copied, 2.58077 s, 4.2 GB/s >> >> real0m2.980s >> user0m0.173s >> sys 0m8.239s >> jk@sild12:~$ >> >> >> Can I ask one of you to run the same "test" (or similar) .. and report back >> i you can reproduce it? > > here my test on e EC (6+3) pool using cephfs kernel client: > > 7061+1 records in > 7061+1 records out > 7404496985 bytes (7.4 GB) copied, 3.62754 s, 2.0 GB/s > 7450+1 records in > 7450+1 records out > 7812246720 bytes (7.8 GB) copied, 4.11908 s, 1.9 GB/s > 7761+1 records in > 7761+1 records out > 8138636188 bytes (8.1 GB) copied, 4.34788 s, 1.9 GB/s > 8212+1 records in > 8212+1 records out > 8611295220 bytes (8.6 GB) copied, 4.53371 s, 1.9 GB/s > > real0m4.936s > user0m0.275s > sys 0m16.828s > > 7061+1 records in > 7061+1 records out > 7404496985 bytes (7.4 GB) copied, 3.19726 s, 2.3 GB/s > 7761+1 records in > 7761+1 records out > 8138636188 bytes (8.1 GB) copied, 3.31881 s, 2.5 GB/s > 7450+1 records in > 7450+1 records out > 7812246720 bytes (7.8 GB) copied, 3.36354 s, 2.3 GB/s > 8212+1 records in > 8212+1 records out > 8611295220 bytes (8.6 GB) copied, 3.74418 s, 2.3 GB/s > > > No big difference here. > all CentOS 7.5 official kernel 3.10.0-862.11.6.el7.x86_64 ...forgot to mention: all is luminous ceph-12.2.7 ~Dietmar signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
On 10/15/18 12:02 PM, jes...@krogh.cc wrote: >>> On Sun, Oct 14, 2018 at 8:21 PM wrote: >>> how many cephfs mounts that access the file? Is is possible that some >>> program opens that file in RW mode (even they just read the file)? >> >> >> The nature of the program is that it is "prepped" by one-set of commands >> and queried by another, thus the RW case is extremely unlikely. >> I can change permission bits to rewoke the w-bit for the user, they >> dont need it anyway... it is just the same service-users that generates >> the data and queries it today. > > Just to remove the suspicion of other clients fiddling with the files I did a > more structured test. I have 4 x 10GB files from fio-benchmarking, total > 40GB . Hosted on > > 1) CephFS /ceph/cluster/home/jk > 2) NFS /z/home/jk > > First I read them .. then sleep 900 seconds .. then read again (just with dd) > > jk@sild12:/ceph/cluster/home/jk$ time for i in $(seq 0 3); do echo "dd > if=test.$i.0 of=/dev/null bs=1M"; done | parallel -j 4 ; sleep 900; time > for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done | > parallel -j 4 > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 2.56413 s, 4.2 GB/s > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 2.82234 s, 3.8 GB/s > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 2.9361 s, 3.7 GB/s > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 3.10397 s, 3.5 GB/s > > real0m3.449s > user0m0.217s > sys 0m11.497s > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 315.439 s, 34.0 MB/s > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 338.661 s, 31.7 MB/s > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 354.725 s, 30.3 MB/s > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 356.126 s, 30.2 MB/s > > real5m56.634s > user0m0.260s > sys 0m16.515s > jk@sild12:/ceph/cluster/home/jk$ > > > Then NFS: > > jk@sild12:~$ time for i in $(seq 0 3); do echo "dd if=test.$i.0 > of=/dev/null bs=1M"; done | parallel -j 4 ; sleep 900; time for i in > $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done | parallel > -j 4 > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 1.60267 s, 6.7 GB/s > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 2.18602 s, 4.9 GB/s > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 2.47564 s, 4.3 GB/s > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 2.54674 s, 4.2 GB/s > > real0m2.855s > user0m0.185s > sys 0m8.888s > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 1.68613 s, 6.4 GB/s > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 1.6983 s, 6.3 GB/s > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 2.20059 s, 4.9 GB/s > 10240+0 records in > 10240+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 2.58077 s, 4.2 GB/s > > real0m2.980s > user0m0.173s > sys 0m8.239s > jk@sild12:~$ > > > Can I ask one of you to run the same "test" (or similar) .. and report back > i you can reproduce it? here my test on e EC (6+3) pool using cephfs kernel client: 7061+1 records in 7061+1 records out 7404496985 bytes (7.4 GB) copied, 3.62754 s, 2.0 GB/s 7450+1 records in 7450+1 records out 7812246720 bytes (7.8 GB) copied, 4.11908 s, 1.9 GB/s 7761+1 records in 7761+1 records out 8138636188 bytes (8.1 GB) copied, 4.34788 s, 1.9 GB/s 8212+1 records in 8212+1 records out 8611295220 bytes (8.6 GB) copied, 4.53371 s, 1.9 GB/s real0m4.936s user0m0.275s sys 0m16.828s 7061+1 records in 7061+1 records out 7404496985 bytes (7.4 GB) copied, 3.19726 s, 2.3 GB/s 7761+1 records in 7761+1 records out 8138636188 bytes (8.1 GB) copied, 3.31881 s, 2.5 GB/s 7450+1 records in 7450+1 records out 7812246720 bytes (7.8 GB) copied, 3.36354 s, 2.3 GB/s 8212+1 records in 8212+1 records out 8611295220 bytes (8.6 GB) copied, 3.74418 s, 2.3 GB/s No big difference here. all CentOS 7.5 official kernel 3.10.0-862.11.6.el7.x86_64 HTH Dietmar signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
>> On Sun, Oct 14, 2018 at 8:21 PM wrote: >> how many cephfs mounts that access the file? Is is possible that some >> program opens that file in RW mode (even they just read the file)? > > > The nature of the program is that it is "prepped" by one-set of commands > and queried by another, thus the RW case is extremely unlikely. > I can change permission bits to rewoke the w-bit for the user, they > dont need it anyway... it is just the same service-users that generates > the data and queries it today. Just to remove the suspicion of other clients fiddling with the files I did a more structured test. I have 4 x 10GB files from fio-benchmarking, total 40GB . Hosted on 1) CephFS /ceph/cluster/home/jk 2) NFS /z/home/jk First I read them .. then sleep 900 seconds .. then read again (just with dd) jk@sild12:/ceph/cluster/home/jk$ time for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done | parallel -j 4 ; sleep 900; time for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done | parallel -j 4 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.56413 s, 4.2 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.82234 s, 3.8 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.9361 s, 3.7 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 3.10397 s, 3.5 GB/s real0m3.449s user0m0.217s sys 0m11.497s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 315.439 s, 34.0 MB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 338.661 s, 31.7 MB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 354.725 s, 30.3 MB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 356.126 s, 30.2 MB/s real5m56.634s user0m0.260s sys 0m16.515s jk@sild12:/ceph/cluster/home/jk$ Then NFS: jk@sild12:~$ time for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done | parallel -j 4 ; sleep 900; time for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done | parallel -j 4 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 1.60267 s, 6.7 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.18602 s, 4.9 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.47564 s, 4.3 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.54674 s, 4.2 GB/s real0m2.855s user0m0.185s sys 0m8.888s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 1.68613 s, 6.4 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 1.6983 s, 6.3 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.20059 s, 4.9 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.58077 s, 4.2 GB/s real0m2.980s user0m0.173s sys 0m8.239s jk@sild12:~$ Can I ask one of you to run the same "test" (or similar) .. and report back i you can reproduce it? Thoughts/comments/suggestions are highly apprecitated? Should I try with the fuse-client ? -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
> On Sun, Oct 14, 2018 at 8:21 PM wrote: > how many cephfs mounts that access the file? Is is possible that some > program opens that file in RW mode (even they just read the file)? The nature of the program is that it is "prepped" by one-set of commands and queried by another, thus the RW case is extremely unlikely. I can change permission bits to rewoke the w-bit for the user, they dont need it anyway... it is just the same service-users that generates the data and queries it today. Can ceph tell the actual amount of clients? .. We have 55-60 hosts, where most of them mounts the catalog. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
On Sun, Oct 14, 2018 at 8:21 PM wrote: > > Hi > > We have a dataset of ~300 GB on CephFS which as being used for computations > over and over agian .. being refreshed daily or similar. > > When hosting it on NFS after refresh, they are transferred, but from > there - they would be sitting in the kernel page cache of the client > until they are refreshed serverside. > > On CephFS it look "similar" but "different". Where the "steady state" > operation over NFS would give a client/server traffic of < 1MB/s .. > CephFS contantly pulls 50-100MB/s over the network. This has > implications for the clients that end up spending unnessary time waiting > for IO in the execution. > > This is in a setting where the CephFS client mem look like this: > > $ free -h > totalusedfree shared buff/cache > available > Mem: 377G 17G340G1.2G 19G > 354G > Swap: 8.8G430M8.4G > > > If I just repeatedly run (within a few minute) something that is using the > files, then > it is fully served out of client page cache (2GB'ish / s) .. but it looks > like > it is being evicted way faster than in the NFS setting? > > This is not scientific .. but the CMD is a cat /file/on/ceph > /dev/null - > type on a total of 24GB data in 300'ish files. > > $ free -h; time CMD ; sleep 1800; free -h; time CMD ; free -h; sleep 3600; > time CMD ; > > totalusedfree shared buff/cache > available > Mem: 377G 16G312G1.2G 48G > 355G > Swap: 8.8G430M8.4G > > real0m8.997s > user0m2.036s > sys 0m6.915s > totalusedfree shared buff/cache > available > Mem: 377G 17G277G1.2G 82G > 354G > Swap: 8.8G430M8.4G > > real3m25.904s > user0m2.794s > sys 0m9.028s > totalusedfree shared buff/cache > available > Mem: 377G 17G283G1.2G 76G > 353G > Swap: 8.8G430M8.4G > > real6m18.358s > user0m2.847s > sys 0m10.651s > > > Munin graphs of the system confirms that there has been zero memory > pressure over the period. > > Is there things in the CephFS case that can cause the page-cache to be > invailated? > Could less agressive "read-ahead" play a role? > > Other thoughts on what root cause on the different behaviour could be? > > Clients are using 4.15 kernel.. Anyone aware of newer patches in this area > that could impact ? > how many cephfs mounts that access the file? Is is possible that some program opens that file in RW mode (even they just read the file)? Yan, Zheng > Jesper > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
> Actual amount of memory used by VFS cache is available through 'grep > Cached /proc/meminfo'. slabtop provides information about cache > of inodes, dentries, and IO memory buffers (buffer_head). Thanks, that was also what I got out of it. And why I reported "free" output in the first as it also shows available and "cached" memory. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
Actual amount of memory used by VFS cache is available through 'grep Cached /proc/meminfo'. slabtop provides information about cache of inodes, dentries, and IO memory buffers (buffer_head). > On 14.10.2018, at 17:28, jes...@krogh.cc wrote: > >> Try looking in /proc/slabinfo / slabtop during your tests. > > I need a bit of guidance here.. Does the slabinfo cover the VFS page > cache ? .. I cannot seem to find any traces (sorting by size on > machines with a huge cache does not really give anything). Perhaps > I'm holding the screwdriver wrong? > > -- > Jesper > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
> Try looking in /proc/slabinfo / slabtop during your tests. I need a bit of guidance here.. Does the slabinfo cover the VFS page cache ? .. I cannot seem to find any traces (sorting by size on machines with a huge cache does not really give anything). Perhaps I'm holding the screwdriver wrong? -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
Try looking in /proc/slabinfo / slabtop during your tests. > On 14.10.2018, at 15:21, jes...@krogh.cc wrote: > > Hi > > We have a dataset of ~300 GB on CephFS which as being used for computations > over and over agian .. being refreshed daily or similar. > > When hosting it on NFS after refresh, they are transferred, but from > there - they would be sitting in the kernel page cache of the client > until they are refreshed serverside. > > On CephFS it look "similar" but "different". Where the "steady state" > operation over NFS would give a client/server traffic of < 1MB/s .. > CephFS contantly pulls 50-100MB/s over the network. This has > implications for the clients that end up spending unnessary time waiting > for IO in the execution. > > This is in a setting where the CephFS client mem look like this: > > $ free -h > totalusedfree shared buff/cache > available > Mem: 377G 17G340G1.2G 19G > 354G > Swap: 8.8G430M8.4G > > > If I just repeatedly run (within a few minute) something that is using the > files, then > it is fully served out of client page cache (2GB'ish / s) .. but it looks > like > it is being evicted way faster than in the NFS setting? > > This is not scientific .. but the CMD is a cat /file/on/ceph > /dev/null - > type on a total of 24GB data in 300'ish files. > > $ free -h; time CMD ; sleep 1800; free -h; time CMD ; free -h; sleep 3600; > time CMD ; > > totalusedfree shared buff/cache > available > Mem: 377G 16G312G1.2G 48G > 355G > Swap: 8.8G430M8.4G > > real0m8.997s > user0m2.036s > sys 0m6.915s > totalusedfree shared buff/cache > available > Mem: 377G 17G277G1.2G 82G > 354G > Swap: 8.8G430M8.4G > > real3m25.904s > user0m2.794s > sys 0m9.028s > totalusedfree shared buff/cache > available > Mem: 377G 17G283G1.2G 76G > 353G > Swap: 8.8G430M8.4G > > real6m18.358s > user0m2.847s > sys 0m10.651s > > > Munin graphs of the system confirms that there has been zero memory > pressure over the period. > > Is there things in the CephFS case that can cause the page-cache to be > invailated? > Could less agressive "read-ahead" play a role? > > Other thoughts on what root cause on the different behaviour could be? > > Clients are using 4.15 kernel.. Anyone aware of newer patches in this area > that could impact ? > > Jesper > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
On 14 Oct 2018, at 15.26, John Hearns wrote: > > This is a general question for the ceph list. > Should Jesper be looking at these vm tunables? > vm.dirty_ratio > vm.dirty_centisecs > > What effect do they have when using Cephfs? This situation is a read only, thus no dirty data in page cache. Above should be irrelevant. Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
This is a general question for the ceph list. Should Jesper be looking at these vm tunables? vm.dirty_ratio vm.dirty_centisecs What effect do they have when using Cephfs? On Sun, 14 Oct 2018 at 14:24, John Hearns wrote: > Hej Jesper. > Sorry I do not have a direct answer to your question. > When looking at memory usage, I often use this command: > > watch cat /rpoc/meminfo > > > > > > > On Sun, 14 Oct 2018 at 13:22, wrote: > >> Hi >> >> We have a dataset of ~300 GB on CephFS which as being used for >> computations >> over and over agian .. being refreshed daily or similar. >> >> When hosting it on NFS after refresh, they are transferred, but from >> there - they would be sitting in the kernel page cache of the client >> until they are refreshed serverside. >> >> On CephFS it look "similar" but "different". Where the "steady state" >> operation over NFS would give a client/server traffic of < 1MB/s .. >> CephFS contantly pulls 50-100MB/s over the network. This has >> implications for the clients that end up spending unnessary time waiting >> for IO in the execution. >> >> This is in a setting where the CephFS client mem look like this: >> >> $ free -h >> totalusedfree shared buff/cache >> available >> Mem: 377G 17G340G1.2G 19G >> 354G >> Swap: 8.8G430M8.4G >> >> >> If I just repeatedly run (within a few minute) something that is using the >> files, then >> it is fully served out of client page cache (2GB'ish / s) .. but it looks >> like >> it is being evicted way faster than in the NFS setting? >> >> This is not scientific .. but the CMD is a cat /file/on/ceph > /dev/null - >> type on a total of 24GB data in 300'ish files. >> >> $ free -h; time CMD ; sleep 1800; free -h; time CMD ; free -h; sleep 3600; >> time CMD ; >> >> totalusedfree shared buff/cache >> available >> Mem: 377G 16G312G1.2G 48G >> 355G >> Swap: 8.8G430M8.4G >> >> real0m8.997s >> user0m2.036s >> sys 0m6.915s >> totalusedfree shared buff/cache >> available >> Mem: 377G 17G277G1.2G 82G >> 354G >> Swap: 8.8G430M8.4G >> >> real3m25.904s >> user0m2.794s >> sys 0m9.028s >> totalusedfree shared buff/cache >> available >> Mem: 377G 17G283G1.2G 76G >> 353G >> Swap: 8.8G430M8.4G >> >> real6m18.358s >> user0m2.847s >> sys 0m10.651s >> >> >> Munin graphs of the system confirms that there has been zero memory >> pressure over the period. >> >> Is there things in the CephFS case that can cause the page-cache to be >> invailated? >> Could less agressive "read-ahead" play a role? >> >> Other thoughts on what root cause on the different behaviour could be? >> >> Clients are using 4.15 kernel.. Anyone aware of newer patches in this area >> that could impact ? >> >> Jesper >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
Hej Jesper. Sorry I do not have a direct answer to your question. When looking at memory usage, I often use this command: watch cat /rpoc/meminfo On Sun, 14 Oct 2018 at 13:22, wrote: > Hi > > We have a dataset of ~300 GB on CephFS which as being used for computations > over and over agian .. being refreshed daily or similar. > > When hosting it on NFS after refresh, they are transferred, but from > there - they would be sitting in the kernel page cache of the client > until they are refreshed serverside. > > On CephFS it look "similar" but "different". Where the "steady state" > operation over NFS would give a client/server traffic of < 1MB/s .. > CephFS contantly pulls 50-100MB/s over the network. This has > implications for the clients that end up spending unnessary time waiting > for IO in the execution. > > This is in a setting where the CephFS client mem look like this: > > $ free -h > totalusedfree shared buff/cache > available > Mem: 377G 17G340G1.2G 19G > 354G > Swap: 8.8G430M8.4G > > > If I just repeatedly run (within a few minute) something that is using the > files, then > it is fully served out of client page cache (2GB'ish / s) .. but it looks > like > it is being evicted way faster than in the NFS setting? > > This is not scientific .. but the CMD is a cat /file/on/ceph > /dev/null - > type on a total of 24GB data in 300'ish files. > > $ free -h; time CMD ; sleep 1800; free -h; time CMD ; free -h; sleep 3600; > time CMD ; > > totalusedfree shared buff/cache > available > Mem: 377G 16G312G1.2G 48G > 355G > Swap: 8.8G430M8.4G > > real0m8.997s > user0m2.036s > sys 0m6.915s > totalusedfree shared buff/cache > available > Mem: 377G 17G277G1.2G 82G > 354G > Swap: 8.8G430M8.4G > > real3m25.904s > user0m2.794s > sys 0m9.028s > totalusedfree shared buff/cache > available > Mem: 377G 17G283G1.2G 76G > 353G > Swap: 8.8G430M8.4G > > real6m18.358s > user0m2.847s > sys 0m10.651s > > > Munin graphs of the system confirms that there has been zero memory > pressure over the period. > > Is there things in the CephFS case that can cause the page-cache to be > invailated? > Could less agressive "read-ahead" play a role? > > Other thoughts on what root cause on the different behaviour could be? > > Clients are using 4.15 kernel.. Anyone aware of newer patches in this area > that could impact ? > > Jesper > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com