Re: [ceph-users] luminous ceph-fuse with quotas breaks 'mount' and 'df'
I think this should be caused by the calculation method of fsblkcnt_t in ceph-fuse. The total and used space will be right shifted by CEPH_BLOCK_SHIFT (22). So if the quota of a directory is less than 4MB, the total size after this calculation would be 0. Then 'df' commands won't report this mount point, but 'mount' command will still do. Regards, Zhi Zhang (David) Contact: zhang.david2...@gmail.com zhangz.da...@outlook.com On Sat, Aug 18, 2018 at 4:29 AM Chad William Seys wrote: > > Looks like Greg may be onto something! > > If the quota is 1000 (bytes), the mount point appears in 'df': > ceph-fuse 8.0M 0 8.0M 0% /srv/smb/winbak > and 'mount': > ceph-fuse on /srv/smb/winbak type fuse.ceph-fuse > (rw,relatime,user_id=0,group_id=0,allow_other) > > If quota is 100, the mount point no longer appears in 'df', but does > appear in 'mount'. > > I wasn't able to get it to disappear from 'mount' even at quota 1 byte. > > Below is a debug session as suggested by John. I used quota 100 on > the mount point. (Segfault occurred when I ctrl-C to kill process after > initial mount.) > > Thanks! > Chad. > > > > 2018-08-17 14:34:54.952636 7f0e300b5140 0 ceph version 12.2.7 > (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable), process > ceph-fuse, pid 30502 > ceph-fuse[30502]: starting ceph client > 2018-08-17 14:34:54.958910 7f0e300b5140 -1 init, newargv = > 0x556a3f060120 newargc=9 > 2018-08-17 14:34:54.961492 7f0e298a6700 10 client.0 ms_handle_connect on > 128.104.164.197:6789/0 > 2018-08-17 14:34:54.965175 7f0e300b5140 10 client.18814183 Subscribing > to map 'mdsmap' > 2018-08-17 14:34:54.965198 7f0e300b5140 20 client.18814183 trim_cache > size 0 max 16384 > 2018-08-17 14:34:54.966721 7f0e298a6700 1 client.18814183 > handle_mds_map epoch 2336272 > 2018-08-17 14:34:54.966788 7f0e300b5140 20 client.18814183 > populate_metadata read hostname 'tardis' > 2018-08-17 14:34:54.966824 7f0e300b5140 10 client.18814183 did not get > mds through better means, so chose random mds 0 > 2018-08-17 14:34:54.966826 7f0e300b5140 20 client.18814183 mds is 0 > 2018-08-17 14:34:54.966828 7f0e300b5140 10 client.18814183 > _open_mds_session mds.0 > 2018-08-17 14:34:54.966858 7f0e300b5140 10 client.18814183 waiting for > session to mds.0 to open > 2018-08-17 14:34:54.969991 7f0e298a6700 10 client.18814183 > ms_handle_connect on 10.128.198.59:6800/2643422990 > 2018-08-17 14:34:55.033974 7f0e298a6700 10 client.18814183 > handle_client_session client_session(open) v1 from mds.0 > 2018-08-17 14:34:55.034030 7f0e298a6700 10 client.18814183 renew_caps mds.0 > 2018-08-17 14:34:55.034196 7f0e298a6700 10 client.18814183 > connect_mds_targets for mds.0 > 2018-08-17 14:34:55.034269 7f0e300b5140 10 client.18814183 did not get > mds through better means, so chose random mds 0 > 2018-08-17 14:34:55.034276 7f0e300b5140 20 client.18814183 mds is 0 > 2018-08-17 14:34:55.034280 7f0e300b5140 10 client.18814183 send_request > rebuilding request 1 for mds.0 > 2018-08-17 14:34:55.034285 7f0e300b5140 20 client.18814183 > encode_cap_releases enter (req: 0x556a3ee79200, mds: 0) > 2018-08-17 14:34:55.034287 7f0e300b5140 20 client.18814183 send_request > set sent_stamp to 2018-08-17 14:34:55.034287 > 2018-08-17 14:34:55.034291 7f0e300b5140 10 client.18814183 send_request > client_request(unknown.0:1 getattr pAsLsXsFs #0x1/backups/winbak > 2018-08-17 14:34:54.966812 caller_uid=0, caller_gid=0{}) v4 to mds.0 > 2018-08-17 14:34:55.034331 7f0e300b5140 20 client.18814183 awaiting > reply|forward|kick on 0x7ffd3c6a1970 > 2018-08-17 14:34:55.035114 7f0e298a6700 10 client.18814183 > handle_client_session client_session(renewcaps seq 1) v1 from mds.0 > 2018-08-17 14:34:55.035612 7f0e298a6700 20 client.18814183 > handle_client_reply got a reply. Safe:1 tid 1 > 2018-08-17 14:34:55.035664 7f0e298a6700 10 client.18814183 insert_trace > from 2018-08-17 14:34:55.034287 mds.0 is_target=1 is_dentry=0 > 2018-08-17 14:34:55.035707 7f0e298a6700 10 client.18814183 features > 0x3ffddff8eea4fffb > 2018-08-17 14:34:55.035744 7f0e298a6700 10 client.18814183 > update_snap_trace len 48 > 2018-08-17 14:34:55.035783 7f0e298a6700 20 client.18814183 > get_snap_realm 0x1 0x556a3ee5ea90 0 -> 1 > 2018-08-17 14:34:55.035857 7f0e298a6700 10 client.18814183 > update_snap_trace snaprealm(0x1 nref=1 c=0 seq=0 parent=0x0 my_snaps=[] > cached_snapc=0=[]) seq 1 > 0 > 2018-08-17 14:34:55.035901 7f0e298a6700 10 client.18814183 > invalidate_snaprealm_and_children snaprealm(0x1 nref=2 c=0 seq=1 > parent=0x0 my_snaps=[] cached_snapc=0=[]) > 2018-08-17 14:34:55.035962 7f0e298a6700 15 client.18814183 > update_snap_trace snaprealm(0x1 nref=2 c=0 seq=1 parent=0x0 my_snaps=[] > cached_snapc=0=[]) self
Re: [ceph-users] Bluestore: inaccurate disk usage statistics problem?
Hi Sage, Thanks for the quick reply. I read the code and our test also proved that disk space was wasted due to min_alloc_size. Very look forward to the "inline" data feature for small objects. We will also look into this feature and hopefully work with community on it. Regards, Zhi Zhang (David) Contact: zhang.david2...@gmail.com zhangz.da...@outlook.com On Wed, Dec 27, 2017 at 6:36 AM, Sage Weil <s...@newdream.net> wrote: > On Tue, 26 Dec 2017, Zhi Zhang wrote: >> Hi, >> >> We recently started to test bluestore with huge amount of small files >> (only dozens of bytes per file). We have 22 OSDs in a test cluster >> using ceph-12.2.1 with 2 replicas and each OSD disk is 2TB size. After >> we wrote about 150 million files through cephfs, we found each OSD >> disk usage reported by "ceph osd df" was more than 40%, which meant >> more than 800GB was used for each disk, but the actual total file size >> was only about 5.2 GB, which was reported by "ceph df" and also >> calculated by ourselves. >> >> The test is ongoing. I wonder whether the cluster would report OSD >> full after we wrote about 300 million files, however the actual total >> file size would be far far less than the disk usage. I will update the >> result when the test is done. >> >> My question is, whether the disk usage statistics in bluestore is >> inaccurate, or the padding, alignment stuff or something else in >> bluestore wastes the disk space? > > Bluestore isn't making any attempt to optimize for small files, so a > one byte file will consume min_alloc_size (64kb on HDD, 16kb on SSD, > IIRC). > > It probably wouldn't be too difficult to add an "inline" data for small > objects feature that puts small objects in rocksdb... > > sage > >> >> Thanks! >> >> $ ceph osd df >> ID CLASS WEIGHT REWEIGHT SIZE USEAVAIL %USE VAR PGS >> 0 hdd 1.49728 1.0 1862G 853G 1009G 45.82 1.00 110 >> 1 hdd 1.69193 1.0 1862G 807G 1054G 43.37 0.94 105 >> 2 hdd 1.81929 1.0 1862G 811G 1051G 43.57 0.95 116 >> 3 hdd 2.00700 1.0 1862G 839G 1023G 45.04 0.98 122 >> 4 hdd 2.06334 1.0 1862G 886G 976G 47.58 1.03 130 >> 5 hdd 1.99051 1.0 1862G 856G 1006G 45.95 1.00 118 >> 6 hdd 1.67519 1.0 1862G 881G 981G 47.32 1.03 114 >> 7 hdd 1.81929 1.0 1862G 874G 988G 46.94 1.02 120 >> 8 hdd 2.08881 1.0 1862G 885G 976G 47.56 1.03 130 >> 9 hdd 1.64265 1.0 1862G 852G 1010G 45.78 0.99 106 >> 10 hdd 1.81929 1.0 1862G 873G 989G 46.88 1.02 109 >> 11 hdd 2.20041 1.0 1862G 915G 947G 49.13 1.07 131 >> 12 hdd 1.45694 1.0 1862G 874G 988G 46.94 1.02 110 >> 13 hdd 2.03847 1.0 1862G 821G 1041G 44.08 0.96 113 >> 14 hdd 1.53812 1.0 1862G 810G 1052G 43.50 0.95 112 >> 15 hdd 1.52914 1.0 1862G 874G 988G 46.94 1.02 111 >> 16 hdd 1.99176 1.0 1862G 810G 1052G 43.51 0.95 114 >> 17 hdd 1.81929 1.0 1862G 841G 1021G 45.16 0.98 119 >> 18 hdd 1.70901 1.0 1862G 831G 1031G 44.61 0.97 113 >> 19 hdd 1.67519 1.0 1862G 875G 987G 47.02 1.02 115 >> 20 hdd 2.03847 1.0 1862G 864G 998G 46.39 1.01 115 >> 21 hdd 2.18794 1.0 1862G 920G 942G 49.39 1.07 127 >> TOTAL 40984G 18861G 22122G 46.02 >> >> $ ceph df >> GLOBAL: >> SIZE AVAIL RAW USED %RAW USED >> 40984G 22122G 18861G 46.02 >> POOLS: >> NAMEID USED %USED MAX AVAIL OBJECTS >> cephfs_metadata 5 160M 0 6964G 77342 >> cephfs_data 6 5193M 0.04 6964G 151292669 >> >> >> Regards, >> Zhi Zhang (David) >> Contact: zhang.david2...@gmail.com >> zhangz.da...@outlook.com >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Bluestore: inaccurate disk usage statistics problem?
Hi, We recently started to test bluestore with huge amount of small files (only dozens of bytes per file). We have 22 OSDs in a test cluster using ceph-12.2.1 with 2 replicas and each OSD disk is 2TB size. After we wrote about 150 million files through cephfs, we found each OSD disk usage reported by "ceph osd df" was more than 40%, which meant more than 800GB was used for each disk, but the actual total file size was only about 5.2 GB, which was reported by "ceph df" and also calculated by ourselves. The test is ongoing. I wonder whether the cluster would report OSD full after we wrote about 300 million files, however the actual total file size would be far far less than the disk usage. I will update the result when the test is done. My question is, whether the disk usage statistics in bluestore is inaccurate, or the padding, alignment stuff or something else in bluestore wastes the disk space? Thanks! $ ceph osd df ID CLASS WEIGHT REWEIGHT SIZE USEAVAIL %USE VAR PGS 0 hdd 1.49728 1.0 1862G 853G 1009G 45.82 1.00 110 1 hdd 1.69193 1.0 1862G 807G 1054G 43.37 0.94 105 2 hdd 1.81929 1.0 1862G 811G 1051G 43.57 0.95 116 3 hdd 2.00700 1.0 1862G 839G 1023G 45.04 0.98 122 4 hdd 2.06334 1.0 1862G 886G 976G 47.58 1.03 130 5 hdd 1.99051 1.0 1862G 856G 1006G 45.95 1.00 118 6 hdd 1.67519 1.0 1862G 881G 981G 47.32 1.03 114 7 hdd 1.81929 1.0 1862G 874G 988G 46.94 1.02 120 8 hdd 2.08881 1.0 1862G 885G 976G 47.56 1.03 130 9 hdd 1.64265 1.0 1862G 852G 1010G 45.78 0.99 106 10 hdd 1.81929 1.0 1862G 873G 989G 46.88 1.02 109 11 hdd 2.20041 1.0 1862G 915G 947G 49.13 1.07 131 12 hdd 1.45694 1.0 1862G 874G 988G 46.94 1.02 110 13 hdd 2.03847 1.0 1862G 821G 1041G 44.08 0.96 113 14 hdd 1.53812 1.0 1862G 810G 1052G 43.50 0.95 112 15 hdd 1.52914 1.0 1862G 874G 988G 46.94 1.02 111 16 hdd 1.99176 1.0 1862G 810G 1052G 43.51 0.95 114 17 hdd 1.81929 1.0 1862G 841G 1021G 45.16 0.98 119 18 hdd 1.70901 1.0 1862G 831G 1031G 44.61 0.97 113 19 hdd 1.67519 1.0 1862G 875G 987G 47.02 1.02 115 20 hdd 2.03847 1.0 1862G 864G 998G 46.39 1.01 115 21 hdd 2.18794 1.0 1862G 920G 942G 49.39 1.07 127 TOTAL 40984G 18861G 22122G 46.02 $ ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 40984G 22122G 18861G 46.02 POOLS: NAMEID USED %USED MAX AVAIL OBJECTS cephfs_metadata 5 160M 0 6964G 77342 cephfs_data 6 5193M 0.04 6964G 151292669 Regards, Zhi Zhang (David) Contact: zhang.david2...@gmail.com zhangz.da...@outlook.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Read IO to object while new data still in journal
If the data has not been written to filestore, as you mentioned, it is still in journal, your following read op will be blocked until the data is written to filestore. This is because when writing this data, the related object context will hold ondisk_write_lock. This lock will be released in a callback after data is in filestore. When ondisk_write_lock is held, read op to this data will be blocked. Regards, Zhi Zhang (David) Contact: zhang.david2...@gmail.com zhangz.da...@outlook.com On Thu, Dec 31, 2015 at 10:33 AM, min fang <louisfang2...@gmail.com> wrote: > yes, the question here is, librbd use the committed callback, as my > understanding, when this callback returned, librbd write will be looked as > completed. So I can issue a read IO even if the data is not readable. In > this case, i would like to know what data will be returned for the read IO? > > 2015-12-31 10:29 GMT+08:00 Dong Wu <archer.wud...@gmail.com>: >> >> there are two callbacks: committed and applied, committed means write >> to all replica's journal, applied means write to all replica's file >> system. so when applied callback return to client, it means data can >> be read. >> >> 2015-12-31 10:15 GMT+08:00 min fang <louisfang2...@gmail.com>: >> > Hi, as my understanding, write IO will committed data to journal >> > firstly, >> > then give a safe callback to ceph client. So it is possible that data >> > still >> > in journal when I send a read IO to the same area. So what data will be >> > returned if the new data still in journal? >> > >> > Thanks. >> > >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Read IO to object while new data still in journal
Regards, Zhi Zhang (David) Contact: zhang.david2...@gmail.com zhangz.da...@outlook.com On Thu, Dec 31, 2015 at 11:08 AM, min fang <louisfang2...@gmail.com> wrote: > thanks, so ceph can guarantee after write commit call back, read IO can get > the new written data, right? yep :-) > > 2015-12-31 10:55 GMT+08:00 Zhi Zhang <zhang.david2...@gmail.com>: >> >> If the data has not been written to filestore, as you mentioned, it is >> still in journal, your following read op will be blocked until the >> data is written to filestore. >> >> This is because when writing this data, the related object context >> will hold ondisk_write_lock. This lock will be released in a callback >> after data is in filestore. When ondisk_write_lock is held, read op to >> this data will be blocked. >> >> >> Regards, >> Zhi Zhang (David) >> Contact: zhang.david2...@gmail.com >> zhangz.da...@outlook.com >> >> >> On Thu, Dec 31, 2015 at 10:33 AM, min fang <louisfang2...@gmail.com> >> wrote: >> > yes, the question here is, librbd use the committed callback, as my >> > understanding, when this callback returned, librbd write will be looked >> > as >> > completed. So I can issue a read IO even if the data is not readable. In >> > this case, i would like to know what data will be returned for the read >> > IO? >> > >> > 2015-12-31 10:29 GMT+08:00 Dong Wu <archer.wud...@gmail.com>: >> >> >> >> there are two callbacks: committed and applied, committed means write >> >> to all replica's journal, applied means write to all replica's file >> >> system. so when applied callback return to client, it means data can >> >> be read. >> >> >> >> 2015-12-31 10:15 GMT+08:00 min fang <louisfang2...@gmail.com>: >> >> > Hi, as my understanding, write IO will committed data to journal >> >> > firstly, >> >> > then give a safe callback to ceph client. So it is possible that data >> >> > still >> >> > in journal when I send a read IO to the same area. So what data will >> >> > be >> >> > returned if the new data still in journal? >> >> > >> >> > Thanks. >> >> > >> >> > ___ >> >> > ceph-users mailing list >> >> > ceph-users@lists.ceph.com >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > >> > >> > >> > >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com