Re: [ceph-users] luminous ceph-fuse with quotas breaks 'mount' and 'df'

2018-08-19 Thread Zhi Zhang
I think this should be caused by the calculation method of fsblkcnt_t
in ceph-fuse. The total and used space will be right shifted by
CEPH_BLOCK_SHIFT (22). So if the quota of a directory is less than
4MB, the total size after this calculation would be 0. Then 'df'
commands won't report this mount point, but 'mount' command will still
do.

Regards,
Zhi Zhang (David)
Contact: zhang.david2...@gmail.com
  zhangz.da...@outlook.com

On Sat, Aug 18, 2018 at 4:29 AM Chad William Seys
 wrote:
>
> Looks like Greg may be onto something!
>
> If the quota is 1000 (bytes), the mount point appears in 'df':
> ceph-fuse   8.0M 0  8.0M   0% /srv/smb/winbak
> and 'mount':
> ceph-fuse on /srv/smb/winbak type fuse.ceph-fuse
> (rw,relatime,user_id=0,group_id=0,allow_other)
>
> If quota is 100, the mount point no longer appears in 'df', but does
> appear in 'mount'.
>
> I wasn't able to get it to disappear from 'mount' even at quota 1 byte.
>
> Below is a debug session as suggested by John. I used quota 100 on
> the mount point. (Segfault occurred when I ctrl-C to kill process after
> initial mount.)
>
> Thanks!
> Chad.
>
>
>
> 2018-08-17 14:34:54.952636 7f0e300b5140  0 ceph version 12.2.7
> (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable), process
> ceph-fuse, pid 30502
> ceph-fuse[30502]: starting ceph client
> 2018-08-17 14:34:54.958910 7f0e300b5140 -1 init, newargv =
> 0x556a3f060120 newargc=9
> 2018-08-17 14:34:54.961492 7f0e298a6700 10 client.0 ms_handle_connect on
> 128.104.164.197:6789/0
> 2018-08-17 14:34:54.965175 7f0e300b5140 10 client.18814183 Subscribing
> to map 'mdsmap'
> 2018-08-17 14:34:54.965198 7f0e300b5140 20 client.18814183 trim_cache
> size 0 max 16384
> 2018-08-17 14:34:54.966721 7f0e298a6700  1 client.18814183
> handle_mds_map epoch 2336272
> 2018-08-17 14:34:54.966788 7f0e300b5140 20 client.18814183
> populate_metadata read hostname 'tardis'
> 2018-08-17 14:34:54.966824 7f0e300b5140 10 client.18814183 did not get
> mds through better means, so chose random mds 0
> 2018-08-17 14:34:54.966826 7f0e300b5140 20 client.18814183 mds is 0
> 2018-08-17 14:34:54.966828 7f0e300b5140 10 client.18814183
> _open_mds_session mds.0
> 2018-08-17 14:34:54.966858 7f0e300b5140 10 client.18814183 waiting for
> session to mds.0 to open
> 2018-08-17 14:34:54.969991 7f0e298a6700 10 client.18814183
> ms_handle_connect on 10.128.198.59:6800/2643422990
> 2018-08-17 14:34:55.033974 7f0e298a6700 10 client.18814183
> handle_client_session client_session(open) v1 from mds.0
> 2018-08-17 14:34:55.034030 7f0e298a6700 10 client.18814183 renew_caps mds.0
> 2018-08-17 14:34:55.034196 7f0e298a6700 10 client.18814183
> connect_mds_targets for mds.0
> 2018-08-17 14:34:55.034269 7f0e300b5140 10 client.18814183 did not get
> mds through better means, so chose random mds 0
> 2018-08-17 14:34:55.034276 7f0e300b5140 20 client.18814183 mds is 0
> 2018-08-17 14:34:55.034280 7f0e300b5140 10 client.18814183 send_request
> rebuilding request 1 for mds.0
> 2018-08-17 14:34:55.034285 7f0e300b5140 20 client.18814183
> encode_cap_releases enter (req: 0x556a3ee79200, mds: 0)
> 2018-08-17 14:34:55.034287 7f0e300b5140 20 client.18814183 send_request
> set sent_stamp to 2018-08-17 14:34:55.034287
> 2018-08-17 14:34:55.034291 7f0e300b5140 10 client.18814183 send_request
> client_request(unknown.0:1 getattr pAsLsXsFs #0x1/backups/winbak
> 2018-08-17 14:34:54.966812 caller_uid=0, caller_gid=0{}) v4 to mds.0
> 2018-08-17 14:34:55.034331 7f0e300b5140 20 client.18814183 awaiting
> reply|forward|kick on 0x7ffd3c6a1970
> 2018-08-17 14:34:55.035114 7f0e298a6700 10 client.18814183
> handle_client_session client_session(renewcaps seq 1) v1 from mds.0
> 2018-08-17 14:34:55.035612 7f0e298a6700 20 client.18814183
> handle_client_reply got a reply. Safe:1 tid 1
> 2018-08-17 14:34:55.035664 7f0e298a6700 10 client.18814183 insert_trace
> from 2018-08-17 14:34:55.034287 mds.0 is_target=1 is_dentry=0
> 2018-08-17 14:34:55.035707 7f0e298a6700 10 client.18814183  features
> 0x3ffddff8eea4fffb
> 2018-08-17 14:34:55.035744 7f0e298a6700 10 client.18814183
> update_snap_trace len 48
> 2018-08-17 14:34:55.035783 7f0e298a6700 20 client.18814183
> get_snap_realm 0x1 0x556a3ee5ea90 0 -> 1
> 2018-08-17 14:34:55.035857 7f0e298a6700 10 client.18814183
> update_snap_trace snaprealm(0x1 nref=1 c=0 seq=0 parent=0x0 my_snaps=[]
> cached_snapc=0=[]) seq 1 > 0
> 2018-08-17 14:34:55.035901 7f0e298a6700 10 client.18814183
> invalidate_snaprealm_and_children snaprealm(0x1 nref=2 c=0 seq=1
> parent=0x0 my_snaps=[] cached_snapc=0=[])
> 2018-08-17 14:34:55.035962 7f0e298a6700 15 client.18814183
> update_snap_trace snaprealm(0x1 nref=2 c=0 seq=1 parent=0x0 my_snaps=[]
> cached_snapc=0=[]) self

Re: [ceph-users] Bluestore: inaccurate disk usage statistics problem?

2017-12-26 Thread Zhi Zhang
Hi Sage,

Thanks for the quick reply. I read the code and our test also proved
that disk space was wasted due to min_alloc_size.

Very look forward to the "inline" data feature for small objects. We
will also look into this feature and hopefully work with community on
it.

Regards,
Zhi Zhang (David)
Contact: zhang.david2...@gmail.com
  zhangz.da...@outlook.com


On Wed, Dec 27, 2017 at 6:36 AM, Sage Weil <s...@newdream.net> wrote:
> On Tue, 26 Dec 2017, Zhi Zhang wrote:
>> Hi,
>>
>> We recently started to test bluestore with huge amount of small files
>> (only dozens of bytes per file). We have 22 OSDs in a test cluster
>> using ceph-12.2.1 with 2 replicas and each OSD disk is 2TB size. After
>> we wrote about 150 million files through cephfs, we found each OSD
>> disk usage reported by "ceph osd df" was more than 40%, which meant
>> more than 800GB was used for each disk, but the actual total file size
>> was only about 5.2 GB, which was reported by "ceph df" and also
>> calculated by ourselves.
>>
>> The test is ongoing. I wonder whether the cluster would report OSD
>> full after we wrote about 300 million files, however the actual total
>> file size would be far far less than the disk usage. I will update the
>> result when the test is done.
>>
>> My question is, whether the disk usage statistics in bluestore is
>> inaccurate, or the padding, alignment stuff or something else in
>> bluestore wastes the disk space?
>
> Bluestore isn't making any attempt to optimize for small files, so a
> one byte file will consume min_alloc_size (64kb on HDD, 16kb on SSD,
> IIRC).
>
> It probably wouldn't be too difficult to add an "inline" data for small
> objects feature that puts small objects in rocksdb...
>
> sage
>
>>
>> Thanks!
>>
>> $ ceph osd df
>> ID CLASS WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
>>  0   hdd 1.49728  1.0  1862G   853G  1009G 45.82 1.00 110
>>  1   hdd 1.69193  1.0  1862G   807G  1054G 43.37 0.94 105
>>  2   hdd 1.81929  1.0  1862G   811G  1051G 43.57 0.95 116
>>  3   hdd 2.00700  1.0  1862G   839G  1023G 45.04 0.98 122
>>  4   hdd 2.06334  1.0  1862G   886G   976G 47.58 1.03 130
>>  5   hdd 1.99051  1.0  1862G   856G  1006G 45.95 1.00 118
>>  6   hdd 1.67519  1.0  1862G   881G   981G 47.32 1.03 114
>>  7   hdd 1.81929  1.0  1862G   874G   988G 46.94 1.02 120
>>  8   hdd 2.08881  1.0  1862G   885G   976G 47.56 1.03 130
>>  9   hdd 1.64265  1.0  1862G   852G  1010G 45.78 0.99 106
>> 10   hdd 1.81929  1.0  1862G   873G   989G 46.88 1.02 109
>> 11   hdd 2.20041  1.0  1862G   915G   947G 49.13 1.07 131
>> 12   hdd 1.45694  1.0  1862G   874G   988G 46.94 1.02 110
>> 13   hdd 2.03847  1.0  1862G   821G  1041G 44.08 0.96 113
>> 14   hdd 1.53812  1.0  1862G   810G  1052G 43.50 0.95 112
>> 15   hdd 1.52914  1.0  1862G   874G   988G 46.94 1.02 111
>> 16   hdd 1.99176  1.0  1862G   810G  1052G 43.51 0.95 114
>> 17   hdd 1.81929  1.0  1862G   841G  1021G 45.16 0.98 119
>> 18   hdd 1.70901  1.0  1862G   831G  1031G 44.61 0.97 113
>> 19   hdd 1.67519  1.0  1862G   875G   987G 47.02 1.02 115
>> 20   hdd 2.03847  1.0  1862G   864G   998G 46.39 1.01 115
>> 21   hdd 2.18794  1.0  1862G   920G   942G 49.39 1.07 127
>> TOTAL 40984G 18861G 22122G 46.02
>>
>> $ ceph df
>> GLOBAL:
>> SIZE   AVAIL  RAW USED %RAW USED
>> 40984G 22122G   18861G 46.02
>> POOLS:
>> NAMEID USED  %USED MAX AVAIL OBJECTS
>> cephfs_metadata 5   160M 0 6964G 77342
>> cephfs_data 6  5193M  0.04 6964G 151292669
>>
>>
>> Regards,
>> Zhi Zhang (David)
>> Contact: zhang.david2...@gmail.com
>>   zhangz.da...@outlook.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore: inaccurate disk usage statistics problem?

2017-12-25 Thread Zhi Zhang
Hi,

We recently started to test bluestore with huge amount of small files
(only dozens of bytes per file). We have 22 OSDs in a test cluster
using ceph-12.2.1 with 2 replicas and each OSD disk is 2TB size. After
we wrote about 150 million files through cephfs, we found each OSD
disk usage reported by "ceph osd df" was more than 40%, which meant
more than 800GB was used for each disk, but the actual total file size
was only about 5.2 GB, which was reported by "ceph df" and also
calculated by ourselves.

The test is ongoing. I wonder whether the cluster would report OSD
full after we wrote about 300 million files, however the actual total
file size would be far far less than the disk usage. I will update the
result when the test is done.

My question is, whether the disk usage statistics in bluestore is
inaccurate, or the padding, alignment stuff or something else in
bluestore wastes the disk space?

Thanks!

$ ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
 0   hdd 1.49728  1.0  1862G   853G  1009G 45.82 1.00 110
 1   hdd 1.69193  1.0  1862G   807G  1054G 43.37 0.94 105
 2   hdd 1.81929  1.0  1862G   811G  1051G 43.57 0.95 116
 3   hdd 2.00700  1.0  1862G   839G  1023G 45.04 0.98 122
 4   hdd 2.06334  1.0  1862G   886G   976G 47.58 1.03 130
 5   hdd 1.99051  1.0  1862G   856G  1006G 45.95 1.00 118
 6   hdd 1.67519  1.0  1862G   881G   981G 47.32 1.03 114
 7   hdd 1.81929  1.0  1862G   874G   988G 46.94 1.02 120
 8   hdd 2.08881  1.0  1862G   885G   976G 47.56 1.03 130
 9   hdd 1.64265  1.0  1862G   852G  1010G 45.78 0.99 106
10   hdd 1.81929  1.0  1862G   873G   989G 46.88 1.02 109
11   hdd 2.20041  1.0  1862G   915G   947G 49.13 1.07 131
12   hdd 1.45694  1.0  1862G   874G   988G 46.94 1.02 110
13   hdd 2.03847  1.0  1862G   821G  1041G 44.08 0.96 113
14   hdd 1.53812  1.0  1862G   810G  1052G 43.50 0.95 112
15   hdd 1.52914  1.0  1862G   874G   988G 46.94 1.02 111
16   hdd 1.99176  1.0  1862G   810G  1052G 43.51 0.95 114
17   hdd 1.81929  1.0  1862G   841G  1021G 45.16 0.98 119
18   hdd 1.70901  1.0  1862G   831G  1031G 44.61 0.97 113
19   hdd 1.67519  1.0  1862G   875G   987G 47.02 1.02 115
20   hdd 2.03847  1.0  1862G   864G   998G 46.39 1.01 115
21   hdd 2.18794  1.0  1862G   920G   942G 49.39 1.07 127
TOTAL 40984G 18861G 22122G 46.02

$ ceph df
GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
40984G 22122G   18861G 46.02
POOLS:
NAMEID USED  %USED MAX AVAIL OBJECTS
cephfs_metadata 5   160M 0 6964G 77342
cephfs_data 6  5193M  0.04 6964G 151292669


Regards,
Zhi Zhang (David)
Contact: zhang.david2...@gmail.com
  zhangz.da...@outlook.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Read IO to object while new data still in journal

2015-12-30 Thread Zhi Zhang
If the data has not been written to filestore, as you mentioned, it is
still in journal, your following read op will be blocked until the
data is written to filestore.

This is because when writing this data, the related object context
will hold ondisk_write_lock. This lock will be released in a callback
after data is in filestore. When ondisk_write_lock is held, read op to
this data will be blocked.


Regards,
Zhi Zhang (David)
Contact: zhang.david2...@gmail.com
  zhangz.da...@outlook.com


On Thu, Dec 31, 2015 at 10:33 AM, min fang <louisfang2...@gmail.com> wrote:
> yes, the question here is, librbd use the committed callback, as my
> understanding, when this callback returned, librbd write will be looked as
> completed. So I can issue a read IO even if the data is not readable. In
> this case, i would like to know what data will be returned for the read IO?
>
> 2015-12-31 10:29 GMT+08:00 Dong Wu <archer.wud...@gmail.com>:
>>
>> there are two callbacks: committed and applied, committed means write
>> to all replica's journal, applied means write to all replica's file
>> system. so when applied callback return to client, it means data can
>> be read.
>>
>> 2015-12-31 10:15 GMT+08:00 min fang <louisfang2...@gmail.com>:
>> > Hi, as my understanding, write IO will committed data to journal
>> > firstly,
>> > then give a safe callback to ceph client. So it is possible that data
>> > still
>> > in journal when I send a read IO to the same area. So what data will be
>> > returned if the new data still in journal?
>> >
>> > Thanks.
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Read IO to object while new data still in journal

2015-12-30 Thread Zhi Zhang
Regards,
Zhi Zhang (David)
Contact: zhang.david2...@gmail.com
  zhangz.da...@outlook.com


On Thu, Dec 31, 2015 at 11:08 AM, min fang <louisfang2...@gmail.com> wrote:
> thanks, so ceph can guarantee after write commit call back, read IO can get
> the new written data, right?

yep :-)

>
> 2015-12-31 10:55 GMT+08:00 Zhi Zhang <zhang.david2...@gmail.com>:
>>
>> If the data has not been written to filestore, as you mentioned, it is
>> still in journal, your following read op will be blocked until the
>> data is written to filestore.
>>
>> This is because when writing this data, the related object context
>> will hold ondisk_write_lock. This lock will be released in a callback
>> after data is in filestore. When ondisk_write_lock is held, read op to
>> this data will be blocked.
>>
>>
>> Regards,
>> Zhi Zhang (David)
>> Contact: zhang.david2...@gmail.com
>>   zhangz.da...@outlook.com
>>
>>
>> On Thu, Dec 31, 2015 at 10:33 AM, min fang <louisfang2...@gmail.com>
>> wrote:
>> > yes, the question here is, librbd use the committed callback, as my
>> > understanding, when this callback returned, librbd write will be looked
>> > as
>> > completed. So I can issue a read IO even if the data is not readable. In
>> > this case, i would like to know what data will be returned for the read
>> > IO?
>> >
>> > 2015-12-31 10:29 GMT+08:00 Dong Wu <archer.wud...@gmail.com>:
>> >>
>> >> there are two callbacks: committed and applied, committed means write
>> >> to all replica's journal, applied means write to all replica's file
>> >> system. so when applied callback return to client, it means data can
>> >> be read.
>> >>
>> >> 2015-12-31 10:15 GMT+08:00 min fang <louisfang2...@gmail.com>:
>> >> > Hi, as my understanding, write IO will committed data to journal
>> >> > firstly,
>> >> > then give a safe callback to ceph client. So it is possible that data
>> >> > still
>> >> > in journal when I send a read IO to the same area. So what data will
>> >> > be
>> >> > returned if the new data still in journal?
>> >> >
>> >> > Thanks.
>> >> >
>> >> > ___
>> >> > ceph-users mailing list
>> >> > ceph-users@lists.ceph.com
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com