Re: [ceph-users] CephFS: delayed objects deletion ?
On Mon, Mar 16, 2015 at 5:08 PM, Florent B flor...@coppint.com wrote: Since then I deleted the pool. But I now have another problem, in fact the opposite of the previous : now I never deleted files in clients, data objects and metadata are still in pools, but directory is empty for clients (it is another directory, other pool, etc. from previous problem). Here are logs from MDS when I restart it about one of the files : 2015-03-16 09:57:48.626254 7f4177694700 12 mds.0.cache.dir(1a95e05) link_primary_inode [dentry #1/staging/api/easyrsa/vars [2,head] auth NULL (dversion lock) v=22 inode=0 | dirty=1 0x6ca5a20] [inode 1a95e11 [2,head] #1a95e11 auth v22 s=0 n(v0 1=1+0) (iversion lock) cr={29050627=0-1966080@1} 0x53c32c8] 2015-03-16 09:57:48.626258 7f4177694700 10 mds.0.journal EMetaBlob.replay added [inode 1a95e11 [2,head] /staging/api/easyrsa/vars auth v22 s=0 n(v0 1=1+0) (iversion lock) cr={29050627=0-1966080@1} 0x53c32c8] 2015-03-16 09:57:48.626260 7f4177694700 10 mds.0.cache.ino(1a95e11) mark_dirty_parent 2015-03-16 09:57:48.626261 7f4177694700 10 mds.0.journal EMetaBlob.replay noting opened inode [inode 1a95e11 [2,head] /staging/api/easyrsa/vars auth v22 dirtyparent s=0 n(v0 1=1+0) (iversion lock) cr={29050627=0-1966080@1} | dirtyparent=1 dirty=1 0x53c32c8] 2015-03-16 09:57:48.626264 7f4177694700 10 mds.0.journal EMetaBlob.replay sessionmap v 21580500 -(1|2) == table 21580499 prealloc [] used 1a95e11 2015-03-16 09:57:48.626265 7f4177694700 20 mds.0.journal (session prealloc [1a95e11~3dd]) 2015-03-16 09:57:48.626843 7f4177694700 10 mds.0.journal EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head] /staging/api/easyrsa/vars auth v42 dirtyparent s=8089 n(v0 b8089 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8] 2015-03-16 09:57:48.629319 7f4177694700 10 mds.0.journal EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head] /staging/api/easyrsa/vars auth v99 dirtyparent s=8089 n(v0 b8089 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8] 2015-03-16 09:57:48.629357 7f4177694700 10 mds.0.journal EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head] /staging/api/easyrsa/vars auth v101 dirtyparent s=8089 n(v0 b8089 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8] 2015-03-16 09:57:48.636559 7f4177694700 10 mds.0.journal EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head] /staging/api/easyrsa/vars auth v164 dirtyparent s=8089 n(v0 b8089 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8] 2015-03-16 09:57:48.636597 7f4177694700 10 mds.0.journal EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head] /staging/api/easyrsa/vars auth v166 dirtyparent s=8089 n(v0 b8089 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8] 2015-03-16 09:57:48.644280 7f4177694700 10 mds.0.journal EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head] /staging/api/easyrsa/vars auth v227 dirtyparent s=8089 n(v0 b8089 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8] 2015-03-16 09:57:48.644318 7f4177694700 10 mds.0.journal EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head] /staging/api/easyrsa/vars auth v229 dirtyparent s=8089 n(v0 b8089 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8] 2015-03-16 09:57:51.911267 7f417c9a1700 15 mds.0.cache chose lock states on [inode 1a95e11 [2,head] /staging/api/easyrsa/vars auth v229 dirtyparent s=8089 n(v0 b8089 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8] 2015-03-16 09:57:51.916816 7f417c9a1700 20 mds.0.locker check_inode_max_size no-op on [inode 1a95e11 [2,head] /staging/api/easyrsa/vars auth v229 dirtyparent s=8089 n(v0 b8089 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8] 2015-03-16 09:57:51.958925 7f417c9a1700 7 mds.0.cache inode [inode 1a95e11 [2,head] /staging/api/easyrsa/vars auth v229 dirtyparent s=8089 n(v0 b8089 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8] 2015-03-16 09:57:56.561404 7f417c9a1700 10 mds.0.cache unlisting unwanted/capless inode [inode 1a95e11 [2,head] /staging/api/easyrsa/vars auth v229 dirtyparent s=8089 n(v0 b8089 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8] this log message is not for deleted files. Could you try again and upload the log file and output of rados -p data ls to somewhere. Regards Yan, Zheng What is going on ? On 03/16/2015 02:18 AM, Yan, Zheng wrote: I don't know what was wrong. could you use rados -p data ls to check which objects still exist. Then restart the mds MDS with debug_mds=20 and search the log for name of the remaining objects. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph.conf
Hi all I have seen that new versions of CEPH with new OS like RHEL7 and Cento7 doesn’t need information like mon.node1 and osd.0 etc.. anymore, can anybody tell me if is that for real? or do I need still need to write config like this: [osd.0] host = sagitario addr = 192.168.1.67 [mon.leo] host = leo mon addr = 192.168.1.81:6789 [cid:image005.png@01D00809.A6D502D0] Jesus Chavez SYSTEMS ENGINEER-C.SALES jesch...@cisco.commailto:jesch...@cisco.com Phone: +52 55 5267 3146 Mobile: +51 1 5538883255 CCIE - 44433 Cisco.comhttp://www.cisco.com/ [cid:image006.gif@01D00809.A6D502D0] Think before you print. This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message. Please click herehttp://www.cisco.com/web/about/doing_business/legal/cri/index.html for Company Registration Information. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PHP Rados failed in read operation if object size is large (say more than 10 MB )
On 03/16/2015 01:55 PM, Gaurang Vyas wrote: running on ubuntu with nginx + php-fpm ?php $rados = rados_create('admin'); rados_conf_read_file($rados, '/etc/ceph/ceph.conf'); rados_conf_set($rados, 'keyring','/etc/ceph/ceph.client.admin.keyring'); $temp = rados_conf_get($rados, rados_osd_op_timeout); echo osd ; echo $temp; $temp = rados_conf_get($rados, client_mount_timeout); echo clinet ; echo $temp; $temp = rados_conf_get($rados, rados_mon_op_timeout); echo mon ; echo $temp; $err = rados_connect($rados); $ioRados = rados_ioctx_create($rados,'dev_whereis'); $pieceSize = rados_stat($ioRados,'TEMP_object'); var_dump($pieceSize); $piece = rados_read($ioRados, 'TEMP_object',$pieceSize['psize'] ,0); So what is the error exactly? Are you running phprados from the master branch on Github? echo $piece; ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
On 16/03/2015, at 12.23, Alexandre DERUMIER aderum...@odiso.com wrote: We use Proxmox, so I think it uses librbd ? As It's me that I made the proxmox rbd plugin, I can confirm that yes, it's librbd ;) Is the ceph cluster on dedicated nodes ? or vms are running on same nodes than osd daemons ? My cluster have Ceph OSDs+MONs on seperate PVE nodes, no VMs And I precise that not all VMs on that pool crashed, only some of them (a large majority), and on a same host, some crashed and others not. Is the vm crashed, like no more qemu process ? or is it the guest os which is crashed ? Hmm long time now, remember VM status was stopped, resumed didn't work aka they were started again asap :) (do you use virtio, virtio-scsi or ide for your guest ?) virtio /Steffen signature.asc Description: Message signed with OpenPGP using GPGMail ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PHP Rados failed in read operation if object size is large (say more than 10 MB )
running on ubuntu with nginx + php-fpm ?php $rados = rados_create('admin'); rados_conf_read_file($rados, '/etc/ceph/ceph.conf'); rados_conf_set($rados, 'keyring','/etc/ceph/ceph.client.admin.keyring'); $temp = rados_conf_get($rados, rados_osd_op_timeout); echo osd ; echo $temp; $temp = rados_conf_get($rados, client_mount_timeout); echo clinet ; echo $temp; $temp = rados_conf_get($rados, rados_mon_op_timeout); echo mon ; echo $temp; $err = rados_connect($rados); $ioRados = rados_ioctx_create($rados,'dev_whereis'); $pieceSize = rados_stat($ioRados,'TEMP_object'); var_dump($pieceSize); $piece = rados_read($ioRados, 'TEMP_object',$pieceSize['psize'] ,0); echo $piece; ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
That full system slows down, OK, but brutal stop... This is strange, that could be: - qemu crash, maybe a bug in rbd block storage (if you use librbd) - oom-killer on you host (any logs ?) what is your qemu version ? - Mail original - De: Florent Bautista flor...@coppint.com À: ceph-users ceph-users@lists.ceph.com Envoyé: Lundi 16 Mars 2015 10:11:43 Objet: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down ! Of course but it does not explain why VMs stopped... That full system slows down, OK, but brutal stop... On 03/14/2015 07:00 PM, Andrija Panic wrote: changin PG number - causes LOOOT of data rebalancing (in my case was 80%) which I learned the hard way... On 14 March 2015 at 18:49, Gabri Mate mailingl...@modernbiztonsag.org wrote: BQ_BEGIN I had the same issue a few days ago. I was increasing the pg_num of one pool from 512 to 1024 and all the VMs in that pool stopped. I came to the conclusion that doubling the pg_num caused such a high load in ceph that the VMs were blocked. The next time I will test with small increments. On 12:38 Sat 14 Mar , Florent B wrote: Hi all, I have a Giant cluster in production. Today one of my RBD pools had the too few pgs warning. So I changed pg_num pgp_num. And at this moment, some of the VM stored on this pool were stopped (on some hosts, not all, it depends, no logic) All was running fine for months... Have you ever seen this ? What could have caused this ? Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com BQ_END ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
We use Proxmox, so I think it uses librbd ? As It's me that I made the proxmox rbd plugin, I can confirm that yes, it's librbd ;) Is the ceph cluster on dedicated nodes ? or vms are running on same nodes than osd daemons ? And I precise that not all VMs on that pool crashed, only some of them (a large majority), and on a same host, some crashed and others not. Is the vm crashed, like no more qemu process ? or is it the guest os which is crashed ? (do you use virtio, virtio-scsi or ide for your guest ?) - Mail original - De: Florent Bautista flor...@coppint.com À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Lundi 16 Mars 2015 11:14:45 Objet: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down ! On 03/16/2015 11:03 AM, Alexandre DERUMIER wrote: This is strange, that could be: - qemu crash, maybe a bug in rbd block storage (if you use librbd) - oom-killer on you host (any logs ?) what is your qemu version ? Now, we have version 2.1.3. Some VMs that stopped were running for a long time, but some other had only 4 days uptime. And I precise that not all VMs on that pool crashed, only some of them (a large majority), and on a same host, some crashed and others not. We use Proxmox, so I think it uses librbd ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
On 16/03/2015, at 11.14, Florent B flor...@coppint.com wrote: On 03/16/2015 11:03 AM, Alexandre DERUMIER wrote: This is strange, that could be: - qemu crash, maybe a bug in rbd block storage (if you use librbd) - oom-killer on you host (any logs ?) what is your qemu version ? Now, we have version 2.1.3. Some VMs that stopped were running for a long time, but some other had only 4 days uptime. And I precise that not all VMs on that pool crashed, only some of them (a large majority), and on a same host, some crashed and others not. We use Proxmox, so I think it uses librbd ? I had the same issue once also when bumping up PG_NUM, majority of my ProxMox VMs stopped. I believe this might be due to heavy rebalancing causing time out when VMs tries to do IO OPs and thus generating kernel panics. Next time around I want to go smaller increments of pg_num and hopefully avoid this. I follow the need for more PGs when having more OSDs, but how come PGs gets to few when adding more objects/data to a pool? /Steffen signature.asc Description: Message signed with OpenPGP using GPGMail ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
May I know your ceph version.?. The latest version of firefly 80.9 has patches to avoid excessive data migrations during rewighting osds. You may need set a tunable inorder make this patch active. This is a bugfix release for firefly. It fixes a performance regression in librbd, an important CRUSH misbehavior (see below), and several RGW bugs. We have also backported support for flock/fcntl locks to ceph-fuse and libcephfs. We recommend that all Firefly users upgrade. For more detailed information, see http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt Adjusting CRUSH maps * This point release fixes several issues with CRUSH that trigger excessive data migration when adjusting OSD weights. These are most obvious when a very small weight change (e.g., a change from 0 to .01) triggers a large amount of movement, but the same set of bugs can also lead to excessive (though less noticeable) movement in other cases. However, because the bug may already have affected your cluster, fixing it may trigger movement *back* to the more correct location. For this reason, you must manually opt-in to the fixed behavior. In order to set the new tunable to correct the behavior:: ceph osd crush set-tunable straw_calc_version 1 Note that this change will have no immediate effect. However, from this point forward, any 'straw' bucket in your CRUSH map that is adjusted will get non-buggy internal weights, and that transition may trigger some rebalancing. You can estimate how much rebalancing will eventually be necessary on your cluster with:: ceph osd getcrushmap -o /tmp/cm crushtool -i /tmp/cm --num-rep 3 --test --show-mappings /tmp/a 21 crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2 crushtool -i /tmp/cm2 --reweight -o /tmp/cm2 crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings /tmp/b 21 wc -l /tmp/a # num total mappings diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings Divide the total number of lines in /tmp/a with the number of lines changed. We've found that most clusters are under 10%. You can force all of this rebalancing to happen at once with:: ceph osd crush reweight-all Otherwise, it will happen at some unknown point in the future when CRUSH weights are next adjusted. Notable Changes --- * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum) * crush: fix straw bucket weight calculation, add straw_calc_version tunable (#10095 Sage Weil) * crush: fix tree bucket (Rongzu Zhu) * crush: fix underflow of tree weights (Loic Dachary, Sage Weil) * crushtool: add --reweight (Sage Weil) * librbd: complete pending operations before losing image (#10299 Jason Dillaman) * librbd: fix read caching performance regression (#9854 Jason Dillaman) * librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman) * mon: fix dump of chooseleaf_vary_r tunable (Sage Weil) * osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai) * osd: handle no-op write with snapshot (#10262 Sage Weil) * radosgw-admi On 03/16/2015 12:37 PM, Alexandre DERUMIER wrote: VMs are running on the same nodes than OSD Are you sure that you didn't some kind of out of memory. pg rebalance can be memory hungry. (depend how many osd you have). 2 OSD per host, and 5 hosts in this cluster. hosts h ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
I always keep my pg number a power of 2. So I’d go from 2048 to 4096. I’m not sure if this is the safest way, but it’s worked for me. [yp] Michael Kuriger Sr. Unix Systems Engineer • mk7...@yp.commailto:mk7...@yp.com |• 818-649-7235 From: Chu Duc Minh chu.ducm...@gmail.commailto:chu.ducm...@gmail.com Date: Monday, March 16, 2015 at 7:49 AM To: Florent B flor...@coppint.commailto:flor...@coppint.com Cc: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Subject: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down ! I'm using the latest Giant and have the same issue. When i increase PG_num of a pool from 2048 to 2148, my VMs is still ok. When i increase from 2148 to 2400, some VMs die (Qemu-kvm process die). My physical servers (host VMs) running kernel 3.13 and use librbd. I think it's a bug in librbd with crushmap. (I set crush_tunables3 on my ceph cluster, does it make sense?) Do you know a way to safely increase PG_num? (I don't think increase PG_num 100 each times is a safe good way) Regards, On Mon, Mar 16, 2015 at 8:50 PM, Florent B flor...@coppint.commailto:flor...@coppint.com wrote: We are on Giant. On 03/16/2015 02:03 PM, Azad Aliyar wrote: May I know your ceph version.?. The latest version of firefly 80.9 has patches to avoid excessive data migrations during rewighting osds. You may need set a tunable inorder make this patch active. This is a bugfix release for firefly. It fixes a performance regression in librbd, an important CRUSH misbehavior (see below), and several RGW bugs. We have also backported support for flock/fcntl locks to ceph-fuse and libcephfs. We recommend that all Firefly users upgrade. For more detailed information, see http://docs.ceph.com/docs/master/_downloads/v0.80.9.txthttps://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_master_-5Fdownloads_v0.80.9.txtd=AwMFaQc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=0MEOMMXqQGLq4weFd85B2Bxn5uBH9V9uMiuajNVb7o0s=-HHkWm2cMQZ06FKpWF4Ai-YkFb9lUR_tH_KR0eITbuUe= Adjusting CRUSH maps * This point release fixes several issues with CRUSH that trigger excessive data migration when adjusting OSD weights. These are most obvious when a very small weight change (e.g., a change from 0 to .01) triggers a large amount of movement, but the same set of bugs can also lead to excessive (though less noticeable) movement in other cases. However, because the bug may already have affected your cluster, fixing it may trigger movement *back* to the more correct location. For this reason, you must manually opt-in to the fixed behavior. In order to set the new tunable to correct the behavior:: ceph osd crush set-tunable straw_calc_version 1 Note that this change will have no immediate effect. However, from this point forward, any 'straw' bucket in your CRUSH map that is adjusted will get non-buggy internal weights, and that transition may trigger some rebalancing. You can estimate how much rebalancing will eventually be necessary on your cluster with:: ceph osd getcrushmap -o /tmp/cm crushtool -i /tmp/cm --num-rep 3 --test --show-mappings /tmp/a 21 crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2 crushtool -i /tmp/cm2 --reweight -o /tmp/cm2 crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings /tmp/b 21 wc -l /tmp/a # num total mappings diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings Divide the total number of lines in /tmp/a with the number of lines changed. We've found that most clusters are under 10%. You can force all of this rebalancing to happen at once with:: ceph osd crush reweight-all Otherwise, it will happen at some unknown point in the future when CRUSH weights are next adjusted. Notable Changes --- * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum) * crush: fix straw bucket weight calculation, add straw_calc_version tunable (#10095 Sage Weil) * crush: fix tree bucket (Rongzu Zhu) * crush: fix underflow of tree weights (Loic Dachary, Sage Weil) * crushtool: add --reweight (Sage Weil) * librbd: complete pending operations before losing image (#10299 Jason Dillaman) * librbd: fix read caching performance regression (#9854 Jason Dillaman) * librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman) * mon: fix dump of chooseleaf_vary_r tunable (Sage Weil) * osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai) * osd: handle no-op write with snapshot (#10262 Sage Weil) * radosgw-admi On 03/16/2015 12:37 PM, Alexandre DERUMIER wrote: VMs are running on the same nodes than OSD Are you sure that you didn't some kind of out of memory. pg rebalance can be memory hungry. (depend how many osd you have). 2 OSD per host, and
Re: [ceph-users] Calamari - Data
Sumit, You may have better luck on the ceph-calamari mailing list. Anyway - calamari uses graphite to handle metrics, and graphite does indeed write them to files. John On 11/03/2015 05:09, Sumit Gaur wrote: Hi I have a basic architecture related question. I know Calamari collect system usages data (diamond collector) using perfrormance counters. I need to knwo if all the system performance data that calamari shows remains in memory or it usages files to store that. Thanks sumit ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS: authorizations ?
On 13/03/2015 11:51, Florent B wrote: Hi all, My question is about user management in CephFS. Is it possible to restrict a CephX user to access some subdirectories ? Not yet. The syntax for setting a path= part in the authorization caps for a cephx user exists, but the code for enforcing it isn't done yet. John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS: delayed objects deletion ?
On 16/03/2015 16:30, Florent B wrote: Thank you John :) Hammer is not released yet, is it ? Is it 'safe' to upgrade a production cluster to 0.93 ? I keep forgetting that -- yes, I should have added ...when it's released :-) John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS: delayed objects deletion ?
On 14/03/2015 09:22, Florent B wrote: Hi, What do you call old MDS ? I'm on Giant release, it is not very old... With CephFS we have a special definition of old that is anything that doesn't have the very latest bug fixes ;-) There have definitely been fixes to stray file handling[1] between giant and hammer. Since with giant you're using a version that is neither latest nor LTS, I'd suggest you upgrade to hammer. Hammer also includes some new perf counters related to strays[2] that will allow you to see how the purging is (or isn't) progressing. If you can reproduce this on hammer, then please capture ceph daemon mds.daemon id session ls and ceph mds tell mds.daemon id dumpcache /tmp/cache.txt, in addition to the procedure to reproduce. Ideally logs with debug mds = 10 as well. Cheers, John 1. http://tracker.ceph.com/issues/10387 http://tracker.ceph.com/issues/10164 2. http://tracker.ceph.com/issues/10388 And I tried restarting both but it didn't solve my problem. Will it be OK in Hammer ? On 03/13/2015 04:27 AM, Yan, Zheng wrote: On Fri, Mar 13, 2015 at 1:17 AM, Florent B flor...@coppint.com wrote: Hi all, I test CephFS again on Giant release. I use ceph-fuse. After deleting a large directory (few hours ago), I can see that my pool still contains 217 GB of objects. Even if my root directory on CephFS is empty. And metadata pool is 46 MB. Is it expected ? If not, how to debug this ? Old mds does not work well in this area. Try umounting clients and restarting MDS. Regards Yan, Zheng Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OS file Cache, Ceph RBD cache and Network files systems
Hi Cephers, Our university could deploy ceph. The goal is to store datas for research laboratories (non-HPC) . To do this, we plan to use Ceph with RBD (mount block device) from a NFS ( or CIFS ) server (ceph client) to workstations in laboratories. According to our tests, the OS (ubuntu or centos...) that map the RBD block implements file system write cache (vm.dirty_ratio, etc ...). In that case, the NFS server will always perform writes to workstations whereas it has not finished writing datas to Ceph cluster - a nd regardless of whether the RBD cache is enabled or not in the config [client] section. My questions: 1. Does the activation of RBD cache is useful only when it combines Virtuals Machnies (where QEMU can access an image as a virtual block device directly via librbd) ? 2. Is it common to use Ceph, with RBD to share network file systems ? 3. And if so, what are the recommendations concerning the OS cache ? Thanks a lot. Stephane. -- Université de Lorraine Stéphane DUGRAVOT - Direction du numérique - Infrastructure Jabber : stephane.dugra...@univ-lorraine.fr Tél.: +33 3 83 68 20 98 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph release timeline
Great work ! David Moreau Simard On 2015-03-15 06:29 PM, Loic Dachary wrote: Hi Ceph, In an attempt to clarify what Ceph release is stable, LTS or development. a new page was added to the documentation: http://ceph.com/docs/master/releases/ It is a matrix where each cell is a release number linked to the release notes from http://ceph.com/docs/master/release-notes/. One line per month and one column per release. Cheers ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
@Michael Kuriger: when ceph/librbd operate normally, i know that double the pg_num is the safe way. But when it has problem, i think double it can make many many VMs die (maybe = 50%?) On Mon, Mar 16, 2015 at 9:53 PM, Michael Kuriger mk7...@yp.com wrote: I always keep my pg number a power of 2. So I’d go from 2048 to 4096. I’m not sure if this is the safest way, but it’s worked for me. [image: yp] Michael Kuriger Sr. Unix Systems Engineer * mk7...@yp.com |( 818-649-7235 From: Chu Duc Minh chu.ducm...@gmail.com Date: Monday, March 16, 2015 at 7:49 AM To: Florent B flor...@coppint.com Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.com Subject: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down ! I'm using the latest Giant and have the same issue. When i increase PG_num of a pool from 2048 to 2148, my VMs is still ok. When i increase from 2148 to 2400, some VMs die (Qemu-kvm process die). My physical servers (host VMs) running kernel 3.13 and use librbd. I think it's a bug in librbd with crushmap. (I set crush_tunables3 on my ceph cluster, does it make sense?) Do you know a way to safely increase PG_num? (I don't think increase PG_num 100 each times is a safe good way) Regards, On Mon, Mar 16, 2015 at 8:50 PM, Florent B flor...@coppint.com wrote: We are on Giant. On 03/16/2015 02:03 PM, Azad Aliyar wrote: May I know your ceph version.?. The latest version of firefly 80.9 has patches to avoid excessive data migrations during rewighting osds. You may need set a tunable inorder make this patch active. This is a bugfix release for firefly. It fixes a performance regression in librbd, an important CRUSH misbehavior (see below), and several RGW bugs. We have also backported support for flock/fcntl locks to ceph-fuse and libcephfs. We recommend that all Firefly users upgrade. For more detailed information, see http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_master_-5Fdownloads_v0.80.9.txtd=AwMFaQc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=0MEOMMXqQGLq4weFd85B2Bxn5uBH9V9uMiuajNVb7o0s=-HHkWm2cMQZ06FKpWF4Ai-YkFb9lUR_tH_KR0eITbuUe= Adjusting CRUSH maps * This point release fixes several issues with CRUSH that trigger excessive data migration when adjusting OSD weights. These are most obvious when a very small weight change (e.g., a change from 0 to .01) triggers a large amount of movement, but the same set of bugs can also lead to excessive (though less noticeable) movement in other cases. However, because the bug may already have affected your cluster, fixing it may trigger movement *back* to the more correct location. For this reason, you must manually opt-in to the fixed behavior. In order to set the new tunable to correct the behavior:: ceph osd crush set-tunable straw_calc_version 1 Note that this change will have no immediate effect. However, from this point forward, any 'straw' bucket in your CRUSH map that is adjusted will get non-buggy internal weights, and that transition may trigger some rebalancing. You can estimate how much rebalancing will eventually be necessary on your cluster with:: ceph osd getcrushmap -o /tmp/cm crushtool -i /tmp/cm --num-rep 3 --test --show-mappings /tmp/a 21 crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2 crushtool -i /tmp/cm2 --reweight -o /tmp/cm2 crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings /tmp/b 21 wc -l /tmp/a # num total mappings diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings Divide the total number of lines in /tmp/a with the number of lines changed. We've found that most clusters are under 10%. You can force all of this rebalancing to happen at once with:: ceph osd crush reweight-all Otherwise, it will happen at some unknown point in the future when CRUSH weights are next adjusted. Notable Changes --- * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum) * crush: fix straw bucket weight calculation, add straw_calc_version tunable (#10095 Sage Weil) * crush: fix tree bucket (Rongzu Zhu) * crush: fix underflow of tree weights (Loic Dachary, Sage Weil) * crushtool: add --reweight (Sage Weil) * librbd: complete pending operations before losing image (#10299 Jason Dillaman) * librbd: fix read caching performance regression (#9854 Jason Dillaman) * librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman) * mon: fix dump of chooseleaf_vary_r tunable (Sage Weil) * osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai) * osd: handle no-op write with snapshot (#10262 Sage Weil) * radosgw-admi On 03/16/2015 12:37 PM,
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
I'm using the latest Giant and have the same issue. When i increase PG_num of a pool from 2048 to 2148, my VMs is still ok. When i increase from 2148 to 2400, some VMs die (Qemu-kvm process die). My physical servers (host VMs) running kernel 3.13 and use librbd. I think it's a bug in librbd with crushmap. (I set crush_tunables3 on my ceph cluster, does it make sense?) Do you know a way to safely increase PG_num? (I don't think increase PG_num 100 each times is a safe good way) Regards, On Mon, Mar 16, 2015 at 8:50 PM, Florent B flor...@coppint.com wrote: We are on Giant. On 03/16/2015 02:03 PM, Azad Aliyar wrote: May I know your ceph version.?. The latest version of firefly 80.9 has patches to avoid excessive data migrations during rewighting osds. You may need set a tunable inorder make this patch active. This is a bugfix release for firefly. It fixes a performance regression in librbd, an important CRUSH misbehavior (see below), and several RGW bugs. We have also backported support for flock/fcntl locks to ceph-fuse and libcephfs. We recommend that all Firefly users upgrade. For more detailed information, see http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt Adjusting CRUSH maps * This point release fixes several issues with CRUSH that trigger excessive data migration when adjusting OSD weights. These are most obvious when a very small weight change (e.g., a change from 0 to .01) triggers a large amount of movement, but the same set of bugs can also lead to excessive (though less noticeable) movement in other cases. However, because the bug may already have affected your cluster, fixing it may trigger movement *back* to the more correct location. For this reason, you must manually opt-in to the fixed behavior. In order to set the new tunable to correct the behavior:: ceph osd crush set-tunable straw_calc_version 1 Note that this change will have no immediate effect. However, from this point forward, any 'straw' bucket in your CRUSH map that is adjusted will get non-buggy internal weights, and that transition may trigger some rebalancing. You can estimate how much rebalancing will eventually be necessary on your cluster with:: ceph osd getcrushmap -o /tmp/cm crushtool -i /tmp/cm --num-rep 3 --test --show-mappings /tmp/a 21 crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2 crushtool -i /tmp/cm2 --reweight -o /tmp/cm2 crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings /tmp/b 21 wc -l /tmp/a # num total mappings diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings Divide the total number of lines in /tmp/a with the number of lines changed. We've found that most clusters are under 10%. You can force all of this rebalancing to happen at once with:: ceph osd crush reweight-all Otherwise, it will happen at some unknown point in the future when CRUSH weights are next adjusted. Notable Changes --- * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum) * crush: fix straw bucket weight calculation, add straw_calc_version tunable (#10095 Sage Weil) * crush: fix tree bucket (Rongzu Zhu) * crush: fix underflow of tree weights (Loic Dachary, Sage Weil) * crushtool: add --reweight (Sage Weil) * librbd: complete pending operations before losing image (#10299 Jason Dillaman) * librbd: fix read caching performance regression (#9854 Jason Dillaman) * librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman) * mon: fix dump of chooseleaf_vary_r tunable (Sage Weil) * osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai) * osd: handle no-op write with snapshot (#10262 Sage Weil) * radosgw-admi On 03/16/2015 12:37 PM, Alexandre DERUMIER wrote: VMs are running on the same nodes than OSD Are you sure that you didn't some kind of out of memory. pg rebalance can be memory hungry. (depend how many osd you have). 2 OSD per host, and 5 hosts in this cluster. hosts h ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mapping users to different rgw pools
Yes, the placement target feature is logically separate from multi-zone setups. Placement targets are configured in the region though, which somewhat muddies the issue. Placement targets are useful feature for multi-zone, so different zones in a cluster don't share the same disks. Federation setup is the only place I've seen any discussion about the topic. Even that is just a brief mention. I didn't see any documentation directly talking about setting up placement targets, even in the federation guides. It looks like you'll need to edit the default region to add the placement targets, but you won't need to setup zones. As far as I can tell, You'll have to piece together what you need from the federation setup and some experimentation. I highly recommend a test VM that you can experiment on before attempting anything in production. On Sun, Mar 15, 2015 at 11:53 PM, Sreenath BH bhsreen...@gmail.com wrote: Thanks. Is this possible outside of multi-zone setup. (With only one Zone)? For example, I want to have pools with different replication factors(or erasure codings) and map users to these pools. -Sreenath On 3/13/15, Craig Lewis cle...@centraldesktop.com wrote: Yes, RadosGW has the concept of Placement Targets and Placement Pools. You can create a target, and point it a set of RADOS pools. Those pools can be configured to use different storage strategies by creating different crushmap rules, and assigning those rules to the pool. RGW users can be assigned a default placement target. When they create a bucket, they can either specify the target, or use their default one. All objects in a bucket are stored according to the bucket's placement target. I haven't seen a good guide for making use of these features. The best guide I know of is the Federation guide ( http://ceph.com/docs/giant/radosgw/federated-config/), but it only briefly mentions placement targets. On Thu, Mar 12, 2015 at 11:48 PM, Sreenath BH bhsreen...@gmail.com wrote: Hi all, Can one Radow gateway support more than one pool for storing objects? And as a follow-up question, is there a way to map different users to separate rgw pools so that their obejcts get stored in different pools? thanks, Sreenath ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd laggy algorithm
On Wed, Mar 11, 2015 at 8:40 AM, Artem Savinov asavi...@asdco.ru wrote: hello. ceph transfers osd node in the down status by default , after receiving 3 reports about disabled nodes. Reports are sent per osd heartbeat grace seconds, but the settings of mon_osd_adjust_heartbeat_gratse = true, mon_osd_adjust_down_out_interval = true timeout to transfer nodes in down status may vary. Tell me please: what algorithm enables changes timeout for the transfer nodes occur in down/out status and which parameters are affected? thanks. The monitors keep track of which detected failures are incorrect (based on reports from the marked-down/out OSDs) and build up an expectation about how often the failures are correct based on an exponential backoff of the data points. You can look at the code in OSDMonitor.cc if you're interested, but basically they apply that expectation to modify the down interval and the down-out interval to a value large enough that they believe the OSD is really down (assuming these config options are set). It's not terribly interesting. :) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote: I’m not sure if it’s something I’m doing wrong or just experiencing an oddity, but when my cache tier flushes dirty blocks out to the base tier, the writes seem to hit the OSD’s straight away instead of coalescing in the journals, is this correct? For example if I create a RBD on a standard 3 way replica pool and run fio via librbd 128k writes, I see the journals take all the io’s until I hit my filestore_min_sync_interval and then I see it start writing to the underlying disks. Doing the same on a full cache tier (to force flushing) I immediately see the base disks at a very high utilisation. The journals also have some write IO at the same time. The only other odd thing I can see via iostat is that most of the time whilst I’m running Fio, is that I can see the underlying disks doing very small write IO’s of around 16kb with an occasional big burst of activity. I know erasure coding+cache tier is slower than just plain replicated pools, but even with various high queue depths I’m struggling to get much above 100-150 iops compared to a 3 way replica pool which can easily achieve 1000-1500. The base tier is comprised of 40 disks. It seems quite a marked difference and I’m wondering if this strange journal behaviour is the cause. Does anyone have any ideas? If you're running a full cache pool, then on every operation touching an object which isn't in the cache pool it will try and evict an object. That's probably what you're seeing. Cache pool in general are only a wise idea if you have a very skewed distribution of data hotness and the entire hot zone can fit in cache at once. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs stuck unclean active+remapped after an osd marked out
On Wed, Mar 11, 2015 at 3:49 PM, Francois Lafont flafdiv...@free.fr wrote: Hi, I was always in the same situation: I couldn't remove an OSD without have some PGs definitely stuck to the active+remapped state. But I remembered I read on IRC that, before to mark out an OSD, it could be sometimes a good idea to reweight it to 0. So, instead of doing [1]: ceph osd out 3 I have tried [2]: ceph osd crush reweight osd.3 0 # waiting for the rebalancing... ceph osd out 3 and it worked. Then I could remove my osd with the online documentation: http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual Now, the osd is removed and my cluster is HEALTH_OK. \o/ Now, my question is: why my cluster was definitely stuck to active+remapped with [1] but was not with [2]? Personally, I have absolutely no explanation. If you have an explanation, I'd love to know it. If I remember/guess correctly, if you mark an OSD out it won't necessarily change the weight of the bucket above it (ie, the host), whereas if you change the weight of the OSD then the host bucket's weight changes. That makes for different mappings, and since you only have a couple of OSDs per host (normally: hurray!) and not many hosts (normally: sadness) then marking one OSD out makes things harder for the CRUSH algorithm. -Greg Should the reweight command be present in the online documentation? http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual If yes, I can make a pull request on the doc with pleasure. ;) Regards. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] client-ceph [can not connect from client][connect protocol feature mismatch]
Thanks a lot Stephane and Kamil, Your reply was really helpful. I needed a different version of ceph client on my client machine. Initially my java application using librados was throwing connection time out. Then I considered querying ceph from command line (ceph --id ...), which was giving the error - 2015-03-05 13:37:16.816322 7f5191deb700 -- 10.8.25.112:0/2487 10.138.23.241:6789/0pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol feature mismatch, my 1ffa peer 42041ffa missing 4204 From the hits given in your mail i tried - wget -q -O- ' https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc https://mail.barracuda.com/owa/redir.aspx?C=3NyLmctq4E2pteCAiaUljUgzJNylM9JIwPBTxx3luEEtOGlWRbTgjTsFufrr9_uu3LumztxKEp0.URL=https%3a%2f%2fceph.com%2fgit%2f%3fp%3dceph.git%3ba%3dblob_plain%3bf%3dkeys%2frelease.asc' | sudo apt-key add - wget -q -O- ' https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/autobuild.asc https://mail.barracuda.com/owa/redir.aspx?C=3NyLmctq4E2pteCAiaUljUgzJNylM9JIwPBTxx3luEEtOGlWRbTgjTsFufrr9_uu3LumztxKEp0.URL=https%3a%2f%2fceph.com%2fgit%2f%3fp%3dceph.git%3ba%3dblob_plain%3bf%3dkeys%2fautobuild.asc' | sudo apt-key add - echo deb http://ceph.com/packages/ceph-extras/debian https://mail.barracuda.com/owa/redir.aspx?C=3NyLmctq4E2pteCAiaUljUgzJNylM9JIwPBTxx3luEEtOGlWRbTgjTsFufrr9_uu3LumztxKEp0.URL=http%3a%2f%2fceph.com%2fpackages%2fceph-extras%2fdebian $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph-extras.list echo deb http://ceph.com/debian-firefly/ https://mail.barracuda.com/owa/redir.aspx?C=3NyLmctq4E2pteCAiaUljUgzJNylM9JIwPBTxx3luEEtOGlWRbTgjTsFufrr9_uu3LumztxKEp0.URL=http%3a%2f%2fceph.com%2fdebian-firefly%2f $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list sudo apt-get install ceph-common to verify: ceph --id brts --keyring=/etc/ceph/ceph.client.brts.keyring health HEALTH_OK Thanks for the reply. -Sonal On Fri, Mar 6, 2015 at 5:50 AM, Stéphane DUGRAVOT stephane.dugra...@univ-lorraine.fr wrote: Hi Sonal, You can refer to this doc to identify your problem. Your error code is 4204, so - 4000 upgrade to kernel 3.9 - 200 CEPH_FEATURE_CRUSH_TUNABLES2 - 4 CEPH_FEATURE_CRUSH_TUNABLES - http://ceph.com/planet/feature-set-mismatch-error-on-ceph-kernel-client/ Stephane. -- Hi, I am newbie for ceph, and ceph-user group. Recently I have been working on a ceph client. It worked on all the environments while when i tested on the production, it is not able to connect to ceph. Following are the operating system details and error. If someone has seen this problem before, any help is really appreciated. OS - lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 12.04.2 LTS Release: 12.04 Codename: precise 2015-03-05 13:37:16.816322 7f5191deb700 -- 10.8.25.112:0/2487 10.138.23.241:6789/0 pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol feature mismatch, my 1ffa peer 42041ffa missing 4204 2015-03-05 13:37:17.635776 7f5191deb700 -- 10.8.25.112:0/2487 10.138.23.241:6789/0 pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol feature mismatch, my 1ffa peer 42041ffa missing 4204 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
On Mon, 16 Mar 2015 16:09:12 -0700 Gregory Farnum wrote: Nothing here particularly surprises me. I don't remember all the details of the filestore's rate limiting off the top of my head, but it goes to great lengths to try and avoid letting the journal get too far ahead of the backing store. Disabling the filestore flusher and increasing the sync intervals without also increasing the filestore_wbthrottle_* limits is not going to work well for you. -Greg While very true and what I recalled (backing store being kicked off early) from earlier mails, I think having every last configuration parameter documented in a way that doesn't reduce people to guesswork would be very helpful. For example filestore_wbthrottle_xfs_inodes_start_flusher which defaults to 500. Assuming that this means to start flushing once 500 inodes have accumulated, how would Ceph even know how many inodes are needed for the data present? Lastly with these parameters, there is xfs and btrfs incarnations, no ext4. Do the xfs parameters also apply to ext4? Christian On Mon, Mar 16, 2015 at 3:58 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: 16 March 2015 17:33 To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync? On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote: I’m not sure if it’s something I’m doing wrong or just experiencing an oddity, but when my cache tier flushes dirty blocks out to the base tier, the writes seem to hit the OSD’s straight away instead of coalescing in the journals, is this correct? For example if I create a RBD on a standard 3 way replica pool and run fio via librbd 128k writes, I see the journals take all the io’s until I hit my filestore_min_sync_interval and then I see it start writing to the underlying disks. Doing the same on a full cache tier (to force flushing) I immediately see the base disks at a very high utilisation. The journals also have some write IO at the same time. The only other odd thing I can see via iostat is that most of the time whilst I’m running Fio, is that I can see the underlying disks doing very small write IO’s of around 16kb with an occasional big burst of activity. I know erasure coding+cache tier is slower than just plain replicated pools, but even with various high queue depths I’m struggling to get much above 100-150 iops compared to a 3 way replica pool which can easily achieve 1000- 1500. The base tier is comprised of 40 disks. It seems quite a marked difference and I’m wondering if this strange journal behaviour is the cause. Does anyone have any ideas? If you're running a full cache pool, then on every operation touching an object which isn't in the cache pool it will try and evict an object. That's probably what you're seeing. Cache pool in general are only a wise idea if you have a very skewed distribution of data hotness and the entire hot zone can fit in cache at once. -Greg Hi Greg, It's not the caching behaviour that I confused about, it’s the journal behaviour on the base disks during flushing. I've been doing some more tests and can do something reproducible which seems strange to me. First off 10MB of 4kb writes: time ceph tell osd.1 bench 1000 4096 { bytes_written: 1000, blocksize: 4096, bytes_per_sec: 16009426.00} real0m0.760s user0m0.063s sys 0m0.022s Now split this into 2x5mb writes: time ceph tell osd.1 bench 500 4096 time ceph tell osd.1 bench 500 4096 { bytes_written: 500, blocksize: 4096, bytes_per_sec: 10580846.00} real0m0.595s user0m0.065s sys 0m0.018s { bytes_written: 500, blocksize: 4096, bytes_per_sec: 9944252.00} real0m4.412s user0m0.053s sys 0m0.071s 2nd bench takes a lot longer even though both should easily fit in the 5GB journal. Looking at iostat, I think I can see that no writes happen to the journal whilst the writes from the 1st bench are being flushed. Is this the expected behaviour? I would have thought as long as there is space available in the journal it shouldn't block on new writes. Also I see in iostat writes to the underlying disk happening at a QD of 1 and 16kb IO's for a number of seconds, with a large blip or activity just before the flush finishes. Is this the correct behaviour? I would have thought if this tell osd bench is doing sequential IO then the journal should be able to flush 5-10mb of data in a fraction a second. Ceph.conf [osd] filestore max sync interval = 30 filestore min sync interval = 20 filestore flusher = false osd_journal_size = 5120 osd_crush_location_hook = /usr/local/bin/crush-location
Re: [ceph-users] RadosGW Direct Upload Limitation
- Original Message - From: Craig Lewis cle...@centraldesktop.com To: Gregory Farnum g...@gregs42.com Cc: ceph-users@lists.ceph.com Sent: Monday, March 16, 2015 11:48:15 AM Subject: Re: [ceph-users] RadosGW Direct Upload Limitation Maybe, but I'm not sure if Yehuda would want to take it upstream or not. This limit is present because it's part of the S3 spec. For larger objects you should use multi-part upload, which can get much bigger. -Greg Note that the multi-part upload has a lower limit of 4MiB per part, and the direct upload has an upper limit of 5GiB. The limit is 10MB, but it does not apply to the last part, so basically you could upload any object size with it. I would still recommend using the plain upload for smaller object sizes, it is faster, and the resulting object might be more efficient (for really small sizes). Yehuda So you have to use both methods - direct upload for small files, and multi-part upload for big files. Your best bet is to use the Amazon S3 libraries. They have functions that take care of it for you. I'd like to see this mentioned in the Ceph documentation someplace. When I first encountered the issue, I couldn't find a limit in the RadosGW documentation anywhere. I only found the 5GiB limit in the Amazon API documentation, which lead me to test on RadosGW. Now that I know it was done to preserve Amazon compatibility, I don't want to override the value anymore. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS unexplained writes
The information you're giving sounds a little contradictory, but my guess is that you're seeing the impacts of object promotion and flushing. You can sample the operations the OSDs are doing at any given time by running ops_in_progress (or similar, I forget exact phrasing) command on the OSD admin socket. I'm not sure if rados df is going to report cache movement activity or not. That though would mostly be written to the SSDs, not the hard drives — although the hard drives could still get metadata updates written when objects are flushed. What data exactly are you seeing that's leading you to believe writes are happening against these drives? What is the exact CephFS and cache pool configuration? -Greg On Mon, Mar 16, 2015 at 2:36 PM, Erik Logtenberg e...@logtenberg.eu wrote: Hi, I forgot to mention: while I am seeing these writes in iotop and /proc/diskstats for the hdd's, I am -not- seeing any writes in rados df for the pool residing on these disks. There is only one pool active on the hdd's and according to rados df it is getting zero writes when I'm just reading big files from cephfs. So apparently the osd's are doing some non-trivial amount of writing on their own behalf. What could it be? Thanks, Erik. On 03/16/2015 10:26 PM, Erik Logtenberg wrote: Hi, I am getting relatively bad performance from cephfs. I use a replicated cache pool on ssd in front of an erasure coded pool on rotating media. When reading big files (streaming video), I see a lot of disk i/o, especially writes. I have no clue what could cause these writes. The writes are going to the hdd's and they stop when I stop reading. I mounted everything with noatime and nodiratime so it shouldn't be that. On a related note, the Cephfs metadata is stored on ssd too, so metadata-related changes shouldn't hit the hdd's anyway I think. Any thoughts? How can I get more information about what ceph is doing? Using iotop I only see that the osd processes are busy but it doesn't give many hints as to what they are doing. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: 16 March 2015 17:33 To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync? On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote: I’m not sure if it’s something I’m doing wrong or just experiencing an oddity, but when my cache tier flushes dirty blocks out to the base tier, the writes seem to hit the OSD’s straight away instead of coalescing in the journals, is this correct? For example if I create a RBD on a standard 3 way replica pool and run fio via librbd 128k writes, I see the journals take all the io’s until I hit my filestore_min_sync_interval and then I see it start writing to the underlying disks. Doing the same on a full cache tier (to force flushing) I immediately see the base disks at a very high utilisation. The journals also have some write IO at the same time. The only other odd thing I can see via iostat is that most of the time whilst I’m running Fio, is that I can see the underlying disks doing very small write IO’s of around 16kb with an occasional big burst of activity. I know erasure coding+cache tier is slower than just plain replicated pools, but even with various high queue depths I’m struggling to get much above 100-150 iops compared to a 3 way replica pool which can easily achieve 1000- 1500. The base tier is comprised of 40 disks. It seems quite a marked difference and I’m wondering if this strange journal behaviour is the cause. Does anyone have any ideas? If you're running a full cache pool, then on every operation touching an object which isn't in the cache pool it will try and evict an object. That's probably what you're seeing. Cache pool in general are only a wise idea if you have a very skewed distribution of data hotness and the entire hot zone can fit in cache at once. -Greg Hi Greg, It's not the caching behaviour that I confused about, it’s the journal behaviour on the base disks during flushing. I've been doing some more tests and can do something reproducible which seems strange to me. First off 10MB of 4kb writes: time ceph tell osd.1 bench 1000 4096 { bytes_written: 1000, blocksize: 4096, bytes_per_sec: 16009426.00} real0m0.760s user0m0.063s sys 0m0.022s Now split this into 2x5mb writes: time ceph tell osd.1 bench 500 4096 time ceph tell osd.1 bench 500 4096 { bytes_written: 500, blocksize: 4096, bytes_per_sec: 10580846.00} real0m0.595s user0m0.065s sys 0m0.018s { bytes_written: 500, blocksize: 4096, bytes_per_sec: 9944252.00} real0m4.412s user0m0.053s sys 0m0.071s 2nd bench takes a lot longer even though both should easily fit in the 5GB journal. Looking at iostat, I think I can see that no writes happen to the journal whilst the writes from the 1st bench are being flushed. Is this the expected behaviour? I would have thought as long as there is space available in the journal it shouldn't block on new writes. Also I see in iostat writes to the underlying disk happening at a QD of 1 and 16kb IO's for a number of seconds, with a large blip or activity just before the flush finishes. Is this the correct behaviour? I would have thought if this tell osd bench is doing sequential IO then the journal should be able to flush 5-10mb of data in a fraction a second. Ceph.conf [osd] filestore max sync interval = 30 filestore min sync interval = 20 filestore flusher = false osd_journal_size = 5120 osd_crush_location_hook = /usr/local/bin/crush-location osd_op_threads = 5 filestore_op_threads = 4 iostat during period where writes seem to be blocked (journal=sda disk=sdd) Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdb 0.00 0.000.002.00 0.00 4.00 4.00 0.000.000.000.00 0.00 0.00 sdc 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdd 0.00 0.000.00 76.00 0.00 760.0020.00 0.99 13.110.00 13.11 13.05 99.20 iostat during what I believe to be the actual flush Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdb 0.00 0.000.002.00 0.00 4.00 4.00 0.000.000.000.00 0.00 0.00 sdc 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00
[ceph-users] CephFS unexplained writes
Hi, I am getting relatively bad performance from cephfs. I use a replicated cache pool on ssd in front of an erasure coded pool on rotating media. When reading big files (streaming video), I see a lot of disk i/o, especially writes. I have no clue what could cause these writes. The writes are going to the hdd's and they stop when I stop reading. I mounted everything with noatime and nodiratime so it shouldn't be that. On a related note, the Cephfs metadata is stored on ssd too, so metadata-related changes shouldn't hit the hdd's anyway I think. Any thoughts? How can I get more information about what ceph is doing? Using iotop I only see that the osd processes are busy but it doesn't give many hints as to what they are doing. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS unexplained writes
Hi, I forgot to mention: while I am seeing these writes in iotop and /proc/diskstats for the hdd's, I am -not- seeing any writes in rados df for the pool residing on these disks. There is only one pool active on the hdd's and according to rados df it is getting zero writes when I'm just reading big files from cephfs. So apparently the osd's are doing some non-trivial amount of writing on their own behalf. What could it be? Thanks, Erik. On 03/16/2015 10:26 PM, Erik Logtenberg wrote: Hi, I am getting relatively bad performance from cephfs. I use a replicated cache pool on ssd in front of an erasure coded pool on rotating media. When reading big files (streaming video), I see a lot of disk i/o, especially writes. I have no clue what could cause these writes. The writes are going to the hdd's and they stop when I stop reading. I mounted everything with noatime and nodiratime so it shouldn't be that. On a related note, the Cephfs metadata is stored on ssd too, so metadata-related changes shouldn't hit the hdd's anyway I think. Any thoughts? How can I get more information about what ceph is doing? Using iotop I only see that the osd processes are busy but it doesn't give many hints as to what they are doing. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
Nothing here particularly surprises me. I don't remember all the details of the filestore's rate limiting off the top of my head, but it goes to great lengths to try and avoid letting the journal get too far ahead of the backing store. Disabling the filestore flusher and increasing the sync intervals without also increasing the filestore_wbthrottle_* limits is not going to work well for you. -Greg On Mon, Mar 16, 2015 at 3:58 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: 16 March 2015 17:33 To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync? On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote: I’m not sure if it’s something I’m doing wrong or just experiencing an oddity, but when my cache tier flushes dirty blocks out to the base tier, the writes seem to hit the OSD’s straight away instead of coalescing in the journals, is this correct? For example if I create a RBD on a standard 3 way replica pool and run fio via librbd 128k writes, I see the journals take all the io’s until I hit my filestore_min_sync_interval and then I see it start writing to the underlying disks. Doing the same on a full cache tier (to force flushing) I immediately see the base disks at a very high utilisation. The journals also have some write IO at the same time. The only other odd thing I can see via iostat is that most of the time whilst I’m running Fio, is that I can see the underlying disks doing very small write IO’s of around 16kb with an occasional big burst of activity. I know erasure coding+cache tier is slower than just plain replicated pools, but even with various high queue depths I’m struggling to get much above 100-150 iops compared to a 3 way replica pool which can easily achieve 1000- 1500. The base tier is comprised of 40 disks. It seems quite a marked difference and I’m wondering if this strange journal behaviour is the cause. Does anyone have any ideas? If you're running a full cache pool, then on every operation touching an object which isn't in the cache pool it will try and evict an object. That's probably what you're seeing. Cache pool in general are only a wise idea if you have a very skewed distribution of data hotness and the entire hot zone can fit in cache at once. -Greg Hi Greg, It's not the caching behaviour that I confused about, it’s the journal behaviour on the base disks during flushing. I've been doing some more tests and can do something reproducible which seems strange to me. First off 10MB of 4kb writes: time ceph tell osd.1 bench 1000 4096 { bytes_written: 1000, blocksize: 4096, bytes_per_sec: 16009426.00} real0m0.760s user0m0.063s sys 0m0.022s Now split this into 2x5mb writes: time ceph tell osd.1 bench 500 4096 time ceph tell osd.1 bench 500 4096 { bytes_written: 500, blocksize: 4096, bytes_per_sec: 10580846.00} real0m0.595s user0m0.065s sys 0m0.018s { bytes_written: 500, blocksize: 4096, bytes_per_sec: 9944252.00} real0m4.412s user0m0.053s sys 0m0.071s 2nd bench takes a lot longer even though both should easily fit in the 5GB journal. Looking at iostat, I think I can see that no writes happen to the journal whilst the writes from the 1st bench are being flushed. Is this the expected behaviour? I would have thought as long as there is space available in the journal it shouldn't block on new writes. Also I see in iostat writes to the underlying disk happening at a QD of 1 and 16kb IO's for a number of seconds, with a large blip or activity just before the flush finishes. Is this the correct behaviour? I would have thought if this tell osd bench is doing sequential IO then the journal should be able to flush 5-10mb of data in a fraction a second. Ceph.conf [osd] filestore max sync interval = 30 filestore min sync interval = 20 filestore flusher = false osd_journal_size = 5120 osd_crush_location_hook = /usr/local/bin/crush-location osd_op_threads = 5 filestore_op_threads = 4 iostat during period where writes seem to be blocked (journal=sda disk=sdd) Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdb 0.00 0.000.002.00 0.00 4.00 4.00 0.000.000.000.00 0.00 0.00 sdc 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdd 0.00 0.000.00 76.00 0.00 760.0020.00 0.99 13.110.00 13.11 13.05 99.20 iostat during
Re: [ceph-users] Mapping users to different rgw pools
Thanks. Is this possible outside of multi-zone setup. (With only one Zone)? For example, I want to have pools with different replication factors(or erasure codings) and map users to these pools. -Sreenath On 3/13/15, Craig Lewis cle...@centraldesktop.com wrote: Yes, RadosGW has the concept of Placement Targets and Placement Pools. You can create a target, and point it a set of RADOS pools. Those pools can be configured to use different storage strategies by creating different crushmap rules, and assigning those rules to the pool. RGW users can be assigned a default placement target. When they create a bucket, they can either specify the target, or use their default one. All objects in a bucket are stored according to the bucket's placement target. I haven't seen a good guide for making use of these features. The best guide I know of is the Federation guide ( http://ceph.com/docs/giant/radosgw/federated-config/), but it only briefly mentions placement targets. On Thu, Mar 12, 2015 at 11:48 PM, Sreenath BH bhsreen...@gmail.com wrote: Hi all, Can one Radow gateway support more than one pool for storing objects? And as a follow-up question, is there a way to map different users to separate rgw pools so that their obejcts get stored in different pools? thanks, Sreenath ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW Direct Upload Limitation
Maybe, but I'm not sure if Yehuda would want to take it upstream or not. This limit is present because it's part of the S3 spec. For larger objects you should use multi-part upload, which can get much bigger. -Greg Note that the multi-part upload has a lower limit of 4MiB per part, and the direct upload has an upper limit of 5GiB. So you have to use both methods - direct upload for small files, and multi-part upload for big files. Your best bet is to use the Amazon S3 libraries. They have functions that take care of it for you. I'd like to see this mentioned in the Ceph documentation someplace. When I first encountered the issue, I couldn't find a limit in the RadosGW documentation anywhere. I only found the 5GiB limit in the Amazon API documentation, which lead me to test on RadosGW. Now that I know it was done to preserve Amazon compatibility, I don't want to override the value anymore. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs stuck unclean active+remapped after an osd marked out
If I remember/guess correctly, if you mark an OSD out it won't necessarily change the weight of the bucket above it (ie, the host), whereas if you change the weight of the OSD then the host bucket's weight changes. -Greg That sounds right. Marking an OSD out is a ceph osd reweight, not a ceph osd crush reweight. Experimentally confirmed. I have an OSD out right now, and the host's crush weight is the same as the other hosts' crush weight. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] query about mapping of Swift/S3 APIs to Ceph cluster APIs
On Sat, Mar 14, 2015 at 3:04 AM, pragya jain prag_2...@yahoo.co.in wrote: Hello all! I am working on Ceph object storage architecture from last few months. I am unable to search a document which can describe how Ceph object storage APIs (Swift/S3 APIs) are mappedd with Ceph storage cluster APIs (librados APIs) to store the data at Ceph storage cluster. As the documents say: Radosgw, a gateway interface for ceph object storage users, accept user request to store or retrieve data in the form of Swift APIs or S3 APIs and convert the user's request in RADOS request. Please help me in knowing 1. how does Radosgw convert user request to RADOS request ? 2. how are HTTP requests mapped with RADOS request? The RadosGW daemon takes care of that. It's an application that sits on top of RADOS. For HTTP, there are a couple ways. The older way has Apache accepting the HTTP request, then forwarding that to the RadosGW daemon using FastCGI. Newer versions support RadosGW handling the HTTP directly. For the full details, you'll want to check out the source code at https://github.com/ceph/ceph If you're not interested enough to read the source code (I wasn't :-) ), setup a test cluster. Create a user, bucket, and object, and look at the contents of the rados pools. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs stuck unclean active+remapped after an osd marked out
Hi, Gregory Farnum a wrote : If I remember/guess correctly, if you mark an OSD out it won't necessarily change the weight of the bucket above it (ie, the host), whereas if you change the weight of the OSD then the host bucket's weight changes. I can just say that, indeed, I have noticed exactly what you describe in the ouput of of ceph osd tree. That makes for different mappings, and since you only have a couple of OSDs per host (normally: hurray!) er, er... no, I have 10 OSDs in the first osd node and 11 OSDs in the second osd node (see my fisrt message). and not many hosts (normally: sadness) Yes, only I have only 2 osd nodes (and 3 monitors). then marking one OSD out makes things harder for the CRUSH algorithm. Ah, Ok. So my cluster is too little for Ceph. ;) Thanks for your answer Greg, I will follow the pull-request with attention. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW Direct Upload Limitation
On Mon, Mar 16, 2015 at 11:14 AM, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote: Hi all! I have recently updated to CEPH version 0.80.9 (latest Firefly release) which presumably supports direct upload. I 've tried to upload a file using this functionality and it seems that is working for files up to 5GB. For files above 5GB there is an error. I believe that this is because of a hardcoded limit: #define RGW_MAX_PUT_SIZE(5ULL*1024*1024*1024) Is there a way to increase that limit other than compiling CEPH from source? No. Could we somehow put it as a configuration parameter? Maybe, but I'm not sure if Yehuda would want to take it upstream or not. This limit is present because it's part of the S3 spec. For larger objects you should use multi-part upload, which can get much bigger. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Shadow files
Out of curiousity, what's the frequency of the peaks and troughs? RadosGW has configs on how long it should wait after deleting before garbage collecting, how long between GC runs, and how many objects it can GC in per run. The defaults are 2 hours, 1 hour, and 32 respectively. Search http://docs.ceph.com/docs/master/radosgw/config-ref/ for rgw gc. If your peaks and troughs have a frequency less than 1 hour, then GC is going to delay and alias the disk usage w.r.t. the object count. If you have millions of objects, you probably need to tweak those values. If RGW is only GCing 32 objects an hour, it's never going to catch up. Now that I think about it, I bet I'm having issues here too. I delete more than (32*24) objects per day... On Sun, Mar 15, 2015 at 4:41 PM, Ben b@benjackson.email wrote: It is either a problem with CEPH, Civetweb or something else in our configuration. But deletes in user buckets is still leaving a high number of old shadow files. Since we have millions and millions of objects, it is hard to reconcile what should and shouldnt exist. Looking at our cluster usage, there are no troughs, it is just a rising peak. But when looking at users data usage, we can see peaks and troughs as you would expect as data is deleted and added. Our ceph version 0.80.9 Please ideas? On 2015-03-13 02:25, Yehuda Sadeh-Weinraub wrote: - Original Message - From: Ben b@benjackson.email To: ceph-us...@ceph.com Sent: Wednesday, March 11, 2015 8:46:25 PM Subject: Re: [ceph-users] Shadow files Anyone got any info on this? Is it safe to delete shadow files? It depends. Shadow files are badly named objects that represent part of the objects data. They are only safe to remove if you know that the corresponding objects no longer exist. Yehuda On 2015-03-11 10:03, Ben wrote: We have a large number of shadow files in our cluster that aren't being deleted automatically as data is deleted. Is it safe to delete these files? Is there something we need to be aware of when deleting them? Is there a script that we can run that will delete these safely? Is there something wrong with our cluster that it isn't deleting these files when it should be? We are using civetweb with radosgw, with tengine ssl proxy infront of it Any advice please Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
VMs are running on the same nodes than OSD Are you sure that you didn't some kind of out of memory. pg rebalance can be memory hungry. (depend how many osd you have). do you see oom-killer in your host logs ? - Mail original - De: Florent Bautista flor...@coppint.com À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Lundi 16 Mars 2015 12:35:11 Objet: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down ! On 03/16/2015 12:23 PM, Alexandre DERUMIER wrote: We use Proxmox, so I think it uses librbd ? As It's me that I made the proxmox rbd plugin, I can confirm that yes, it's librbd ;) Is the ceph cluster on dedicated nodes ? or vms are running on same nodes than osd daemons ? VMs are running on the same nodes than OSD And I precise that not all VMs on that pool crashed, only some of them (a large majority), and on a same host, some crashed and others not. Is the vm crashed, like no more qemu process ? or is it the guest os which is crashed ? (do you use virtio, virtio-scsi or ide for your guest ?) I don't really know what crashed, I think qemu process but not sure. We use virtio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Shadow files
On Mon, Mar 16, 2015 at 12:12 PM, Craig Lewis cle...@centraldesktop.com wrote: Out of curiousity, what's the frequency of the peaks and troughs? RadosGW has configs on how long it should wait after deleting before garbage collecting, how long between GC runs, and how many objects it can GC in per run. The defaults are 2 hours, 1 hour, and 32 respectively. Search http://docs.ceph.com/docs/master/radosgw/config-ref/ for rgw gc. If your peaks and troughs have a frequency less than 1 hour, then GC is going to delay and alias the disk usage w.r.t. the object count. If you have millions of objects, you probably need to tweak those values. If RGW is only GCing 32 objects an hour, it's never going to catch up. Now that I think about it, I bet I'm having issues here too. I delete more than (32*24) objects per day... Uh, that's not quite what rgw_gc_max_objs mean. That param configures how the garbage control data objects and internal classes are sharded, and each grouping will only delete one object at a time. So it controls the parallelism, but not the total number of objects! Also, Yehuda says that changing this can be a bit dangerous because it currently needs to be consistent across any program doing or generating GC work. -Greg On Sun, Mar 15, 2015 at 4:41 PM, Ben b@benjackson.email wrote: It is either a problem with CEPH, Civetweb or something else in our configuration. But deletes in user buckets is still leaving a high number of old shadow files. Since we have millions and millions of objects, it is hard to reconcile what should and shouldnt exist. Looking at our cluster usage, there are no troughs, it is just a rising peak. But when looking at users data usage, we can see peaks and troughs as you would expect as data is deleted and added. Our ceph version 0.80.9 Please ideas? On 2015-03-13 02:25, Yehuda Sadeh-Weinraub wrote: - Original Message - From: Ben b@benjackson.email To: ceph-us...@ceph.com Sent: Wednesday, March 11, 2015 8:46:25 PM Subject: Re: [ceph-users] Shadow files Anyone got any info on this? Is it safe to delete shadow files? It depends. Shadow files are badly named objects that represent part of the objects data. They are only safe to remove if you know that the corresponding objects no longer exist. Yehuda On 2015-03-11 10:03, Ben wrote: We have a large number of shadow files in our cluster that aren't being deleted automatically as data is deleted. Is it safe to delete these files? Is there something we need to be aware of when deleting them? Is there a script that we can run that will delete these safely? Is there something wrong with our cluster that it isn't deleting these files when it should be? We are using civetweb with radosgw, with tengine ssl proxy infront of it Any advice please Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fw: query about mapping of Swift/S3 APIs to Ceph cluster APIs
please somebody answer my queries. -RegardsPragya JainDepartment of Computer ScienceUniversity of DelhiDelhi, India On Saturday, 14 March 2015 3:34 PM, pragya jain prag_2...@yahoo.co.in wrote: Hello all! I am working on Ceph object storage architecture from last few months. I am unable to search a document which can describe how Ceph object storage APIs (Swift/S3 APIs) are mappedd with Ceph storage cluster APIs (librados APIs) to store the data at Ceph storage cluster. As the documents say: Radosgw, a gateway interface for ceph object storage users, accept user request to store or retrieve data in the form of Swift APIs or S3 APIs and convert the user's request in RADOS request. Please help me in knowing1. how does Radosgw convert user request to RADOS request ?2. how are HTTP requests mapped with RADOS request? Thank you -RegardsPragya JainDepartment of Computer ScienceUniversity of DelhiDelhi, India ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] query about region and zone creation while configuring RADOSGW
hello all! I am working on Ceph object storage architecture.I have some queries: In case of configuring federated system, we need to create regions containing one or more zones and the cluster must have a master region and each region must have a master zone. but in case of simple gateway configuration, is there a need to create at least a region and a zone to store the data? Please somebody reply my query. Thank you -RegardsPragya JainDepartment of Computer ScienceUniversity of DelhiDelhi, India___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com