Re: [ceph-users] CephFS: delayed objects deletion ?

2015-03-16 Thread Yan, Zheng
On Mon, Mar 16, 2015 at 5:08 PM, Florent B flor...@coppint.com wrote:
 Since then I deleted the pool.

 But I now have another problem, in fact the opposite of the previous :
 now I never deleted files in clients, data objects and metadata are
 still in pools, but directory is empty for clients (it is another
 directory, other pool, etc. from previous problem).

 Here are logs from MDS when I restart it about one of the files :

 2015-03-16 09:57:48.626254 7f4177694700 12 mds.0.cache.dir(1a95e05)
 link_primary_inode [dentry #1/staging/api/easyrsa/vars [2,head] auth
 NULL (dversion lock) v=22 inode=0 | dirty=1 0x6ca5a20] [inode
 1a95e11 [2,head] #1a95e11 auth v22 s=0 n(v0 1=1+0) (iversion
 lock) cr={29050627=0-1966080@1} 0x53c32c8]
 2015-03-16 09:57:48.626258 7f4177694700 10 mds.0.journal
 EMetaBlob.replay added [inode 1a95e11 [2,head]
 /staging/api/easyrsa/vars auth v22 s=0 n(v0 1=1+0) (iversion lock)
 cr={29050627=0-1966080@1} 0x53c32c8]
 2015-03-16 09:57:48.626260 7f4177694700 10 mds.0.cache.ino(1a95e11)
 mark_dirty_parent
 2015-03-16 09:57:48.626261 7f4177694700 10 mds.0.journal
 EMetaBlob.replay noting opened inode [inode 1a95e11 [2,head]
 /staging/api/easyrsa/vars auth v22 dirtyparent s=0 n(v0 1=1+0) (iversion
 lock) cr={29050627=0-1966080@1} | dirtyparent=1 dirty=1 0x53c32c8]
 2015-03-16 09:57:48.626264 7f4177694700 10 mds.0.journal
 EMetaBlob.replay sessionmap v 21580500 -(1|2) == table 21580499 prealloc
 [] used 1a95e11
 2015-03-16 09:57:48.626265 7f4177694700 20 mds.0.journal  (session
 prealloc [1a95e11~3dd])
 2015-03-16 09:57:48.626843 7f4177694700 10 mds.0.journal
 EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head]
 /staging/api/easyrsa/vars auth v42 dirtyparent s=8089 n(v0 b8089 1=1+0)
 (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
 2015-03-16 09:57:48.629319 7f4177694700 10 mds.0.journal
 EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head]
 /staging/api/easyrsa/vars auth v99 dirtyparent s=8089 n(v0 b8089 1=1+0)
 (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
 2015-03-16 09:57:48.629357 7f4177694700 10 mds.0.journal
 EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head]
 /staging/api/easyrsa/vars auth v101 dirtyparent s=8089 n(v0 b8089 1=1+0)
 (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
 2015-03-16 09:57:48.636559 7f4177694700 10 mds.0.journal
 EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head]
 /staging/api/easyrsa/vars auth v164 dirtyparent s=8089 n(v0 b8089 1=1+0)
 (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
 2015-03-16 09:57:48.636597 7f4177694700 10 mds.0.journal
 EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head]
 /staging/api/easyrsa/vars auth v166 dirtyparent s=8089 n(v0 b8089 1=1+0)
 (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
 2015-03-16 09:57:48.644280 7f4177694700 10 mds.0.journal
 EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head]
 /staging/api/easyrsa/vars auth v227 dirtyparent s=8089 n(v0 b8089 1=1+0)
 (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
 2015-03-16 09:57:48.644318 7f4177694700 10 mds.0.journal
 EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head]
 /staging/api/easyrsa/vars auth v229 dirtyparent s=8089 n(v0 b8089 1=1+0)
 (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
 2015-03-16 09:57:51.911267 7f417c9a1700 15 mds.0.cache  chose lock
 states on [inode 1a95e11 [2,head] /staging/api/easyrsa/vars auth
 v229 dirtyparent s=8089 n(v0 b8089 1=1+0) (iversion lock) |
 dirtyparent=1 dirty=1 0x53c32c8]
 2015-03-16 09:57:51.916816 7f417c9a1700 20 mds.0.locker
 check_inode_max_size no-op on [inode 1a95e11 [2,head]
 /staging/api/easyrsa/vars auth v229 dirtyparent s=8089 n(v0 b8089 1=1+0)
 (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
 2015-03-16 09:57:51.958925 7f417c9a1700  7 mds.0.cache inode [inode
 1a95e11 [2,head] /staging/api/easyrsa/vars auth v229 dirtyparent
 s=8089 n(v0 b8089 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
 2015-03-16 09:57:56.561404 7f417c9a1700 10 mds.0.cache  unlisting
 unwanted/capless inode [inode 1a95e11 [2,head]
 /staging/api/easyrsa/vars auth v229 dirtyparent s=8089 n(v0 b8089 1=1+0)
 (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]



this log message is not for deleted files. Could you try again and upload the
log file and output of rados -p data ls to somewhere.

Regards
Yan, Zheng


 What is going on ?

 On 03/16/2015 02:18 AM, Yan, Zheng wrote:
 I don't know what was wrong. could you use rados -p data ls to check
 which objects still exist. Then restart the mds MDS with debug_mds=20
 and search the log for name of the remaining objects.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph.conf

2015-03-16 Thread Jesus Chavez (jeschave)
Hi all I have seen that new versions of CEPH with new OS like RHEL7 and Cento7 
doesn’t need information like mon.node1 and osd.0 etc.. anymore, can anybody 
tell me if is that for real? or do I need still need to write config like this:

[osd.0]
  host = sagitario
  addr = 192.168.1.67
[mon.leo]
  host = leo
  mon addr = 192.168.1.81:6789


[cid:image005.png@01D00809.A6D502D0]


Jesus Chavez
SYSTEMS ENGINEER-C.SALES

jesch...@cisco.commailto:jesch...@cisco.com
Phone: +52 55 5267 3146
Mobile: +51 1 5538883255

CCIE - 44433


Cisco.comhttp://www.cisco.com/





[cid:image006.gif@01D00809.A6D502D0]



  Think before you print.

This email may contain confidential and privileged material for the sole use of 
the intended recipient. Any review, use, distribution or disclosure by others 
is strictly prohibited. If you are not the intended recipient (or authorized to 
receive for the recipient), please contact the sender by reply email and delete 
all copies of this message.

Please click 
herehttp://www.cisco.com/web/about/doing_business/legal/cri/index.html for 
Company Registration Information.





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PHP Rados failed in read operation if object size is large (say more than 10 MB )

2015-03-16 Thread Wido den Hollander
On 03/16/2015 01:55 PM, Gaurang Vyas wrote:
 running on ubuntu with nginx + php-fpm
 
 ?php
 $rados = rados_create('admin');
 
 
 rados_conf_read_file($rados, '/etc/ceph/ceph.conf');
 rados_conf_set($rados, 'keyring','/etc/ceph/ceph.client.admin.keyring');
 
 $temp = rados_conf_get($rados, rados_osd_op_timeout);
 echo  osd ;
 echo $temp;
 $temp = rados_conf_get($rados, client_mount_timeout);
 echo  clinet   ;
 echo $temp;
 $temp = rados_conf_get($rados, rados_mon_op_timeout);
 echo   mon   ;
 echo $temp;
 
 $err = rados_connect($rados);
 $ioRados = rados_ioctx_create($rados,'dev_whereis');
 
 $pieceSize = rados_stat($ioRados,'TEMP_object');
 var_dump($pieceSize);
 
 $piece = rados_read($ioRados, 'TEMP_object',$pieceSize['psize'] ,0);
 

So what is the error exactly? Are you running phprados from the master
branch on Github?

 echo $piece;
 ?
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

2015-03-16 Thread Steffen W Sørensen

On 16/03/2015, at 12.23, Alexandre DERUMIER aderum...@odiso.com wrote:

 We use Proxmox, so I think it uses librbd ? 
 
 As It's me that I made the proxmox rbd plugin, I can confirm that yes, it's 
 librbd ;)
 Is the ceph cluster on dedicated nodes ? or vms are running on same nodes 
 than osd daemons ?
My cluster have Ceph OSDs+MONs on seperate PVE nodes, no VMs

 
 
 And I precise that not all VMs on that pool crashed, only some of them 
 (a large majority), and on a same host, some crashed and others not. 
 
 Is the vm crashed, like no more qemu process ?
 or is it the guest os which is crashed ?
Hmm long time now, remember VM status was stopped, resumed didn't work aka they 
were started again asap :)

 (do you use virtio, virtio-scsi or ide for your guest ?)
virtio

/Steffen



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PHP Rados failed in read operation if object size is large (say more than 10 MB )

2015-03-16 Thread Gaurang Vyas
running on ubuntu with nginx + php-fpm

?php
$rados = rados_create('admin');


rados_conf_read_file($rados, '/etc/ceph/ceph.conf');
rados_conf_set($rados, 'keyring','/etc/ceph/ceph.client.admin.keyring');

$temp = rados_conf_get($rados, rados_osd_op_timeout);
echo  osd ;
echo $temp;
$temp = rados_conf_get($rados, client_mount_timeout);
echo  clinet   ;
echo $temp;
$temp = rados_conf_get($rados, rados_mon_op_timeout);
echo   mon   ;
echo $temp;

$err = rados_connect($rados);
$ioRados = rados_ioctx_create($rados,'dev_whereis');

$pieceSize = rados_stat($ioRados,'TEMP_object');
var_dump($pieceSize);

$piece = rados_read($ioRados, 'TEMP_object',$pieceSize['psize'] ,0);

echo $piece;
?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

2015-03-16 Thread Alexandre DERUMIER
That full system slows down, OK, but brutal stop... 

This is strange, that could be:

- qemu crash, maybe a bug in rbd block storage (if you use librbd)
- oom-killer on you host (any logs ?)

what is your qemu version ?


- Mail original -
De: Florent Bautista flor...@coppint.com
À: ceph-users ceph-users@lists.ceph.com
Envoyé: Lundi 16 Mars 2015 10:11:43
Objet: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

Of course but it does not explain why VMs stopped... 
That full system slows down, OK, but brutal stop... 

On 03/14/2015 07:00 PM, Andrija Panic wrote: 



changin PG number - causes LOOOT of data rebalancing (in my case was 80%) 
which I learned the hard way... 

On 14 March 2015 at 18:49, Gabri Mate  mailingl...@modernbiztonsag.org  
wrote: 

BQ_BEGIN
I had the same issue a few days ago. I was increasing the pg_num of one 
pool from 512 to 1024 and all the VMs in that pool stopped. I came to 
the conclusion that doubling the pg_num caused such a high load in ceph 
that the VMs were blocked. The next time I will test with small 
increments. 


On 12:38 Sat 14 Mar , Florent B wrote: 
 Hi all, 
 
 I have a Giant cluster in production. 
 
 Today one of my RBD pools had the too few pgs warning. So I changed 
 pg_num  pgp_num. 
 
 And at this moment, some of the VM stored on this pool were stopped (on 
 some hosts, not all, it depends, no logic) 
 
 All was running fine for months... 
 
 Have you ever seen this ? 
 What could have caused this ? 
 
 Thank you. 
 
 
 ___ 
 ceph-users mailing list 
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 






-- 

Andrija Panić 


___
ceph-users mailing list ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

BQ_END


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

2015-03-16 Thread Alexandre DERUMIER
We use Proxmox, so I think it uses librbd ? 

As It's me that I made the proxmox rbd plugin, I can confirm that yes, it's 
librbd ;)

Is the ceph cluster on dedicated nodes ? or vms are running on same nodes than 
osd daemons ?


And I precise that not all VMs on that pool crashed, only some of them 
(a large majority), and on a same host, some crashed and others not. 

Is the vm crashed, like no more qemu process ?
or is it the guest os which is crashed ? (do you use virtio, virtio-scsi or ide 
for your guest ?)





- Mail original -
De: Florent Bautista flor...@coppint.com
À: aderumier aderum...@odiso.com
Cc: ceph-users ceph-users@lists.ceph.com
Envoyé: Lundi 16 Mars 2015 11:14:45
Objet: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

On 03/16/2015 11:03 AM, Alexandre DERUMIER wrote: 
 This is strange, that could be: 
 
 - qemu crash, maybe a bug in rbd block storage (if you use librbd) 
 - oom-killer on you host (any logs ?) 
 
 what is your qemu version ? 
 

Now, we have version 2.1.3. 

Some VMs that stopped were running for a long time, but some other had 
only 4 days uptime. 

And I precise that not all VMs on that pool crashed, only some of them 
(a large majority), and on a same host, some crashed and others not. 

We use Proxmox, so I think it uses librbd ? 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

2015-03-16 Thread Steffen W Sørensen

On 16/03/2015, at 11.14, Florent B flor...@coppint.com wrote:

 On 03/16/2015 11:03 AM, Alexandre DERUMIER wrote:
 This is strange, that could be:
 
 - qemu crash, maybe a bug in rbd block storage (if you use librbd)
 - oom-killer on you host (any logs ?)
 
 what is your qemu version ?
 
 
 Now, we have version 2.1.3.
 
 Some VMs that stopped were running for a long time, but some other had
 only 4 days uptime.
 
 And I precise that not all VMs on that pool crashed, only some of them
 (a large majority), and on a same host, some crashed and others not.
 
 We use Proxmox, so I think it uses librbd ?
I had the same issue once also when bumping up PG_NUM, majority of my ProxMox 
VMs stopped. I believe this might be due to heavy rebalancing causing time out 
when VMs tries to do IO OPs and thus generating kernel panics.

Next time around I want to go smaller increments of pg_num and hopefully avoid 
this.

I follow the need for more PGs when having more OSDs, but how come PGs gets to 
few when adding more objects/data to a pool?

/Steffen


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

2015-03-16 Thread Azad Aliyar
May I know your ceph version.?. The latest version of firefly 80.9 has
patches to avoid excessive data migrations during rewighting osds. You may
need set a tunable inorder make this patch active.

This is a bugfix release for firefly.  It fixes a performance regression
in librbd, an important CRUSH misbehavior (see below), and several RGW
bugs.  We have also backported support for flock/fcntl locks to ceph-fuse
and libcephfs.

We recommend that all Firefly users upgrade.

For more detailed information, see
  http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt

Adjusting CRUSH maps


* This point release fixes several issues with CRUSH that trigger
  excessive data migration when adjusting OSD weights.  These are most
  obvious when a very small weight change (e.g., a change from 0 to
  .01) triggers a large amount of movement, but the same set of bugs
  can also lead to excessive (though less noticeable) movement in
  other cases.

  However, because the bug may already have affected your cluster,
  fixing it may trigger movement *back* to the more correct location.
  For this reason, you must manually opt-in to the fixed behavior.

  In order to set the new tunable to correct the behavior::

 ceph osd crush set-tunable straw_calc_version 1

  Note that this change will have no immediate effect.  However, from
  this point forward, any 'straw' bucket in your CRUSH map that is
  adjusted will get non-buggy internal weights, and that transition
  may trigger some rebalancing.

  You can estimate how much rebalancing will eventually be necessary
  on your cluster with::

 ceph osd getcrushmap -o /tmp/cm
 crushtool -i /tmp/cm --num-rep 3 --test --show-mappings  /tmp/a 21
 crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2
 crushtool -i /tmp/cm2 --reweight -o /tmp/cm2
 crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings  /tmp/b 21
 wc -l /tmp/a  # num total mappings
 diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings

   Divide the total number of lines in /tmp/a with the number of lines
   changed.  We've found that most clusters are under 10%.

   You can force all of this rebalancing to happen at once with::

 ceph osd crush reweight-all

   Otherwise, it will happen at some unknown point in the future when
   CRUSH weights are next adjusted.

Notable Changes
---

* ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum)
* crush: fix straw bucket weight calculation, add straw_calc_version
  tunable (#10095 Sage Weil)
* crush: fix tree bucket (Rongzu Zhu)
* crush: fix underflow of tree weights (Loic Dachary, Sage Weil)
* crushtool: add --reweight (Sage Weil)
* librbd: complete pending operations before losing image (#10299 Jason
  Dillaman)
* librbd: fix read caching performance regression (#9854 Jason Dillaman)
* librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman)
* mon: fix dump of chooseleaf_vary_r tunable (Sage Weil)
* osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai)
* osd: handle no-op write with snapshot (#10262 Sage Weil)
* radosgw-admi




On 03/16/2015 12:37 PM, Alexandre DERUMIER wrote:
 VMs are running on the same nodes than OSD
 Are you sure that you didn't some kind of out of memory.
 pg rebalance can be memory hungry. (depend how many osd you have).

2 OSD per host, and 5 hosts in this cluster.
hosts h
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

2015-03-16 Thread Michael Kuriger
I always keep my pg number a power of 2.  So I’d go from 2048 to 4096.  I’m not 
sure if this is the safest way, but it’s worked for me.


[yp]



Michael Kuriger
Sr. Unix Systems Engineer
• mk7...@yp.commailto:mk7...@yp.com |• 818-649-7235


From: Chu Duc Minh chu.ducm...@gmail.commailto:chu.ducm...@gmail.com
Date: Monday, March 16, 2015 at 7:49 AM
To: Florent B flor...@coppint.commailto:flor...@coppint.com
Cc: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com 
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Subject: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

I'm using the latest Giant and have the same issue. When i increase PG_num of a 
pool from 2048 to 2148, my VMs is still ok. When i increase from 2148 to 2400, 
some VMs die (Qemu-kvm process die).
My physical servers (host VMs) running kernel 3.13 and use librbd.
I think it's a bug in librbd with crushmap.
(I set crush_tunables3 on my ceph cluster, does it make sense?)

Do you know a way to safely increase PG_num? (I don't think increase PG_num 100 
each times is a safe  good way)

Regards,

On Mon, Mar 16, 2015 at 8:50 PM, Florent B 
flor...@coppint.commailto:flor...@coppint.com wrote:
We are on Giant.

On 03/16/2015 02:03 PM, Azad Aliyar wrote:

 May I know your ceph version.?. The latest version of firefly 80.9 has
 patches to avoid excessive data migrations during rewighting osds. You
 may need set a tunable inorder make this patch active.

 This is a bugfix release for firefly.  It fixes a performance regression
 in librbd, an important CRUSH misbehavior (see below), and several RGW
 bugs.  We have also backported support for flock/fcntl locks to ceph-fuse
 and libcephfs.

 We recommend that all Firefly users upgrade.

 For more detailed information, see
   
 http://docs.ceph.com/docs/master/_downloads/v0.80.9.txthttps://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_master_-5Fdownloads_v0.80.9.txtd=AwMFaQc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=0MEOMMXqQGLq4weFd85B2Bxn5uBH9V9uMiuajNVb7o0s=-HHkWm2cMQZ06FKpWF4Ai-YkFb9lUR_tH_KR0eITbuUe=

 Adjusting CRUSH maps
 

 * This point release fixes several issues with CRUSH that trigger
   excessive data migration when adjusting OSD weights.  These are most
   obvious when a very small weight change (e.g., a change from 0 to
   .01) triggers a large amount of movement, but the same set of bugs
   can also lead to excessive (though less noticeable) movement in
   other cases.

   However, because the bug may already have affected your cluster,
   fixing it may trigger movement *back* to the more correct location.
   For this reason, you must manually opt-in to the fixed behavior.

   In order to set the new tunable to correct the behavior::

  ceph osd crush set-tunable straw_calc_version 1

   Note that this change will have no immediate effect.  However, from
   this point forward, any 'straw' bucket in your CRUSH map that is
   adjusted will get non-buggy internal weights, and that transition
   may trigger some rebalancing.

   You can estimate how much rebalancing will eventually be necessary
   on your cluster with::

  ceph osd getcrushmap -o /tmp/cm
  crushtool -i /tmp/cm --num-rep 3 --test --show-mappings  /tmp/a 21
  crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2
  crushtool -i /tmp/cm2 --reweight -o /tmp/cm2
  crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings  /tmp/b
 21
  wc -l /tmp/a  # num total mappings
  diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings

Divide the total number of lines in /tmp/a with the number of lines
changed.  We've found that most clusters are under 10%.

You can force all of this rebalancing to happen at once with::

  ceph osd crush reweight-all

Otherwise, it will happen at some unknown point in the future when
CRUSH weights are next adjusted.

 Notable Changes
 ---

 * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum)
 * crush: fix straw bucket weight calculation, add straw_calc_version
   tunable (#10095 Sage Weil)
 * crush: fix tree bucket (Rongzu Zhu)
 * crush: fix underflow of tree weights (Loic Dachary, Sage Weil)
 * crushtool: add --reweight (Sage Weil)
 * librbd: complete pending operations before losing image (#10299 Jason
   Dillaman)
 * librbd: fix read caching performance regression (#9854 Jason Dillaman)
 * librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman)
 * mon: fix dump of chooseleaf_vary_r tunable (Sage Weil)
 * osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai)
 * osd: handle no-op write with snapshot (#10262 Sage Weil)
 * radosgw-admi




 On 03/16/2015 12:37 PM, Alexandre DERUMIER wrote:
  VMs are running on the same nodes than OSD
  Are you sure that you didn't some kind of out of memory.
  pg rebalance can be memory hungry. (depend how many osd you have).

 2 OSD per host, and 

Re: [ceph-users] Calamari - Data

2015-03-16 Thread John Spray

Sumit,

You may have better luck on the ceph-calamari mailing list.  Anyway - 
calamari uses graphite to handle metrics, and graphite does indeed write 
them to files.


John

On 11/03/2015 05:09, Sumit Gaur wrote:

Hi
I have a basic architecture related question. I know Calamari collect 
system usages data (diamond collector) using perfrormance counters. I 
need to knwo if all the system performance data that calamari shows 
remains in memory or it usages files to store that.

Thanks
sumit


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: authorizations ?

2015-03-16 Thread John Spray

On 13/03/2015 11:51, Florent B wrote:

Hi all,

My question is about user management in CephFS.

Is it possible to restrict a CephX user to access some subdirectories ?
Not yet.  The syntax for setting a path= part in the authorization 
caps for a cephx user exists, but the code for enforcing it isn't done yet.


John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: delayed objects deletion ?

2015-03-16 Thread John Spray

On 16/03/2015 16:30, Florent B wrote:
Thank you John :) Hammer is not released yet, is it ? Is it 'safe' to 
upgrade a production cluster to 0.93 ? 
I keep forgetting that -- yes, I should have added ...when it's 
released :-)


John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: delayed objects deletion ?

2015-03-16 Thread John Spray

On 14/03/2015 09:22, Florent B wrote:

Hi,

What do you call old MDS ? I'm on Giant release, it is not very old...
With CephFS we have a special definition of old that is anything that 
doesn't have the very latest bug fixes ;-)


There have definitely been fixes to stray file handling[1] between giant 
and hammer.  Since with giant you're using a version that is neither 
latest nor LTS, I'd suggest you upgrade to hammer.  Hammer also includes 
some new perf counters related to strays[2] that will allow you to see 
how the purging is (or isn't) progressing.


If you can reproduce this on hammer, then please capture ceph daemon 
mds.daemon id session ls and ceph mds tell mds.daemon id dumpcache 
/tmp/cache.txt, in addition to the procedure to reproduce.  Ideally 
logs with debug mds = 10 as well.


Cheers,
John

1.
http://tracker.ceph.com/issues/10387
http://tracker.ceph.com/issues/10164

2.
http://tracker.ceph.com/issues/10388


And I tried restarting both but it didn't solve my problem.

Will it be OK in Hammer ?

On 03/13/2015 04:27 AM, Yan, Zheng wrote:

On Fri, Mar 13, 2015 at 1:17 AM, Florent B flor...@coppint.com wrote:

Hi all,

I test CephFS again on Giant release.

I use ceph-fuse.

After deleting a large directory (few hours ago), I can see that my pool
still contains 217 GB of objects.

Even if my root directory on CephFS is empty.

And metadata pool is 46 MB.

Is it expected ? If not, how to debug this ?

Old mds does not work well in this area. Try umounting clients and
restarting MDS.

Regards
Yan, Zheng



Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OS file Cache, Ceph RBD cache and Network files systems

2015-03-16 Thread Stéphane DUGRAVOT
Hi Cephers, 

Our university could deploy ceph. The goal is to store datas for research 
laboratories (non-HPC) . To do this, we plan to use Ceph with RBD (mount block 
device) from a NFS ( or CIFS ) server (ceph client) to workstations in 
laboratories. According to our tests, the OS (ubuntu or centos...) that map the 
RBD block implements file system write cache (vm.dirty_ratio, etc ...). In that 
case, the NFS server will always perform writes to workstations whereas it has 
not finished writing datas to Ceph cluster - a nd regardless of whether the RBD 
cache is enabled or not in the config [client] section. 

My questions: 


1. Does the activation of RBD cache is useful only when it combines 
Virtuals Machnies (where QEMU can access an image as a virtual block device 
directly via librbd) ? 
2. Is it common to use Ceph, with RBD to share network file systems ? 
3. And if so, what are the recommendations concerning the OS cache ? 

Thanks a lot. 
Stephane. 

-- 
Université de Lorraine 
Stéphane DUGRAVOT - Direction du numérique - Infrastructure 
Jabber : stephane.dugra...@univ-lorraine.fr 
Tél.: +33 3 83 68 20 98 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph release timeline

2015-03-16 Thread David Moreau Simard
Great work !

David Moreau Simard

On 2015-03-15 06:29 PM, Loic Dachary wrote:
 Hi Ceph,

 In an attempt to clarify what Ceph release is stable, LTS or development. a 
 new page was added to the documentation: 
 http://ceph.com/docs/master/releases/ It is a matrix where each cell is a 
 release number linked to the release notes from 
 http://ceph.com/docs/master/release-notes/. One line per month and one column 
 per release.

 Cheers



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

2015-03-16 Thread Chu Duc Minh
@Michael Kuriger: when ceph/librbd operate normally, i know that double the
pg_num is the safe way. But when it has problem, i think double it can make
many many VMs die (maybe = 50%?)


On Mon, Mar 16, 2015 at 9:53 PM, Michael Kuriger mk7...@yp.com wrote:

   I always keep my pg number a power of 2.  So I’d go from 2048 to 4096.
 I’m not sure if this is the safest way, but it’s worked for me.



 [image: yp]



 Michael Kuriger

 Sr. Unix Systems Engineer

 * mk7...@yp.com |( 818-649-7235

   From: Chu Duc Minh chu.ducm...@gmail.com
 Date: Monday, March 16, 2015 at 7:49 AM
 To: Florent B flor...@coppint.com
 Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

I'm using the latest Giant and have the same issue. When i increase
 PG_num of a pool from 2048 to 2148, my VMs is still ok. When i increase
 from 2148 to 2400, some VMs die (Qemu-kvm process die).
  My physical servers (host VMs) running kernel 3.13 and use librbd.
  I think it's a bug in librbd with crushmap.
  (I set crush_tunables3 on my ceph cluster, does it make sense?)

 Do you know a way to safely increase PG_num? (I don't think increase
 PG_num 100 each times is a safe  good way)

  Regards,

 On Mon, Mar 16, 2015 at 8:50 PM, Florent B flor...@coppint.com wrote:

 We are on Giant.

 On 03/16/2015 02:03 PM, Azad Aliyar wrote:
 
  May I know your ceph version.?. The latest version of firefly 80.9 has
  patches to avoid excessive data migrations during rewighting osds. You
  may need set a tunable inorder make this patch active.
 
  This is a bugfix release for firefly.  It fixes a performance regression
  in librbd, an important CRUSH misbehavior (see below), and several RGW
  bugs.  We have also backported support for flock/fcntl locks to
 ceph-fuse
  and libcephfs.
 
  We recommend that all Firefly users upgrade.
 
  For more detailed information, see
http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt
 https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_master_-5Fdownloads_v0.80.9.txtd=AwMFaQc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=0MEOMMXqQGLq4weFd85B2Bxn5uBH9V9uMiuajNVb7o0s=-HHkWm2cMQZ06FKpWF4Ai-YkFb9lUR_tH_KR0eITbuUe=
 
  Adjusting CRUSH maps
  
 
  * This point release fixes several issues with CRUSH that trigger
excessive data migration when adjusting OSD weights.  These are most
obvious when a very small weight change (e.g., a change from 0 to
.01) triggers a large amount of movement, but the same set of bugs
can also lead to excessive (though less noticeable) movement in
other cases.
 
However, because the bug may already have affected your cluster,
fixing it may trigger movement *back* to the more correct location.
For this reason, you must manually opt-in to the fixed behavior.
 
In order to set the new tunable to correct the behavior::
 
   ceph osd crush set-tunable straw_calc_version 1
 
Note that this change will have no immediate effect.  However, from
this point forward, any 'straw' bucket in your CRUSH map that is
adjusted will get non-buggy internal weights, and that transition
may trigger some rebalancing.
 
You can estimate how much rebalancing will eventually be necessary
on your cluster with::
 
   ceph osd getcrushmap -o /tmp/cm
   crushtool -i /tmp/cm --num-rep 3 --test --show-mappings  /tmp/a
 21
   crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2
   crushtool -i /tmp/cm2 --reweight -o /tmp/cm2
   crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings  /tmp/b
  21
   wc -l /tmp/a  # num total mappings
   diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings
 
 Divide the total number of lines in /tmp/a with the number of lines
 changed.  We've found that most clusters are under 10%.
 
 You can force all of this rebalancing to happen at once with::
 
   ceph osd crush reweight-all
 
 Otherwise, it will happen at some unknown point in the future when
 CRUSH weights are next adjusted.
 
  Notable Changes
  ---
 
  * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum)
  * crush: fix straw bucket weight calculation, add straw_calc_version
tunable (#10095 Sage Weil)
  * crush: fix tree bucket (Rongzu Zhu)
  * crush: fix underflow of tree weights (Loic Dachary, Sage Weil)
  * crushtool: add --reweight (Sage Weil)
  * librbd: complete pending operations before losing image (#10299 Jason
Dillaman)
  * librbd: fix read caching performance regression (#9854 Jason Dillaman)
  * librbd: gracefully handle deleted/renamed pools (#10270 Jason
 Dillaman)
  * mon: fix dump of chooseleaf_vary_r tunable (Sage Weil)
  * osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai)
  * osd: handle no-op write with snapshot (#10262 Sage Weil)
  * radosgw-admi
 
 
 
 
  On 03/16/2015 12:37 PM, 

Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

2015-03-16 Thread Chu Duc Minh
I'm using the latest Giant and have the same issue. When i increase PG_num
of a pool from 2048 to 2148, my VMs is still ok. When i increase from 2148
to 2400, some VMs die (Qemu-kvm process die).
My physical servers (host VMs) running kernel 3.13 and use librbd.
I think it's a bug in librbd with crushmap.
(I set crush_tunables3 on my ceph cluster, does it make sense?)

Do you know a way to safely increase PG_num? (I don't think increase PG_num
100 each times is a safe  good way)

Regards,

On Mon, Mar 16, 2015 at 8:50 PM, Florent B flor...@coppint.com wrote:

 We are on Giant.

 On 03/16/2015 02:03 PM, Azad Aliyar wrote:
 
  May I know your ceph version.?. The latest version of firefly 80.9 has
  patches to avoid excessive data migrations during rewighting osds. You
  may need set a tunable inorder make this patch active.
 
  This is a bugfix release for firefly.  It fixes a performance regression
  in librbd, an important CRUSH misbehavior (see below), and several RGW
  bugs.  We have also backported support for flock/fcntl locks to ceph-fuse
  and libcephfs.
 
  We recommend that all Firefly users upgrade.
 
  For more detailed information, see
http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt
 
  Adjusting CRUSH maps
  
 
  * This point release fixes several issues with CRUSH that trigger
excessive data migration when adjusting OSD weights.  These are most
obvious when a very small weight change (e.g., a change from 0 to
.01) triggers a large amount of movement, but the same set of bugs
can also lead to excessive (though less noticeable) movement in
other cases.
 
However, because the bug may already have affected your cluster,
fixing it may trigger movement *back* to the more correct location.
For this reason, you must manually opt-in to the fixed behavior.
 
In order to set the new tunable to correct the behavior::
 
   ceph osd crush set-tunable straw_calc_version 1
 
Note that this change will have no immediate effect.  However, from
this point forward, any 'straw' bucket in your CRUSH map that is
adjusted will get non-buggy internal weights, and that transition
may trigger some rebalancing.
 
You can estimate how much rebalancing will eventually be necessary
on your cluster with::
 
   ceph osd getcrushmap -o /tmp/cm
   crushtool -i /tmp/cm --num-rep 3 --test --show-mappings  /tmp/a
 21
   crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2
   crushtool -i /tmp/cm2 --reweight -o /tmp/cm2
   crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings  /tmp/b
  21
   wc -l /tmp/a  # num total mappings
   diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings
 
 Divide the total number of lines in /tmp/a with the number of lines
 changed.  We've found that most clusters are under 10%.
 
 You can force all of this rebalancing to happen at once with::
 
   ceph osd crush reweight-all
 
 Otherwise, it will happen at some unknown point in the future when
 CRUSH weights are next adjusted.
 
  Notable Changes
  ---
 
  * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum)
  * crush: fix straw bucket weight calculation, add straw_calc_version
tunable (#10095 Sage Weil)
  * crush: fix tree bucket (Rongzu Zhu)
  * crush: fix underflow of tree weights (Loic Dachary, Sage Weil)
  * crushtool: add --reweight (Sage Weil)
  * librbd: complete pending operations before losing image (#10299 Jason
Dillaman)
  * librbd: fix read caching performance regression (#9854 Jason Dillaman)
  * librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman)
  * mon: fix dump of chooseleaf_vary_r tunable (Sage Weil)
  * osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai)
  * osd: handle no-op write with snapshot (#10262 Sage Weil)
  * radosgw-admi
 
 
 
 
  On 03/16/2015 12:37 PM, Alexandre DERUMIER wrote:
   VMs are running on the same nodes than OSD
   Are you sure that you didn't some kind of out of memory.
   pg rebalance can be memory hungry. (depend how many osd you have).
 
  2 OSD per host, and 5 hosts in this cluster.
  hosts h
 

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mapping users to different rgw pools

2015-03-16 Thread Craig Lewis
Yes, the placement target feature is logically separate from multi-zone
setups.  Placement targets are configured in the region though, which
somewhat muddies the issue.

Placement targets are useful feature for multi-zone, so different zones in
a cluster don't share the same disks.  Federation setup is the only place
I've seen any discussion about the topic.  Even that is just a brief
mention.  I didn't see any documentation directly talking about setting up
placement targets, even in the federation guides.

It looks like you'll need to edit the default region to add the placement
targets, but you won't need to setup zones.  As far as I can tell, You'll
have to piece together what you need from the federation setup and some
experimentation.  I highly recommend a test VM that you can experiment on
before attempting anything in production.




On Sun, Mar 15, 2015 at 11:53 PM, Sreenath BH bhsreen...@gmail.com wrote:

 Thanks.

 Is this possible outside of multi-zone setup. (With only one Zone)?

 For example, I want to have pools with different replication
 factors(or erasure codings) and map users to these pools.

 -Sreenath


 On 3/13/15, Craig Lewis cle...@centraldesktop.com wrote:
  Yes, RadosGW has the concept of Placement Targets and Placement Pools.
 You
  can create a target, and point it a set of RADOS pools.  Those pools can
 be
  configured to use different storage strategies by creating different
  crushmap rules, and assigning those rules to the pool.
 
  RGW users can be assigned a default placement target.  When they create a
  bucket, they can either specify the target, or use their default one.
 All
  objects in a bucket are stored according to the bucket's placement
 target.
 
 
  I haven't seen a good guide for making use of these features.  The best
  guide I know of is the Federation guide (
  http://ceph.com/docs/giant/radosgw/federated-config/), but it only
 briefly
  mentions placement targets.
 
 
 
  On Thu, Mar 12, 2015 at 11:48 PM, Sreenath BH bhsreen...@gmail.com
 wrote:
 
  Hi all,
 
  Can one Radow gateway support more than one pool for storing objects?
 
  And as a follow-up question, is there a way to map different users to
  separate rgw pools so that their obejcts get stored in different
  pools?
 
  thanks,
  Sreenath
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd laggy algorithm

2015-03-16 Thread Gregory Farnum
On Wed, Mar 11, 2015 at 8:40 AM, Artem Savinov asavi...@asdco.ru wrote:
 hello.
 ceph transfers osd node in the down status by default , after receiving 3
 reports about disabled nodes. Reports are sent per   osd heartbeat grace
 seconds, but the settings of mon_osd_adjust_heartbeat_gratse = true,
 mon_osd_adjust_down_out_interval = true timeout to transfer nodes in down
 status may vary. Tell me please: what algorithm enables changes timeout for
 the transfer nodes occur in down/out status and which parameters are
 affected?
 thanks.

The monitors keep track of which detected failures are incorrect
(based on reports from the marked-down/out OSDs) and build up an
expectation about how often the failures are correct based on an
exponential backoff of the data points. You can look at the code in
OSDMonitor.cc if you're interested, but basically they apply that
expectation to modify the down interval and the down-out interval to a
value large enough that they believe the OSD is really down (assuming
these config options are set). It's not terribly interesting. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Gregory Farnum
On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote:

 I’m not sure if it’s something I’m doing wrong or just experiencing an 
 oddity, but when my cache tier flushes dirty blocks out to the base tier, the 
 writes seem to hit the OSD’s straight away instead of coalescing in the 
 journals, is this correct?

 For example if I create a RBD on a standard 3 way replica pool and run fio 
 via librbd 128k writes, I see the journals take all the io’s until I hit my 
 filestore_min_sync_interval and then I see it start writing to the underlying 
 disks.

 Doing the same on a full cache tier (to force flushing)  I immediately see 
 the base disks at a very high utilisation. The journals also have some write 
 IO at the same time. The only other odd thing I can see via iostat is that 
 most of the time whilst I’m running Fio, is that I can see the underlying 
 disks doing very small write IO’s of around 16kb with an occasional big burst 
 of activity.

 I know erasure coding+cache tier is slower than just plain replicated pools, 
 but even with various high queue depths I’m struggling to get much above 
 100-150 iops compared to a 3 way replica pool which can easily achieve 
 1000-1500. The base tier is comprised of 40 disks. It seems quite a marked 
 difference and I’m wondering if this strange journal behaviour is the cause.

 Does anyone have any ideas?

If you're running a full cache pool, then on every operation touching
an object which isn't in the cache pool it will try and evict an
object. That's probably what you're seeing.

Cache pool in general are only a wise idea if you have a very skewed
distribution of data hotness and the entire hot zone can fit in
cache at once.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck unclean active+remapped after an osd marked out

2015-03-16 Thread Gregory Farnum
On Wed, Mar 11, 2015 at 3:49 PM, Francois Lafont flafdiv...@free.fr wrote:
 Hi,

 I was always in the same situation: I couldn't remove an OSD without
 have some PGs definitely stuck to the active+remapped state.

 But I remembered I read on IRC that, before to mark out an OSD, it
 could be sometimes a good idea to reweight it to 0. So, instead of
 doing [1]:

 ceph osd out 3

 I have tried [2]:

 ceph osd crush reweight osd.3 0 # waiting for the rebalancing...
 ceph osd out 3

 and it worked. Then I could remove my osd with the online documentation:
 http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual

 Now, the osd is removed and my cluster is HEALTH_OK. \o/

 Now, my question is: why my cluster was definitely stuck to active+remapped
 with [1] but was not with [2]? Personally, I have absolutely no explanation.
 If you have an explanation, I'd love to know it.

If I remember/guess correctly, if you mark an OSD out it won't
necessarily change the weight of the bucket above it (ie, the host),
whereas if you change the weight of the OSD then the host bucket's
weight changes. That makes for different mappings, and since you only
have a couple of OSDs per host (normally: hurray!) and not many hosts
(normally: sadness) then marking one OSD out makes things harder for
the CRUSH algorithm.
-Greg


 Should the reweight command be present in the online documentation?
 http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
 If yes, I can make a pull request on the doc with pleasure. ;)

 Regards.

 --
 François Lafont
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] client-ceph [can not connect from client][connect protocol feature mismatch]

2015-03-16 Thread Sonal Dubey
Thanks a lot Stephane and Kamil,

Your reply was really helpful. I needed a different version of ceph client
on my client machine. Initially my java application using librados was
throwing connection time out. Then I considered querying ceph from command
line (ceph --id ...), which was giving the error -



2015-03-05 13:37:16.816322 7f5191deb700 -- 10.8.25.112:0/2487 
10.138.23.241:6789/0pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol
feature mismatch, my 1ffa  peer 42041ffa missing 4204


From the hits given in your mail i tried -

wget -q -O- '
https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc
https://mail.barracuda.com/owa/redir.aspx?C=3NyLmctq4E2pteCAiaUljUgzJNylM9JIwPBTxx3luEEtOGlWRbTgjTsFufrr9_uu3LumztxKEp0.URL=https%3a%2f%2fceph.com%2fgit%2f%3fp%3dceph.git%3ba%3dblob_plain%3bf%3dkeys%2frelease.asc'
| sudo apt-key add -
wget -q -O- '
https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/autobuild.asc
https://mail.barracuda.com/owa/redir.aspx?C=3NyLmctq4E2pteCAiaUljUgzJNylM9JIwPBTxx3luEEtOGlWRbTgjTsFufrr9_uu3LumztxKEp0.URL=https%3a%2f%2fceph.com%2fgit%2f%3fp%3dceph.git%3ba%3dblob_plain%3bf%3dkeys%2fautobuild.asc'
| sudo apt-key add -
echo deb http://ceph.com/packages/ceph-extras/debian
https://mail.barracuda.com/owa/redir.aspx?C=3NyLmctq4E2pteCAiaUljUgzJNylM9JIwPBTxx3luEEtOGlWRbTgjTsFufrr9_uu3LumztxKEp0.URL=http%3a%2f%2fceph.com%2fpackages%2fceph-extras%2fdebian
$(lsb_release
-sc) main | sudo tee /etc/apt/sources.list.d/ceph-extras.list
echo deb http://ceph.com/debian-firefly/
https://mail.barracuda.com/owa/redir.aspx?C=3NyLmctq4E2pteCAiaUljUgzJNylM9JIwPBTxx3luEEtOGlWRbTgjTsFufrr9_uu3LumztxKEp0.URL=http%3a%2f%2fceph.com%2fdebian-firefly%2f
$(lsb_release
-sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
sudo apt-get install ceph-common

to verify:
ceph --id brts --keyring=/etc/ceph/ceph.client.brts.keyring health
HEALTH_OK

Thanks for the reply.

-Sonal


On Fri, Mar 6, 2015 at 5:50 AM, Stéphane DUGRAVOT 
stephane.dugra...@univ-lorraine.fr wrote:

 Hi Sonal,
 You can refer to this doc to identify your problem.
 Your error code is 4204, so

- 4000 upgrade to kernel 3.9
-  200 CEPH_FEATURE_CRUSH_TUNABLES2
- 4 CEPH_FEATURE_CRUSH_TUNABLES


-
http://ceph.com/planet/feature-set-mismatch-error-on-ceph-kernel-client/

 Stephane.

 --

 Hi,

 I am newbie for ceph, and ceph-user group. Recently I have been working on
 a ceph client. It worked on all the environments while when i tested on the
 production, it is not able to connect to ceph.

 Following are the operating system details and error. If someone has seen
 this problem before, any help is really appreciated.

 OS -

 lsb_release -a
 No LSB modules are available.
 Distributor ID: Ubuntu
 Description: Ubuntu 12.04.2 LTS
 Release: 12.04
 Codename: precise

 2015-03-05 13:37:16.816322 7f5191deb700 -- 10.8.25.112:0/2487 
 10.138.23.241:6789/0 pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol
 feature mismatch, my 1ffa  peer 42041ffa missing 4204
 2015-03-05 13:37:17.635776 7f5191deb700 -- 10.8.25.112:0/2487 
 10.138.23.241:6789/0 pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol
 feature mismatch, my 1ffa  peer 42041ffa missing 4204

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Christian Balzer
On Mon, 16 Mar 2015 16:09:12 -0700 Gregory Farnum wrote:

 Nothing here particularly surprises me. I don't remember all the
 details of the filestore's rate limiting off the top of my head, but
 it goes to great lengths to try and avoid letting the journal get too
 far ahead of the backing store. Disabling the filestore flusher and
 increasing the sync intervals without also increasing the
 filestore_wbthrottle_* limits is not going to work well for you.
 -Greg
 
While very true and what I recalled (backing store being kicked off early)
from earlier mails, I think having every last configuration parameter
documented in a way that doesn't reduce people to guesswork would be very
helpful.

For example filestore_wbthrottle_xfs_inodes_start_flusher which defaults
to 500. 
Assuming that this means to start flushing once 500 inodes have
accumulated, how would Ceph even know how many inodes are needed for the
data present?

Lastly with these parameters, there is xfs and btrfs incarnations, no
ext4. 
Do the xfs parameters also apply to ext4?

Christian

 On Mon, Mar 16, 2015 at 3:58 PM, Nick Fisk n...@fisk.me.uk wrote:
 
 
 
 
  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
  Of Gregory Farnum
  Sent: 16 March 2015 17:33
  To: Nick Fisk
  Cc: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier
  journal sync?
 
  On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote:
  
   I’m not sure if it’s something I’m doing wrong or just experiencing
   an
  oddity, but when my cache tier flushes dirty blocks out to the base
  tier, the writes seem to hit the OSD’s straight away instead of
  coalescing in the journals, is this correct?
  
   For example if I create a RBD on a standard 3 way replica pool and
   run fio
  via librbd 128k writes, I see the journals take all the io’s until I
  hit my filestore_min_sync_interval and then I see it start writing to
  the underlying disks.
  
   Doing the same on a full cache tier (to force flushing)  I
   immediately see the
  base disks at a very high utilisation. The journals also have some
  write IO at the same time. The only other odd thing I can see via
  iostat is that most of the time whilst I’m running Fio, is that I can
  see the underlying disks doing very small write IO’s of around 16kb
  with an occasional big burst of activity.
  
   I know erasure coding+cache tier is slower than just plain
   replicated pools,
  but even with various high queue depths I’m struggling to get much
  above 100-150 iops compared to a 3 way replica pool which can easily
  achieve 1000- 1500. The base tier is comprised of 40 disks. It seems
  quite a marked difference and I’m wondering if this strange journal
  behaviour is the cause.
  
   Does anyone have any ideas?
 
  If you're running a full cache pool, then on every operation touching
  an object which isn't in the cache pool it will try and evict an
  object. That's probably what you're seeing.
 
  Cache pool in general are only a wise idea if you have a very skewed
  distribution of data hotness and the entire hot zone can fit in
  cache at once.
  -Greg
 
  Hi Greg,
 
  It's not the caching behaviour that I confused about, it’s the journal
  behaviour on the base disks during flushing. I've been doing some more
  tests and can do something reproducible which seems strange to me.
 
  First off 10MB of 4kb writes:
  time ceph tell osd.1 bench 1000 4096
  { bytes_written: 1000,
blocksize: 4096,
bytes_per_sec: 16009426.00}
 
  real0m0.760s
  user0m0.063s
  sys 0m0.022s
 
  Now split this into 2x5mb writes:
  time ceph tell osd.1 bench 500 4096   time ceph tell osd.1 bench
  500 4096 { bytes_written: 500,
blocksize: 4096,
bytes_per_sec: 10580846.00}
 
  real0m0.595s
  user0m0.065s
  sys 0m0.018s
  { bytes_written: 500,
blocksize: 4096,
bytes_per_sec: 9944252.00}
 
  real0m4.412s
  user0m0.053s
  sys 0m0.071s
 
  2nd bench takes a lot longer even though both should easily fit in the
  5GB journal. Looking at iostat, I think I can see that no writes
  happen to the journal whilst the writes from the 1st bench are being
  flushed. Is this the expected behaviour? I would have thought as long
  as there is space available in the journal it shouldn't block on new
  writes. Also I see in iostat writes to the underlying disk happening
  at a QD of 1 and 16kb IO's for a number of seconds, with a large blip
  or activity just before the flush finishes. Is this the correct
  behaviour? I would have thought if this tell osd bench is doing
  sequential IO then the journal should be able to flush 5-10mb of data
  in a fraction a second.
 
  Ceph.conf
  [osd]
  filestore max sync interval = 30
  filestore min sync interval = 20
  filestore flusher = false
  osd_journal_size = 5120
  osd_crush_location_hook = /usr/local/bin/crush-location
  

Re: [ceph-users] RadosGW Direct Upload Limitation

2015-03-16 Thread Yehuda Sadeh-Weinraub


- Original Message -
 From: Craig Lewis cle...@centraldesktop.com
 To: Gregory Farnum g...@gregs42.com
 Cc: ceph-users@lists.ceph.com
 Sent: Monday, March 16, 2015 11:48:15 AM
 Subject: Re: [ceph-users] RadosGW Direct Upload Limitation
 
 
 
 
 Maybe, but I'm not sure if Yehuda would want to take it upstream or
 not. This limit is present because it's part of the S3 spec. For
 larger objects you should use multi-part upload, which can get much
 bigger.
 -Greg
 
 
 Note that the multi-part upload has a lower limit of 4MiB per part, and the
 direct upload has an upper limit of 5GiB.

The limit is 10MB, but it does not apply to the last part, so basically you 
could upload any object size with it. I would still recommend using the plain 
upload for smaller object sizes, it is faster, and the resulting object might 
be more efficient (for really small sizes).

Yehuda

 
 So you have to use both methods - direct upload for small files, and
 multi-part upload for big files.
 
 Your best bet is to use the Amazon S3 libraries. They have functions that
 take care of it for you.
 
 
 I'd like to see this mentioned in the Ceph documentation someplace. When I
 first encountered the issue, I couldn't find a limit in the RadosGW
 documentation anywhere. I only found the 5GiB limit in the Amazon API
 documentation, which lead me to test on RadosGW. Now that I know it was done
 to preserve Amazon compatibility, I don't want to override the value
 anymore.
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS unexplained writes

2015-03-16 Thread Gregory Farnum
The information you're giving sounds a little contradictory, but my
guess is that you're seeing the impacts of object promotion and
flushing. You can sample the operations the OSDs are doing at any
given time by running ops_in_progress (or similar, I forget exact
phrasing) command on the OSD admin socket. I'm not sure if rados df
is going to report cache movement activity or not.

That though would mostly be written to the SSDs, not the hard drives —
although the hard drives could still get metadata updates written when
objects are flushed. What data exactly are you seeing that's leading
you to believe writes are happening against these drives? What is the
exact CephFS and cache pool configuration?
-Greg

On Mon, Mar 16, 2015 at 2:36 PM, Erik Logtenberg e...@logtenberg.eu wrote:
 Hi,

 I forgot to mention: while I am seeing these writes in iotop and
 /proc/diskstats for the hdd's, I am -not- seeing any writes in rados
 df for the pool residing on these disks. There is only one pool active
 on the hdd's and according to rados df it is getting zero writes when
 I'm just reading big files from cephfs.

 So apparently the osd's are doing some non-trivial amount of writing on
 their own behalf. What could it be?

 Thanks,

 Erik.


 On 03/16/2015 10:26 PM, Erik Logtenberg wrote:
 Hi,

 I am getting relatively bad performance from cephfs. I use a replicated
 cache pool on ssd in front of an erasure coded pool on rotating media.

 When reading big files (streaming video), I see a lot of disk i/o,
 especially writes. I have no clue what could cause these writes. The
 writes are going to the hdd's and they stop when I stop reading.

 I mounted everything with noatime and nodiratime so it shouldn't be
 that. On a related note, the Cephfs metadata is stored on ssd too, so
 metadata-related changes shouldn't hit the hdd's anyway I think.

 Any thoughts? How can I get more information about what ceph is doing?
 Using iotop I only see that the osd processes are busy but it doesn't
 give many hints as to what they are doing.

 Thanks,

 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Nick Fisk




 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Gregory Farnum
 Sent: 16 March 2015 17:33
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal
 sync?
 
 On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote:
 
  I’m not sure if it’s something I’m doing wrong or just experiencing an
 oddity, but when my cache tier flushes dirty blocks out to the base tier, the
 writes seem to hit the OSD’s straight away instead of coalescing in the
 journals, is this correct?
 
  For example if I create a RBD on a standard 3 way replica pool and run fio
 via librbd 128k writes, I see the journals take all the io’s until I hit my
 filestore_min_sync_interval and then I see it start writing to the underlying
 disks.
 
  Doing the same on a full cache tier (to force flushing)  I immediately see 
  the
 base disks at a very high utilisation. The journals also have some write IO at
 the same time. The only other odd thing I can see via iostat is that most of
 the time whilst I’m running Fio, is that I can see the underlying disks doing
 very small write IO’s of around 16kb with an occasional big burst of activity.
 
  I know erasure coding+cache tier is slower than just plain replicated pools,
 but even with various high queue depths I’m struggling to get much above
 100-150 iops compared to a 3 way replica pool which can easily achieve 1000-
 1500. The base tier is comprised of 40 disks. It seems quite a marked
 difference and I’m wondering if this strange journal behaviour is the cause.
 
  Does anyone have any ideas?
 
 If you're running a full cache pool, then on every operation touching an
 object which isn't in the cache pool it will try and evict an object. That's
 probably what you're seeing.
 
 Cache pool in general are only a wise idea if you have a very skewed
 distribution of data hotness and the entire hot zone can fit in cache at
 once.
 -Greg

Hi Greg,

It's not the caching behaviour that I confused about, it’s the journal 
behaviour on the base disks during flushing. I've been doing some more tests 
and can do something reproducible which seems strange to me. 

First off 10MB of 4kb writes:
time ceph tell osd.1 bench 1000 4096
{ bytes_written: 1000,
  blocksize: 4096,
  bytes_per_sec: 16009426.00}

real0m0.760s
user0m0.063s
sys 0m0.022s

Now split this into 2x5mb writes:
time ceph tell osd.1 bench 500 4096   time ceph tell osd.1 bench 500 
4096
{ bytes_written: 500,
  blocksize: 4096,
  bytes_per_sec: 10580846.00}

real0m0.595s
user0m0.065s
sys 0m0.018s
{ bytes_written: 500,
  blocksize: 4096,
  bytes_per_sec: 9944252.00}

real0m4.412s
user0m0.053s
sys 0m0.071s

2nd bench takes a lot longer even though both should easily fit in the 5GB 
journal. Looking at iostat, I think I can see that no writes happen to the 
journal whilst the writes from the 1st bench are being flushed. Is this the 
expected behaviour? I would have thought as long as there is space available in 
the journal it shouldn't block on new writes. Also I see in iostat writes to 
the underlying disk happening at a QD of 1 and 16kb IO's for a number of 
seconds, with a large blip or activity just before the flush finishes. Is this 
the correct behaviour? I would have thought if this tell osd bench is doing 
sequential IO then the journal should be able to flush 5-10mb of data in a 
fraction a second.

Ceph.conf
[osd]
filestore max sync interval = 30
filestore min sync interval = 20
filestore flusher = false
osd_journal_size = 5120
osd_crush_location_hook = /usr/local/bin/crush-location
osd_op_threads = 5
filestore_op_threads = 4


iostat during period where writes seem to be blocked (journal=sda disk=sdd)

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.000.002.00 0.00 4.00 4.00 
0.000.000.000.00   0.00   0.00
sdc   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdd   0.00 0.000.00   76.00 0.00   760.0020.00 
0.99   13.110.00   13.11  13.05  99.20

iostat during what I believe to be the actual flush

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.000.002.00 0.00 4.00 4.00 
0.000.000.000.00   0.00   0.00
sdc   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00

[ceph-users] CephFS unexplained writes

2015-03-16 Thread Erik Logtenberg
Hi,

I am getting relatively bad performance from cephfs. I use a replicated
cache pool on ssd in front of an erasure coded pool on rotating media.

When reading big files (streaming video), I see a lot of disk i/o,
especially writes. I have no clue what could cause these writes. The
writes are going to the hdd's and they stop when I stop reading.

I mounted everything with noatime and nodiratime so it shouldn't be
that. On a related note, the Cephfs metadata is stored on ssd too, so
metadata-related changes shouldn't hit the hdd's anyway I think.

Any thoughts? How can I get more information about what ceph is doing?
Using iotop I only see that the osd processes are busy but it doesn't
give many hints as to what they are doing.

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS unexplained writes

2015-03-16 Thread Erik Logtenberg
Hi,

I forgot to mention: while I am seeing these writes in iotop and
/proc/diskstats for the hdd's, I am -not- seeing any writes in rados
df for the pool residing on these disks. There is only one pool active
on the hdd's and according to rados df it is getting zero writes when
I'm just reading big files from cephfs.

So apparently the osd's are doing some non-trivial amount of writing on
their own behalf. What could it be?

Thanks,

Erik.


On 03/16/2015 10:26 PM, Erik Logtenberg wrote:
 Hi,
 
 I am getting relatively bad performance from cephfs. I use a replicated
 cache pool on ssd in front of an erasure coded pool on rotating media.
 
 When reading big files (streaming video), I see a lot of disk i/o,
 especially writes. I have no clue what could cause these writes. The
 writes are going to the hdd's and they stop when I stop reading.
 
 I mounted everything with noatime and nodiratime so it shouldn't be
 that. On a related note, the Cephfs metadata is stored on ssd too, so
 metadata-related changes shouldn't hit the hdd's anyway I think.
 
 Any thoughts? How can I get more information about what ceph is doing?
 Using iotop I only see that the osd processes are busy but it doesn't
 give many hints as to what they are doing.
 
 Thanks,
 
 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Gregory Farnum
Nothing here particularly surprises me. I don't remember all the
details of the filestore's rate limiting off the top of my head, but
it goes to great lengths to try and avoid letting the journal get too
far ahead of the backing store. Disabling the filestore flusher and
increasing the sync intervals without also increasing the
filestore_wbthrottle_* limits is not going to work well for you.
-Greg

On Mon, Mar 16, 2015 at 3:58 PM, Nick Fisk n...@fisk.me.uk wrote:




 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Gregory Farnum
 Sent: 16 March 2015 17:33
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal
 sync?

 On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote:
 
  I’m not sure if it’s something I’m doing wrong or just experiencing an
 oddity, but when my cache tier flushes dirty blocks out to the base tier, the
 writes seem to hit the OSD’s straight away instead of coalescing in the
 journals, is this correct?
 
  For example if I create a RBD on a standard 3 way replica pool and run fio
 via librbd 128k writes, I see the journals take all the io’s until I hit my
 filestore_min_sync_interval and then I see it start writing to the underlying
 disks.
 
  Doing the same on a full cache tier (to force flushing)  I immediately see 
  the
 base disks at a very high utilisation. The journals also have some write IO 
 at
 the same time. The only other odd thing I can see via iostat is that most of
 the time whilst I’m running Fio, is that I can see the underlying disks doing
 very small write IO’s of around 16kb with an occasional big burst of 
 activity.
 
  I know erasure coding+cache tier is slower than just plain replicated 
  pools,
 but even with various high queue depths I’m struggling to get much above
 100-150 iops compared to a 3 way replica pool which can easily achieve 1000-
 1500. The base tier is comprised of 40 disks. It seems quite a marked
 difference and I’m wondering if this strange journal behaviour is the cause.
 
  Does anyone have any ideas?

 If you're running a full cache pool, then on every operation touching an
 object which isn't in the cache pool it will try and evict an object. That's
 probably what you're seeing.

 Cache pool in general are only a wise idea if you have a very skewed
 distribution of data hotness and the entire hot zone can fit in cache at
 once.
 -Greg

 Hi Greg,

 It's not the caching behaviour that I confused about, it’s the journal 
 behaviour on the base disks during flushing. I've been doing some more tests 
 and can do something reproducible which seems strange to me.

 First off 10MB of 4kb writes:
 time ceph tell osd.1 bench 1000 4096
 { bytes_written: 1000,
   blocksize: 4096,
   bytes_per_sec: 16009426.00}

 real0m0.760s
 user0m0.063s
 sys 0m0.022s

 Now split this into 2x5mb writes:
 time ceph tell osd.1 bench 500 4096   time ceph tell osd.1 bench 
 500 4096
 { bytes_written: 500,
   blocksize: 4096,
   bytes_per_sec: 10580846.00}

 real0m0.595s
 user0m0.065s
 sys 0m0.018s
 { bytes_written: 500,
   blocksize: 4096,
   bytes_per_sec: 9944252.00}

 real0m4.412s
 user0m0.053s
 sys 0m0.071s

 2nd bench takes a lot longer even though both should easily fit in the 5GB 
 journal. Looking at iostat, I think I can see that no writes happen to the 
 journal whilst the writes from the 1st bench are being flushed. Is this the 
 expected behaviour? I would have thought as long as there is space available 
 in the journal it shouldn't block on new writes. Also I see in iostat writes 
 to the underlying disk happening at a QD of 1 and 16kb IO's for a number of 
 seconds, with a large blip or activity just before the flush finishes. Is 
 this the correct behaviour? I would have thought if this tell osd bench is 
 doing sequential IO then the journal should be able to flush 5-10mb of data 
 in a fraction a second.

 Ceph.conf
 [osd]
 filestore max sync interval = 30
 filestore min sync interval = 20
 filestore flusher = false
 osd_journal_size = 5120
 osd_crush_location_hook = /usr/local/bin/crush-location
 osd_op_threads = 5
 filestore_op_threads = 4


 iostat during period where writes seem to be blocked (journal=sda disk=sdd)

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
 avgqu-sz   await r_await w_await  svctm  %util
 sda   0.00 0.000.000.00 0.00 0.00 0.00
  0.000.000.000.00   0.00   0.00
 sdb   0.00 0.000.002.00 0.00 4.00 4.00
  0.000.000.000.00   0.00   0.00
 sdc   0.00 0.000.000.00 0.00 0.00 0.00
  0.000.000.000.00   0.00   0.00
 sdd   0.00 0.000.00   76.00 0.00   760.0020.00
  0.99   13.110.00   13.11  13.05  99.20

 iostat during 

Re: [ceph-users] Mapping users to different rgw pools

2015-03-16 Thread Sreenath BH
Thanks.

Is this possible outside of multi-zone setup. (With only one Zone)?

For example, I want to have pools with different replication
factors(or erasure codings) and map users to these pools.

-Sreenath


On 3/13/15, Craig Lewis cle...@centraldesktop.com wrote:
 Yes, RadosGW has the concept of Placement Targets and Placement Pools.  You
 can create a target, and point it a set of RADOS pools.  Those pools can be
 configured to use different storage strategies by creating different
 crushmap rules, and assigning those rules to the pool.

 RGW users can be assigned a default placement target.  When they create a
 bucket, they can either specify the target, or use their default one.  All
 objects in a bucket are stored according to the bucket's placement target.


 I haven't seen a good guide for making use of these features.  The best
 guide I know of is the Federation guide (
 http://ceph.com/docs/giant/radosgw/federated-config/), but it only briefly
 mentions placement targets.



 On Thu, Mar 12, 2015 at 11:48 PM, Sreenath BH bhsreen...@gmail.com wrote:

 Hi all,

 Can one Radow gateway support more than one pool for storing objects?

 And as a follow-up question, is there a way to map different users to
 separate rgw pools so that their obejcts get stored in different
 pools?

 thanks,
 Sreenath
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW Direct Upload Limitation

2015-03-16 Thread Craig Lewis


 Maybe, but I'm not sure if Yehuda would want to take it upstream or
 not. This limit is present because it's part of the S3 spec. For
 larger objects you should use multi-part upload, which can get much
 bigger.
 -Greg


Note that the multi-part upload has a lower limit of 4MiB per part, and the
direct upload has an upper limit of 5GiB.

So you have to use both methods - direct upload for small files, and
multi-part upload for big files.

Your best bet is to use the Amazon S3 libraries.  They have functions that
take care of it for you.


I'd like to see this mentioned in the Ceph documentation someplace.  When I
first encountered the issue, I couldn't find a limit in the RadosGW
documentation anywhere.  I only found the 5GiB limit in the Amazon API
documentation, which lead me to test on RadosGW.  Now that I know it was
done to preserve Amazon compatibility, I don't want to override the value
anymore.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck unclean active+remapped after an osd marked out

2015-03-16 Thread Craig Lewis


 If I remember/guess correctly, if you mark an OSD out it won't
 necessarily change the weight of the bucket above it (ie, the host),
 whereas if you change the weight of the OSD then the host bucket's
 weight changes.
 -Greg



That sounds right.  Marking an OSD out is a ceph osd reweight, not a ceph
osd crush reweight.

Experimentally confirmed.  I have an OSD out right now, and the host's
crush weight is the same as the other hosts' crush weight.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] query about mapping of Swift/S3 APIs to Ceph cluster APIs

2015-03-16 Thread Craig Lewis
On Sat, Mar 14, 2015 at 3:04 AM, pragya jain prag_2...@yahoo.co.in wrote:

 Hello all!

 I am working on Ceph object storage architecture from last few months.

 I am unable to search  a document which can describe how Ceph object
 storage APIs (Swift/S3 APIs) are mappedd with Ceph storage cluster APIs
 (librados APIs) to store the data at Ceph storage cluster.

 As the documents say: Radosgw, a gateway interface for ceph object storage
 users, accept user request to store or retrieve data in the form of Swift
 APIs or S3 APIs and convert the user's request in RADOS request.

 Please help me in knowing
 1. how does Radosgw convert user request to RADOS request ?
 2. how are HTTP requests mapped with RADOS request?


The RadosGW daemon takes care of that.  It's an application that sits on
top of RADOS.

For HTTP, there are a couple ways.  The older way has Apache accepting the
HTTP request, then forwarding that to the RadosGW daemon using FastCGI.
Newer versions support RadosGW handling the HTTP directly.

For the full details, you'll want to check out the source code at
https://github.com/ceph/ceph

If you're not interested enough to read the source code (I wasn't :-) ),
setup a test cluster.  Create a user, bucket, and object, and look at the
contents of the rados pools.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck unclean active+remapped after an osd marked out

2015-03-16 Thread Francois Lafont
Hi,

Gregory Farnum a wrote :

 If I remember/guess correctly, if you mark an OSD out it won't
 necessarily change the weight of the bucket above it (ie, the host),
 whereas if you change the weight of the OSD then the host bucket's
 weight changes.

I can just say that, indeed, I have noticed exactly what you describe
in the ouput of of ceph osd tree.

 That makes for different mappings, and since you only
 have a couple of OSDs per host (normally: hurray!)

er, er... no, I have 10 OSDs in the first osd node and 11 OSDs in the
second osd node (see my fisrt message).

 and not many hosts (normally: sadness)

Yes, only I have only 2 osd nodes (and 3 monitors).

 then marking one OSD out makes things harder for the CRUSH algorithm.

Ah, Ok. So my cluster is too little for Ceph. ;)
Thanks for your answer Greg, I will follow the pull-request with attention.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW Direct Upload Limitation

2015-03-16 Thread Gregory Farnum
On Mon, Mar 16, 2015 at 11:14 AM, Georgios Dimitrakakis
gior...@acmac.uoc.gr wrote:
 Hi all!

 I have recently updated to CEPH version 0.80.9 (latest Firefly release)
 which presumably
 supports direct upload.

 I 've tried to upload a file using this functionality and it seems that is
 working
 for files up to 5GB. For files above 5GB there is an error. I believe that
 this is because
 of a hardcoded limit:

 #define RGW_MAX_PUT_SIZE(5ULL*1024*1024*1024)


 Is there a way to increase that limit other than compiling CEPH from source?

No.


 Could we somehow put it as a configuration parameter?

Maybe, but I'm not sure if Yehuda would want to take it upstream or
not. This limit is present because it's part of the S3 spec. For
larger objects you should use multi-part upload, which can get much
bigger.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shadow files

2015-03-16 Thread Craig Lewis
Out of curiousity, what's the frequency of the peaks and troughs?

RadosGW has configs on how long it should wait after deleting before
garbage collecting, how long between GC runs, and how many objects it can
GC in per run.

The defaults are 2 hours, 1 hour, and 32 respectively.  Search
http://docs.ceph.com/docs/master/radosgw/config-ref/ for rgw gc.

If your peaks and troughs have a frequency less than 1 hour, then GC is
going to delay and alias the disk usage w.r.t. the object count.

If you have millions of objects, you probably need to tweak those values.
If RGW is only GCing 32 objects an hour, it's never going to catch up.


Now that I think about it, I bet I'm having issues here too.  I delete more
than (32*24) objects per day...



On Sun, Mar 15, 2015 at 4:41 PM, Ben b@benjackson.email wrote:

 It is either a problem with CEPH, Civetweb or something else in our
 configuration.
 But deletes in user buckets is still leaving a high number of old shadow
 files. Since we have millions and millions of objects, it is hard to
 reconcile what should and shouldnt exist.

 Looking at our cluster usage, there are no troughs, it is just a rising
 peak.
 But when looking at users data usage, we can see peaks and troughs as you
 would expect as data is deleted and added.

 Our ceph version 0.80.9

 Please ideas?

 On 2015-03-13 02:25, Yehuda Sadeh-Weinraub wrote:

 - Original Message -

 From: Ben b@benjackson.email
 To: ceph-us...@ceph.com
 Sent: Wednesday, March 11, 2015 8:46:25 PM
 Subject: Re: [ceph-users] Shadow files

 Anyone got any info on this?

 Is it safe to delete shadow files?


 It depends. Shadow files are badly named objects that represent part
 of the objects data. They are only safe to remove if you know that the
 corresponding objects no longer exist.

 Yehuda


 On 2015-03-11 10:03, Ben wrote:
  We have a large number of shadow files in our cluster that aren't
  being deleted automatically as data is deleted.
 
  Is it safe to delete these files?
  Is there something we need to be aware of when deleting them?
  Is there a script that we can run that will delete these safely?
 
  Is there something wrong with our cluster that it isn't deleting these
  files when it should be?
 
  We are using civetweb with radosgw, with tengine ssl proxy infront of
  it
 
  Any advice please
  Thanks
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

  ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

2015-03-16 Thread Alexandre DERUMIER
VMs are running on the same nodes than OSD

Are you sure that you didn't some kind of out of memory.
pg rebalance can be memory hungry. (depend how many osd you have).

do you see oom-killer in your host logs ?


- Mail original -
De: Florent Bautista flor...@coppint.com
À: aderumier aderum...@odiso.com
Cc: ceph-users ceph-users@lists.ceph.com
Envoyé: Lundi 16 Mars 2015 12:35:11
Objet: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

On 03/16/2015 12:23 PM, Alexandre DERUMIER wrote: 
 We use Proxmox, so I think it uses librbd ? 
 As It's me that I made the proxmox rbd plugin, I can confirm that yes, it's 
 librbd ;) 
 
 Is the ceph cluster on dedicated nodes ? or vms are running on same nodes 
 than osd daemons ? 
 

VMs are running on the same nodes than OSD 

 And I precise that not all VMs on that pool crashed, only some of them 
 (a large majority), and on a same host, some crashed and others not. 
 Is the vm crashed, like no more qemu process ? 
 or is it the guest os which is crashed ? (do you use virtio, virtio-scsi or 
 ide for your guest ?) 
 
 

I don't really know what crashed, I think qemu process but not sure. 
We use virtio 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shadow files

2015-03-16 Thread Gregory Farnum
On Mon, Mar 16, 2015 at 12:12 PM, Craig Lewis cle...@centraldesktop.com wrote:
 Out of curiousity, what's the frequency of the peaks and troughs?

 RadosGW has configs on how long it should wait after deleting before garbage
 collecting, how long between GC runs, and how many objects it can GC in per
 run.

 The defaults are 2 hours, 1 hour, and 32 respectively.  Search
 http://docs.ceph.com/docs/master/radosgw/config-ref/ for rgw gc.

 If your peaks and troughs have a frequency less than 1 hour, then GC is
 going to delay and alias the disk usage w.r.t. the object count.

 If you have millions of objects, you probably need to tweak those values.
 If RGW is only GCing 32 objects an hour, it's never going to catch up.


 Now that I think about it, I bet I'm having issues here too.  I delete more
 than (32*24) objects per day...

Uh, that's not quite what rgw_gc_max_objs mean. That param configures
how the garbage control data objects and internal classes are sharded,
and each grouping will only delete one object at a time. So it
controls the parallelism, but not the total number of objects!

Also, Yehuda says that changing this can be a bit dangerous because it
currently needs to be consistent across any program doing or
generating GC work.
-Greg




 On Sun, Mar 15, 2015 at 4:41 PM, Ben b@benjackson.email wrote:

 It is either a problem with CEPH, Civetweb or something else in our
 configuration.
 But deletes in user buckets is still leaving a high number of old shadow
 files. Since we have millions and millions of objects, it is hard to
 reconcile what should and shouldnt exist.

 Looking at our cluster usage, there are no troughs, it is just a rising
 peak.
 But when looking at users data usage, we can see peaks and troughs as you
 would expect as data is deleted and added.

 Our ceph version 0.80.9

 Please ideas?

 On 2015-03-13 02:25, Yehuda Sadeh-Weinraub wrote:

 - Original Message -

 From: Ben b@benjackson.email
 To: ceph-us...@ceph.com
 Sent: Wednesday, March 11, 2015 8:46:25 PM
 Subject: Re: [ceph-users] Shadow files

 Anyone got any info on this?

 Is it safe to delete shadow files?


 It depends. Shadow files are badly named objects that represent part
 of the objects data. They are only safe to remove if you know that the
 corresponding objects no longer exist.

 Yehuda


 On 2015-03-11 10:03, Ben wrote:
  We have a large number of shadow files in our cluster that aren't
  being deleted automatically as data is deleted.
 
  Is it safe to delete these files?
  Is there something we need to be aware of when deleting them?
  Is there a script that we can run that will delete these safely?
 
  Is there something wrong with our cluster that it isn't deleting these
  files when it should be?
 
  We are using civetweb with radosgw, with tengine ssl proxy infront of
  it
 
  Any advice please
  Thanks
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fw: query about mapping of Swift/S3 APIs to Ceph cluster APIs

2015-03-16 Thread pragya jain
please somebody answer my queries. -RegardsPragya JainDepartment of 
Computer ScienceUniversity of DelhiDelhi, India

  On Saturday, 14 March 2015 3:34 PM, pragya jain prag_2...@yahoo.co.in 
wrote:
   
 

 Hello all!
I am working on Ceph object storage architecture from last few months.
I am unable to search  a document which can describe how Ceph object storage 
APIs (Swift/S3 APIs) are mappedd with Ceph storage cluster APIs (librados APIs) 
to store the data at Ceph storage cluster.
As the documents say: Radosgw, a gateway interface for ceph object storage 
users, accept user request to store or retrieve data in the form of Swift APIs 
or S3 APIs and convert the user's request in RADOS request.
Please help me in knowing1. how does Radosgw convert user request to RADOS 
request ?2. how are HTTP requests mapped with RADOS request?
Thank you -RegardsPragya JainDepartment of Computer ScienceUniversity of 
DelhiDelhi, India
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 
   ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] query about region and zone creation while configuring RADOSGW

2015-03-16 Thread pragya jain
hello all!
I am working on Ceph object storage architecture.I have some queries:
In case of configuring federated system, we need to create regions containing 
one or more zones and the cluster must have a master region and each region 
must have a master zone.
but in case of simple gateway configuration, is there a need to create at least 
a region and a zone to store the data?
Please somebody reply my query.
Thank you -RegardsPragya JainDepartment of Computer ScienceUniversity of 
DelhiDelhi, India___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com