date:20150316


Sumit,

You may have better luck on the ceph-calamari mailing list.  Anyway - 
calamari uses graphite to handle metrics, and graphite does indeed write 
them to files.


John

On 11/03/2015 05:09, Sumit Gaur wrote:

Hi
I have a basic architecture related question. I know Calamari collect 
system usages data (diamond collector) using perfrormance counters. I 
need to knwo if all the system performance data that calamari shows 
remains in memory or it usages files to store that.

Thanks
sumit


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: authorizations ?


On 13/03/2015 11:51, Florent B wrote:

Hi all,

My question is about user management in CephFS.

Is it possible to restrict a CephX user to access some subdirectories ?
Not yet.  The syntax for setting a path= part in the authorization 
caps for a cephx user exists, but the code for enforcing it isn't done yet.


John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: delayed objects deletion ?


On 16/03/2015 16:30, Florent B wrote:
Thank you John :) Hammer is not released yet, is it ? Is it 'safe' to 
upgrade a production cluster to 0.93 ? 
I keep forgetting that -- yes, I should have added ...when it's 
released :-)


John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: delayed objects deletion ?


On 14/03/2015 09:22, Florent B wrote:

Hi,

What do you call old MDS ? I'm on Giant release, it is not very old...
With CephFS we have a special definition of old that is anything that 
doesn't have the very latest bug fixes ;-)


There have definitely been fixes to stray file handling[1] between giant 
and hammer.  Since with giant you're using a version that is neither 
latest nor LTS, I'd suggest you upgrade to hammer.  Hammer also includes 
some new perf counters related to strays[2] that will allow you to see 
how the purging is (or isn't) progressing.


If you can reproduce this on hammer, then please capture ceph daemon 
mds.daemon id session ls and ceph mds tell mds.daemon id dumpcache 
/tmp/cache.txt, in addition to the procedure to reproduce.  Ideally 
logs with debug mds = 10 as well.


Cheers,
John

1.
http://tracker.ceph.com/issues/10387
http://tracker.ceph.com/issues/10164

2.
http://tracker.ceph.com/issues/10388


And I tried restarting both but it didn't solve my problem.

Will it be OK in Hammer ?

On 03/13/2015 04:27 AM, Yan, Zheng wrote:

On Fri, Mar 13, 2015 at 1:17 AM, Florent B flor...@coppint.com wrote:

Hi all,

I test CephFS again on Giant release.

I use ceph-fuse.

After deleting a large directory (few hours ago), I can see that my pool
still contains 217 GB of objects.

Even if my root directory on CephFS is empty.

And metadata pool is 46 MB.

Is it expected ? If not, how to debug this ?

Old mds does not work well in this area. Try umounting clients and
restarting MDS.

Regards
Yan, Zheng



Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OS file Cache, Ceph RBD cache and Network files systems

2015-03-16 Thread Stéphane DUGRAVOT

Hi Cephers, 

Our university could deploy ceph. The goal is to store datas for research 
laboratories (non-HPC) . To do this, we plan to use Ceph with RBD (mount block 
device) from a NFS ( or CIFS ) server (ceph client) to workstations in 
laboratories. According to our tests, the OS (ubuntu or centos...) that map the 
RBD block implements file system write cache (vm.dirty_ratio, etc ...). In that 
case, the NFS server will always perform writes to workstations whereas it has 
not finished writing datas to Ceph cluster - a nd regardless of whether the RBD 
cache is enabled or not in the config [client] section. 

My questions: 


1. Does the activation of RBD cache is useful only when it combines 
Virtuals Machnies (where QEMU can access an image as a virtual block device 
directly via librbd) ? 
2. Is it common to use Ceph, with RBD to share network file systems ? 
3. And if so, what are the recommendations concerning the OS cache ? 

Thanks a lot. 
Stephane. 

-- 
Université de Lorraine 
Stéphane DUGRAVOT - Direction du numérique - Infrastructure 
Jabber : stephane.dugra...@univ-lorraine.fr 
Tél.: +33 3 83 68 20 98 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph release timeline

2015-03-16 Thread David Moreau Simard

Great work !

David Moreau Simard

On 2015-03-15 06:29 PM, Loic Dachary wrote:
 Hi Ceph,

 In an attempt to clarify what Ceph release is stable, LTS or development. a 
 new page was added to the documentation: 
 http://ceph.com/docs/master/releases/ It is a matrix where each cell is a 
 release number linked to the release notes from 
 http://ceph.com/docs/master/release-notes/. One line per month and one column 
 per release.

 Cheers



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

2015-03-16 Thread Chu Duc Minh

@Michael Kuriger: when ceph/librbd operate normally, i know that double the
pg_num is the safe way. But when it has problem, i think double it can make
many many VMs die (maybe = 50%?)

On Mon, Mar 16, 2015 at 9:53 PM, Michael Kuriger mk7...@yp.com wrote:

I always keep my pg number a power of 2. So I’d go from 2048 to 4096.
I’m not sure if this is the safest way, but it’s worked for me.

[image: yp]

Michael Kuriger

Sr. Unix Systems Engineer

* mk7...@yp.com |( 818-649-7235

From: Chu Duc Minh chu.ducm...@gmail.com
Date: Monday, March 16, 2015 at 7:49 AM
To: Florent B flor...@coppint.com
Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.com
Subject: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

I'm using the latest Giant and have the same issue. When i increase
PG_num of a pool from 2048 to 2148, my VMs is still ok. When i increase
from 2148 to 2400, some VMs die (Qemu-kvm process die).
My physical servers (host VMs) running kernel 3.13 and use librbd.
I think it's a bug in librbd with crushmap.
(I set crush_tunables3 on my ceph cluster, does it make sense?)

Do you know a way to safely increase PG_num? (I don't think increase
PG_num 100 each times is a safe good way)

Regards,

On Mon, Mar 16, 2015 at 8:50 PM, Florent B flor...@coppint.com wrote:

We are on Giant.

On 03/16/2015 02:03 PM, Azad Aliyar wrote:

May I know your ceph version.?. The latest version of firefly 80.9 has
patches to avoid excessive data migrations during rewighting osds. You
may need set a tunable inorder make this patch active.

This is a bugfix release for firefly. It fixes a performance regression
in librbd, an important CRUSH misbehavior (see below), and several RGW
bugs. We have also backported support for flock/fcntl locks to
ceph-fuse
and libcephfs.

We recommend that all Firefly users upgrade.

For more detailed information, see
http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt
https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_master_-5Fdownloads_v0.80.9.txtd=AwMFaQc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=0MEOMMXqQGLq4weFd85B2Bxn5uBH9V9uMiuajNVb7o0s=-HHkWm2cMQZ06FKpWF4Ai-YkFb9lUR_tH_KR0eITbuUe=

Adjusting CRUSH maps

* This point release fixes several issues with CRUSH that trigger
excessive data migration when adjusting OSD weights. These are most
obvious when a very small weight change (e.g., a change from 0 to
.01) triggers a large amount of movement, but the same set of bugs
can also lead to excessive (though less noticeable) movement in
other cases.

However, because the bug may already have affected your cluster,
fixing it may trigger movement *back* to the more correct location.
For this reason, you must manually opt-in to the fixed behavior.

In order to set the new tunable to correct the behavior::

ceph osd crush set-tunable straw_calc_version 1

Note that this change will have no immediate effect. However, from
this point forward, any 'straw' bucket in your CRUSH map that is
adjusted will get non-buggy internal weights, and that transition
may trigger some rebalancing.

You can estimate how much rebalancing will eventually be necessary
on your cluster with::

ceph osd getcrushmap -o /tmp/cm
crushtool -i /tmp/cm --num-rep 3 --test --show-mappings /tmp/a
21
crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2
crushtool -i /tmp/cm2 --reweight -o /tmp/cm2
crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings /tmp/b
21
wc -l /tmp/a # num total mappings
diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings

Divide the total number of lines in /tmp/a with the number of lines
changed. We've found that most clusters are under 10%.

You can force all of this rebalancing to happen at once with::

ceph osd crush reweight-all

Otherwise, it will happen at some unknown point in the future when
CRUSH weights are next adjusted.

Notable Changes
---

* ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum)
* crush: fix straw bucket weight calculation, add straw_calc_version
tunable (#10095 Sage Weil)
* crush: fix tree bucket (Rongzu Zhu)
* crush: fix underflow of tree weights (Loic Dachary, Sage Weil)
* crushtool: add --reweight (Sage Weil)
* librbd: complete pending operations before losing image (#10299 Jason
Dillaman)
* librbd: fix read caching performance regression (#9854 Jason Dillaman)
* librbd: gracefully handle deleted/renamed pools (#10270 Jason
Dillaman)
* mon: fix dump of chooseleaf_vary_r tunable (Sage Weil)
* osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai)
* osd: handle no-op write with snapshot (#10262 Sage Weil)
* radosgw-admi

On 03/16/2015 12:37 PM,

Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

2015-03-16 Thread Chu Duc Minh

I'm using the latest Giant and have the same issue. When i increase PG_num
of a pool from 2048 to 2148, my VMs is still ok. When i increase from 2148
to 2400, some VMs die (Qemu-kvm process die).
My physical servers (host VMs) running kernel 3.13 and use librbd.
I think it's a bug in librbd with crushmap.
(I set crush_tunables3 on my ceph cluster, does it make sense?)

Do you know a way to safely increase PG_num? (I don't think increase PG_num
100 each times is a safe  good way)

Regards,

On Mon, Mar 16, 2015 at 8:50 PM, Florent B flor...@coppint.com wrote:

 We are on Giant.

 On 03/16/2015 02:03 PM, Azad Aliyar wrote:
 
  May I know your ceph version.?. The latest version of firefly 80.9 has
  patches to avoid excessive data migrations during rewighting osds. You
  may need set a tunable inorder make this patch active.
 
  This is a bugfix release for firefly.  It fixes a performance regression
  in librbd, an important CRUSH misbehavior (see below), and several RGW
  bugs.  We have also backported support for flock/fcntl locks to ceph-fuse
  and libcephfs.
 
  We recommend that all Firefly users upgrade.
 
  For more detailed information, see
http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt
 
  Adjusting CRUSH maps
  
 
  * This point release fixes several issues with CRUSH that trigger
excessive data migration when adjusting OSD weights.  These are most
obvious when a very small weight change (e.g., a change from 0 to
.01) triggers a large amount of movement, but the same set of bugs
can also lead to excessive (though less noticeable) movement in
other cases.
 
However, because the bug may already have affected your cluster,
fixing it may trigger movement *back* to the more correct location.
For this reason, you must manually opt-in to the fixed behavior.
 
In order to set the new tunable to correct the behavior::
 
   ceph osd crush set-tunable straw_calc_version 1
 
Note that this change will have no immediate effect.  However, from
this point forward, any 'straw' bucket in your CRUSH map that is
adjusted will get non-buggy internal weights, and that transition
may trigger some rebalancing.
 
You can estimate how much rebalancing will eventually be necessary
on your cluster with::
 
   ceph osd getcrushmap -o /tmp/cm
   crushtool -i /tmp/cm --num-rep 3 --test --show-mappings  /tmp/a
 21
   crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2
   crushtool -i /tmp/cm2 --reweight -o /tmp/cm2
   crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings  /tmp/b
  21
   wc -l /tmp/a  # num total mappings
   diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings
 
 Divide the total number of lines in /tmp/a with the number of lines
 changed.  We've found that most clusters are under 10%.
 
 You can force all of this rebalancing to happen at once with::
 
   ceph osd crush reweight-all
 
 Otherwise, it will happen at some unknown point in the future when
 CRUSH weights are next adjusted.
 
  Notable Changes
  ---
 
  * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum)
  * crush: fix straw bucket weight calculation, add straw_calc_version
tunable (#10095 Sage Weil)
  * crush: fix tree bucket (Rongzu Zhu)
  * crush: fix underflow of tree weights (Loic Dachary, Sage Weil)
  * crushtool: add --reweight (Sage Weil)
  * librbd: complete pending operations before losing image (#10299 Jason
Dillaman)
  * librbd: fix read caching performance regression (#9854 Jason Dillaman)
  * librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman)
  * mon: fix dump of chooseleaf_vary_r tunable (Sage Weil)
  * osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai)
  * osd: handle no-op write with snapshot (#10262 Sage Weil)
  * radosgw-admi
 
 
 
 
  On 03/16/2015 12:37 PM, Alexandre DERUMIER wrote:
   VMs are running on the same nodes than OSD
   Are you sure that you didn't some kind of out of memory.
   pg rebalance can be memory hungry. (depend how many osd you have).
 
  2 OSD per host, and 5 hosts in this cluster.
  hosts h
 

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mapping users to different rgw pools

Yes, the placement target feature is logically separate from multi-zone
setups.  Placement targets are configured in the region though, which
somewhat muddies the issue.

Placement targets are useful feature for multi-zone, so different zones in
a cluster don't share the same disks.  Federation setup is the only place
I've seen any discussion about the topic.  Even that is just a brief
mention.  I didn't see any documentation directly talking about setting up
placement targets, even in the federation guides.

It looks like you'll need to edit the default region to add the placement
targets, but you won't need to setup zones.  As far as I can tell, You'll
have to piece together what you need from the federation setup and some
experimentation.  I highly recommend a test VM that you can experiment on
before attempting anything in production.




On Sun, Mar 15, 2015 at 11:53 PM, Sreenath BH bhsreen...@gmail.com wrote:

 Thanks.

 Is this possible outside of multi-zone setup. (With only one Zone)?

 For example, I want to have pools with different replication
 factors(or erasure codings) and map users to these pools.

 -Sreenath


 On 3/13/15, Craig Lewis cle...@centraldesktop.com wrote:
  Yes, RadosGW has the concept of Placement Targets and Placement Pools.
 You
  can create a target, and point it a set of RADOS pools.  Those pools can
 be
  configured to use different storage strategies by creating different
  crushmap rules, and assigning those rules to the pool.
 
  RGW users can be assigned a default placement target.  When they create a
  bucket, they can either specify the target, or use their default one.
 All
  objects in a bucket are stored according to the bucket's placement
 target.
 
 
  I haven't seen a good guide for making use of these features.  The best
  guide I know of is the Federation guide (
  http://ceph.com/docs/giant/radosgw/federated-config/), but it only
 briefly
  mentions placement targets.
 
 
 
  On Thu, Mar 12, 2015 at 11:48 PM, Sreenath BH bhsreen...@gmail.com
 wrote:
 
  Hi all,
 
  Can one Radow gateway support more than one pool for storing objects?
 
  And as a follow-up question, is there a way to map different users to
  separate rgw pools so that their obejcts get stored in different
  pools?
 
  thanks,
  Sreenath
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd laggy algorithm

On Wed, Mar 11, 2015 at 8:40 AM, Artem Savinov asavi...@asdco.ru wrote:
 hello.
 ceph transfers osd node in the down status by default , after receiving 3
 reports about disabled nodes. Reports are sent per   osd heartbeat grace
 seconds, but the settings of mon_osd_adjust_heartbeat_gratse = true,
 mon_osd_adjust_down_out_interval = true timeout to transfer nodes in down
 status may vary. Tell me please: what algorithm enables changes timeout for
 the transfer nodes occur in down/out status and which parameters are
 affected?
 thanks.

The monitors keep track of which detected failures are incorrect
(based on reports from the marked-down/out OSDs) and build up an
expectation about how often the failures are correct based on an
exponential backoff of the data points. You can look at the code in
OSDMonitor.cc if you're interested, but basically they apply that
expectation to modify the down interval and the down-out interval to a
value large enough that they believe the OSD is really down (assuming
these config options are set). It's not terribly interesting. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote:

 I’m not sure if it’s something I’m doing wrong or just experiencing an 
 oddity, but when my cache tier flushes dirty blocks out to the base tier, the 
 writes seem to hit the OSD’s straight away instead of coalescing in the 
 journals, is this correct?

 For example if I create a RBD on a standard 3 way replica pool and run fio 
 via librbd 128k writes, I see the journals take all the io’s until I hit my 
 filestore_min_sync_interval and then I see it start writing to the underlying 
 disks.

 Doing the same on a full cache tier (to force flushing)  I immediately see 
 the base disks at a very high utilisation. The journals also have some write 
 IO at the same time. The only other odd thing I can see via iostat is that 
 most of the time whilst I’m running Fio, is that I can see the underlying 
 disks doing very small write IO’s of around 16kb with an occasional big burst 
 of activity.

 I know erasure coding+cache tier is slower than just plain replicated pools, 
 but even with various high queue depths I’m struggling to get much above 
 100-150 iops compared to a 3 way replica pool which can easily achieve 
 1000-1500. The base tier is comprised of 40 disks. It seems quite a marked 
 difference and I’m wondering if this strange journal behaviour is the cause.

 Does anyone have any ideas?

If you're running a full cache pool, then on every operation touching
an object which isn't in the cache pool it will try and evict an
object. That's probably what you're seeing.

Cache pool in general are only a wise idea if you have a very skewed
distribution of data hotness and the entire hot zone can fit in
cache at once.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PGs stuck unclean active+remapped after an osd marked out

On Wed, Mar 11, 2015 at 3:49 PM, Francois Lafont flafdiv...@free.fr wrote:
 Hi,

 I was always in the same situation: I couldn't remove an OSD without
 have some PGs definitely stuck to the active+remapped state.

 But I remembered I read on IRC that, before to mark out an OSD, it
 could be sometimes a good idea to reweight it to 0. So, instead of
 doing [1]:

 ceph osd out 3

 I have tried [2]:

 ceph osd crush reweight osd.3 0 # waiting for the rebalancing...
 ceph osd out 3

 and it worked. Then I could remove my osd with the online documentation:
 http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual

 Now, the osd is removed and my cluster is HEALTH_OK. \o/

 Now, my question is: why my cluster was definitely stuck to active+remapped
 with [1] but was not with [2]? Personally, I have absolutely no explanation.
 If you have an explanation, I'd love to know it.

If I remember/guess correctly, if you mark an OSD out it won't
necessarily change the weight of the bucket above it (ie, the host),
whereas if you change the weight of the OSD then the host bucket's
weight changes. That makes for different mappings, and since you only
have a couple of OSDs per host (normally: hurray!) and not many hosts
(normally: sadness) then marking one OSD out makes things harder for
the CRUSH algorithm.
-Greg


 Should the reweight command be present in the online documentation?
 http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
 If yes, I can make a pull request on the doc with pleasure. ;)

 Regards.

 --
 François Lafont
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] client-ceph [can not connect from client][connect protocol feature mismatch]

2015-03-16 Thread Sonal Dubey

Thanks a lot Stephane and Kamil,

Your reply was really helpful. I needed a different version of ceph client
on my client machine. Initially my java application using librados was
throwing connection time out. Then I considered querying ceph from command
line (ceph --id ...), which was giving the error -



2015-03-05 13:37:16.816322 7f5191deb700 -- 10.8.25.112:0/2487 
10.138.23.241:6789/0pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol
feature mismatch, my 1ffa  peer 42041ffa missing 4204


From the hits given in your mail i tried -

wget -q -O- '
https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc
https://mail.barracuda.com/owa/redir.aspx?C=3NyLmctq4E2pteCAiaUljUgzJNylM9JIwPBTxx3luEEtOGlWRbTgjTsFufrr9_uu3LumztxKEp0.URL=https%3a%2f%2fceph.com%2fgit%2f%3fp%3dceph.git%3ba%3dblob_plain%3bf%3dkeys%2frelease.asc'
| sudo apt-key add -
wget -q -O- '
https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/autobuild.asc
https://mail.barracuda.com/owa/redir.aspx?C=3NyLmctq4E2pteCAiaUljUgzJNylM9JIwPBTxx3luEEtOGlWRbTgjTsFufrr9_uu3LumztxKEp0.URL=https%3a%2f%2fceph.com%2fgit%2f%3fp%3dceph.git%3ba%3dblob_plain%3bf%3dkeys%2fautobuild.asc'
| sudo apt-key add -
echo deb http://ceph.com/packages/ceph-extras/debian
https://mail.barracuda.com/owa/redir.aspx?C=3NyLmctq4E2pteCAiaUljUgzJNylM9JIwPBTxx3luEEtOGlWRbTgjTsFufrr9_uu3LumztxKEp0.URL=http%3a%2f%2fceph.com%2fpackages%2fceph-extras%2fdebian
$(lsb_release
-sc) main | sudo tee /etc/apt/sources.list.d/ceph-extras.list
echo deb http://ceph.com/debian-firefly/
https://mail.barracuda.com/owa/redir.aspx?C=3NyLmctq4E2pteCAiaUljUgzJNylM9JIwPBTxx3luEEtOGlWRbTgjTsFufrr9_uu3LumztxKEp0.URL=http%3a%2f%2fceph.com%2fdebian-firefly%2f
$(lsb_release
-sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
sudo apt-get install ceph-common

to verify:
ceph --id brts --keyring=/etc/ceph/ceph.client.brts.keyring health
HEALTH_OK

Thanks for the reply.

-Sonal


On Fri, Mar 6, 2015 at 5:50 AM, Stéphane DUGRAVOT 
stephane.dugra...@univ-lorraine.fr wrote:

 Hi Sonal,
 You can refer to this doc to identify your problem.
 Your error code is 4204, so

- 4000 upgrade to kernel 3.9
-  200 CEPH_FEATURE_CRUSH_TUNABLES2
- 4 CEPH_FEATURE_CRUSH_TUNABLES


-
http://ceph.com/planet/feature-set-mismatch-error-on-ceph-kernel-client/

 Stephane.

 --

 Hi,

 I am newbie for ceph, and ceph-user group. Recently I have been working on
 a ceph client. It worked on all the environments while when i tested on the
 production, it is not able to connect to ceph.

 Following are the operating system details and error. If someone has seen
 this problem before, any help is really appreciated.

 OS -

 lsb_release -a
 No LSB modules are available.
 Distributor ID: Ubuntu
 Description: Ubuntu 12.04.2 LTS
 Release: 12.04
 Codename: precise

 2015-03-05 13:37:16.816322 7f5191deb700 -- 10.8.25.112:0/2487 
 10.138.23.241:6789/0 pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol
 feature mismatch, my 1ffa  peer 42041ffa missing 4204
 2015-03-05 13:37:17.635776 7f5191deb700 -- 10.8.25.112:0/2487 
 10.138.23.241:6789/0 pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol
 feature mismatch, my 1ffa  peer 42041ffa missing 4204

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Christian Balzer

On Mon, 16 Mar 2015 16:09:12 -0700 Gregory Farnum wrote:

 Nothing here particularly surprises me. I don't remember all the
 details of the filestore's rate limiting off the top of my head, but
 it goes to great lengths to try and avoid letting the journal get too
 far ahead of the backing store. Disabling the filestore flusher and
 increasing the sync intervals without also increasing the
 filestore_wbthrottle_* limits is not going to work well for you.
 -Greg
 
While very true and what I recalled (backing store being kicked off early)
from earlier mails, I think having every last configuration parameter
documented in a way that doesn't reduce people to guesswork would be very
helpful.

For example filestore_wbthrottle_xfs_inodes_start_flusher which defaults
to 500. 
Assuming that this means to start flushing once 500 inodes have
accumulated, how would Ceph even know how many inodes are needed for the
data present?

Lastly with these parameters, there is xfs and btrfs incarnations, no
ext4. 
Do the xfs parameters also apply to ext4?

Christian

 On Mon, Mar 16, 2015 at 3:58 PM, Nick Fisk n...@fisk.me.uk wrote:
 
 
 
 
  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
  Of Gregory Farnum
  Sent: 16 March 2015 17:33
  To: Nick Fisk
  Cc: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier
  journal sync?
 
  On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote:
  
   I’m not sure if it’s something I’m doing wrong or just experiencing
   an
  oddity, but when my cache tier flushes dirty blocks out to the base
  tier, the writes seem to hit the OSD’s straight away instead of
  coalescing in the journals, is this correct?
  
   For example if I create a RBD on a standard 3 way replica pool and
   run fio
  via librbd 128k writes, I see the journals take all the io’s until I
  hit my filestore_min_sync_interval and then I see it start writing to
  the underlying disks.
  
   Doing the same on a full cache tier (to force flushing)  I
   immediately see the
  base disks at a very high utilisation. The journals also have some
  write IO at the same time. The only other odd thing I can see via
  iostat is that most of the time whilst I’m running Fio, is that I can
  see the underlying disks doing very small write IO’s of around 16kb
  with an occasional big burst of activity.
  
   I know erasure coding+cache tier is slower than just plain
   replicated pools,
  but even with various high queue depths I’m struggling to get much
  above 100-150 iops compared to a 3 way replica pool which can easily
  achieve 1000- 1500. The base tier is comprised of 40 disks. It seems
  quite a marked difference and I’m wondering if this strange journal
  behaviour is the cause.
  
   Does anyone have any ideas?
 
  If you're running a full cache pool, then on every operation touching
  an object which isn't in the cache pool it will try and evict an
  object. That's probably what you're seeing.
 
  Cache pool in general are only a wise idea if you have a very skewed
  distribution of data hotness and the entire hot zone can fit in
  cache at once.
  -Greg
 
  Hi Greg,
 
  It's not the caching behaviour that I confused about, it’s the journal
  behaviour on the base disks during flushing. I've been doing some more
  tests and can do something reproducible which seems strange to me.
 
  First off 10MB of 4kb writes:
  time ceph tell osd.1 bench 1000 4096
  { bytes_written: 1000,
blocksize: 4096,
bytes_per_sec: 16009426.00}
 
  real0m0.760s
  user0m0.063s
  sys 0m0.022s
 
  Now split this into 2x5mb writes:
  time ceph tell osd.1 bench 500 4096   time ceph tell osd.1 bench
  500 4096 { bytes_written: 500,
blocksize: 4096,
bytes_per_sec: 10580846.00}
 
  real0m0.595s
  user0m0.065s
  sys 0m0.018s
  { bytes_written: 500,
blocksize: 4096,
bytes_per_sec: 9944252.00}
 
  real0m4.412s
  user0m0.053s
  sys 0m0.071s
 
  2nd bench takes a lot longer even though both should easily fit in the
  5GB journal. Looking at iostat, I think I can see that no writes
  happen to the journal whilst the writes from the 1st bench are being
  flushed. Is this the expected behaviour? I would have thought as long
  as there is space available in the journal it shouldn't block on new
  writes. Also I see in iostat writes to the underlying disk happening
  at a QD of 1 and 16kb IO's for a number of seconds, with a large blip
  or activity just before the flush finishes. Is this the correct
  behaviour? I would have thought if this tell osd bench is doing
  sequential IO then the journal should be able to flush 5-10mb of data
  in a fraction a second.
 
  Ceph.conf
  [osd]
  filestore max sync interval = 30
  filestore min sync interval = 20
  filestore flusher = false
  osd_journal_size = 5120
  osd_crush_location_hook = /usr/local/bin/crush-location

Re: [ceph-users] RadosGW Direct Upload Limitation

2015-03-16 Thread Yehuda Sadeh-Weinraub

- Original Message -
 From: Craig Lewis cle...@centraldesktop.com
 To: Gregory Farnum g...@gregs42.com
 Cc: ceph-users@lists.ceph.com
 Sent: Monday, March 16, 2015 11:48:15 AM
 Subject: Re: [ceph-users] RadosGW Direct Upload Limitation

 Maybe, but I'm not sure if Yehuda would want to take it upstream or
 not. This limit is present because it's part of the S3 spec. For
 larger objects you should use multi-part upload, which can get much
 bigger.
 -Greg

 Note that the multi-part upload has a lower limit of 4MiB per part, and the
 direct upload has an upper limit of 5GiB.

The limit is 10MB, but it does not apply to the last part, so basically you 
could upload any object size with it. I would still recommend using the plain 
upload for smaller object sizes, it is faster, and the resulting object might 
be more efficient (for really small sizes).

Yehuda

 So you have to use both methods - direct upload for small files, and
 multi-part upload for big files.

 Your best bet is to use the Amazon S3 libraries. They have functions that
 take care of it for you.

 I'd like to see this mentioned in the Ceph documentation someplace. When I
 first encountered the issue, I couldn't find a limit in the RadosGW
 documentation anywhere. I only found the 5GiB limit in the Amazon API
 documentation, which lead me to test on RadosGW. Now that I know it was done
 to preserve Amazon compatibility, I don't want to override the value
 anymore.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS unexplained writes

The information you're giving sounds a little contradictory, but my
guess is that you're seeing the impacts of object promotion and
flushing. You can sample the operations the OSDs are doing at any
given time by running ops_in_progress (or similar, I forget exact
phrasing) command on the OSD admin socket. I'm not sure if rados df
is going to report cache movement activity or not.

That though would mostly be written to the SSDs, not the hard drives —
although the hard drives could still get metadata updates written when
objects are flushed. What data exactly are you seeing that's leading
you to believe writes are happening against these drives? What is the
exact CephFS and cache pool configuration?
-Greg

On Mon, Mar 16, 2015 at 2:36 PM, Erik Logtenberg e...@logtenberg.eu wrote:
 Hi,

 I forgot to mention: while I am seeing these writes in iotop and
 /proc/diskstats for the hdd's, I am -not- seeing any writes in rados
 df for the pool residing on these disks. There is only one pool active
 on the hdd's and according to rados df it is getting zero writes when
 I'm just reading big files from cephfs.

 So apparently the osd's are doing some non-trivial amount of writing on
 their own behalf. What could it be?

 Thanks,

 Erik.


 On 03/16/2015 10:26 PM, Erik Logtenberg wrote:
 Hi,

 I am getting relatively bad performance from cephfs. I use a replicated
 cache pool on ssd in front of an erasure coded pool on rotating media.

 When reading big files (streaming video), I see a lot of disk i/o,
 especially writes. I have no clue what could cause these writes. The
 writes are going to the hdd's and they stop when I stop reading.

 I mounted everything with noatime and nodiratime so it shouldn't be
 that. On a related note, the Cephfs metadata is stored on ssd too, so
 metadata-related changes shouldn't hit the hdd's anyway I think.

 Any thoughts? How can I get more information about what ceph is doing?
 Using iotop I only see that the osd processes are busy but it doesn't
 give many hints as to what they are doing.

 Thanks,

 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Nick Fisk

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Gregory Farnum
 Sent: 16 March 2015 17:33
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal
 sync?

 On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote:

  I’m not sure if it’s something I’m doing wrong or just experiencing an
 oddity, but when my cache tier flushes dirty blocks out to the base tier, the
 writes seem to hit the OSD’s straight away instead of coalescing in the
 journals, is this correct?

  For example if I create a RBD on a standard 3 way replica pool and run fio
 via librbd 128k writes, I see the journals take all the io’s until I hit my
 filestore_min_sync_interval and then I see it start writing to the underlying
 disks.

  Doing the same on a full cache tier (to force flushing)  I immediately see 
  the
 base disks at a very high utilisation. The journals also have some write IO at
 the same time. The only other odd thing I can see via iostat is that most of
 the time whilst I’m running Fio, is that I can see the underlying disks doing
 very small write IO’s of around 16kb with an occasional big burst of activity.

  I know erasure coding+cache tier is slower than just plain replicated pools,
 but even with various high queue depths I’m struggling to get much above
 100-150 iops compared to a 3 way replica pool which can easily achieve 1000-
 1500. The base tier is comprised of 40 disks. It seems quite a marked
 difference and I’m wondering if this strange journal behaviour is the cause.

  Does anyone have any ideas?

 If you're running a full cache pool, then on every operation touching an
 object which isn't in the cache pool it will try and evict an object. That's
 probably what you're seeing.

 Cache pool in general are only a wise idea if you have a very skewed
 distribution of data hotness and the entire hot zone can fit in cache at
 once.
 -Greg

Hi Greg,

It's not the caching behaviour that I confused about, it’s the journal 
behaviour on the base disks during flushing. I've been doing some more tests 
and can do something reproducible which seems strange to me. 

First off 10MB of 4kb writes:
time ceph tell osd.1 bench 1000 4096
{ bytes_written: 1000,
  blocksize: 4096,
  bytes_per_sec: 16009426.00}

real0m0.760s
user0m0.063s
sys 0m0.022s

Now split this into 2x5mb writes:
time ceph tell osd.1 bench 500 4096   time ceph tell osd.1 bench 500 
4096
{ bytes_written: 500,
  blocksize: 4096,
  bytes_per_sec: 10580846.00}

real0m0.595s
user0m0.065s
sys 0m0.018s
{ bytes_written: 500,
  blocksize: 4096,
  bytes_per_sec: 9944252.00}

real0m4.412s
user0m0.053s
sys 0m0.071s

2nd bench takes a lot longer even though both should easily fit in the 5GB 
journal. Looking at iostat, I think I can see that no writes happen to the 
journal whilst the writes from the 1st bench are being flushed. Is this the 
expected behaviour? I would have thought as long as there is space available in 
the journal it shouldn't block on new writes. Also I see in iostat writes to 
the underlying disk happening at a QD of 1 and 16kb IO's for a number of 
seconds, with a large blip or activity just before the flush finishes. Is this 
the correct behaviour? I would have thought if this tell osd bench is doing 
sequential IO then the journal should be able to flush 5-10mb of data in a 
fraction a second.

Ceph.conf
[osd]
filestore max sync interval = 30
filestore min sync interval = 20
filestore flusher = false
osd_journal_size = 5120
osd_crush_location_hook = /usr/local/bin/crush-location
osd_op_threads = 5
filestore_op_threads = 4

iostat during period where writes seem to be blocked (journal=sda disk=sdd)

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.000.002.00 0.00 4.00 4.00 
0.000.000.000.00   0.00   0.00
sdc   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdd   0.00 0.000.00   76.00 0.00   760.0020.00 
0.99   13.110.00   13.11  13.05  99.20

iostat during what I believe to be the actual flush

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.000.002.00 0.00 4.00 4.00 
0.000.000.000.00   0.00   0.00
sdc   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00

[ceph-users] CephFS unexplained writes

2015-03-16 Thread Erik Logtenberg

Hi,

I am getting relatively bad performance from cephfs. I use a replicated
cache pool on ssd in front of an erasure coded pool on rotating media.

When reading big files (streaming video), I see a lot of disk i/o,
especially writes. I have no clue what could cause these writes. The
writes are going to the hdd's and they stop when I stop reading.

I mounted everything with noatime and nodiratime so it shouldn't be
that. On a related note, the Cephfs metadata is stored on ssd too, so
metadata-related changes shouldn't hit the hdd's anyway I think.

Any thoughts? How can I get more information about what ceph is doing?
Using iotop I only see that the osd processes are busy but it doesn't
give many hints as to what they are doing.

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS unexplained writes

2015-03-16 Thread Erik Logtenberg

Hi,

I forgot to mention: while I am seeing these writes in iotop and
/proc/diskstats for the hdd's, I am -not- seeing any writes in rados
df for the pool residing on these disks. There is only one pool active
on the hdd's and according to rados df it is getting zero writes when
I'm just reading big files from cephfs.

So apparently the osd's are doing some non-trivial amount of writing on
their own behalf. What could it be?

Thanks,

Erik.


On 03/16/2015 10:26 PM, Erik Logtenberg wrote:
 Hi,
 
 I am getting relatively bad performance from cephfs. I use a replicated
 cache pool on ssd in front of an erasure coded pool on rotating media.
 
 When reading big files (streaming video), I see a lot of disk i/o,
 especially writes. I have no clue what could cause these writes. The
 writes are going to the hdd's and they stop when I stop reading.
 
 I mounted everything with noatime and nodiratime so it shouldn't be
 that. On a related note, the Cephfs metadata is stored on ssd too, so
 metadata-related changes shouldn't hit the hdd's anyway I think.
 
 Any thoughts? How can I get more information about what ceph is doing?
 Using iotop I only see that the osd processes are busy but it doesn't
 give many hints as to what they are doing.
 
 Thanks,
 
 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

Nothing here particularly surprises me. I don't remember all the
details of the filestore's rate limiting off the top of my head, but
it goes to great lengths to try and avoid letting the journal get too
far ahead of the backing store. Disabling the filestore flusher and
increasing the sync intervals without also increasing the
filestore_wbthrottle_* limits is not going to work well for you.
-Greg

On Mon, Mar 16, 2015 at 3:58 PM, Nick Fisk n...@fisk.me.uk wrote:




 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Gregory Farnum
 Sent: 16 March 2015 17:33
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal
 sync?

 On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote:
 
  I’m not sure if it’s something I’m doing wrong or just experiencing an
 oddity, but when my cache tier flushes dirty blocks out to the base tier, the
 writes seem to hit the OSD’s straight away instead of coalescing in the
 journals, is this correct?
 
  For example if I create a RBD on a standard 3 way replica pool and run fio
 via librbd 128k writes, I see the journals take all the io’s until I hit my
 filestore_min_sync_interval and then I see it start writing to the underlying
 disks.
 
  Doing the same on a full cache tier (to force flushing)  I immediately see 
  the
 base disks at a very high utilisation. The journals also have some write IO 
 at
 the same time. The only other odd thing I can see via iostat is that most of
 the time whilst I’m running Fio, is that I can see the underlying disks doing
 very small write IO’s of around 16kb with an occasional big burst of 
 activity.
 
  I know erasure coding+cache tier is slower than just plain replicated 
  pools,
 but even with various high queue depths I’m struggling to get much above
 100-150 iops compared to a 3 way replica pool which can easily achieve 1000-
 1500. The base tier is comprised of 40 disks. It seems quite a marked
 difference and I’m wondering if this strange journal behaviour is the cause.
 
  Does anyone have any ideas?

 If you're running a full cache pool, then on every operation touching an
 object which isn't in the cache pool it will try and evict an object. That's
 probably what you're seeing.

 Cache pool in general are only a wise idea if you have a very skewed
 distribution of data hotness and the entire hot zone can fit in cache at
 once.
 -Greg

 Hi Greg,

 It's not the caching behaviour that I confused about, it’s the journal 
 behaviour on the base disks during flushing. I've been doing some more tests 
 and can do something reproducible which seems strange to me.

 First off 10MB of 4kb writes:
 time ceph tell osd.1 bench 1000 4096
 { bytes_written: 1000,
   blocksize: 4096,
   bytes_per_sec: 16009426.00}

 real0m0.760s
 user0m0.063s
 sys 0m0.022s

 Now split this into 2x5mb writes:
 time ceph tell osd.1 bench 500 4096   time ceph tell osd.1 bench 
 500 4096
 { bytes_written: 500,
   blocksize: 4096,
   bytes_per_sec: 10580846.00}

 real0m0.595s
 user0m0.065s
 sys 0m0.018s
 { bytes_written: 500,
   blocksize: 4096,
   bytes_per_sec: 9944252.00}

 real0m4.412s
 user0m0.053s
 sys 0m0.071s

 2nd bench takes a lot longer even though both should easily fit in the 5GB 
 journal. Looking at iostat, I think I can see that no writes happen to the 
 journal whilst the writes from the 1st bench are being flushed. Is this the 
 expected behaviour? I would have thought as long as there is space available 
 in the journal it shouldn't block on new writes. Also I see in iostat writes 
 to the underlying disk happening at a QD of 1 and 16kb IO's for a number of 
 seconds, with a large blip or activity just before the flush finishes. Is 
 this the correct behaviour? I would have thought if this tell osd bench is 
 doing sequential IO then the journal should be able to flush 5-10mb of data 
 in a fraction a second.

 Ceph.conf
 [osd]
 filestore max sync interval = 30
 filestore min sync interval = 20
 filestore flusher = false
 osd_journal_size = 5120
 osd_crush_location_hook = /usr/local/bin/crush-location
 osd_op_threads = 5
 filestore_op_threads = 4


 iostat during period where writes seem to be blocked (journal=sda disk=sdd)

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
 avgqu-sz   await r_await w_await  svctm  %util
 sda   0.00 0.000.000.00 0.00 0.00 0.00
  0.000.000.000.00   0.00   0.00
 sdb   0.00 0.000.002.00 0.00 4.00 4.00
  0.000.000.000.00   0.00   0.00
 sdc   0.00 0.000.000.00 0.00 0.00 0.00
  0.000.000.000.00   0.00   0.00
 sdd   0.00 0.000.00   76.00 0.00   760.0020.00
  0.99   13.110.00   13.11  13.05  99.20

 iostat during

Re: [ceph-users] Mapping users to different rgw pools

2015-03-16 Thread Sreenath BH

Thanks.

Is this possible outside of multi-zone setup. (With only one Zone)?

For example, I want to have pools with different replication
factors(or erasure codings) and map users to these pools.

-Sreenath


On 3/13/15, Craig Lewis cle...@centraldesktop.com wrote:
 Yes, RadosGW has the concept of Placement Targets and Placement Pools.  You
 can create a target, and point it a set of RADOS pools.  Those pools can be
 configured to use different storage strategies by creating different
 crushmap rules, and assigning those rules to the pool.

 RGW users can be assigned a default placement target.  When they create a
 bucket, they can either specify the target, or use their default one.  All
 objects in a bucket are stored according to the bucket's placement target.


 I haven't seen a good guide for making use of these features.  The best
 guide I know of is the Federation guide (
 http://ceph.com/docs/giant/radosgw/federated-config/), but it only briefly
 mentions placement targets.



 On Thu, Mar 12, 2015 at 11:48 PM, Sreenath BH bhsreen...@gmail.com wrote:

 Hi all,

 Can one Radow gateway support more than one pool for storing objects?

 And as a follow-up question, is there a way to map different users to
 separate rgw pools so that their obejcts get stored in different
 pools?

 thanks,
 Sreenath
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RadosGW Direct Upload Limitation



 Maybe, but I'm not sure if Yehuda would want to take it upstream or
 not. This limit is present because it's part of the S3 spec. For
 larger objects you should use multi-part upload, which can get much
 bigger.
 -Greg


Note that the multi-part upload has a lower limit of 4MiB per part, and the
direct upload has an upper limit of 5GiB.

So you have to use both methods - direct upload for small files, and
multi-part upload for big files.

Your best bet is to use the Amazon S3 libraries.  They have functions that
take care of it for you.


I'd like to see this mentioned in the Ceph documentation someplace.  When I
first encountered the issue, I couldn't find a limit in the RadosGW
documentation anywhere.  I only found the 5GiB limit in the Amazon API
documentation, which lead me to test on RadosGW.  Now that I know it was
done to preserve Amazon compatibility, I don't want to override the value
anymore.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PGs stuck unclean active+remapped after an osd marked out



 If I remember/guess correctly, if you mark an OSD out it won't
 necessarily change the weight of the bucket above it (ie, the host),
 whereas if you change the weight of the OSD then the host bucket's
 weight changes.
 -Greg



That sounds right.  Marking an OSD out is a ceph osd reweight, not a ceph
osd crush reweight.

Experimentally confirmed.  I have an OSD out right now, and the host's
crush weight is the same as the other hosts' crush weight.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] query about mapping of Swift/S3 APIs to Ceph cluster APIs

On Sat, Mar 14, 2015 at 3:04 AM, pragya jain prag_2...@yahoo.co.in wrote:

 Hello all!

 I am working on Ceph object storage architecture from last few months.

 I am unable to search  a document which can describe how Ceph object
 storage APIs (Swift/S3 APIs) are mappedd with Ceph storage cluster APIs
 (librados APIs) to store the data at Ceph storage cluster.

 As the documents say: Radosgw, a gateway interface for ceph object storage
 users, accept user request to store or retrieve data in the form of Swift
 APIs or S3 APIs and convert the user's request in RADOS request.

 Please help me in knowing
 1. how does Radosgw convert user request to RADOS request ?
 2. how are HTTP requests mapped with RADOS request?


The RadosGW daemon takes care of that.  It's an application that sits on
top of RADOS.

For HTTP, there are a couple ways.  The older way has Apache accepting the
HTTP request, then forwarding that to the RadosGW daemon using FastCGI.
Newer versions support RadosGW handling the HTTP directly.

For the full details, you'll want to check out the source code at
https://github.com/ceph/ceph

If you're not interested enough to read the source code (I wasn't :-) ),
setup a test cluster.  Create a user, bucket, and object, and look at the
contents of the rados pools.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PGs stuck unclean active+remapped after an osd marked out

2015-03-16 Thread Francois Lafont

Hi,

Gregory Farnum a wrote :

 If I remember/guess correctly, if you mark an OSD out it won't
 necessarily change the weight of the bucket above it (ie, the host),
 whereas if you change the weight of the OSD then the host bucket's
 weight changes.

I can just say that, indeed, I have noticed exactly what you describe
in the ouput of of ceph osd tree.

 That makes for different mappings, and since you only
 have a couple of OSDs per host (normally: hurray!)

er, er... no, I have 10 OSDs in the first osd node and 11 OSDs in the
second osd node (see my fisrt message).

 and not many hosts (normally: sadness)

Yes, only I have only 2 osd nodes (and 3 monitors).

 then marking one OSD out makes things harder for the CRUSH algorithm.

Ah, Ok. So my cluster is too little for Ceph. ;)
Thanks for your answer Greg, I will follow the pull-request with attention.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RadosGW Direct Upload Limitation

On Mon, Mar 16, 2015 at 11:14 AM, Georgios Dimitrakakis
gior...@acmac.uoc.gr wrote:
 Hi all!

 I have recently updated to CEPH version 0.80.9 (latest Firefly release)
 which presumably
 supports direct upload.

 I 've tried to upload a file using this functionality and it seems that is
 working
 for files up to 5GB. For files above 5GB there is an error. I believe that
 this is because
 of a hardcoded limit:

 #define RGW_MAX_PUT_SIZE(5ULL*1024*1024*1024)


 Is there a way to increase that limit other than compiling CEPH from source?

No.


 Could we somehow put it as a configuration parameter?

Maybe, but I'm not sure if Yehuda would want to take it upstream or
not. This limit is present because it's part of the S3 spec. For
larger objects you should use multi-part upload, which can get much
bigger.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Shadow files

Out of curiousity, what's the frequency of the peaks and troughs?

RadosGW has configs on how long it should wait after deleting before
garbage collecting, how long between GC runs, and how many objects it can
GC in per run.

The defaults are 2 hours, 1 hour, and 32 respectively.  Search
http://docs.ceph.com/docs/master/radosgw/config-ref/ for rgw gc.

If your peaks and troughs have a frequency less than 1 hour, then GC is
going to delay and alias the disk usage w.r.t. the object count.

If you have millions of objects, you probably need to tweak those values.
If RGW is only GCing 32 objects an hour, it's never going to catch up.


Now that I think about it, I bet I'm having issues here too.  I delete more
than (32*24) objects per day...



On Sun, Mar 15, 2015 at 4:41 PM, Ben b@benjackson.email wrote:

 It is either a problem with CEPH, Civetweb or something else in our
 configuration.
 But deletes in user buckets is still leaving a high number of old shadow
 files. Since we have millions and millions of objects, it is hard to
 reconcile what should and shouldnt exist.

 Looking at our cluster usage, there are no troughs, it is just a rising
 peak.
 But when looking at users data usage, we can see peaks and troughs as you
 would expect as data is deleted and added.

 Our ceph version 0.80.9

 Please ideas?

 On 2015-03-13 02:25, Yehuda Sadeh-Weinraub wrote:

 - Original Message -

 From: Ben b@benjackson.email
 To: ceph-us...@ceph.com
 Sent: Wednesday, March 11, 2015 8:46:25 PM
 Subject: Re: [ceph-users] Shadow files

 Anyone got any info on this?

 Is it safe to delete shadow files?


 It depends. Shadow files are badly named objects that represent part
 of the objects data. They are only safe to remove if you know that the
 corresponding objects no longer exist.

 Yehuda


 On 2015-03-11 10:03, Ben wrote:
  We have a large number of shadow files in our cluster that aren't
  being deleted automatically as data is deleted.
 
  Is it safe to delete these files?
  Is there something we need to be aware of when deleting them?
  Is there a script that we can run that will delete these safely?
 
  Is there something wrong with our cluster that it isn't deleting these
  files when it should be?
 
  We are using civetweb with radosgw, with tengine ssl proxy infront of
  it
 
  Any advice please
  Thanks
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

  ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

2015-03-16 Thread Alexandre DERUMIER

VMs are running on the same nodes than OSD

Are you sure that you didn't some kind of out of memory.
pg rebalance can be memory hungry. (depend how many osd you have).

do you see oom-killer in your host logs ?


- Mail original -
De: Florent Bautista flor...@coppint.com
À: aderumier aderum...@odiso.com
Cc: ceph-users ceph-users@lists.ceph.com
Envoyé: Lundi 16 Mars 2015 12:35:11
Objet: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !

On 03/16/2015 12:23 PM, Alexandre DERUMIER wrote: 
 We use Proxmox, so I think it uses librbd ? 
 As It's me that I made the proxmox rbd plugin, I can confirm that yes, it's 
 librbd ;) 
 
 Is the ceph cluster on dedicated nodes ? or vms are running on same nodes 
 than osd daemons ? 
 

VMs are running on the same nodes than OSD 

 And I precise that not all VMs on that pool crashed, only some of them 
 (a large majority), and on a same host, some crashed and others not. 
 Is the vm crashed, like no more qemu process ? 
 or is it the guest os which is crashed ? (do you use virtio, virtio-scsi or 
 ide for your guest ?) 
 
 

I don't really know what crashed, I think qemu process but not sure. 
We use virtio 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Shadow files