Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-18 Thread SCHAER Frederic
Wow. Thanks Not very operations friendly though… Wouldn’t it be just OK to pull the disk that we think is the bad one, check the serial number, and if not, just replug and let the udev rules do their job and re-insert the disk in the ceph cluster ? (provided XFS doesn’t freeze for good when we

[ceph-users] Giant upgrade - stability issues

2014-11-18 Thread Andrei Mikhailovsky
Hello cephers, I need your help and suggestion on what is going on with my cluster. A few weeks ago i've upgraded from Firefly to Giant. I've previously written about having issues with Giant where in two weeks period the cluster's IO froze three times after ceph down-ed two osds. I have in

Re: [ceph-users] CephFS unresponsive at scale (2M files,

2014-11-18 Thread Thomas Lemarchand
Hi Kevin, There are every (I think) MDS tunables listed on this page with a short description : http://ceph.com/docs/master/cephfs/mds-config-ref/ Can you tell us how your cluster behave after the mds-cache-size change ? What is your MDS ram consumption, before and after ? Thanks ! -- Thomas

Re: [ceph-users] incorrect pool size, wrong ruleset?

2014-11-18 Thread houmles
Nobody knows where should be problem? On Wed, Nov 12, 2014 at 10:41:36PM +0100, houmles wrote: Hi, I have 2 hosts with 8 2TB drive in each. I want to have 2 replicas between both hosts and then 2 replicas between osds on each host. That way even when I lost one host I still have 2

Re: [ceph-users] incorrect pool size, wrong ruleset?

2014-11-18 Thread houmles
What do you mean by osd level? Pool has size 4 and min_size 1. On Tue, Nov 18, 2014 at 10:32:11AM +, Anand Bhat wrote: What are the setting for min_size and size at OSD level in your Ceph configuration ? Looks like size is set to 2 which halves your total storage as two copies of the

Re: [ceph-users] Troubleshooting an erasure coded pool with a cache tier

2014-11-18 Thread Nick Fisk
Has anyone tried applying this fix to see if it makes any difference? https://github.com/ceph/ceph/pull/2374 I might be in a position in a few days to build a test cluster to test myself, but was wondering if anyone else has had any luck with it? Nick -Original Message- From:

[ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-11-18 Thread Massimiliano Cuttini
Dear all, i try to install ceph but i get errors: #ceph-deploy install node1 [] [ceph_deploy.install][DEBUG ] Installing stable version *firefly *on cluster ceph hosts node1 [ceph_deploy.install][DEBUG ] Detecting platform for host node1 ... []

Re: [ceph-users] OSD commits suicide

2014-11-18 Thread Craig Lewis
That would probably have helped. The XFS deadlocks would only occur when there was relatively little free memory. Kernel 3.18 is supposed to have a fix for that, but I haven't tried it yet. Looking at my actual usage, I don't even need 64k inodes. 64k inodes should make things a bit faster

Re: [ceph-users] osd crashed while there was no space

2014-11-18 Thread Craig Lewis
You shouldn't let the cluster get so full that losing a few OSDs will make you go toofull. Letting the cluster get to 100% full is such a bad idea that you should make sure it doesn't happen. Ceph is supposed to stop moving data to an OSD once that OSD hits osd_backfill_full_ratio, which

Re: [ceph-users] OSD commits suicide

2014-11-18 Thread Andrey Korolyov
On Tue, Nov 18, 2014 at 10:04 PM, Craig Lewis cle...@centraldesktop.com wrote: That would probably have helped. The XFS deadlocks would only occur when there was relatively little free memory. Kernel 3.18 is supposed to have a fix for that, but I haven't tried it yet. Looking at my actual

Re: [ceph-users] Giant upgrade - stability issues

2014-11-18 Thread Samuel Just
Ok, why is ceph marking osds down? Post your ceph.log from one of the problematic periods. -Sam On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky and...@arhont.com wrote: Hello cephers, I need your help and suggestion on what is going on with my cluster. A few weeks ago i've upgraded from

Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-11-18 Thread David Moreau Simard
Thanks guys. I looked at http://tracker.ceph.com/issues/8818 and chatted with dis on #ceph-devel. I ran a LOT of tests on a LOT of comabination of kernels (sometimes with tunables legacy). I haven't found a magical combination in which the following test does not hang: fio --name=writefile

[ceph-users] Bonding woes

2014-11-18 Thread Roland Giesler
Hi people, I have two identical servers (both Sun X2100 M2's) that form part of a cluster of 3 machines (other machines will be added later). I want to bond two GB ethernet ports on these, which works perfectly on the one, but not on the other. How can this be? The one machine (named S2)

Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-11-18 Thread Nick Fisk
Hi David, Have you tried on a normal replicated pool with no cache? I've seen a number of threads recently where caching is causing various things to block/hang. It would be interesting to see if this still happens without the caching layer, at least it would rule it out. Also is there any sign

Re: [ceph-users] Stackforge Puppet Module

2014-11-18 Thread Nick Fisk
Hi David, Just to let you know I finally managed to get to the bottom of this. In the repo.pp one of the authors has a non ASCII character in his name, for whatever reason this was tripping up my puppet environment. After removing the following line:- # Author: François Charlier

Re: [ceph-users] Giant upgrade - stability issues

2014-11-18 Thread Andrei Mikhailovsky
Sam, the logs are rather large in size. Where should I post it to? Thanks - Original Message - From: Samuel Just sam.j...@inktank.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users@lists.ceph.com Sent: Tuesday, 18 November, 2014 7:54:56 PM Subject: Re: [ceph-users] Giant

Re: [ceph-users] Stackforge Puppet Module

2014-11-18 Thread David Moreau Simard
Great find Nick. I've discussed it on IRC and it does look like a real issue: https://github.com/enovance/edeploy-roles/blob/master/puppet-master.install#L48-L52 I've pushed the fix for review: https://review.openstack.org/#/c/135421/ -- David Moreau Simard On Nov 18, 2014, at 3:32 PM, Nick

Re: [ceph-users] Giant upgrade - stability issues

2014-11-18 Thread Samuel Just
pastebin or something, probably. -Sam On Tue, Nov 18, 2014 at 12:34 PM, Andrei Mikhailovsky and...@arhont.com wrote: Sam, the logs are rather large in size. Where should I post it to? Thanks From: Samuel Just sam.j...@inktank.com To: Andrei Mikhailovsky

Re: [ceph-users] CephFS unresponsive at scale (2M files,

2014-11-18 Thread Kevin Sumner
Hi Thomas, I looked over the mds config reference a bit yesterday, but mds cache size seems to be the most relevant tunable. As suggested, I upped mds-cache-size to 1 million yesterday and started the load generator. During load generation, we’re seeing similar behavior on the filesystem and

[ceph-users] Concurrency in ceph

2014-11-18 Thread hp cre
Hello everyone, I'm new to ceph but been working with proprietary clustered filesystem for quite some time. I almost understand how ceph works, but have a couple of questions which have been asked before here, but i didn't understand the answer. In the closed source world, we use clustered

Re: [ceph-users] Concurrency in ceph

2014-11-18 Thread Gregory Farnum
On Tue, Nov 18, 2014 at 1:26 PM, hp cre hpc...@gmail.com wrote: Hello everyone, I'm new to ceph but been working with proprietary clustered filesystem for quite some time. I almost understand how ceph works, but have a couple of questions which have been asked before here, but i didn't

Re: [ceph-users] rados mkpool fails, but not ceph osd pool create

2014-11-18 Thread Gregory Farnum
On Tue, Nov 11, 2014 at 11:43 PM, Gauvain Pocentek gauvain.pocen...@objectif-libre.com wrote: Hi all, I'm facing a problem on a ceph deployment. rados mkpool always fails: # rados -n client.admin mkpool test error creating pool test: (2) No such file or directory rados lspool and rmpool

Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-11-18 Thread Massimiliano Cuttini
I solved by installing EPEL repo on yum. I think that somebody should write down in the documentation that EPEL is mandatory Il 18/11/2014 14:29, Massimiliano Cuttini ha scritto: Dear all, i try to install ceph but i get errors: #ceph-deploy install node1 []

Re: [ceph-users] Concurrency in ceph

2014-11-18 Thread hp cre
Ok thanks Greg. But what openstack does, AFAIU, is use rbd devices directly, one for each Vm instance, right? And that's how it supports live migrations on KVM, etc.. Right? Openstack and similar cloud frameworks don't need to create vm instances on filesystems, am I correct? On 18 Nov 2014

Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-11-18 Thread Travis Rhoden
Hi Massimiliano, I just recreated this bug myself. Ceph-deploy is supposed to install EPEL automatically on the platforms that need it. I just confirmed that it is not doing so, and will be opening up a bug in the Ceph tracker. I'll paste it here when I do so you can follow it. Thanks for the

Re: [ceph-users] Concurrency in ceph

2014-11-18 Thread Campbell, Bill
I can't speak for OpenStack, but OpenNebula uses Libvirt/QEMU/KVM to access an RBD directly for each virtual instance deployed, live-migration included (as each RBD is in and of itself a separate block device, not file system). I would imagine OpenStack works in a similar fashion. -

Re: [ceph-users] Concurrency in ceph

2014-11-18 Thread Gregory Farnum
On Tue, Nov 18, 2014 at 1:43 PM, hp cre hpc...@gmail.com wrote: Ok thanks Greg. But what openstack does, AFAIU, is use rbd devices directly, one for each Vm instance, right? And that's how it supports live migrations on KVM, etc.. Right? Openstack and similar cloud frameworks don't need to

Re: [ceph-users] Concurrency in ceph

2014-11-18 Thread hp cre
Yes Openstack also uses libvirt/qemu/kvm, thanks. On 18 Nov 2014 23:50, Campbell, Bill bcampb...@axcess-financial.com wrote: I can't speak for OpenStack, but OpenNebula uses Libvirt/QEMU/KVM to access an RBD directly for each virtual instance deployed, live-migration included (as each RBD is

Re: [ceph-users] mds continuously crashing on Firefly

2014-11-18 Thread Gregory Farnum
On Thu, Nov 13, 2014 at 9:34 AM, Lincoln Bryant linco...@uchicago.edu wrote: Hi all, Just providing an update to this -- I started the mds daemon on a new server and rebooted a box with a hung CephFS mount (from the first crash) and the problem seems to have gone away. I'm still not sure

Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-11-18 Thread Travis Rhoden
I've captured this at http://tracker.ceph.com/issues/10133 On Tue, Nov 18, 2014 at 4:48 PM, Travis Rhoden trho...@gmail.com wrote: Hi Massimiliano, I just recreated this bug myself. Ceph-deploy is supposed to install EPEL automatically on the platforms that need it. I just confirmed that

Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-11-18 Thread Massimiliano Cuttini
Then. ...very good! :) Ok, the next bad thing is that I have installed GIANT on Admin node. However ceph-deploy ignore ADMIN node installation and install FIREFLY. Now i have ceph-deploy of Giant on my ADMIN node and my first OSD node with FIREFLY. It seems to me odd. Is it fine or i

[ceph-users] Replacing Ceph mons understanding initial members

2014-11-18 Thread Scottix
We currently have a 3 node system with 3 monitor nodes. I created them in the initial setup and the ceph.conf mon initial members = Ceph200, Ceph201, Ceph202 mon host = 10.10.5.31,10.10.5.32,10.10.5.33 We are in the process of expanding and installing dedicated mon servers. I know I can run:

Re: [ceph-users] Log reading/how do I tell what an OSD is trying to connect to

2014-11-18 Thread Gregory Farnum
It's a little strange, but with just the one-sided log it looks as though the OSD is setting up a bunch of connections and then deliberately tearing them down again within second or two (i.e., this is not a direct messenger bug, but it might be an OSD one, or it might be something else). Is it

Re: [ceph-users] Unclear about CRUSH map and more than one step emit in rule

2014-11-18 Thread Gregory Farnum
On Sun, Nov 16, 2014 at 4:17 PM, Anthony Alba ascanio.al...@gmail.com wrote: The step emit documentation states Outputs the current value and empties the stack. Typically used at the end of a rule, but may also be used to pick from different trees in the same rule. What use case is there

Re: [ceph-users] mds cluster degraded

2014-11-18 Thread Gregory Farnum
Hmm, last time we saw this it meant that the MDS log had gotten corrupted somehow and was a little short (in that case due to the OSDs filling up). What do you mean by rebuilt the OSDs? -Greg On Mon, Nov 17, 2014 at 12:52 PM, JIten Shah jshah2...@me.com wrote: After i rebuilt the OSD’s, the MDS

Re: [ceph-users] Giant upgrade - stability issues

2014-11-18 Thread Andrei Mikhailovsky
Sam, Pastebin or similar will not take tens of megabytes worth of logs. If we are talking about debug_ms 10 setting, I've got about 7gb worth of logs generated every half an hour or so. Not really sure what to do with that much data. Anything more constructive? Thanks - Original

[ceph-users] Bug or by design?

2014-11-18 Thread Robert LeBlanc
I was going to submit this as a bug, but thought I would put it here for discussion first. I have a feeling that it could be behavior by design. ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) I'm using a cache pool and was playing around with the size and min_size on the pool to

Re: [ceph-users] Cache tiering and cephfs

2014-11-18 Thread Gregory Farnum
I believe the reason we don't allow you to do this right now is that there was not a good way of coordinating the transition (so that everybody starts routing traffic through the cache pool at the same time), which could lead to data inconsistencies. Looks like the OSDs handle this appropriately

Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-11-18 Thread David Moreau Simard
Testing without the cache tiering is the next test I want to do when I have time.. When it's hanging, there is no activity at all on the cluster. Nothing in ceph -w, nothing in ceph osd pool stats. I'll provide an update when I have a chance to test without tiering. -- David Moreau Simard

Re: [ceph-users] incorrect pool size, wrong ruleset?

2014-11-18 Thread Gregory Farnum
On Wed, Nov 12, 2014 at 1:41 PM, houmles houm...@gmail.com wrote: Hi, I have 2 hosts with 8 2TB drive in each. I want to have 2 replicas between both hosts and then 2 replicas between osds on each host. That way even when I lost one host I still have 2 replicas. Currently I have this

Re: [ceph-users] Bug or by design?

2014-11-18 Thread Robert LeBlanc
On Nov 18, 2014 4:48 PM, Gregory Farnum g...@gregs42.com wrote: On Tue, Nov 18, 2014 at 3:38 PM, Robert LeBlanc rob...@leblancnet.us wrote: I was going to submit this as a bug, but thought I would put it here for discussion first. I have a feeling that it could be behavior by design.

Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-11-18 Thread Ramakrishna Nishtala (rnishtal)
Hi Dave Did you say iscsi only? The tracker issue does not say though. I am on giant, with both client and ceph on RHEL 7 and seems to work ok, unless I am missing something here. RBD on baremetal with kmod-rbd and caching disabled. [root@compute4 ~]# time fio --name=writefile --size=100G

Re: [ceph-users] osd crashed while there was no space

2014-11-18 Thread han vincent
Hmm, the problem is I had not modified any config, all the config is default. as you said, all the IO should be stopped by the configs mon_osd_full_ration or osd_failsafe_full_ration. In my test, when the osd near full, the IO from rest bench stopped, but the backfill IO did not stop.

Re: [ceph-users] Log reading/how do I tell what an OSD is trying to connect to

2014-11-18 Thread Scott Laird
I think I just solved at least part of the problem. Because of the somewhat peculiar way that I have Docker configured, docker instances on another system were being assigned my OSD's IP address, running for a couple seconds, and then failing (for unrelated reasons). Effectively, there was