[ceph-users] add writeback to Bluestore thanks to lvm-writecache

2019-08-13 Thread Olivier Bonvalet
Hi, we use OSDs with data on HDD and db/wal on NVMe. But for now, BlueStore.DB and BlueStore.WAL only store medadata NOT data. Right ? So, when we migrated from : A) Filestore + HDD with hardware writecache + journal on SSD to : B) Bluestore + HDD without hardware writecache + DB/WAL on NVMe

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-09 Thread Olivier Bonvalet
is using. If there is a memory leak, the > > > autotuner > > > can > > > only do so much. At some point it will reduce the caches to fit > > > within > > > cache_min and leave it there. > > > > > > > > > Mark > > &

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-09 Thread Olivier Bonvalet
matically > > released. (You can check the heap freelist with `ceph tell osd.0 > > heap > > stats`). > > As a workaround we run this hourly: > > > > ceph tell mon.* heap release > > ceph tell osd.* heap release > > ceph tell mds.*

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-09 Thread Olivier Bonvalet
> released. (You can check the heap freelist with `ceph tell osd.0 heap > stats`). > As a workaround we run this hourly: > > ceph tell mon.* heap release > ceph tell osd.* heap release > ceph tell mds.* heap release > > -- Dan > > On Sat, Apr 6,

[ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-06 Thread Olivier Bonvalet
Hi, on a Luminous 12.2.11 deploiement, my bluestore OSD exceed the osd_memory_target : daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd ceph3646 17.1 12.0 6828916 5893136 ? Ssl mars29 1903:42 /usr/bin/ceph-osd -f --cluster ceph --id 143 --setuser ceph --setgroup ceph ceph3991

[ceph-users] Bluestore & snapshots weight

2018-10-28 Thread Olivier Bonvalet
Hi, with Filestore, to estimate the weight of snapshot we use a simple find script on each OSD : nice find "$OSDROOT/$OSDDIR/current/" \ -type f -not -name '*_head_*' -not -name '*_snapdir_*' \ -printf '%P\n' Then we agregate by image prefix, and obtain an estimation of each

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
detail"? > Also, which version of Ceph are you running? > Paul > > Am Fr., 21. Sep. 2018 um 19:28 Uhr schrieb Olivier Bonvalet > : > > > > So I've totally disable cache-tiering and overlay. Now OSD 68 & 69 > > are > > fine, no more blocked. > > >

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
embre 2018 à 16:51 +0200, Maks Kowalik a écrit : > According to the query output you pasted shards 1 and 2 are broken. > But, on the other hand EC profile (4+2) should make it possible to > recover from 2 shards lost simultanously... > > pt., 21 wrz 2018 o 16:29 Olivier Bonvalet &

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
o > your > output in https://pastebin.com/zrwu5X0w). Can you verify if that > block > device is in use and healthy or is it corrupt? > > > Zitat von Maks Kowalik : > > > Could you, please paste the output of pg 37.9c query > > > > pt., 21 wrz 201

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
> rbd_data.f66c92ae8944a.000f2596 > > rbd_header.f66c92ae8944a > > > > And "cache-flush-evict-all" still hangs. > > > > I also switched the cache tier to "readproxy", to avoid using this > > cache. But, it's still blocked.

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
, it's still blocked. Le vendredi 21 septembre 2018 à 02:14 +0200, Olivier Bonvalet a écrit : > Hello, > > on a Luminous cluster, I have a PG incomplete and I can't find how to > fix that. > > It's an EC pool (4+2) : > > pg 37.9c is incomplete, acting [32,50

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
ems during recovery. Since > only > OSDs 68 and 69 are mentioned I was wondering if your cache tier > also > has size 2. > > > Zitat von Olivier Bonvalet : > > > Hi, > > > > cache-tier on this pool have 26GB of data (for 5.7TB of data on the > > EC &

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
rting those OSDs (68, 69) helps, > too. > Or there could be an issue with the cache tier, what do those logs > say? > > Regards, > Eugen > > > Zitat von Olivier Bonvalet : > > > Hello, > > > > on a Luminous cluster, I have a PG incomplete and I c

[ceph-users] PG stuck incomplete

2018-09-20 Thread Olivier Bonvalet
Hello, on a Luminous cluster, I have a PG incomplete and I can't find how to fix that. It's an EC pool (4+2) : pg 37.9c is incomplete, acting [32,50,59,1,0,75] (reducing pool bkp-sb-raid6 min_size from 4 may help; search ceph.com/docs for 'incomplete') Of course, we can't reduce min_size

Re: [ceph-users] Optane 900P device class automatically set to SSD not NVME

2018-08-13 Thread Olivier Bonvalet
On a recent Luminous cluster, with nvme*n1 devices, the class is automatically set as "nvme" on "Intel SSD DC P3520 Series" : ~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 2.15996 root default -9 0.71999

Re: [ceph-users] ghost PG : "i don't have pgid xx"

2018-06-05 Thread Olivier Bonvalet
gt; Luminous (you got > 200 PGs per OSD): try to increase > mon_max_pg_per_osd on the monitors to 300 or so to temporarily > resolve this. > > Paul > > 2018-06-05 9:40 GMT+02:00 Olivier Bonvalet : > > Some more informations : the cluster was just upgraded from Jewel

Re: [ceph-users] ghost PG : "i don't have pgid xx"

2018-06-05 Thread Olivier Bonvalet
] 21 286462'438402 2018-05-20 18:06:12.443141 286462'438402 2018-05-20 18:06:12.443141 0 Le mardi 05 juin 2018 à 09:25 +0200, Olivier Bonvalet a écrit : > Hi, > > I have a cluster in "stale" state : a lots of RBD are blocked since > ~10 > hour

[ceph-users] ghost PG : "i don't have pgid xx"

2018-06-05 Thread Olivier Bonvalet
Hi, I have a cluster in "stale" state : a lots of RBD are blocked since ~10 hours. In the status I see PG in stale or down state, but thoses PG doesn't seem to exists anymore : root! stor00-sbg:~# ceph health detail | egrep '(stale|down)' HEALTH_ERR noout,noscrub,nodeep-scrub flag(s) set; 1

[ceph-users] Re : general protection fault: 0000 [#1] SMP

2017-10-12 Thread Olivier Bonvalet
Le jeudi 12 octobre 2017 à 09:12 +0200, Ilya Dryomov a écrit : > It's a crash in memcpy() in skb_copy_ubufs(). It's not in ceph, but > ceph-induced, it looks like. I don't remember seeing anything > similar > in the context of krbd. > > This is a Xen dom0 kernel, right? What did the workload

[ceph-users] general protection fault: 0000 [#1] SMP

2017-10-11 Thread Olivier Bonvalet
Hi, I had a "general protection fault: " with Ceph RBD kernel client. Not sure how to read the call, is it Ceph related ? Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: [#1] SMP Oct 11 16:15:11 lorunde kernel: [311418.891855] Modules linked in: cpuid

[ceph-users] Re : Re : Re : bad crc/signature errors

2017-10-06 Thread Olivier Bonvalet
Le jeudi 05 octobre 2017 à 21:52 +0200, Ilya Dryomov a écrit : > On Thu, Oct 5, 2017 at 6:05 PM, Olivier Bonvalet <ceph.l...@daevel.fr > > wrote: > > Le jeudi 05 octobre 2017 à 17:03 +0200, Ilya Dryomov a écrit : > > > When did you start seeing these

[ceph-users] Re : Re : bad crc/signature errors

2017-10-05 Thread Olivier Bonvalet
Le jeudi 05 octobre 2017 à 17:03 +0200, Ilya Dryomov a écrit : > When did you start seeing these errors? Can you correlate that to > a ceph or kernel upgrade? If not, and if you don't see other issues, > I'd write it off as faulty hardware. Well... I have one hypervisor (Xen 4.6 and kernel

[ceph-users] Re : Re : bad crc/signature errors

2017-10-05 Thread Olivier Bonvalet
Le jeudi 05 octobre 2017 à 11:10 +0200, Ilya Dryomov a écrit : > On Thu, Oct 5, 2017 at 9:03 AM, Olivier Bonvalet <ceph.l...@daevel.fr > > wrote: > > I also see that, but on 4.9.52 and 4.13.3 kernel. > > > > I also have some kernel panic, but don't know if it's r

[ceph-users] Re : bad crc/signature errors

2017-10-05 Thread Olivier Bonvalet
Le jeudi 05 octobre 2017 à 11:47 +0200, Ilya Dryomov a écrit : > The stable pages bug manifests as multiple sporadic connection > resets, > because in that case CRCs computed by the kernel don't always match > the > data that gets sent out. When the mismatch is detected on the OSD > side, OSDs

[ceph-users] Re : bad crc/signature errors

2017-10-05 Thread Olivier Bonvalet
I also see that, but on 4.9.52 and 4.13.3 kernel. I also have some kernel panic, but don't know if it's related (RBD are mapped on Xen hosts). Le jeudi 05 octobre 2017 à 05:53 +, Adrian Saul a écrit : > We see the same messages and are similarly on a 4.4 KRBD version that > is affected by

Re: [ceph-users] ceph.com IPv6 down

2015-09-23 Thread Olivier Bonvalet
Le mercredi 23 septembre 2015 à 13:41 +0200, Wido den Hollander a écrit : > Hmm, that is weird. It works for me here from the Netherlands via > IPv6: You're right, I checked from other providers and it works. So, a problem between Free (France) and Dreamhost ?

[ceph-users] ceph.com IPv6 down

2015-09-23 Thread Olivier Bonvalet
Hi, since several hours http://ceph.com/ doesn't reply anymore in IPv6. It pings, and we can open TCP socket, but nothing more : ~$ nc -w30 -v -6 ceph.com 80 Connection to ceph.com 80 port [tcp/http] succeeded! GET / HTTP/1.0 Host: ceph.com But, a HEAD query works : ~$

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Le vendredi 18 septembre 2015 à 12:04 +0200, Jan Schermer a écrit : > > On 18 Sep 2015, at 11:28, Christian Balzer <ch...@gol.com> wrote: > > > > On Fri, 18 Sep 2015 11:07:49 +0200 Olivier Bonvalet wrote: > > > > > Le vendredi 18 septembre 2

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Le vendredi 18 septembre 2015 à 17:04 +0900, Christian Balzer a écrit : > Hello, > > On Fri, 18 Sep 2015 09:37:24 +0200 Olivier Bonvalet wrote: > > > Hi, > > > > sorry for missing informations. I was to avoid putting too much > > inappropriate i

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
re I touch anything has become a > routine now and that problem is gone. > > Jan > > > On 18 Sep 2015, at 10:53, Olivier Bonvalet <ceph.l...@daevel.fr> > > wrote: > > > > mmm good point. > > > > I don't see CPU or IO problem on mons, but in log

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
the HDD pool. At same time, is there tips tuning journal in case of HDD OSD, with (potentially big) SSD journal, and hardware RAID card which handle write back ? Thanks for your help. Olivier Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a écrit : > Hi, > > I have a cluste

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Le vendredi 18 septembre 2015 à 14:14 +0200, Paweł Sadowski a écrit : > It might be worth checking how many threads you have in your system > (ps > -eL | wc -l). By default there is a limit of 32k (sysctl -q > kernel.pid_max). There is/was a bug in fork() > (https://lkml.org/lkml/2015/2/3/345)

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
near 0 too > > - bandwith usage is also near 0 > > > > The whole cluster seems waiting for something... but I don't see > > what. > > > > > > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a > > écrit : > > > Hi, > > > > &

Re: [ceph-users] debian repositories path change?

2015-09-18 Thread Olivier Bonvalet
Hi, not sure if it's related, but there is recent changes because of a security issue : http://ceph.com/releases/important-security-notice-regarding-signing-key-and-binary-downloads-of-ceph/ Le vendredi 18 septembre 2015 à 08:45 -0500, Brian Kroth a écrit : > Hi all, we've had the following

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
seems waiting for something... but I don't see > > what. > > > > > > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a > > écrit : > > > Hi, > > > > > > I have a cluster with lot of blocked operations each time I try > &

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
investigate... > > Jan > > > On 18 Sep 2015, at 09:37, Olivier Bonvalet <ceph.l...@daevel.fr> > > wrote: > > > > Hi, > > > > sorry for missing informations. I was to avoid putting too much > > inappropriate infos ;) > > > > &

[ceph-users] Lot of blocked operations

2015-09-17 Thread Olivier Bonvalet
Hi, I have a cluster with lot of blocked operations each time I try to move data (by reweighting a little an OSD). It's a full SSD cluster, with 10GbE network. In logs, when I have blocked OSD, on the main OSD I can see that : 2015-09-18 01:55:16.981396 7f89e8cb8700 0 log [WRN] : 2 slow

Re: [ceph-users] Lot of blocked operations

2015-09-17 Thread Olivier Bonvalet
Some additionnal informations : - I have 4 SSD per node. - the CPU usage is near 0 - IO wait is near 0 too - bandwith usage is also near 0 The whole cluster seems waiting for something... but I don't see what. Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a écrit : > Hi, >

Re: [ceph-users] Firefly 0.80.10 ready to upgrade to?

2015-07-21 Thread Olivier Bonvalet
Le lundi 13 juillet 2015 à 11:31 +0100, Gregory Farnum a écrit : On Mon, Jul 13, 2015 at 11:25 AM, Kostis Fardelas dante1...@gmail.com wrote: Hello, it seems that new packages for firefly have been uploaded to repo. However, I can't find any details in Ceph Release notes. There is only

Re: [ceph-users] Firefly 0.80.10 ready to upgrade to?

2015-07-21 Thread Olivier Bonvalet
Le mardi 21 juillet 2015 à 07:06 -0700, Sage Weil a écrit : On Tue, 21 Jul 2015, Olivier Bonvalet wrote: Le lundi 13 juillet 2015 à 11:31 +0100, Gregory Farnum a écrit : On Mon, Jul 13, 2015 at 11:25 AM, Kostis Fardelas dante1...@gmail.com wrote: Hello, it seems that new packages

[ceph-users] More writes on filestore than on journal ?

2015-03-23 Thread Olivier Bonvalet
Hi, I'm still trying to find why there is much more write operations on filestore since Emperor/Firefly than from Dumpling. So, I add monitoring of all perf counters values from OSD. From what I see : «filestore.ops» reports an average of 78 operations per seconds. But, block device monitoring

Re: [ceph-users] More writes on blockdevice than on filestore ?

2015-03-23 Thread Olivier Bonvalet
Erg... I sent to fast. Bad title, please read «More writes on blockdevice than on filestore) Le lundi 23 mars 2015 à 14:21 +0100, Olivier Bonvalet a écrit : Hi, I'm still trying to find why there is much more write operations on filestore since Emperor/Firefly than from Dumpling. So, I

Re: [ceph-users] More writes on filestore than on journal ?

2015-03-23 Thread Olivier Bonvalet
Hi, Le lundi 23 mars 2015 à 07:29 -0700, Gregory Farnum a écrit : On Mon, Mar 23, 2015 at 6:21 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Hi, I'm still trying to find why there is much more write operations on filestore since Emperor/Firefly than from Dumpling. Do you have any

Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

2015-03-04 Thread Olivier Bonvalet
? - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 16:42:13 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly Only writes ;) Le mercredi

Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

2015-03-04 Thread Olivier Bonvalet
Only writes ;) Le mercredi 04 mars 2015 à 16:19 +0100, Alexandre DERUMIER a écrit : The change is only on OSD (and not on OSD journal). do you see twice iops for read and write ? if only read, maybe a read ahead bug could explain this. - Mail original - De: Olivier Bonvalet

[ceph-users] Perf problem after upgrade from dumpling to firefly

2015-03-04 Thread Olivier Bonvalet
Hi, last saturday I upgraded my production cluster from dumpling to emperor (since we were successfully using it on a test cluster). A couple of hours later, we had falling OSD : some of them were marked as down by Ceph, probably because of IO starvation. I marked the cluster in «noout», start

Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

2015-03-04 Thread Olivier Bonvalet
: Olivier Bonvalet ceph.l...@daevel.fr À: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 12:10:30 Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly Hi, last saturday I upgraded my production cluster from dumpling to emperor (since we were successfully

Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

2015-03-04 Thread Olivier Bonvalet
is permanent : I have twice IO/s on HDD since firefly. Oh, permanent, that's strange. (If you don't see more traffic coming from clients, I don't understand...) do you see also twice ios/ ops in ceph -w stats ? is the ceph health ok ? - Mail original - De: Olivier Bonvalet ceph.l

Re: [ceph-users] v0.80.8 and librbd performance

2015-03-03 Thread Olivier Bonvalet
Does kernel client affected by the problem ? Le mardi 03 mars 2015 à 15:19 -0800, Sage Weil a écrit : Hi, This is just a heads up that we've identified a performance regression in v0.80.8 from previous firefly releases. A v0.80.9 is working it's way through QA and should be out in a few

Re: [ceph-users] v0.80.8 and librbd performance

2015-03-03 Thread Olivier Bonvalet
Le mardi 03 mars 2015 à 16:32 -0800, Sage Weil a écrit : On Wed, 4 Mar 2015, Olivier Bonvalet wrote: Does kernel client affected by the problem ? Nope. The kernel client is unaffected.. the issue is in librbd. sage Ok, thanks for the clarification. So I have to dig

Re: [ceph-users] Data still in OSD directories after removing

2014-05-21 Thread Olivier Bonvalet
.238e1f29.*' -delete Thanks for any advice, Olivier PS : not sure if this kind of problem is for the user or dev mailing list. Le mardi 20 mai 2014 à 11:32 +0200, Olivier Bonvalet a écrit : Hi, short : I removed a 1TB RBD image, but I still see files about it on OSD. long : 1) I did : rbd

Re: [ceph-users] Data still in OSD directories after removing

2014-05-21 Thread Olivier Bonvalet
Le mercredi 21 mai 2014 à 08:20 -0700, Sage Weil a écrit : You should definitely not do this! :) Of course ;) You're certain that that is the correct prefix for the rbd image you removed? Do you see the objects lists when you do 'rados -p rbd ls - | grep prefix'? I'm pretty sure yes

[ceph-users] Data still in OSD directories after removing

2014-05-20 Thread Olivier Bonvalet
Hi, short : I removed a 1TB RBD image, but I still see files about it on OSD. long : 1) I did : rbd snap purge $pool/$img but since it overload the cluster, I stopped it (CTRL+C) 2) latter, rbd snap purge $pool/$img 3) then, rbd rm $pool/$img now, on the disk I can found files of this v1

Re: [ceph-users] performance and disk usage of snapshots

2013-09-28 Thread Olivier Bonvalet
Hi, Le mardi 24 septembre 2013 à 18:37 +0200, Corin Langosch a écrit : Hi there, do snapshots have an impact on write performance? I assume on each write all snapshots have to get updated (cow) so the more snapshots exist the worse write performance will get? Not exactly : the first

Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-11 Thread Olivier Bonvalet
Hi, do you need more information about that ? thanks, Olivier Le mardi 10 septembre 2013 à 11:19 -0700, Samuel Just a écrit : Can you post the rest of you crush map? -Sam On Tue, Sep 10, 2013 at 5:52 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: I also checked that all files in that PG

Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-11 Thread Olivier Bonvalet
mercredi 11 septembre 2013 à 11:00 +0200, Olivier Bonvalet a écrit : Hi, do you need more information about that ? thanks, Olivier Le mardi 10 septembre 2013 à 11:19 -0700, Samuel Just a écrit : Can you post the rest of you crush map? -Sam On Tue, Sep 10, 2013 at 5:52 AM, Olivier

Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-11 Thread Olivier Bonvalet
I removed some garbage about hosts faude / rurkh / murmillia (they was temporarily added because cluster was full). So the clean CRUSH map : # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 # devices device 0 device0 device 1

[ceph-users] Ceph space problem, garbage collector ?

2013-09-10 Thread Olivier Bonvalet
Hi, I have a space problem on a production cluster, like if there is unused data not freed : ceph df and rados df reports 613GB of data, and disk usage is 2640GB (with 3 replica). It should be near 1839GB. I have 5 hosts, 3 with SAS storage and 2 with SSD storage. I use crush rules to put pools

Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-10 Thread Olivier Bonvalet
: http://pastebin.com/u73mTvjs Le mardi 10 septembre 2013 à 10:31 +0200, Olivier Bonvalet a écrit : Hi, I have a space problem on a production cluster, like if there is unused data not freed : ceph df and rados df reports 613GB of data, and disk usage is 2640GB (with 3 replica). It should

Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-10 Thread Olivier Bonvalet
--pool ssd3copies ls rados.ssd3copies.dump). Le mardi 10 septembre 2013 à 13:46 +0200, Olivier Bonvalet a écrit : Some additionnal informations : if I look on one PG only, for example the 6.31f. ceph pg dump report a size of 616GB : # ceph pg dump | grep ^6\\. | awk '{ SUM+=($6/1024/1024

Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-10 Thread Olivier Bonvalet
Le mardi 10 septembre 2013 à 11:19 -0700, Samuel Just a écrit : Can you post the rest of you crush map? -Sam Yes : # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device

Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-10 Thread Olivier Bonvalet
.46 up 1 47 2.72osd.47 up 1 48 2.72osd.48 up 1 Le mardi 10 septembre 2013 à 21:01 +0200, Olivier Bonvalet a écrit : I removed some garbage

Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-10 Thread Olivier Bonvalet
I removed some garbage about hosts faude / rurkh / murmillia (they was temporarily added because cluster was full). So the clean CRUSH map : # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 # devices device 0 device0 device 1

Re: [ceph-users] Ceph + Xen - RBD io hang

2013-08-27 Thread Olivier Bonvalet
Hi, I use Ceph 0.61.8 and Xen 4.2.2 (Debian) in production, and can't use kernel 3.10.* on dom0, which hang very soon. But it's visible in kernel logs of the dom0, not the domU. Anyway, you should probably re-try with kernel 3.9.11 for the dom0 (I also use 3.10.9 in domU). Olivier Le mardi 27

[ceph-users] osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl)) (ceph 0.61.7)

2013-08-19 Thread Olivier Bonvalet
Hi, I have an OSD which crash every time I try to start it (see logs below). Is it a known problem ? And is there a way to fix it ? root! taman:/var/log/ceph# grep -v ' pipe' osd.65.log 2013-08-19 11:07:48.478558 7f6fe367a780 0 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff),

Re: [ceph-users] osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl)) (ceph 0.61.7)

2013-08-19 Thread Olivier Bonvalet
Le lundi 19 août 2013 à 12:27 +0200, Olivier Bonvalet a écrit : Hi, I have an OSD which crash every time I try to start it (see logs below). Is it a known problem ? And is there a way to fix it ? root! taman:/var/log/ceph# grep -v ' pipe' osd.65.log 2013-08-19 11:07:48.478558 7f6fe367a780

Re: [ceph-users] Replace all monitors

2013-08-10 Thread Olivier Bonvalet
Le jeudi 08 août 2013 à 18:04 -0700, Sage Weil a écrit : On Fri, 9 Aug 2013, Olivier Bonvalet wrote: Le jeudi 08 ao?t 2013 ? 09:43 -0700, Sage Weil a ?crit : On Thu, 8 Aug 2013, Olivier Bonvalet wrote: Hi, from now I have 5 monitors which share slow SSD with several OSD

[ceph-users] Replace all monitors

2013-08-08 Thread Olivier Bonvalet
Hi, from now I have 5 monitors which share slow SSD with several OSD journal. As a result, each data migration operation (reweight, recovery, etc) is very slow and the cluster is near down. So I have to change that. I'm looking to replace this 5 monitors by 3 new monitors, which still share

Re: [ceph-users] Replace all monitors

2013-08-08 Thread Olivier Bonvalet
Le jeudi 08 août 2013 à 09:43 -0700, Sage Weil a écrit : On Thu, 8 Aug 2013, Olivier Bonvalet wrote: Hi, from now I have 5 monitors which share slow SSD with several OSD journal. As a result, each data migration operation (reweight, recovery, etc) is very slow and the cluster is near

Re: [ceph-users] kernel BUG at net/ceph/osd_client.c:2103

2013-08-05 Thread Olivier Bonvalet
Message- From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users- boun...@lists.ceph.com] On Behalf Of Olivier Bonvalet Sent: Monday, 5 August 2013 11:07 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] kernel BUG at net/ceph/osd_client.c:2103 Hi, I've just

Re: [ceph-users] kernel BUG at net/ceph/osd_client.c:2103

2013-08-04 Thread Olivier Bonvalet
Sorry, the dev list is probably a better place for that one. Le lundi 05 août 2013 à 03:07 +0200, Olivier Bonvalet a écrit : Hi, I've just upgraded a Xen Dom0 (Debian Wheezy with Xen 4.2.2) from Linux 3.9.11 to Linux 3.10.5, and now I have kernel panic after launching some VM which use RBD

Re: [ceph-users] kernel BUG at net/ceph/osd_client.c:2103

2013-08-04 Thread Olivier Bonvalet
that addresses the bug, would you be able to test it? Thanks! sage On Mon, 5 Aug 2013, Olivier Bonvalet wrote: Sorry, the dev list is probably a better place for that one. Le lundi 05 ao?t 2013 ? 03:07 +0200, Olivier Bonvalet a ?crit : Hi, I've just upgraded a Xen Dom0 (Debian

Re: [ceph-users] VMs freez after slow requests

2013-06-03 Thread Olivier Bonvalet
Le lundi 03 juin 2013 à 08:04 -0700, Gregory Farnum a écrit : On Sunday, June 2, 2013, Dominik Mostowiec wrote: Hi, I try to start postgres cluster on VMs with second disk mounted from ceph (rbd - kvm). I started some writes (pgbench initialisation)

Re: [ceph-users] scrub error: found clone without head

2013-05-31 Thread Olivier Bonvalet
you send the filenames in the pg directories for those 4 pgs? -Sam On Thu, May 23, 2013 at 3:27 PM, Olivier Bonvalet ceph.l...@daevel.fr wrote: No : pg 3.7c is active+clean+inconsistent, acting [24,13,39] pg 3.6b is active+clean+inconsistent, acting [28,23,5] pg 3.d is active+clean

Re: [ceph-users] scrub error: found clone without head

2013-05-31 Thread Olivier Bonvalet
Note that I still have scrub errors, but rados doesn't see thoses objects : root! brontes:~# rados -p hdd3copies ls | grep '^rb.0.15c26.238e1f29' root! brontes:~# Le vendredi 31 mai 2013 à 15:36 +0200, Olivier Bonvalet a écrit : Hi, sorry for the late answer : trying to fix that, I tried

Re: [ceph-users] [solved] scrub error: found clone without head

2013-05-31 Thread Olivier Bonvalet
Ok, so : - after a second rbd rm XXX, the image was gone - and rados ls doesn't see any object from that image - so I tried to move thoses files = scrub is now ok ! So for me it's fixed. Thanks Le vendredi 31 mai 2013 à 16:34 +0200, Olivier Bonvalet a écrit : Note that I still have scrub

[ceph-users] Edge effect with multiple RBD kernel clients per host ?

2013-05-25 Thread Olivier Bonvalet
Hi, I seem to have a bad edge effect in my setup, don't know if it's a RBD problem or a Xen problem. So, I have one Ceph cluster, in which I setup 2 different storage pools : one on SSD and one on SAS. With appropriate CRUSH rules, those pools are complety separated, only MON are commons. Then,

Re: [ceph-users] scrub error: found clone without head

2013-05-23 Thread Olivier Bonvalet
Not yet. I keep it for now. Le mercredi 22 mai 2013 à 15:50 -0700, Samuel Just a écrit : rb.0.15c26.238e1f29 Has that rbd volume been removed? -Sam On Wed, May 22, 2013 at 12:18 PM, Olivier Bonvalet ceph.l...@daevel.fr wrote: 0.61-11-g3b94f03 (0.61-1.1), but the bug occured

Re: [ceph-users] scrub error: found clone without head

2013-05-23 Thread Olivier Bonvalet
a écrit : Do all of the affected PGs share osd.28 as the primary? I think the only recovery is probably to manually remove the orphaned clones. -Sam On Thu, May 23, 2013 at 5:00 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Not yet. I keep it for now. Le mercredi 22 mai 2013 à 15:50

Re: [ceph-users] scrub error: found clone without head

2013-05-22 Thread Olivier Bonvalet
Le lundi 20 mai 2013 à 00:06 +0200, Olivier Bonvalet a écrit : Le mardi 07 mai 2013 à 15:51 +0300, Dzianis Kahanovich a écrit : I have 4 scrub errors (3 PGs - found clone without head), on one OSD. Not repairing. How to repair it exclude re-creating of OSD? Now it easy to clean+create

Re: [ceph-users] scrub error: found clone without head

2013-05-20 Thread Olivier Bonvalet
-provocation for force reinstall). Now (at least to my summer outdoors) I keep v0.62 (3 nodes) with every pool size=3 min_size=2 (was - size=2 min_size=1). But try to do nothing first and try to install latest version. And keep your vote to issue #4937 to force developers. Olivier Bonvalet пишет

Re: [ceph-users] scrub error: found clone without head

2013-05-19 Thread Olivier Bonvalet
Le mardi 07 mai 2013 à 15:51 +0300, Dzianis Kahanovich a écrit : I have 4 scrub errors (3 PGs - found clone without head), on one OSD. Not repairing. How to repair it exclude re-creating of OSD? Now it easy to clean+create OSD, but in theory - in case there are multiple OSDs - it may cause

Re: [ceph-users] PG down incomplete

2013-05-17 Thread Olivier Bonvalet
osd tree would help us understand the status of your cluster a bit better. On Thu, May 16, 2013 at 10:32 PM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Le mercredi 15 mai 2013 à 00:15 +0200, Olivier Bonvalet a écrit : Hi, I have some PG in state down and/or incomplete on my cluster

Re: [ceph-users] PG down incomplete

2013-05-17 Thread Olivier Bonvalet
-rebalancing If you have down OSDs that don't get marked out, that would certainly cause problems. Have you tried restarting the failed OSDs? What do the logs look like for osd.15 and osd.25? On Fri, May 17, 2013 at 1:31 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Hi, thanks

Re: [ceph-users] PG down incomplete

2013-05-16 Thread Olivier Bonvalet
Le mercredi 15 mai 2013 à 00:15 +0200, Olivier Bonvalet a écrit : Hi, I have some PG in state down and/or incomplete on my cluster, because I loose 2 OSD and a pool was having only 2 replicas. So of course that data is lost. My problem now is that I can't retreive a HEALTH_OK status

[ceph-users] PG down incomplete

2013-05-14 Thread Olivier Bonvalet
Hi, I have some PG in state down and/or incomplete on my cluster, because I loose 2 OSD and a pool was having only 2 replicas. So of course that data is lost. My problem now is that I can't retreive a HEALTH_OK status : if I try to remove, read or overwrite the corresponding RBD images, near all

Re: [ceph-users] rbd snap rm overload my cluster (during backups)

2013-04-21 Thread Olivier Bonvalet
for Cuttlefish and it was in some of the dev releases)? Snapshot deletes are a little more expensive than we'd like, but I'm surprised they're doing this badly for you. :/ -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sun, Apr 21, 2013 at 2:16 AM, Olivier Bonvalet olivier.bonva

Re: [ceph-users] Scrub shutdown the OSD process

2013-04-20 Thread Olivier Bonvalet
Le mercredi 17 avril 2013 à 20:52 +0200, Olivier Bonvalet a écrit : What I didn't understand is why the OSD process crash, instead of marking that PG corrupted, and does that PG really corrupted are is this just an OSD bug ? Once again, a bit more informations : by searching informations

Re: [ceph-users] Scrub shutdown the OSD process

2013-04-17 Thread Olivier Bonvalet
Le mardi 16 avril 2013 à 08:56 +0200, Olivier Bonvalet a écrit : Le lundi 15 avril 2013 à 10:57 -0700, Gregory Farnum a écrit : On Mon, Apr 15, 2013 at 10:19 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Le lundi 15 avril 2013 à 10:16 -0700, Gregory Farnum a écrit : Are you

Re: [ceph-users] Scrub shutdown the OSD process

2013-04-16 Thread Olivier Bonvalet
Le lundi 15 avril 2013 à 10:57 -0700, Gregory Farnum a écrit : On Mon, Apr 15, 2013 at 10:19 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Le lundi 15 avril 2013 à 10:16 -0700, Gregory Farnum a écrit : Are you saying you saw this problem more than once, and so you completely wiped

Re: [ceph-users] Performance problems

2013-04-16 Thread Olivier Bonvalet
Le vendredi 12 avril 2013 à 19:45 +0200, Olivier Bonvalet a écrit : Le vendredi 12 avril 2013 à 10:04 -0500, Mark Nelson a écrit : On 04/11/2013 07:25 PM, Ziemowit Pierzycki wrote: No, I'm not using RDMA in this configuration since this will eventually get deployed to production

Re: [ceph-users] RBD snapshots are not «readable», because of LVM ?

2013-04-16 Thread Olivier Bonvalet
-only by nature. Maybe LVM expects a writeable PV. Use format 2 images and create snapshot childs, and all should be good. On 04/15/2013 04:48 PM, Olivier Bonvalet wrote: Hi, I'm trying to map a RBD snapshot, which contains an LVM PV. I can do the «map» : rbd map hdd3copies

[ceph-users] Scrub shutdown the OSD process

2013-04-15 Thread Olivier Bonvalet
Hi, I have an OSD process which is regulary shutdown by scrub, if I well understand that trace : 0 2013-04-15 09:29:53.708141 7f5a8e3cc700 -1 *** Caught signal (Aborted) ** in thread 7f5a8e3cc700 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55) 1:

[ceph-users] RBD snapshots are not «readable», because of LVM ?

2013-04-15 Thread Olivier Bonvalet
Hi, I'm trying to map a RBD snapshot, which contains an LVM PV. I can do the «map» : rbd map hdd3copies/jason@20130415-065314 --id alg Then pvscan works : pvscan | grep rbd PV /dev/rbd58 VG vg-jason lvm2 [19,94 GiB / 1,44 GiB free] But enabling LV doesn't work : #

Re: [ceph-users] Scrub shutdown the OSD process

2013-04-15 Thread Olivier Bonvalet
Le lundi 15 avril 2013 à 10:57 -0700, Gregory Farnum a écrit : On Mon, Apr 15, 2013 at 10:19 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Le lundi 15 avril 2013 à 10:16 -0700, Gregory Farnum a écrit : Are you saying you saw this problem more than once, and so you completely wiped

Re: [ceph-users] Number of ODS per host

2013-03-06 Thread Olivier Bonvalet
Hi, I think it depends of your total number of OSDs. I take my case : I have 8 OSD per host, and 5 host. If one host crash, I loose 20% of the cluster, and have a huge amount of data to rebalance. For fault tolerance, it was not a good idea. Le mardi 05 mars 2013 à 02:39 -0800, waed