Re: [ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12

2019-12-16 Thread Stefan Kooman
Quoting Jelle de Jong (jelledej...@powercraft.nl): > > It took three days to recover and during this time clients were not > responsive. > > How can I migrate to bluestore without inactive pgs or slow request. I got > several more filestore clusters and I would like to know how to migrate > witho

Re: [ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12

2019-12-12 Thread Bryan Stillwell
Jelle, Try putting just the WAL on the Optane NVMe. I'm guessing your DB is too big to fit within 5GB. We used a 5GB journal on our nodes as well, but when we switched to BlueStore (using ceph-volume lvm batch) it created 37GiB logical volumes (200GB SSD / 5 or 400GB SSD / 10) for our DBs. A

[ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12

2019-12-12 Thread Jelle de Jong
Hello everybody, I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's with 32GB Intel Optane NVMe journal, 10GB networking. I wanted to move to bluestore due to dropping support of filestore, our cluster was working fine with filestore and we could take complete nodes out

[ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12

2019-12-06 Thread Jelle de Jong
Hello everybody, [fix confusing typo] I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's with 32GB Intel Optane NVMe journal, 10GB networking. I wanted to move to bluestore due to dropping support of filestore, our cluster was working fine with filestore and we could tak

[ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12

2019-12-06 Thread Jelle de Jong
Hello everybody, I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's with 32GB Intel Optane NVMe journal, 10GB networking. I wanted to move to bluestore due to dropping support of file store, our cluster was working fine with bluestore and we could take complete nodes out

[ceph-users] Help on diag needed : heartbeat_failed

2019-11-26 Thread Vincent Godin
We encounter a strange behavior on our Mimic 13.2.6 cluster. A any time, and without any load, some OSDs become unreachable from only some hosts. It last 10 mn and then the problem vanish. It 's not always the same OSDs and the same hosts. There is no network failure on any of the host (because onl

[ceph-users] Help with debug_osd logs

2019-11-12 Thread 陈旭
Hi guys, I deploy an efk cluster and use ceph as block storage in kubernetes, but RBD write iops sometimes becomes zero and last for a few minutes. I set the debug_osd to 20/20 and checkeck the osd_logs. I found that when iops becomes zero, Logs like “get_health_metrics reporting 1 slow o

Re: [ceph-users] Help understanding EC object reads

2019-09-16 Thread Thomas Byrne - UKRI STFC
rnum > Sent: 09 September 2019 23:25 > To: Byrne, Thomas (STFC,RAL,SC) > Cc: ceph-users > Subject: Re: [ceph-users] Help understanding EC object reads > > On Thu, Aug 29, 2019 at 4:57 AM Thomas Byrne - UKRI STFC > wrote: > > > > Hi all, > > > > I’m investiga

Re: [ceph-users] Help understanding EC object reads

2019-09-09 Thread Gregory Farnum
On Thu, Aug 29, 2019 at 4:57 AM Thomas Byrne - UKRI STFC wrote: > > Hi all, > > I’m investigating an issue with our (non-Ceph) caching layers of our large EC > cluster. It seems to be turning users requests for whole objects into lots of > small byte range requests reaching the OSDs, but I’m not

[ceph-users] Help understanding EC object reads

2019-08-29 Thread Thomas Byrne - UKRI STFC
Hi all, I'm investigating an issue with our (non-Ceph) caching layers of our large EC cluster. It seems to be turning users requests for whole objects into lots of small byte range requests reaching the OSDs, but I'm not sure how inefficient this behaviour is in reality. My limited understandi

Re: [ceph-users] Help Ceph Cluster Down

2019-01-07 Thread Caspar Smit
Arun, This is what i already suggested in my first reply. Kind regards, Caspar Op za 5 jan. 2019 om 06:52 schreef Arun POONIA < arun.poo...@nuagenetworks.net>: > Hi Kevin, > > You are right. Increasing number of PGs per OSD resolved the issue. I will > probably add this config in /etc/ceph/ceph

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Arun POONIA
Hi Kevin, You are right. Increasing number of PGs per OSD resolved the issue. I will probably add this config in /etc/ceph/ceph.conf file of ceph mon and OSDs so it applies on host boot. Thanks Arun On Fri, Jan 4, 2019 at 3:46 PM Kevin Olbrich wrote: > Hi Arun, > > actually deleting was no goo

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Kevin Olbrich
Hi Arun, actually deleting was no good idea, thats why I wrote, that the OSDs should be "out". You have down PGs, that because the data is on OSDs that are unavailable but known by the cluster. This can be checked by using "ceph pg 0.5 query" (change PG name). Because your PG count is so much ove

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Arun POONIA
Hi Kevin, I tried deleting newly added server from Ceph Cluster and looks like Ceph is not recovering. I agree with unfound data but it doesn't say about unfound data. It says inactive/down for PGs and I can't bring them up. [root@fre101 ~]# ceph health detail 2019-01-04 15:17:05.711641 7f27b0f3

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Kevin Olbrich
I don't think this will help you. Unfound means, the cluster is unable to find the data anywhere (it's lost). It would be sufficient to shut down the new host - the OSDs will then be out. You can also force-heal the cluster, something like "do your best possible": ceph pg 2.5 mark_unfound_lost re

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Arun POONIA
Hi Kevin, Can I remove newly added server from Cluster and see if it heals cluster ? When I check Hard Disk Iops on new server which are very low compared to existing cluster server. Indeed this is a critical cluster but I don't have expertise to make it flawless. Thanks Arun On Fri, Jan 4, 20

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Kevin Olbrich
If you realy created and destroyed OSDs before the cluster healed itself, this data will be permanently lost (not found / inactive). Also your PG count is so much oversized, the calculation for peering will most likely break because this was never tested. If this is a critical cluster, I would sta

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Arun POONIA
Can anyone comment on this issue please, I can't seem to bring my cluster healthy. On Fri, Jan 4, 2019 at 6:26 AM Arun POONIA wrote: > Hi Caspar, > > Number of IOPs are also quite low. It used be around 1K Plus on one of > Pool (VMs) now its like close to 10-30 . > > Thansk > Arun > > On Fri, Ja

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Arun POONIA
Hi Caspar, Number of IOPs are also quite low. It used be around 1K Plus on one of Pool (VMs) now its like close to 10-30 . Thansk Arun On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA wrote: > Hi Caspar, > > Yes and No, numbers are going up and down. If I run ceph -s command I can > see it decreases

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Arun POONIA
Hi Caspar, Yes and No, numbers are going up and down. If I run ceph -s command I can see it decreases one time and later it increases again. I see there are so many blocked/slow requests. Almost all the OSDs have slow requests. Around 12% PGs are inactive not sure how to activate them again. [ro

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Caspar Smit
Are the numbers still decreasing? This one for instance: "3883 PGs pending on creation" Caspar Op vr 4 jan. 2019 om 14:23 schreef Arun POONIA < arun.poo...@nuagenetworks.net>: > Hi Caspar, > > Yes, cluster was working fine with number of PGs per OSD warning up until > now. I am not sure how t

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Arun POONIA
Hi Caspar, Yes, cluster was working fine with number of PGs per OSD warning up until now. I am not sure how to recover from stale down/inactive PGs. If you happen to know about this can you let me know? Current State: [root@fre101 ~]# ceph -s 2019-01-04 05:22:05.942349 7f314f613700 -1 asok(0x7f3

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Caspar Smit
Hi Arun, How did you end up with a 'working' cluster with so many pgs per OSD? "too many PGs per OSD (2968 > max 200)" To (temporarily) allow this kind of pgs per osd you could try this: Change these values in the global section in your ceph.conf: mon max pg per osd = 200 osd max pg per osd ha

Re: [ceph-users] Help Ceph Cluster Down

2019-01-03 Thread Arun POONIA
Hi Chris, Indeed that's what happened. I didn't set noout flag either and I did zapped disk on new server every time. In my cluster status fre201 is only new server. Current Status after enabling 3 OSDs on fre201 host. [root@fre201 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWE

Re: [ceph-users] Help Ceph Cluster Down

2019-01-03 Thread Chris
If you added OSDs and then deleted them repeatedly without waiting for replication to finish as the cluster attempted to re-balance across them, its highly likely that you are permanently missing PGs (especially if the disks were zapped each time). If those 3 down OSDs can be revived there is

[ceph-users] Help Ceph Cluster Down

2019-01-03 Thread Arun POONIA
Hi, Recently I tried adding a new node (OSD) to ceph cluster using ceph-deploy tool. Since I was experimenting with tool and ended up deleting OSD nodes on new server couple of times. Now since ceph OSDs are running on new server cluster PGs seems to be inactive (10-15%) and they are not recoveri

Re: [ceph-users] Help with setting device-class rule on pool without causing data to move

2019-01-03 Thread David C
Thanks, Sage! That did the trick. Wido, seems like an interesting approach but I wasn't brave enough to attempt it! Eric, I suppose this does the same thing that the crushtool reclassify feature does? Thank you both for your suggestions. For posterity: - I grabbed some 14.0.1 packages, extrac

Re: [ceph-users] Help with setting device-class rule on pool without causing data to move

2018-12-31 Thread Eric Goirand
Hi David, CERN has provided with a python script to swap the correct bucket IDs (default <-> hdd), you can find it here : https://github.com/cernceph/ceph-scripts/blob/master/tools/device-class-id-swap.py The principle is the following : - extract the CRUSH map - run the script on it => it create

Re: [ceph-users] Help with setting device-class rule on pool without causing data to move

2018-12-30 Thread Sage Weil
On Sun, 30 Dec 2018, David C wrote: > Hi All > > I'm trying to set the existing pools in a Luminous cluster to use the hdd > device-class but without moving data around. If I just create a new rule > using the hdd class and set my pools to use that new rule it will cause a > huge amount of data mo

[ceph-users] Help with setting device-class rule on pool without causing data to move

2018-12-30 Thread David C
Hi All I'm trying to set the existing pools in a Luminous cluster to use the hdd device-class but without moving data around. If I just create a new rule using the hdd class and set my pools to use that new rule it will cause a huge amount of data movement even though the pgs are all already on HD

Re: [ceph-users] Help with crushmap

2018-12-02 Thread Vasiliy Tolstov
вс, 2 дек. 2018 г., 20:38 Paul Emmerich paul.emmer...@croit.io: > 10 copies for a replicated setup seems... excessive. > I'm try to create golang package for simple key-val store that used ceph crushmap to distribute data. For each namespace attach ceph crushmap rule. > _

Re: [ceph-users] Help with crushmap

2018-12-02 Thread Paul Emmerich
10 copies for a replicated setup seems... excessive. The rules are quite simple, for example rule 1 could be: take default choose firstn 5 type datacenter # picks 5 datacenters chooseleaf firstn 2 type host # 2 different hosts in each datacenter emit rule 2 is the same but type region and first

[ceph-users] Help with crushmap

2018-12-02 Thread Vasiliy Tolstov
Hi, i need help with crushmap I have 3 regions - r1 r2 r3 5 dc - dc1 dc2 dc3 dc4 dc5 dc1 dc2 dc3 in r1 dc4 in r2 dc5 in r3 Each dc have 3 nodes with 2 disks I need to have 3 rules rule1 to have 2 copies on two nodes in each dc - 10 copies total failure domain dc rule2 to have 2 copies on two nodes

Re: [ceph-users] Help! OSDs across the cluster just crashed

2018-10-03 Thread Brett Chancellor
That turned out to be exactly the issue (And boy was it fun clearing pgs out on 71 OSDs). I think it's caused by a combination of two factors. 1. This cluster has way to many placement groups per OSD (just north of 800). It was fine when we first created all the pools, but upgrades (most recently t

Re: [ceph-users] Help! OSDs across the cluster just crashed

2018-10-03 Thread Gregory Farnum
Yeah, don't run these commands blind. They are changing the local metadata of the PG in ways that may make it inconsistent with the overall cluster and result in lost data. Brett, it seems this issue has come up several times in the field but we haven't been able to reproduce it locally or get eno

Re: [ceph-users] Help! OSDs across the cluster just crashed

2018-10-02 Thread Vasu Kulkarni
can you file tracker for your issues(http://tracker.ceph.com/projects/ceph/issues/new) , email once its lengthy is not great to track the issue, Ideally full details of environment (os/ceph versions /before/after/workload info/ tool used for upgrade) is important if one has to recreate it. There a

Re: [ceph-users] Help! OSDs across the cluster just crashed

2018-10-02 Thread Goktug Yildirim
Hi, Sorry to hear that. I’ve been battling with mine for 2 weeks :/ I’ve corrected mine OSDs with the following commands. My OSD logs (/var/log/ceph/ceph-OSDx.log) has a line including log(EER) with the PG number besides and before crash dump. ceph-objectstore-tool --data-path /var/lib/ceph/os

[ceph-users] Help! OSDs across the cluster just crashed

2018-10-02 Thread Brett Chancellor
Help. I have a 60 node cluster and most of the OSDs decided to crash themselves at the same time. They wont restart, the messages look like... --- begin dump of recent events --- 0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal (Aborted) ** in thread 7f57ab5b7d80 thread_name:c

Re: [ceph-users] help me turn off "many more objects that average"

2018-09-12 Thread Chad William Seys
Hi Paul, Yes, all monitors have been restarted. Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] help me turn off "many more objects that average"

2018-09-12 Thread Paul Emmerich
Did you restart the mons or inject the option? Paul 2018-09-12 17:40 GMT+02:00 Chad William Seys : > Hi all, > I'm having trouble turning off the warning "1 pools have many more objects > per pg than average". > > I've tried a lot of variations on the below, my current ceph.conf: > > #... > [mo

[ceph-users] help me turn off "many more objects that average"

2018-09-12 Thread Chad William Seys
Hi all, I'm having trouble turning off the warning "1 pools have many more objects per pg than average". I've tried a lot of variations on the below, my current ceph.conf: #... [mon] #... mon_pg_warn_max_object_skew = 0 All of my monitors have been restarted. Seems like I'm missing someth

Re: [ceph-users] help needed

2018-09-07 Thread Muhammad Junaid
>> > >> On Thu, Sep 6, 2018 at 4:50 PM Marc Roos > wrote: > >>> > >>> > >>> > >>> > >>> Do not use Samsung 850 PRO for journal > >>> Just use LSI logic HBA (eg. SAS2308) > >>> > >>> > >>> -

Re: [ceph-users] help needed

2018-09-06 Thread Darius Kasparavičius
> >>> >>> -----Original Message- >>> From: Muhammad Junaid [mailto:junaid.fsd...@gmail.com] >>> Sent: donderdag 6 september 2018 13:18 >>> To: ceph-users@lists.ceph.com >>> Subject: [ceph-users] help needed >>> >>> Hi

Re: [ceph-users] help needed

2018-09-06 Thread Nick Fisk
To: Muhammad Junaid Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] help needed The official ceph documentation recommendations for a db partition for a 4TB bluestore osd would be 160GB each. Samsung Evo Pro is not an Enterprise class SSD. A quick search of the ML will allow which

Re: [ceph-users] help needed

2018-09-06 Thread David Turner
HBA (eg. SAS2308) >> >> >> -Original Message- >> From: Muhammad Junaid [mailto:junaid.fsd...@gmail.com] >> Sent: donderdag 6 september 2018 13:18 >> To: ceph-users@lists.ceph.com >> Subject: [ceph-users] help needed >> >> Hi there >> >>

Re: [ceph-users] help needed

2018-09-06 Thread Muhammad Junaid
sung 850 PRO for journal > Just use LSI logic HBA (eg. SAS2308) > > > -Original Message- > From: Muhammad Junaid [mailto:junaid.fsd...@gmail.com] > Sent: donderdag 6 september 2018 13:18 > To: ceph-users@lists.ceph.com > Subject: [ceph-users] help needed > > Hi

Re: [ceph-users] help needed

2018-09-06 Thread Marc Roos
Do not use Samsung 850 PRO for journal Just use LSI logic HBA (eg. SAS2308) -Original Message- From: Muhammad Junaid [mailto:junaid.fsd...@gmail.com] Sent: donderdag 6 september 2018 13:18 To: ceph-users@lists.ceph.com Subject: [ceph-users] help needed Hi there Hope, every one

[ceph-users] help needed

2018-09-06 Thread Muhammad Junaid
Hi there Hope, every one will be fine. I need an urgent help in ceph cluster design. We are planning 3 OSD node cluster in the beginning. Details are as under: Servers: 3 * DELL R720xd OS Drives: 2 2.5" SSD OSD Drives: 10 3.5" SAS 7200rpm 3/4 TB Journal Drives: 2 SSD's Samsung 850 PRO 256GB each

Re: [ceph-users] Help Basically..

2018-09-02 Thread David Turner
Agreed on not going the disks until your cluster is healthy again. Making them out and seeing how healthy you can get in the meantime is a good idea. On Sun, Sep 2, 2018, 1:18 PM Ronny Aasen wrote: > On 02.09.2018 17:12, Lee wrote: > > Should I just out the OSD's first or completely zap them and

Re: [ceph-users] Help Basically..

2018-09-02 Thread Ronny Aasen
On 02.09.2018 17:12, Lee wrote: Should I just out the OSD's first or completely zap them and recreate? Or delete and let the cluster repair itself? On the second node when it started back up I had problems with the Journals for ID 5 and 7 they were also recreated all the rest are still the or

Re: [ceph-users] Help Basically..

2018-09-02 Thread Lee
Ok, rather than going gunhoe at this.. 1. I have set out, 31,24,21,18,15,14,13,6 and 7,5 (10 is a new OSD) Which gives me ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 23.65970 root default -5 8.18990 host data33-a4 13 0.90999 osd.13 up0

Re: [ceph-users] Help Basically..

2018-09-02 Thread Lee
Should I just out the OSD's first or completely zap them and recreate? Or delete and let the cluster repair itself? On the second node when it started back up I had problems with the Journals for ID 5 and 7 they were also recreated all the rest are still the originals. I know that some PG's are o

Re: [ceph-users] Help Basically..

2018-09-02 Thread David Turner
The problem is with never getting a successful run of `ceph-osd --flush-journal` on the old SSD journal drive. All of the OSDs that used the dead journal need to be removed from the cluster, wiped, and added back in. The data on them is not 100% consistent because the old journal died. Any word tha

Re: [ceph-users] Help Basically..

2018-09-02 Thread Lee
I followed: $ journal_uuid=$(sudo cat /var/lib/ceph/osd/ceph-0/journal_uuid) $ sudo sgdisk --new=1:0:+20480M --change-name=1:'ceph journal' --partition-guid=1:$journal_uuid --typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdk Then $ sudo ceph-osd --mkjournal -i 20 $ sudo serv

Re: [ceph-users] Help Basically..

2018-09-02 Thread Lee
> > > Hi David, > > Yes heath detail outputs all the errors etc and recovery / backfill is > going on, just taking time 25% misplaced and 1.5 degraded. > > I can list out the pools and see sizes etc.. > > My main problem is I have no client IO from a read perspective, I cannot > start vms I'm opens

Re: [ceph-users] Help Basically..

2018-09-02 Thread Lee
Hi David, Yes heath detail outputs all the errors etc and recovery / backfill is going on, just taking time 25% misplaced and 1.5 degraded. I can list out the pools and see sizes etc.. My main problem is I have no client IO from a read perspective, I cannot start vms I'm openstack and ceph -w st

Re: [ceph-users] Help Basically..

2018-09-02 Thread David Turner
When the first node went offline with a dead SSD journal, all of the dates on the OSDs was useless. Unless you could flush the journals, you can't guarantee that a wire the cluster think happened actually made it to the disk. The proper procedure here is to remove those OSDs and add them again as

Re: [ceph-users] Help Basically..

2018-09-02 Thread David C
Does "ceph health detail" work? Have you manually confirmed the OSDs on the nodes are working? What was the replica size of the pools? Are you seeing any progress with the recovery? On Sun, Sep 2, 2018 at 9:42 AM Lee wrote: > Running 0.94.5 as part of a Openstack enviroment, our ceph setup is

[ceph-users] Help Basically..

2018-09-02 Thread Lee
Running 0.94.5 as part of a Openstack enviroment, our ceph setup is 3x OSD Nodes 3x MON Nodes, yesterday we had a aircon outage in our hosting enviroment, 1 OSD node failed (offline with a the journal SSD dead) left with 2 nodes running correctly, 2 hours later a second OSD node failed complaining

Re: [ceph-users] Help needed for debugging slow_requests

2018-08-15 Thread Konstantin Shalygin
Now here's the thing: Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since then I'm getting slow requests that cause blocked IO inside the VMs that are running on the cluster (but not necessarily on the host with the OSD causing the slow request). If I boot back into 4.13 then Ceph

[ceph-users] Help needed for debugging slow_requests

2018-08-13 Thread Uwe Sauter
Dear community, TL;DR: Cluster runs good with Kernel 4.13, produces slow_requests with Kernel 4.15. How to debug? I'm running a combined Ceph / KVM cluster consisting of 6 hosts of 2 different kinds (details at the end). The main difference between those hosts is CPU generation (Westmere /

Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-30 Thread Jake Grimmett
Hi All, there might be a a problem on Scientific Linux 7.5 too: after upgrading directly from 12.2.5 to 13.2.1 [root@cephr01 ~]# ceph-detect-init Traceback (most recent call last): File "/usr/bin/ceph-detect-init", line 9, in load_entry_point('ceph-detect-init==1.0.1', 'console_scripts',

Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-30 Thread Kenneth Waegeman
kzal t maar eens testen :) On 30/07/18 10:54, Nathan Cutler wrote: for all others on this list, it might also be helpful to know which setups are likely affected. Does this only occur for Filestore disks, i.e. if ceph-volume has taken over taking care of these? Does it happen on every RHEL 7.

Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-30 Thread Nathan Cutler
for all others on this list, it might also be helpful to know which setups are likely affected. Does this only occur for Filestore disks, i.e. if ceph-volume has taken over taking care of these? Does it happen on every RHEL 7.5 system? It affects all OSDs managed by ceph-disk on all RHEL syste

Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-30 Thread Oliver Freyermuth
e=release) > ceph_detect_init.exc.UnsupportedPlatform: Platform is not supported.: rhel > 7.5 > > > Gesendet: Sonntag, 29. Juli 2018 um 20:33 Uhr > Von: "Nathan Cutler" > An: ceph.nov...@habmalnefrage.de, "Vasu Kulkarni" > Cc: ceph-users , "Ceph D

Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-30 Thread ceph . novice
atform: Platform is not supported.: rhel 7.5 Gesendet: Sonntag, 29. Juli 2018 um 20:33 Uhr Von: "Nathan Cutler" An: ceph.nov...@habmalnefrage.de, "Vasu Kulkarni" Cc: ceph-users , "Ceph Development" Betreff: Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2

Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-29 Thread Nathan Cutler
303 Nathan On 07/29/2018 11:16 AM, ceph.nov...@habmalnefrage.de wrote: > Gesendet: Sonntag, 29. Juli 2018 um 03:15 Uhr Von: "Vasu Kulkarni" An: ceph.nov...@habmalnefrage.de Cc: "Sage Weil" , ceph-users , "Ceph Development" Betreff: Re: [ceph-users] HELP

Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-29 Thread ceph . novice
age Weil" , ceph-users , "Ceph Development" Betreff: Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released") On Sat, Jul 28, 2018 at 6:02 PM, wrote: > Have you guys changed something with the systemctl startup of the OSDs? I think there is some ki

Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-28 Thread Vasu Kulkarni
/ 8.8 TiB avail > pgs: 1390 active+clean > > io: > client: 11 KiB/s rd, 10 op/s rd, 0 op/s wr > > Any hints? > > -- > > > Gesendet: Samstag, 28. Juli 2018 um 23:35 Uhr > Von: ceph

Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-28 Thread ceph . novice
rage.de An: "Sage Weil" Cc: ceph-users@lists.ceph.com, ceph-de...@vger.kernel.org Betreff: Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released") Hi Sage. Sure. Any specific OSD(s) log(s)? Or just any? Gesendet: Samstag, 28. Juli 2018 um 16:49 Uhr Von: "Sage

Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-28 Thread ceph . novice
Hi Sage. Sure. Any specific OSD(s) log(s)? Or just any? Gesendet: Samstag, 28. Juli 2018 um 16:49 Uhr Von: "Sage Weil" An: ceph.nov...@habmalnefrage.de, ceph-users@lists.ceph.com, ceph-de...@vger.kernel.org Betreff: Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic r

[ceph-users] Help needed to recover from cache tier OSD crash

2018-07-28 Thread Dmitry
Hello all, would someone please help with recovering from a recent failure of all cache tier pool OSDs? My CEPH cluster has a usual replica 2 pool with two 500GB SSD OSD’s writeback cache tier over it (also replica 2). Both cache OSD’s were created with standard ceph deploy tool, and have 2

Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-28 Thread Sage Weil
Can you include more or your osd log file? On July 28, 2018 9:46:16 AM CDT, ceph.nov...@habmalnefrage.de wrote: >Dear users and developers. >  >I've updated our dev-cluster from v13.2.0 to v13.2.1 yesterday and >since then everything is badly broken. >I've restarted all Ceph components via "system

[ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-28 Thread ceph . novice
Dear users and developers.   I've updated our dev-cluster from v13.2.0 to v13.2.1 yesterday and since then everything is badly broken. I've restarted all Ceph components via "systemctl" and also rebootet the server SDS21 and SDS24, nothing changes. This cluster started as Kraken, was updated to

Re: [ceph-users] Help! Luminous 12.2.5 CephFS - MDS crashed and now won't start (failing at MDCache::add_inode)

2018-06-25 Thread Linh Vu
s on behalf of Linh Vu Sent: Monday, 25 June 2018 7:06:45 PM To: ceph-users Subject: [ceph-users] Help! Luminous 12.2.5 CephFS - MDS crashed and now won't start (failing at MDCache::add_inode) Hi all, We have a Luminous 12.2.5 cluster, running entirely just CephFS with 1 active and 1 s

[ceph-users] Help! Luminous 12.2.5 CephFS - MDS crashed and now won't start (failing at MDCache::add_inode)

2018-06-25 Thread Linh Vu
Hi all, We have a Luminous 12.2.5 cluster, running entirely just CephFS with 1 active and 1 standby MDS. The active MDS crashed and now won't start again with this same error: ### 0> 2018-06-25 16:11:21.136203 7f01c2749700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_6

Re: [ceph-users] Help/advice with crush rules

2018-05-21 Thread Gregory Farnum
On Mon, May 21, 2018 at 11:19 AM Andras Pataki < apat...@flatironinstitute.org> wrote: > Hi Greg, > > Thanks for the detailed explanation - the examples make a lot of sense. > > One followup question regarding a two level crush rule like: > > > step take default > step choose 3 type=rack > step ch

Re: [ceph-users] Help/advice with crush rules

2018-05-21 Thread Andras Pataki
Hi Greg, Thanks for the detailed explanation - the examples make a lot of sense. One followup question regarding a two level crush rule like: step take default step choose 3 type=rack step chooseleaf 3 type=host step emit If the erasure code has 9 chunks, this lines up exactly without any pro

Re: [ceph-users] Help/advice with crush rules

2018-05-18 Thread Gregory Farnum
On Thu, May 17, 2018 at 9:05 AM Andras Pataki wrote: > I've been trying to wrap my head around crush rules, and I need some > help/advice. I'm thinking of using erasure coding instead of > replication, and trying to understand the possibilities for planning for > failure cases. > > For a simplif

[ceph-users] Help/advice with crush rules

2018-05-17 Thread Andras Pataki
I've been trying to wrap my head around crush rules, and I need some help/advice.  I'm thinking of using erasure coding instead of replication, and trying to understand the possibilities for planning for failure cases. For a simplified example, consider a 2 level topology, OSDs live on hosts,

Re: [ceph-users] Help Configuring Replication

2018-04-23 Thread Christopher Meadors
That seems to have worked. Thanks much! And yes, I realize my setup is less than ideal, but I'm planning on migrating from another storage system, and this is the hardware I have to work with. I'll definitely keep your recommendations in mind when I start to grow the cluster. On 04/23/2018 1

Re: [ceph-users] Help Configuring Replication

2018-04-23 Thread Paul Emmerich
Hi, this doesn't sound like a good idea: two hosts is usually a poor configuration for Ceph. Also, fewer disks on more servers is typically better than lots of disks in few servers. But to answer your question: you could use a crush rule like this: min_size 4 max_size 4 step take default step ch

[ceph-users] Help Configuring Replication

2018-04-23 Thread Christopher Meadors
I'm starting to get a small Ceph cluster running. I'm to the point where I've created a pool, and stored some test data in it, but I'm having trouble configuring the level of replication that I want. The goal is to have two OSD host nodes, each with 20 OSDs. The target replication will be: o

Re: [ceph-users] Help with Bluestore WAL

2018-02-21 Thread David Turner
There WAL sis a required party of the osd. If you remove that, then the osd is missing a crucial part of itself and it will be unable to start until the WAL is back online. If the SSD were to fail, then all osds using it would need to be removed and recreated on the cluster. On Tue, Feb 20, 2018,

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-21 Thread David Turner
ter" from Ceph Days Germany earlier this month for > other things to watch out for: > > > > https://ceph.com/cephdays/germany/ > > > > Bryan > > > > *From: *ceph-users on behalf of Bryan > Banister > *Date: *Tuesday, February 20, 2018 at 2:53 PM

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-21 Thread Bryan Stillwell
er this month for other things to watch out for: https://ceph.com/cephdays/germany/ Bryan From: ceph-users on behalf of Bryan Banister Date: Tuesday, February 20, 2018 at 2:53 PM To: David Turner Cc: Ceph Users Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2 HI David [

Re: [ceph-users] Help with Bluestore WAL

2018-02-20 Thread Konstantin Shalygin
Hi, We were recently testing luminous with bluestore. We have 6 node cluster with 12 HDD and 1 SSD each, we used ceph-volume with LVM to create all the OSD and attached with SSD WAL (LVM ). We create individual 10GBx12 LVM on single SDD for each WAL. So all the OSD WAL is on the singe SSD.

Re: [ceph-users] Help with Bluestore WAL

2018-02-20 Thread Balakumar Munusawmy
Hi, We were recently testing luminous with bluestore. We have 6 node cluster with 12 HDD and 1 SSD each, we used ceph-volume with LVM to create all the OSD and attached with SSD WAL (LVM ). We create individual 10GBx12 LVM on single SDD for each WAL. So all the OSD WAL is on the singe SSD. P

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-20 Thread Bryan Banister
.com>> Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2 Note: External Email That sounds like a good next step. Start with OSDs involved in the longest blocked requests. Wait a couple minutes after the osd marks itself back up an

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-16 Thread David Turner
arking OSDs with stuck requests down to see if that > will re-assert them? > > > > Thanks!! > > -Bryan > > > > *From:* David Turner [mailto:drakonst...@gmail.com] > *Sent:* Friday, February 16, 2018 2:51 PM > > > *To:* Bryan Banister > *Cc:* Bryan Stillwe

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-16 Thread Bryan Banister
Cc: Bryan Stillwell ; Janne Johansson ; Ceph Users Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2 Note: External Email The questions I definitely know the answer to first, and then we'll continue from there. If an OSD is blocking peerin

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-16 Thread David Turner
"scrubber.seed": 0, > > "scrubber.waiting_on": 0, > > "scrubber.waiting_on_whom": [] > > } > > }, > > { > > "name": "Started", > >

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-16 Thread Bryan Banister
uot;: "Started", "enter_time": "2018-02-13 14:33:17.491148" } ], Sorry for all the hand holding, but how do I determine if I need to set an OSD as ‘down’ to fix the issues, and how does it go about re-asserting itself? I again tried lo

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-16 Thread David Turner
00 > > > > At this point we do not know to proceed with recovery efforts. I tried > looking at the ceph docs and mail list archives but wasn’t able to > determine the right path forward here. > > > > Any help is appreciated, > > -Bryan > > > > >

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-16 Thread Bryan Banister
@godaddy.com] Sent: Tuesday, February 13, 2018 2:27 PM To: Bryan Banister ; Janne Johansson Cc: Ceph Users Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2 Note: External Email It may work fine, but I would suggest limiting the number of ope

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-13 Thread Bryan Stillwell
It may work fine, but I would suggest limiting the number of operations going on at the same time. Bryan From: Bryan Banister Date: Tuesday, February 13, 2018 at 1:16 PM To: Bryan Stillwell , Janne Johansson Cc: Ceph Users Subject: RE: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-13 Thread Bryan Banister
y.com] Sent: Tuesday, February 13, 2018 12:43 PM To: Bryan Banister ; Janne Johansson Cc: Ceph Users Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2 Note: External Email - Bryan, Based off the information you've provided

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-13 Thread Bryan Stillwell
print $1 "\t" $7 }' |sort -n -k2 You'll see that within a pool the PG sizes are fairly close to the same size, but in your cluster the PGs are fairly large (~200GB would be my guess). Bryan From: ceph-users on behalf of Bryan Banister Date: Monday, February 12, 2018 at 2:19

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-12 Thread Bryan Banister
[mailto:icepic...@gmail.com] Sent: Wednesday, January 31, 2018 9:34 AM To: Bryan Banister Cc: Ceph Users Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2 Note: External Email 2018-01-31 15:58 GMT+01:00 Bryan Banister mailto:bbanis...@jumptrading.com

Re: [ceph-users] Help ! how to recover from total monitor failure in lumnious

2018-02-02 Thread Frank Li
Thanks, I’m downloading it right now -- Efficiency is Intelligent Laziness From: "ceph.nov...@habmalnefrage.de" Date: Friday, February 2, 2018 at 12:37 PM To: "ceph.nov...@habmalnefrage.de" Cc: Frank Li , "ceph-users@lists.ceph.com" Subject: Aw: Re: [ceph-use

  1   2   3   4   >