Quoting Jelle de Jong (jelledej...@powercraft.nl):
>
> It took three days to recover and during this time clients were not
> responsive.
>
> How can I migrate to bluestore without inactive pgs or slow request. I got
> several more filestore clusters and I would like to know how to migrate
> witho
Jelle,
Try putting just the WAL on the Optane NVMe. I'm guessing your DB is too big
to fit within 5GB. We used a 5GB journal on our nodes as well, but when we
switched to BlueStore (using ceph-volume lvm batch) it created 37GiB logical
volumes (200GB SSD / 5 or 400GB SSD / 10) for our DBs.
A
Hello everybody,
I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's
with 32GB Intel Optane NVMe journal, 10GB networking.
I wanted to move to bluestore due to dropping support of filestore, our
cluster was working fine with filestore and we could take complete nodes
out
Hello everybody,
[fix confusing typo]
I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's
with 32GB Intel Optane NVMe journal, 10GB networking.
I wanted to move to bluestore due to dropping support of filestore, our
cluster was working fine with filestore and we could tak
Hello everybody,
I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's
with 32GB Intel Optane NVMe journal, 10GB networking.
I wanted to move to bluestore due to dropping support of file store, our
cluster was working fine with bluestore and we could take complete nodes
out
We encounter a strange behavior on our Mimic 13.2.6 cluster. A any
time, and without any load, some OSDs become unreachable from only
some hosts. It last 10 mn and then the problem vanish.
It 's not always the same OSDs and the same hosts. There is no network
failure on any of the host (because onl
Hi guys, I deploy an efk cluster and use ceph as block storage in
kubernetes, but RBD write iops sometimes becomes zero and last for a few
minutes. I set the debug_osd to 20/20 and checkeck the osd_logs.
I found that when iops becomes zero, Logs like “get_health_metrics reporting
1 slow o
rnum
> Sent: 09 September 2019 23:25
> To: Byrne, Thomas (STFC,RAL,SC)
> Cc: ceph-users
> Subject: Re: [ceph-users] Help understanding EC object reads
>
> On Thu, Aug 29, 2019 at 4:57 AM Thomas Byrne - UKRI STFC
> wrote:
> >
> > Hi all,
> >
> > I’m investiga
On Thu, Aug 29, 2019 at 4:57 AM Thomas Byrne - UKRI STFC
wrote:
>
> Hi all,
>
> I’m investigating an issue with our (non-Ceph) caching layers of our large EC
> cluster. It seems to be turning users requests for whole objects into lots of
> small byte range requests reaching the OSDs, but I’m not
Hi all,
I'm investigating an issue with our (non-Ceph) caching layers of our large EC
cluster. It seems to be turning users requests for whole objects into lots of
small byte range requests reaching the OSDs, but I'm not sure how inefficient
this behaviour is in reality.
My limited understandi
Arun,
This is what i already suggested in my first reply.
Kind regards,
Caspar
Op za 5 jan. 2019 om 06:52 schreef Arun POONIA <
arun.poo...@nuagenetworks.net>:
> Hi Kevin,
>
> You are right. Increasing number of PGs per OSD resolved the issue. I will
> probably add this config in /etc/ceph/ceph
Hi Kevin,
You are right. Increasing number of PGs per OSD resolved the issue. I will
probably add this config in /etc/ceph/ceph.conf file of ceph mon and OSDs
so it applies on host boot.
Thanks
Arun
On Fri, Jan 4, 2019 at 3:46 PM Kevin Olbrich wrote:
> Hi Arun,
>
> actually deleting was no goo
Hi Arun,
actually deleting was no good idea, thats why I wrote, that the OSDs
should be "out".
You have down PGs, that because the data is on OSDs that are
unavailable but known by the cluster.
This can be checked by using "ceph pg 0.5 query" (change PG name).
Because your PG count is so much ove
Hi Kevin,
I tried deleting newly added server from Ceph Cluster and looks like Ceph
is not recovering. I agree with unfound data but it doesn't say about
unfound data. It says inactive/down for PGs and I can't bring them up.
[root@fre101 ~]# ceph health detail
2019-01-04 15:17:05.711641 7f27b0f3
I don't think this will help you. Unfound means, the cluster is unable
to find the data anywhere (it's lost).
It would be sufficient to shut down the new host - the OSDs will then be out.
You can also force-heal the cluster, something like "do your best possible":
ceph pg 2.5 mark_unfound_lost re
Hi Kevin,
Can I remove newly added server from Cluster and see if it heals cluster ?
When I check Hard Disk Iops on new server which are very low compared to
existing cluster server.
Indeed this is a critical cluster but I don't have expertise to make it
flawless.
Thanks
Arun
On Fri, Jan 4, 20
If you realy created and destroyed OSDs before the cluster healed
itself, this data will be permanently lost (not found / inactive).
Also your PG count is so much oversized, the calculation for peering
will most likely break because this was never tested.
If this is a critical cluster, I would sta
Can anyone comment on this issue please, I can't seem to bring my cluster
healthy.
On Fri, Jan 4, 2019 at 6:26 AM Arun POONIA
wrote:
> Hi Caspar,
>
> Number of IOPs are also quite low. It used be around 1K Plus on one of
> Pool (VMs) now its like close to 10-30 .
>
> Thansk
> Arun
>
> On Fri, Ja
Hi Caspar,
Number of IOPs are also quite low. It used be around 1K Plus on one of Pool
(VMs) now its like close to 10-30 .
Thansk
Arun
On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA
wrote:
> Hi Caspar,
>
> Yes and No, numbers are going up and down. If I run ceph -s command I can
> see it decreases
Hi Caspar,
Yes and No, numbers are going up and down. If I run ceph -s command I can
see it decreases one time and later it increases again. I see there are so
many blocked/slow requests. Almost all the OSDs have slow requests. Around
12% PGs are inactive not sure how to activate them again.
[ro
Are the numbers still decreasing?
This one for instance:
"3883 PGs pending on creation"
Caspar
Op vr 4 jan. 2019 om 14:23 schreef Arun POONIA <
arun.poo...@nuagenetworks.net>:
> Hi Caspar,
>
> Yes, cluster was working fine with number of PGs per OSD warning up until
> now. I am not sure how t
Hi Caspar,
Yes, cluster was working fine with number of PGs per OSD warning up until
now. I am not sure how to recover from stale down/inactive PGs. If you
happen to know about this can you let me know?
Current State:
[root@fre101 ~]# ceph -s
2019-01-04 05:22:05.942349 7f314f613700 -1 asok(0x7f3
Hi Arun,
How did you end up with a 'working' cluster with so many pgs per OSD?
"too many PGs per OSD (2968 > max 200)"
To (temporarily) allow this kind of pgs per osd you could try this:
Change these values in the global section in your ceph.conf:
mon max pg per osd = 200
osd max pg per osd ha
Hi Chris,
Indeed that's what happened. I didn't set noout flag either and I did
zapped disk on new server every time. In my cluster status fre201 is only
new server.
Current Status after enabling 3 OSDs on fre201 host.
[root@fre201 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWE
If you added OSDs and then deleted them repeatedly without waiting for
replication to finish as the cluster attempted to re-balance across them,
its highly likely that you are permanently missing PGs (especially if the
disks were zapped each time).
If those 3 down OSDs can be revived there is
Hi,
Recently I tried adding a new node (OSD) to ceph cluster using ceph-deploy
tool. Since I was experimenting with tool and ended up deleting OSD nodes
on new server couple of times.
Now since ceph OSDs are running on new server cluster PGs seems to be
inactive (10-15%) and they are not recoveri
Thanks, Sage! That did the trick.
Wido, seems like an interesting approach but I wasn't brave enough to
attempt it!
Eric, I suppose this does the same thing that the crushtool reclassify
feature does?
Thank you both for your suggestions.
For posterity:
- I grabbed some 14.0.1 packages, extrac
Hi David,
CERN has provided with a python script to swap the correct bucket IDs
(default <-> hdd), you can find it here :
https://github.com/cernceph/ceph-scripts/blob/master/tools/device-class-id-swap.py
The principle is the following :
- extract the CRUSH map
- run the script on it => it create
On Sun, 30 Dec 2018, David C wrote:
> Hi All
>
> I'm trying to set the existing pools in a Luminous cluster to use the hdd
> device-class but without moving data around. If I just create a new rule
> using the hdd class and set my pools to use that new rule it will cause a
> huge amount of data mo
Hi All
I'm trying to set the existing pools in a Luminous cluster to use the hdd
device-class but without moving data around. If I just create a new rule
using the hdd class and set my pools to use that new rule it will cause a
huge amount of data movement even though the pgs are all already on HD
вс, 2 дек. 2018 г., 20:38 Paul Emmerich paul.emmer...@croit.io:
> 10 copies for a replicated setup seems... excessive.
>
I'm try to create golang package for simple key-val store that used ceph
crushmap to distribute data.
For each namespace attach ceph crushmap rule.
>
_
10 copies for a replicated setup seems... excessive.
The rules are quite simple, for example rule 1 could be:
take default
choose firstn 5 type datacenter # picks 5 datacenters
chooseleaf firstn 2 type host # 2 different hosts in each datacenter
emit
rule 2 is the same but type region and first
Hi, i need help with crushmap
I have
3 regions - r1 r2 r3
5 dc - dc1 dc2 dc3 dc4 dc5
dc1 dc2 dc3 in r1
dc4 in r2
dc5 in r3
Each dc have 3 nodes with 2 disks
I need to have 3 rules
rule1 to have 2 copies on two nodes in each dc - 10 copies total failure
domain dc
rule2 to have 2 copies on two nodes
That turned out to be exactly the issue (And boy was it fun clearing pgs
out on 71 OSDs). I think it's caused by a combination of two factors.
1. This cluster has way to many placement groups per OSD (just north of
800). It was fine when we first created all the pools, but upgrades (most
recently t
Yeah, don't run these commands blind. They are changing the local metadata
of the PG in ways that may make it inconsistent with the overall cluster
and result in lost data.
Brett, it seems this issue has come up several times in the field but we
haven't been able to reproduce it locally or get eno
can you file tracker for your
issues(http://tracker.ceph.com/projects/ceph/issues/new) , email once
its lengthy is not great to track the issue, Ideally full details of
environment (os/ceph versions /before/after/workload info/ tool used
for upgrade) is important if one has to recreate it. There a
Hi,
Sorry to hear that. I’ve been battling with mine for 2 weeks :/
I’ve corrected mine OSDs with the following commands. My OSD logs
(/var/log/ceph/ceph-OSDx.log) has a line including log(EER) with the PG number
besides and before crash dump.
ceph-objectstore-tool --data-path /var/lib/ceph/os
Help. I have a 60 node cluster and most of the OSDs decided to crash
themselves at the same time. They wont restart, the messages look like...
--- begin dump of recent events ---
0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal
(Aborted) **
in thread 7f57ab5b7d80 thread_name:c
Hi Paul,
Yes, all monitors have been restarted.
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Did you restart the mons or inject the option?
Paul
2018-09-12 17:40 GMT+02:00 Chad William Seys :
> Hi all,
> I'm having trouble turning off the warning "1 pools have many more objects
> per pg than average".
>
> I've tried a lot of variations on the below, my current ceph.conf:
>
> #...
> [mo
Hi all,
I'm having trouble turning off the warning "1 pools have many more
objects per pg than average".
I've tried a lot of variations on the below, my current ceph.conf:
#...
[mon]
#...
mon_pg_warn_max_object_skew = 0
All of my monitors have been restarted.
Seems like I'm missing someth
>>
> >> On Thu, Sep 6, 2018 at 4:50 PM Marc Roos
> wrote:
> >>>
> >>>
> >>>
> >>>
> >>> Do not use Samsung 850 PRO for journal
> >>> Just use LSI logic HBA (eg. SAS2308)
> >>>
> >>>
> >>> -
>
>>>
>>> -----Original Message-
>>> From: Muhammad Junaid [mailto:junaid.fsd...@gmail.com]
>>> Sent: donderdag 6 september 2018 13:18
>>> To: ceph-users@lists.ceph.com
>>> Subject: [ceph-users] help needed
>>>
>>> Hi
To: Muhammad Junaid
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] help needed
The official ceph documentation recommendations for a db partition for a 4TB
bluestore osd would be 160GB each.
Samsung Evo Pro is not an Enterprise class SSD. A quick search of the ML will
allow which
HBA (eg. SAS2308)
>>
>>
>> -Original Message-
>> From: Muhammad Junaid [mailto:junaid.fsd...@gmail.com]
>> Sent: donderdag 6 september 2018 13:18
>> To: ceph-users@lists.ceph.com
>> Subject: [ceph-users] help needed
>>
>> Hi there
>>
>>
sung 850 PRO for journal
> Just use LSI logic HBA (eg. SAS2308)
>
>
> -Original Message-
> From: Muhammad Junaid [mailto:junaid.fsd...@gmail.com]
> Sent: donderdag 6 september 2018 13:18
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] help needed
>
> Hi
Do not use Samsung 850 PRO for journal
Just use LSI logic HBA (eg. SAS2308)
-Original Message-
From: Muhammad Junaid [mailto:junaid.fsd...@gmail.com]
Sent: donderdag 6 september 2018 13:18
To: ceph-users@lists.ceph.com
Subject: [ceph-users] help needed
Hi there
Hope, every one
Hi there
Hope, every one will be fine. I need an urgent help in ceph cluster design.
We are planning 3 OSD node cluster in the beginning. Details are as under:
Servers: 3 * DELL R720xd
OS Drives: 2 2.5" SSD
OSD Drives: 10 3.5" SAS 7200rpm 3/4 TB
Journal Drives: 2 SSD's Samsung 850 PRO 256GB each
Agreed on not going the disks until your cluster is healthy again. Making
them out and seeing how healthy you can get in the meantime is a good idea.
On Sun, Sep 2, 2018, 1:18 PM Ronny Aasen wrote:
> On 02.09.2018 17:12, Lee wrote:
> > Should I just out the OSD's first or completely zap them and
On 02.09.2018 17:12, Lee wrote:
Should I just out the OSD's first or completely zap them and recreate?
Or delete and let the cluster repair itself?
On the second node when it started back up I had problems with the
Journals for ID 5 and 7 they were also recreated all the rest are
still the or
Ok, rather than going gunhoe at this..
1. I have set out, 31,24,21,18,15,14,13,6 and 7,5 (10 is a new OSD)
Which gives me
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 23.65970 root default
-5 8.18990 host data33-a4
13 0.90999 osd.13 up0
Should I just out the OSD's first or completely zap them and recreate? Or
delete and let the cluster repair itself?
On the second node when it started back up I had problems with the Journals
for ID 5 and 7 they were also recreated all the rest are still the
originals.
I know that some PG's are o
The problem is with never getting a successful run of `ceph-osd
--flush-journal` on the old SSD journal drive. All of the OSDs that used
the dead journal need to be removed from the cluster, wiped, and added back
in. The data on them is not 100% consistent because the old journal died.
Any word tha
I followed:
$ journal_uuid=$(sudo cat /var/lib/ceph/osd/ceph-0/journal_uuid)
$ sudo sgdisk --new=1:0:+20480M --change-name=1:'ceph journal'
--partition-guid=1:$journal_uuid
--typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdk
Then
$ sudo ceph-osd --mkjournal -i 20
$ sudo serv
>
>
> Hi David,
>
> Yes heath detail outputs all the errors etc and recovery / backfill is
> going on, just taking time 25% misplaced and 1.5 degraded.
>
> I can list out the pools and see sizes etc..
>
> My main problem is I have no client IO from a read perspective, I cannot
> start vms I'm opens
Hi David,
Yes heath detail outputs all the errors etc and recovery / backfill is
going on, just taking time 25% misplaced and 1.5 degraded.
I can list out the pools and see sizes etc..
My main problem is I have no client IO from a read perspective, I cannot
start vms I'm openstack and ceph -w st
When the first node went offline with a dead SSD journal, all of the dates
on the OSDs was useless. Unless you could flush the journals, you can't
guarantee that a wire the cluster think happened actually made it to the
disk. The proper procedure here is to remove those OSDs and add them again
as
Does "ceph health detail" work?
Have you manually confirmed the OSDs on the nodes are working?
What was the replica size of the pools?
Are you seeing any progress with the recovery?
On Sun, Sep 2, 2018 at 9:42 AM Lee wrote:
> Running 0.94.5 as part of a Openstack enviroment, our ceph setup is
Running 0.94.5 as part of a Openstack enviroment, our ceph setup is 3x OSD
Nodes 3x MON Nodes, yesterday we had a aircon outage in our hosting
enviroment, 1 OSD node failed (offline with a the journal SSD dead) left
with 2 nodes running correctly, 2 hours later a second OSD node failed
complaining
Now here's the thing:
Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since then I'm
getting slow requests that
cause blocked IO inside the VMs that are running on the cluster (but not
necessarily on the host
with the OSD causing the slow request).
If I boot back into 4.13 then Ceph
Dear community,
TL;DR: Cluster runs good with Kernel 4.13, produces slow_requests with Kernel
4.15. How to debug?
I'm running a combined Ceph / KVM cluster consisting of 6 hosts of 2 different
kinds (details at the end).
The main difference between those hosts is CPU generation (Westmere /
Hi All,
there might be a a problem on Scientific Linux 7.5 too:
after upgrading directly from 12.2.5 to 13.2.1
[root@cephr01 ~]# ceph-detect-init
Traceback (most recent call last):
File "/usr/bin/ceph-detect-init", line 9, in
load_entry_point('ceph-detect-init==1.0.1', 'console_scripts',
kzal t maar eens testen :)
On 30/07/18 10:54, Nathan Cutler wrote:
for all others on this list, it might also be helpful to know which
setups are likely affected.
Does this only occur for Filestore disks, i.e. if ceph-volume has
taken over taking care of these?
Does it happen on every RHEL 7.
for all others on this list, it might also be helpful to know which setups are
likely affected.
Does this only occur for Filestore disks, i.e. if ceph-volume has taken over
taking care of these?
Does it happen on every RHEL 7.5 system?
It affects all OSDs managed by ceph-disk on all RHEL syste
e=release)
> ceph_detect_init.exc.UnsupportedPlatform: Platform is not supported.: rhel
> 7.5
>
>
> Gesendet: Sonntag, 29. Juli 2018 um 20:33 Uhr
> Von: "Nathan Cutler"
> An: ceph.nov...@habmalnefrage.de, "Vasu Kulkarni"
> Cc: ceph-users , "Ceph D
atform: Platform is not supported.: rhel 7.5
Gesendet: Sonntag, 29. Juli 2018 um 20:33 Uhr
Von: "Nathan Cutler"
An: ceph.nov...@habmalnefrage.de, "Vasu Kulkarni"
Cc: ceph-users , "Ceph Development"
Betreff: Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2
303
Nathan
On 07/29/2018 11:16 AM, ceph.nov...@habmalnefrage.de wrote:
>
Gesendet: Sonntag, 29. Juli 2018 um 03:15 Uhr
Von: "Vasu Kulkarni"
An: ceph.nov...@habmalnefrage.de
Cc: "Sage Weil" , ceph-users , "Ceph
Development"
Betreff: Re: [ceph-users] HELP
age Weil" , ceph-users ,
"Ceph Development"
Betreff: Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")
On Sat, Jul 28, 2018 at 6:02 PM, wrote:
> Have you guys changed something with the systemctl startup of the OSDs?
I think there is some ki
/ 8.8 TiB avail
> pgs: 1390 active+clean
>
> io:
> client: 11 KiB/s rd, 10 op/s rd, 0 op/s wr
>
> Any hints?
>
> --
>
>
> Gesendet: Samstag, 28. Juli 2018 um 23:35 Uhr
> Von: ceph
rage.de
An: "Sage Weil"
Cc: ceph-users@lists.ceph.com, ceph-de...@vger.kernel.org
Betreff: Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")
Hi Sage.
Sure. Any specific OSD(s) log(s)? Or just any?
Gesendet: Samstag, 28. Juli 2018 um 16:49 Uhr
Von: "Sage
Hi Sage.
Sure. Any specific OSD(s) log(s)? Or just any?
Gesendet: Samstag, 28. Juli 2018 um 16:49 Uhr
Von: "Sage Weil"
An: ceph.nov...@habmalnefrage.de, ceph-users@lists.ceph.com,
ceph-de...@vger.kernel.org
Betreff: Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic r
Hello all,
would someone please help with recovering from a recent failure of all cache
tier pool OSDs?
My CEPH cluster has a usual replica 2 pool with two 500GB SSD OSD’s writeback
cache tier over it (also replica 2).
Both cache OSD’s were created with standard ceph deploy tool, and have 2
Can you include more or your osd log file?
On July 28, 2018 9:46:16 AM CDT, ceph.nov...@habmalnefrage.de wrote:
>Dear users and developers.
>
>I've updated our dev-cluster from v13.2.0 to v13.2.1 yesterday and
>since then everything is badly broken.
>I've restarted all Ceph components via "system
Dear users and developers.
I've updated our dev-cluster from v13.2.0 to v13.2.1 yesterday and since then
everything is badly broken.
I've restarted all Ceph components via "systemctl" and also rebootet the server
SDS21 and SDS24, nothing changes.
This cluster started as Kraken, was updated to
s on behalf of Linh Vu
Sent: Monday, 25 June 2018 7:06:45 PM
To: ceph-users
Subject: [ceph-users] Help! Luminous 12.2.5 CephFS - MDS crashed and now won't
start (failing at MDCache::add_inode)
Hi all,
We have a Luminous 12.2.5 cluster, running entirely just CephFS with 1 active
and 1 s
Hi all,
We have a Luminous 12.2.5 cluster, running entirely just CephFS with 1 active
and 1 standby MDS. The active MDS crashed and now won't start again with this
same error:
###
0> 2018-06-25 16:11:21.136203 7f01c2749700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_6
On Mon, May 21, 2018 at 11:19 AM Andras Pataki <
apat...@flatironinstitute.org> wrote:
> Hi Greg,
>
> Thanks for the detailed explanation - the examples make a lot of sense.
>
> One followup question regarding a two level crush rule like:
>
>
> step take default
> step choose 3 type=rack
> step ch
Hi Greg,
Thanks for the detailed explanation - the examples make a lot of sense.
One followup question regarding a two level crush rule like:
step take default
step choose 3 type=rack
step chooseleaf 3 type=host
step emit
If the erasure code has 9 chunks, this lines up exactly without any
pro
On Thu, May 17, 2018 at 9:05 AM Andras Pataki
wrote:
> I've been trying to wrap my head around crush rules, and I need some
> help/advice. I'm thinking of using erasure coding instead of
> replication, and trying to understand the possibilities for planning for
> failure cases.
>
> For a simplif
I've been trying to wrap my head around crush rules, and I need some
help/advice. I'm thinking of using erasure coding instead of
replication, and trying to understand the possibilities for planning for
failure cases.
For a simplified example, consider a 2 level topology, OSDs live on
hosts,
That seems to have worked. Thanks much!
And yes, I realize my setup is less than ideal, but I'm planning on
migrating from another storage system, and this is the hardware I have
to work with. I'll definitely keep your recommendations in mind when I
start to grow the cluster.
On 04/23/2018 1
Hi,
this doesn't sound like a good idea: two hosts is usually a poor
configuration for Ceph.
Also, fewer disks on more servers is typically better than lots of disks in
few servers.
But to answer your question: you could use a crush rule like this:
min_size 4
max_size 4
step take default
step ch
I'm starting to get a small Ceph cluster running. I'm to the point where
I've created a pool, and stored some test data in it, but I'm having
trouble configuring the level of replication that I want.
The goal is to have two OSD host nodes, each with 20 OSDs. The target
replication will be:
o
There WAL sis a required party of the osd. If you remove that, then the osd
is missing a crucial part of itself and it will be unable to start until
the WAL is back online. If the SSD were to fail, then all osds using it
would need to be removed and recreated on the cluster.
On Tue, Feb 20, 2018,
ter" from Ceph Days Germany earlier this month for
> other things to watch out for:
>
>
>
> https://ceph.com/cephdays/germany/
>
>
>
> Bryan
>
>
>
> *From: *ceph-users on behalf of Bryan
> Banister
> *Date: *Tuesday, February 20, 2018 at 2:53 PM
er this month for other things
to watch out for:
https://ceph.com/cephdays/germany/
Bryan
From: ceph-users on behalf of Bryan
Banister
Date: Tuesday, February 20, 2018 at 2:53 PM
To: David Turner
Cc: Ceph Users
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
HI David [
Hi,
We were recently testing luminous with bluestore. We have 6 node cluster
with 12 HDD and 1 SSD each, we used ceph-volume with LVM to create all the OSD
and attached with SSD WAL (LVM ). We create individual 10GBx12 LVM on single
SDD for each WAL. So all the OSD WAL is on the singe SSD.
Hi,
We were recently testing luminous with bluestore. We have 6 node cluster
with 12 HDD and 1 SSD each, we used ceph-volume with LVM to create all the OSD
and attached with SSD WAL (LVM ). We create individual 10GBx12 LVM on single
SDD for each WAL. So all the OSD WAL is on the singe SSD. P
.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Note: External Email
That sounds like a good next step. Start with OSDs involved in the longest
blocked requests. Wait a couple minutes after the osd marks itself back up an
arking OSDs with stuck requests down to see if that
> will re-assert them?
>
>
>
> Thanks!!
>
> -Bryan
>
>
>
> *From:* David Turner [mailto:drakonst...@gmail.com]
> *Sent:* Friday, February 16, 2018 2:51 PM
>
>
> *To:* Bryan Banister
> *Cc:* Bryan Stillwe
Cc: Bryan Stillwell ; Janne Johansson
; Ceph Users
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Note: External Email
The questions I definitely know the answer to first, and then we'll continue
from there. If an OSD is blocking peerin
"scrubber.seed": 0,
>
> "scrubber.waiting_on": 0,
>
> "scrubber.waiting_on_whom": []
>
> }
>
> },
>
> {
>
> "name": "Started",
>
>
uot;: "Started",
"enter_time": "2018-02-13 14:33:17.491148"
}
],
Sorry for all the hand holding, but how do I determine if I need to set an OSD
as ‘down’ to fix the issues, and how does it go about re-asserting itself?
I again tried lo
00
>
>
>
> At this point we do not know to proceed with recovery efforts. I tried
> looking at the ceph docs and mail list archives but wasn’t able to
> determine the right path forward here.
>
>
>
> Any help is appreciated,
>
> -Bryan
>
>
>
>
>
@godaddy.com]
Sent: Tuesday, February 13, 2018 2:27 PM
To: Bryan Banister ; Janne Johansson
Cc: Ceph Users
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Note: External Email
It may work fine, but I would suggest limiting the number of ope
It may work fine, but I would suggest limiting the number of operations going
on at the same time.
Bryan
From: Bryan Banister
Date: Tuesday, February 13, 2018 at 1:16 PM
To: Bryan Stillwell , Janne Johansson
Cc: Ceph Users
Subject: RE: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
y.com]
Sent: Tuesday, February 13, 2018 12:43 PM
To: Bryan Banister ; Janne Johansson
Cc: Ceph Users
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Note: External Email
-
Bryan,
Based off the information you've provided
print $1 "\t" $7 }' |sort -n
-k2
You'll see that within a pool the PG sizes are fairly close to the same size,
but in your cluster the PGs are fairly large (~200GB would be my guess).
Bryan
From: ceph-users on behalf of Bryan
Banister
Date: Monday, February 12, 2018 at 2:19
[mailto:icepic...@gmail.com]
Sent: Wednesday, January 31, 2018 9:34 AM
To: Bryan Banister
Cc: Ceph Users
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
Note: External Email
2018-01-31 15:58 GMT+01:00 Bryan Banister
mailto:bbanis...@jumptrading.com
Thanks, I’m downloading it right now
--
Efficiency is Intelligent Laziness
From: "ceph.nov...@habmalnefrage.de"
Date: Friday, February 2, 2018 at 12:37 PM
To: "ceph.nov...@habmalnefrage.de"
Cc: Frank Li , "ceph-users@lists.ceph.com"
Subject: Aw: Re: [ceph-use
1 - 100 of 377 matches
Mail list logo