Re: [ceph-users] CephFS: effects of using hard links

2019-03-20 Thread Dan van der Ster
On Tue, Mar 19, 2019 at 9:43 AM Erwin Bogaard wrote: > > Hi, > > > > For a number of application we use, there is a lot of file duplication. This > wastes precious storage space, which I would like to avoid. > > When using a local disk, I can use a hard link to let all duplicate files > point

Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-19 Thread Dan van der Ster
On Tue, Mar 19, 2019 at 1:05 PM Alfredo Deza wrote: > > On Tue, Mar 19, 2019 at 7:26 AM Dan van der Ster wrote: > > > > On Tue, Mar 19, 2019 at 12:17 PM Alfredo Deza wrote: > > > > > > On Tue, Mar 19, 2019 at 7:00 AM Alfredo Deza wrote: > > > >

Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-19 Thread Dan van der Ster
On Tue, Mar 19, 2019 at 12:17 PM Alfredo Deza wrote: > > On Tue, Mar 19, 2019 at 7:00 AM Alfredo Deza wrote: > > > > On Tue, Mar 19, 2019 at 6:47 AM Dan van der Ster > > wrote: > > > > > > Hi all, > > > > > > We've just hit our fir

[ceph-users] ceph-volume lvm batch OSD replacement

2019-03-19 Thread Dan van der Ster
Hi all, We've just hit our first OSD replacement on a host created with `ceph-volume lvm batch` with mixed hdds+ssds. The hdd /dev/sdq was prepared like this: # ceph-volume lvm batch /dev/sd[m-r] /dev/sdac --yes Then /dev/sdq failed and was then zapped like this: # ceph-volume lvm zap

[ceph-users] Safe to remove objects from default.rgw.meta ?

2019-03-12 Thread Dan van der Ster
Hi all, We have an S3 cluster with >10 million objects in default.rgw.meta. # radosgw-admin zone get | jq .metadata_heap "default.rgw.meta" In these old tickets I realized that this setting is obsolete, and those objects are probably useless: http://tracker.ceph.com/issues/17256

Re: [ceph-users] Safe to remove objects from default.rgw.meta ?

2019-03-12 Thread Dan van der Ster
one set --rgw-zone=default --infile=zone.json and now I can safely remove the default.rgw.meta pool. -- Dan On Tue, Mar 12, 2019 at 3:17 PM Dan van der Ster wrote: > > Hi all, > > We have an S3 cluster with >10 million objects in default.rgw.meta. > > # radosgw-

Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-21 Thread Dan van der Ster
gt; ceph-volume lvm activate successful for osd ID: 3 > --> ceph-volume lvm create successful for: /dev/sda > Yes that's it! Worked for me too. Thanks! Dan > This is a Nautilus test cluster, but I remember having this on a > Luminous cluster, too. I hope this helps. > > Regard

Re: [ceph-users] cephfs manila snapshots best practices

2019-03-21 Thread Dan van der Ster
On Thu, Mar 21, 2019 at 1:50 PM Tom Barron wrote: > > On 20/03/19 16:33 +0100, Dan van der Ster wrote: > >Hi all, > > > >We're currently upgrading our cephfs (managed by OpenStack Manila) > >clusters to Mimic, and want to start enabling snapshots of the file > >

Re: [ceph-users] cephfs manila snapshots best practices

2019-03-22 Thread Dan van der Ster
ir quota by CephFS :/ > > > Paul > On Wed, Mar 20, 2019 at 4:34 PM Dan van der Ster > wrote: > > > > Hi all, > > > > We're currently upgrading our cephfs (managed by OpenStack Manila) > > clusters to Mimic, and want to start enabling snapshots of the file &

Re: [ceph-users] v12.2.11 Luminous released

2019-02-07 Thread Dan van der Ster
On Fri, Feb 1, 2019 at 10:18 PM Neha Ojha wrote: > > On Fri, Feb 1, 2019 at 1:09 PM Robert Sander > wrote: > > > > Am 01.02.19 um 19:06 schrieb Neha Ojha: > > > > > If you would have hit the bug, you should have seen failures like > > > https://tracker.ceph.com/issues/36686. > > > Yes,

Re: [ceph-users] Need help with upmap feature on luminous

2019-02-06 Thread Dan van der Ster
Note that there are some improved upmap balancer heuristics in development here: https://github.com/ceph/ceph/pull/26187 -- dan On Tue, Feb 5, 2019 at 10:18 PM Kári Bertilsson wrote: > > Hello > > I previously enabled upmap and used automatic balancing with "ceph balancer > on". I got very

Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-06 Thread Dan van der Ster
Hi, With HEALTH_OK a mon data dir should be under 2GB for even such a large cluster. During backfilling scenarios, the mons keep old maps and grow quite quickly. So if you have balancing, pg splitting, etc. ongoing for awhile, the mon stores will eventually trigger that 15GB alarm. But the

Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-18 Thread Dan van der Ster
On Thu, Feb 14, 2019 at 2:31 PM Sage Weil wrote: > > On Thu, 7 Feb 2019, Dan van der Ster wrote: > > On Thu, Feb 7, 2019 at 12:17 PM M Ranga Swami Reddy > > wrote: > > > > > > Hi Dan, > > > >During backfilling scenarios, the mons keep old ma

Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-18 Thread Dan van der Ster
osd bench, etc)? > > On Fri, Feb 15, 2019 at 3:13 PM M Ranga Swami Reddy > wrote: > > > > today I again hit the warn with 30G also... > > > > On Thu, Feb 14, 2019 at 7:39 PM Sage Weil wrote: > > > > > > On Thu, 7 Feb 2019, Dan van der Ster wrote:

Re: [ceph-users] HDD OSD 100% busy reading OMAP keys RGW

2019-02-14 Thread Dan van der Ster
On Thu, Feb 14, 2019 at 11:13 AM Wido den Hollander wrote: > > On 2/14/19 10:20 AM, Dan van der Ster wrote: > > On Thu., Feb. 14, 2019, 6:17 a.m. Wido den Hollander >> > >> Hi, > >> > >> On a cluster running RGW only I'm running into BlueStor

Re: [ceph-users] HDD OSD 100% busy reading OMAP keys RGW

2019-02-14 Thread Dan van der Ster
On Thu, Feb 14, 2019 at 12:07 PM Wido den Hollander wrote: > > > > On 2/14/19 11:26 AM, Dan van der Ster wrote: > > On Thu, Feb 14, 2019 at 11:13 AM Wido den Hollander wrote: > >> > >> On 2/14/19 10:20 AM, Dan van der Ster wrote: > >>> On T

Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Dan van der Ster
On Fri, Feb 15, 2019 at 12:01 PM Willem Jan Withagen wrote: > > On 15/02/2019 11:56, Dan van der Ster wrote: > > On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen > > wrote: > >> > >> On 15/02/2019 10:39, Ilya Dryomov wrote: > >>> On

Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Dan van der Ster
On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen wrote: > > On 15/02/2019 10:39, Ilya Dryomov wrote: > > On Fri, Feb 15, 2019 at 12:05 AM Mike Perez wrote: > >> > >> Hi Marc, > >> > >> You can see previous designs on the Ceph store: > >> > >> https://www.proforma.com/sdscommunitystore > > >

Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-07 Thread Dan van der Ster
of dB showing > 15G, do I need to run the compact commands > to do the trimming? Compaction isn't necessary -- you should only need to restart all peon's then the leader. A few minutes later the db's should start trimming. -- dan > > Thanks > Swami > > On Wed, Feb 6, 2019 at 6:2

Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-07 Thread Dan van der Ster
e to restart the > ceph-mon, instead prefer to do the compact on non-leader mons. > Is this ok? > Compaction doesn't solve this particular problem, because the maps have not yet been deleted by the ceph-mon process. -- dan > Thanks > Swami > > On Thu, Feb 7, 2019 at 6:30 PM Dan

Re: [ceph-users] Lumunious 12.2.10 update send to 12.2.11

2019-02-05 Thread Dan van der Ster
No idea, but maybe this commit which landed in v12.2.11 is relevant: commit 187bc76957dcd8a46a839707dea3c26b3285bd8f Author: runsisi Date: Mon Nov 12 20:01:32 2018 +0800 librbd: fix missing unblock_writes if shrink is not allowed Fixes: http://tracker.ceph.com/issues/36778

Re: [ceph-users] how to trigger offline filestore merge

2019-04-09 Thread Dan van der Ster
; sleep 5 ; done After running that for awhile the PG filestore structure has merged down and now listing the pool and backfilling are back to normal. Thanks! Dan On Tue, Apr 9, 2019 at 7:05 PM Dan van der Ster wrote: > > Hi all, > > We have a slight issue while trying to migrate

[ceph-users] Save the date: Ceph Day for Research @ CERN -- Sept 16, 2019

2019-04-15 Thread Dan van der Ster
for proposals will be available by mid-May. All the Best, Dan van der Ster CERN IT Department Ceph Governing Board, Academic Liaison [1] Sept 16 is the day after CERN Open Days, where there will be plenty to visit on our campus if you arrive a couple of days before https://home.cern/news/news/cern/cern

Re: [ceph-users] obj_size_info_mismatch error handling

2019-06-03 Thread Dan van der Ster
Hi Reed and Brad, Did you ever learn more about this problem? We currently have a few inconsistencies arriving with the same env (cephfs, v13.2.5) and symptoms. PG Repair doesn't fix the inconsistency, nor does Brad's omap workaround earlier in the thread. In our case, we can fix by cp'ing the

Re: [ceph-users] typical snapmapper size

2019-06-07 Thread Dan van der Ster
On Thu, Jun 6, 2019 at 8:00 PM Sage Weil wrote: > > Hello RBD users, > > Would you mind running this command on a random OSD on your RBD-oriented > cluster? > > ceph-objectstore-tool \ > --data-path /var/lib/ceph/osd/ceph-NNN \ > >

[ceph-users] v12.2.12 mds FAILED assert(session->get_nref() == 1)

2019-06-07 Thread Dan van der Ster
Hi all, Just a quick heads up, and maybe a check if anyone else is affected. After upgrading our MDS's from v12.2.11 to v12.2.12, we started getting crashes with /builddir/build/BUILD/ceph-12.2.12/src/mds/MDSRank.cc: 1304: FAILED assert(session->get_nref() == 1) I opened a ticket here

Re: [ceph-users] Quotas with Mimic (CephFS-FUSE) clients in a Luminous Cluster

2019-05-27 Thread Dan van der Ster
Hi Oliver, We saw the same issue after upgrading to mimic. IIRC we could make the max_bytes xattr visible by touching an empty file in the dir (thereby updating the dir inode). e.g. touch /cephfs/user/freyermu/.quota; rm /cephfs/user/freyermu/.quota Does that work? -- dan On Mon, May 27,

Re: [ceph-users] Quotas with Mimic (CephFS-FUSE) clients in a Luminous Cluster

2019-05-27 Thread Dan van der Ster
On Mon, May 27, 2019 at 11:54 AM Oliver Freyermuth wrote: > > Dear Dan, > > thanks for the quick reply! > > Am 27.05.19 um 11:44 schrieb Dan van der Ster: > > Hi Oliver, > > > > We saw the same issue after upgrading to mimic. > > > > IIRC we co

Re: [ceph-users] [events] Ceph Day CERN September 17 - CFP now open!

2019-05-27 Thread Dan van der Ster
Tuesday Sept 17 is indeed the correct day! We had to move it by one day to get a bigger room... sorry for the confusion. -- dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-20 Thread Dan van der Ster
bluestore_compression_mode=force on the osd. -- dan [1] http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values On Thu, Jun 20, 2019 at 4:33 PM Dan van der Ster wrote: > > Hi all, > > I'm trying to compress an rbd pool via backfilling the existing data, > and the allocate

Re: [ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-20 Thread Dan van der Ster
...) Now I'll try to observe any performance impact of increased min_blob_size... Do you recall if there were some benchmarks done to pick those current defaults? Thanks! Dan -- Dan > > > Thanks, > > Igor > > On 6/20/2019 5:33 PM, Dan van der Ster wrote: > > Hi all,

Re: [ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-20 Thread Dan van der Ster
e more details (preferably backed with logs) on this... > > On 6/20/2019 6:23 PM, Dan van der Ster wrote: > > P.S. I know this has been discussed before, but the > > compression_(mode|algorithm) pool options [1] seem completely broken > > -- With the pool mode set t

Re: [ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-21 Thread Dan van der Ster
http://tracker.ceph.com/issues/40480 On Thu, Jun 20, 2019 at 9:12 PM Dan van der Ster wrote: > > I will try to reproduce with logs and create a tracker once I find the > smoking gun... > > It's very strange -- I had the osd mode set to 'passive', and pool > option set to 'f

[ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-20 Thread Dan van der Ster
Hi all, I'm trying to compress an rbd pool via backfilling the existing data, and the allocated space doesn't match what I expect. Here is the test: I marked osd.130 out and waited for it to erase all its data. Then I set (on the pool) compression_mode=force and compression_algorithm=zstd. Then

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-17 Thread Dan van der Ster
for some reasons we could > not react quickly. We accepted the risk of the bucket becoming slow, but > had not thought of further risks ... > > On 17.06.19 10:15, Dan van der Ster wrote: > > Nice to hear this was resolved in the end. > > > > Coming back

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-17 Thread Dan van der Ster
Nice to hear this was resolved in the end. Coming back to the beginning -- is it clear to anyone what was the root cause and how other users can avoid this from happening? Maybe some better default configs to warn users earlier about too-large omaps? Cheers, Dan On Thu, Jun 13, 2019 at 7:36 PM

Re: [ceph-users] Major ceph disaster

2019-05-13 Thread Dan van der Ster
Presumably the 2 OSDs you marked as lost were hosting those incomplete PGs? It would be useful to double confirm that: check with `ceph pg query` and `ceph pg dump`. (If so, this is why the ignore history les thing isn't helping; you don't have the minimum 3 stripes up for those 3+1 PGs.) If

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Dan van der Ster
On Tue, May 14, 2019 at 10:02 AM Kevin Flöh wrote: > > On 13.05.19 10:51 nachm., Lionel Bouton wrote: > > Le 13/05/2019 à 16:20, Kevin Flöh a écrit : > >> Dear ceph experts, > >> > >> [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...] > >> Here is what happened: One osd

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Dan van der Ster
On Tue, May 14, 2019 at 10:59 AM Kevin Flöh wrote: > > > On 14.05.19 10:08 vorm., Dan van der Ster wrote: > > On Tue, May 14, 2019 at 10:02 AM Kevin Flöh wrote: > > On 13.05.19 10:51 nachm., Lionel Bouton wrote: > > Le 13/05/2019 à 16:20, Kevin Flöh a écrit : > >

Re: [ceph-users] Erasure code profiles and crush rules. Missing link...?

2019-05-22 Thread Dan van der Ster
On Wed, May 22, 2019 at 3:03 PM Rainer Krienke wrote: > > Hello, > > I created an erasure code profile named ecprofile-42 with the following > parameters: > > $ ceph osd erasure-code-profile set ecprofile-42 plugin=jerasure k=4 m=2 > > Next I created a new pool using the ec profile from above: >

Re: [ceph-users] Crush rule for "ssd first" but without knowing how much

2019-05-23 Thread Dan van der Ster
Did I understand correctly: you have a crush tree with both ssd and hdd devices, and you want to direct PGs to the ssds, until they reach some fullness threshold, and only then start directing PGs to the hdds? I can't think of a crush rule alone to achieve that. But something you could do is add

Re: [ceph-users] Major ceph disaster

2019-05-23 Thread Dan van der Ster
What's the full ceph status? Normally recovery_wait just means that the relevant osd's are busy recovering/backfilling another PG. On Thu, May 23, 2019 at 10:53 AM Kevin Flöh wrote: > > Hi, > > we have set the PGs to recover and now they are stuck in > active+recovery_wait+degraded and

Re: [ceph-users] Major ceph disaster

2019-05-23 Thread Dan van der Ster
gt; >io: > client: 211KiB/s rd, 46.0KiB/s wr, 158op/s rd, 0op/s wr > > On 23.05.19 10:54 vorm., Dan van der Ster wrote: > > What's the full ceph status? > > Normally recovery_wait just means that the relevant osd's are busy > > recovering/backfilling anothe

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Dan van der Ster
> "2(0)", > "4(1)", > "23(2)", > "24(0)", > "72(1)", > "79(3)" > ], > "down_osds

Re: [ceph-users] Data distribution question

2019-04-30 Thread Dan van der Ster
On Tue, 30 Apr 2019, 19:32 Igor Podlesny, wrote: > On Wed, 1 May 2019 at 00:24, Dan van der Ster wrote: > > > > The upmap balancer in v12.2.12 works really well... Perfectly uniform on > our clusters. > > > > .. Dan > > mode upmap ? > yes, mgr balan

Re: [ceph-users] Data distribution question

2019-04-30 Thread Dan van der Ster
On Tue, Apr 30, 2019 at 8:26 PM Igor Podlesny wrote: > > On Wed, 1 May 2019 at 01:01, Dan van der Ster wrote: > >> > The upmap balancer in v12.2.12 works really well... Perfectly uniform on > >> > our clusters. > >> > >> mode upmap ?

Re: [ceph-users] Data distribution question

2019-04-30 Thread Dan van der Ster
On Tue, Apr 30, 2019 at 9:01 PM Igor Podlesny wrote: > > On Wed, 1 May 2019 at 01:26, Igor Podlesny wrote: > > On Wed, 1 May 2019 at 01:01, Dan van der Ster wrote: > > >> > The upmap balancer in v12.2.12 works really well... Perfectly uniform > > >> >

Re: [ceph-users] Data distribution question

2019-04-30 Thread Dan van der Ster
Removing pools won't make a difference. Read up to slide 22 here: https://www.slideshare.net/mobile/Inktank_Ceph/ceph-day-berlin-mastering-ceph-operations-upmap-and-the-mgr-balancer .. Dan (Apologies for terseness, I'm mobile) On Tue, 30 Apr 2019, 20:02 Shain Miley, wrote: > Here is the

Re: [ceph-users] Data distribution question

2019-04-30 Thread Dan van der Ster
The upmap balancer in v12.2.12 works really well... Perfectly uniform on our clusters. .. Dan On Tue, 30 Apr 2019, 19:22 Kenneth Van Alstyne, wrote: > Unfortunately it looks like he’s still on Luminous, but if upgrading is an > option, the options are indeed significantly better. If I recall

Re: [ceph-users] co-located cephfs client deadlock

2019-05-02 Thread Dan van der Ster
On Mon, Apr 1, 2019 at 1:46 PM Yan, Zheng wrote: > > On Mon, Apr 1, 2019 at 6:45 PM Dan van der Ster wrote: > > > > Hi all, > > > > We have been benchmarking a hyperconverged cephfs cluster (kernel > > clients + osd on same machines) for awhile. Over the wee

Re: [ceph-users] co-located cephfs client deadlock

2019-05-02 Thread Dan van der Ster
lved by restarting > the osd that it is reading from? > > > > > -----Original Message- > From: Dan van der Ster [mailto:d...@vanderster.com] > Sent: donderdag 2 mei 2019 8:51 > To: Yan, Zheng > Cc: ceph-users; pablo.llo...@cern.ch > Subject: Re: [ceph-users] co-located

Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Dan van der Ster
Hi Lars, Is there a specific bench result you're concerned about? I would think that small write perf could be kept reasonable thanks to bluestore's deferred writes. FWIW, our bench results (all flash cluster) didn't show a massive performance difference between 3 replica and 4+2 EC. I agree

Re: [ceph-users] Were fixed CephFS lock ups when it's running on nodes with OSDs?

2019-04-23 Thread Dan van der Ster
On Mon, 22 Apr 2019, 22:20 Gregory Farnum, wrote: > On Sat, Apr 20, 2019 at 9:29 AM Igor Podlesny wrote: > > > > I remember seeing reports in regards but it's being a while now. > > Can anyone tell? > > No, this hasn't changed. It's unlikely it ever will; I think NFS > resolved the issue but it

[ceph-users] how to trigger offline filestore merge

2019-04-09 Thread Dan van der Ster
Hi all, We have a slight issue while trying to migrate a pool from filestore to bluestore. This pool used to have 20 million objects in filestore -- it now has 50,000. During its life, the filestore pgs were internally split several times, but never merged. Now the pg _head dirs have mostly

Re: [ceph-users] cephfs full, 2/3 Raw capacity used

2019-08-26 Thread Dan van der Ster
Thanks. The version and balancer config look good. So you can try `ceph osd reweight osd.10 0.8` to see if it helps to get you out of this. -- dan On Mon, Aug 26, 2019 at 11:35 AM Simon Oosthoek wrote: > > On 26-08-19 11:16, Dan van der Ster wrote: > > Hi, > > > > Which

Re: [ceph-users] cephfs full, 2/3 Raw capacity used

2019-08-26 Thread Dan van der Ster
Hi, Which version of ceph are you using? Which balancer mode? The balancer score isn't a percent-error or anything humanly usable. `ceph osd df tree` can better show you exactly which osds are over/under utilized and by how much. You might be able to manually fix things by using `ceph osd

Re: [ceph-users] loaded dup inode (but no mds crash)

2019-07-29 Thread Dan van der Ster
On Mon, Jul 29, 2019 at 2:52 PM Yan, Zheng wrote: > > On Fri, Jul 26, 2019 at 4:45 PM Dan van der Ster wrote: > > > > Hi all, > > > > Last night we had 60 ERRs like this: > > > > 2019-07-26 00:56:44.479240 7efc6cca1700 0 mds.2.cache.dir(0x617) >

Re: [ceph-users] loaded dup inode (but no mds crash)

2019-07-29 Thread Dan van der Ster
On Mon, Jul 29, 2019 at 3:47 PM Yan, Zheng wrote: > > On Mon, Jul 29, 2019 at 9:13 PM Dan van der Ster wrote: > > > > On Mon, Jul 29, 2019 at 2:52 PM Yan, Zheng wrote: > > > > > > On Fri, Jul 26, 2019 at 4:45 PM Dan van der Ster > > > wrote: >

[ceph-users] loaded dup inode (but no mds crash)

2019-07-26 Thread Dan van der Ster
Hi all, Last night we had 60 ERRs like this: 2019-07-26 00:56:44.479240 7efc6cca1700 0 mds.2.cache.dir(0x617) _fetched badness: got (but i already had) [inode 0x10006289992 [...2,head] ~mds2/stray1/10006289992 auth v14438219972 dirtyparent s=116637332 nl=8 n(v0 rc2019-07-26 00:56:17.199090

[ceph-users] how to power off a cephfs cluster cleanly

2019-07-25 Thread Dan van der Ster
Hi all, In September we'll need to power down a CephFS cluster (currently mimic) for a several-hour electrical intervention. Having never done this before, I thought I'd check with the list. Here's our planned procedure: 1. umounts /cephfs from all hpc clients. 2. ceph osd set noout 3. wait

Re: [ceph-users] ceph mdss keep on crashing after update to 14.2.3

2019-09-19 Thread Dan van der Ster
You were running v14.2.2 before? It seems that that ceph_assert you're hitting was indeed added between v14.2.2. and v14.2.3 in this commit https://github.com/ceph/ceph/commit/12f8b813b0118b13e0cdac15b19ba8a7e127730b There's a comment in the tracker for that commit which says the original fix

Re: [ceph-users] problem with degraded PG

2019-06-14 Thread Dan van der Ster
Ahh I was thinking of chooseleaf_vary_r, which you already have. So probably not related to tunables. What is your `ceph osd tree` ? By the way, 12.2.9 has an unrelated bug (details http://tracker.ceph.com/issues/36686) AFAIU you will just need to update to v12.2.11 or v12.2.12 for that fix. --

Re: [ceph-users] problem with degraded PG

2019-06-14 Thread Dan van der Ster
Hi, This looks like a tunables issue. What is the output of `ceph osd crush show-tunables ` -- Dan On Fri, Jun 14, 2019 at 11:19 AM Luk wrote: > > Hello, > > Maybe somone was fighting with this kind of stuck in ceph already. > This is production cluster, can't/don't want to make wrong

Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Dan van der Ster
On Mon, Jul 8, 2019 at 1:02 PM Lars Marowsky-Bree wrote: > > On 2019-07-08T12:25:30, Dan van der Ster wrote: > > > Is there a specific bench result you're concerned about? > > We're seeing ~5800 IOPS, ~23 MiB/s on 4 KiB IO (stripe_width 8192) on a > pool that could do 3

Re: [ceph-users] v13.2.7 osds crash in build_incremental_map_msg

2019-12-04 Thread Dan van der Ster
e only some > specific unsafe scenarios? > > Best regards, > > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ____ > From: ceph-users on behalf of Dan van der > Ster > Sent: 03 December

[ceph-users] v13.2.7 osds crash in build_incremental_map_msg

2019-12-03 Thread Dan van der Ster
Hi all, We're midway through an update from 13.2.6 to 13.2.7 and started getting OSDs crashing regularly like this [1]. Does anyone obviously know what the issue is? (Maybe https://github.com/ceph/ceph/pull/26448/files ?) Or is it some temporary problem while we still have v13.2.6 and v13.2.7

Re: [ceph-users] v13.2.7 osds crash in build_incremental_map_msg

2019-12-03 Thread Dan van der Ster
I created https://tracker.ceph.com/issues/43106 and we're downgrading our osds back to 13.2.6. -- dan On Tue, Dec 3, 2019 at 4:09 PM Dan van der Ster wrote: > > Hi all, > > We're midway through an update from 13.2.6 to 13.2.7 and started > getting OSDs crashing regularly like t

Re: [ceph-users] OSD's hang after network blip

2020-01-16 Thread Dan van der Ster
Hi Nick, We saw the exact same problem yesterday after a network outage -- a few of our down OSDs were stuck down until we restarted their processes. -- Dan On Wed, Jan 15, 2020 at 3:37 PM Nick Fisk wrote: > Hi All, > > Running 14.2.5, currently experiencing some network blips isolated to a

Re: [ceph-users] OSD's hang after network blip

2020-01-16 Thread Dan van der Ster
Fisk wrote: > On Thursday, January 16, 2020 09:15 GMT, Dan van der Ster < > d...@vanderster.com> wrote: > > > Hi Nick, > > > > We saw the exact same problem yesterday after a network outage -- a few > of > > our down OSDs were stuck down until we

Re: [ceph-users] MDS: obscene buffer_anon memory use when scanning lots of files

2020-01-21 Thread Dan van der Ster
On Wed, Jan 22, 2020 at 12:24 AM Patrick Donnelly wrote: > On Tue, Jan 21, 2020 at 8:32 AM John Madden wrote: > > > > On 14.2.5 but also present in Luminous, buffer_anon memory use spirals > > out of control when scanning many thousands of files. The use case is > > more or less "look up this

Re: [ceph-users] Acting sets sometimes may violate crush rule ?

2020-01-13 Thread Dan van der Ster
Hi, One way this can happen is if you change the crush rule of a pool after the balancer has been running awhile. This is because the balancer upmaps are only validated when they are initially created. ceph osd dump | grep upmap Does it explain your issue? .. Dan On Tue, 14 Jan 2020, 04:17

<    1   2   3   4   5   6