Re: [ceph-users] Consumer-grade SSD in Ceph
I dont think “usually” is good enough in a production setup. Sent from myMail for iOS Thursday, 19 December 2019, 12.09 +0100 from Виталий Филиппов : >Usually it doesn't, it only harms performance and probably SSD lifetime >too > >> I would not be running ceph on ssds without powerloss protection. I >> delivers a potential data loss scenario ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Consumer-grade SSD in Ceph
I would not be running ceph on ssds without powerloss protection. I delivers a potential data loss scenario Jesper Sent from myMail for iOS Thursday, 19 December 2019, 08.32 +0100 from Виталий Филиппов : >https://yourcmc.ru/wiki/Ceph_performance > >https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc > >19 декабря 2019 г. 0:41:02 GMT+03:00, Sinan Polat < si...@turka.nl > пишет: >>Hi, >> >>I am aware that >>https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ >> holds a list with benchmark of quite some different ssd models. >>Unfortunately it doesn't have benchmarks for recent ssd models. >> >>A client is planning to expand a running cluster (Luminous, FileStore, SSD >>only, Replicated). I/O Utilization is close to 0, but capacity wise the >>cluster is almost nearfull. To save costs the cluster will be expanded will >>customer-grade SSD's, but I am unable to find benchmarks of recent SSD models. >> >>Does anyone has experience with Samsung 860 EVO, 860 PRO and Crucial MX500 in >>a Ceph cluster? >> >>Thanks! >>Sinan >-- >With best regards, >Vitaliy Filippov >___ >ceph-users mailing list >ceph-users@lists.ceph.com >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HA and data recovery of CEPH
Hi Nathan Is that true? The time it takes to reallocate the primary pg delivers “downtime” by design. right? Seen from a writing clients perspective Jesper Sent from myMail for iOS Friday, 29 November 2019, 06.24 +0100 from pen...@portsip.com : >Hi Nathan, > >Thanks for the help. >My colleague will provide more details. > >BR >On Fri, Nov 29, 2019 at 12:57 PM Nathan Fish < lordci...@gmail.com > wrote: >>If correctly configured, your cluster should have zero downtime from a >>single OSD or node failure. What is your crush map? Are you using >>replica or EC? If your 'min_size' is not smaller than 'size', then you >>will lose availability. >> >>On Thu, Nov 28, 2019 at 10:50 PM Peng Bo < pen...@portsip.com > wrote: >>> >>> Hi all, >>> >>> We are working on use CEPH to build our HA system, the purpose is the >>> system should always provide service even a node of CEPH is down or OSD is >>> lost. >>> >>> Currently, as we practiced once a node/OSD is down, the CEPH cluster needs >>> to take about 40 seconds to sync data, our system can't provide service >>> during that. >>> >>> My questions: >>> >>> Does there have any way that we can reduce the data sync time? >>> How can we let the CEPH keeps available once a node/OSD is down? >>> >>> >>> BR >>> >>> -- >>> The modern Unified Communications provider >>> >>> https://www.portsip.com >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >-- >The modern Unified Communications provider > >https://www.portsip.com >___ >ceph-users mailing list >ceph-users@lists.ceph.com >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NVMe disk - size
Is c) the bcache solution? real life experience - unless you are really beating an enterprise ssd with writes - they last very,very long and even when failure happens- you can typically see it by the wear levels in smart months before. I would go for c) but if possible add one more nvme to each host - we have a 9-hdd+3-ssd scenario here. Jesper Sent from myMail for iOS Monday, 18 November 2019, 07.49 +0100 from kristof.cou...@gmail.com : >Hi all, > >Thanks for the feedback. >Though, just to be sure: > >1. There is no 30GB limit if I understand correctly for the RocksDB size. If >metadata crosses that barrier, the L4 part will spillover to the primary >device? Or will it just move the RocksDB completely? Or will it just stop and >indicate it's full? >2. Since the WAL will also be written to that device, I assume a few >additional GB's is still usefull... > >With my setup (13x 14TB + 2 NVMe of 1.6TB / host, 10 hosts) I have multiple >possible scenario's: >- Assigning 35GB of space of the NVMe disk (30GB for DB, 5 spare) would result >in only 455GB being used (13 x 35GB). This is a pity, since I have 3.2TB of >NVMe disk space... > >Options line-up: > >Option a : Not using the NVMe for block.db storage, but as RGW metadata pool. >Advantages: >- Impact of 1 defect NVMe is limited. >- Fast storage for the metadata pool. >Disadvantage: >- RocksDB for each OSD is on the primary disk, resulting in slower performance >of each OSD. > >Option b: Hardware mirror of the NVMe drive >Advantages: >- Impact of 1 defect NVMe is limited >- Fast KV lookup for each OSD >Disadvantage: >- I/O to NVMe is serialized for all OSDs on 1 host. Though the NVMe are fast, >I imagine that there still is an impact. >- 1 TB of NVMe is not used / host > >Option c: Split the NVMe's accross the OSD >Advantages: >- Fast RockDB access - up to L3 (assuming spillover does it job) >Disadvantage: >- 1 defect NVMe impacts max 7 OSDs (1 NVMe assigned to 7 or 6 OSD daemons per >host) >- 2.7TB of NVMe space not used per host > >Option d: 1 NVMe disk for OSDs, 1 for RGW metadata pool >Advantages: >- Fast RockDB access - up to L3 >- Fast RGW metadata pool (though limited to 5,3TB (raw pool size will be 16TB, >divided by 3 due to replication) I assume this already gives some possibilities >Disadvantages: >- 1 defect NVMe might impact a complete host (all OSDs might be using it for >the RockDB storage) >- 1 TB of NVMe is not used > >Though menu to choose from, each with it possibilities... The initial idea was >too assign 200GB per OSD of the NVMe space per OSD, but this would result in a >lot of unused space. I don't know if there is anything on the roadmap to adapt >the RocksDB sizing to make better use of the available NVMe disk space. >With all the information, I would assume that the best option would be option >A . Since we will be using erasure coding for the RGW data pool (k=6, m=3), >the impact of a defect NVMe would be too significant. The other alternative >would be option b, but then again we would be dealing with HW raid which is >against all Ceph design rules. > >Any other options or (dis)advantages I missed? Or any other opinions to choose >another option? > >Regards, > >Kristof >Op vr 15 nov. 2019 om 18:22 schreef < vita...@yourcmc.ru >: >>Use 30 GB for all OSDs. Other values are pointless, because >>https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing >> >>You can use the rest of free NVMe space for bcache - it's much better >>than just allocating it for block.db. >___ >ceph-users mailing list >ceph-users@lists.ceph.com >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs: apache locks up after parallel reloads on multiple nodes
Thursday, 12 September 2019, 17.16 +0200 from Paul Emmerich : >Yeah, CephFS is much closer to POSIX semantics for a filesystem than >NFS. There's an experimental relaxed mode called LazyIO but I'm not >sure if it's applicable here. > >You can debug this by dumping slow requests from the MDS servers via >the admin socket Is lazy IO supported by the kernel client? if so which version kernel? Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph for "home lab" / hobbyist use?
Saturday, 7 September 2019, 15.25 +0200 from wil...@gmail.com >On a related note, I came across this hardware while searching around >on this topic: https://ambedded.com/ambedded_com/AR M Interesting to see the cost of those. 8 LFF drives in 1U is pretty dense. Anyone using similar concepts in enterprise environments? Jesper > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] optane + 4x SSDs for VM disk images?
>> Could performance of Optane + 4x SSDs per node ever exceed that of >> pure Optane disks? > > No. With Ceph, the results for Optane and just for good server SSDs are > almost the same. One thing is that you can run more OSDs per an Optane > than per a usual SSD. However, the latency you get from both is almost > the same as most of it comes from Ceph itself, not from the underlying > storage. This also results in Optanes being useless for > block.db/block.wal if your SSDs aren't shitty desktop ones. > > And as usual I'm posting the link to my article > https://yourcmc.ru/wiki/Ceph_performance :) You write that they are not reporting QD=1 single-threaded numbers, but in Table 10 and 11 the average latencies are reported which is "close to the same", so they can get Read latency: 0.32ms (thereby 3125 IOPS) Write latency: 1.1ms (therby 909 IOPS) Really nice writeup and very true - should be a must-read for anyone starting out with Ceph. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD caching on EC-pools (heavy cross OSD communication on cached reads)
make sense - makes the cases for ec pools smaller though. Jesper Sent from myMail for iOS Sunday, 9 June 2019, 17.48 +0200 from paul.emmer...@croit.io : >Caching is handled in BlueStore itself, erasure coding happens on a higher >layer. > > >Paul > >-- >Paul Emmerich > >Looking for help with your Ceph cluster? Contact us at https://croit.io > >croit GmbH >Freseniusstr. 31h >81247 München >www.croit.io >Tel: +49 89 1896585 90 > >On Sun, Jun 9, 2019 at 8:43 AM < jes...@krogh.cc > wrote: >>Hi. >> >>I just changed some of my data on CephFS to go to the EC pool instead >>of the 3x replicated pool. The data is "write rare / read heavy" data >>being served to an HPC cluster. >> >>To my surprise it looks like the OSD memory caching is done at the >>"split object level" not at the "assembled object level", as a >>consequence - even though the dataset is fully memory cached it >>actually deliveres a very "heavy" cross OSD network traffic to >>assemble the objects back. >> >>Since (as far as I understand) no changes can go the the underlying >>object without going though the primary pg - then caching could be >>more effectively done at that level. >> >>The caching on the 3x replica does not retrieve all 3 copies to compare >>and verify on a read request (or I at least cannot see any network >>traffic supporting that it should be the case). >> >>Is above configurable? Or would that be a feature/performance request? >> >>Jesper >> >>___ >>ceph-users mailing list >>ceph-users@lists.ceph.com >>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD caching on EC-pools (heavy cross OSD communication on cached reads)
Hi. I just changed some of my data on CephFS to go to the EC pool instead of the 3x replicated pool. The data is "write rare / read heavy" data being served to an HPC cluster. To my surprise it looks like the OSD memory caching is done at the "split object level" not at the "assembled object level", as a consequence - even though the dataset is fully memory cached it actually deliveres a very "heavy" cross OSD network traffic to assemble the objects back. Since (as far as I understand) no changes can go the the underlying object without going though the primary pg - then caching could be more effectively done at that level. The caching on the 3x replica does not retrieve all 3 copies to compare and verify on a read request (or I at least cannot see any network traffic supporting that it should be the case). Is above configurable? Or would that be a feature/performance request? Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD RAM recommendations
> I'm a bit confused by the RAM recommendations for OSD servers. I have > also seen conflicting information in the lists (1 GB RAM per OSD, 1 GB > RAM per TB, 3-5 GB RAM per OSD, etc.). I guess I'm a lot better with a > concrete example: I think it depends on the usagepattern - the more the better. When configured the OSD daemon will use the memory as disk-caching for reads - I have a simiar setup 7 hosts x 10TB x 12 disk - with 512GB each This serves an "active dataset" to a HPC cluster, where it is hugely beneficial to be able to cache the "hot data" which is 1.5TB'ish. If your "hot" dataset is smaller, then less will do as well. Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Single threaded IOPS on SSD pool.
> Hi, > > El 5/6/19 a las 16:53, vita...@yourcmc.ru escribió: >>> Ok, average network latency from VM to OSD's ~0.4ms. >> >> It's rather bad, you can improve the latency by 0.3ms just by >> upgrading the network. >> >>> Single threaded performance ~500-600 IOPS - or average latency of 1.6ms >>> Is that comparable to what other are seeing? >> >> Good "reference" numbers are 0.5ms for reads (~2000 iops) and 1ms for >> writes (~1000 iops). >> >> I confirm that the most powerful thing to do is disabling CPU >> powersave (governor=performance + cpupower -D 0). You usually get 2x >> single thread iops at once. > > We have a small cluster with 4 OSD host, each with 1 SSD INTEL > SSDSC2KB019T8 (D3-S4510 1.8T), connected with a 10G network (shared with > VMs, not a busy cluster). Volumes are replica 3: > > Network latency from one node to the other 3: > 10 packets transmitted, 10 received, 0% packet loss, time 9166ms > rtt min/avg/max/mdev = 0.042/0.064/0.088/0.013 ms > > 10 packets transmitted, 10 received, 0% packet loss, time 9190ms > rtt min/avg/max/mdev = 0.047/0.072/0.110/0.017 ms > > 10 packets transmitted, 10 received, 0% packet loss, time 9219ms > rtt min/avg/max/mdev = 0.061/0.078/0.099/0.011 ms What NIC / Switching components are in play here .. I simply cannot get latencies this far down. Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Single threaded IOPS on SSD pool.
Hi. This is more an inquiry to figure out how our current setup compares to other setups. I have a 3 x replicated SSD pool with RBD images. When running fio on /tmp I'm interested in seeing how much IOPS a single thread can get - as Ceph scales up very nicely with concurrency. Currently 34 OSD of ~896GB Intel D3-4510's each over 7 OSD-hosts. jk@iguana:/tmp$ for i in 01 02 03 04 05 06 07; do ping -c 10 ceph-osd$i; done |egrep '(statistics|rtt)' --- ceph-osd01.nzcorp.net ping statistics --- rtt min/avg/max/mdev = 0.316/0.381/0.483/0.056 ms --- ceph-osd02.nzcorp.net ping statistics --- rtt min/avg/max/mdev = 0.293/0.415/0.625/0.100 ms --- ceph-osd03.nzcorp.net ping statistics --- rtt min/avg/max/mdev = 0.319/0.395/0.558/0.074 ms --- ceph-osd04.nzcorp.net ping statistics --- rtt min/avg/max/mdev = 0.224/0.352/0.492/0.077 ms --- ceph-osd05.nzcorp.net ping statistics --- rtt min/avg/max/mdev = 0.257/0.360/0.444/0.059 ms --- ceph-osd06.nzcorp.net ping statistics --- rtt min/avg/max/mdev = 0.209/0.334/0.442/0.062 ms --- ceph-osd07.nzcorp.net ping statistics --- rtt min/avg/max/mdev = 0.259/0.401/0.517/0.069 ms Ok, average network latency from VM to OSD's ~0.4ms. $ fio fio-job-randr.ini test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1 fio-2.2.10 Starting 1 process Jobs: 1 (f=1): [r(1)] [100.0% done] [2145KB/0KB/0KB /s] [536/0/0 iops] [eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=29519: Wed Jun 5 08:40:51 2019 Description : [fio random 4k reads] read : io=143352KB, bw=2389.2KB/s, iops=597, runt= 60001msec slat (usec): min=8, max=1925, avg=30.24, stdev=13.56 clat (usec): min=7, max=321039, avg=1636.47, stdev=4346.52 lat (usec): min=102, max=321074, avg=1667.58, stdev=4346.57 clat percentiles (usec): | 1.00th=[ 157], 5.00th=[ 844], 10.00th=[ 924], 20.00th=[ 1012], | 30.00th=[ 1096], 40.00th=[ 1160], 50.00th=[ 1224], 60.00th=[ 1304], | 70.00th=[ 1400], 80.00th=[ 1528], 90.00th=[ 1768], 95.00th=[ 2128], | 99.00th=[11328], 99.50th=[18304], 99.90th=[51456], 99.95th=[94720], | 99.99th=[216064] bw (KB /s): min=0, max= 3089, per=99.39%, avg=2374.50, stdev=472.15 lat (usec) : 10=0.01%, 100=0.01%, 250=2.95%, 500=0.03%, 750=0.27% lat (usec) : 1000=14.96% lat (msec) : 2=75.87%, 4=2.99%, 10=1.78%, 20=0.73%, 50=0.30% lat (msec) : 100=0.07%, 250=0.03%, 500=0.01% cpu : usr=0.76%, sys=3.29%, ctx=38871, majf=0, minf=11 IO depths: 1=108.2%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued: total=r=35838/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: io=143352KB, aggrb=2389KB/s, minb=2389KB/s, maxb=2389KB/s, mint=60001msec, maxt=60001msec Disk stats (read/write): vda: ios=38631/51, merge=0/3, ticks=62668/40, in_queue=62700, util=96.77% And fio-file: $ cat fio-job-randr.ini [global] readwrite=randread blocksize=4k ioengine=libaio numjobs=1 thread=0 direct=1 iodepth=1 group_reporting=1 ramp_time=5 norandommap=1 description=fio random 4k reads time_based=1 runtime=60 randrepeat=0 [test] size=1g Single threaded performance ~500-600 IOPS - or average latency of 1.6ms Is that comparable to what other are seeing? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Noob question - ceph-mgr crash on arm
0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mgr.odroid-c.log --- end dump of recent events --- Kind regards Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] fscache and cephfs
Does that work together nicely? Anyone using it? With NVMe drives fairly cheap it could stack pretty nicely. Jesper Sent from myMail for iOS ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Stalls on new RBD images.
Hi. I'm fishing a bit here. What we see is that when we have new VM/RBD/SSD-backed images the time before they are "fully written" first time - can be lousy performance. Sort of like they are thin-provisioned and the subsequent growing of the images in Ceph deliveres a performance hit. Does anyone else have someting similar in their setup - how do you deal with it? KVM based virtualization, Ceph Luminous. Any suggestions/hints/welcome Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?
> On 4/10/19 9:07 AM, Charles Alva wrote: >> Hi Ceph Users, >> >> Is there a way around to minimize rocksdb compacting event so that it >> won't use all the spinning disk IO utilization and avoid it being marked >> as down due to fail to send heartbeat to others? >> >> Right now we have frequent high IO disk utilization for every 20-25 >> minutes where the rocksdb reaches level 4 with 67GB data to compact. >> > > How big is the disk? RocksDB will need to compact at some point and it > seems that the HDD can't keep up. > > I've seen this with many customers and in those cases we offloaded the > WAL+DB to an SSD. Guess the SSD need to be pretty durable to handle that? Is there a "migration path" to offload this or is it needed to destroy and re-create the OSD? Thanks. Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] VM management setup
Hi. Knowing this is a bit off-topic but seeking recommendations and advise anyway. We're seeking a "management" solution for VM's - currently in the 40-50 VM - but would like to have better access in managing them and potintially migrate them across multiple hosts, setup block devices, etc, etc. This is only to be used internally in a department where a bunch of engineering people will manage it, no costumers and that kind of thing. Up until now we have been using virt-manager with kvm - and have been quite satisfied when we were in the "few vms", but it seems like the time to move on. Thus we're looking for something "simple" that can help manage a ceph+kvm based setup - the simpler and more to the point the better. Any recommendations? .. found a lot of names allready .. OpenStack CloudStack Proxmox .. But recommendations are truely welcome. Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] fio test rbd - single thread - qd1
> `cpupower idle-set -D 0` will help you a lot, yes. > > However it seems that not only the bluestore makes it slow. >= 50% of the > latency is introduced by the OSD itself. I'm just trying to understand > WHAT parts of it are doing so much work. For example in my current case > (with cpupower idle-set -D 0 of course) when I was testing a single OSD on > a very good drive (Intel NVMe, capable of 4+ single-thread sync write > iops) it was delivering me only 950-1000 iops. It's roughly 1 ms latency, > and only 50% of it comes from bluestore (you can see it `ceph daemon osd.x > perf dump`)! I've even tuned bluestore a little, so that now I'm getting > ~1200 iops from it. It means that the bluestore's latency dropped by 33% > (it was around 1/1000 = 500 us, now it is 1/1200 = ~330 us). But still the > overall improvement is only 20% - everything else is eaten by the OSD > itself. Thanks for the insight - that means that the SSD-number for read/write performance are roughly ok - I guess. It still puzzles me why the bluestore-caching does not benefit the read-size. Is the cache not an LRU cache on the block device or is it actually uses for something else? Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] fio test rbd - single thread - qd1
> One thing you can check is the CPU performance (cpu governor in > particular). > On such light loads I've seen CPUs sitting in low performance mode (slower > clocks), giving MUCH worse performance results than when tried with > heavier > loads. Try "cpupower monitor" on OSD nodes in a loop and observe the core > frequencies. > Thanks for the suggestion. They seem to be all powered up .. other suggestion/reflections are truely welcome.. Thanks. Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] fio test rbd - single thread - qd1
Hi All. I'm trying to get head and tails into where we can stretch our Ceph cluster into what applications. Parallism works excellent, but baseline throughput it - perhaps - not what I would expect it to be. Luminous cluster running bluestore - all OSD-daemons have 16GB of cache. Fio files attacher - 4KB random read and 4KB random write - test file is "only" 1GB In this i ONLY care about raw IOPS numbers. I have 2 pools, both 3x replicated .. one backed with SSDs S4510's (14x1TB) and one with HDD's 84x10TB. Network latency from rbd mount to one of the osd-hosts. --- ceph-osd01.nzcorp.net ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9189ms rtt min/avg/max/mdev = 0.084/0.108/0.146/0.022 ms SSD: randr: # grep iops read*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 38 1727.07 2033.66 1954.71 1949.4789 46.592401 randw: # grep iops write*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 36400.05455.26436.58 433.91417 12.468187 The double (or triple) network penalty of-course kicks in and delivers a lower throughput here. Are these performance numbers in the ballpark of what we'd expect? With 1GB of test file .. I would really expect this to be memory cached in the OSD/bluestore cache and thus deliver a read IOPS closer to theoretical max: 1s/0.108ms => 9.2K IOPS Again on the write side - all OSDs are backed by Battery-Backed write cache, thus writes should go directly into memory of the constroller .. .. still slower than reads - due to having to visit 3 hosts.. but not this low? Suggestions for improvements? Are other people seeing similar results? For the HDD tests I get similar - surprisingly slow numbers: # grep iops write*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 38 36.91 118.8 69.14 72.926842 21.75198 This should have the same performance characteristics as the SSD's as the writes should be hitting BBWC. # grep iops read*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 39 26.18181.51 48.16 50.574872 24.01572 Same here - shold be cached in the blue-store cache as it is 16GB x 84 OSD's .. with a 1GB testfile. Any thoughts - suggestions - insights ? Jesper fio-single-thread-randr.ini Description: Binary data fio-single-thread-randw.ini Description: Binary data ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to just delete PGs stuck incomplete on EC pool
Did they break, or did something went wronng trying to replace them? Jespe Sent from myMail for iOS Saturday, 2 March 2019, 14.34 +0100 from Daniel K : >I bought the wrong drives trying to be cheap. They were 2TB WD Blue 5400rpm >2.5 inch laptop drives. > >They've been replace now with HGST 10K 1.8TB SAS drives. > > > >On Sat, Mar 2, 2019, 12:04 AM < jes...@krogh.cc > wrote: >> >> >>Saturday, 2 March 2019, 04.20 +0100 from satha...@gmail.com < >>satha...@gmail.com >: >>>56 OSD, 6-node 12.2.5 cluster on Proxmox >>> >>>We had multiple drives fail(about 30%) within a few days of each other, >>>likely faster than the cluster could recover. >> >>Hov did so many drives break? >> >>Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to just delete PGs stuck incomplete on EC pool
Saturday, 2 March 2019, 04.20 +0100 from satha...@gmail.com : >56 OSD, 6-node 12.2.5 cluster on Proxmox > >We had multiple drives fail(about 30%) within a few days of each other, likely >faster than the cluster could recover. Hov did so many drives break? Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Understanding EC properties for CephFS / small files.
Hi Paul. Thanks for you comments. > For your examples: > > 16 MB file -> 4x 4 MB objects -> 4x 4x 1 MB data chunks, 4x 2x 1 MB > coding chunks > > 512 kB file -> 1x 512 kB object -> 4x 128 kB data chunks, 2x 128 kb > coding chunks > > > You'll run into different problems once the erasure coded chunks end > up being smaller than 64kb each due to bluestore min allocation sizes > and general metadata overhead making erasure coding a bad fit for very > small files. Thanks for the clairification, which makes this a "very bad fit" for CephFS: # find . -type f -print0 | xargs -0 stat | grep Size | perl -ane '/Size: (\d+)/; print $1 . "\n";' | ministat -n x N Min MaxMedian AvgStddev x 12651568 0 1.0840049e+11 9036 2217611.6 32397960 Gives me 6,3M files < 9036 bytes in size, that'll be stored as 6 x 64KB at the bluestore level if I understand it correctly. We come from a xfs world where default blocksize is 4K so above situation worked quite nicely. Guess I probably would be way better off with a RBD with xfs on top to solve this case using Ceph. Is it fair to summarize your input as: In a EC4+2 configuration, minimal used space is 256KB+128KB(coding) regardless of file-size In a EC8+3 configuraiton, minimal used space is 512KB+192KB(coding) regardless of file-size And for the access side: All access to files in EC pool requires as a minimum IO-requests to k-shards for the first bytes to be returned, with fast_read it becomes k+n, but returns when k has responded. Any experience with inlining data on the MDS - that would obviously help here I guess. Thanks. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS - read latency.
> Probably not related to CephFS. Try to compare the latency you are > seeing to the op_r_latency reported by the OSDs. > > The fast_read option on the pool can also help a lot for this IO pattern. Magic, that actually cut the read-latency in half - making it more aligned with what to expect from the HW+network side: N Min MaxMedian AvgStddev x 100 0.015687 0.221538 0.0252530.03259606 0.028827849 25ms as a median, 32ms average is still on the high side, but way, way better. Thanks. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Understanding EC properties for CephFS / small files.
> I'm trying to understand the nuts and bolts of EC / CephFS > We're running an EC4+2 pool on top of 72 x 7.2K rpm 10TB drives. Pretty > slow bulk / archive storage. Ok, did some more searching and found this: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021642.html. Which to some degree confirms my understanding, I'd still like to get even more insight though. Gregory Farnum comes with this comments: "Unfortunately any logic like this would need to be handled in your application layer. Raw RADOS does not do object sharding or aggregation on its own. CERN did contribute the libradosstriper, which will break down your multi-gigabyte objects into more typical sizes, but a generic system for packing many small objects into larger ones is tough the choices depend so much on likely access patterns and such. I would definitely recommend working out something like that, though! " An idea about how to advance this stuff: I can see that this would be "very hard" by the Ceph concepts to do at the objects level, but a suggestion would be to do it at the CephFS/MDS level. A basic thing that "often" would work, would be to on a "directory level" have a special type of "packed" object, where multiple files went into the same CephFS object. For common access patterns people are reading through entire catalogs in the first place, which would also limits IO on the overall system for tree traversals (Think tar cxvf linux.kernel.tar.gz git-checkout) I have no idea about how cephfs is dealing with concurrent updates around entitites, but in this situation, dealing with concurrency at the packed-object level. It would be harder to "pack files across catalogs", since that is not the native way of the MDS to keep track of things. A third way would be to more "agressively" inline data on the MDS. How mature - well tested - efficient is that feature? http://docs.ceph.com/docs/master/cephfs/experimental-features/ The unfortunate consequence of bumping the 2KB size upwards to meet the point where EC-pools become efficient would mean that we end up hitting the MDS way harder than what we do today. 2KB seem like a safe limit. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Understanding EC properties for CephFS / small files.
Hi List. I'm trying to understand the nuts and bolts of EC / CephFS We're running an EC4+2 pool on top of 72 x 7.2K rpm 10TB drives. Pretty slow bulk / archive storage. # getfattr -n ceph.dir.layout /mnt/home/cluster/mysqlbackup getfattr: Removing leading '/' from absolute path names # file: mnt/home/cluster/mysqlbackup ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs_data_ec42" This configuration is taken directly out of the online documentation: (Which may have been where it went all wrong from our perspective): http://docs.ceph.com/docs/master/cephfs/file-layouts/ Ok, this means that a 16MB file will be split at 4 chuncks of 4MB each with 2 erasure coding chuncks? I dont really understand the stripe_count element? And since erasure-coding works at the object level, striping individual objects across - here 4 replicas - it'll end up filling 16MB ? Or is there an internal optimization causing this not to be the case? Additionally, when reading the file, all 4 chunck need to be read to assemble the object. Causing (at a minumum) 4 IOPS per file. Now, my common file size is < 8MB and commonly 512KB files are on this pool. Will that cause a 512KB file to be padded to 4MB with 3 empty chuncks to fill the erasure coded profile and then 2 coding chuncks on top? In total 24MB for storing 512KB ? And when reading it I'll hit 4 random IO's to read 512KB or can it optimize around not reading "empty" chuncks? If this is true, then I would be both performance and space/cost-wise way better off with 3x replication. Or is it less worse than what I get to here? If the math is true, then we can begin to calculate chunksize and EC profiles for when EC begins to deliver benefits. In terms of IO it seems like I'll always suffer a 1:4 ratio on IOPS in a reading scenario on a 4+2 EC pool, compared to a 3x replication. Side-note: I'm trying to get bacula (tape-backup) to read off my archive to tape in a "resonable time/speed". Thanks in advance. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG_AVAILABILITY with one osd down?
> Hello, > your log extract shows that: > > 2019-02-15 21:40:08 OSD.29 DOWN > 2019-02-15 21:40:09 PG_AVAILABILITY warning start > 2019-02-15 21:40:15 PG_AVAILABILITY warning cleared > > 2019-02-15 21:44:06 OSD.29 UP > 2019-02-15 21:44:08 PG_AVAILABILITY warning start > 2019-02-15 21:44:15 PG_AVAILABILITY warning cleared > > What you saw is the natural consequence of OSD state change. Those two > periods of limited PG availability (6s each) are related to peering > that happens shortly after an OSD goes down or up. > Basically, the placement groups stored on that OSD need peering, so > the incoming connections are directed to other (alive) OSDs. And, yes, > during those few seconds the data are not accessible. Thanks, bear over with my questions. I'm pretty new to Ceph. What will clients (CephFS, Object) experience? .. will they just block until time has passed and they get through or? Which means that I'll get 72 x 6 seconds unavailabilty when doing a rolling restart of my OSD's during upgrades and such? Or is a controlled restart different than a crash? -- Jesper. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG_AVAILABILITY with one osd down?
Yesterday I saw this one.. it puzzles me: 2019-02-15 21:00:00.000126 mon.torsk1 mon.0 10.194.132.88:6789/0 604164 : cluster [INF] overall HEALTH_OK 2019-02-15 21:39:55.793934 mon.torsk1 mon.0 10.194.132.88:6789/0 604304 : cluster [WRN] Health check failed: 2 slow requests are blocked > 32 sec. Implicated osds 58 (REQUEST_SLOW) 2019-02-15 21:40:00.887766 mon.torsk1 mon.0 10.194.132.88:6789/0 604305 : cluster [WRN] Health check update: 6 slow requests are blocked > 32 sec. Implicated osds 9,19,52,58,68 (REQUEST_SLOW) 2019-02-15 21:40:06.973901 mon.torsk1 mon.0 10.194.132.88:6789/0 604306 : cluster [WRN] Health check update: 14 slow requests are blocked > 32 sec. Implicated osds 3,9,19,29,32,52,55,58,68,69 (REQUEST_SLOW) 2019-02-15 21:40:08.466266 mon.torsk1 mon.0 10.194.132.88:6789/0 604307 : cluster [INF] osd.29 failed (root=default,host=bison) (6 reporters from different host after 33.862482 >= grace 29.247323) 2019-02-15 21:40:08.473703 mon.torsk1 mon.0 10.194.132.88:6789/0 604308 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) 2019-02-15 21:40:09.489494 mon.torsk1 mon.0 10.194.132.88:6789/0 604310 : cluster [WRN] Health check failed: Reduced data availability: 6 pgs peering (PG_AVAILABILITY) 2019-02-15 21:40:11.008906 mon.torsk1 mon.0 10.194.132.88:6789/0 604312 : cluster [WRN] Health check failed: Degraded data redundancy: 3828291/700353996 objects degraded (0.547%), 77 pgs degraded (PG_DEGRADED) 2019-02-15 21:40:13.474777 mon.torsk1 mon.0 10.194.132.88:6789/0 604313 : cluster [WRN] Health check update: 9 slow requests are blocked > 32 sec. Implicated osds 3,9,32,55,58,69 (REQUEST_SLOW) 2019-02-15 21:40:15.060165 mon.torsk1 mon.0 10.194.132.88:6789/0 604314 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 17 pgs peering) 2019-02-15 21:40:17.128185 mon.torsk1 mon.0 10.194.132.88:6789/0 604315 : cluster [WRN] Health check update: Degraded data redundancy: 9897139/700354131 objects degraded (1.413%), 200 pgs degraded (PG_DEGRADED) 2019-02-15 21:40:17.128219 mon.torsk1 mon.0 10.194.132.88:6789/0 604316 : cluster [INF] Health check cleared: REQUEST_SLOW (was: 2 slow requests are blocked > 32 sec. Implicated osds 32,55) 2019-02-15 21:40:22.137090 mon.torsk1 mon.0 10.194.132.88:6789/0 604317 : cluster [WRN] Health check update: Degraded data redundancy: 9897140/700354194 objects degraded (1.413%), 200 pgs degraded (PG_DEGRADED) 2019-02-15 21:40:27.249354 mon.torsk1 mon.0 10.194.132.88:6789/0 604318 : cluster [WRN] Health check update: Degraded data redundancy: 9897142/700354287 objects degraded (1.413%), 200 pgs degraded (PG_DEGRADED) 2019-02-15 21:40:33.335147 mon.torsk1 mon.0 10.194.132.88:6789/0 604322 : cluster [WRN] Health check update: Degraded data redundancy: 9897143/700354356 objects degraded (1.413%), 200 pgs degraded (PG_DEGRADED) ... shortened .. 2019-02-15 21:43:48.496536 mon.torsk1 mon.0 10.194.132.88:6789/0 604366 : cluster [WRN] Health check update: Degraded data redundancy: 9897168/700356693 objects degraded (1.413%), 200 pgs degraded, 201 pgs undersized (PG_DEGRADED) 2019-02-15 21:43:53.496924 mon.torsk1 mon.0 10.194.132.88:6789/0 604367 : cluster [WRN] Health check update: Degraded data redundancy: 9897170/700356804 objects degraded (1.413%), 200 pgs degraded, 201 pgs undersized (PG_DEGRADED) 2019-02-15 21:43:58.497313 mon.torsk1 mon.0 10.194.132.88:6789/0 604368 : cluster [WRN] Health check update: Degraded data redundancy: 9897172/700356879 objects degraded (1.413%), 200 pgs degraded, 201 pgs undersized (PG_DEGRADED) 2019-02-15 21:44:03.497696 mon.torsk1 mon.0 10.194.132.88:6789/0 604369 : cluster [WRN] Health check update: Degraded data redundancy: 9897174/700356996 objects degraded (1.413%), 200 pgs degraded, 201 pgs undersized (PG_DEGRADED) 2019-02-15 21:44:06.939331 mon.torsk1 mon.0 10.194.132.88:6789/0 604372 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down) 2019-02-15 21:44:06.965401 mon.torsk1 mon.0 10.194.132.88:6789/0 604373 : cluster [INF] osd.29 10.194.133.58:6844/305358 boot 2019-02-15 21:44:08.498060 mon.torsk1 mon.0 10.194.132.88:6789/0 604376 : cluster [WRN] Health check update: Degraded data redundancy: 9897174/700357056 objects degraded (1.413%), 200 pgs degraded, 201 pgs undersized (PG_DEGRADED) 2019-02-15 21:44:08.996099 mon.torsk1 mon.0 10.194.132.88:6789/0 604377 : cluster [WRN] Health check failed: Reduced data availability: 12 pgs peering (PG_AVAILABILITY) 2019-02-15 21:44:13.498472 mon.torsk1 mon.0 10.194.132.88:6789/0 604378 : cluster [WRN] Health check update: Degraded data redundancy: 55/700357161 objects degraded (0.000%), 33 pgs degraded (PG_DEGRADED) 2019-02-15 21:44:15.081437 mon.torsk1 mon.0 10.194.132.88:6789/0 604379 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 12 pgs peering) 2019-02-15 21:44:18.498808 mon.torsk1 mon.0 10.194.132.88:6789/0 604380 : cluster [WRN] Health check update: Degraded data redundancy: 14/700357230 objects degraded (0.
[ceph-users] CephFS - read latency.
Hi. I've got a bunch of "small" files moved onto CephFS as archive/bulk storage and now I have the backup (to tape) to spool over them. A sample of the single-threaded backup client delivers this very consistent pattern: $ sudo strace -T -p 7307 2>&1 | grep -A 7 -B 3 open write(111, "\377\377\377\377", 4) = 4 <0.11> openat(AT_FDCWD, "/ceph/cluster/rsyncbackups/fileshare.txt", O_RDONLY) = 38 <0.30> write(111, "\0\0\0\021197418 2 67201568", 21) = 21 <0.36> read(38, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 <0.049733> write(111, "\0\1\0\0CLC\0\0\0\0\2\0\0\0\0\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0"..., 65540) = 65540 <0.37> read(38, " $$ $$\16\33\16 \16\33"..., 65536) = 65536 <0.000199> write(111, "\0\1\0\0 $$ $$\16\33\16 $$"..., 65540) = 65540 <0.26> read(38, "$ \33 \16\33\25 \33\33\33 \33\33\33 \25\0\26\2\16NVDOLOVB"..., 65536) = 65536 <0.35> write(111, "\0\1\0\0$ \33 \16\33\25 \33\33\33 \33\33\33 \25\0\26\2\16NVDO"..., 65540) = 65540 <0.24> The pattern is very consistent, thus it is not one PG or one OSD being contented. $ sudo strace -T -p 7307 2>&1 | grep -A 3 open |grep read read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 11968 <0.070917> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 23232 <0.039789> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0P\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 <0.053598> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 28240 <0.105046> read(41, "NZCA_FS_CLCGENOMICS, 1, 1\nNZCA_F"..., 65536) = 73 <0.061966> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 <0.050943> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 <0.031217> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 7392 <0.052612> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 288 <0.075930> read(41, "1316919290-DASPHYNBAAPe2218b"..., 65536) = 940 <0.040609> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 22400 <0.038423> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 11984 <0.039051> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 9040 <0.054161> read(41, "NZCA_FS_CLCGENOMICS, 1, 1\nNZCA_F"..., 65536) = 73 <0.040654> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 22352 <0.031236> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0N\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 <0.123424> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 49984 <0.052249> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 28176 <0.052742> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 288 <0.092039> Or to sum: sudo strace -T -p 23748 2>&1 | grep -A 3 open | grep read | perl -ane'/<(\d+\.\d+)>/; print $1 . "\n";' | head -n 1000 | ministat N Min MaxMedian AvgStddev x 1000 3.2e-05 2.141551 0.054313 0.065834359 0.091480339 As can be seen the "initial" read averages at 65.8ms - which - if the filesize is say 1MB and the rest of the time is 0 - caps read performance mostly 20MB/s .. at that pace, the journey through double digit TB is long even with 72 OSD's backing. Spec: Ceph Luminous 12.2.5 - Bluestore 6 OSD nodes, 10TB HDDs, 4+2 EC pool, 10GbitE Locally the drives deliver latencies of approximately 6-8ms for a random read. Any suggestion on where to find out where the remaining 50ms is being spend would be truely helpful. Large files "just works" as read-ahead does a nice job in getting performance up. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
> That's a usefull conclusion to take back. Last question - We have our SSD pool set to 3x replication, Micron states that NVMe is good at 2x - is this "taste and safety" or is there any general thoughts about SSD-robustness in a Ceph setup? Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
> On 07/02/2019 17:07, jes...@krogh.cc wrote: > Thanks for your explanation. In your case, you have low concurrency > requirements, so focusing on latency rather than total iops is your > goal. Your current setup gives 1.9 ms latency for writes and 0.6 ms for > read. These are considered good, it is difficult to go below 1 ms for > writes. As Wido pointed, to get latency down you need to insure you have > C States in your cpu settings ( or just C1 state ), you have no low > frequencies in your P States and get cpu with high GHz frequency rather > than more cores (Nick Fisk has a good presentation on this), also avoid > dual socket and NUMA. Also if money is no issue, you will get a bit > better latency with 40G or 100G network. Thanks a lot. I'm heading towards the conclusion that if I went all in and got new HW+NVMe drives, then I'd "only" be about 3x better off than where I am today. (compared to the Micron paper) That's a usefull conclusion to take back. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
Hi Maged Thanks for your reply. > 6k is low as a max write iops value..even for single client. for cluster > of 3 nodes, we see from 10k to 60k write iops depending on hardware. > > can you increase your threads to 64 or 128 via -t parameter I can absolutely get it higher by increasing the parallism. But I may have missed to explain my purpuse - I'm intested in how close to putting local SSD/NVMe in servers I can get with RDB. Thus putting parallel scenarios that I would never see in production in the tests does not really help my understanding. I think a concurrency level of 16 is in the top of what I would expect our PostgreSQL databases to do in real life. > can you run fio with sync=1 on your disks. > > can you try with noop scheduler > > what is the %utilization on the disks and cpu ? > > can you have more than 1 disk per node I'll have a look at that. Thanks for the suggestion. Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
Thanks for the confirmation Marc Can you put in a but more hardware/network details? Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
> On 2/7/19 8:41 AM, Brett Chancellor wrote: >> This seems right. You are doing a single benchmark from a single client. >> Your limiting factor will be the network latency. For most networks this >> is between 0.2 and 0.3ms. if you're trying to test the potential of >> your cluster, you'll need multiple workers and clients. >> > > Indeed. To add to this, you will need fast (High clockspeed!) CPUs in > order to get the latency down. The CPUs will need tuning as well like > their power profiles and C-States. Thanks for the insigt, I'm aware and my current CPUs are pretty old - but I'm also in the process of learning how to make the right decisions when expanding. If all my time end up being spend in the client end, then bying NVMe drives does not help me a all nor does better cpus in the OSDs. > You won't get the 1:1 performance from the SSDs on your RBD block devices. I'm full aware of that - Ceph / RBD / etc comes with an awesome feature packages and that flexibility deliveres overhead and eats into it. But it helps to deliver "upper bounds" and work my way to good from there. Thanks. Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
> On Thu, 7 Feb 2019 08:17:20 +0100 jes...@krogh.cc wrote: >> Hi List >> >> We are in the process of moving to the next usecase for our ceph cluster >> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and >> that works fine. >> >> We're currently on luminous / bluestore, if upgrading is deemed to >> change what we're seeing then please let us know. >> >> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. >> Connected >> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to >> deadline, nomerges = 1, rotational = 0. >> > I'd make sure that the endurance of these SSDs is in line with your > expected usage. They are - at the moment :-) and Ceph allows me to change my mind without interferrring with the applications running on top - Nice! >> Each disk "should" give approximately 36K IOPS random write and the >> double >> random read. >> > Only locally, latency is your enemy. > > Tell us more about your network. It is a Dell N4032, N4064 switch stack on 10Gbase-T. All hosts are on same subnet, NIC's are Intel X540's No-jumbo-framing and not much tuning - all kernels are on 4.15 (Ubuntu) Pings from client to two of the osd's --- flodhest.nzcorp.net ping statistics --- 50 packets transmitted, 50 received, 0% packet loss, time 50157ms rtt min/avg/max/mdev = 0.075/0.105/0.158/0.021 ms --- bison.nzcorp.net ping statistics --- 50 packets transmitted, 50 received, 0% packet loss, time 50139ms rtt min/avg/max/mdev = 0.078/0.137/0.275/0.032 ms > rados bench is not the sharpest tool in the shed for this. > As it needs to allocate stuff to begin with, amongst other things. Suggest longer test-runs? >> This is also quite far from expected. I have 12GB of memory on the OSD >> daemon for caching on each host - close to idle cluster - thus 50GB+ for >> caching with a working set of < 6GB .. this should - in this case >> not really be bound by the underlying SSD. > Did you adjust the bluestore parameters (whatever they are this week or > for your version) to actually use that memory? According to top - it is picking up the caching memory. We have this block. bluestore_cache_kv_max = 214748364800 bluestore_cache_kv_ratio = 0.4 bluestore_cache_meta_ratio = 0.1 bluestore_cache_size_hdd = 13958643712 bluestore_cache_size_ssd = 13958643712 bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,compact_on_mount=false I actually think most of above has been applied with the 10TB harddrives in mind, not the SSD's .. but I have no idea if they do "bad things" for us. > Don't use iostat, use atop. > Small IOPS are extremely CPU intensive, so atop will give you an insight > as to what might be busy besides the actual storage device. Thanks will do so. More suggestions are wellcome. Doing some math: Say network latency was the only cost driver - assume rone roundtrip per IOPS per thread. 16 threads - 0.15ms per round-trip - gives 1000 ms/s/thread / 0.15ms/IOPS => 6.666 IOPSs * 16 threads => 10 IOPS/s ok, thats at least an upper bound on expectations in this scenario, and I am at 28207 thus 4x from and have still not accounted any OSD or rdb userspace time into the equation. Can i directly get service-time out of the osd-daemon ? That would be nice to see how many ms is spend at that end from an OSD perspective. Jesper -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rados block on SSD - performance - how to tune and get insight?
Hi List We are in the process of moving to the next usecase for our ceph cluster (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and that works fine. We're currently on luminous / bluestore, if upgrading is deemed to change what we're seeing then please let us know. We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to deadline, nomerges = 1, rotational = 0. Each disk "should" give approximately 36K IOPS random write and the double random read. Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of well performing SSD block devices - potentially to host databases and things like that. I ready through this nice document [0], I know the HW are radically different from mine, but I still think I'm in the very low end of what 6 x S4510 should be capable of doing. Since it is IOPS i care about I have lowered block size to 4096 -- 4M blocksize nicely saturates the NIC's in both directions. $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects Object prefix: benchmark_data_torsk2_11207 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 5857 5841 22.8155 22.8164 0.00238437 0.00273434 2 15 11768 11753 22.9533 23.0938 0.0028559 0.00271944 3 16 17264 17248 22.4564 21.4648 0.0024 0.00278101 4 16 22857 22841 22.3037 21.84770.002716 0.00280023 5 16 28462 28446 22.2213 21.8945 0.002201860.002811 6 16 34216 34200 22.2635 22.4766 0.00234315 0.00280552 7 16 39616 39600 22.0962 21.0938 0.00290661 0.00282718 8 16 45510 45494 22.2118 23.0234 0.0033541 0.00281253 9 16 50995 50979 22.1243 21.4258 0.00267282 0.00282371 10 16 56745 56729 22.1577 22.4609 0.00252583 0.0028193 Total time run: 10.002668 Total writes made: 56745 Write size: 4096 Object size:4096 Bandwidth (MB/sec): 22.1601 Stddev Bandwidth: 0.712297 Max bandwidth (MB/sec): 23.0938 Min bandwidth (MB/sec): 21.0938 Average IOPS: 5672 Stddev IOPS:182 Max IOPS: 5912 Min IOPS: 5400 Average Latency(s): 0.00281953 Stddev Latency(s): 0.00190771 Max latency(s): 0.0834767 Min latency(s): 0.00120945 Min latency is fine -- but Max latency of 83ms ? Average IOPS @ 5672 ? $ sudo rados bench -p scbench 10 rand hints = 1 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 15 23329 23314 91.0537 91.0703 0.000349856 0.000679074 2 16 48555 48539 94.7884 98.5352 0.000499159 0.000652067 3 16 76193 76177 99.1747 107.961 0.000443877 0.000622775 4 15103923103908 101.459 108.324 0.000678589 0.000609182 5 15132720132705 103.663 112.488 0.000741734 0.000595998 6 15161811161796 105.323 113.637 0.000333166 0.000586323 7 15190196190181 106.115 110.879 0.000612227 0.000582014 8 15221155221140 107.966 120.934 0.000471219 0.000571944 9 16251143251127 108.984 117.137 0.000267528 0.000566659 Total time run: 10.000640 Total reads made: 282097 Read size:4096 Object size: 4096 Bandwidth (MB/sec): 110.187 Average IOPS: 28207 Stddev IOPS: 2357 Max IOPS: 30959 Min IOPS: 23314 Average Latency(s): 0.000560402 Max latency(s): 0.109804 Min latency(s): 0.000212671 This is also quite far from expected. I have 12GB of memory on the OSD daemon for caching on each host - close to idle cluster - thus 50GB+ for caching with a working set of < 6GB .. this should - in this case not really be bound by the underlying SSD. But if it were: IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off? No measureable service time in iostat when running tests, thus I have come to the conclusion that it has to be either client side, the network path, or the OSD-daemon that deliveres the increasing latency / decreased IOPS. Is there any suggestions on how to get more insigths in that? Has anyone replicated close to the number Micron are reporting on NVMe? Thanks a log. [0] https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en ___ ceph-users mailing list ceph-user
Re: [ceph-users] Spec for Ceph Mon+Mgr?
> : We're currently co-locating our mons with the head node of our Hadoop > : installation. That may be giving us some problems, we dont know yet, but > : thus I'm speculation about moving them to dedicated hardware. Would it be ok to run them on kvm VM’s - of course not backed by ceph? Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Spec for Ceph Mon+Mgr?
Hi. We're currently co-locating our mons with the head node of our Hadoop installation. That may be giving us some problems, we dont know yet, but thus I'm speculation about moving them to dedicated hardware. It is hard to get specifications "small" engough .. the specs for the mon is where we usually virtualize our way out of if .. which seems very wrong here. Are other people just co-locating it with something random or what are others typically using in a small ceph cluster (< 100 OSDs .. 7 OSD hosts) Thanks. Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS - Small file - single thread - read performance.
Hi Everyone. Thanks for the testing everyone - I think my system works as intented. When reading from another client - hitting the cache of the OSD-hosts I also get down to 7-8ms. As mentioned, this is probably as expected. I need to figure out to increase parallism somewhat - or convince users to not created those ridiciouls amounts of small files. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS - Small file - single thread - read performance.
Hi. We have the intention of using CephFS for some of our shares, which we'd like to spool to tape as a part normal backup schedule. CephFS works nice for large files but for "small" .. < 0.1MB .. there seem to be a "overhead" on 20-40ms per file. I tested like this: root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile > /dev/null real0m0.034s user0m0.001s sys 0m0.000s And from local page-cache right after. root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile > /dev/null real0m0.002s user0m0.002s sys 0m0.000s Giving a ~20ms overhead in a single file. This is about x3 higher than on our local filesystems (xfs) based on same spindles. CephFS metadata is on SSD - everything else on big-slow HDD's (in both cases). Is this what everyone else see? Thanks -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph community - how to make it even stronger
Hi All. I was reading up and especially the thread on upgrading to mimic and stable releases - caused me to reflect a bit on our ceph journey so far. We started approximately 6 months ago - with CephFS as the dominant use case in our HPC setup - starting at 400TB useable capacity and as is matures going towards 1PB - mixed slow and SSD. Some of the first confusions was. bluestore vs. filestore - what was the recommendation actually? Figuring out what kernel clients are useable with CephFS - and what kernels to use on the other end? Tuning of the MDS ? Imbalace of OSD nodes rendering the cluster down - how to balance? Triggering kernel bugs in the kernel client during OSD_FULL ? This mailing list has been very responsive to the questions, thanks for that. But - compared to other open source projects we're lacking a bit of infrastructure and guidance here. I did check: - http://tracker.ceph.com/projects/ceph/wiki/Wiki => Which does not seem to be operational. - http://docs.ceph.com/docs/mimic/start/get-involved/ Gmane is probably not coming back - waiting 2 years now, can we easily get the mailinglist archives indexed otherwise. I feel that the wealth of knowledge being build up around operating ceph is not really captured to make the next users journey - better and easier. I would love to help out - hey - I end up spending the time anyway, but some guidance on how to do it may help. I would suggest: 1) Dump a 1-3 monthly status email on the project to the respective mailing lists => Major releases, Conferences, etc 2) Get the wiki active - one of the main things I want to know about when messing with the storage is - What is working for other people - just a page where people can dump an aggregated output of their ceph cluster and write 2-5 lines about the use-case for it. 3) Either get community more active on the documentation - advocate for it - or start up more documentation on the wiki => A FAQ would be a nice first place to start. There may be an awful lot of things I've missed on the write up - but please follow up. If some of the core ceph people allready have thoughts / ideas / guidance, please share so we collaboratively can make it better. Lastly - thanks for the great support on the mailing list - so far - the intent is only to try to make ceph even better. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Balancing cluster with large disks - 10TB HHD
>> I would still like to have a log somewhere to grep and inspect what >> balancer/upmap >> actually does - when in my cluster. Or some ceph commands that deliveres >> some monitoring capabilityes .. any suggestions? > Yes, on ceph-mgr log, when log level is DEBUG. Tried the docs .. something like: ceph tell mds ... does not seem to work. http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/ > You can get your cluster upmap's in via `ceph osd dump | grep upmap`. Got it -- but I really need the README .. it shows the map .. ... pg_upmap_items 6.0 [40,20] pg_upmap_items 6.1 [59,57,47,48] pg_upmap_items 6.2 [59,55,75,9] pg_upmap_items 6.3 [22,13,40,39] pg_upmap_items 6.4 [23,9] pg_upmap_items 6.5 [25,17] pg_upmap_items 6.6 [45,46,59,56] pg_upmap_items 6.8 [60,54,16,68] pg_upmap_items 6.9 [61,69] pg_upmap_items 6.a [51,48] pg_upmap_items 6.b [43,71,41,29] pg_upmap_items 6.c [22,13] .. But .. I dont have any pg's that should only have 2 replicas.. neither any with 4 .. how should this be interpreted? Thanks. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Balancing cluster with large disks - 10TB HHD
Hi. .. Just an update - This looks awesome.. and in a 8x5 company - christmas is a good period to rebalance a cluster :-) >> I'll try it out again - last I tried it complanied about older clients - >> it should be better now. > upmap is supported since kernel 4.13. > >> Second - should the reweights be set back to 1 then? > Yes, also: > > 1. `ceph osd crush tunables optimal` Done > 2. All your buckets should be straw2, but in case `ceph osd crush > set-all-straw-buckets-to-straw2` Done > 3. Your hosts imbalanced: elefant & capone have only eight 10TB's, > another hosts - 12. So I recommend replace 8TB's spinners to 10TB or > just shuffle it between hosts, like 2x8TB+10x10Tb. Yes, we initially thought we could go with 3 osd-hosts .. but then found out that EC-pools required more -- and then added. > 4. Revert all your reweights. Done > 5. Balancer do his work: `ceph balancer mode upmap`, `ceph balancer on`. So far - works awesome -- sudo qms/server_documentation/ceph/ceph-osd-data-distribution hdd hdd x N Min MaxMedian AvgStddev x 72 50.82 55.65 52.88 52.916944 1.0002586 As compared to the best I got with reweighting: $ sudo qms/server_documentation/ceph/ceph-osd-data-distribution hdd hdd x N Min MaxMedian AvgStddev x 72 45.36 54.98 52.63 52.131944 2.0746672 It took about 24 hours to rebalance -- and move quite some TB's around. I would still like to have a log somewhere to grep and inspect what balancer/upmap actually does - when in my cluster. Or some ceph commands that deliveres some monitoring capabilityes .. any suggestions? -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Balancing cluster with large disks - 10TB HHD
> Have a look at this thread on the mailing list: > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg46506.html Ok, done.. how do I see that it actually work? Second - should the reweights be set back to 1 then? Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Balancing cluster with large disks - 10TB HHD
> -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > On mik, 2018-12-26 at 13:14 +0100, jes...@krogh.cc wrote: >> Thanks for the insight and links. >> >> > As I can see you are on Luminous. Since Luminous Balancer plugin is >> > available [1], you should use it instead reweight's in place, >> especially >> > in upmap mode [2] >> >> I'll try it out again - last I tried it complanied about older clients - >> it should be better now. >> > require_min_compat_client luminous is required, for you to take advantage > of > upmap. $ sudo ceph osd set-require-min-compat-client luminous Error EPERM: cannot set require_min_compat_client to luminous: 54 connected client(s) look like jewel (missing 0x800); add --yes-i-really-mean-it to do it anyway We've standardize on the 4.15 kernel client on all CephFS clients, those are the 54 - would it be safe to ignore above warning ? Otherwise - which kernel do I need to go to ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Balancing cluster with large disks - 10TB HHD
Thanks for the insight and links. > As I can see you are on Luminous. Since Luminous Balancer plugin is > available [1], you should use it instead reweight's in place, especially > in upmap mode [2] I'll try it out again - last I tried it complanied about older clients - it should be better now. > Also, may be I can catch another crush mistakes, can I see `ceph osd > crush show-tunables, `ceph osd crush rule dump`, `ceph osd pool ls > detail`? Here: $ sudo ceph osd crush show-tunables { "choose_local_tries": 0, "choose_local_fallback_tries": 0, "choose_total_tries": 50, "chooseleaf_descend_once": 1, "chooseleaf_vary_r": 1, "chooseleaf_stable": 0, "straw_calc_version": 1, "allowed_bucket_algs": 54, "profile": "hammer", "optimal_tunables": 0, "legacy_tunables": 0, "minimum_required_version": "hammer", "require_feature_tunables": 1, "require_feature_tunables2": 1, "has_v2_rules": 1, "require_feature_tunables3": 1, "has_v3_rules": 0, "has_v4_buckets": 1, "require_feature_tunables5": 0, "has_v5_rules": 0 } $ sudo ceph osd crush rule dump [ { "rule_id": 0, "rule_name": "replicated_ruleset_hdd", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default~hdd" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 1, "rule_name": "replicated_ruleset_hdd_fast", "ruleset": 1, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -28, "item_name": "default~hdd_fast" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 2, "rule_name": "replicated_ruleset_ssd", "ruleset": 2, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -21, "item_name": "default~ssd" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 3, "rule_name": "cephfs_data_ec42", "ruleset": 3, "type": 3, "min_size": 3, "max_size": 6, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -1, "item_name": "default~hdd" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } ] $ sudo ceph osd pool ls detail pool 6 'kube' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 41045 flags hashpspool stripe_width 0 application rbd removed_snaps [1~3] pool 15 'default.rgw.buckets.data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 41045 flags hashpspool stripe_width 0 application rgw pool 17 'default.rgw.users.keys' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 last_change 41045 lfor 0/36590 flags hashpspool stripe_width 0 application rgw pool 18 'default.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 last_change 41045 lfor 0/36595 flags hashpspool stripe_width 0 application rgw pool 19 'default.rgw.users.uid' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 last_change 41045 lfor 0/36608 flags hashpspool stripe_width 0 application rgw pool 20 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 41045 flags hashpspool stripe_width 0 application rbd pool 26 'default.rgw.data.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 41045 flags hashpspool stripe_width 0 application rgw pool 27 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 41045 flags hashpspool stripe_width 0 application rgw pool 28 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 ob
Re: [ceph-users] Balancing cluster with large disks - 10TB HHD
> Please, paste your `ceph osd df tree` and `ceph osd dump | head -n 12`. $ sudo ceph osd df tree ID CLASSWEIGHTREWEIGHT SIZE USEAVAIL %USE VAR PGS TYPE NAME -8 639.98883- 639T 327T 312T 51.24 1.00 - root default -10 111.73999- 111T 58509G 55915G 51.13 1.00 - host bison 78 hdd_fast 0.90900 1.0 930G 1123M 929G 0.12 0.00 0 osd.78 79 hdd_fast 0.81799 1.0 837G 1123M 836G 0.13 0.00 0 osd.79 20 hdd 9.09499 0.95000 9313G 4980G 4333G 53.47 1.04 204 osd.20 28 hdd 9.09499 1.0 9313G 4612G 4700G 49.53 0.97 200 osd.28 29 hdd 9.09499 1.0 9313G 4848G 4465G 52.05 1.02 211 osd.29 33 hdd 9.09499 1.0 9313G 4759G 4553G 51.10 1.00 207 osd.33 34 hdd 9.09499 1.0 9313G 4613G 4699G 49.54 0.97 195 osd.34 35 hdd 9.09499 0.89250 9313G 4954G 4359G 53.19 1.04 206 osd.35 36 hdd 9.09499 1.0 9313G 4724G 4588G 50.73 0.99 200 osd.36 37 hdd 9.09499 1.0 9313G 5013G 4300G 53.83 1.05 214 osd.37 38 hdd 9.09499 0.92110 9313G 4962G 4350G 53.28 1.04 206 osd.38 39 hdd 9.09499 1.0 9313G 4960G 4353G 53.26 1.04 214 osd.39 40 hdd 9.09499 1.0 9313G 5022G 4291G 53.92 1.05 216 osd.40 41 hdd 9.09499 0.88235 9313G 5037G 4276G 54.09 1.06 203 osd.41 7 ssd 0.87299 1.0 893G 18906M 875G 2.07 0.04 124 osd.7 -7 102.74084- 102T 54402G 50805G 51.71 1.01 - host bonnie 0 hdd 7.27699 0.87642 7451G 4191G 3259G 56.25 1.10 175 osd.0 1 hdd 7.27699 0.86200 7451G 3837G 3614G 51.49 1.01 163 osd.1 2 hdd 7.27699 0.74664 7451G 3920G 3531G 52.61 1.03 169 osd.2 11 hdd 7.27699 0.77840 7451G 3983G 3467G 53.46 1.04 169 osd.11 13 hdd 9.09499 0.76595 9313G 4894G 4419G 52.55 1.03 201 osd.13 14 hdd 9.09499 1.0 9313G 4350G 4963G 46.71 0.91 189 osd.14 16 hdd 9.09499 0.92635 9313G 4879G 4434G 52.39 1.02 204 osd.16 18 hdd 9.09499 0.67932 9313G 4634G 4678G 49.76 0.97 190 osd.18 22 hdd 9.09499 0.93053 9313G 5085G 4228G 54.60 1.07 218 osd.22 31 hdd 9.09499 0.88536 9313G 5152G 4160G 55.33 1.08 221 osd.31 42 hdd 9.09499 0.84232 9313G 4796G 4516G 51.51 1.01 199 osd.42 43 hdd 9.09499 0.87662 9313G 4656G 4657G 50.00 0.98 191 osd.43 6 ssd 0.87299 1.0 894G 20643M 874G 2.25 0.04 134 osd.6 -6 102.74100- 102T 53627G 51580G 50.97 0.99 - host capone 3 hdd 7.27699 0.84938 7451G 4028G 3422G 54.07 1.06 171 osd.3 4 hdd 7.27699 0.83890 7451G 3909G 3542G 52.46 1.02 167 osd.4 5 hdd 7.27699 1.0 7451G 3389G 4061G 45.49 0.89 151 osd.5 9 hdd 7.27699 1.0 7451G 3710G 3740G 49.80 0.97 161 osd.9 15 hdd 9.09499 1.0 9313G 4952G 4360G 53.18 1.04 206 osd.15 17 hdd 9.09499 0.95000 9313G 4865G 4448G 52.24 1.02 202 osd.17 23 hdd 9.09499 1.0 9313G 4984G 4329G 53.52 1.04 223 osd.23 24 hdd 9.09499 1.0 9313G 4847G 4466G 52.05 1.02 202 osd.24 25 hdd 9.09499 0.89929 9313G 4909G 4404G 52.71 1.03 205 osd.25 30 hdd 9.09499 0.92787 9313G 4740G 4573G 50.90 0.99 202 osd.30 74 hdd 9.09499 0.93146 9313G 4709G 4603G 50.57 0.99 199 osd.74 75 hdd 9.09499 1.0 9313G 4559G 4753G 48.96 0.96 194 osd.75 8 ssd 0.87299 1.0 893G 19593M 874G 2.14 0.04 129 osd.8 -16 102.74100- 102T 53985G 51222G 51.31 1.00 - host elefant 19 hdd 7.27699 1.0 7451G 3665G 3786G 49.19 0.96 152 osd.19 21 hdd 7.27699 0.89539 7451G 4102G 3349G 55.05 1.07 169 osd.21 64 hdd 7.27699 0.89275 7451G 3956G 3494G 53.10 1.04 171 osd.64 65 hdd 7.27699 0.92513 7451G 3976G 3475G 53.36 1.04 171 osd.65 66 hdd 9.09499 1.0 9313G 4674G 4638G 50.20 0.98 199 osd.66 67 hdd 9.09499 1.0 9313G 4737G 4575G 50.87 0.99 201 osd.67 68 hdd 9.09499 0.89973 9313G 4946G 4366G 53.11 1.04 211 osd.68 69 hdd 9.09499 1.0 9313G 4648G 4665G 49.91 0.97 204 osd.69 70 hdd 9.09499 0.89526 9313G 4907G 4405G 52.69 1.03 209 osd.70 71 hdd 9.09499 0.84923 9313G 4690G 4622G 50.37 0.98 198 osd.71 72 hdd 9.09499 0.87547 9313G 4976G 4336G 53.43 1.04 211 osd.72 73 hdd 9.09499 1.0 9313G 4683G 4630G 50.29 0.98 200 osd.73 10 ssd 0.87299 1.0 893G 19158M 875G 2.09 0.04 126
[ceph-users] Balancing cluster with large disks - 10TB HHD
Hi. We hit an OSD_FULL last week on our cluster - with an average utillzation of less than 50% .. thus hugely imbalanced. This has driven us to go for adjusting pg's upwards and reweighting the osd's more agressively. Question: What do people see as an "acceptable" variance across OSD's? x N Min MaxMedian AvgStddev x 72 45.49 56.25 52.35 51.878889 2.1764343 72 x 10TB drives. It seems hard to get further down -- thus churn will most likely make it hard for us to stay at this level. Currently we have ~158 PGs / OSD .. which by my math gives 63GB/pg if they were fully utillzing the disk - which leads me to think that somewhat smaller pg's would give the balancing an easier job. Would to be ok to go to closer to 300 PGs/OSD ? - would it be sane? I can see that the default max is 300, but I have hard time finding out if this is "recommendable" or just a "tunable". * We've now seen OSD_FULL trigger irrecoverable kernel bugs on the CephFS kernel client on our 4.15 kernels - multiple times - forced reboot is the only way out. We're on the Ubuntu kernels .. I havent done the diff to upstream (yet) and I dont intent to run our production cluster disk-full anyware in the near future to test it out. Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Priority of repair vs rebalancing?
Hi. In our ceph cluster we hit one OSD with 95% full while others in same pool only hit 40% .. (total usage is ~55%). Thus I went into a: sudo ceph osd reweight-by-utilization 110 0.05 12 Which initated some data movement.. but right after ceph status reported: jk@bison:~/adm-git$ sudo ceph -s cluster: id: dbc33946-ba1f-477c-84df-c63a3c9c91a6 health: HEALTH_WARN 49924979/660322545 objects misplaced (7.561%) Degraded data redundancy: 26/660322545 objects degraded (0.000%), 2 pgs degraded services: mon: 3 daemons, quorum torsk1,torsk2,bison mgr: bison(active), standbys: torsk1 mds: cephfs-1/1/2 up {0=zebra01=up:active}, 1 up:standby-replay osd: 78 osds: 78 up, 78 in; 255 remapped pgs rgw: 9 daemons active data: pools: 16 pools, 2184 pgs objects: 141M objects, 125 TB usage: 298 TB used, 340 TB / 638 TB avail pgs: 26/660322545 objects degraded (0.000%) 49924979/660322545 objects misplaced (7.561%) 1927 active+clean 187 active+remapped+backfilling 68 active+remapped+backfill_wait 2active+recovery_wait+degraded io: client: 761 kB/s rd, 1284 kB/s wr, 85 op/s rd, 79 op/s wr recovery: 623 MB/s, 665 objects/s Any idea about how those 26 objects got degraded in the process? Just in-flight writes ? Any means to priority the 26 objects over the 49M objects that need to be replaced? Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
> If a CephFS client receive a cap release request and it is able to > perform it (no processes accessing the file at the moment), the client > cleaned up its internal state and allows the MDS to release the cap. > This cleanup also involves removing file data from the page cache. > > If your MDS was running with a too small cache size, it had to revoke > caps over and over to adhere to its cache size, and the clients had to > cleanup their cache over and over, too. Well.. It could just mark it "elegible for future cleanup" - if the client has not use of the available memory, then this is just trashing local client memory cache for a file that goes into use again in a few minutes from here. - based on your description, this is what we have been seeing. Bumping MDS memory has pushed our problem and our setup works fine, but above behaviour still seems very "unoptimal" - of course if the file changes - feel free to active prune - but hey - why actually - the it will get no hits in the client LRU cache and be automatically evicted by the client anyway. I feel this is messing up with thing that has worked well for a few decades now, but I may just be missing the fine grained details. > Hope this helps. Definately - thanks. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.
On 25 Nov 2018, at 15.17, Vitaliy Filippov wrote: > > All disks (HDDs and SSDs) have cache and may lose non-transactional writes > that are in-flight. However, any adequate disk handles fsync's (i.e SATA > FLUSH CACHE commands). So transactional writes should never be lost, and in > Ceph ALL writes are transactional - Ceph issues fsync's all the time. Another > example is DBMS-es - they also issue an fsync when you COMMIT. https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf This may have changed since 2013, normal understanding is that cache need to be disabled to ensure that flushed are persistent, and disabling cache in ssd is either not adhered to by firmware or plummeting the write performance. Which is why enterprise discs had power loss protection in terms of capacitors. again any links/info telling otherwise is very welcome Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.
>> the real risk is the lack of power loss protection. Data can be >> corrupted on unflean shutdowns > > it's not! lack of "advanced power loss protection" only means lower iops > with fsync, but not the possibility of data corruption > > "advanced power loss protection" is basically the synonym for > "non-volatile cache" A few years ago - it was pretty common knowledge that if it didnt have capacitors - and thus Power-Loss-Protection, then an unexpected power-off could lead to data-loss situations. Perhapos I'm not updated with recent development. Is it a solved problem today in consumergrade SSD? .. any links to insight/testing/etc would be welcome. https://arstechnica.com/civis/viewtopic.php?f=11&t=1383499 - does at least not support the viewpoint. Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.
> On 24 Nov 2018, at 18.09, Anton Aleksandrov wrote > We plan to have data on dedicate disk in each node and my question is about > WAL/DB for Bluestore. How bad would it be to place it on system-consumer-SSD? > How big risk is it, that everything will get "slower than using spinning HDD > for the same purpose"? And how big risk is it, that our nodes will die, > because of SSD lifespan? the real risk is the lack of power loss protection. Data can be corrupted on unflean shutdowns Disabling cache may help ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS kernel client versions - pg-upmap
Hi. I tried to enable the "new smart balancing" - backend are on RH luminous clients are Ubuntu 4.15 kernel. As per: http://docs.ceph.com/docs/mimic/rados/operations/upmap/ $ sudo ceph osd set-require-min-compat-client luminous Error EPERM: cannot set require_min_compat_client to luminous: 1 connected client(s) look like firefly (missing 0xe010020); 1 connected client(s) look like firefly (missing 0xe01); 1 connected client(s) look like hammer (missing 0xe20); 55 connected client(s) look like jewel (missing 0x800); add --yes-i-really-mean-it to do it anyway ok, so 4.15 kernel connects as a "hammer" (<1.0) client? Is there a huge gap in upstreaming kernel clients to kernel.org or what am I misreading here? Hammer is 2015'ish - 4.15 is January 2018'ish? Is kernel client development lacking behind ? Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
> I suspect that mds asked client to trim its cache. Please run > following commands on an idle client. In the mean time - we migrated to the RH Ceph version and deliered the MDS both SSD's and more memory and the problem went away. It still puzzles my mind a bit - why is there a connection between the "client page cache" and the MDS server performance/etc. The only argument I can find is that if the MDS cannot cache data, then and it need to go back and get metadata from the Ceph metadata poll then it exposes data as "new" to the clients, despite it being the same. - if that is the case, then I would say there is a significant room for performance optimization here. > If you can reproduce this issue. please send kernel log to us. Will do if/when it reappears. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
> On 10/15/18 12:41 PM, Dietmar Rieder wrote: >> No big difference here. >> all CentOS 7.5 official kernel 3.10.0-862.11.6.el7.x86_64 > > ...forgot to mention: all is luminous ceph-12.2.7 Thanks for your time in testing, this is very valueable to me in the debugging. 2 questions: Did you "sleep 900" in-between the execution? Are you using the kernel client or the fuse client? If I run them "right after each other" .. then I get the same behaviour. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
>> On Sun, Oct 14, 2018 at 8:21 PM wrote: >> how many cephfs mounts that access the file? Is is possible that some >> program opens that file in RW mode (even they just read the file)? > > > The nature of the program is that it is "prepped" by one-set of commands > and queried by another, thus the RW case is extremely unlikely. > I can change permission bits to rewoke the w-bit for the user, they > dont need it anyway... it is just the same service-users that generates > the data and queries it today. Just to remove the suspicion of other clients fiddling with the files I did a more structured test. I have 4 x 10GB files from fio-benchmarking, total 40GB . Hosted on 1) CephFS /ceph/cluster/home/jk 2) NFS /z/home/jk First I read them .. then sleep 900 seconds .. then read again (just with dd) jk@sild12:/ceph/cluster/home/jk$ time for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done | parallel -j 4 ; sleep 900; time for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done | parallel -j 4 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.56413 s, 4.2 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.82234 s, 3.8 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.9361 s, 3.7 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 3.10397 s, 3.5 GB/s real0m3.449s user0m0.217s sys 0m11.497s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 315.439 s, 34.0 MB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 338.661 s, 31.7 MB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 354.725 s, 30.3 MB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 356.126 s, 30.2 MB/s real5m56.634s user0m0.260s sys 0m16.515s jk@sild12:/ceph/cluster/home/jk$ Then NFS: jk@sild12:~$ time for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done | parallel -j 4 ; sleep 900; time for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done | parallel -j 4 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 1.60267 s, 6.7 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.18602 s, 4.9 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.47564 s, 4.3 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.54674 s, 4.2 GB/s real0m2.855s user0m0.185s sys 0m8.888s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 1.68613 s, 6.4 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 1.6983 s, 6.3 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.20059 s, 4.9 GB/s 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 2.58077 s, 4.2 GB/s real0m2.980s user0m0.173s sys 0m8.239s jk@sild12:~$ Can I ask one of you to run the same "test" (or similar) .. and report back i you can reproduce it? Thoughts/comments/suggestions are highly apprecitated? Should I try with the fuse-client ? -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
> On Sun, Oct 14, 2018 at 8:21 PM wrote: > how many cephfs mounts that access the file? Is is possible that some > program opens that file in RW mode (even they just read the file)? The nature of the program is that it is "prepped" by one-set of commands and queried by another, thus the RW case is extremely unlikely. I can change permission bits to rewoke the w-bit for the user, they dont need it anyway... it is just the same service-users that generates the data and queries it today. Can ceph tell the actual amount of clients? .. We have 55-60 hosts, where most of them mounts the catalog. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
> Actual amount of memory used by VFS cache is available through 'grep > Cached /proc/meminfo'. slabtop provides information about cache > of inodes, dentries, and IO memory buffers (buffer_head). Thanks, that was also what I got out of it. And why I reported "free" output in the first as it also shows available and "cached" memory. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
> Try looking in /proc/slabinfo / slabtop during your tests. I need a bit of guidance here.. Does the slabinfo cover the VFS page cache ? .. I cannot seem to find any traces (sorting by size on machines with a huge cache does not really give anything). Perhaps I'm holding the screwdriver wrong? -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client - page cache being invaildated.
On 14 Oct 2018, at 15.26, John Hearns wrote: > > This is a general question for the ceph list. > Should Jesper be looking at these vm tunables? > vm.dirty_ratio > vm.dirty_centisecs > > What effect do they have when using Cephfs? This situation is a read only, thus no dirty data in page cache. Above should be irrelevant. Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephfs kernel client - page cache being invaildated.
Hi We have a dataset of ~300 GB on CephFS which as being used for computations over and over agian .. being refreshed daily or similar. When hosting it on NFS after refresh, they are transferred, but from there - they would be sitting in the kernel page cache of the client until they are refreshed serverside. On CephFS it look "similar" but "different". Where the "steady state" operation over NFS would give a client/server traffic of < 1MB/s .. CephFS contantly pulls 50-100MB/s over the network. This has implications for the clients that end up spending unnessary time waiting for IO in the execution. This is in a setting where the CephFS client mem look like this: $ free -h totalusedfree shared buff/cache available Mem: 377G 17G340G1.2G 19G 354G Swap: 8.8G430M8.4G If I just repeatedly run (within a few minute) something that is using the files, then it is fully served out of client page cache (2GB'ish / s) .. but it looks like it is being evicted way faster than in the NFS setting? This is not scientific .. but the CMD is a cat /file/on/ceph > /dev/null - type on a total of 24GB data in 300'ish files. $ free -h; time CMD ; sleep 1800; free -h; time CMD ; free -h; sleep 3600; time CMD ; totalusedfree shared buff/cache available Mem: 377G 16G312G1.2G 48G 355G Swap: 8.8G430M8.4G real0m8.997s user0m2.036s sys 0m6.915s totalusedfree shared buff/cache available Mem: 377G 17G277G1.2G 82G 354G Swap: 8.8G430M8.4G real3m25.904s user0m2.794s sys 0m9.028s totalusedfree shared buff/cache available Mem: 377G 17G283G1.2G 76G 353G Swap: 8.8G430M8.4G real6m18.358s user0m2.847s sys 0m10.651s Munin graphs of the system confirms that there has been zero memory pressure over the period. Is there things in the CephFS case that can cause the page-cache to be invailated? Could less agressive "read-ahead" play a role? Other thoughts on what root cause on the different behaviour could be? Clients are using 4.15 kernel.. Anyone aware of newer patches in this area that could impact ? Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS performance.
Hi All. First thanks for the good discussion and strong answer's I've gotten so far. Current cluster setup is 4 x 10 x 12TB 7.2K RPM drives with all and 10GbitE and metadata on rotating drives - 3x replication - 256GB memory in OSD hosts and 32+ cores. Behind Perc with eachdiskraid0 and BBWC. Planned changes: - is to get 1-2 more OSD-hosts - experiment with EC-pools for CephFS - MDS onto seperate host and metadata onto SSD's. I'm still struggling to get "non-cached" performance up to "hardware" speed - whatever that means. I do "fio" benchmark using 10GB files, 16 threads, 4M block size -- at which I can "almost" sustained fill the 10GbitE NIC. In this configuraiton I would have expected it to be "way above" 10Gbit speed thus have the NIC not "almost" filled - but fully filled - could that be the metadata activities .. but on "big files" and read - that should not be much - right? Above is actually ok for production, thus .. not a big issue, just information. Single threaded performance is still struggling Cold HHD (read from disk in NFS-server end) / NFS performance: jk@zebra01:~$ pipebench < /nfs/16GB.file > /dev/null Summary: Piped 15.86 GB in 00h00m27.53s: 589.88 MB/second Local page cache (just to say it isn't the profiling tool delivering limitations): jk@zebra03:~$ pipebench < /nfs/16GB.file > /dev/null Summary: Piped 29.24 GB in 00h00m09.15s:3.19 GB/second jk@zebra03:~$ Now from the Ceph system: jk@zebra01:~$ pipebench < /ceph/bigfile.file> /dev/null Summary: Piped 36.79 GB in 00h03m47.66s: 165.49 MB/second Can block/stripe-size be tuned? Does it make sense? Does read-ahead on the CephFS kernel-client need tuning? What performance are other people seeing? Other thoughts - recommendations? On some of the shares we're storing pretty large files (GB size) and need the backup to move them to tape - so it is preferred to be capable of filling an LTO6 drive's write speed to capacity with a single thread. 40'ish 7.2K RPM drives - should - add up to more than above.. right? This is the only current load being put on the cluster - + 100MB/s recovery traffic. Thanks. Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore vs. Filestore
> Your use case sounds it might profit from the rados cache tier > feature. It's a rarely used feature because it only works in very > specific circumstances. But your scenario sounds like it might work. > Definitely worth giving it a try. Also, dm-cache with LVM *might* > help. > But if your active working set is really just 400GB: Bluestore cache > should handle this just fine. Don't worry about "unequal" > distribution, every 4mb chunk of every file will go to a random OSD. I tried it out - and will do it more but Initial tests didnt really convince me - but I'll try more. > One very powerful and simple optimization is moving the metadata pool > to SSD only. Even if it's just 3 small but fast SSDs; that can make a > huge difference to how fast your filesystem "feels". They are ordered and will hopefully arrive very soon. Can I: 1) Add disks 2) Create pool 3) stop all MDS's 4) rados cppool 5) Start MDS .. Yes, thats a cluster-down on CephFS but shouldn't take long. Or is there a better guide? -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore vs. Filestore
> On 02.10.2018 19:28, jes...@krogh.cc wrote: > In the cephfs world there is no central server that hold the cache. each > cephfs client reads data directly from the osd's. I can accept this argument, but nevertheless .. if I used Filestore - it would work. > This also means no > single point of failure, and you can scale out performance by spreading > metadata tree information over multiple MDS servers. and scale out > storage and throughput with added osd nodes. > > so if the cephfs client cache is not sufficient, you can look at at the > bluestore cache. http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/#cache-size I have been there, but it seems to "not work"- I think the need to slice per OSD and statically allocate mem per OSD breaks the efficiency. (but I cannot prove it) > or you can look at adding a ssd layer over the spinning disks. with eg > bcache. I assume you are using a ssd/nvram for bluestore db already My currently bluestore(s) is backed by 10TB 7.2K RPM drives, allthough behind BBWC. Can you elaborate on the "assumption" as we're not doing that, I'd like to explore that. > you should also look at tuning the cephfs metadata servers. > make sure the metadata pool is on fast ssd osd's . and tune the mds > cache to the mds server's ram, so you cache as much metadata as possible. Yes, we're in the process of doing that - I belive we're seeing the MDS suffering when we saturate a few disks in the setup - and they are sharing. Thus we'll move the metadata as per recommendations to SSD. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Bluestore vs. Filestore
Hi. Based on some recommendations we have setup our CephFS installation using bluestore*. We're trying to get a strong replacement for "huge" xfs+NFS server - 100TB-ish size. Current setup is - a sizeable Linux host with 512GB of memory - one large Dell MD1200 or MD1220 - 100TB + a Linux kernel NFS server. Since our "hot" dataset is < 400GB we can actually serve the hot data directly out of the host page-cache and never really touch the "slow" underlying drives. Except when new bulk data are written where a Perc with BBWC is consuming the data. In the CephFS + Bluestore world, Ceph is "deliberatly" bypassing the host OS page-cache, so even when we have 4-5 x 256GB memory** in the OSD hosts it is really hard to create a synthetic test where they hot data does not end up being read out of the underlying disks. Yes, the client side page cache works very well, but in our scenario we have 30+ hosts pulling the same data over NFS. Is bluestore just a "bad fit" .. Filestore "should" do the right thing? Is the recommendation to make an SSD "overlay" on the slow drives? Thoughts? Jesper * Bluestore should be the new and shiny future - right? ** Total mem 1TB+ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Bluestore OSDs stay down
Hi, I have been very impressed with the BlueStore test environment I made, which is build on the Ubuntu 16.04 using the Ceph development master repository. But now I have run into some self inflicted problems. Yesterday I accidentally updated the OSD while they were being heavily used. Then the OSD started to go down one by one, and when they all had did I ended up with pgs in practically every possible state. /:~# ceph health // //2016-09-30 09:50:39.044987 7f27e4ee2700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb// //2016-09-30 09:50:39.052592 7f27e4ee2700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb// //HEALTH_ERR 243 pgs are stuck inactive for more than 300 seconds; 130 pgs backfill_wait; 488 pgs degraded; 55 pgs down; 49 pgs incomplete; 63 pgs peering; 2 pgs recovering; 281 pgs recovery_wait; 600 pgs stale; 243 pgs stuck inactive; 357 pgs stuck unclean; 488 pgs undersized; recovery 1240205/1848822 objects degraded (67.081%); recovery 397635/1848822 objects misplaced (21.507%); recovery 57149/616274 unfound (9.273%); mds cluster is degraded; 8/8 in osds are down/ As mentioned all OSD's are now down and refuse to come back up. From the osd log file I see this error message: //srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.0.0-2808-g76e120c/src/os/bluestore/StupidAllocator.cc: 317: FAILED assert(rm.empty())/ Of course the data was not important in this test environment and the easiest would properly be to start over, but I am considering building a production environment build on Bluestore as soon as it becomes stable, so for the sport of it I would like to see if I can actually recover the OSD's. Just to get some deeper insight into Ceph recovery. I have been though: http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-osd/ without any luck. What would be the next steps to try? Thanks! //Jesper 2016-09-30 08:51:23.464389 7f17985528c0 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb 2016-09-30 08:51:23.464409 7f17985528c0 0 set uid:gid to 64045:64045 (ceph:ceph) 2016-09-30 08:51:23.464430 7f17985528c0 0 ceph version v11.0.0-2808-g76e120c (76e120c705b77d2d2cef1b94cacdd11c14460a3f), process ceph-osd, pid 16590 2016-09-30 08:51:23.464479 7f17985528c0 5 object store type is bluestore 2016-09-30 08:51:23.464504 7f17985528c0 -1 WARNING: experimental feature 'bluestore' is enabled Please be aware that this feature is experimental, untested, unsupported, and may result in data corruption, data loss, and/or irreparable damage to your cluster. Do not use feature with important data. 2016-09-30 08:51:23.466344 7f17985528c0 0 pidfile_write: ignore empty --pid-file 2016-09-30 08:51:23.468345 7f17985528c0 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb 2016-09-30 08:51:23.473859 7f17985528c0 10 ErasureCodePluginSelectJerasure: load: jerasure_sse4 2016-09-30 08:51:23.475249 7f17985528c0 10 load: jerasure load: lrc load: isa 2016-09-30 08:51:23.475657 7f17985528c0 2 osd.0 0 mounting /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal 2016-09-30 08:51:23.475669 7f17985528c0 1 bluestore(/var/lib/ceph/osd/ceph-0) mount path /var/lib/ceph/osd/ceph-0 2016-09-30 08:51:23.475707 7f17985528c0 1 bdev create path /var/lib/ceph/osd/ceph-0/block type kernel 2016-09-30 08:51:23.476182 7f17985528c0 1 bdev(/var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block 2016-09-30 08:51:23.476545 7f17985528c0 1 bdev(/var/lib/ceph/osd/ceph-0/block) open size 4000681103360 (0x3a37b2d1000, 3725 GB) block_size 4096 (4096 B) non-rotational 2016-09-30 08:51:23.476814 7f17985528c0 1 bdev create path /var/lib/ceph/osd/ceph-0/block type kernel 2016-09-30 08:51:23.477256 7f17985528c0 1 bdev(/var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block 2016-09-30 08:51:23.477501 7f17985528c0 1 bdev(/var/lib/ceph/osd/ceph-0/block) open size 4000681103360 (0x3a37b2d1000, 3725 GB) block_size 4096 (4096 B) non-rotational 2016-09-30 08:51:23.477509 7f17985528c0 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-0/block size 3725 GB 2016-09-30 08:51:23.477533 7f17985528c0 1 bluefs mount 2016-09-30 08:51:23.553415 7f17985528c0 -1 /srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.0.0-2808-g76e120c/src/os/bluestore/StupidAllocator.cc: In function 'virtual void StupidAllocator::init_rm_free(uint64_t, uint64_t)' thread 7f17985528c0 time 2016-09-30 08:51:23.550071 /srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.0.0-2808-g76e120c/src/os/bluestore/StupidAllocator.cc: 317: FAILED assert(rm.empty()) ceph version v11.0.0-2808-g76e120c (76e120c705b77d2d2cef1b94cacdd11c14460a3f) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55d4c31a6240] 2: (StupidAllocat
Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7
Hi Loic, Problem solved! So i just started to reinstall with the last known working conf - centos 6.6 (remember i updated to 6.7). During the install it complains about some "bios raid metadata" on one of the disks (/dev/sdc) and wants to hide it. Last time i just added some boot parameter to ignore this, so it would let me install the OS on this disk. So i thought...hmm, maybe this is the root cause to the problems? Abort re-install, back into the 6.7 install, and apply this fix to my /dev/sdc; https://kezhong.wordpress.com/2011/06/14/how-to-remove-bios-raid-metadata-from-disk-on-fedora/ Deleted both journals and osd's and re-created - and wupti - things are working :-) !!! So i guess having an old disk with "bios raid metadata" on it will disturb ceph. May ceph should include a check for this "bios raid metadata"? Someone besides me might decide to use old raid disks for their ceph setup ;-) Thank you very much for your kind help! Cheers, Jesper ***** On 18/12/2015 22:09, Jesper Thorhauge wrote: > Hi Loic, > > Getting closer! > > lrwxrwxrwx 1 root root 10 Dec 18 19:43 1e9d527f-0866-4284-b77c-c1cb04c5a168 > -> ../../sdc4 > lrwxrwxrwx 1 root root 10 Dec 18 19:43 c34d4694-b486-450d-b57f-da24255f0072 > -> ../../sdc3 > lrwxrwxrwx 1 root root 10 Dec 18 19:42 c83b5aa5-fe77-42f6-9415-25ca0266fb7f > -> ../../sdb1 > lrwxrwxrwx 1 root root 10 Dec 18 19:42 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 > -> ../../sda1 > > So symlinks are now working! Activating an OSD is a different story :-( > > "ceph-disk -vv activate /dev/sda1" gives me; > > INFO:ceph-disk:Running command: /sbin/blkid -p -s TYPE -ovalue -- /dev/sda1 > INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. > --lookup osd_mount_options_xfs > INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. > --lookup osd_fs_mount_options_xfs > DEBUG:ceph-disk:Mounting /dev/sda1 on /var/lib/ceph/tmp/mnt.A99cDp with > options noatime,inode64 > INFO:ceph-disk:Running command: /bin/mount -t xfs -o noatime,inode64 -- > /dev/sda1 /var/lib/ceph/tmp/mnt.A99cDp > DEBUG:ceph-disk:Cluster uuid is 07b5c90b-6cae-40c0-93b2-31e0ebad7315 > INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph > --show-config-value=fsid > DEBUG:ceph-disk:Cluster name is ceph > DEBUG:ceph-disk:OSD uuid is e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 > DEBUG:ceph-disk:OSD id is 6 > DEBUG:ceph-disk:Initializing OSD... > INFO:ceph-disk:Running command: /usr/bin/ceph --cluster ceph --name > client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon > getmap -o /var/lib/ceph/tmp/mnt.A99cDp/activate.monmap > got monmap epoch 6 > INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster ceph --mkfs > --mkkey -i 6 --monmap /var/lib/ceph/tmp/mnt.A99cDp/activate.monmap --osd-data > /var/lib/ceph/tmp/mnt.A99cDp --osd-journal > /var/lib/ceph/tmp/mnt.A99cDp/journal --osd-uuid > e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 --keyring > /var/lib/ceph/tmp/mnt.A99cDp/keyring > HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device > 2015-12-18 21:58:12.489357 7f266d7b0800 -1 journal check: ondisk fsid > ---- doesn't match expected > e85f4d92-c8f1-4591-bd2a-aa43b80f58f6, invalid (someone else's?) journal > HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device > HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device > HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device > 2015-12-18 21:58:12.680566 7f266d7b0800 -1 > filestore(/var/lib/ceph/tmp/mnt.A99cDp) could not find > 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory > 2015-12-18 21:58:12.865810 7f266d7b0800 -1 created object store > /var/lib/ceph/tmp/mnt.A99cDp journal /var/lib/ceph/tmp/mnt.A99cDp/journal for > osd.6 fsid 07b5c90b-6cae-40c0-93b2-31e0ebad7315 > 2015-12-18 21:58:12.865844 7f266d7b0800 -1 auth: error reading file: > /var/lib/ceph/tmp/mnt.A99cDp/keyring: can't open > /var/lib/ceph/tmp/mnt.A99cDp/keyring: (2) No such file or directory > 2015-12-18 21:58:12.865910 7f266d7b0800 -1 created new key in keyring > /var/lib/ceph/tmp/mnt.A99cDp/keyring > INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. > --lookup init > DEBUG:ceph-disk:Marking with init system sysvinit > DEBUG:ceph-disk:Authorizing OSD key... > INFO:ceph-disk:Running command: /usr/bin/ceph --cluster ceph --name > client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring auth > add osd.6 -i /var/lib/ceph/tmp/mnt.A99cDp/keyring osd allow * mon allow > profile osd > Error EINVAL: entity osd.6 exists but key does not match >
Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7
Hi Loic, Getting closer! lrwxrwxrwx 1 root root 10 Dec 18 19:43 1e9d527f-0866-4284-b77c-c1cb04c5a168 -> ../../sdc4 lrwxrwxrwx 1 root root 10 Dec 18 19:43 c34d4694-b486-450d-b57f-da24255f0072 -> ../../sdc3 lrwxrwxrwx 1 root root 10 Dec 18 19:42 c83b5aa5-fe77-42f6-9415-25ca0266fb7f -> ../../sdb1 lrwxrwxrwx 1 root root 10 Dec 18 19:42 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 -> ../../sda1 So symlinks are now working! Activating an OSD is a different story :-( "ceph-disk -vv activate /dev/sda1" gives me; INFO:ceph-disk:Running command: /sbin/blkid -p -s TYPE -ovalue -- /dev/sda1 INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs DEBUG:ceph-disk:Mounting /dev/sda1 on /var/lib/ceph/tmp/mnt.A99cDp with options noatime,inode64 INFO:ceph-disk:Running command: /bin/mount -t xfs -o noatime,inode64 -- /dev/sda1 /var/lib/ceph/tmp/mnt.A99cDp DEBUG:ceph-disk:Cluster uuid is 07b5c90b-6cae-40c0-93b2-31e0ebad7315 INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid DEBUG:ceph-disk:Cluster name is ceph DEBUG:ceph-disk:OSD uuid is e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 DEBUG:ceph-disk:OSD id is 6 DEBUG:ceph-disk:Initializing OSD... INFO:ceph-disk:Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/tmp/mnt.A99cDp/activate.monmap got monmap epoch 6 INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster ceph --mkfs --mkkey -i 6 --monmap /var/lib/ceph/tmp/mnt.A99cDp/activate.monmap --osd-data /var/lib/ceph/tmp/mnt.A99cDp --osd-journal /var/lib/ceph/tmp/mnt.A99cDp/journal --osd-uuid e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 --keyring /var/lib/ceph/tmp/mnt.A99cDp/keyring HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 2015-12-18 21:58:12.489357 7f266d7b0800 -1 journal check: ondisk fsid ---- doesn't match expected e85f4d92-c8f1-4591-bd2a-aa43b80f58f6, invalid (someone else's?) journal HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 2015-12-18 21:58:12.680566 7f266d7b0800 -1 filestore(/var/lib/ceph/tmp/mnt.A99cDp) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory 2015-12-18 21:58:12.865810 7f266d7b0800 -1 created object store /var/lib/ceph/tmp/mnt.A99cDp journal /var/lib/ceph/tmp/mnt.A99cDp/journal for osd.6 fsid 07b5c90b-6cae-40c0-93b2-31e0ebad7315 2015-12-18 21:58:12.865844 7f266d7b0800 -1 auth: error reading file: /var/lib/ceph/tmp/mnt.A99cDp/keyring: can't open /var/lib/ceph/tmp/mnt.A99cDp/keyring: (2) No such file or directory 2015-12-18 21:58:12.865910 7f266d7b0800 -1 created new key in keyring /var/lib/ceph/tmp/mnt.A99cDp/keyring INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup init DEBUG:ceph-disk:Marking with init system sysvinit DEBUG:ceph-disk:Authorizing OSD key... INFO:ceph-disk:Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring auth add osd.6 -i /var/lib/ceph/tmp/mnt.A99cDp/keyring osd allow * mon allow profile osd Error EINVAL: entity osd.6 exists but key does not match ERROR:ceph-disk:Failed to activate DEBUG:ceph-disk:Unmounting /var/lib/ceph/tmp/mnt.A99cDp INFO:ceph-disk:Running command: /bin/umount -- /var/lib/ceph/tmp/mnt.A99cDp Traceback (most recent call last): File "/usr/sbin/ceph-disk", line 2994, in main() File "/usr/sbin/ceph-disk", line 2972, in main args.func(args) File "/usr/sbin/ceph-disk", line 2178, in main_activate init=args.mark_init, File "/usr/sbin/ceph-disk", line 1954, in mount_activate (osd_id, cluster) = activate(path, activate_key_template, init) File "/usr/sbin/ceph-disk", line 2153, in activate keyring=keyring, File "/usr/sbin/ceph-disk", line 1756, in auth_key 'mon', 'allow profile osd', File "/usr/sbin/ceph-disk", line 323, in command_check_call return subprocess.check_call(arguments) File "/usr/lib64/python2.6/subprocess.py", line 505, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/usr/bin/ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', '--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', 'auth', 'add', 'osd.6', '-i', '/var/lib/ceph/tmp/mnt.A99cDp/keyring', 'osd', 'allow *', 'mon', 'allow profile osd'
Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7
Hi Loic, Damn, the updated udev didn't fix the problem :-( The rc.local workaround is also complaining; INFO:ceph-disk:Running command: /usr/bin/ceph-osd -i 0 --get-journal-uuid --osd-journal /dev/sdc3 libust[2648/2648]: Warning: HOME environment variable not set. Disabling LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305) HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device DEBUG:ceph-disk:Journal /dev/sdc3 has OSD UUID ---- INFO:ceph-disk:Running command: /sbin/blkid -p -s TYPE -ovalue -- /dev/disk/by-partuuid/---- error: /dev/disk/by-partuuid/----: No such file or directory ceph-disk: Cannot discover filesystem type: device /dev/disk/by-partuuid/----: Command '/sbin/blkid' returned non-zero exit status 2 INFO:ceph-disk:Running command: /usr/bin/ceph-osd -i 0 --get-journal-uuid --osd-journal /dev/sdc4 libust[2687/2687]: Warning: HOME environment variable not set. Disabling LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305) HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device DEBUG:ceph-disk:Journal /dev/sdc4 has OSD UUID ---- INFO:ceph-disk:Running command: /sbin/blkid -p -s TYPE -ovalue -- /dev/disk/by-partuuid/---- error: /dev/disk/by-partuuid/----: No such file or directory ceph-disk: Cannot discover filesystem type: device /dev/disk/by-partuuid/----: Command '/sbin/blkid' returned non-zero exit status 2 /dev/sdc1 and /dev/sdc2 contains the boot loader and OS, so driverwise i guess things are working :-) But "HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device" seems to be the underlying issue. Any thoughts? /Jesper * Hi Loic, searched around for possible udev bugs, and then tried to run "yum update". Udev did have a fresh update with the following version diffs; udev-147-2.63.el6_7.1.x86_64 --> udev-147-2.63.el6_7.1.x86_64 from what i can see this update fixes stuff related to symbolic links / external devices. /dev/sdc sits on external eSata. So... https://rhn.redhat.com/errata/RHBA-2015-1382.html will reboot tonight and get back :-) /jesper ***' I guess that's the problem you need to solve : why /dev/sdc does not generate udev events (different driver than /dev/sda maybe ?). Once it does, Ceph should work. A workaround could be to add somethink like: ceph-disk-udev 3 sdc3 sdc ceph-disk-udev 4 sdc4 sdc in /etc/rc.local. On 17/12/2015 12:01, Jesper Thorhauge wrote: > Nope, the previous post contained all that was in the boot.log :-( > > /Jesper > > ** > > - Den 17. dec 2015, kl. 11:53, Loic Dachary skrev: > > On 17/12/2015 11:33, Jesper Thorhauge wrote: >> Hi Loic, >> >> Sounds like something does go wrong when /dev/sdc3 shows up. Is there anyway >> i can debug this further? Log-files? Modify the .rules file...? > > Do you see traces of what happens when /dev/sdc3 shows up in boot.log ? > >> >> /Jesper >> >> >> >> The non-symlink files in /dev/disk/by-partuuid come to existence because of: >> >> * system boots >> * udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 >> * ceph-disk-udev creates the symlink >> /dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb7f -> ../../sdb1 >> * ceph-disk activate /dev/sda1 is mounted and finds a symlink to the journal >> journal -> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 which >> does not yet exists because /dev/sdc udev rules have not been run yet >> * ceph-osd opens the journal in write mode and that creates the file >> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 as a regular file >> * the file is empty and the osd fails to activate with the error you see >> (EINVAL because the file is empty) >> >> This is ok, supported and expected since there is no way to know which disk >> will show up first. >> >> When /dev/sdc shows up, the same logic will be triggered: >> >> * udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 >> * ceph-disk-udev creates the symlink >> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 -> ../../sdc3 >> (overriding the file because ln -sf) >> * ceph-disk activate-journal /dev/sdc3 finds that >> c83b5aa5-fe77-42f6-9415-25ca0266fb7f is the data partition for that journal >> and mounts /dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb
Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7
Hi Loic, searched around for possible udev bugs, and then tried to run "yum update". Udev did have a fresh update with the following version diffs; udev-147-2.63.el6_7.1.x86_64 --> udev-147-2.63.el6_7.1.x86_64 from what i can see this update fixes stuff related to symbolic links / external devices. /dev/sdc sits on external eSata. So... https://rhn.redhat.com/errata/RHBA-2015-1382.html will reboot tonight and get back :-) /jesper ***' I guess that's the problem you need to solve : why /dev/sdc does not generate udev events (different driver than /dev/sda maybe ?). Once it does, Ceph should work. A workaround could be to add somethink like: ceph-disk-udev 3 sdc3 sdc ceph-disk-udev 4 sdc4 sdc in /etc/rc.local. On 17/12/2015 12:01, Jesper Thorhauge wrote: > Nope, the previous post contained all that was in the boot.log :-( > > /Jesper > > ** > > - Den 17. dec 2015, kl. 11:53, Loic Dachary skrev: > > On 17/12/2015 11:33, Jesper Thorhauge wrote: >> Hi Loic, >> >> Sounds like something does go wrong when /dev/sdc3 shows up. Is there anyway >> i can debug this further? Log-files? Modify the .rules file...? > > Do you see traces of what happens when /dev/sdc3 shows up in boot.log ? > >> >> /Jesper >> >> >> >> The non-symlink files in /dev/disk/by-partuuid come to existence because of: >> >> * system boots >> * udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 >> * ceph-disk-udev creates the symlink >> /dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb7f -> ../../sdb1 >> * ceph-disk activate /dev/sda1 is mounted and finds a symlink to the journal >> journal -> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 which >> does not yet exists because /dev/sdc udev rules have not been run yet >> * ceph-osd opens the journal in write mode and that creates the file >> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 as a regular file >> * the file is empty and the osd fails to activate with the error you see >> (EINVAL because the file is empty) >> >> This is ok, supported and expected since there is no way to know which disk >> will show up first. >> >> When /dev/sdc shows up, the same logic will be triggered: >> >> * udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 >> * ceph-disk-udev creates the symlink >> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 -> ../../sdc3 >> (overriding the file because ln -sf) >> * ceph-disk activate-journal /dev/sdc3 finds that >> c83b5aa5-fe77-42f6-9415-25ca0266fb7f is the data partition for that journal >> and mounts /dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb7f >> * ceph-osd opens the journal and all is well >> >> Except something goes wrong in your case, presumably because ceph-disk-udev >> is not called when /dev/sdc3 shows up ? >> >> On 17/12/2015 08:29, Jesper Thorhauge wrote: >>> Hi Loic, >>> >>> osd's are on /dev/sda and /dev/sdb, journal's is on /dev/sdc (sdc3 / sdc4). >>> >>> sgdisk for sda shows; >>> >>> Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) >>> Partition unique GUID: E85F4D92-C8F1-4591-BD2A-AA43B80F58F6 >>> First sector: 2048 (at 1024.0 KiB) >>> Last sector: 1953525134 (at 931.5 GiB) >>> Partition size: 1953523087 sectors (931.5 GiB) >>> Attribute flags: >>> Partition name: 'ceph data' >>> >>> for sdb >>> >>> Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) >>> Partition unique GUID: C83B5AA5-FE77-42F6-9415-25CA0266FB7F >>> First sector: 2048 (at 1024.0 KiB) >>> Last sector: 1953525134 (at 931.5 GiB) >>> Partition size: 1953523087 sectors (931.5 GiB) >>> Attribute flags: >>> Partition name: 'ceph data' >>> >>> for /dev/sdc3 >>> >>> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) >>> Partition unique GUID: C34D4694-B486-450D-B57F-DA24255F0072 >>> First sector: 935813120 (at 446.2 GiB) >>> Last sector: 956293119 (at 456.0 GiB) >>> Partition size: 2048 sectors (9.8 GiB) >>> Attribute flags: >>> Partition name: 'ceph journal' >>> >>> for /dev/sdc4 >>> >>> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) >>>
Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7
Nope, the previous post contained all that was in the boot.log :-( /Jesper ** - Den 17. dec 2015, kl. 11:53, Loic Dachary skrev: On 17/12/2015 11:33, Jesper Thorhauge wrote: > Hi Loic, > > Sounds like something does go wrong when /dev/sdc3 shows up. Is there anyway > i can debug this further? Log-files? Modify the .rules file...? Do you see traces of what happens when /dev/sdc3 shows up in boot.log ? > > /Jesper > > > > The non-symlink files in /dev/disk/by-partuuid come to existence because of: > > * system boots > * udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 > * ceph-disk-udev creates the symlink > /dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb7f -> ../../sdb1 > * ceph-disk activate /dev/sda1 is mounted and finds a symlink to the journal > journal -> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 which > does not yet exists because /dev/sdc udev rules have not been run yet > * ceph-osd opens the journal in write mode and that creates the file > /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 as a regular file > * the file is empty and the osd fails to activate with the error you see > (EINVAL because the file is empty) > > This is ok, supported and expected since there is no way to know which disk > will show up first. > > When /dev/sdc shows up, the same logic will be triggered: > > * udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 > * ceph-disk-udev creates the symlink > /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 -> ../../sdc3 > (overriding the file because ln -sf) > * ceph-disk activate-journal /dev/sdc3 finds that > c83b5aa5-fe77-42f6-9415-25ca0266fb7f is the data partition for that journal > and mounts /dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb7f > * ceph-osd opens the journal and all is well > > Except something goes wrong in your case, presumably because ceph-disk-udev > is not called when /dev/sdc3 shows up ? > > On 17/12/2015 08:29, Jesper Thorhauge wrote: >> Hi Loic, >> >> osd's are on /dev/sda and /dev/sdb, journal's is on /dev/sdc (sdc3 / sdc4). >> >> sgdisk for sda shows; >> >> Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) >> Partition unique GUID: E85F4D92-C8F1-4591-BD2A-AA43B80F58F6 >> First sector: 2048 (at 1024.0 KiB) >> Last sector: 1953525134 (at 931.5 GiB) >> Partition size: 1953523087 sectors (931.5 GiB) >> Attribute flags: >> Partition name: 'ceph data' >> >> for sdb >> >> Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) >> Partition unique GUID: C83B5AA5-FE77-42F6-9415-25CA0266FB7F >> First sector: 2048 (at 1024.0 KiB) >> Last sector: 1953525134 (at 931.5 GiB) >> Partition size: 1953523087 sectors (931.5 GiB) >> Attribute flags: >> Partition name: 'ceph data' >> >> for /dev/sdc3 >> >> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) >> Partition unique GUID: C34D4694-B486-450D-B57F-DA24255F0072 >> First sector: 935813120 (at 446.2 GiB) >> Last sector: 956293119 (at 456.0 GiB) >> Partition size: 2048 sectors (9.8 GiB) >> Attribute flags: >> Partition name: 'ceph journal' >> >> for /dev/sdc4 >> >> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) >> Partition unique GUID: 1E9D527F-0866-4284-B77C-C1CB04C5A168 >> First sector: 956293120 (at 456.0 GiB) >> Last sector: 976773119 (at 465.8 GiB) >> Partition size: 2048 sectors (9.8 GiB) >> Attribute flags: >> Partition name: 'ceph journal' >> >> 60-ceph-partuuid-workaround.rules is located in /lib/udev/rules.d, so it >> seems correct to me. >> >> after a reboot, /dev/disk/by-partuuid is; >> >> -rw-r--r-- 1 root root 0 Dec 16 07:35 1e9d527f-0866-4284-b77c-c1cb04c5a168 >> -rw-r--r-- 1 root root 0 Dec 16 07:35 c34d4694-b486-450d-b57f-da24255f0072 >> lrwxrwxrwx 1 root root 10 Dec 16 07:35 c83b5aa5-fe77-42f6-9415-25ca0266fb7f >> -> ../../sdb1 >> lrwxrwxrwx 1 root root 10 Dec 16 07:35 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 >> -> ../../sda1 >> >> i dont know how to verify the symlink of the journal file - can you guide me >> on that one? >> >> Thank :-) ! >> >> /Jesper >> >> ** >> >> Hi, >> >> On 17/12/2015 07:53, Jesper Thorhauge wrote: >>
Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7
Hi Loic, Sounds like something does go wrong when /dev/sdc3 shows up. Is there anyway i can debug this further? Log-files? Modify the .rules file...? /Jesper The non-symlink files in /dev/disk/by-partuuid come to existence because of: * system boots * udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 * ceph-disk-udev creates the symlink /dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb7f -> ../../sdb1 * ceph-disk activate /dev/sda1 is mounted and finds a symlink to the journal journal -> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 which does not yet exists because /dev/sdc udev rules have not been run yet * ceph-osd opens the journal in write mode and that creates the file /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 as a regular file * the file is empty and the osd fails to activate with the error you see (EINVAL because the file is empty) This is ok, supported and expected since there is no way to know which disk will show up first. When /dev/sdc shows up, the same logic will be triggered: * udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 * ceph-disk-udev creates the symlink /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 -> ../../sdc3 (overriding the file because ln -sf) * ceph-disk activate-journal /dev/sdc3 finds that c83b5aa5-fe77-42f6-9415-25ca0266fb7f is the data partition for that journal and mounts /dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb7f * ceph-osd opens the journal and all is well Except something goes wrong in your case, presumably because ceph-disk-udev is not called when /dev/sdc3 shows up ? On 17/12/2015 08:29, Jesper Thorhauge wrote: > Hi Loic, > > osd's are on /dev/sda and /dev/sdb, journal's is on /dev/sdc (sdc3 / sdc4). > > sgdisk for sda shows; > > Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) > Partition unique GUID: E85F4D92-C8F1-4591-BD2A-AA43B80F58F6 > First sector: 2048 (at 1024.0 KiB) > Last sector: 1953525134 (at 931.5 GiB) > Partition size: 1953523087 sectors (931.5 GiB) > Attribute flags: > Partition name: 'ceph data' > > for sdb > > Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) > Partition unique GUID: C83B5AA5-FE77-42F6-9415-25CA0266FB7F > First sector: 2048 (at 1024.0 KiB) > Last sector: 1953525134 (at 931.5 GiB) > Partition size: 1953523087 sectors (931.5 GiB) > Attribute flags: > Partition name: 'ceph data' > > for /dev/sdc3 > > Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) > Partition unique GUID: C34D4694-B486-450D-B57F-DA24255F0072 > First sector: 935813120 (at 446.2 GiB) > Last sector: 956293119 (at 456.0 GiB) > Partition size: 2048 sectors (9.8 GiB) > Attribute flags: > Partition name: 'ceph journal' > > for /dev/sdc4 > > Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) > Partition unique GUID: 1E9D527F-0866-4284-B77C-C1CB04C5A168 > First sector: 956293120 (at 456.0 GiB) > Last sector: 976773119 (at 465.8 GiB) > Partition size: 2048 sectors (9.8 GiB) > Attribute flags: > Partition name: 'ceph journal' > > 60-ceph-partuuid-workaround.rules is located in /lib/udev/rules.d, so it > seems correct to me. > > after a reboot, /dev/disk/by-partuuid is; > > -rw-r--r-- 1 root root 0 Dec 16 07:35 1e9d527f-0866-4284-b77c-c1cb04c5a168 > -rw-r--r-- 1 root root 0 Dec 16 07:35 c34d4694-b486-450d-b57f-da24255f0072 > lrwxrwxrwx 1 root root 10 Dec 16 07:35 c83b5aa5-fe77-42f6-9415-25ca0266fb7f > -> ../../sdb1 > lrwxrwxrwx 1 root root 10 Dec 16 07:35 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 > -> ../../sda1 > > i dont know how to verify the symlink of the journal file - can you guide me > on that one? > > Thank :-) ! > > /Jesper > > ** > > Hi, > > On 17/12/2015 07:53, Jesper Thorhauge wrote: >> Hi, >> >> Some more information showing in the boot.log; >> >> 2015-12-16 07:35:33.289830 7f1b990ad800 -1 >> filestore(/var/lib/ceph/tmp/mnt.aWZTcE) mkjournal error creating journal on >> /var/lib/ceph/tmp/mnt.aWZTcE/journal: (22) Invalid argument >> 2015-12-16 07:35:33.289842 7f1b990ad800 -1 OSD::mkfs: ObjectStore::mkfs >> failed with error -22 >> 2015-12-16 07:35:33.289883 7f1b990ad800 -1 ** ERROR: error creating empty >> object store in /var/lib/ceph/tmp/mnt.aWZTcE: (22) Invalid argument >> ERROR:ceph-disk:Failed to activate >> ceph-disk: Command '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', >> &
Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7
Hi Loic, Yep, 95-ceph-osd.rules contains exactly that... *** And 95-ceph-osd.rules contains the following ? # Check gpt partion for ceph tags and activate ACTION=="add", SUBSYSTEM=="block", \ ENV{DEVTYPE}=="partition", \ ENV{ID_PART_TABLE_TYPE}=="gpt", \ RUN+="/usr/sbin/ceph-disk-udev $number $name $parent" On 17/12/2015 08:29, Jesper Thorhauge wrote: > Hi Loic, > > osd's are on /dev/sda and /dev/sdb, journal's is on /dev/sdc (sdc3 / sdc4). > > sgdisk for sda shows; > > Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) > Partition unique GUID: E85F4D92-C8F1-4591-BD2A-AA43B80F58F6 > First sector: 2048 (at 1024.0 KiB) > Last sector: 1953525134 (at 931.5 GiB) > Partition size: 1953523087 sectors (931.5 GiB) > Attribute flags: > Partition name: 'ceph data' > > for sdb > > Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) > Partition unique GUID: C83B5AA5-FE77-42F6-9415-25CA0266FB7F > First sector: 2048 (at 1024.0 KiB) > Last sector: 1953525134 (at 931.5 GiB) > Partition size: 1953523087 sectors (931.5 GiB) > Attribute flags: > Partition name: 'ceph data' > > for /dev/sdc3 > > Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) > Partition unique GUID: C34D4694-B486-450D-B57F-DA24255F0072 > First sector: 935813120 (at 446.2 GiB) > Last sector: 956293119 (at 456.0 GiB) > Partition size: 2048 sectors (9.8 GiB) > Attribute flags: > Partition name: 'ceph journal' > > for /dev/sdc4 > > Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) > Partition unique GUID: 1E9D527F-0866-4284-B77C-C1CB04C5A168 > First sector: 956293120 (at 456.0 GiB) > Last sector: 976773119 (at 465.8 GiB) > Partition size: 2048 sectors (9.8 GiB) > Attribute flags: > Partition name: 'ceph journal' > > 60-ceph-partuuid-workaround.rules is located in /lib/udev/rules.d, so it > seems correct to me. > > after a reboot, /dev/disk/by-partuuid is; > > -rw-r--r-- 1 root root 0 Dec 16 07:35 1e9d527f-0866-4284-b77c-c1cb04c5a168 > -rw-r--r-- 1 root root 0 Dec 16 07:35 c34d4694-b486-450d-b57f-da24255f0072 > lrwxrwxrwx 1 root root 10 Dec 16 07:35 c83b5aa5-fe77-42f6-9415-25ca0266fb7f > -> ../../sdb1 > lrwxrwxrwx 1 root root 10 Dec 16 07:35 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 > -> ../../sda1 > > i dont know how to verify the symlink of the journal file - can you guide me > on that one? > > Thank :-) ! > > /Jesper > > ** > > Hi, > > On 17/12/2015 07:53, Jesper Thorhauge wrote: >> Hi, >> >> Some more information showing in the boot.log; >> >> 2015-12-16 07:35:33.289830 7f1b990ad800 -1 >> filestore(/var/lib/ceph/tmp/mnt.aWZTcE) mkjournal error creating journal on >> /var/lib/ceph/tmp/mnt.aWZTcE/journal: (22) Invalid argument >> 2015-12-16 07:35:33.289842 7f1b990ad800 -1 OSD::mkfs: ObjectStore::mkfs >> failed with error -22 >> 2015-12-16 07:35:33.289883 7f1b990ad800 -1 ** ERROR: error creating empty >> object store in /var/lib/ceph/tmp/mnt.aWZTcE: (22) Invalid argument >> ERROR:ceph-disk:Failed to activate >> ceph-disk: Command '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', >> '--mkkey', '-i', '7', '--monmap', >> '/var/lib/ceph/tmp/mnt.aWZTcE/activate.monmap', '--osd-data', >> '/var/lib/ceph/tmp/mnt.aWZTcE', '--osd-journal', >> '/var/lib/ceph/tmp/mnt.aWZTcE/journal', '--osd-uuid', >> 'c83b5aa5-fe77-42f6-9415-25ca0266fb7f', '--keyring', >> '/var/lib/ceph/tmp/mnt.aWZTcE/keyring']' returned non-zero exit status 1 >> ceph-disk: Error: One or more partitions failed to activate >> >> Maybe related to the "(22) Invalid argument" part..? > > After a reboot the symlinks are reconstructed and if they are still > incorrect, it means there is an inconsistency somewhere else. To debug the > problem, could you mount /dev/sda1 and verify the symlink of the journal file > ? Then verify the content of /dev/disk/by-partuuid. And also display the > partition information with sgdisk -i 1 /dev/sda and sgdisk -i 2 /dev/sda. Are > you collocating your journal with the data, on the same disk ? Or are they on > two different disks ? > > git log --no-merges --oneline tags/v0.94.3..tags/v0.94.5 udev > > shows nothing, meaning there has been no change t
Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7
Hi Loic, osd's are on /dev/sda and /dev/sdb, journal's is on /dev/sdc (sdc3 / sdc4). sgdisk for sda shows; Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) Partition unique GUID: E85F4D92-C8F1-4591-BD2A-AA43B80F58F6 First sector: 2048 (at 1024.0 KiB) Last sector: 1953525134 (at 931.5 GiB) Partition size: 1953523087 sectors (931.5 GiB) Attribute flags: Partition name: 'ceph data' for sdb Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) Partition unique GUID: C83B5AA5-FE77-42F6-9415-25CA0266FB7F First sector: 2048 (at 1024.0 KiB) Last sector: 1953525134 (at 931.5 GiB) Partition size: 1953523087 sectors (931.5 GiB) Attribute flags: Partition name: 'ceph data' for /dev/sdc3 Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) Partition unique GUID: C34D4694-B486-450D-B57F-DA24255F0072 First sector: 935813120 (at 446.2 GiB) Last sector: 956293119 (at 456.0 GiB) Partition size: 2048 sectors (9.8 GiB) Attribute flags: Partition name: 'ceph journal' for /dev/sdc4 Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) Partition unique GUID: 1E9D527F-0866-4284-B77C-C1CB04C5A168 First sector: 956293120 (at 456.0 GiB) Last sector: 976773119 (at 465.8 GiB) Partition size: 2048 sectors (9.8 GiB) Attribute flags: Partition name: 'ceph journal' 60-ceph-partuuid-workaround.rules is located in /lib/udev/rules.d, so it seems correct to me. after a reboot, /dev/disk/by-partuuid is; -rw-r--r-- 1 root root 0 Dec 16 07:35 1e9d527f-0866-4284-b77c-c1cb04c5a168 -rw-r--r-- 1 root root 0 Dec 16 07:35 c34d4694-b486-450d-b57f-da24255f0072 lrwxrwxrwx 1 root root 10 Dec 16 07:35 c83b5aa5-fe77-42f6-9415-25ca0266fb7f -> ../../sdb1 lrwxrwxrwx 1 root root 10 Dec 16 07:35 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 -> ../../sda1 i dont know how to verify the symlink of the journal file - can you guide me on that one? Thank :-) ! /Jesper ** Hi, On 17/12/2015 07:53, Jesper Thorhauge wrote: > Hi, > > Some more information showing in the boot.log; > > 2015-12-16 07:35:33.289830 7f1b990ad800 -1 > filestore(/var/lib/ceph/tmp/mnt.aWZTcE) mkjournal error creating journal on > /var/lib/ceph/tmp/mnt.aWZTcE/journal: (22) Invalid argument > 2015-12-16 07:35:33.289842 7f1b990ad800 -1 OSD::mkfs: ObjectStore::mkfs > failed with error -22 > 2015-12-16 07:35:33.289883 7f1b990ad800 -1 ** ERROR: error creating empty > object store in /var/lib/ceph/tmp/mnt.aWZTcE: (22) Invalid argument > ERROR:ceph-disk:Failed to activate > ceph-disk: Command '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', > '--mkkey', '-i', '7', '--monmap', > '/var/lib/ceph/tmp/mnt.aWZTcE/activate.monmap', '--osd-data', > '/var/lib/ceph/tmp/mnt.aWZTcE', '--osd-journal', > '/var/lib/ceph/tmp/mnt.aWZTcE/journal', '--osd-uuid', > 'c83b5aa5-fe77-42f6-9415-25ca0266fb7f', '--keyring', > '/var/lib/ceph/tmp/mnt.aWZTcE/keyring']' returned non-zero exit status 1 > ceph-disk: Error: One or more partitions failed to activate > > Maybe related to the "(22) Invalid argument" part..? After a reboot the symlinks are reconstructed and if they are still incorrect, it means there is an inconsistency somewhere else. To debug the problem, could you mount /dev/sda1 and verify the symlink of the journal file ? Then verify the content of /dev/disk/by-partuuid. And also display the partition information with sgdisk -i 1 /dev/sda and sgdisk -i 2 /dev/sda. Are you collocating your journal with the data, on the same disk ? Or are they on two different disks ? git log --no-merges --oneline tags/v0.94.3..tags/v0.94.5 udev shows nothing, meaning there has been no change to udev rules. There is one change related to the installation of the udev rules https://github.com/ceph/ceph/commit/4eb58ad2027148561d94bb43346b464b55d041a6. Could you double check 60-ceph-partuuid-workaround.rules is installed where it should ? Cheers > > /Jesper > > * > > Hi, > > I have done several reboots, and it did not lead to healthy symlinks :-( > > /Jesper > > > > Hi, > > On 16/12/2015 07:39, Jesper Thorhauge wrote: >> Hi, >> >> A fresh server install on one of my nodes (and yum update) left me with >> CentOS 6.7 / Ceph 0.94.5. All the other nodes are running Ceph 0.94.2. >> >> "ceph-disk prepare /dev/sda /dev/sdc" seems to work as expected, but >> "ceph-disk activate / dev/sda1" fails. I have traced the problem to &g
Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7
Hi, Some more information showing in the boot.log; 2015-12-16 07:35:33.289830 7f1b990ad800 -1 filestore(/var/lib/ceph/tmp/mnt.aWZTcE) mkjournal error creating journal on /var/lib/ceph/tmp/mnt.aWZTcE/journal: (22) Invalid argument 2015-12-16 07:35:33.289842 7f1b990ad800 -1 OSD::mkfs: ObjectStore::mkfs failed with error -22 2015-12-16 07:35:33.289883 7f1b990ad800 -1 ** ERROR: error creating empty object store in /var/lib/ceph/tmp/mnt.aWZTcE: (22) Invalid argument ERROR:ceph-disk:Failed to activate ceph-disk: Command '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', '--mkkey', '-i', '7', '--monmap', '/var/lib/ceph/tmp/mnt.aWZTcE/activate.monmap', '--osd-data', '/var/lib/ceph/tmp/mnt.aWZTcE', '--osd-journal', '/var/lib/ceph/tmp/mnt.aWZTcE/journal', '--osd-uuid', 'c83b5aa5-fe77-42f6-9415-25ca0266fb7f', '--keyring', '/var/lib/ceph/tmp/mnt.aWZTcE/keyring']' returned non-zero exit status 1 ceph-disk: Error: One or more partitions failed to activate Maybe related to the "(22) Invalid argument" part..? /Jesper * Hi, I have done several reboots, and it did not lead to healthy symlinks :-( /Jesper Hi, On 16/12/2015 07:39, Jesper Thorhauge wrote: > Hi, > > A fresh server install on one of my nodes (and yum update) left me with > CentOS 6.7 / Ceph 0.94.5. All the other nodes are running Ceph 0.94.2. > > "ceph-disk prepare /dev/sda /dev/sdc" seems to work as expected, but > "ceph-disk activate / dev/sda1" fails. I have traced the problem to > "/dev/disk/by-partuuid", where the journal symlinks are broken; > > -rw-r--r-- 1 root root 0 Dec 16 07:35 1e9d527f-0866-4284-b77c-c1cb04c5a168 > -rw-r--r-- 1 root root 0 Dec 16 07:35 c34d4694-b486-450d-b57f-da24255f0072 > lrwxrwxrwx 1 root root 10 Dec 16 07:35 c83b5aa5-fe77-42f6-9415-25ca0266fb7f > -> ../../sdb1 > lrwxrwxrwx 1 root root 10 Dec 16 07:35 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 > -> ../../sda1 > > Re-creating them manually wont survive a reboot. Is this a problem with the > udev rules in Ceph 0.94.3+? This usually is a symptom of something else going wrong (i.e. it is possible to confuse the kernel into creating the wrong symbolic links). The correct symlinks should be set when you reboot. > Hope that somebody can help me :-) Please let us know if rebooting leads to healthy symlinks. Cheers > > Thanks! > > Best regards, > Jesper > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Loïc Dachary, Artisan Logiciel Libre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7
Hi, I have done several reboots, and it did not lead to healthy symlinks :-( /Jesper Hi, On 16/12/2015 07:39, Jesper Thorhauge wrote: > Hi, > > A fresh server install on one of my nodes (and yum update) left me with > CentOS 6.7 / Ceph 0.94.5. All the other nodes are running Ceph 0.94.2. > > "ceph-disk prepare /dev/sda /dev/sdc" seems to work as expected, but > "ceph-disk activate / dev/sda1" fails. I have traced the problem to > "/dev/disk/by-partuuid", where the journal symlinks are broken; > > -rw-r--r-- 1 root root 0 Dec 16 07:35 1e9d527f-0866-4284-b77c-c1cb04c5a168 > -rw-r--r-- 1 root root 0 Dec 16 07:35 c34d4694-b486-450d-b57f-da24255f0072 > lrwxrwxrwx 1 root root 10 Dec 16 07:35 c83b5aa5-fe77-42f6-9415-25ca0266fb7f > -> ../../sdb1 > lrwxrwxrwx 1 root root 10 Dec 16 07:35 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 > -> ../../sda1 > > Re-creating them manually wont survive a reboot. Is this a problem with the > udev rules in Ceph 0.94.3+? This usually is a symptom of something else going wrong (i.e. it is possible to confuse the kernel into creating the wrong symbolic links). The correct symlinks should be set when you reboot. > Hope that somebody can help me :-) Please let us know if rebooting leads to healthy symlinks. Cheers > > Thanks! > > Best regards, > Jesper > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Loïc Dachary, Artisan Logiciel Libre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7
Hi, A fresh server install on one of my nodes (and yum update) left me with CentOS 6.7 / Ceph 0.94.5. All the other nodes are running Ceph 0.94.2. "ceph-disk prepare /dev/sda /dev/sdc" seems to work as expected, but "ceph-disk activate / dev/sda1" fails. I have traced the problem to "/dev/disk/by-partuuid", where the journal symlinks are broken; -rw-r--r-- 1 root root 0 Dec 16 07:35 1e9d527f-0866-4284-b77c-c1cb04c5a168 -rw-r--r-- 1 root root 0 Dec 16 07:35 c34d4694-b486-450d-b57f-da24255f0072 lrwxrwxrwx 1 root root 10 Dec 16 07:35 c83b5aa5-fe77-42f6-9415-25ca0266fb7f -> ../../sdb1 lrwxrwxrwx 1 root root 10 Dec 16 07:35 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 -> ../../sda1 Re-creating them manually wont survive a reboot. Is this a problem with the udev rules in Ceph 0.94.3+? Hope that somebody can help me :-) Thanks! Best regards, Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com