Re: [ceph-users] Consumer-grade SSD in Ceph

2019-12-19 Thread jesper

I dont think “usually” is good enough in a production setup.



Sent from myMail for iOS


Thursday, 19 December 2019, 12.09 +0100 from Виталий Филиппов  
:
>Usually it doesn't, it only harms performance and probably SSD lifetime 
>too
>
>> I would not be running ceph on ssds without powerloss protection. I
>> delivers a potential data loss scenario
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Consumer-grade SSD in Ceph

2019-12-19 Thread jesper

I would not be running ceph on ssds without powerloss protection. I delivers a 
potential data loss scenario

Jesper



Sent from myMail for iOS


Thursday, 19 December 2019, 08.32 +0100 from Виталий Филиппов  
:
>https://yourcmc.ru/wiki/Ceph_performance
>
>https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc
>
>19 декабря 2019 г. 0:41:02 GMT+03:00, Sinan Polat < si...@turka.nl > пишет:
>>Hi,
>>
>>I am aware that  
>>https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>>  holds a list with benchmark of quite some different ssd models. 
>>Unfortunately it doesn't have benchmarks for recent ssd models.
>>
>>A client is planning to expand a running cluster (Luminous, FileStore, SSD 
>>only, Replicated). I/O Utilization is close to 0, but capacity wise the 
>>cluster is almost nearfull. To save costs the cluster will be expanded will 
>>customer-grade SSD's, but I am unable to find benchmarks of recent SSD models.
>>
>>Does anyone has experience with Samsung 860 EVO, 860 PRO and Crucial MX500 in 
>>a Ceph cluster?
>>
>>Thanks!
>>Sinan
>-- 
>With best regards,
>Vitaliy Filippov
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HA and data recovery of CEPH

2019-11-28 Thread jesper

Hi Nathan

Is that true?

The time it takes to reallocate the primary pg delivers “downtime” by design.  
right? Seen from a writing clients perspective 

Jesper



Sent from myMail for iOS


Friday, 29 November 2019, 06.24 +0100 from pen...@portsip.com  
:
>Hi Nathan, 
>
>Thanks for the help.
>My colleague will provide more details.
>
>BR
>On Fri, Nov 29, 2019 at 12:57 PM Nathan Fish < lordci...@gmail.com > wrote:
>>If correctly configured, your cluster should have zero downtime from a
>>single OSD or node failure. What is your crush map? Are you using
>>replica or EC? If your 'min_size' is not smaller than 'size', then you
>>will lose availability.
>>
>>On Thu, Nov 28, 2019 at 10:50 PM Peng Bo < pen...@portsip.com > wrote:
>>>
>>> Hi all,
>>>
>>> We are working on use CEPH to build our HA system, the purpose is the 
>>> system should always provide service even a node of CEPH is down or OSD is 
>>> lost.
>>>
>>> Currently, as we practiced once a node/OSD is down, the CEPH cluster needs 
>>> to take about 40 seconds to sync data, our system can't provide service 
>>> during that.
>>>
>>> My questions:
>>>
>>> Does there have any way that we can reduce the data sync time?
>>> How can we let the CEPH keeps available once a node/OSD is down?
>>>
>>>
>>> BR
>>>
>>> --
>>> The modern Unified Communications provider
>>>
>>>  https://www.portsip.com
>>> ___
>>> ceph-users mailing list
>>>  ceph-users@lists.ceph.com
>>>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>-- 
>The modern Unified Communications provider
>
>https://www.portsip.com
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-17 Thread jesper

Is c) the bcache solution?

real life experience - unless you are really beating an enterprise ssd with 
writes - they last very,very long and even when failure happens- you can 
typically see it by the wear levels in smart months before.

I would go for c) but if possible add one more nvme to each host - we have a 
9-hdd+3-ssd scenario here.

Jesper



Sent from myMail for iOS


Monday, 18 November 2019, 07.49 +0100 from kristof.cou...@gmail.com  
:
>Hi all,
>
>Thanks for the feedback.
>Though, just to be sure:
>
>1. There is no 30GB limit if I understand correctly for the RocksDB size. If 
>metadata crosses that barrier, the L4 part will spillover to the primary 
>device? Or will it just move the RocksDB completely? Or will it just stop and 
>indicate it's full?
>2. Since the WAL will also be written to that device, I assume a few 
>additional GB's is still usefull...
>
>With my setup (13x 14TB + 2 NVMe of 1.6TB / host, 10 hosts) I have multiple 
>possible scenario's:
>- Assigning 35GB of space of the NVMe disk (30GB for DB, 5 spare) would result 
>in only 455GB being used (13 x 35GB). This is a pity, since I have 3.2TB of 
>NVMe disk space...
>
>Options line-up:
>
>Option a : Not using the NVMe for block.db storage, but as RGW metadata pool.
>Advantages:
>- Impact of 1 defect NVMe is limited.
>- Fast storage for the metadata pool.
>Disadvantage:
>- RocksDB for each OSD is on the primary disk, resulting in slower performance 
>of each OSD.
>
>Option b:  Hardware mirror of the NVMe drive
>Advantages:
>- Impact of 1 defect NVMe is limited
>- Fast KV lookup for each OSD
>Disadvantage:
>- I/O to NVMe is serialized for all OSDs on 1 host. Though the NVMe are fast, 
>I imagine that there still is an impact.
>- 1 TB of NVMe is not used / host
>
>Option c:  Split the NVMe's accross the OSD
>Advantages:
>- Fast RockDB access - up to L3 (assuming spillover does it job)
>Disadvantage:
>- 1 defect NVMe impacts max 7 OSDs (1 NVMe assigned to 7 or 6 OSD daemons per 
>host)
>- 2.7TB of NVMe space not used per host
>
>Option d:  1 NVMe disk for OSDs, 1 for RGW metadata pool
>Advantages:
>- Fast RockDB access - up to L3
>- Fast RGW metadata pool (though limited to 5,3TB (raw pool size will be 16TB, 
>divided by 3 due to replication) I assume this already gives some possibilities
>Disadvantages:
>- 1 defect NVMe might impact a complete host (all OSDs might be using it for 
>the RockDB storage)
>- 1 TB of NVMe is not used
>
>Though menu to choose from, each with it possibilities... The initial idea was 
>too assign 200GB per OSD of the NVMe space per OSD, but this would result in a 
>lot of unused space. I don't know if there is anything on the roadmap to adapt 
>the RocksDB sizing to make better use of the available NVMe disk space.
>With all the information, I would assume that the best option would be  option 
>A . Since we will be using erasure coding for the RGW data pool (k=6, m=3), 
>the impact of a defect NVMe would be too significant. The other alternative 
>would be option b, but then again we would be dealing with HW raid which is 
>against all Ceph design rules.
>
>Any other options or (dis)advantages I missed? Or any other opinions to choose 
>another option?
>
>Regards,
>
>Kristof
>Op vr 15 nov. 2019 om 18:22 schreef < vita...@yourcmc.ru >:
>>Use 30 GB for all OSDs. Other values are pointless, because 
>>https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
>>
>>You can use the rest of free NVMe space for bcache - it's much better 
>>than just allocating it for block.db.
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs: apache locks up after parallel reloads on multiple nodes

2019-09-12 Thread jesper



Thursday, 12 September 2019, 17.16 +0200 from Paul Emmerich  
:
>Yeah, CephFS is much closer to POSIX semantics for a filesystem than
>NFS. There's an experimental relaxed mode called LazyIO but I'm not
>sure if it's applicable here.
>
>You can debug this by dumping slow requests from the MDS servers via
>the admin socket

Is lazy IO supported by the kernel client? if so which version kernel? 

Jesper
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph for "home lab" / hobbyist use?

2019-09-07 Thread jesper

Saturday, 7 September 2019, 15.25 +0200 from wil...@gmail.com  


>On a related note, I came across this hardware while searching around
>on this topic:  https://ambedded.com/ambedded_com/AR M

Interesting to see the cost of those. 8 LFF drives in 1U is pretty dense. 
Anyone using similar concepts in enterprise environments?

Jesper
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optane + 4x SSDs for VM disk images?

2019-08-12 Thread jesper
>> Could performance of Optane + 4x SSDs per node ever exceed that of
>> pure Optane disks?
>
> No. With Ceph, the results for Optane and just for good server SSDs are
> almost the same. One thing is that you can run more OSDs per an Optane
> than per a usual SSD. However, the latency you get from both is almost
> the same as most of it comes from Ceph itself, not from the underlying
> storage. This also results in Optanes being useless for
> block.db/block.wal if your SSDs aren't shitty desktop ones.
>
> And as usual I'm posting the link to my article
> https://yourcmc.ru/wiki/Ceph_performance :)

You write that they are not reporting QD=1 single-threaded numbers,
but in Table 10 and 11 the average latencies are reported which
is "close to the same", so they can get

Read latency: 0.32ms (thereby 3125 IOPS)
Write latency: 1.1ms  (therby 909 IOPS)

Really nice writeup and very true - should be a must-read for anyone
starting out with Ceph.

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD caching on EC-pools (heavy cross OSD communication on cached reads)

2019-06-09 Thread jesper

make sense - makes the cases for ec pools smaller though.

Jesper



Sent from myMail for iOS


Sunday, 9 June 2019, 17.48 +0200 from paul.emmer...@croit.io  
:
>Caching is handled in BlueStore itself, erasure coding happens on a higher 
>layer.
>
>
>Paul
>
>-- 
>Paul Emmerich
>
>Looking for help with your Ceph cluster? Contact us at  https://croit.io
>
>croit GmbH
>Freseniusstr. 31h
>81247 München
>www.croit.io
>Tel:  +49 89 1896585 90
>
>On Sun, Jun 9, 2019 at 8:43 AM < jes...@krogh.cc > wrote:
>>Hi.
>>
>>I just changed some of my data on CephFS to go to the EC pool instead
>>of the 3x replicated pool. The data is "write rare / read heavy" data
>>being served to an HPC cluster.
>>
>>To my surprise it looks like the OSD memory caching is done at the
>>"split object level" not at the "assembled object level", as a
>>consequence - even though the dataset is fully memory cached it
>>actually deliveres a very "heavy" cross OSD network traffic to
>>assemble the objects back.
>>
>>Since (as far as I understand) no changes can go the the underlying
>>object without going though the primary pg - then caching could be
>>more effectively done at that level.
>>
>>The caching on the 3x replica does not retrieve all 3 copies to compare
>>and verify on a read request (or I at least cannot see any network
>>traffic supporting that it should be the case).
>>
>>Is above configurable? Or would that be a feature/performance request?
>>
>>Jesper
>>
>>___
>>ceph-users mailing list
>>ceph-users@lists.ceph.com
>>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD caching on EC-pools (heavy cross OSD communication on cached reads)

2019-06-09 Thread jesper
Hi.

I just changed some of my data on CephFS to go to the EC pool instead
of the 3x replicated pool. The data is "write rare / read heavy" data
being served to an HPC cluster.

To my surprise it looks like the OSD memory caching is done at the
"split object level" not at the "assembled object level", as a
consequence - even though the dataset is fully memory cached it
actually deliveres a very "heavy" cross OSD network traffic to
assemble the objects back.

Since (as far as I understand) no changes can go the the underlying
object without going though the primary pg - then caching could be
more effectively done at that level.

The caching on the 3x replica does not retrieve all 3 copies to compare
and verify on a read request (or I at least cannot see any network
traffic supporting that it should be the case).

Is above configurable? Or would that be a feature/performance request?

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD RAM recommendations

2019-06-07 Thread jesper
> I'm a bit confused by the RAM recommendations for OSD servers. I have
> also seen conflicting information in the lists (1 GB RAM per OSD, 1 GB
> RAM per TB, 3-5 GB RAM per OSD, etc.). I guess I'm a lot better with a
> concrete example:

I think it depends on the usagepattern - the more the better.
When configured the OSD daemon will use the memory as disk-caching for
reads - I have a simiar setup 7 hosts x 10TB x 12 disk - with 512GB each
This serves an "active dataset" to a HPC cluster, where it is hugely
beneficial to be able to cache the "hot data" which is 1.5TB'ish.

If your "hot" dataset is smaller, then less will do as well.

Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Single threaded IOPS on SSD pool.

2019-06-06 Thread jesper
> Hi,
>
> El 5/6/19 a las 16:53, vita...@yourcmc.ru escribió:
>>> Ok, average network latency from VM to OSD's ~0.4ms.
>>
>> It's rather bad, you can improve the latency by 0.3ms just by
>> upgrading the network.
>>
>>> Single threaded performance ~500-600 IOPS - or average latency of 1.6ms
>>> Is that comparable to what other are seeing?
>>
>> Good "reference" numbers are 0.5ms for reads (~2000 iops) and 1ms for
>> writes (~1000 iops).
>>
>> I confirm that the most powerful thing to do is disabling CPU
>> powersave (governor=performance + cpupower -D 0). You usually get 2x
>> single thread iops at once.
>
> We have a small cluster with 4 OSD host, each with 1 SSD INTEL
> SSDSC2KB019T8 (D3-S4510 1.8T), connected with a 10G network (shared with
> VMs, not a busy cluster). Volumes are replica 3:
>
> Network latency from one node to the other 3:
> 10 packets transmitted, 10 received, 0% packet loss, time 9166ms
> rtt min/avg/max/mdev = 0.042/0.064/0.088/0.013 ms
>
> 10 packets transmitted, 10 received, 0% packet loss, time 9190ms
> rtt min/avg/max/mdev = 0.047/0.072/0.110/0.017 ms
>
> 10 packets transmitted, 10 received, 0% packet loss, time 9219ms
> rtt min/avg/max/mdev = 0.061/0.078/0.099/0.011 ms

What NIC / Switching components are in play here .. I simply cannot get
latencies
this far down.

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Single threaded IOPS on SSD pool.

2019-06-05 Thread jesper
Hi.

This is more an inquiry to figure out how our current setup compares
to other setups. I have a 3 x replicated SSD pool with RBD images.
When running fio on /tmp I'm interested in seeing how much IOPS a
single thread can get - as Ceph scales up very nicely with concurrency.

Currently 34 OSD of ~896GB Intel D3-4510's each over 7 OSD-hosts.

jk@iguana:/tmp$ for i in 01 02 03 04 05 06 07; do ping -c 10 ceph-osd$i;
done  |egrep '(statistics|rtt)'
--- ceph-osd01.nzcorp.net ping statistics ---
rtt min/avg/max/mdev = 0.316/0.381/0.483/0.056 ms
--- ceph-osd02.nzcorp.net ping statistics ---
rtt min/avg/max/mdev = 0.293/0.415/0.625/0.100 ms
--- ceph-osd03.nzcorp.net ping statistics ---
rtt min/avg/max/mdev = 0.319/0.395/0.558/0.074 ms
--- ceph-osd04.nzcorp.net ping statistics ---
rtt min/avg/max/mdev = 0.224/0.352/0.492/0.077 ms
--- ceph-osd05.nzcorp.net ping statistics ---
rtt min/avg/max/mdev = 0.257/0.360/0.444/0.059 ms
--- ceph-osd06.nzcorp.net ping statistics ---
rtt min/avg/max/mdev = 0.209/0.334/0.442/0.062 ms
--- ceph-osd07.nzcorp.net ping statistics ---
rtt min/avg/max/mdev = 0.259/0.401/0.517/0.069 ms

Ok, average network latency from VM to OSD's ~0.4ms.

$ fio fio-job-randr.ini
test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [2145KB/0KB/0KB /s] [536/0/0 iops]
[eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=29519: Wed Jun  5 08:40:51 2019
  Description  : [fio random 4k reads]
  read : io=143352KB, bw=2389.2KB/s, iops=597, runt= 60001msec
slat (usec): min=8, max=1925, avg=30.24, stdev=13.56
clat (usec): min=7, max=321039, avg=1636.47, stdev=4346.52
 lat (usec): min=102, max=321074, avg=1667.58, stdev=4346.57
clat percentiles (usec):
 |  1.00th=[  157],  5.00th=[  844], 10.00th=[  924], 20.00th=[ 1012],
 | 30.00th=[ 1096], 40.00th=[ 1160], 50.00th=[ 1224], 60.00th=[ 1304],
 | 70.00th=[ 1400], 80.00th=[ 1528], 90.00th=[ 1768], 95.00th=[ 2128],
 | 99.00th=[11328], 99.50th=[18304], 99.90th=[51456], 99.95th=[94720],
 | 99.99th=[216064]
bw (KB  /s): min=0, max= 3089, per=99.39%, avg=2374.50, stdev=472.15
lat (usec) : 10=0.01%, 100=0.01%, 250=2.95%, 500=0.03%, 750=0.27%
lat (usec) : 1000=14.96%
lat (msec) : 2=75.87%, 4=2.99%, 10=1.78%, 20=0.73%, 50=0.30%
lat (msec) : 100=0.07%, 250=0.03%, 500=0.01%
  cpu  : usr=0.76%, sys=3.29%, ctx=38871, majf=0, minf=11
  IO depths: 1=108.2%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 issued: total=r=35838/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=143352KB, aggrb=2389KB/s, minb=2389KB/s, maxb=2389KB/s,
mint=60001msec, maxt=60001msec

Disk stats (read/write):
  vda: ios=38631/51, merge=0/3, ticks=62668/40, in_queue=62700, util=96.77%


And fio-file:
$ cat fio-job-randr.ini
[global]
readwrite=randread
blocksize=4k
ioengine=libaio
numjobs=1
thread=0
direct=1
iodepth=1
group_reporting=1
ramp_time=5
norandommap=1
description=fio random 4k reads
time_based=1
runtime=60
randrepeat=0

[test]
size=1g


Single threaded performance ~500-600 IOPS - or average latency of 1.6ms
Is that comparable to what other are seeing?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Noob question - ceph-mgr crash on arm

2019-05-20 Thread Jesper Taxbøl
  0/ 5 client
  1/ 5 osd
  0/ 5 optracker
  0/ 5 objclass
  1/ 3 filestore
  1/ 3 journal
  0/ 5 ms
  1/ 5 mon
  0/10 monc
  1/ 5 paxos
  0/ 5 tp
  1/ 5 auth
  1/ 5 crypto
  1/ 1 finisher
  1/ 1 reserver
  1/ 5 heartbeatmap
  1/ 5 perfcounter
  1/ 5 rgw
  1/10 civetweb
  1/ 5 javaclient
  1/ 5 asok
  1/ 1 throttle
  0/ 0 refs
  1/ 5 xio
  1/ 5 compressor
  1/ 5 bluestore
  1/ 5 bluefs
  1/ 3 bdev
  1/ 5 kstore
  4/ 5 rocksdb
  4/ 5 leveldb
  4/ 5 memdb
  1/ 5 kinetic
  1/ 5 fuse
  1/ 5 mgr
  1/ 5 mgrc
  1/ 5 dpdk
  1/ 5 eventtrace
 -2/-2 (syslog threshold)
 -1/-1 (stderr threshold)
 max_recent 1
 max_new 1000
 log_file /var/log/ceph/ceph-mgr.odroid-c.log
--- end dump of recent events ---



Kind regards

Jesper
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] fscache and cephfs

2019-05-17 Thread jesper


Does that work together nicely? Anyone using it?

With NVMe drives fairly cheap it could stack pretty nicely.

Jesper


Sent from myMail for iOS
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Stalls on new RBD images.

2019-05-08 Thread jesper
Hi.

I'm fishing a bit here.

What we see is that when we have new VM/RBD/SSD-backed images the
time before they are "fully written" first time - can be lousy
performance. Sort of like they are thin-provisioned and the subsequent
growing of the images in Ceph deliveres a performance hit.

Does anyone else have someting similar in their setup - how do you deal
with it?

KVM based virtualization, Ceph Luminous.

Any suggestions/hints/welcome

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?

2019-04-10 Thread jesper
> On 4/10/19 9:07 AM, Charles Alva wrote:
>> Hi Ceph Users,
>>
>> Is there a way around to minimize rocksdb compacting event so that it
>> won't use all the spinning disk IO utilization and avoid it being marked
>> as down due to fail to send heartbeat to others?
>>
>> Right now we have frequent high IO disk utilization for every 20-25
>> minutes where the rocksdb reaches level 4 with 67GB data to compact.
>>
>
> How big is the disk? RocksDB will need to compact at some point and it
> seems that the HDD can't keep up.
>
> I've seen this with many customers and in those cases we offloaded the
> WAL+DB to an SSD.

Guess the SSD need to be pretty durable to handle that?

Is there a "migration path" to offload this or is it needed to destroy
and re-create the OSD?

Thanks.

Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] VM management setup

2019-04-05 Thread jesper
Hi. Knowing this is a bit off-topic but seeking recommendations
and advise anyway.

We're seeking a "management" solution for VM's - currently in the 40-50
VM - but would like to have better access in managing them and potintially
migrate them across multiple hosts, setup block devices, etc, etc.

This is only to be used internally in a department where a bunch of
engineering people will manage it, no costumers and that kind of thing.

Up until now we have been using virt-manager with kvm - and have been
quite satisfied when we were in the "few vms", but it seems like the
time to move on.

Thus we're looking for something "simple" that can help manage a ceph+kvm
based setup -  the simpler and more to the point the better.

Any recommendations?

.. found a lot of names allready ..
OpenStack
CloudStack
Proxmox
..

But recommendations are truely welcome.

Thanks.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-20 Thread jesper
> `cpupower idle-set -D 0` will help you a lot, yes.
>
> However it seems that not only the bluestore makes it slow. >= 50% of the
> latency is introduced by the OSD itself. I'm just trying to understand
> WHAT parts of it are doing so much work. For example in my current case
> (with cpupower idle-set -D 0 of course) when I was testing a single OSD on
> a very good drive (Intel NVMe, capable of 4+ single-thread sync write
> iops) it was delivering me only 950-1000 iops. It's roughly 1 ms latency,
> and only 50% of it comes from bluestore (you can see it `ceph daemon osd.x
> perf dump`)! I've even tuned bluestore a little, so that now I'm getting
> ~1200 iops from it. It means that the bluestore's latency dropped by 33%
> (it was around 1/1000 = 500 us, now it is 1/1200 = ~330 us). But still the
> overall improvement is only 20% - everything else is eaten by the OSD
> itself.


Thanks for the insight - that means that the SSD-number for read/write
performance are roughly ok - I guess.

It still puzzles me why the bluestore-caching does not benefit
the read-size.

Is the cache not an LRU cache on the block device or is it actually uses for
something else?

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-19 Thread jesper
> One thing you can check is the CPU performance (cpu governor in
> particular).
> On such light loads I've seen CPUs sitting in low performance mode (slower
> clocks), giving MUCH worse performance results than when tried with
> heavier
> loads. Try "cpupower monitor" on OSD nodes in a loop and observe the core
> frequencies.
>

Thanks for the suggestion. They seem to be all powered up .. other
suggestion/reflections
are truely welcome.. Thanks.

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] fio test rbd - single thread - qd1

2019-03-19 Thread jesper
Hi All.

I'm trying to get head and tails into where we can stretch our Ceph cluster
into what applications. Parallism works excellent, but baseline throughput
it - perhaps - not what I would expect it to be.

Luminous cluster running bluestore - all OSD-daemons have 16GB of cache.

Fio files attacher - 4KB random read and 4KB random write - test file is
"only" 1GB
In this i ONLY care about raw IOPS numbers.

I have 2 pools, both 3x replicated .. one backed with SSDs S4510's
(14x1TB) and one with HDD's 84x10TB.

Network latency from rbd mount to one of the osd-hosts.
--- ceph-osd01.nzcorp.net ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9189ms
rtt min/avg/max/mdev = 0.084/0.108/0.146/0.022 ms

SSD:
randr:
# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
N   Min   MaxMedian   AvgStddev
x  38   1727.07   2033.66   1954.71 1949.4789 46.592401
randw:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
N   Min   MaxMedian   AvgStddev
x  36400.05455.26436.58 433.91417 12.468187

The double (or triple) network penalty of-course kicks in and delivers a
lower throughput here.
Are these performance numbers in the ballpark of what we'd expect?

With 1GB of test file .. I would really expect this to be memory cached in
the OSD/bluestore cache
and thus deliver a read IOPS closer to theoretical max: 1s/0.108ms => 9.2K
IOPS

Again on the write side - all OSDs are backed by Battery-Backed write
cache, thus writes should go directly
into memory of the constroller .. .. still slower than reads - due to
having to visit 3 hosts.. but not this low?

Suggestions for improvements? Are other people seeing similar results?

For the HDD tests I get similar - surprisingly slow numbers:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
N   Min   MaxMedian   AvgStddev
x  38 36.91 118.8 69.14 72.926842  21.75198

This should have the same performance characteristics as the SSD's as the
writes should be hitting BBWC.

# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
N   Min   MaxMedian   AvgStddev
x  39 26.18181.51 48.16 50.574872  24.01572

Same here - shold be cached in the blue-store cache as it is 16GB x 84
OSD's  .. with a 1GB testfile.

Any thoughts - suggestions - insights ?

Jesper

fio-single-thread-randr.ini
Description: Binary data


fio-single-thread-randw.ini
Description: Binary data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to just delete PGs stuck incomplete on EC pool

2019-03-02 Thread jesper

Did they break, or did something went wronng trying to replace them?

Jespe



Sent from myMail for iOS


Saturday, 2 March 2019, 14.34 +0100 from Daniel K  :
>I bought the wrong drives trying to be cheap. They were 2TB WD Blue 5400rpm 
>2.5 inch laptop drives.
>
>They've been replace now with HGST 10K 1.8TB SAS drives.
>
>
>
>On Sat, Mar 2, 2019, 12:04 AM  < jes...@krogh.cc > wrote:
>>
>>
>>Saturday, 2 March 2019, 04.20 +0100 from  satha...@gmail.com < 
>>satha...@gmail.com >:
>>>56 OSD, 6-node 12.2.5 cluster on Proxmox
>>>
>>>We had multiple drives fail(about 30%) within a few days of each other, 
>>>likely faster than the cluster could recover.
>>
>>Hov did so many drives break?
>>
>>Jesper
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to just delete PGs stuck incomplete on EC pool

2019-03-01 Thread jesper



Saturday, 2 March 2019, 04.20 +0100 from satha...@gmail.com  
:
>56 OSD, 6-node 12.2.5 cluster on Proxmox
>
>We had multiple drives fail(about 30%) within a few days of each other, likely 
>faster than the cluster could recover.

Hov did so many drives break?

Jesper
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding EC properties for CephFS / small files.

2019-02-17 Thread jesper
Hi Paul.

Thanks for you comments.

> For your examples:
>
> 16 MB file -> 4x 4 MB objects -> 4x 4x 1 MB data chunks, 4x 2x 1 MB
> coding chunks
>
> 512 kB file -> 1x 512 kB object -> 4x 128 kB data chunks, 2x 128 kb
> coding chunks
>
>
> You'll run into different problems once the erasure coded chunks end
> up being smaller than 64kb each due to bluestore min allocation sizes
> and general metadata overhead making erasure coding a bad fit for very
> small files.

Thanks for the clairification, which makes this a "very bad fit" for CephFS:

# find . -type f -print0 | xargs -0 stat | grep Size | perl -ane '/Size:
(\d+)/; print $1 . "\n";' | ministat -n
x 
N   Min   MaxMedian   AvgStddev
x 12651568 0 1.0840049e+11  9036 2217611.6 
32397960

Gives me 6,3M files < 9036 bytes in size, that'll be stored as 6 x 64KB at
the bluestore
level if I understand it correctly.

We come from a xfs world where default blocksize is 4K so above situation
worked quite nicely. Guess I probably would be way better off with a
RBD with xfs on top to solve this case using Ceph.

Is it fair to summarize your input as:

In a EC4+2 configuration, minimal used space is 256KB+128KB(coding)
regardless of file-size
In a EC8+3 configuraiton, minimal used space is 512KB+192KB(coding)
regardless of file-size

And for the access side:
All access to files in EC pool requires as a minimum IO-requests to
k-shards for the first
bytes to be returned, with fast_read it becomes k+n, but returns when k
has responded.

Any experience with inlining data on the MDS - that would obviously help
here I guess.

Thanks.

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - read latency.

2019-02-17 Thread jesper
> Probably not related to CephFS. Try to compare the latency you are
> seeing to the op_r_latency reported by the OSDs.
>
> The fast_read option on the pool can also help a lot for this IO pattern.

Magic, that actually cut the read-latency in half - making it more
aligned with what to expect from the HW+network side:

N   Min   MaxMedian   AvgStddev
x 100  0.015687  0.221538  0.0252530.03259606   0.028827849

25ms as a median, 32ms average is still on the high side,
but way, way better.

Thanks.

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding EC properties for CephFS / small files.

2019-02-16 Thread jesper
> I'm trying to understand the nuts and bolts of EC / CephFS
> We're running an EC4+2 pool on top of 72 x 7.2K rpm 10TB drives. Pretty
> slow bulk / archive storage.

Ok, did some more searching and found this:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021642.html.

Which to some degree confirms my understanding, I'd still like to get
even more insight though.

Gregory Farnum comes with this comments:
"Unfortunately any logic like this would need to be handled in your
application layer. Raw RADOS does not do object sharding or aggregation on
its own.
CERN did contribute the libradosstriper, which will break down your
multi-gigabyte objects into more typical sizes, but a generic system for
packing many small objects into larger ones is tough — the choices depend
so much on likely access patterns and such.

I would definitely recommend working out something like that, though!
"
An idea about how to advance this stuff:

I can see that this would be "very hard" by the Ceph concepts to do
at the objects level, but a suggestion would be to do it at the
CephFS/MDS level.

A basic thing that "often" would work, would be to on a "directory level"
have a special type of "packed" object, where multiple files went into
the same CephFS object. For common access patterns people are reading
through entire catalogs in the first place, which would also limits IO
on the overall system for tree traversals (Think tar cxvf
linux.kernel.tar.gz git-checkout)
I have no idea about how cephfs is dealing with concurrent updates
around entitites, but in this situation, dealing with concurrency
at the packed-object level.

It would be harder to "pack files across catalogs", since that is
not the native way of the MDS to keep track of things.

A third way would be to more "agressively" inline data on the MDS.
How mature - well tested - efficient is that feature?

http://docs.ceph.com/docs/master/cephfs/experimental-features/

The unfortunate consequence of bumping the 2KB size upwards to meet
the point where EC-pools become efficient would mean that we end
up hitting the MDS way harder than what we do today. 2KB seem
like a safe limit.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Understanding EC properties for CephFS / small files.

2019-02-16 Thread jesper
Hi List.

I'm trying to understand the nuts and bolts of EC / CephFS
We're running an EC4+2 pool on top of 72 x 7.2K rpm 10TB drives. Pretty
slow bulk / archive storage.

# getfattr -n ceph.dir.layout /mnt/home/cluster/mysqlbackup
getfattr: Removing leading '/' from absolute path names
# file: mnt/home/cluster/mysqlbackup
ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304
pool=cephfs_data_ec42"

This configuration is taken directly out of the online documentation:
(Which may have been where it went all wrong from our perspective):

http://docs.ceph.com/docs/master/cephfs/file-layouts/

Ok, this means that a 16MB file will be split at 4 chuncks of 4MB each
with 2 erasure coding chuncks? I dont really understand the stripe_count
element?

And since erasure-coding works at the object level, striping individual
objects across - here 4 replicas - it'll end up filling 16MB ? Or
is there an internal optimization causing this not to be the case?

Additionally, when reading the file, all 4 chunck need to be read to
assemble the object. Causing (at a minumum) 4 IOPS per file.

Now, my common file size is < 8MB and commonly 512KB files are on
this pool.

Will that cause a 512KB file to be padded to 4MB with 3 empty chuncks
to fill the erasure coded profile and then 2 coding chuncks on top?
In total 24MB for storing 512KB ?

And when reading it I'll hit 4 random IO's to read 512KB or can
it optimize around not reading "empty" chuncks?

If this is true, then I would be both performance and space/cost-wise
way better off with 3x replication.

Or is it less worse than what I get to here?

If the math is true, then we can begin to calculate chunksize and
EC profiles for when EC begins to deliver benefits.

In terms of IO it seems like I'll always suffer a 1:4 ratio on IOPS in
a reading scenario on a 4+2 EC pool, compared to a 3x replication.

Side-note: I'm trying to get bacula (tape-backup) to read off my archive
to tape in a "resonable time/speed".

Thanks in advance.

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG_AVAILABILITY with one osd down?

2019-02-16 Thread jesper
> Hello,
> your log extract shows that:
>
> 2019-02-15 21:40:08 OSD.29 DOWN
> 2019-02-15 21:40:09 PG_AVAILABILITY warning start
> 2019-02-15 21:40:15 PG_AVAILABILITY warning cleared
>
> 2019-02-15 21:44:06 OSD.29 UP
> 2019-02-15 21:44:08 PG_AVAILABILITY warning start
> 2019-02-15 21:44:15 PG_AVAILABILITY warning cleared
>
> What you saw is the natural consequence of OSD state change. Those two
> periods of limited PG availability (6s each) are related to peering
> that happens shortly after an OSD goes down or up.
> Basically, the placement groups stored on that OSD need peering, so
> the incoming connections are directed to other (alive) OSDs. And, yes,
> during those few seconds the data are not accessible.

Thanks, bear over with my questions. I'm pretty new to Ceph.
What will clients  (CephFS, Object) experience?
.. will they just block until time has passed and they get through or?

Which means that I'll get 72 x 6 seconds unavailabilty when doing
a rolling restart of my OSD's during upgrades and such? Or is a
controlled restart different than a crash?

-- 
Jesper.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG_AVAILABILITY with one osd down?

2019-02-15 Thread jesper
Yesterday I saw this one.. it puzzles me:
2019-02-15 21:00:00.000126 mon.torsk1 mon.0 10.194.132.88:6789/0 604164 :
cluster [INF] overall HEALTH_OK
2019-02-15 21:39:55.793934 mon.torsk1 mon.0 10.194.132.88:6789/0 604304 :
cluster [WRN] Health check failed: 2 slow requests are blocked > 32 sec.
Implicated osds 58 (REQUEST_SLOW)
2019-02-15 21:40:00.887766 mon.torsk1 mon.0 10.194.132.88:6789/0 604305 :
cluster [WRN] Health check update: 6 slow requests are blocked > 32 sec.
Implicated osds 9,19,52,58,68 (REQUEST_SLOW)
2019-02-15 21:40:06.973901 mon.torsk1 mon.0 10.194.132.88:6789/0 604306 :
cluster [WRN] Health check update: 14 slow requests are blocked > 32 sec.
Implicated osds 3,9,19,29,32,52,55,58,68,69 (REQUEST_SLOW)
2019-02-15 21:40:08.466266 mon.torsk1 mon.0 10.194.132.88:6789/0 604307 :
cluster [INF] osd.29 failed (root=default,host=bison) (6 reporters from
different host after 33.862482 >= grace 29.247323)
2019-02-15 21:40:08.473703 mon.torsk1 mon.0 10.194.132.88:6789/0 604308 :
cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-02-15 21:40:09.489494 mon.torsk1 mon.0 10.194.132.88:6789/0 604310 :
cluster [WRN] Health check failed: Reduced data availability: 6 pgs
peering (PG_AVAILABILITY)
2019-02-15 21:40:11.008906 mon.torsk1 mon.0 10.194.132.88:6789/0 604312 :
cluster [WRN] Health check failed: Degraded data redundancy:
3828291/700353996 objects degraded (0.547%), 77 pgs degraded (PG_DEGRADED)
2019-02-15 21:40:13.474777 mon.torsk1 mon.0 10.194.132.88:6789/0 604313 :
cluster [WRN] Health check update: 9 slow requests are blocked > 32 sec.
Implicated osds 3,9,32,55,58,69 (REQUEST_SLOW)
2019-02-15 21:40:15.060165 mon.torsk1 mon.0 10.194.132.88:6789/0 604314 :
cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
availability: 17 pgs peering)
2019-02-15 21:40:17.128185 mon.torsk1 mon.0 10.194.132.88:6789/0 604315 :
cluster [WRN] Health check update: Degraded data redundancy:
9897139/700354131 objects degraded (1.413%), 200 pgs degraded
(PG_DEGRADED)
2019-02-15 21:40:17.128219 mon.torsk1 mon.0 10.194.132.88:6789/0 604316 :
cluster [INF] Health check cleared: REQUEST_SLOW (was: 2 slow requests are
blocked > 32 sec. Implicated osds 32,55)
2019-02-15 21:40:22.137090 mon.torsk1 mon.0 10.194.132.88:6789/0 604317 :
cluster [WRN] Health check update: Degraded data redundancy:
9897140/700354194 objects degraded (1.413%), 200 pgs degraded
(PG_DEGRADED)
2019-02-15 21:40:27.249354 mon.torsk1 mon.0 10.194.132.88:6789/0 604318 :
cluster [WRN] Health check update: Degraded data redundancy:
9897142/700354287 objects degraded (1.413%), 200 pgs degraded
(PG_DEGRADED)
2019-02-15 21:40:33.335147 mon.torsk1 mon.0 10.194.132.88:6789/0 604322 :
cluster [WRN] Health check update: Degraded data redundancy:
9897143/700354356 objects degraded (1.413%), 200 pgs degraded
(PG_DEGRADED)
... shortened ..
2019-02-15 21:43:48.496536 mon.torsk1 mon.0 10.194.132.88:6789/0 604366 :
cluster [WRN] Health check update: Degraded data redundancy:
9897168/700356693 objects degraded (1.413%), 200 pgs degraded, 201 pgs
undersized (PG_DEGRADED)
2019-02-15 21:43:53.496924 mon.torsk1 mon.0 10.194.132.88:6789/0 604367 :
cluster [WRN] Health check update: Degraded data redundancy:
9897170/700356804 objects degraded (1.413%), 200 pgs degraded, 201 pgs
undersized (PG_DEGRADED)
2019-02-15 21:43:58.497313 mon.torsk1 mon.0 10.194.132.88:6789/0 604368 :
cluster [WRN] Health check update: Degraded data redundancy:
9897172/700356879 objects degraded (1.413%), 200 pgs degraded, 201 pgs
undersized (PG_DEGRADED)
2019-02-15 21:44:03.497696 mon.torsk1 mon.0 10.194.132.88:6789/0 604369 :
cluster [WRN] Health check update: Degraded data redundancy:
9897174/700356996 objects degraded (1.413%), 200 pgs degraded, 201 pgs
undersized (PG_DEGRADED)
2019-02-15 21:44:06.939331 mon.torsk1 mon.0 10.194.132.88:6789/0 604372 :
cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-02-15 21:44:06.965401 mon.torsk1 mon.0 10.194.132.88:6789/0 604373 :
cluster [INF] osd.29 10.194.133.58:6844/305358 boot
2019-02-15 21:44:08.498060 mon.torsk1 mon.0 10.194.132.88:6789/0 604376 :
cluster [WRN] Health check update: Degraded data redundancy:
9897174/700357056 objects degraded (1.413%), 200 pgs degraded, 201 pgs
undersized (PG_DEGRADED)
2019-02-15 21:44:08.996099 mon.torsk1 mon.0 10.194.132.88:6789/0 604377 :
cluster [WRN] Health check failed: Reduced data availability: 12 pgs
peering (PG_AVAILABILITY)
2019-02-15 21:44:13.498472 mon.torsk1 mon.0 10.194.132.88:6789/0 604378 :
cluster [WRN] Health check update: Degraded data redundancy: 55/700357161
objects degraded (0.000%), 33 pgs degraded (PG_DEGRADED)
2019-02-15 21:44:15.081437 mon.torsk1 mon.0 10.194.132.88:6789/0 604379 :
cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
availability: 12 pgs peering)
2019-02-15 21:44:18.498808 mon.torsk1 mon.0 10.194.132.88:6789/0 604380 :
cluster [WRN] Health check update: Degraded data redundancy: 14/700357230
objects degraded 

[ceph-users] CephFS - read latency.

2019-02-15 Thread jesper
Hi.

I've got a bunch of "small" files moved onto CephFS as archive/bulk storage
and now I have the backup (to tape) to spool over them. A sample of the
single-threaded backup client delivers this very consistent pattern:

$ sudo strace -T -p 7307 2>&1 | grep -A 7 -B 3 open
write(111, "\377\377\377\377", 4)   = 4 <0.11>
openat(AT_FDCWD, "/ceph/cluster/rsyncbackups/fileshare.txt", O_RDONLY) =
38 <0.30>
write(111, "\0\0\0\021197418 2 67201568", 21) = 21 <0.36>
read(38,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
65536) = 65536 <0.049733>
write(111,
"\0\1\0\0CLC\0\0\0\0\2\0\0\0\0\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0"...,
65540) = 65540 <0.37>
read(38, " $$ $$\16\33\16 \16\33"..., 65536) = 65536
<0.000199>
write(111, "\0\1\0\0 $$ $$\16\33\16 $$"..., 65540) = 65540
<0.26>
read(38, "$ \33  \16\33\25 \33\33\33   \33\33\33
\25\0\26\2\16NVDOLOVB"..., 65536) = 65536 <0.35>
write(111, "\0\1\0\0$ \33  \16\33\25 \33\33\33   \33\33\33
\25\0\26\2\16NVDO"..., 65540) = 65540 <0.24>

The pattern is very consistent, thus it is not one PG or one OSD being
contented.
$ sudo strace -T -p 7307 2>&1 | grep -A 3 open  |grep read
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 11968 <0.070917>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 23232 <0.039789>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0P\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 65536 <0.053598>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 28240 <0.105046>
read(41, "NZCA_FS_CLCGENOMICS, 1, 1\nNZCA_F"..., 65536) = 73 <0.061966>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 65536 <0.050943>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
65536) = 65536 <0.031217>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 7392 <0.052612>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 288 <0.075930>
read(41, "1316919290-DASPHYNBAAPe2218b"..., 65536) = 940 <0.040609>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 22400 <0.038423>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 11984 <0.039051>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 9040 <0.054161>
read(41, "NZCA_FS_CLCGENOMICS, 1, 1\nNZCA_F"..., 65536) = 73 <0.040654>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 22352 <0.031236>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0N\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 65536 <0.123424>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 49984 <0.052249>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 28176 <0.052742>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 288 <0.092039>

Or to sum:
sudo strace -T -p 23748 2>&1 | grep -A 3 open  | grep read |  perl
-ane'/<(\d+\.\d+)>/; print $1 . "\n";' | head -n 1000 | ministat

N   Min   MaxMedian   AvgStddev
x 1000   3.2e-05  2.141551  0.054313   0.065834359   0.091480339


As can be seen the "initial" read averages at 65.8ms - which - if the
filesize is say 1MB and
the rest of the time is 0 - caps read performance mostly 20MB/s .. at that
pace, the journey
through double digit TB is long even with 72 OSD's backing.

Spec: Ceph Luminous 12.2.5 - Bluestore
6 OSD nodes, 10TB HDDs, 4+2 EC pool, 10GbitE

Locally the drives deliver latencies of approximately 6-8ms for a random
read. Any suggestion
on where to find out where the remaining 50ms is being spend would be
truely helpful.

Large files "just works" as read-ahead does a nice job in getting
performance up.

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread jesper
> That's a usefull conclusion to take back.

Last question - We have our SSD pool set to 3x replication, Micron states
that NVMe is good at 2x - is this "taste and safety" or is there any
general
thoughts about SSD-robustness in a Ceph setup?


Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread jesper
> On 07/02/2019 17:07, jes...@krogh.cc wrote:
> Thanks for your explanation. In your case, you have low concurrency
> requirements, so focusing on latency rather than total iops is your
> goal. Your current setup gives 1.9 ms latency for writes and 0.6 ms for
> read. These are considered good, it is difficult to go below 1 ms for
> writes. As Wido pointed, to get latency down you need to insure you have
> C States in your cpu settings ( or just C1 state ), you have no low
> frequencies in your P States and get cpu with high GHz frequency rather
> than more cores (Nick Fisk has a good presentation on this), also avoid
> dual socket and NUMA. Also if money is no issue, you will get a bit
> better latency with 40G or 100G network.

Thanks a lot. I'm heading towards the conclusion that if I went all in
and got new HW+NVMe drives, then I'd "only" be about 3x better off than
where I am today.  (compared to the Micron paper)

That's a usefull conclusion to take back.

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread jesper
Hi Maged

Thanks for your reply.

> 6k is low as a max write iops value..even for single client. for cluster
> of 3 nodes, we see from 10k to 60k write iops depending on hardware.
>
> can you increase your threads to 64 or 128 via -t parameter

I can absolutely get it higher by increasing the parallism. But I
may have missed to explain my purpuse - I'm intested in how close to
putting local SSD/NVMe in servers I can get with RDB. Thus putting
parallel scenarios that I would never see in production in the
tests does not really help my understanding. I think a concurrency level
of 16 is in the top of what I would expect our PostgreSQL databases to do
in real life.

> can you run fio with sync=1 on your disks.
>
> can you try with noop scheduler
>
> what is the %utilization on the disks and cpu ?
>
> can you have more than 1 disk per node

I'll have a look at that. Thanks for the suggestion.

Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread jesper

Thanks for the confirmation Marc

Can you put in a but more hardware/network details?

Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread jesper
> On 2/7/19 8:41 AM, Brett Chancellor wrote:
>> This seems right. You are doing a single benchmark from a single client.
>> Your limiting factor will be the network latency. For most networks this
>> is between 0.2 and 0.3ms.  if you're trying to test the potential of
>> your cluster, you'll need multiple workers and clients.
>>
>
> Indeed. To add to this, you will need fast (High clockspeed!) CPUs in
> order to get the latency down. The CPUs will need tuning as well like
> their power profiles and C-States.

Thanks for the insigt, I'm aware and my current CPUs are pretty old
- but I'm also in the process of learning how to make the right
decisions when expanding. If all my time end up being spend in the
client end, then bying NVMe drives does not help me a all nor does
better cpus in the OSDs.

> You won't get the 1:1 performance from the SSDs on your RBD block devices.

I'm full aware of that - Ceph / RBD / etc comes with an awesome feature
packages and that flexibility deliveres overhead and eats into it.
But it helps to deliver "upper bounds" and work my way to good from there.

Thanks.

Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread jesper
> On Thu, 7 Feb 2019 08:17:20 +0100 jes...@krogh.cc wrote:
>> Hi List
>>
>> We are in the process of moving to the next usecase for our ceph cluster
>> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
>> that works fine.
>>
>> We're currently on luminous / bluestore, if upgrading is deemed to
>> change what we're seeing then please let us know.
>>
>> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each.
>> Connected
>> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to
>> deadline, nomerges = 1, rotational = 0.
>>
> I'd make sure that the endurance of these SSDs is in line with your
> expected usage.

They are - at the moment :-) and Ceph allows me to change my mind without
interferrring with the applications running on top - Nice!

>> Each disk "should" give approximately 36K IOPS random write and the
>> double
>> random read.
>>
> Only locally, latency is your enemy.
>
> Tell us more about your network.

It is a Dell N4032, N4064 switch stack on 10Gbase-T.
All hosts are on same subnet, NIC's are Intel X540's
No-jumbo-framing and not much tuning - all kernels are on 4.15 (Ubuntu)

Pings from client to two of the osd's
--- flodhest.nzcorp.net ping statistics ---
50 packets transmitted, 50 received, 0% packet loss, time 50157ms
rtt min/avg/max/mdev = 0.075/0.105/0.158/0.021 ms
--- bison.nzcorp.net ping statistics ---
50 packets transmitted, 50 received, 0% packet loss, time 50139ms
rtt min/avg/max/mdev = 0.078/0.137/0.275/0.032 ms


> rados bench is not the sharpest tool in the shed for this.
> As it needs to allocate stuff to begin with, amongst other things.

Suggest longer test-runs?

>> This is also quite far from expected. I have 12GB of memory on the OSD
>> daemon for caching on each host - close to idle cluster - thus 50GB+ for
>> caching with a working set of < 6GB .. this should - in this case
>> not really be bound by the underlying SSD.
> Did you adjust the bluestore parameters (whatever they are this week or
> for your version) to actually use that memory?

According to top - it is picking up the caching memory.
We have this block.

bluestore_cache_kv_max = 214748364800
bluestore_cache_kv_ratio = 0.4
bluestore_cache_meta_ratio = 0.1
bluestore_cache_size_hdd = 13958643712
bluestore_cache_size_ssd = 13958643712
bluestore_rocksdb_options =
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,compact_on_mount=false

I actually think most of above has been applied with the 10TB harddrives
in mind, not the SSD's .. but I have no idea if they do "bad things" for
us.

> Don't use iostat, use atop.
> Small IOPS are extremely CPU intensive, so atop will give you an insight
> as to what might be busy besides the actual storage device.

Thanks will do so.

More suggestions are wellcome.

Doing some math:
Say network latency was the only cost driver - assume rone roundtrip per
IOPS per thread.

16 threads - 0.15ms per round-trip - gives 1000 ms/s/thread / 0.15ms/IOPS
=> 6.666 IOPSs * 16 threads => 10 IOPS/s
ok, thats at least an upper bound on expectations in this scenario, and I
am at 28207 thus 4x from and have
still not accounted any OSD or rdb userspace time into the equation.

Can i directly get service-time out of the osd-daemon ? That would be nice
to see how many ms is spend at that end from an OSD perspective.

Jesper

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-06 Thread jesper
Hi List

We are in the process of moving to the next usecase for our ceph cluster
(Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
that works fine.

We're currently on luminous / bluestore, if upgrading is deemed to
change what we're seeing then please let us know.

We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected
through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to
deadline, nomerges = 1, rotational = 0.

Each disk "should" give approximately 36K IOPS random write and the double
random read.

Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of
well performing SSD block devices - potentially to host databases and
things like that. I ready through this nice document [0], I know the
HW are radically different from mine, but I still think I'm in the
very low end of what 6 x S4510 should be capable of doing.

Since it is IOPS i care about I have lowered block size to 4096 -- 4M
blocksize nicely saturates the NIC's in both directions.


$ sudo rados bench -p scbench -b 4096 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for
up to 10 seconds or 0 objects
Object prefix: benchmark_data_torsk2_11207
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1  16  5857  5841   22.8155   22.8164  0.00238437  0.00273434
2  15 11768 11753   22.9533   23.0938   0.0028559  0.00271944
3  16 17264 17248   22.4564   21.4648  0.0024  0.00278101
4  16 22857 22841   22.3037   21.84770.002716  0.00280023
5  16 28462 28446   22.2213   21.8945  0.002201860.002811
6  16 34216 34200   22.2635   22.4766  0.00234315  0.00280552
7  16 39616 39600   22.0962   21.0938  0.00290661  0.00282718
8  16 45510 45494   22.2118   23.0234   0.0033541  0.00281253
9  16 50995 50979   22.1243   21.4258  0.00267282  0.00282371
   10  16 56745 56729   22.1577   22.4609  0.00252583   0.0028193
Total time run: 10.002668
Total writes made:  56745
Write size: 4096
Object size:4096
Bandwidth (MB/sec): 22.1601
Stddev Bandwidth:   0.712297
Max bandwidth (MB/sec): 23.0938
Min bandwidth (MB/sec): 21.0938
Average IOPS:   5672
Stddev IOPS:182
Max IOPS:   5912
Min IOPS:   5400
Average Latency(s): 0.00281953
Stddev Latency(s):  0.00190771
Max latency(s): 0.0834767
Min latency(s): 0.00120945

Min latency is fine -- but Max latency of 83ms ?
Average IOPS @ 5672 ?

$ sudo rados bench -p scbench  10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1  15 23329 23314   91.0537   91.0703 0.000349856 0.000679074
2  16 48555 48539   94.7884   98.5352 0.000499159 0.000652067
3  16 76193 76177   99.1747   107.961 0.000443877 0.000622775
4  15103923103908   101.459   108.324 0.000678589 0.000609182
5  15132720132705   103.663   112.488 0.000741734 0.000595998
6  15161811161796   105.323   113.637 0.000333166 0.000586323
7  15190196190181   106.115   110.879 0.000612227 0.000582014
8  15221155221140   107.966   120.934 0.000471219 0.000571944
9  16251143251127   108.984   117.137 0.000267528 0.000566659
Total time run:   10.000640
Total reads made: 282097
Read size:4096
Object size:  4096
Bandwidth (MB/sec):   110.187
Average IOPS: 28207
Stddev IOPS:  2357
Max IOPS: 30959
Min IOPS: 23314
Average Latency(s):   0.000560402
Max latency(s):   0.109804
Min latency(s):   0.000212671

This is also quite far from expected. I have 12GB of memory on the OSD
daemon for caching on each host - close to idle cluster - thus 50GB+ for
caching with a working set of < 6GB .. this should - in this case
not really be bound by the underlying SSD. But if it were:

IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off?

No measureable service time in iostat when running tests, thus I have
come to the conclusion that it has to be either client side, the
network path, or the OSD-daemon that deliveres the increasing latency /
decreased IOPS.

Is there any suggestions on how to get more insigths in that?

Has anyone replicated close to the number Micron are reporting on NVMe?

Thanks a log.

[0]
https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en

___
ceph-users mailing list

Re: [ceph-users] Spec for Ceph Mon+Mgr?

2019-01-31 Thread Jesper Krogh


> : We're currently co-locating our mons with the head node of our Hadoop
> : installation. That may be giving us some problems, we dont know yet, but
> : thus I'm speculation about moving them to dedicated hardware.

Would it be ok to run them on kvm VM’s - of course not backed by ceph?

Jesper
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Spec for Ceph Mon+Mgr?

2019-01-22 Thread jesper
Hi.

We're currently co-locating our mons with the head node of our Hadoop
installation. That may be giving us some problems, we dont know yet, but
thus I'm speculation about moving them to dedicated hardware.

It is hard to get specifications "small" engough .. the specs for the
mon is where we usually virtualize our way out of if .. which seems very
wrong here.

Are other people just co-locating it with something random or what are
others typically using in a small ceph cluster (< 100 OSDs .. 7 OSD hosts)

Thanks.

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - Small file - single thread - read performance.

2019-01-18 Thread jesper
Hi Everyone.

Thanks for the testing everyone - I think my system works as intented.

When reading from another client - hitting the cache of the OSD-hosts
I also get down to 7-8ms.

As mentioned, this is probably as expected.

I need to figure out to increase parallism somewhat - or convince users to
not created those ridiciouls amounts of small files.

-- 
Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS - Small file - single thread - read performance.

2019-01-18 Thread jesper
Hi.

We have the intention of using CephFS for some of our shares, which we'd
like to spool to tape as a part normal backup schedule. CephFS works nice
for large files but for "small" .. < 0.1MB  .. there seem to be a
"overhead" on 20-40ms per file. I tested like this:

root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile >
/dev/null

real0m0.034s
user0m0.001s
sys 0m0.000s

And from local page-cache right after.
root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile >
/dev/null

real0m0.002s
user0m0.002s
sys 0m0.000s

Giving a ~20ms overhead in a single file.

This is about x3 higher than on our local filesystems (xfs) based on
same spindles.

CephFS metadata is on SSD - everything else on big-slow HDD's (in both
cases).

Is this what everyone else see?

Thanks

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph community - how to make it even stronger

2019-01-04 Thread jesper
Hi All.

I was reading up and especially the thread on upgrading to mimic and
stable releases - caused me to reflect a bit on our ceph journey so far.

We started approximately 6 months ago - with CephFS as the dominant
use case in our HPC setup - starting at 400TB useable capacity and
as is matures going towards 1PB - mixed slow and SSD.

Some of the first confusions was.
bluestore vs. filestore - what was the recommendation actually?
Figuring out what kernel clients are useable with CephFS - and what
kernels to use on the other end?
Tuning of the MDS ?
Imbalace of OSD nodes rendering the cluster down - how to balance?
Triggering kernel bugs in the kernel client during OSD_FULL ?

This mailing list has been very responsive to the questions, thanks for
that.

But - compared to other open source projects we're lacking a bit of
infrastructure and guidance here.

I did check:
- http://tracker.ceph.com/projects/ceph/wiki/Wiki => Which does not seem
to be operational.
- http://docs.ceph.com/docs/mimic/start/get-involved/
Gmane is probably not coming back - waiting 2 years now, can we easily get
the mailinglist archives indexed otherwise.

I feel that the wealth of knowledge being build up around operating ceph
is not really captured to make the next users journey - better and easier.

I would love to help out - hey - I end up spending the time anyway, but
some guidance on how to do it may help.

I would suggest:

1) Dump a 1-3 monthly status email on the project to the respective
mailing lists => Major releases, Conferences, etc
2) Get the wiki active - one of the main things I want to know about when
messing with the storage is - What is working for other people - just a
page where people can dump an aggregated output of their ceph cluster and
write 2-5 lines about the use-case for it.
3) Either get community more active on the documentation - advocate for it
- or start up more documentation on the wiki => A FAQ would be a nice
first place to start.

There may be an awful lot of things I've missed on the write up - but
please follow up.

If some of the core ceph people allready have thoughts / ideas / guidance,
please share so we collaboratively can make it better.

Lastly - thanks for the great support on the mailing list - so far - the
intent is only to try to make ceph even better.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancing cluster with large disks - 10TB HHD

2018-12-30 Thread jesper
>> I would still like to have a log somewhere to grep and inspect what
>> balancer/upmap
>> actually does - when in my cluster. Or some ceph commands that deliveres
>> some monitoring capabilityes .. any suggestions?
> Yes, on ceph-mgr log, when log level is DEBUG.

Tried the docs .. something like:

ceph tell mds ... does not seem to work.
http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/

> You can get your cluster upmap's in via `ceph osd dump | grep upmap`.

Got it -- but I really need the README .. it shows the map ..
...
pg_upmap_items 6.0 [40,20]
pg_upmap_items 6.1 [59,57,47,48]
pg_upmap_items 6.2 [59,55,75,9]
pg_upmap_items 6.3 [22,13,40,39]
pg_upmap_items 6.4 [23,9]
pg_upmap_items 6.5 [25,17]
pg_upmap_items 6.6 [45,46,59,56]
pg_upmap_items 6.8 [60,54,16,68]
pg_upmap_items 6.9 [61,69]
pg_upmap_items 6.a [51,48]
pg_upmap_items 6.b [43,71,41,29]
pg_upmap_items 6.c [22,13]

..

But .. I dont have any pg's that should only have 2 replicas.. neither any
with 4 .. how should this be interpreted?

Thanks.

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancing cluster with large disks - 10TB HHD

2018-12-28 Thread jesper


Hi. .. Just an update - This looks awesome.. and in a 8x5 company -
christmas is a good period to rebalance a cluster :-)

>> I'll try it out again - last I tried it complanied about older clients -
>> it should be better now.
> upmap is supported since kernel 4.13.
>
>> Second - should the reweights be set back to 1 then?
> Yes, also:
>
> 1. `ceph osd crush tunables optimal`

Done

> 2. All your buckets should be straw2, but in case `ceph osd crush
> set-all-straw-buckets-to-straw2`

Done

> 3. Your hosts imbalanced: elefant & capone have only eight 10TB's,
> another hosts - 12. So I recommend replace 8TB's spinners to 10TB or
> just shuffle it between hosts, like 2x8TB+10x10Tb.

Yes, we initially thought we could go with 3 osd-hosts .. but then found
out that EC-pools required more -- and then added.

> 4. Revert all your reweights.

Done

> 5. Balancer do his work: `ceph balancer mode upmap`, `ceph balancer on`.

So far - works awesome --
 sudo qms/server_documentation/ceph/ceph-osd-data-distribution hdd
hdd
x 
N   Min   MaxMedian   AvgStddev
x  72 50.82 55.65 52.88 52.916944 1.0002586

As compared to the best I got with reweighting:
$ sudo qms/server_documentation/ceph/ceph-osd-data-distribution hdd
hdd
x 
N   Min   MaxMedian   AvgStddev
x  72 45.36 54.98 52.63 52.131944 2.0746672


It took about 24 hours to rebalance -- and move quite some TB's around.

I would still like to have a log somewhere to grep and inspect what
balancer/upmap
actually does - when in my cluster. Or some ceph commands that deliveres
some monitoring capabilityes .. any suggestions?

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancing cluster with large disks - 10TB HHD

2018-12-26 Thread jesper
> Have a look at this thread on the mailing list:
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg46506.html

Ok, done..  how do I see that it actually work?
Second - should the reweights be set back to 1 then?

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancing cluster with large disks - 10TB HHD

2018-12-26 Thread jesper
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> On mik, 2018-12-26 at 13:14 +0100, jes...@krogh.cc wrote:
>> Thanks for the insight and links.
>>
>> > As I can see you are on Luminous. Since Luminous Balancer plugin is
>> > available [1], you should use it instead reweight's in place,
>> especially
>> > in upmap mode [2]
>>
>> I'll try it out again - last I tried it complanied about older clients -
>> it should be better now.
>>
> require_min_compat_client luminous is required, for you to take advantage
> of
> upmap.

$ sudo ceph osd set-require-min-compat-client luminous
Error EPERM: cannot set require_min_compat_client to luminous: 54
connected client(s) look like jewel (missing 0x800); add
--yes-i-really-mean-it to do it anyway

We've standardize on the 4.15 kernel client on all CephFS clients, those
are the 54 - would it be safe to ignore above warning ? Otherwise - which
kernel do I need to go to ?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancing cluster with large disks - 10TB HHD

2018-12-26 Thread jesper


Thanks for the insight and links.

> As I can see you are on Luminous. Since Luminous Balancer plugin is
> available [1], you should use it instead reweight's in place, especially
> in upmap mode [2]

I'll try it out again - last I tried it complanied about older clients -
it should be better now.

> Also, may be I can catch another crush mistakes, can I see `ceph osd
> crush show-tunables, `ceph osd crush rule dump`, `ceph osd pool ls
> detail`?

Here:
$ sudo ceph osd crush show-tunables
{
"choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"chooseleaf_vary_r": 1,
"chooseleaf_stable": 0,
"straw_calc_version": 1,
"allowed_bucket_algs": 54,
"profile": "hammer",
"optimal_tunables": 0,
"legacy_tunables": 0,
"minimum_required_version": "hammer",
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"has_v2_rules": 1,
"require_feature_tunables3": 1,
"has_v3_rules": 0,
"has_v4_buckets": 1,
"require_feature_tunables5": 0,
"has_v5_rules": 0
}

$ sudo ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_ruleset_hdd",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 1,
"rule_name": "replicated_ruleset_hdd_fast",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -28,
"item_name": "default~hdd_fast"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 2,
"rule_name": "replicated_ruleset_ssd",
"ruleset": 2,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -21,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 3,
"rule_name": "cephfs_data_ec42",
"ruleset": 3,
"type": 3,
"min_size": 3,
"max_size": 6,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -1,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]

$ sudo ceph osd pool ls detail
pool 6 'kube' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 128 pgp_num 128 last_change 41045 flags hashpspool
stripe_width 0 application rbd
removed_snaps [1~3]
pool 15 'default.rgw.buckets.data' replicated size 3 min_size 2 crush_rule
0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 41045 flags
hashpspool stripe_width 0 application rgw
pool 17 'default.rgw.users.keys' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 16 pgp_num 16 last_change 41045 lfor 0/36590
flags hashpspool stripe_width 0 application rgw
pool 18 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 last_change 41045
lfor 0/36595 flags hashpspool stripe_width 0 application rgw
pool 19 'default.rgw.users.uid' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 16 pgp_num 16 last_change 41045 lfor 0/36608
flags hashpspool stripe_width 0 application rgw
pool 20 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 128 pgp_num 128 last_change 41045 flags hashpspool
stripe_width 0 application rbd
pool 26 'default.rgw.data.root' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 41045 flags hashpspool
stripe_width 0 application rgw
pool 27 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 41045 flags hashpspool
stripe_width 0 application rgw
pool 28 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0

Re: [ceph-users] Balancing cluster with large disks - 10TB HHD

2018-12-25 Thread jesper
> Please, paste your `ceph osd df tree` and `ceph osd dump | head -n 12`.

$ sudo ceph osd df tree
ID  CLASSWEIGHTREWEIGHT SIZE  USEAVAIL  %USE  VAR  PGS TYPE NAME
 -8  639.98883-  639T   327T   312T 51.24 1.00   - root
default
-10  111.73999-  111T 58509G 55915G 51.13 1.00   -
host bison
 78 hdd_fast   0.90900  1.0  930G  1123M   929G  0.12 0.00   0
osd.78
 79 hdd_fast   0.81799  1.0  837G  1123M   836G  0.13 0.00   0
osd.79
 20  hdd   9.09499  0.95000 9313G  4980G  4333G 53.47 1.04 204
osd.20
 28  hdd   9.09499  1.0 9313G  4612G  4700G 49.53 0.97 200
osd.28
 29  hdd   9.09499  1.0 9313G  4848G  4465G 52.05 1.02 211
osd.29
 33  hdd   9.09499  1.0 9313G  4759G  4553G 51.10 1.00 207
osd.33
 34  hdd   9.09499  1.0 9313G  4613G  4699G 49.54 0.97 195
osd.34
 35  hdd   9.09499  0.89250 9313G  4954G  4359G 53.19 1.04 206
osd.35
 36  hdd   9.09499  1.0 9313G  4724G  4588G 50.73 0.99 200
osd.36
 37  hdd   9.09499  1.0 9313G  5013G  4300G 53.83 1.05 214
osd.37
 38  hdd   9.09499  0.92110 9313G  4962G  4350G 53.28 1.04 206
osd.38
 39  hdd   9.09499  1.0 9313G  4960G  4353G 53.26 1.04 214
osd.39
 40  hdd   9.09499  1.0 9313G  5022G  4291G 53.92 1.05 216
osd.40
 41  hdd   9.09499  0.88235 9313G  5037G  4276G 54.09 1.06 203
osd.41
  7  ssd   0.87299  1.0  893G 18906M   875G  2.07 0.04 124
osd.7
 -7  102.74084-  102T 54402G 50805G 51.71 1.01   -
host bonnie
  0  hdd   7.27699  0.87642 7451G  4191G  3259G 56.25 1.10 175
osd.0
  1  hdd   7.27699  0.86200 7451G  3837G  3614G 51.49 1.01 163
osd.1
  2  hdd   7.27699  0.74664 7451G  3920G  3531G 52.61 1.03 169
osd.2
 11  hdd   7.27699  0.77840 7451G  3983G  3467G 53.46 1.04 169
osd.11
 13  hdd   9.09499  0.76595 9313G  4894G  4419G 52.55 1.03 201
osd.13
 14  hdd   9.09499  1.0 9313G  4350G  4963G 46.71 0.91 189
osd.14
 16  hdd   9.09499  0.92635 9313G  4879G  4434G 52.39 1.02 204
osd.16
 18  hdd   9.09499  0.67932 9313G  4634G  4678G 49.76 0.97 190
osd.18
 22  hdd   9.09499  0.93053 9313G  5085G  4228G 54.60 1.07 218
osd.22
 31  hdd   9.09499  0.88536 9313G  5152G  4160G 55.33 1.08 221
osd.31
 42  hdd   9.09499  0.84232 9313G  4796G  4516G 51.51 1.01 199
osd.42
 43  hdd   9.09499  0.87662 9313G  4656G  4657G 50.00 0.98 191
osd.43
  6  ssd   0.87299  1.0  894G 20643M   874G  2.25 0.04 134
osd.6
 -6  102.74100-  102T 53627G 51580G 50.97 0.99   -
host capone
  3  hdd   7.27699  0.84938 7451G  4028G  3422G 54.07 1.06 171
osd.3
  4  hdd   7.27699  0.83890 7451G  3909G  3542G 52.46 1.02 167
osd.4
  5  hdd   7.27699  1.0 7451G  3389G  4061G 45.49 0.89 151
osd.5
  9  hdd   7.27699  1.0 7451G  3710G  3740G 49.80 0.97 161
osd.9
 15  hdd   9.09499  1.0 9313G  4952G  4360G 53.18 1.04 206
osd.15
 17  hdd   9.09499  0.95000 9313G  4865G  4448G 52.24 1.02 202
osd.17
 23  hdd   9.09499  1.0 9313G  4984G  4329G 53.52 1.04 223
osd.23
 24  hdd   9.09499  1.0 9313G  4847G  4466G 52.05 1.02 202
osd.24
 25  hdd   9.09499  0.89929 9313G  4909G  4404G 52.71 1.03 205
osd.25
 30  hdd   9.09499  0.92787 9313G  4740G  4573G 50.90 0.99 202
osd.30
 74  hdd   9.09499  0.93146 9313G  4709G  4603G 50.57 0.99 199
osd.74
 75  hdd   9.09499  1.0 9313G  4559G  4753G 48.96 0.96 194
osd.75
  8  ssd   0.87299  1.0  893G 19593M   874G  2.14 0.04 129
osd.8
-16  102.74100-  102T 53985G 51222G 51.31 1.00   -
host elefant
 19  hdd   7.27699  1.0 7451G  3665G  3786G 49.19 0.96 152
osd.19
 21  hdd   7.27699  0.89539 7451G  4102G  3349G 55.05 1.07 169
osd.21
 64  hdd   7.27699  0.89275 7451G  3956G  3494G 53.10 1.04 171
osd.64
 65  hdd   7.27699  0.92513 7451G  3976G  3475G 53.36 1.04 171
osd.65
 66  hdd   9.09499  1.0 9313G  4674G  4638G 50.20 0.98 199
osd.66
 67  hdd   9.09499  1.0 9313G  4737G  4575G 50.87 0.99 201
osd.67
 68  hdd   9.09499  0.89973 9313G  4946G  4366G 53.11 1.04 211
osd.68
 69  hdd   9.09499  1.0 9313G  4648G  4665G 49.91 0.97 204
osd.69
 70  hdd   9.09499  0.89526 9313G  4907G  4405G 52.69 1.03 209
osd.70
 71  hdd   9.09499  0.84923 9313G  4690G  4622G 50.37 0.98 198
osd.71
 72  hdd   9.09499  0.87547 9313G  4976G  4336G 53.43 1.04 211
osd.72
 73  hdd   9.09499  1.0 9313G  4683G  4630G 50.29 0.98 200
osd.73
 10  ssd   0.87299  1.0  893G 19158M   875G  2.09 0.04 126

[ceph-users] Balancing cluster with large disks - 10TB HHD

2018-12-25 Thread jesper
Hi.

We hit an OSD_FULL last week on our cluster - with an average utillzation
of less than 50% .. thus hugely imbalanced.  This has driven us to
go for adjusting pg's upwards and reweighting the osd's more agressively.

Question: What do people see as an "acceptable" variance across OSD's?
x 
N   Min   MaxMedian   AvgStddev
x  72 45.49 56.25 52.35 51.878889 2.1764343

72 x 10TB drives. It seems hard to get further down -- thus churn will
most likely make it hard for us to stay at this level.

Currently we have ~158 PGs / OSD .. which by my math gives 63GB/pg if they
were fully utillzing the disk - which leads me to think that somewhat
smaller pg's would give the balancing an easier job. Would to be ok to
go to closer to 300 PGs/OSD ?  - would it be sane?

I can see that the default max is 300, but I have hard time finding out
if this is "recommendable" or just a "tunable".

* We've now seen OSD_FULL trigger irrecoverable kernel bugs on the
CephFS kernel client on our 4.15 kernels - multiple times - forced reboot
is the only way out. We're on the Ubuntu kernels .. I havent done the diff
to upstream (yet) and I dont intent to run our production cluster
disk-full anyware in the near future to test it out.

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Priority of repair vs rebalancing?

2018-12-18 Thread jesper
Hi.

In our ceph cluster we hit one OSD with 95% full while others in same pool
only hit 40% .. (total usage is ~55%). Thus I went into a:

sudo ceph osd reweight-by-utilization 110 0.05 12

Which initated some data movement.. but right after ceph status reported:


jk@bison:~/adm-git$ sudo ceph -s
  cluster:
id: dbc33946-ba1f-477c-84df-c63a3c9c91a6
health: HEALTH_WARN
49924979/660322545 objects misplaced (7.561%)
Degraded data redundancy: 26/660322545 objects degraded
(0.000%), 2 pgs degraded

  services:
mon: 3 daemons, quorum torsk1,torsk2,bison
mgr: bison(active), standbys: torsk1
mds: cephfs-1/1/2 up  {0=zebra01=up:active}, 1 up:standby-replay
osd: 78 osds: 78 up, 78 in; 255 remapped pgs
rgw: 9 daemons active

  data:
pools:   16 pools, 2184 pgs
objects: 141M objects, 125 TB
usage:   298 TB used, 340 TB / 638 TB avail
pgs: 26/660322545 objects degraded (0.000%)
 49924979/660322545 objects misplaced (7.561%)
 1927 active+clean
 187  active+remapped+backfilling
 68   active+remapped+backfill_wait
 2active+recovery_wait+degraded

  io:
client:   761 kB/s rd, 1284 kB/s wr, 85 op/s rd, 79 op/s wr
recovery: 623 MB/s, 665 objects/s


Any idea about how those 26 objects got degraded in the process?
Just in-flight writes ?

Any means to priority the 26 objects over the 49M objects that need to be
replaced?

Thanks.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client - page cache being invaildated.

2018-12-16 Thread jesper
> If a CephFS client receive a cap release request and it is able to
> perform it (no processes accessing the file at the moment), the client
> cleaned up its internal state and allows the MDS to release the cap.
> This cleanup also involves removing file data from the page cache.
>
> If your MDS was running with a too small cache size, it had to revoke
> caps over and over to adhere to its cache size, and the clients had to
> cleanup their cache over and over, too.


Well.. It could just mark it "elegible for future cleanup" - if the client
has not use of the available memory, then this is just trashing
local client memory cache for a file that goes into use again in a few
minutes from here. - based on your description, this is what we have
been seeing.

Bumping MDS memory has pushed our problem and our setup works fine, but
above behaviour still seems very "unoptimal" - of course if the file
changes - feel free to active prune - but hey - why actually - the
it will get no hits in the client LRU cache and be automatically
evicted by the client anyway.

I feel this is messing up with thing that has worked well for a few
decades now, but I may just be missing the fine grained details.


> Hope this helps.

Definately - thanks.

-- 
Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.

2018-11-25 Thread Jesper Krogh
On 25 Nov 2018, at 15.17, Vitaliy Filippov  wrote:
> 
> All disks (HDDs and SSDs) have cache and may lose non-transactional writes 
> that are in-flight. However, any adequate disk handles fsync's (i.e SATA 
> FLUSH CACHE commands). So transactional writes should never be lost, and in 
> Ceph ALL writes are transactional - Ceph issues fsync's all the time. Another 
> example is DBMS-es - they also issue an fsync when you COMMIT.

https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf

This may have changed since 2013, normal understanding is that cache need to be 
disabled to ensure that flushed are persistent, and disabling cache in ssd is 
either not adhered to by firmware or plummeting the write performance.

Which is why enterprise discs had power loss protection in terms of capacitors.

again any links/info telling otherwise is very welcome

Jesper
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.

2018-11-25 Thread jesper
>> the real risk is the lack of power loss protection. Data can be
>> corrupted on unflean shutdowns
>
> it's not! lack of "advanced power loss protection" only means lower iops
> with fsync, but not the possibility of data corruption
>
> "advanced power loss protection" is basically the synonym for
> "non-volatile cache"

A few years ago - it was pretty common knowledge that if it didnt have
capacitors - and thus Power-Loss-Protection, then an unexpected power-off
could lead to data-loss situations. Perhapos I'm not updated with recent
development. Is it a solved problem today in consumergrade SSD?
.. any links to insight/testing/etc would be welcome.

https://arstechnica.com/civis/viewtopic.php?f=11=1383499
- does at least not support the viewpoint.

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.

2018-11-24 Thread Jesper Krogh


> On 24 Nov 2018, at 18.09, Anton Aleksandrov  wrote
> We plan to have data on dedicate disk in each node and my question is about 
> WAL/DB for Bluestore. How bad would it be to place it on system-consumer-SSD? 
> How big risk is it, that everything will get "slower than using spinning HDD 
> for the same purpose"? And how big risk is it, that our nodes will die, 
> because of SSD lifespan?

the real risk is the lack of power loss protection. Data can be corrupted on 
unflean shutdowns 

Disabling cache may help
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS kernel client versions - pg-upmap

2018-11-03 Thread jesper
Hi.

I tried to enable the "new smart balancing" - backend are on RH luminous
clients are Ubuntu 4.15 kernel.

As per: http://docs.ceph.com/docs/mimic/rados/operations/upmap/
$ sudo ceph osd set-require-min-compat-client luminous
Error EPERM: cannot set require_min_compat_client to luminous: 1 connected
client(s) look like firefly (missing 0xe010020); 1 connected
client(s) look like firefly (missing 0xe01); 1 connected
client(s) look like hammer (missing 0xe20); 55 connected
client(s) look like jewel (missing 0x800); add
--yes-i-really-mean-it to do it anyway

ok, so 4.15 kernel connects as a "hammer" (<1.0) client?  Is there a
huge gap in upstreaming kernel clients to kernel.org or what am I
misreading here?

Hammer is 2015'ish - 4.15 is January 2018'ish?

Is kernel client development lacking behind ?

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client - page cache being invaildated.

2018-11-03 Thread jesper
> I suspect that mds asked client to trim its cache. Please run
> following commands on an idle client.

In the mean time - we migrated to the RH Ceph version and deliered the MDS
both SSD's and more memory and the problem went away.

It still puzzles my mind a bit - why is there a connection between the
"client page cache" and the MDS server performance/etc. The only argument
I can find is that if the MDS cannot cache data, then and it need to go
back and get metadata from the Ceph metadata poll then it exposes
data as "new" to the clients, despite it being the same. - if that is
the case, then I would say there is a significant room for performance
optimization here.

> If you can reproduce this issue. please send kernel log to us.

Will do if/when it reappears.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client - page cache being invaildated.

2018-10-15 Thread jesper
> On 10/15/18 12:41 PM, Dietmar Rieder wrote:
>> No big difference here.
>> all CentOS 7.5 official kernel 3.10.0-862.11.6.el7.x86_64
>
> ...forgot to mention: all is luminous ceph-12.2.7

Thanks for your time in testing, this is very valueable to me in the
debugging. 2 questions:

Did you "sleep 900" in-between the execution?
Are you using the kernel client or the fuse client?

If I run them "right after each other" .. then I get the same behaviour.

-- 
Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client - page cache being invaildated.

2018-10-15 Thread jesper
>> On Sun, Oct 14, 2018 at 8:21 PM  wrote:
>> how many cephfs mounts that access the file? Is is possible that some
>> program opens that file in RW mode (even they just read the file)?
>
>
> The nature of the program is that it is "prepped" by one-set of commands
> and queried by another, thus the RW case is extremely unlikely.
> I can change permission bits to rewoke the w-bit for the user, they
> dont need it anyway... it is just the same service-users that generates
> the data and queries it today.

Just to remove the suspicion of other clients fiddling with the files I did a
more structured test. I have 4 x 10GB files from fio-benchmarking, total
40GB . Hosted on

1) CephFS /ceph/cluster/home/jk
2) NFS /z/home/jk

First I read them .. then sleep 900 seconds .. then read again (just with dd)

jk@sild12:/ceph/cluster/home/jk$ time  for i in $(seq 0 3); do echo "dd
if=test.$i.0 of=/dev/null bs=1M"; done  | parallel -j 4 ; sleep 900; time 
for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done  |
parallel -j 4
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 2.56413 s, 4.2 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 2.82234 s, 3.8 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 2.9361 s, 3.7 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 3.10397 s, 3.5 GB/s

real0m3.449s
user0m0.217s
sys 0m11.497s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 315.439 s, 34.0 MB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 338.661 s, 31.7 MB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 354.725 s, 30.3 MB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 356.126 s, 30.2 MB/s

real5m56.634s
user0m0.260s
sys 0m16.515s
jk@sild12:/ceph/cluster/home/jk$


Then NFS:

jk@sild12:~$ time  for i in $(seq 0 3); do echo "dd if=test.$i.0
of=/dev/null bs=1M"; done  | parallel -j 4 ; sleep 900; time  for i in
$(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done  | parallel
-j 4
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 1.60267 s, 6.7 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 2.18602 s, 4.9 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 2.47564 s, 4.3 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 2.54674 s, 4.2 GB/s

real0m2.855s
user0m0.185s
sys 0m8.888s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 1.68613 s, 6.4 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 1.6983 s, 6.3 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 2.20059 s, 4.9 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 2.58077 s, 4.2 GB/s

real0m2.980s
user0m0.173s
sys 0m8.239s
jk@sild12:~$


Can I ask one of you to run the same "test" (or similar) .. and report back
i you can reproduce it?

Thoughts/comments/suggestions are highly apprecitated?  Should I try with
the fuse-client ?

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client - page cache being invaildated.

2018-10-14 Thread jesper
> On Sun, Oct 14, 2018 at 8:21 PM  wrote:
> how many cephfs mounts that access the file? Is is possible that some
> program opens that file in RW mode (even they just read the file)?


The nature of the program is that it is "prepped" by one-set of commands
and queried by another, thus the RW case is extremely unlikely.
I can change permission bits to rewoke the w-bit for the user, they
dont need it anyway... it is just the same service-users that generates
the data and queries it today.

Can ceph tell the actual amount of clients? ..
We have 55-60 hosts, where most of them mounts the catalog.

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client - page cache being invaildated.

2018-10-14 Thread jesper
> Actual amount of memory used by VFS cache is available through 'grep
> Cached /proc/meminfo'. slabtop provides information about cache
> of inodes, dentries, and IO memory buffers (buffer_head).

Thanks, that was also what I got out of it. And why I reported "free"
output in the first as it also shows available and "cached" memory.

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client - page cache being invaildated.

2018-10-14 Thread jesper
> Try looking in /proc/slabinfo / slabtop during your tests.

I need a bit of guidance here..  Does the slabinfo cover the VFS page
cache ? .. I cannot seem to find any traces (sorting by size on
machines with a huge cache does not really give anything). Perhaps
I'm holding the screwdriver wrong?

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client - page cache being invaildated.

2018-10-14 Thread Jesper Krogh
On 14 Oct 2018, at 15.26, John Hearns  wrote:
> 
> This is a general question for the ceph list.
> Should Jesper be looking at these vm tunables?
> vm.dirty_ratio
> vm.dirty_centisecs
> 
> What effect do they have when using Cephfs?

This situation is a read only, thus no dirty data in page cache. Above should 
be irrelevant. 

Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs kernel client - page cache being invaildated.

2018-10-14 Thread jesper
Hi

We have a dataset of ~300 GB on CephFS which as being used for computations
over and over agian .. being refreshed daily or similar.

When hosting it on NFS after refresh, they are transferred, but from
there - they would be sitting in the kernel page cache of the client
until they are refreshed serverside.

On CephFS it look "similar" but "different". Where the "steady state"
operation over NFS would give a client/server traffic of < 1MB/s ..
CephFS contantly pulls 50-100MB/s over the network.  This has
implications for the clients that end up spending unnessary time waiting
for IO in the execution.

This is in a setting where the CephFS client mem look like this:

$ free -h
  totalusedfree  shared  buff/cache  
available
Mem:   377G 17G340G1.2G 19G   
354G
Swap:  8.8G430M8.4G


If I just repeatedly run (within a few minute) something that is using the
files, then
it is fully served out of client page cache (2GB'ish / s) ..  but it looks
like
it is being evicted way faster than in the NFS setting?

This is not scientific .. but the CMD is a cat /file/on/ceph > /dev/null -
type on a total of 24GB data in 300'ish files.

$ free -h; time CMD ; sleep 1800; free -h; time CMD ; free -h; sleep 3600;
time CMD ;

  totalusedfree  shared  buff/cache  
available
Mem:   377G 16G312G1.2G 48G   
355G
Swap:  8.8G430M8.4G

real0m8.997s
user0m2.036s
sys 0m6.915s
  totalusedfree  shared  buff/cache  
available
Mem:   377G 17G277G1.2G 82G   
354G
Swap:  8.8G430M8.4G

real3m25.904s
user0m2.794s
sys 0m9.028s
  totalusedfree  shared  buff/cache  
available
Mem:   377G 17G283G1.2G 76G   
353G
Swap:  8.8G430M8.4G

real6m18.358s
user0m2.847s
sys 0m10.651s


Munin graphs of the system confirms that there has been zero memory
pressure over the period.

Is there things in the CephFS case that can cause the page-cache to be
invailated?
Could less agressive "read-ahead" play a role?

Other thoughts on what root cause on the different behaviour could be?

Clients are using 4.15 kernel.. Anyone aware of newer patches in this area
that could impact ?

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS performance.

2018-10-03 Thread jesper
Hi All.

First thanks for the good discussion and strong answer's I've gotten so far.

Current cluster setup is 4 x 10 x 12TB 7.2K RPM drives with all and
10GbitE and metadata on rotating drives - 3x replication - 256GB memory in
OSD hosts and 32+ cores. Behind Perc with eachdiskraid0 and BBWC.

Planned changes:
- is to get 1-2 more OSD-hosts
- experiment with EC-pools for CephFS
- MDS onto seperate host and metadata onto SSD's.

I'm still struggling to get "non-cached" performance up to "hardware"
speed - whatever that means. I do "fio" benchmark using 10GB files, 16
threads, 4M block size -- at which I can "almost" sustained fill the
10GbitE NIC. In this configuraiton I would have expected it to be "way
above" 10Gbit speed thus have the NIC not "almost" filled - but fully
filled - could that be the metadata activities .. but on "big files" and
read - that should not be much - right?

Above is actually ok for production, thus .. not a big issue, just
information.

Single threaded performance is still struggling

Cold HHD (read from disk in NFS-server end) / NFS performance:

jk@zebra01:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped   15.86 GB in 00h00m27.53s:  589.88 MB/second


Local page cache (just to say it isn't the profiling tool delivering
limitations):
jk@zebra03:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped   29.24 GB in 00h00m09.15s:3.19 GB/second
jk@zebra03:~$

Now from the Ceph system:
jk@zebra01:~$ pipebench < /ceph/bigfile.file> /dev/null
Summary:
Piped   36.79 GB in 00h03m47.66s:  165.49 MB/second

Can block/stripe-size be tuned? Does it make sense?
Does read-ahead on the CephFS kernel-client need tuning?
What performance are other people seeing?
Other thoughts - recommendations?

On some of the shares we're storing pretty large files (GB size) and
need the backup to move them to tape - so it is preferred to be capable
of filling an LTO6 drive's write speed to capacity with a single thread.

40'ish 7.2K RPM drives - should - add up to more than above.. right?
This is the only current load being put on the cluster - + 100MB/s
recovery traffic.


Thanks.

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore vs. Filestore

2018-10-03 Thread jesper
> Your use case sounds it might profit from the rados cache tier
> feature. It's a rarely used feature because it only works in very
> specific circumstances. But your scenario sounds like it might work.
> Definitely worth giving it a try. Also, dm-cache with LVM *might*
> help.
> But if your active working set is really just 400GB: Bluestore cache
> should handle this just fine. Don't worry about "unequal"
> distribution, every 4mb chunk of every file will go to a random OSD.

I tried it out - and will do it more but Initial tests didnt really
convince me - but I'll try more.

> One very powerful and simple optimization is moving the metadata pool
> to SSD only. Even if it's just 3 small but fast SSDs; that can make a
> huge difference to how fast your filesystem "feels".

They are ordered and will hopefully arrive very soon.

Can I:
1) Add disks
2) Create pool
3) stop all MDS's
4) rados cppool
5) Start MDS

.. Yes, thats a cluster-down on CephFS but shouldn't take long. Or is
there a better guide?

--
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore vs. Filestore

2018-10-02 Thread jesper
> On 02.10.2018 19:28, jes...@krogh.cc wrote:
> In the cephfs world there is no central server that hold the cache. each
> cephfs client reads data directly from the osd's.

I can accept this argument, but nevertheless .. if I used Filestore - it
would work.

> This also means no
> single point of failure, and you can scale out performance by spreading
> metadata tree information over multiple MDS servers. and scale out
> storage and throughput with added osd nodes.
>
> so if the cephfs client cache is not sufficient, you can look at at the
> bluestore cache.
http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/#cache-size

I have been there, but it seems to "not work"- I think the need to
slice per OSD and statically allocate mem per OSD breaks the efficiency.
(but I cannot prove it)

> or you can look at adding a ssd layer over the spinning disks. with eg 
> bcache.  I assume you are using a ssd/nvram for bluestore db already

My currently bluestore(s) is backed by 10TB 7.2K RPM drives, allthough behind
BBWC. Can you elaborate on the "assumption" as we're not doing that, I'd like
to explore that.

> you should also look at tuning the cephfs metadata servers.
> make sure the metadata pool is on fast ssd osd's .  and tune the mds
> cache to the mds server's ram, so you cache as much metadata as possible.

Yes, we're in the process of doing that - I belive we're seeing the MDS
suffering
when we saturate a few disks in the setup - and they are sharing. Thus
we'll move
the metadata as per recommendations to SSD.

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore vs. Filestore

2018-10-02 Thread jesper
Hi.

Based on some recommendations we have setup our CephFS installation using
bluestore*. We're trying to get a strong replacement for "huge" xfs+NFS
server - 100TB-ish size.

Current setup is - a sizeable Linux host with 512GB of memory - one large
Dell MD1200 or MD1220 - 100TB + a Linux kernel NFS server.

Since our "hot" dataset is < 400GB we can actually serve the hot data
directly out of the host page-cache and never really touch the "slow"
underlying drives. Except when new bulk data are written where a Perc with
BBWC is consuming the data.

In the CephFS + Bluestore world, Ceph is "deliberatly" bypassing the host
OS page-cache, so even when we have 4-5 x 256GB memory** in the OSD hosts
it is really hard to create a synthetic test where they hot data does not
end up being read out of the underlying disks. Yes, the
client side page cache works very well, but in our scenario we have 30+
hosts pulling the same data over NFS.

Is bluestore just a "bad fit" .. Filestore "should" do the right thing? Is
the recommendation to make an SSD "overlay" on the slow drives?

Thoughts?

Jesper

* Bluestore should be the new and shiny future - right?
** Total mem 1TB+



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore OSDs stay down

2016-09-30 Thread Jesper Lykkegaard Karlsen

Hi,

I have been very impressed with the BlueStore test environment I made, 
which is build on the Ubuntu 16.04 using the Ceph development master 
repository.


But now I have run into some self inflicted problems.

Yesterday I accidentally updated the OSD while they were being heavily 
used. Then the OSD started to go down one by one, and when they all had 
did I ended up with pgs in practically every possible state.


   /:~# ceph health //
   //2016-09-30 09:50:39.044987 7f27e4ee2700 -1 WARNING: the following
   dangerous and experimental features are enabled: bluestore,rocksdb//
   //2016-09-30 09:50:39.052592 7f27e4ee2700 -1 WARNING: the following
   dangerous and experimental features are enabled: bluestore,rocksdb//
   //HEALTH_ERR 243 pgs are stuck inactive for more than 300 seconds;
   130 pgs backfill_wait; 488 pgs degraded; 55 pgs down; 49 pgs
   incomplete; 63 pgs peering; 2 pgs recovering; 281 pgs recovery_wait;
   600 pgs stale; 243 pgs stuck inactive; 357 pgs stuck unclean; 488
   pgs undersized; recovery 1240205/1848822 objects degraded (67.081%);
   recovery 397635/1848822 objects misplaced (21.507%); recovery
   57149/616274 unfound (9.273%); mds cluster is degraded; 8/8 in osds
   are down/

As mentioned all OSD's are now down and refuse to come back up.

From the osd log file I see this error message:

   
//srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.0.0-2808-g76e120c/src/os/bluestore/StupidAllocator.cc:
   317: FAILED assert(rm.empty())/

Of course the data was not important in this test environment and the 
easiest would properly be to start over, but I am considering building a 
production environment build on Bluestore as soon as it becomes stable, 
so for the sport of it I would like to see if I can actually recover the 
OSD's. Just to get some deeper insight into Ceph recovery.


I have been though: 
http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-osd/ 
without any luck.


What would be the next steps to try?

Thanks!

//Jesper

2016-09-30 08:51:23.464389 7f17985528c0 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
2016-09-30 08:51:23.464409 7f17985528c0  0 set uid:gid to 64045:64045 (ceph:ceph)
2016-09-30 08:51:23.464430 7f17985528c0  0 ceph version v11.0.0-2808-g76e120c (76e120c705b77d2d2cef1b94cacdd11c14460a3f), process ceph-osd, pid 16590
2016-09-30 08:51:23.464479 7f17985528c0  5 object store type is bluestore
2016-09-30 08:51:23.464504 7f17985528c0 -1 WARNING: experimental feature 'bluestore' is enabled
Please be aware that this feature is experimental, untested,
unsupported, and may result in data corruption, data loss,
and/or irreparable damage to your cluster.  Do not use
feature with important data.

2016-09-30 08:51:23.466344 7f17985528c0  0 pidfile_write: ignore empty --pid-file
2016-09-30 08:51:23.468345 7f17985528c0 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
2016-09-30 08:51:23.473859 7f17985528c0 10 ErasureCodePluginSelectJerasure: load: jerasure_sse4 
2016-09-30 08:51:23.475249 7f17985528c0 10 load: jerasure load: lrc load: isa 
2016-09-30 08:51:23.475657 7f17985528c0  2 osd.0 0 mounting /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal
2016-09-30 08:51:23.475669 7f17985528c0  1 bluestore(/var/lib/ceph/osd/ceph-0) mount path /var/lib/ceph/osd/ceph-0
2016-09-30 08:51:23.475707 7f17985528c0  1 bdev create path /var/lib/ceph/osd/ceph-0/block type kernel
2016-09-30 08:51:23.476182 7f17985528c0  1 bdev(/var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
2016-09-30 08:51:23.476545 7f17985528c0  1 bdev(/var/lib/ceph/osd/ceph-0/block) open size 4000681103360 (0x3a37b2d1000, 3725 GB) block_size 4096 (4096 B) non-rotational
2016-09-30 08:51:23.476814 7f17985528c0  1 bdev create path /var/lib/ceph/osd/ceph-0/block type kernel
2016-09-30 08:51:23.477256 7f17985528c0  1 bdev(/var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
2016-09-30 08:51:23.477501 7f17985528c0  1 bdev(/var/lib/ceph/osd/ceph-0/block) open size 4000681103360 (0x3a37b2d1000, 3725 GB) block_size 4096 (4096 B) non-rotational
2016-09-30 08:51:23.477509 7f17985528c0  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-0/block size 3725 GB
2016-09-30 08:51:23.477533 7f17985528c0  1 bluefs mount
2016-09-30 08:51:23.553415 7f17985528c0 -1 /srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.0.0-2808-g76e120c/src/os/bluestore/StupidAllocator.cc: In function 'virtual void StupidAllocator::init_rm_free(uint64_t, uint64_t)' thread 7f17985528c0 time 2016-09-30 08:51:23.550071
/srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.0.0-2808-g76e120c/src/os/bluestore/StupidAllocator.cc: 317: FAILED assert(rm.empty())

 ceph version v11.0.0-2808-g76e120c (76e120c705b77d2d2cef1b94cacdd11c14460a3f)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55d4c31a6240]
 2: (StupidAllocator::init_rm_free(unsigned long

Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7

2015-12-19 Thread Jesper Thorhauge
Hi Loic, 

Problem solved! 

So i just started to reinstall with the last known working conf - centos 6.6 
(remember i updated to 6.7). During the install it complains about some "bios 
raid metadata" on one of the disks (/dev/sdc) and wants to hide it. Last time i 
just added some boot parameter to ignore this, so it would let me install the 
OS on this disk. So i thought...hmm, maybe this is the root cause to the 
problems? Abort re-install, back into the 6.7 install, and apply this fix to my 
/dev/sdc; 
https://kezhong.wordpress.com/2011/06/14/how-to-remove-bios-raid-metadata-from-disk-on-fedora/
 

Deleted both journals and osd's and re-created - and wupti - things are working 
:-) !!! 

So i guess having an old disk with "bios raid metadata" on it will disturb 
ceph. May ceph should include a check for this "bios raid metadata"? Someone 
besides me might decide to use old raid disks for their ceph setup ;-) 

Thank you very much for your kind help! 

Cheers, 
Jesper 

***** 

On 18/12/2015 22:09, Jesper Thorhauge wrote: 
> Hi Loic, 
> 
> Getting closer! 
> 
> lrwxrwxrwx 1 root root 10 Dec 18 19:43 1e9d527f-0866-4284-b77c-c1cb04c5a168 
> -> ../../sdc4 
> lrwxrwxrwx 1 root root 10 Dec 18 19:43 c34d4694-b486-450d-b57f-da24255f0072 
> -> ../../sdc3 
> lrwxrwxrwx 1 root root 10 Dec 18 19:42 c83b5aa5-fe77-42f6-9415-25ca0266fb7f 
> -> ../../sdb1 
> lrwxrwxrwx 1 root root 10 Dec 18 19:42 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 
> -> ../../sda1 
> 
> So symlinks are now working! Activating an OSD is a different story :-( 
> 
> "ceph-disk -vv activate /dev/sda1" gives me; 
> 
> INFO:ceph-disk:Running command: /sbin/blkid -p -s TYPE -ovalue -- /dev/sda1 
> INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. 
> --lookup osd_mount_options_xfs 
> INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. 
> --lookup osd_fs_mount_options_xfs 
> DEBUG:ceph-disk:Mounting /dev/sda1 on /var/lib/ceph/tmp/mnt.A99cDp with 
> options noatime,inode64 
> INFO:ceph-disk:Running command: /bin/mount -t xfs -o noatime,inode64 -- 
> /dev/sda1 /var/lib/ceph/tmp/mnt.A99cDp 
> DEBUG:ceph-disk:Cluster uuid is 07b5c90b-6cae-40c0-93b2-31e0ebad7315 
> INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph 
> --show-config-value=fsid 
> DEBUG:ceph-disk:Cluster name is ceph 
> DEBUG:ceph-disk:OSD uuid is e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 
> DEBUG:ceph-disk:OSD id is 6 
> DEBUG:ceph-disk:Initializing OSD... 
> INFO:ceph-disk:Running command: /usr/bin/ceph --cluster ceph --name 
> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon 
> getmap -o /var/lib/ceph/tmp/mnt.A99cDp/activate.monmap 
> got monmap epoch 6 
> INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster ceph --mkfs 
> --mkkey -i 6 --monmap /var/lib/ceph/tmp/mnt.A99cDp/activate.monmap --osd-data 
> /var/lib/ceph/tmp/mnt.A99cDp --osd-journal 
> /var/lib/ceph/tmp/mnt.A99cDp/journal --osd-uuid 
> e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 --keyring 
> /var/lib/ceph/tmp/mnt.A99cDp/keyring 
> HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 
> 2015-12-18 21:58:12.489357 7f266d7b0800 -1 journal check: ondisk fsid 
> ---- doesn't match expected 
> e85f4d92-c8f1-4591-bd2a-aa43b80f58f6, invalid (someone else's?) journal 
> HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 
> HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 
> HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 
> 2015-12-18 21:58:12.680566 7f266d7b0800 -1 
> filestore(/var/lib/ceph/tmp/mnt.A99cDp) could not find 
> 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory 
> 2015-12-18 21:58:12.865810 7f266d7b0800 -1 created object store 
> /var/lib/ceph/tmp/mnt.A99cDp journal /var/lib/ceph/tmp/mnt.A99cDp/journal for 
> osd.6 fsid 07b5c90b-6cae-40c0-93b2-31e0ebad7315 
> 2015-12-18 21:58:12.865844 7f266d7b0800 -1 auth: error reading file: 
> /var/lib/ceph/tmp/mnt.A99cDp/keyring: can't open 
> /var/lib/ceph/tmp/mnt.A99cDp/keyring: (2) No such file or directory 
> 2015-12-18 21:58:12.865910 7f266d7b0800 -1 created new key in keyring 
> /var/lib/ceph/tmp/mnt.A99cDp/keyring 
> INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. 
> --lookup init 
> DEBUG:ceph-disk:Marking with init system sysvinit 
> DEBUG:ceph-disk:Authorizing OSD key... 
> INFO:ceph-disk:Running command: /usr/bin/ceph --cluster ceph --name 
> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring auth 
> add osd.6 -i /var/lib/ceph/tmp/mnt.A99cDp/keyring osd allow * mon allow 
> profile osd 
> Error EINVAL: entity osd.6 exists but key does not match 
> ERROR:ceph-disk:Fail

Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7

2015-12-18 Thread Jesper Thorhauge
Hi Loic, 

searched around for possible udev bugs, and then tried to run "yum update". 
Udev did have a fresh update with the following version diffs; 

udev-147-2.63.el6_7.1.x86_64 --> udev-147-2.63.el6_7.1.x86_64 

from what i can see this update fixes stuff related to symbolic links / 
external devices. /dev/sdc sits on external eSata. So... 

https://rhn.redhat.com/errata/RHBA-2015-1382.html 

will reboot tonight and get back :-) 

/jesper 

***' 

I guess that's the problem you need to solve : why /dev/sdc does not generate 
udev events (different driver than /dev/sda maybe ?). Once it does, Ceph should 
work. 

A workaround could be to add somethink like: 

ceph-disk-udev 3 sdc3 sdc 
ceph-disk-udev 4 sdc4 sdc 

in /etc/rc.local. 

On 17/12/2015 12:01, Jesper Thorhauge wrote: 
> Nope, the previous post contained all that was in the boot.log :-( 
> 
> /Jesper 
> 
> ** 
> 
> - Den 17. dec 2015, kl. 11:53, Loic Dachary <l...@dachary.org> skrev: 
> 
> On 17/12/2015 11:33, Jesper Thorhauge wrote: 
>> Hi Loic, 
>> 
>> Sounds like something does go wrong when /dev/sdc3 shows up. Is there anyway 
>> i can debug this further? Log-files? Modify the .rules file...? 
> 
> Do you see traces of what happens when /dev/sdc3 shows up in boot.log ? 
> 
>> 
>> /Jesper 
>> 
>>  
>> 
>> The non-symlink files in /dev/disk/by-partuuid come to existence because of: 
>> 
>> * system boots 
>> * udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 
>> * ceph-disk-udev creates the symlink 
>> /dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb7f -> ../../sdb1 
>> * ceph-disk activate /dev/sda1 is mounted and finds a symlink to the journal 
>> journal -> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 which 
>> does not yet exists because /dev/sdc udev rules have not been run yet 
>> * ceph-osd opens the journal in write mode and that creates the file 
>> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 as a regular file 
>> * the file is empty and the osd fails to activate with the error you see 
>> (EINVAL because the file is empty) 
>> 
>> This is ok, supported and expected since there is no way to know which disk 
>> will show up first. 
>> 
>> When /dev/sdc shows up, the same logic will be triggered: 
>> 
>> * udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 
>> * ceph-disk-udev creates the symlink 
>> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 -> ../../sdc3 
>> (overriding the file because ln -sf) 
>> * ceph-disk activate-journal /dev/sdc3 finds that 
>> c83b5aa5-fe77-42f6-9415-25ca0266fb7f is the data partition for that journal 
>> and mounts /dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb7f 
>> * ceph-osd opens the journal and all is well 
>> 
>> Except something goes wrong in your case, presumably because ceph-disk-udev 
>> is not called when /dev/sdc3 shows up ? 
>> 
>> On 17/12/2015 08:29, Jesper Thorhauge wrote: 
>>> Hi Loic, 
>>> 
>>> osd's are on /dev/sda and /dev/sdb, journal's is on /dev/sdc (sdc3 / sdc4). 
>>> 
>>> sgdisk for sda shows; 
>>> 
>>> Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) 
>>> Partition unique GUID: E85F4D92-C8F1-4591-BD2A-AA43B80F58F6 
>>> First sector: 2048 (at 1024.0 KiB) 
>>> Last sector: 1953525134 (at 931.5 GiB) 
>>> Partition size: 1953523087 sectors (931.5 GiB) 
>>> Attribute flags:  
>>> Partition name: 'ceph data' 
>>> 
>>> for sdb 
>>> 
>>> Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) 
>>> Partition unique GUID: C83B5AA5-FE77-42F6-9415-25CA0266FB7F 
>>> First sector: 2048 (at 1024.0 KiB) 
>>> Last sector: 1953525134 (at 931.5 GiB) 
>>> Partition size: 1953523087 sectors (931.5 GiB) 
>>> Attribute flags:  
>>> Partition name: 'ceph data' 
>>> 
>>> for /dev/sdc3 
>>> 
>>> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) 
>>> Partition unique GUID: C34D4694-B486-450D-B57F-DA24255F0072 
>>> First sector: 935813120 (at 446.2 GiB) 
>>> Last sector: 956293119 (at 456.0 GiB) 
>>> Partition size: 2048 sectors (9.8 GiB) 
>>> Attribute flags:  
>>> Partition name: 'ceph journal' 
>>> 
>>> for /dev/sdc4 
>>> 
>>> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) 
>>> Partition unique GUID: 1

Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7

2015-12-18 Thread Jesper Thorhauge
Hi Loic, 

Damn, the updated udev didn't fix the problem :-( 

The rc.local workaround is also complaining; 

INFO:ceph-disk:Running command: /usr/bin/ceph-osd -i 0 --get-journal-uuid 
--osd-journal /dev/sdc3 
libust[2648/2648]: Warning: HOME environment variable not set. Disabling 
LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305) 
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 
DEBUG:ceph-disk:Journal /dev/sdc3 has OSD UUID 
---- 
INFO:ceph-disk:Running command: /sbin/blkid -p -s TYPE -ovalue -- 
/dev/disk/by-partuuid/---- 
error: /dev/disk/by-partuuid/----: No such file 
or directory 
ceph-disk: Cannot discover filesystem type: device 
/dev/disk/by-partuuid/----: Command 
'/sbin/blkid' returned non-zero exit status 2 
INFO:ceph-disk:Running command: /usr/bin/ceph-osd -i 0 --get-journal-uuid 
--osd-journal /dev/sdc4 
libust[2687/2687]: Warning: HOME environment variable not set. Disabling 
LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305) 
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 
DEBUG:ceph-disk:Journal /dev/sdc4 has OSD UUID 
---- 
INFO:ceph-disk:Running command: /sbin/blkid -p -s TYPE -ovalue -- 
/dev/disk/by-partuuid/---- 
error: /dev/disk/by-partuuid/----: No such file 
or directory 
ceph-disk: Cannot discover filesystem type: device 
/dev/disk/by-partuuid/----: Command 
'/sbin/blkid' returned non-zero exit status 2 

/dev/sdc1 and /dev/sdc2 contains the boot loader and OS, so driverwise i guess 
things are working :-) 

But "HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device" seems to 
be the underlying issue. 

Any thoughts? 

/Jesper 

* 

Hi Loic, 

searched around for possible udev bugs, and then tried to run "yum update". 
Udev did have a fresh update with the following version diffs; 

udev-147-2.63.el6_7.1.x86_64 --> udev-147-2.63.el6_7.1.x86_64 

from what i can see this update fixes stuff related to symbolic links / 
external devices. /dev/sdc sits on external eSata. So... 

https://rhn.redhat.com/errata/RHBA-2015-1382.html 

will reboot tonight and get back :-) 

/jesper 

***' 

I guess that's the problem you need to solve : why /dev/sdc does not generate 
udev events (different driver than /dev/sda maybe ?). Once it does, Ceph should 
work. 

A workaround could be to add somethink like: 

ceph-disk-udev 3 sdc3 sdc 
ceph-disk-udev 4 sdc4 sdc 

in /etc/rc.local. 

On 17/12/2015 12:01, Jesper Thorhauge wrote: 
> Nope, the previous post contained all that was in the boot.log :-( 
> 
> /Jesper 
> 
> ** 
> 
> - Den 17. dec 2015, kl. 11:53, Loic Dachary <l...@dachary.org> skrev: 
> 
> On 17/12/2015 11:33, Jesper Thorhauge wrote: 
>> Hi Loic, 
>> 
>> Sounds like something does go wrong when /dev/sdc3 shows up. Is there anyway 
>> i can debug this further? Log-files? Modify the .rules file...? 
> 
> Do you see traces of what happens when /dev/sdc3 shows up in boot.log ? 
> 
>> 
>> /Jesper 
>> 
>>  
>> 
>> The non-symlink files in /dev/disk/by-partuuid come to existence because of: 
>> 
>> * system boots 
>> * udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 
>> * ceph-disk-udev creates the symlink 
>> /dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb7f -> ../../sdb1 
>> * ceph-disk activate /dev/sda1 is mounted and finds a symlink to the journal 
>> journal -> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 which 
>> does not yet exists because /dev/sdc udev rules have not been run yet 
>> * ceph-osd opens the journal in write mode and that creates the file 
>> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 as a regular file 
>> * the file is empty and the osd fails to activate with the error you see 
>> (EINVAL because the file is empty) 
>> 
>> This is ok, supported and expected since there is no way to know which disk 
>> will show up first. 
>> 
>> When /dev/sdc shows up, the same logic will be triggered: 
>> 
>> * udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 
>> * ceph-disk-udev creates the symlink 
>> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 -> ../../sdc3 
>> (overriding the file because ln -sf) 
>> * ceph-disk activate-journal /dev/sdc3 finds that 
>> c83b5aa5-fe77-42f6-9415-25ca0266fb7f is the data partition for that journal 
>> and mounts /dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb7f 
>

Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7

2015-12-18 Thread Jesper Thorhauge
Hi Loic, 

Getting closer! 

lrwxrwxrwx 1 root root 10 Dec 18 19:43 1e9d527f-0866-4284-b77c-c1cb04c5a168 -> 
../../sdc4 
lrwxrwxrwx 1 root root 10 Dec 18 19:43 c34d4694-b486-450d-b57f-da24255f0072 -> 
../../sdc3 
lrwxrwxrwx 1 root root 10 Dec 18 19:42 c83b5aa5-fe77-42f6-9415-25ca0266fb7f -> 
../../sdb1 
lrwxrwxrwx 1 root root 10 Dec 18 19:42 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 -> 
../../sda1 

So symlinks are now working! Activating an OSD is a different story :-( 

"ceph-disk -vv activate /dev/sda1" gives me; 

INFO:ceph-disk:Running command: /sbin/blkid -p -s TYPE -ovalue -- /dev/sda1 
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. 
--lookup osd_mount_options_xfs 
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. 
--lookup osd_fs_mount_options_xfs 
DEBUG:ceph-disk:Mounting /dev/sda1 on /var/lib/ceph/tmp/mnt.A99cDp with options 
noatime,inode64 
INFO:ceph-disk:Running command: /bin/mount -t xfs -o noatime,inode64 -- 
/dev/sda1 /var/lib/ceph/tmp/mnt.A99cDp 
DEBUG:ceph-disk:Cluster uuid is 07b5c90b-6cae-40c0-93b2-31e0ebad7315 
INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph 
--show-config-value=fsid 
DEBUG:ceph-disk:Cluster name is ceph 
DEBUG:ceph-disk:OSD uuid is e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 
DEBUG:ceph-disk:OSD id is 6 
DEBUG:ceph-disk:Initializing OSD... 
INFO:ceph-disk:Running command: /usr/bin/ceph --cluster ceph --name 
client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon 
getmap -o /var/lib/ceph/tmp/mnt.A99cDp/activate.monmap 
got monmap epoch 6 
INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster ceph --mkfs --mkkey 
-i 6 --monmap /var/lib/ceph/tmp/mnt.A99cDp/activate.monmap --osd-data 
/var/lib/ceph/tmp/mnt.A99cDp --osd-journal /var/lib/ceph/tmp/mnt.A99cDp/journal 
--osd-uuid e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 --keyring 
/var/lib/ceph/tmp/mnt.A99cDp/keyring 
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 
2015-12-18 21:58:12.489357 7f266d7b0800 -1 journal check: ondisk fsid 
---- doesn't match expected 
e85f4d92-c8f1-4591-bd2a-aa43b80f58f6, invalid (someone else's?) journal 
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 
2015-12-18 21:58:12.680566 7f266d7b0800 -1 
filestore(/var/lib/ceph/tmp/mnt.A99cDp) could not find 
23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory 
2015-12-18 21:58:12.865810 7f266d7b0800 -1 created object store 
/var/lib/ceph/tmp/mnt.A99cDp journal /var/lib/ceph/tmp/mnt.A99cDp/journal for 
osd.6 fsid 07b5c90b-6cae-40c0-93b2-31e0ebad7315 
2015-12-18 21:58:12.865844 7f266d7b0800 -1 auth: error reading file: 
/var/lib/ceph/tmp/mnt.A99cDp/keyring: can't open 
/var/lib/ceph/tmp/mnt.A99cDp/keyring: (2) No such file or directory 
2015-12-18 21:58:12.865910 7f266d7b0800 -1 created new key in keyring 
/var/lib/ceph/tmp/mnt.A99cDp/keyring 
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. 
--lookup init 
DEBUG:ceph-disk:Marking with init system sysvinit 
DEBUG:ceph-disk:Authorizing OSD key... 
INFO:ceph-disk:Running command: /usr/bin/ceph --cluster ceph --name 
client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring auth 
add osd.6 -i /var/lib/ceph/tmp/mnt.A99cDp/keyring osd allow * mon allow profile 
osd 
Error EINVAL: entity osd.6 exists but key does not match 
ERROR:ceph-disk:Failed to activate 
DEBUG:ceph-disk:Unmounting /var/lib/ceph/tmp/mnt.A99cDp 
INFO:ceph-disk:Running command: /bin/umount -- /var/lib/ceph/tmp/mnt.A99cDp 
Traceback (most recent call last): 
File "/usr/sbin/ceph-disk", line 2994, in  
main() 
File "/usr/sbin/ceph-disk", line 2972, in main 
args.func(args) 
File "/usr/sbin/ceph-disk", line 2178, in main_activate 
init=args.mark_init, 
File "/usr/sbin/ceph-disk", line 1954, in mount_activate 
(osd_id, cluster) = activate(path, activate_key_template, init) 
File "/usr/sbin/ceph-disk", line 2153, in activate 
keyring=keyring, 
File "/usr/sbin/ceph-disk", line 1756, in auth_key 
'mon', 'allow profile osd', 
File "/usr/sbin/ceph-disk", line 323, in command_check_call 
return subprocess.check_call(arguments) 
File "/usr/lib64/python2.6/subprocess.py", line 505, in check_call 
raise CalledProcessError(retcode, cmd) 
subprocess.CalledProcessError: Command '['/usr/bin/ceph', '--cluster', 'ceph', 
'--name', 'client.bootstrap-osd', '--keyring', 
'/var/lib/ceph/bootstrap-osd/ceph.keyring', 'auth', 'add', 'osd.6', '-i', 
'/var/lib/ceph/tmp/mnt.A99cDp/keyring', 'osd', 'allow *', 'mon', 'allow profile 
osd']' returned non-zero exit status 22 

Thanks! 

/Jesper 

*** 

Hi Jesper, 

The goal of the rc.local is twofold but mainly to ensure the 
/dev/disk/by-partuuid symlinks exists for the jou

Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7

2015-12-17 Thread Jesper Thorhauge
Hi Loic, 

Yep, 95-ceph-osd.rules contains exactly that... 

*** 

And 95-ceph-osd.rules contains the following ? 

# Check gpt partion for ceph tags and activate 
ACTION=="add", SUBSYSTEM=="block", \ 
ENV{DEVTYPE}=="partition", \ 
ENV{ID_PART_TABLE_TYPE}=="gpt", \ 
RUN+="/usr/sbin/ceph-disk-udev $number $name $parent" 

On 17/12/2015 08:29, Jesper Thorhauge wrote: 
> Hi Loic, 
> 
> osd's are on /dev/sda and /dev/sdb, journal's is on /dev/sdc (sdc3 / sdc4). 
> 
> sgdisk for sda shows; 
> 
> Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) 
> Partition unique GUID: E85F4D92-C8F1-4591-BD2A-AA43B80F58F6 
> First sector: 2048 (at 1024.0 KiB) 
> Last sector: 1953525134 (at 931.5 GiB) 
> Partition size: 1953523087 sectors (931.5 GiB) 
> Attribute flags:  
> Partition name: 'ceph data' 
> 
> for sdb 
> 
> Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) 
> Partition unique GUID: C83B5AA5-FE77-42F6-9415-25CA0266FB7F 
> First sector: 2048 (at 1024.0 KiB) 
> Last sector: 1953525134 (at 931.5 GiB) 
> Partition size: 1953523087 sectors (931.5 GiB) 
> Attribute flags:  
> Partition name: 'ceph data' 
> 
> for /dev/sdc3 
> 
> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) 
> Partition unique GUID: C34D4694-B486-450D-B57F-DA24255F0072 
> First sector: 935813120 (at 446.2 GiB) 
> Last sector: 956293119 (at 456.0 GiB) 
> Partition size: 2048 sectors (9.8 GiB) 
> Attribute flags:  
> Partition name: 'ceph journal' 
> 
> for /dev/sdc4 
> 
> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) 
> Partition unique GUID: 1E9D527F-0866-4284-B77C-C1CB04C5A168 
> First sector: 956293120 (at 456.0 GiB) 
> Last sector: 976773119 (at 465.8 GiB) 
> Partition size: 2048 sectors (9.8 GiB) 
> Attribute flags:  
> Partition name: 'ceph journal' 
> 
> 60-ceph-partuuid-workaround.rules is located in /lib/udev/rules.d, so it 
> seems correct to me. 
> 
> after a reboot, /dev/disk/by-partuuid is; 
> 
> -rw-r--r-- 1 root root 0 Dec 16 07:35 1e9d527f-0866-4284-b77c-c1cb04c5a168 
> -rw-r--r-- 1 root root 0 Dec 16 07:35 c34d4694-b486-450d-b57f-da24255f0072 
> lrwxrwxrwx 1 root root 10 Dec 16 07:35 c83b5aa5-fe77-42f6-9415-25ca0266fb7f 
> -> ../../sdb1 
> lrwxrwxrwx 1 root root 10 Dec 16 07:35 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 
> -> ../../sda1 
> 
> i dont know how to verify the symlink of the journal file - can you guide me 
> on that one? 
> 
> Thank :-) ! 
> 
> /Jesper 
> 
> ** 
> 
> Hi, 
> 
> On 17/12/2015 07:53, Jesper Thorhauge wrote: 
>> Hi, 
>> 
>> Some more information showing in the boot.log; 
>> 
>> 2015-12-16 07:35:33.289830 7f1b990ad800 -1 
>> filestore(/var/lib/ceph/tmp/mnt.aWZTcE) mkjournal error creating journal on 
>> /var/lib/ceph/tmp/mnt.aWZTcE/journal: (22) Invalid argument 
>> 2015-12-16 07:35:33.289842 7f1b990ad800 -1 OSD::mkfs: ObjectStore::mkfs 
>> failed with error -22 
>> 2015-12-16 07:35:33.289883 7f1b990ad800 -1 ** ERROR: error creating empty 
>> object store in /var/lib/ceph/tmp/mnt.aWZTcE: (22) Invalid argument 
>> ERROR:ceph-disk:Failed to activate 
>> ceph-disk: Command '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', 
>> '--mkkey', '-i', '7', '--monmap', 
>> '/var/lib/ceph/tmp/mnt.aWZTcE/activate.monmap', '--osd-data', 
>> '/var/lib/ceph/tmp/mnt.aWZTcE', '--osd-journal', 
>> '/var/lib/ceph/tmp/mnt.aWZTcE/journal', '--osd-uuid', 
>> 'c83b5aa5-fe77-42f6-9415-25ca0266fb7f', '--keyring', 
>> '/var/lib/ceph/tmp/mnt.aWZTcE/keyring']' returned non-zero exit status 1 
>> ceph-disk: Error: One or more partitions failed to activate 
>> 
>> Maybe related to the "(22) Invalid argument" part..? 
> 
> After a reboot the symlinks are reconstructed and if they are still 
> incorrect, it means there is an inconsistency somewhere else. To debug the 
> problem, could you mount /dev/sda1 and verify the symlink of the journal file 
> ? Then verify the content of /dev/disk/by-partuuid. And also display the 
> partition information with sgdisk -i 1 /dev/sda and sgdisk -i 2 /dev/sda. Are 
> you collocating your journal with the data, on the same disk ? Or are they on 
> two different disks ? 
> 
> git log --no-merges --oneline tags/v0.94.3..tags/v0.94.5 udev 
> 
> shows nothing, meaning there has been no change to udev rules. There is one 
> change related to the installation of the udev rules 
> https://github.com/ceph/ceph/commit/4eb58ad2027148561d94bb43346b464b55d041a6. 
> Could you double check 60-ceph-partuuid-workaround.rul

Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7

2015-12-17 Thread Jesper Thorhauge
Hi Loic, 

Sounds like something does go wrong when /dev/sdc3 shows up. Is there anyway i 
can debug this further? Log-files? Modify the .rules file...? 

/Jesper 

 

The non-symlink files in /dev/disk/by-partuuid come to existence because of: 

* system boots 
* udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 
* ceph-disk-udev creates the symlink 
/dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb7f -> ../../sdb1 
* ceph-disk activate /dev/sda1 is mounted and finds a symlink to the journal 
journal -> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 which 
does not yet exists because /dev/sdc udev rules have not been run yet 
* ceph-osd opens the journal in write mode and that creates the file 
/dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 as a regular file 
* the file is empty and the osd fails to activate with the error you see 
(EINVAL because the file is empty) 

This is ok, supported and expected since there is no way to know which disk 
will show up first. 

When /dev/sdc shows up, the same logic will be triggered: 

* udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 
* ceph-disk-udev creates the symlink 
/dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 -> ../../sdc3 
(overriding the file because ln -sf) 
* ceph-disk activate-journal /dev/sdc3 finds that 
c83b5aa5-fe77-42f6-9415-25ca0266fb7f is the data partition for that journal and 
mounts /dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb7f 
* ceph-osd opens the journal and all is well 

Except something goes wrong in your case, presumably because ceph-disk-udev is 
not called when /dev/sdc3 shows up ? 

On 17/12/2015 08:29, Jesper Thorhauge wrote: 
> Hi Loic, 
> 
> osd's are on /dev/sda and /dev/sdb, journal's is on /dev/sdc (sdc3 / sdc4). 
> 
> sgdisk for sda shows; 
> 
> Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) 
> Partition unique GUID: E85F4D92-C8F1-4591-BD2A-AA43B80F58F6 
> First sector: 2048 (at 1024.0 KiB) 
> Last sector: 1953525134 (at 931.5 GiB) 
> Partition size: 1953523087 sectors (931.5 GiB) 
> Attribute flags:  
> Partition name: 'ceph data' 
> 
> for sdb 
> 
> Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) 
> Partition unique GUID: C83B5AA5-FE77-42F6-9415-25CA0266FB7F 
> First sector: 2048 (at 1024.0 KiB) 
> Last sector: 1953525134 (at 931.5 GiB) 
> Partition size: 1953523087 sectors (931.5 GiB) 
> Attribute flags:  
> Partition name: 'ceph data' 
> 
> for /dev/sdc3 
> 
> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) 
> Partition unique GUID: C34D4694-B486-450D-B57F-DA24255F0072 
> First sector: 935813120 (at 446.2 GiB) 
> Last sector: 956293119 (at 456.0 GiB) 
> Partition size: 2048 sectors (9.8 GiB) 
> Attribute flags:  
> Partition name: 'ceph journal' 
> 
> for /dev/sdc4 
> 
> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) 
> Partition unique GUID: 1E9D527F-0866-4284-B77C-C1CB04C5A168 
> First sector: 956293120 (at 456.0 GiB) 
> Last sector: 976773119 (at 465.8 GiB) 
> Partition size: 2048 sectors (9.8 GiB) 
> Attribute flags:  
> Partition name: 'ceph journal' 
> 
> 60-ceph-partuuid-workaround.rules is located in /lib/udev/rules.d, so it 
> seems correct to me. 
> 
> after a reboot, /dev/disk/by-partuuid is; 
> 
> -rw-r--r-- 1 root root 0 Dec 16 07:35 1e9d527f-0866-4284-b77c-c1cb04c5a168 
> -rw-r--r-- 1 root root 0 Dec 16 07:35 c34d4694-b486-450d-b57f-da24255f0072 
> lrwxrwxrwx 1 root root 10 Dec 16 07:35 c83b5aa5-fe77-42f6-9415-25ca0266fb7f 
> -> ../../sdb1 
> lrwxrwxrwx 1 root root 10 Dec 16 07:35 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 
> -> ../../sda1 
> 
> i dont know how to verify the symlink of the journal file - can you guide me 
> on that one? 
> 
> Thank :-) ! 
> 
> /Jesper 
> 
> ** 
> 
> Hi, 
> 
> On 17/12/2015 07:53, Jesper Thorhauge wrote: 
>> Hi, 
>> 
>> Some more information showing in the boot.log; 
>> 
>> 2015-12-16 07:35:33.289830 7f1b990ad800 -1 
>> filestore(/var/lib/ceph/tmp/mnt.aWZTcE) mkjournal error creating journal on 
>> /var/lib/ceph/tmp/mnt.aWZTcE/journal: (22) Invalid argument 
>> 2015-12-16 07:35:33.289842 7f1b990ad800 -1 OSD::mkfs: ObjectStore::mkfs 
>> failed with error -22 
>> 2015-12-16 07:35:33.289883 7f1b990ad800 -1 ** ERROR: error creating empty 
>> object store in /var/lib/ceph/tmp/mnt.aWZTcE: (22) Invalid argument 
>> ERROR:ceph-disk:Failed to activate 
>> ceph-disk: Command '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', 
>> '--mkkey', '-i', '7', '--monmap', 
>> '/var/lib/ceph/tmp/mnt.aWZTcE/activate.monmap', '

Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7

2015-12-17 Thread Jesper Thorhauge
Nope, the previous post contained all that was in the boot.log :-( 

/Jesper 

** 

- Den 17. dec 2015, kl. 11:53, Loic Dachary <l...@dachary.org> skrev: 

On 17/12/2015 11:33, Jesper Thorhauge wrote: 
> Hi Loic, 
> 
> Sounds like something does go wrong when /dev/sdc3 shows up. Is there anyway 
> i can debug this further? Log-files? Modify the .rules file...? 

Do you see traces of what happens when /dev/sdc3 shows up in boot.log ? 

> 
> /Jesper 
> 
>  
> 
> The non-symlink files in /dev/disk/by-partuuid come to existence because of: 
> 
> * system boots 
> * udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 
> * ceph-disk-udev creates the symlink 
> /dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb7f -> ../../sdb1 
> * ceph-disk activate /dev/sda1 is mounted and finds a symlink to the journal 
> journal -> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 which 
> does not yet exists because /dev/sdc udev rules have not been run yet 
> * ceph-osd opens the journal in write mode and that creates the file 
> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 as a regular file 
> * the file is empty and the osd fails to activate with the error you see 
> (EINVAL because the file is empty) 
> 
> This is ok, supported and expected since there is no way to know which disk 
> will show up first. 
> 
> When /dev/sdc shows up, the same logic will be triggered: 
> 
> * udev rule calls ceph-disk-udev via 95-ceph-osd.rules on /dev/sda1 
> * ceph-disk-udev creates the symlink 
> /dev/disk/by-partuuid/1e9d527f-0866-4284-b77c-c1cb04c5a168 -> ../../sdc3 
> (overriding the file because ln -sf) 
> * ceph-disk activate-journal /dev/sdc3 finds that 
> c83b5aa5-fe77-42f6-9415-25ca0266fb7f is the data partition for that journal 
> and mounts /dev/disk/by-partuuid/c83b5aa5-fe77-42f6-9415-25ca0266fb7f 
> * ceph-osd opens the journal and all is well 
> 
> Except something goes wrong in your case, presumably because ceph-disk-udev 
> is not called when /dev/sdc3 shows up ? 
> 
> On 17/12/2015 08:29, Jesper Thorhauge wrote: 
>> Hi Loic, 
>> 
>> osd's are on /dev/sda and /dev/sdb, journal's is on /dev/sdc (sdc3 / sdc4). 
>> 
>> sgdisk for sda shows; 
>> 
>> Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) 
>> Partition unique GUID: E85F4D92-C8F1-4591-BD2A-AA43B80F58F6 
>> First sector: 2048 (at 1024.0 KiB) 
>> Last sector: 1953525134 (at 931.5 GiB) 
>> Partition size: 1953523087 sectors (931.5 GiB) 
>> Attribute flags:  
>> Partition name: 'ceph data' 
>> 
>> for sdb 
>> 
>> Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) 
>> Partition unique GUID: C83B5AA5-FE77-42F6-9415-25CA0266FB7F 
>> First sector: 2048 (at 1024.0 KiB) 
>> Last sector: 1953525134 (at 931.5 GiB) 
>> Partition size: 1953523087 sectors (931.5 GiB) 
>> Attribute flags:  
>> Partition name: 'ceph data' 
>> 
>> for /dev/sdc3 
>> 
>> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) 
>> Partition unique GUID: C34D4694-B486-450D-B57F-DA24255F0072 
>> First sector: 935813120 (at 446.2 GiB) 
>> Last sector: 956293119 (at 456.0 GiB) 
>> Partition size: 2048 sectors (9.8 GiB) 
>> Attribute flags:  
>> Partition name: 'ceph journal' 
>> 
>> for /dev/sdc4 
>> 
>> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) 
>> Partition unique GUID: 1E9D527F-0866-4284-B77C-C1CB04C5A168 
>> First sector: 956293120 (at 456.0 GiB) 
>> Last sector: 976773119 (at 465.8 GiB) 
>> Partition size: 2048 sectors (9.8 GiB) 
>> Attribute flags:  
>> Partition name: 'ceph journal' 
>> 
>> 60-ceph-partuuid-workaround.rules is located in /lib/udev/rules.d, so it 
>> seems correct to me. 
>> 
>> after a reboot, /dev/disk/by-partuuid is; 
>> 
>> -rw-r--r-- 1 root root 0 Dec 16 07:35 1e9d527f-0866-4284-b77c-c1cb04c5a168 
>> -rw-r--r-- 1 root root 0 Dec 16 07:35 c34d4694-b486-450d-b57f-da24255f0072 
>> lrwxrwxrwx 1 root root 10 Dec 16 07:35 c83b5aa5-fe77-42f6-9415-25ca0266fb7f 
>> -> ../../sdb1 
>> lrwxrwxrwx 1 root root 10 Dec 16 07:35 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 
>> -> ../../sda1 
>> 
>> i dont know how to verify the symlink of the journal file - can you guide me 
>> on that one? 
>> 
>> Thank :-) ! 
>> 
>> /Jesper 
>> 
>> ** 
>> 
>> Hi, 
>> 
>> On 17/12/2015 07:53, Jesper Thorhauge wrote: 
>>> Hi, 
>>> 

Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7

2015-12-16 Thread Jesper Thorhauge
Hi, 

Some more information showing in the boot.log; 

2015-12-16 07:35:33.289830 7f1b990ad800 -1 
filestore(/var/lib/ceph/tmp/mnt.aWZTcE) mkjournal error creating journal on 
/var/lib/ceph/tmp/mnt.aWZTcE/journal: (22) Invalid argument 
2015-12-16 07:35:33.289842 7f1b990ad800 -1 OSD::mkfs: ObjectStore::mkfs failed 
with error -22 
2015-12-16 07:35:33.289883 7f1b990ad800 -1 ** ERROR: error creating empty 
object store in /var/lib/ceph/tmp/mnt.aWZTcE: (22) Invalid argument 
ERROR:ceph-disk:Failed to activate 
ceph-disk: Command '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', 
'--mkkey', '-i', '7', '--monmap', 
'/var/lib/ceph/tmp/mnt.aWZTcE/activate.monmap', '--osd-data', 
'/var/lib/ceph/tmp/mnt.aWZTcE', '--osd-journal', 
'/var/lib/ceph/tmp/mnt.aWZTcE/journal', '--osd-uuid', 
'c83b5aa5-fe77-42f6-9415-25ca0266fb7f', '--keyring', 
'/var/lib/ceph/tmp/mnt.aWZTcE/keyring']' returned non-zero exit status 1 
ceph-disk: Error: One or more partitions failed to activate 

Maybe related to the "(22) Invalid argument" part..? 

/Jesper 

* 

Hi, 

I have done several reboots, and it did not lead to healthy symlinks :-( 

/Jesper 

 

Hi, 

On 16/12/2015 07:39, Jesper Thorhauge wrote: 
> Hi, 
> 
> A fresh server install on one of my nodes (and yum update) left me with 
> CentOS 6.7 / Ceph 0.94.5. All the other nodes are running Ceph 0.94.2. 
> 
> "ceph-disk prepare /dev/sda /dev/sdc" seems to work as expected, but 
> "ceph-disk activate / dev/sda1" fails. I have traced the problem to 
> "/dev/disk/by-partuuid", where the journal symlinks are broken; 
> 
> -rw-r--r-- 1 root root 0 Dec 16 07:35 1e9d527f-0866-4284-b77c-c1cb04c5a168 
> -rw-r--r-- 1 root root 0 Dec 16 07:35 c34d4694-b486-450d-b57f-da24255f0072 
> lrwxrwxrwx 1 root root 10 Dec 16 07:35 c83b5aa5-fe77-42f6-9415-25ca0266fb7f 
> -> ../../sdb1 
> lrwxrwxrwx 1 root root 10 Dec 16 07:35 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 
> -> ../../sda1 
> 
> Re-creating them manually wont survive a reboot. Is this a problem with the 
> udev rules in Ceph 0.94.3+? 

This usually is a symptom of something else going wrong (i.e. it is possible to 
confuse the kernel into creating the wrong symbolic links). The correct 
symlinks should be set when you reboot. 

> Hope that somebody can help me :-) 

Please let us know if rebooting leads to healthy symlinks. 

Cheers 
> 
> Thanks! 
> 
> Best regards, 
> Jesper 
> 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7

2015-12-16 Thread Jesper Thorhauge
Hi Loic, 

osd's are on /dev/sda and /dev/sdb, journal's is on /dev/sdc (sdc3 / sdc4). 

sgdisk for sda shows; 

Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) 
Partition unique GUID: E85F4D92-C8F1-4591-BD2A-AA43B80F58F6 
First sector: 2048 (at 1024.0 KiB) 
Last sector: 1953525134 (at 931.5 GiB) 
Partition size: 1953523087 sectors (931.5 GiB) 
Attribute flags:  
Partition name: 'ceph data' 

for sdb 

Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) 
Partition unique GUID: C83B5AA5-FE77-42F6-9415-25CA0266FB7F 
First sector: 2048 (at 1024.0 KiB) 
Last sector: 1953525134 (at 931.5 GiB) 
Partition size: 1953523087 sectors (931.5 GiB) 
Attribute flags:  
Partition name: 'ceph data' 

for /dev/sdc3 

Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) 
Partition unique GUID: C34D4694-B486-450D-B57F-DA24255F0072 
First sector: 935813120 (at 446.2 GiB) 
Last sector: 956293119 (at 456.0 GiB) 
Partition size: 2048 sectors (9.8 GiB) 
Attribute flags:  
Partition name: 'ceph journal' 

for /dev/sdc4 

Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) 
Partition unique GUID: 1E9D527F-0866-4284-B77C-C1CB04C5A168 
First sector: 956293120 (at 456.0 GiB) 
Last sector: 976773119 (at 465.8 GiB) 
Partition size: 2048 sectors (9.8 GiB) 
Attribute flags:  
Partition name: 'ceph journal' 

60-ceph-partuuid-workaround.rules is located in /lib/udev/rules.d, so it seems 
correct to me. 

after a reboot, /dev/disk/by-partuuid is; 

-rw-r--r-- 1 root root 0 Dec 16 07:35 1e9d527f-0866-4284-b77c-c1cb04c5a168 
-rw-r--r-- 1 root root 0 Dec 16 07:35 c34d4694-b486-450d-b57f-da24255f0072 
lrwxrwxrwx 1 root root 10 Dec 16 07:35 c83b5aa5-fe77-42f6-9415-25ca0266fb7f -> 
../../sdb1 
lrwxrwxrwx 1 root root 10 Dec 16 07:35 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 -> 
../../sda1 

i dont know how to verify the symlink of the journal file - can you guide me on 
that one? 

Thank :-) ! 

/Jesper 

** 

Hi, 

On 17/12/2015 07:53, Jesper Thorhauge wrote: 
> Hi, 
> 
> Some more information showing in the boot.log; 
> 
> 2015-12-16 07:35:33.289830 7f1b990ad800 -1 
> filestore(/var/lib/ceph/tmp/mnt.aWZTcE) mkjournal error creating journal on 
> /var/lib/ceph/tmp/mnt.aWZTcE/journal: (22) Invalid argument 
> 2015-12-16 07:35:33.289842 7f1b990ad800 -1 OSD::mkfs: ObjectStore::mkfs 
> failed with error -22 
> 2015-12-16 07:35:33.289883 7f1b990ad800 -1 ** ERROR: error creating empty 
> object store in /var/lib/ceph/tmp/mnt.aWZTcE: (22) Invalid argument 
> ERROR:ceph-disk:Failed to activate 
> ceph-disk: Command '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', 
> '--mkkey', '-i', '7', '--monmap', 
> '/var/lib/ceph/tmp/mnt.aWZTcE/activate.monmap', '--osd-data', 
> '/var/lib/ceph/tmp/mnt.aWZTcE', '--osd-journal', 
> '/var/lib/ceph/tmp/mnt.aWZTcE/journal', '--osd-uuid', 
> 'c83b5aa5-fe77-42f6-9415-25ca0266fb7f', '--keyring', 
> '/var/lib/ceph/tmp/mnt.aWZTcE/keyring']' returned non-zero exit status 1 
> ceph-disk: Error: One or more partitions failed to activate 
> 
> Maybe related to the "(22) Invalid argument" part..? 

After a reboot the symlinks are reconstructed and if they are still incorrect, 
it means there is an inconsistency somewhere else. To debug the problem, could 
you mount /dev/sda1 and verify the symlink of the journal file ? Then verify 
the content of /dev/disk/by-partuuid. And also display the partition 
information with sgdisk -i 1 /dev/sda and sgdisk -i 2 /dev/sda. Are you 
collocating your journal with the data, on the same disk ? Or are they on two 
different disks ? 

git log --no-merges --oneline tags/v0.94.3..tags/v0.94.5 udev 

shows nothing, meaning there has been no change to udev rules. There is one 
change related to the installation of the udev rules 
https://github.com/ceph/ceph/commit/4eb58ad2027148561d94bb43346b464b55d041a6. 
Could you double check 60-ceph-partuuid-workaround.rules is installed where it 
should ? 

Cheers 

> 
> /Jesper 
> 
> * 
> 
> Hi, 
> 
> I have done several reboots, and it did not lead to healthy symlinks :-( 
> 
> /Jesper 
> 
>  
> 
> Hi, 
> 
> On 16/12/2015 07:39, Jesper Thorhauge wrote: 
>> Hi, 
>> 
>> A fresh server install on one of my nodes (and yum update) left me with 
>> CentOS 6.7 / Ceph 0.94.5. All the other nodes are running Ceph 0.94.2. 
>> 
>> "ceph-disk prepare /dev/sda /dev/sdc" seems to work as expected, but 
>> "ceph-disk activate / dev/sda1" fails. I have traced the problem to 
>> "/dev/disk/by-partuuid", where the journal symlinks are broken; 
>> 
>> -rw-r--r-- 1 root root 0 Dec 16 07:35 1e9d527f-0866-4284-b77c-c1cb04c5a168 
>> -rw-r--r-- 1 root root 0 Dec 16 07:35 c34d46

Re: [ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7

2015-12-16 Thread Jesper Thorhauge
Hi, 

I have done several reboots, and it did not lead to healthy symlinks :-( 

/Jesper 

 

Hi, 

On 16/12/2015 07:39, Jesper Thorhauge wrote: 
> Hi, 
> 
> A fresh server install on one of my nodes (and yum update) left me with 
> CentOS 6.7 / Ceph 0.94.5. All the other nodes are running Ceph 0.94.2. 
> 
> "ceph-disk prepare /dev/sda /dev/sdc" seems to work as expected, but 
> "ceph-disk activate / dev/sda1" fails. I have traced the problem to 
> "/dev/disk/by-partuuid", where the journal symlinks are broken; 
> 
> -rw-r--r-- 1 root root 0 Dec 16 07:35 1e9d527f-0866-4284-b77c-c1cb04c5a168 
> -rw-r--r-- 1 root root 0 Dec 16 07:35 c34d4694-b486-450d-b57f-da24255f0072 
> lrwxrwxrwx 1 root root 10 Dec 16 07:35 c83b5aa5-fe77-42f6-9415-25ca0266fb7f 
> -> ../../sdb1 
> lrwxrwxrwx 1 root root 10 Dec 16 07:35 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 
> -> ../../sda1 
> 
> Re-creating them manually wont survive a reboot. Is this a problem with the 
> udev rules in Ceph 0.94.3+? 

This usually is a symptom of something else going wrong (i.e. it is possible to 
confuse the kernel into creating the wrong symbolic links). The correct 
symlinks should be set when you reboot. 

> Hope that somebody can help me :-) 

Please let us know if rebooting leads to healthy symlinks. 

Cheers 
> 
> Thanks! 
> 
> Best regards, 
> Jesper 
> 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Journal symlink broken / Ceph 0.94.5 / CentOS 6.7

2015-12-15 Thread Jesper Thorhauge
Hi, 

A fresh server install on one of my nodes (and yum update) left me with CentOS 
6.7 / Ceph 0.94.5. All the other nodes are running Ceph 0.94.2. 

"ceph-disk prepare /dev/sda /dev/sdc" seems to work as expected, but "ceph-disk 
activate / dev/sda1" fails. I have traced the problem to 
"/dev/disk/by-partuuid", where the journal symlinks are broken; 

-rw-r--r-- 1 root root 0 Dec 16 07:35 1e9d527f-0866-4284-b77c-c1cb04c5a168 
-rw-r--r-- 1 root root 0 Dec 16 07:35 c34d4694-b486-450d-b57f-da24255f0072 
lrwxrwxrwx 1 root root 10 Dec 16 07:35 c83b5aa5-fe77-42f6-9415-25ca0266fb7f -> 
../../sdb1 
lrwxrwxrwx 1 root root 10 Dec 16 07:35 e85f4d92-c8f1-4591-bd2a-aa43b80f58f6 -> 
../../sda1 

Re-creating them manually wont survive a reboot. Is this a problem with the 
udev rules in Ceph 0.94.3+? 

Hope that somebody can help me :-) 

Thanks! 

Best regards, 
Jesper 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com