[ceph-users] Problem formatting erasure coded image

2019-09-22 Thread David Herselman
Hi,

I'm seeing errors in Windows VM guests's event logs, for example:
The IO operation at logical block address 0x607bf7 for Disk 1 (PDO name 
\Device\001e) was retried
Log Name: System
Source: Disk
Event ID: 153
Level: Warning

Initialising the disk to use GPT is successful but attempting to create a 
standard NTFS volume eventually times out and fails.


Pretty sure this is in production in numerous environments, so I must be doing 
something wrong... Could anyone please validate that a rbd cached erasure coded 
image can be used as a Windows VM data disc?


Running Ceph Nautilus 14.2.4 with kernel 5.0.21

Created new erasure coded pool backed by spinners and a new replicated ssd pool 
for metadata:
ceph osd erasure-code-profile set ec32_hdd \
  plugin=jerasure k=3 m=2 technique=reed_sol_van \
  crush-root=default crush-failure-domain=host crush-device-class=hdd \
  directory=/usr/lib/ceph/erasure-code;
ceph osd pool create ec_hdd 64 erasure ec32_hdd;
ceph osd pool set ec_hdd allow_ec_overwrites true;
ceph osd pool application enable ec_hdd rbd;

ceph osd crush rule create-replicated replicated_ssd default host ssd;
ceph osd pool create rbd_ssd 64 64 replicated replicated_ssd;
ceph osd pool application enable rbd_ssd rbd;

rbd create rbd_ssd/surveylance-recordings --size 1T --data-pool ec_hdd;

Added a caching tier:
ceph osd pool create ec_hdd_cache 64 64 replicated replicated_ssd;
ceph osd tier add ec_hdd ec_hdd_cache;
ceph osd tier cache-mode ec_hdd_cache writeback;
ceph osd tier set-overlay ec_hdd ec_hdd_cache;
ceph osd pool set ec_hdd_cache hit_set_type bloom;

ceph osd pool set ec_hdd_cache hit_set_count 12
ceph osd pool set ec_hdd_cache hit_set_period 14400
ceph osd pool set ec_hdd_cache target_max_bytes $[128*1024*1024*1024]
ceph osd pool set ec_hdd_cache min_read_recency_for_promote 2
ceph osd pool set ec_hdd_cache min_write_recency_for_promote 2
ceph osd pool set ec_hdd_cache cache_target_dirty_ratio 0.4
ceph osd pool set ec_hdd_cache cache_target_dirty_high_ratio 0.6
ceph osd pool set ec_hdd_cache cache_target_full_ratio 0.8


Image appears to have been created correctly:
rbd ls rbd_ssd -l
NAME   SIZE  PARENT FMT PROT LOCK
surveylance-recordings 1 TiB  2

rbd info rbd_ssd/surveylance-recordings
rbd image 'surveylance-recordings':
size 1 TiB in 262144 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 7341cc54df71f
data_pool: ec_hdd
block_name_prefix: rbd_data.2.7341cc54df71f
format: 2
features: layering, data-pool
op_features:
flags:
create_timestamp: Sun Sep 22 17:47:30 2019
access_timestamp: Sun Sep 22 17:47:30 2019
modify_timestamp: Sun Sep 22 17:47:30 2019

Ceph appears healthy:
ceph -s
  cluster:
id: 31f6ea46-12cb-47e8-a6f3-60fb6bbd1782
health: HEALTH_OK

  services:
mon: 3 daemons, quorum kvm1a,kvm1b,kvm1c (age 5d)
mgr: kvm1c(active, since 5d), standbys: kvm1b, kvm1a
mds: cephfs:1 {0=kvm1c=up:active} 2 up:standby
osd: 24 osds: 24 up (since 4d), 24 in (since 4d)

  data:
pools:   9 pools, 417 pgs
objects: 325.04k objects, 1.1 TiB
usage:   3.3 TiB used, 61 TiB / 64 TiB avail
pgs: 417 active+clean

  io:
client:   25 KiB/s rd, 2.7 MiB/s wr, 17 op/s rd, 306 op/s wr
cache:0 op/s promote

ceph df
  RAW STORAGE:
CLASS SIZEAVAIL   USEDRAW USED %RAW USED
hdd62 TiB  59 TiB 2.9 TiB  2.9 TiB  4.78
ssd   2.4 TiB 2.1 TiB 303 GiB  309 GiB 12.36
TOTAL  64 TiB  61 TiB 3.2 TiB  3.3 TiB  5.07

  POOLS:
POOL  ID STORED  OBJECTS USED%USED  
   MAX AVAIL
rbd_hdd1 995 GiB 289.54k 2.9 TiB  5.23  
  18 TiB
rbd_ssd217 B   4  48 KiB 0  
 666 GiB
rbd_hdd_cache  3  99 GiB  34.91k 302 GiB 13.13  
 666 GiB
cephfs_data4 2.1 GiB 526 6.4 GiB  0.01  
  18 TiB
cephfs_metadata5 767 KiB  22 3.7 MiB 0  
  18 TiB
device_health_metrics  6 5.9 MiB  24 5.9 MiB 0  
  18 TiB
ec_hdd10 4.0 MiB   3 7.5 MiB 0  
  32 TiB
ec_hdd_cache  11  67 MiB  30 200 MiB 0  
 666 GiB



Regards
David Herselman

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need advice with setup planning

2019-09-22 Thread Martin Verges
Hello Salsa,

Amazing! Where were you 3 months ago? Only problem is that I think we have
> no moer budget for this so I can't get approval for software license.
>

We where here and on some Ceph days as well. We do provide a completely
free version but with limited features (no HA, LDAP,...) but with the
benefit of not having to worry about the installation, upgrades and a lot
of day-to-day stuff.
Therefore you don't need to pay us a single cent to benefit from a tested
Ceph software stack (from OS, libs, userspace, ...). You can install and
manage everything needed for Ceph to provide RBD, for example to consume
with proxmox, and every extended feature can still be used using the
command line as you would do without our software.

Please keep in mind that our license always includes support that you would
need to pay otherwise (using your own time, or pay someone else).

The service is critical and we are afraid that the network might be
> congested and QoS for the end user degrades
>

If the service is critical, a 10G network would be the choice. However, one
of our test clusters with 11 systems do have a dual 1GbE that works
perfectly fine. It just needs to be configured correctly using a good hash
policy (we use layer3+4 in our software)
and of course will never achieve the full performance.

Btw. a brand new single port 10G card only costs ~40€ (used from 6€), dual
port starting from ~80€ (used from 20€). If it is critical storage system,
that little money should always be possible to spend.

Great! Thanks for the help and congratulations on that demo. It is the best
> I've used and the easiest ceph setup I've found. As feedback, the last part
> of the demo tutorial is not 100% compatible with the master branch from
> github.
>

Thanks for the feedback, we do need to update the videos to the newest
version but our time is unfortunately limited :/

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Fr., 20. Sept. 2019 um 18:34 Uhr schrieb Martin Verges <
martin.ver...@croit.io>:

> Hello Salsa,
>
> I have tested Ceph using VMs but never got to put it to use and had a lot
>> of trouble to get it to install.
>>
> if you want to get rid of all the troubles from installing to day2day
> operations, you could consider using https://croit.io/croit-virtual-demo
>
> - Use 2 HDDs for SO using RAID 1 (I've left 3.5TB unallocated in case I
>> can use it later for storage)
>> - Install CentOS 7.7
>>
> Is ok, but won't be necessary if you choose croit as we boot from the
> network and don't install a operating system.
>
> - Use 2 vLANs, one for ceph internal usage and another for external
>> access. Since they've 4 network adapters, I'll try to bond them in pairs to
>> speed up network (1Gb).
>>
> If there is no internal policy that forces you to do seperate networks,
> you can use a simple 1 vlan setup and bond 4*1GbE. Otherwise it's ok.
>
>
>> - I'll try to use ceph-ansible for installation. I failed to use it on
>> lab, but it seems more recommended.
>> - Install Ceph Nautilus
>>
> Ultra easy with croit, maybe look at our videos on youtube -
> https://www.youtube.com/playlist?list=PL1g9zo59diHDSJgkZcMRUq6xROzt_YKox
>
>
>> - Each server will host OSD, MON, MGR and MDS.
>>
> ok, but you should use ssd for metadata.
>
>
>> - One VM for ceph-admin: This wil be used to run ceph-ansible and maybe
>> to host some ceph services later
>>
> perfect for croit ;)
>
>
>> - I'll have to serve samba, iscsi and probably NFS too. Not sure how or
>> on which servers.
>>
> Just put it on the servers as well, with croit it is just a click away and
> everything is included in our interface.
> If not using croit, you can still install it on the same systems and
> configure it by hand/script.
>
> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io
> Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
>
> Web: https://croit.io
> YouTube: https://goo.gl/PGE1Bx
>
>
> Am Fr., 20. Sept. 2019 um 18:14 Uhr schrieb Salsa :
>
>> I have tested Ceph using VMs but never got to put it to use and had a lot
>> of trouble to get it to install.
>>
>> Now I've been asked to do a production setup using 3 servers (Dell R740)
>> with 12 4TB each.
>>
>> My plan is this:
>> - Use 2 HDDs for SO using RAID 1 (I've left 3.5TB unallocated in case I
>> can use it later for storage)
>> - Install CentOS 7.7
>> - Use 2 vLANs, one for ceph internal usage and another for external
>> access. Since they've 4 network adapters, I'll try to bond them in pairs to
>> speed up network (1Gb).
>> - I'll try to use ceph-ansible for installation. I failed to 

Re: [ceph-users] Looking for the best way to utilize 1TB NVMe added to the host with 8x3TB HDD OSDs

2019-09-22 Thread Wladimir Mutel

Ashley Merrick wrote:

Correct, in a large cluster no problem.

I was talking in Wladimir setup where they are running single node with 
a failure domain of OSD. Which would be a loss of all OSD's and all data.


	Sure I am aware that running with 1 NVMe is risky, so we have a plan to 
add a mirroring NVMe to it in some future. Hope this could be solved by 
using simple mdadm+lvm scheme


	Btw, are there any recommendations on cheapest Ceph node hardware ? Now 
I understand that 8x3TB HDDs in single host is quite a centralized 
setup. And I have a feeling that a good Ceph cluster should have more 
hosts than OSDs in each host. Like, with 8 OSDs per host, at least 8 
hosts. Or at least 3 hosts with 3 OSDs in each. Right ? And then it 
would be reasonable to add single NVMe per host to allow any component 
of the host to fail within failure domain=host.
I am still thinking within the cheapest concept of multiple HDDs + 
single NVMe per host.


 On Sun, 22 Sep 2019 03:42:52 +0800 *solarflow99 
mailto:solarflo...@gmail.com>>* wrote 


now my understanding is that a NVMe drive is recommended to help
speed up bluestore.  If it were to fail then those OSDs would be
lost but assuming there is 3x replication and enough OSDs I don't
see the problem here.  There are other scenarios where a whole
server might le lost, it doesn't mean the total loss of the cluster.


On Sat, Sep 21, 2019 at 5:27 AM Ashley Merrick
mailto:singap...@amerrick.co.uk>> wrote:

__
Placing it as a Journal / Bluestore DB/WAL will help with writes
mostly, by the sounds of it you want to increase read
performance?, how important is the data on this CEPH cluster?

If you place it as a Journal DB/WAL any failure of it will cause
total data loss so I would very much advise against this unless
this is totally for testing and total data loss is not an issue.

In that can is worth upgrading to blue store by rebuilding each
OSD placing the DB/WAL on a SSD partition, you can do this one
OSD at a time but there is no migration path so you would need
to wait for data rebuilding after each OSD change before moving
onto the next.

If you need to make sure your data is safe then your really
limited to using it as a read only cache, but I think even then
most setups would cause all OSD's to go offline till you
manually removed it from a read only cache if the disk failed.
However bcache/dm-cache may support this automatically however
is still a risk that I personally wouldn't want to take.

Also it really depends on your use for CEPH and the I/O activity
expected to what the best option may be.



 On Fri, 20 Sep 2019 14:56:12 +0800 *Wladimir Mutel
mailto:m...@mwg.dp.ua>>* wrote 

 Dear everyone,

 Last year I set up an experimental Ceph cluster (still
single node,
failure domain = osd, MB Asus P10S-M WS, CPU Xeon E3-1235L,
RAM 64 GB,
HDDs WD30EFRX, Ubuntu 18.04, now with kernel 5.3.0 from
Ubuntu mainline
PPA and Ceph 14.2.4 from
download.ceph.com/debian-nautilus/dists/bionic

). I set up JErasure 2+1 pool, created some RBDs using that
as data pool
and exported them by iSCSI (using tcmu-runner, gwcli and
associated
packages). But with HDD-only setup their performance was
less than
stellar, not saturating even 1Gbit Ethernet on RBD reads.

 This year my experiment was funded with Gigabyte PCIe
NVMe 1TB SSD
(GP-ASACNE2100TTTDR). Now it is plugged in the MB and is
visible as a
storage device to lsblk. Also I can see its 4 interrupt
queues in
/proc/interrupts, and its transfer measured by hdparm -t is
about 2.3GB/sec.

 And now I want to ask your advice on how to best
include it into this
already existing setup. Should I allocate it for OSD
journals and
databases ? Is there a way to reconfigure existing OSD in
this way
without destroying and recreating it ? Or are there plans to
ease this
kind of migration ? Can I add it as a write-adsorbing cache to
individual RBD images ? To individual block devices at the
level of
bcache/dm-cache ? What about speeding up RBD reads ?

 I would appreciate to read your opinions and
recommendations.
 (just want to warn you that in this situation I don't
have financial
option of going full-SSD)

 Thank you all in advance for 

Re: [ceph-users] Looking for the best way to utilize 1TB NVMe added to the host with 8x3TB HDD OSDs

2019-09-22 Thread Ashley Merrick
Correct, in a large cluster no problem.

I was talking in Wladimir setup where they are running single node with a 
failure domain of OSD. Which would be a loss of all OSD's and all data.




 On Sun, 22 Sep 2019 03:42:52 +0800 solarflow99 
 wrote 


now my understanding is that a NVMe drive is recommended to help speed up 
bluestore.  If it were to fail then those OSDs would be lost but assuming there 
is 3x replication and enough OSDs I don't see the problem here.  There are 
other scenarios where a whole server might le lost, it doesn't mean the total 
loss of the cluster.





On Sat, Sep 21, 2019 at 5:27 AM Ashley Merrick 
 wrote:





Placing it as a Journal / Bluestore DB/WAL will help with writes mostly, by the 
sounds of it you want to increase read performance?, how important is the data 
on this CEPH cluster?

If you place it as a Journal DB/WAL any failure of it will cause total data 
loss so I would very much advise against this unless this is totally for 
testing and total data loss is not an issue.

In that can is worth upgrading to blue store by rebuilding each OSD placing the 
DB/WAL on a SSD partition, you can do this one OSD at a time but there is no 
migration path so you would need to wait for data rebuilding after each OSD 
change before moving onto the next.

If you need to make sure your data is safe then your really limited to using it 
as a read only cache, but I think even then most setups would cause all OSD's 
to go offline till you manually removed it from a read only cache if the disk 
failed.
However bcache/dm-cache may support this automatically however is still a risk 
that I personally wouldn't want to take.

Also it really depends on your use for CEPH and the I/O activity expected to 
what the best option may be.




 On Fri, 20 Sep 2019 14:56:12 +0800 Wladimir Mutel  
wrote 


Dear everyone, 
 
Last year I set up an experimental Ceph cluster (still single node, 
failure domain = osd, MB Asus P10S-M WS, CPU Xeon E3-1235L, RAM 64 GB, 
HDDs WD30EFRX, Ubuntu 18.04, now with kernel 5.3.0 from Ubuntu mainline 
PPA and Ceph 14.2.4 from http://download.ceph.com/debian-nautilus/dists/bionic 
). I set up JErasure 2+1 pool, created some RBDs using that as data pool 
and exported them by iSCSI (using tcmu-runner, gwcli and associated 
packages). But with HDD-only setup their performance was less than 
stellar, not saturating even 1Gbit Ethernet on RBD reads. 
 
This year my experiment was funded with Gigabyte PCIe NVMe 1TB SSD 
(GP-ASACNE2100TTTDR). Now it is plugged in the MB and is visible as a 
storage device to lsblk. Also I can see its 4 interrupt queues in 
/proc/interrupts, and its transfer measured by hdparm -t is about 2.3GB/sec. 
 
And now I want to ask your advice on how to best include it into this 
already existing setup. Should I allocate it for OSD journals and 
databases ? Is there a way to reconfigure existing OSD in this way 
without destroying and recreating it ? Or are there plans to ease this 
kind of migration ? Can I add it as a write-adsorbing cache to 
individual RBD images ? To individual block devices at the level of 
bcache/dm-cache ? What about speeding up RBD reads ? 
 
I would appreciate to read your opinions and recommendations. 
(just want to warn you that in this situation I don't have financial 
option of going full-SSD) 
 
Thank you all in advance for your response 
___ 
ceph-users mailing list 
mailto:ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 








___
 ceph-users mailing list
 mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking for the best way to utilize 1TB NVMe added to the host with 8x3TB HDD OSDs

2019-09-22 Thread Ashley Merrick
Correct, in a large cluster no problem.

I was talking in Wladimir setup where they are running single node with a 
failure domain of OSD. Which would be a loss of all OSD's and all data.




 On Sun, 22 Sep 2019 03:42:52 +0800 solarflow99 
 wrote 


now my understanding is that a NVMe drive is recommended to help speed up 
bluestore.  If it were to fail then those OSDs would be lost but assuming there 
is 3x replication and enough OSDs I don't see the problem here.  There are 
other scenarios where a whole server might le lost, it doesn't mean the total 
loss of the cluster.





On Sat, Sep 21, 2019 at 5:27 AM Ashley Merrick 
 wrote:





Placing it as a Journal / Bluestore DB/WAL will help with writes mostly, by the 
sounds of it you want to increase read performance?, how important is the data 
on this CEPH cluster?

If you place it as a Journal DB/WAL any failure of it will cause total data 
loss so I would very much advise against this unless this is totally for 
testing and total data loss is not an issue.

In that can is worth upgrading to blue store by rebuilding each OSD placing the 
DB/WAL on a SSD partition, you can do this one OSD at a time but there is no 
migration path so you would need to wait for data rebuilding after each OSD 
change before moving onto the next.

If you need to make sure your data is safe then your really limited to using it 
as a read only cache, but I think even then most setups would cause all OSD's 
to go offline till you manually removed it from a read only cache if the disk 
failed.
However bcache/dm-cache may support this automatically however is still a risk 
that I personally wouldn't want to take.

Also it really depends on your use for CEPH and the I/O activity expected to 
what the best option may be.




 On Fri, 20 Sep 2019 14:56:12 +0800 Wladimir Mutel  
wrote 


Dear everyone, 
 
Last year I set up an experimental Ceph cluster (still single node, 
failure domain = osd, MB Asus P10S-M WS, CPU Xeon E3-1235L, RAM 64 GB, 
HDDs WD30EFRX, Ubuntu 18.04, now with kernel 5.3.0 from Ubuntu mainline 
PPA and Ceph 14.2.4 from http://download.ceph.com/debian-nautilus/dists/bionic 
). I set up JErasure 2+1 pool, created some RBDs using that as data pool 
and exported them by iSCSI (using tcmu-runner, gwcli and associated 
packages). But with HDD-only setup their performance was less than 
stellar, not saturating even 1Gbit Ethernet on RBD reads. 
 
This year my experiment was funded with Gigabyte PCIe NVMe 1TB SSD 
(GP-ASACNE2100TTTDR). Now it is plugged in the MB and is visible as a 
storage device to lsblk. Also I can see its 4 interrupt queues in 
/proc/interrupts, and its transfer measured by hdparm -t is about 2.3GB/sec. 
 
And now I want to ask your advice on how to best include it into this 
already existing setup. Should I allocate it for OSD journals and 
databases ? Is there a way to reconfigure existing OSD in this way 
without destroying and recreating it ? Or are there plans to ease this 
kind of migration ? Can I add it as a write-adsorbing cache to 
individual RBD images ? To individual block devices at the level of 
bcache/dm-cache ? What about speeding up RBD reads ? 
 
I would appreciate to read your opinions and recommendations. 
(just want to warn you that in this situation I don't have financial 
option of going full-SSD) 
 
Thank you all in advance for your response 
___ 
ceph-users mailing list 
mailto:ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 








___
 ceph-users mailing list
 mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com