from:"Maged Mokhtar"

[ceph-users] rbd image-meta

2015-07-23 Thread Maged Mokhtar

Hello

i am trying to use the rbd image-meta set.
i get an error from rbd that this command is not recognized
yet it is documented in rdb documentation:
http://ceph.com/docs/next/man/8/rbd/

I am using Hammer release deployed using ceph_deploy on Ubutnu 14.04
Is image-meta set supported in rbd in Hammer release ?

Any help much appreciated.
/Maged
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] rbd image-meta

2015-07-23 Thread Maged Mokhtar

Hello

i am trying to use the rbd image-meta set.
i get an error from rbd that this command is not recognized
yet it is documented in rdb documentation:
http://ceph.com/docs/next/man/8/rbd/

I am using Hammer release deployed using ceph_deploy on Ubutnu 14.04
Is image-meta set supported in rbd in Hammer release ?

Any help much appreciated.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Number of PGs: fix from start or change as we grow ?

2016-08-03 Thread Maged Mokhtar

Hello,

I would like to build a small cluster with 20 disks to start but in the future 
would like to gradually increase it to maybe 200 disks.
Is it better to fix the number of PGs in the pool from the beginning or is it 
better to start with a small number and then gradually change the number of PGs 
as the system grows ?
Is the act of changing the number of PGs in a running cluster something that 
can be done regularly ? 

Cheers 
/Maged



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] watch timeout on failure

2017-01-21 Thread Maged Mokhtar

Hi,

If a host with a kernel mapped rbd image dies, it still keeps a watch on
the rbd image header for a timeout that seems to be determined by
ms_tcp_read_timeout ( default 15 minutes ) rather than
osd_client_watch_timeout whereas according to the docs: "If the client
loses its connection to the primary OSD for a watched object, the watch
will be removed after a timeout configured with osd_client_watch_timeout."

It is possible to force watch removal by blacklisting the failed host, but
i was wondering if the above timeout is the correct behavior. this is
using latest 10.2.5

Cheers
/Maged

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] watch timeout on failure

2017-01-21 Thread Maged Mokhtar



Thanks for the clarification..


> On Sat, Jan 21, 2017 at 1:18 PM, Maged Mokhtar <mmokh...@petasan.org>
> wrote:
>> Hi,
>>
>> If a host with a kernel mapped rbd image dies, it still keeps a watch on
>> the rbd image header for a timeout that seems to be determined by
>> ms_tcp_read_timeout ( default 15 minutes ) rather than
>> osd_client_watch_timeout whereas according to the docs: "If the client
>> loses its connection to the primary OSD for a watched object, the watch
>> will be removed after a timeout configured with
>> osd_client_watch_timeout."
>>
>> It is possible to force watch removal by blacklisting the failed host,
>> but
>> i was wondering if the above timeout is the correct behavior. this is
>> using latest 10.2.5
>
> Yeah, it can do that in some cases because kernels up to 4.6 use the
> old watch-notify protocol.  If you upgrade the kernel client to 4.7 or
> higher, all watches should get removed after osd_client_watch_timeout.
>
> Thanks,
>
> Ilya
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] help with crush rule

2017-02-18 Thread Maged Mokhtar


Hi,

I have a need to support a small cluster with 3 hosts and 3 replicas given
that in normal operation each replica will be placed on a separate host
but in case one host dies then its replicas could be stored on separate
osds on the 2 live hosts.

I was hoping to write a rule that in case it could only find 2 replicas on
separated nodes will emit it and do another select/emit to place the
reaming replica. Is this possible ? i could not find a way to define an if
condition or being able to determine the size of the working vector
actually returned.

Cheers /maged

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] help with crush rule

2017-02-27 Thread Maged Mokhtar

Thank you for the clarification.
apology for my late reply /maged

From: Brian Andrus 
Sent: Wednesday, February 22, 2017 2:23 AM
To: Maged Mokhtar 
Cc: ceph-users 
Subject: Re: [ceph-users] help with crush rule

I don't think a CRUSH rule exception is currently possible, but it makes sense 
to me for a feature request.

On Sat, Feb 18, 2017 at 6:16 AM, Maged Mokhtar <mmokh...@petasan.org> wrote:

  Hi,

  I have a need to support a small cluster with 3 hosts and 3 replicas given
  that in normal operation each replica will be placed on a separate host
  but in case one host dies then its replicas could be stored on separate
  osds on the 2 live hosts.

  I was hoping to write a rule that in case it could only find 2 replicas on
  separated nodes will emit it and do another select/emit to place the
  reaming replica. Is this possible ? i could not find a way to define an if
  condition or being able to determine the size of the working vector
  actually returned.

  Cheers /maged

  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-18 Thread Maged Mokhtar



Thank you Mike for this update.
I sent you and Dave the relevant changes we found for hyper-v.

Cheers /maged

--
From: "Mike Christie" <mchri...@redhat.com>
Sent: Monday, October 17, 2016 9:40 PM
To: "Maged Mokhtar" <mmokh...@petasan.org>; "Lars Marowsky-Bree" 
<l...@suse.com>; <ceph-users@lists.ceph.com>; "Paul Cuzner" 
<pcuz...@redhat.com>

Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project


If it is just a couple kernel changes you should post them, so SUSE can
merge them in target_core_rbd and we can port them to upstream. You will
not have to carry them and SUSE and I will not have to re-debug the
problems :)

For the (non target_mode approach), everything that is needed for basic
IO, failover and failback (we only support active/passive right now and
no distributed PRs like SUSE) support is merged upstream:

- Linus's tree
(git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) for
4.9 has the kernel changes.
- The Ceph tree (https://github.com/ceph/ceph) has some rbd command line
tool changes that are needed.
- The multipath tools tree (https://github.com/ceph/ceph) has changes
needed for how we are doing active/passive with the rbd exclusive lock.

So you can build patches against those trees.

For SUSE's approach, I think everything is in SUSE's git trees which you
probably are familiar with already.

Also, if you are going to build off of upstream/distros and/or also
support other distros as a base, Kraken will have these features, and so
will RHEL 7.3 and RHCS 2.1.

And for setup/management Paul Cuzner (https://github.com/pcuzner)
implemented ansible playbooks to set everything up:

https://github.com/pcuzner/ceph-iscsi-ansible
https://github.com/pcuzner/ceph-iscsi-config

Maybe you can use that too, but since you are SUSE based I am guessing
you are using lrbd.


On 10/17/2016 10:24 AM, Maged Mokhtar wrote:

Hi Lars,
Yes I was aware of David Disseldorp & Mike Christie efforts to upstream
the patches from a while back ago. I understand there will be a move
away from the SUSE target_mod_rbd to support a more generic device
handling but do not know what the current status of this work is. We
have made a couple of tweaks to target_mod_rbd to support some issues
with found with hyper-v which could be of use, we would be glad to help
in any way.
We will be moving to Jewel soon, but are still using Hammer simply
because we did not have time to test it well.
In our project we try to focus on HA clustered iSCSI only and make it
easy to setup and use. Drbd will not give a scale-out solution.
I will look into github, maybe it will help us in the future.

Cheers /maged

--
From: "Lars Marowsky-Bree" <l...@suse.com>
Sent: Monday, October 17, 2016 4:21 PM
To: <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project


On 2016-10-17T13:37:29, Maged Mokhtar <mmokh...@petasan.org> wrote:

Hi Maged,

glad to see our patches caught your attention. You're aware that they
are being upstreamed by David Disseldorp and Mike Christie, right? You
don't have to uplift patches from our backported SLES kernel ;-)

Also, curious why you based this on Hammer; SUSE Enterprise Storage at
this point is based on Jewel. Did you experience any problems with the
older release? The newer one has important fixes.

Is this supposed to be a separate product/project forever? I mean, there
are several management frontends for Ceph at this stage gaining the
iSCSI functionality.

And, lastly, if all I wanted to build was an iSCSI target and not expose
the rest of Ceph's functionality, I'd probably build it around drbd9.

But glad to see the iSCSI frontend is gaining more traction. We have
many customers in the field deploying it successfully with our support
package.

OK, not quite lastly - could you be convinced to make the source code
available in a bit more convenient form? I doubt that's the preferred
form of distribution for development ;-) A GitHub repo maybe?


Regards,
   Lars

--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar 
Wilde


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Blog Articles

2016-11-12 Thread Maged Mokhtar


Hi Nick,

Maybe not directly relating to your use case, but it will nice to know, at
least theoretically, how this latency will increase under heavier loads
specifically near max. cluster iops throughput where all cores will be
at/near peak utilization.

Would you be able to share any Ceph config parameters you changed to
achieve low latency, what i/o scheduler did you use, also did you use
jemalloc ?

The Mhz per IO article is very interesting too, the single chart packs a
lot of info.

/Maged

> Hi,
>
> Yes, I specifically wanted to make sure the disk part of the
> infrastructure didn't affect the results, the main aims were to reduce
> the end to end latency in the journals and Ceph code by utilising fast
> CPU's and NVME journals. SQL transaction logs are a good
> example where this low latency, low depth behaviour is required.
>
> There are also certain cases with direct io where even though you have
> high queue depths, you can still get contention at the PG
> depending on the IO/PG distribution. Getting latency low as possible also
> helps here as well, as the PG is effectively single
> threaded at some point.
>
> Nick
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Maged Mokhtar
>> Sent: 11 November 2016 21:48
>> To: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph Blog Articles
>>
>>
>>
>> Nice article on write latency. If i understand correctly, this latency
>> is measured while there is no overflow of the journal
> caused by long
>> sustained writes else you will start hitting the HDD latency. Also queue
>> depth you use is 1 ?
>>
>> Will be interested to see your article on hardware.
>>
>> /Maged
>>
>>
>>
>> > Hi All,
>> >
>> > I've recently put together some articles around some of the
>> > performance testing I have been doing.
>> >
>> > The first explores the high level theory behind latency in a Ceph
>> > infrastructure and what we have managed to achieve.
>> >
>> > http://www.sys-pro.co.uk/ceph-write-latency/
>> >
>> > The second explores some of results we got from trying to work out how
>> > much CPU a Ceph IO uses.
>> >
>> > http://www.sys-pro.co.uk/how-many-mhz-does-a-ceph-io-need/
>> >
>> > I hope they are of interest to someone.
>> >
>> > I'm currently working on a couple more explaining the choices behind
>> > the hardware that got us 700us write latency and what we finally
>> > built.
>> >
>> > Nick
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Blog Articles

2016-11-14 Thread Maged Mokhtar


Hi Nick,

Actually I was referring to an all SSD cluster. I expect the latency to 
increase from when you have a low load / queue depth to when you have a 
cluster under heavy load at/near its maximum iops throughput when the cpu 
cores are near peak utilization.


Cheers /Maged

--
From: "Nick Fisk" <n...@fisk.me.uk>
Sent: Monday, November 14, 2016 11:41 AM
To: "'Maged Mokhtar'" <mmokh...@petasan.org>; <ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] Ceph Blog Articles


Hi Maged,

I would imagine as soon as you start saturating the disks, the latency 
impact would make the savings from the fast CPU's pointless.
Really you would only try and optimise the latency if you are using SSD 
based cluster.


This was only done with spinning disks in our case with a low Queue Depth 
for investigation purposes. The low latency isn't
something we are currently making use of with this cluster, but has 
enabled us to plan the correct hardware for any future SSD based

clusters.

Nick



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Maged Mokhtar

Sent: 12 November 2016 16:08
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph Blog Articles


Hi Nick,

Maybe not directly relating to your use case, but it will nice to know, 
at least theoretically, how this latency will increase

under heavier
loads specifically near max. cluster iops throughput where all cores will 
be at/near peak utilization.


Would you be able to share any Ceph config parameters you changed to 
achieve low latency, what i/o scheduler did you use, also did

you use jemalloc ?

The Mhz per IO article is very interesting too, the single chart packs a 
lot of info.


/Maged

> Hi,
>
> Yes, I specifically wanted to make sure the disk part of the
> infrastructure didn't affect the results, the main aims were to reduce
> the end to end latency in the journals and Ceph code by utilising fast
> CPU's and NVME journals. SQL transaction logs are a good example where
> this low latency, low depth behaviour is required.
>
> There are also certain cases with direct io where even though you have
> high queue depths, you can still get contention at the PG depending on
> the IO/PG distribution. Getting latency low as possible also helps
> here as well, as the PG is effectively single threaded at some point.
>
> Nick
>
>> -Original Message-----
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> Of Maged Mokhtar
>> Sent: 11 November 2016 21:48
>> To: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph Blog Articles
>>
>>
>>
>> Nice article on write latency. If i understand correctly, this
>> latency is measured while there is no overflow of the journal
> caused by long
>> sustained writes else you will start hitting the HDD latency. Also
>> queue depth you use is 1 ?
>>
>> Will be interested to see your article on hardware.
>>
>> /Maged
>>
>>
>>
>> > Hi All,
>> >
>> > I've recently put together some articles around some of the
>> > performance testing I have been doing.
>> >
>> > The first explores the high level theory behind latency in a Ceph
>> > infrastructure and what we have managed to achieve.
>> >
>> > http://www.sys-pro.co.uk/ceph-write-latency/
>> >
>> > The second explores some of results we got from trying to work out
>> > how much CPU a Ceph IO uses.
>> >
>> > http://www.sys-pro.co.uk/how-many-mhz-does-a-ceph-io-need/
>> >
>> > I hope they are of interest to someone.
>> >
>> > I'm currently working on a couple more explaining the choices
>> > behind the hardware that got us 700us write latency and what we
>> > finally built.
>> >
>> > Nick
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Blog Articles

2016-11-11 Thread Maged Mokhtar



Nice article on write latency. If i understand correctly, this latency is
measured while there is no overflow of the journal caused by long
sustained writes else you will start hitting the HDD latency. Also queue
depth you use is 1 ?

Will be interested to see your article on hardware.

/Maged



> Hi All,
>
> I've recently put together some articles around some of the performance
> testing I have been doing.
>
> The first explores the high level theory behind latency in a Ceph
> infrastructure and what we have managed to achieve.
>
> http://www.sys-pro.co.uk/ceph-write-latency/
>
> The second explores some of results we got from trying to work out how
> much CPU a Ceph IO uses.
>
> http://www.sys-pro.co.uk/how-many-mhz-does-a-ceph-io-need/
>
> I hope they are of interest to someone.
>
> I'm currently working on a couple more explaining the choices behind the
> hardware that got us 700us write latency and what we
> finally built.
>
> Nick
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-16 Thread Maged Mokhtar


Hello,

I am happy to announce PetaSAN, an open source scale-out SAN that uses Ceph 
storage and LIO iSCSI Target.

visit us at:
www.petasan.org

your feedback will be much appreciated.
maged mokhtar 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Maged Mokhtar


Hi Oliver,

if you are refering to clustering reservations through VAAI. We are using 
upstream code from SUSE Enterprise Storage which adds clustered support for 
VAAI (compare and write, write same) in the kernel as well as in ceph 
(implemented as atomic  osd operations). We have tested VMware HA and 
vMotion and they work fine. We have a guide you can download on this use 
case.


--
From: "Oliver Dzombic" <i...@ip-interactive.de>
Sent: Sunday, October 16, 2016 10:58 PM
To: <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project


Hi,

its using LIO, means it will have the same compatibelity issues with 
vmware.


So i am wondering, why they call it an idial solution.

--
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 16.10.2016 um 18:57 schrieb Maged Mokhtar:

Hello,

I am happy to announce PetaSAN, an open source scale-out SAN that uses
Ceph storage and LIO iSCSI Target.
visit us at:
www.petasan.org

your feedback will be much appreciated.
maged mokhtar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Maged Mokhtar


Hi Oliver,

This is our first beta version, we do not support cache tiering. We 
definitely intend to support it.

Cheers /maged

--
From: "Oliver Dzombic" <i...@ip-interactive.de>
Sent: Monday, October 17, 2016 2:05 PM
To: <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project


Hi Maged,

thank you for your clearification ! That makes it intresting.

I have red that your base is ceph 0.94, in this version using cache tier
is not recommanded, if i remember correctly.

Does your codemodification also take care of this issue ?

--
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 17.10.2016 um 13:37 schrieb Maged Mokhtar:

Hi Oliver,

if you are refering to clustering reservations through VAAI. We are
using upstream code from SUSE Enterprise Storage which adds clustered
support for VAAI (compare and write, write same) in the kernel as well
as in ceph (implemented as atomic  osd operations). We have tested
VMware HA and vMotion and they work fine. We have a guide you can
download on this use case.

--
From: "Oliver Dzombic" <i...@ip-interactive.de>
Sent: Sunday, October 16, 2016 10:58 PM
To: <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project


Hi,

its using LIO, means it will have the same compatibelity issues with
vmware.

So i am wondering, why they call it an idial solution.

--
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 16.10.2016 um 18:57 schrieb Maged Mokhtar:

Hello,

I am happy to announce PetaSAN, an open source scale-out SAN that uses
Ceph storage and LIO iSCSI Target.
visit us at:
www.petasan.org

your feedback will be much appreciated.
maged mokhtar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Maged Mokhtar

Thank you David very much and thank you for the correction.

--
From: "David Disseldorp" <dd...@suse.de>
Sent: Monday, October 17, 2016 5:24 PM
To: "Maged Mokhtar" <mmokh...@petasan.org>
Cc: <ceph-users@lists.ceph.com>; "Oliver Dzombic" <i...@ip-interactive.de>; 
"Mike Christie" <mchri...@redhat.com>

Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

Hi Maged,

Thanks for the announcement - good luck with the project!
One comment...

On Mon, 17 Oct 2016 13:37:29 +0200, Maged Mokhtar wrote:

if you are refering to clustering reservations through VAAI. We are using
upstream code from SUSE Enterprise Storage which adds clustered support 
for

VAAI (compare and write, write same) in the kernel as well as in ceph
(implemented as atomic  osd operations). We have tested VMware HA and
vMotion and they work fine. We have a guide you can download on this use
case.

Just so there's no ambiguity here, the vast majority of the clustered
compare-and-write and write-same implementation was done by Mike
Christie from Red Hat.

Cheers, David 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Maged Mokhtar


Hi Lars,
Yes I was aware of David Disseldorp & Mike Christie efforts to upstream the 
patches from a while back ago. I understand there will be a move away from 
the SUSE target_mod_rbd to support a more generic device handling but do not 
know what the current status of this work is. We have made a couple of 
tweaks to target_mod_rbd to support some issues with found with hyper-v 
which could be of use, we would be glad to help in any way.
We will be moving to Jewel soon, but are still using Hammer simply because 
we did not have time to test it well.
In our project we try to focus on HA clustered iSCSI only and make it easy 
to setup and use. Drbd will not give a scale-out solution.

I will look into github, maybe it will help us in the future.

Cheers /maged

--
From: "Lars Marowsky-Bree" <l...@suse.com>
Sent: Monday, October 17, 2016 4:21 PM
To: <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project


On 2016-10-17T13:37:29, Maged Mokhtar <mmokh...@petasan.org> wrote:

Hi Maged,

glad to see our patches caught your attention. You're aware that they
are being upstreamed by David Disseldorp and Mike Christie, right? You
don't have to uplift patches from our backported SLES kernel ;-)

Also, curious why you based this on Hammer; SUSE Enterprise Storage at
this point is based on Jewel. Did you experience any problems with the
older release? The newer one has important fixes.

Is this supposed to be a separate product/project forever? I mean, there
are several management frontends for Ceph at this stage gaining the
iSCSI functionality.

And, lastly, if all I wanted to build was an iSCSI target and not expose
the rest of Ceph's functionality, I'd probably build it around drbd9.

But glad to see the iSCSI frontend is gaining more traction. We have
many customers in the field deploying it successfully with our support
package.

OK, not quite lastly - could you be convinced to make the source code
available in a bit more convenient form? I doubt that's the preferred
form of distribution for development ;-) A GitHub repo maybe?


Regards,
   Lars

--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nürnberg)

"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Estimate Max IOPS of Cluster

2017-01-04 Thread Maged Mokhtar


Max iops  depends on the hardware type/configuration for disks/cpu/network.

For disks, the theoretical iops limit is 
read  = physical disk iops x number of disks
write (with journal on same disk) = physical disk iops x number of disks / num 
of replicas / 3
in practice real benchmarks will vary widely from this, I've seen numbers from 
30 to 80 % of theoretical value.

When the number of disks/cpu cores is high, the cpu bottleneck kicks in, again 
it depends on hardware but you could use a performance tool such as atop to 
know when this happens on your setup. There is no theoretical measure of this, 
but one good analysis i find is Nick Fisk:
http://www.sys-pro.co.uk/how-many-mhz-does-a-ceph-io-need/


Cheers
/Maged



From: John Petrini 
Sent: Tuesday, January 03, 2017 10:15 PM
To: ceph-users 
Subject: [ceph-users] Estimate Max IOPS of Cluster


Hello, 


Does any one have a reasonably accurate way to determine the max IOPS of a Ceph 
cluster?


Thank You,

___


John Petrini






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-07 Thread Maged Mokhtar



Adding more nodes is best if you have unlimited budget :)You should add more 
osds per node until you start hitting cpu or network bottlenecks. Use a perf 
tool like atop/sysstat to know when this happens.



 Original message 
From: kevin parrikar  
Date: 07/01/2017  19:56  (GMT+02:00) 
To: Lionel Bouton  
Cc: ceph-users@lists.ceph.com 
Subject: Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe 
NIC and 2 replicas -Hammer release 

Wow thats a lot of good information. I wish i knew about all these before 
investing on all these devices.Since i dont have any other option,will get 
better SSD and faster HDD .
I have one more generic question about Ceph.
To increase the throughput of a cluster what is the standard practice is it 
more osd "per" node or more osd "nodes".

Thanks alot for all your help.Learned so many new things thanks again

Kevin
On Sat, Jan 7, 2017 at 7:33 PM, Lionel Bouton  
wrote:

  

  
  
Le 07/01/2017 à 14:11, kevin parrikar a
  écrit :



  Thanks for your valuable input.

We were using these SSD in our NAS box(synology)  and it was
giving 13k iops for our fileserver in raid1.We had a few spare
disks which we added to our ceph nodes hoping that it will give
good performance same as that of NAS box.(i am not comparing NAS
with ceph ,just the reason why we decided to use these SSD)



We dont have S3520 or S3610 at
  the moment but can order one of these to see how it performs
  in ceph .We have 4xS3500  80Gb handy.

  If i create a 2 node cluster with 2xS3500 each and with
  replica of 2,do you think it can deliver 24MB/s of 4k writes .





Probably not. See
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/



According to the page above the DC S3500 reaches 39MB/s. Its
capacity isn't specified, yours are 80GB only which is the lowest
capacity I'm aware of and for all DC models I know of the speed goes
down with the capacity so you probably will get lower than that.

If you put both data and journal on the same device you cut your
bandwidth in half : so this would give you an average <20MB/s per
OSD (with occasional peaks above that if you don't have a sustained
20MB/s). With 4 OSDs and size=2, your total write bandwidth is
<40MB/s. For a single stream of data you will only get <20MB/s
though (you won't benefit from parallel writes to the 4 OSDs and
will only write on 2 at a time).



Not that by comparison the 250GB 840 EVO only reaches 1.9MB/s.



But even if you reach the 40MB/s, these models are not designed for
heavy writes, you will probably kill them long before their warranty
is expired (IIRC these are rated for ~24GB writes per day over the
warranty period). In your configuration you only have to write 24G
each day (as you have 4 of them, write both to data and journal and
size=2) to be in this situation (this is an average of only 0.28
MB/s compared to your 24 MB/s target).




  We bought S3500
  because last time when we tried ceph, people were suggesting
  this model :) :) 





The 3500 series might be enough with the higher capacities in some
rare cases but the 80GB model is almost useless.



You have to do the math considering :

- how much you will write to the cluster (guess high if you have to
guess),

- if you will use the SSD for both journals and data (which means
writing twice on them),

- your replication level (which means you will write multiple times
the same data),

- when you expect to replace the hardware,

- the amount of writes per day they support under warranty (if the
manufacturer doesn't present this number prominently they probably
are trying to sell you a fast car headed for a brick wall)



If your hardware can't handle the amount of write you expect to put
in it then you are screwed. There were reports of new Ceph users not
aware of this and using cheap SSDs that failed in a matter of months
all at the same time. You definitely don't want to be in their
position.

In fact as problems happen (hardware failure leading to cluster
storage rebalancing for example) you should probably get a system
able to handle 10x the amount of writes you expect it to handle and
then monitor the SSD SMART attributes to be alerted long before they
die and replace them before problems happen. You definitely want a
controller allowing access to this information. If you can't you
will have to monitor the writes and guess this value which is risky
as write amplification inside SSDs is not easy to guess...

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-08 Thread Maged Mokhtar


Why would you still be using journals when running fully OSDs on SSDs?
When using a journal the data is first written to a journal, and then that 
same data is (later on) written again to disk.
This in the assumption that the time to write the journal is only a 
fraction of the time it costs to write to disk. And since writing data to 
stable storage in on the critical path, the journal brings an advantage. 
Now when the disk is already on SSD, I see very little difference in 
writing the data directly to disk en forgo the journal.


There are advantages to a 2 phase commit approach,  without a journal a 
write could fail half way through with some but not all data written leading 
to integrity issues. Also note the the journal writes are done sequentially 
at the block level which should be faster than flushing to filesystem.



--
From: "Willem Jan Withagen" 
Sent: Sunday, January 08, 2017 1:47 PM
To: "Lionel Bouton" ; "kevin parrikar" 


Cc: 
Subject: Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe 
NIC and 2 replicas -Hammer release



On 7-1-2017 15:03, Lionel Bouton wrote:

Le 07/01/2017 à 14:11, kevin parrikar a écrit :

Thanks for your valuable input.
We were using these SSD in our NAS box(synology)  and it was giving
13k iops for our fileserver in raid1.We had a few spare disks which we
added to our ceph nodes hoping that it will give good performance same
as that of NAS box.(i am not comparing NAS with ceph ,just the reason
why we decided to use these SSD)

We dont have S3520 or S3610 at the moment but can order one of these
to see how it performs in ceph .We have 4xS3500  80Gb handy.
If i create a 2 node cluster with 2xS3500 each and with replica of
2,do you think it can deliver 24MB/s of 4k writes .


Probably not. See
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

According to the page above the DC S3500 reaches 39MB/s. Its capacity
isn't specified, yours are 80GB only which is the lowest capacity I'm
aware of and for all DC models I know of the speed goes down with the
capacity so you probably will get lower than that.
If you put both data and journal on the same device you cut your
bandwidth in half : so this would give you an average <20MB/s per OSD
(with occasional peaks above that if you don't have a sustained 20MB/s).
With 4 OSDs and size=2, your total write bandwidth is <40MB/s. For a
single stream of data you will only get <20MB/s though (you won't
benefit from parallel writes to the 4 OSDs and will only write on 2 at a
time).


I'm new to this part of tuning ceph, but I do have an architectual
discussion:

Why would you still be using journals when running fully OSDs on SSDs?

When using a journal the data is first written to a journal, and then
that same data is (later on) written again to disk.
This in the assumption that the time to write the journal is only a
fraction of the time it costs to write to disk. And since writing data
to stable storage in on the critical path, the journal brings an 
advantage.


Now when the disk is already on SSD, I see very little difference in
writing the data directly to disk en forgo the journal.
I would imagine that not using journals would cut writing time in have
because the data is only written once. There is no loss of bandwidth on
the SSD, and internally the SSD does not have to manage double the
amount erase cycles in garbage collection once the SDD comes close to
being fully used.

The only thing I can imagine that makes a difference is that journal
writing is slightly faster than writing data into the FS that is used
for the disk. But that should not be such a major extra cost that it
warrants all the other disadvantages.

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Estimate Max IOPS of Cluster

2017-01-04 Thread Maged Mokhtar


if you are asking about what tools to use:
http://tracker.ceph.com/projects/ceph/wiki/Benchmark_Ceph_Cluster_Performance

You should run many concurrent processes on different clients 



From: Maged Mokhtar 
Sent: Wednesday, January 04, 2017 6:45 PM
To: John Petrini ; ceph-users 
Subject: Re: [ceph-users] Estimate Max IOPS of Cluster



Max iops  depends on the hardware type/configuration for disks/cpu/network.

For disks, the theoretical iops limit is 
read  = physical disk iops x number of disks
write (with journal on same disk) = physical disk iops x number of disks / num 
of replicas / 3
in practice real benchmarks will vary widely from this, I've seen numbers from 
30 to 80 % of theoretical value.

When the number of disks/cpu cores is high, the cpu bottleneck kicks in, again 
it depends on hardware but you could use a performance tool such as atop to 
know when this happens on your setup. There is no theoretical measure of this, 
but one good analysis i find is Nick Fisk:
http://www.sys-pro.co.uk/how-many-mhz-does-a-ceph-io-need/


Cheers
/Maged



From: John Petrini 
Sent: Tuesday, January 03, 2017 10:15 PM
To: ceph-users 
Subject: [ceph-users] Estimate Max IOPS of Cluster


Hello, 


Does any one have a reasonably accurate way to determine the max IOPS of a Ceph 
cluster?


Thank You,

___


John Petrini






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-06 Thread Maged Mokhtar

The numbers are very low. I would first benchmark the system without the vm 
client using rbd 4k test such as:
rbd bench-write image01  --pool=rbd --io-threads=32 --io-size 4096
--io-pattern rand --rbd_cache=false

 Original message 
From: kevin parrikar  
Date: 07/01/2017  05:48  (GMT+02:00) 
To: Christian Balzer  
Cc: ceph-users@lists.ceph.com 
Subject: Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe 
NIC and 2 replicas -Hammer release 

i really need some help here :(

replaced all 7.2 rpm SAS disks with new Samsung 840 evo 512Gb SSD with no 
seperate journal Disk .Now both OSD nodes are with 2 ssd disks  with a replica 
of 2 . 
Total number of OSD process in the cluster is 4.with all SSD.

But throughput has gone down from 1.4 MB/s to 1.3 MB/s for 4k writes and for 4M 
it has gone down from 140MB/s to 126MB/s .

now atop no longer shows OSD device as 100% busy..
How ever i can see both ceph-osd process in atop with 53% and 47% disk 
utilization.

 PID                         RDDSK          WRDSK           WCANCL       DSK    
 CMD        1/220771                          0K                648.8M          
   0K               53%    ceph-osd19547                          0K            
    576.7M             0K               47%    ceph-osd

OSD disks(ssd) utilization from atop

DSK |  sdc | busy  6%  | read  0  | write  517  | KiB/r   0  | KiB/w  293 | 
MBr/s 0.00  | MBw/s 148.18  | avq   9.44  | avio 0.12 ms  |

DSK |  sdd | busy   5% | read   0 | write   336 | KiB/r   0  | KiB/w   292 | 
MBr/s 0.00 | MBw/s  96.12  | avq     7.62  | avio 0.15 ms  |

Queue Depth of OSD disks
 cat /sys/block/sdd/device//queue_depth256
atop inside virtual machine:[4 CPU/3Gb RAM]
DSK |   vdc  | busy     96%  | read     0  | write  256  | KiB/r   0  | KiB/w  
512  | MBr/s   0.00  | MBw/s 128.00  | avq    7.96  | avio 3.77 ms  |

Both Guest and Host are using deadline I/O scheduler

Virtual Machine Configuration:

449da0e7-6223-457c-b2c6-b5e112099212          

ceph.conf

 cat /etc/ceph/ceph.conf
[global]fsid = c4e1a523-9017-492e-9c30-8350eba1bd51mon_initial_members = 
node-16 node-30 node-31mon_host = 172.16.1.11 172.16.1.12 
172.16.1.8auth_cluster_required = cephxauth_service_required = 
cephxauth_client_required = cephxfilestore_xattr_use_omap = 
truelog_to_syslog_level = infolog_to_syslog = Trueosd_pool_default_size = 
2osd_pool_default_min_size = 1osd_pool_default_pg_num = 64public_network = 
172.16.1.0/24log_to_syslog_facility = LOG_LOCAL0osd_journal_size = 
2048auth_supported = cephxosd_pool_default_pgp_num = 64osd_mkfs_type = 
xfscluster_network = 172.16.1.0/24osd_recovery_max_active = 1osd_max_backfills 
= 1

[client]rbd_cache_writethrough_until_flush = Truerbd_cache = True
[client.radosgw.gateway]rgw_keystone_accepted_roles = _member_, Member, admin, 
swiftoperatorkeyring = /etc/ceph/keyring.radosgw.gatewayrgw_frontends = fastcgi 
socket_port=9000 socket_host=127.0.0.1rgw_socket_path = 
/tmp/radosgw.sockrgw_keystone_revocation_interval = 100
Any guidance on where to look for issues.

Regards,Kevin
On Fri, Jan 6, 2017 at 4:42 PM, kevin parrikar  
wrote:
Thanks Christian for your valuable comments,each comment is a new learning for 
me.
Please see inline 

On Fri, Jan 6, 2017 at 9:32 AM, Christian Balzer  wrote:

Hello,

On Fri, 6 Jan 2017 08:40:36 +0530 kevin parrikar wrote:

> Hello All,

>

> I have setup a ceph cluster based on 0.94.6 release in  2 servers each with

> 80Gb intel s3510 and 2x3 Tb 7.2 SATA disks,16 CPU,24G RAM

> which is connected to a 10G switch with a replica of 2 [ i will add 3 more

> servers to the cluster] and 3 seperate monitor nodes which are vms.

>

I'd go to the latest hammer, this version has a lethal cache-tier bug if

you should decide to try that.

80Gb Intel DC S3510 are a) slow and b) have only 0.3 DWPD.

You're going to wear those out quickly and if not replaced in time loose

data.

2 HDDs give you a theoretical speed of something like 300MB/s sustained,

when used a OSDs I'd expect the usual 50-60MB/s per OSD due to

seeks, journal (file system) and leveldb overheads.

Which perfectly matches your results.

H that makes sense ,its hitting 7.2 rpm OSD's peak write speed.I was in an 
assumption that ssd Journal to OSD will happen slowly at a later time and hence 
 i could use slower and cheaper disks for OSD.But in practise it looks like 
many articles in the internet that talks about faster journal and slower OSD 
dont seems to be correct.

Will adding more OSD disks per node improve the overall performance?

 i can add 4 more disks to each node,but all are 7.2 rpm disks .I am expecting 
some kind of parallel writes on these disks and magically improves performance 
:D
This is my second experiment with Ceph last time i gave up and purchased 
another

Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Maged Mokhtar

We were beta till early Feb. so we are relatively young. If there are 
issues/bugs, we'd certainly be interested to know through our forum. Note that 
with us you can always use the cli and bypass the UI and it will be straight 
Ceph/LIO commands if you wish.

From: Brady Deetz 
Sent: Thursday, April 06, 2017 3:21 PM
To: ceph-users 
Subject: Re: [ceph-users] rbd iscsi gateway question

I appreciate everybody's responses here. I remember the announcement of Petasan 
a whole back on here and some concerns about it.  

Is anybody using it in production yet? 

On Apr 5, 2017 9:58 PM, "Brady Deetz"  wrote:

  I apologize if this is a duplicate of something recent, but I'm not finding 
much. Does the issue still exist where dropping an OSD results in a LUN's I/O 
hanging? 

  I'm attempting to determine if I have to move off of VMWare in order to 
safely use Ceph as my VM storage.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Maged Mokhtar

The io hang (it is actually a pause not hang) is done by Ceph only in case 
of a simultaneous failure of 2 hosts or 2 osds on separate hosts. A single 
host/osd being out will not cause this.  In PetaSAN project www.petasan.org 
we use LIO/krbd. We have done a lot of tests on VMWare, in case of io 
failure, the io will block for approx 30s on the VMWare ESX (default 
timeout, but can be configured)  then it will resume on the other MPIO path.


We are using a custom LIO/kernel upstreamed from SLE 12 used in their 
enterprise storage offering, it supports direct rbd backstore. I believe 
there was a request to include it mainstream kernel but it did not happen, 
probably waiting for TCMU solution which will be better/cleaner design.


Cheers /maged 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Book & questions

2017-08-13 Thread Maged Mokhtar

i would recommend getting all 3 books, they are all very good. i
particularly like Nick's book, it has a lot of hands on issues and quite
recent.   

/Maged 

On 2017-08-13 09:43, Sinan Polat wrote:

> Hi all, 
> 
> I am quite new with Ceph Storage. Currently we have a Ceph environment 
> running, but in a few months we will be setting up a new Ceph storage 
> environment. 
> 
> I have read a lot of information on the Ceph website, but the more 
> information the better for me. What book(s) would you suggest? 
> 
> I found the following books: 
> 
> Learning Ceph - Karan Singh (Jan 2015) 
> 
> Ceph Cookbook - Karan Singh (Feb 2016) 
> 
> Mastering Ceph - Nick Fisk (May 2017) 
> 
> Another question; 
> 
> Ceph is self-healing, it will distribute the replicas to the available OSD's 
> in case of a failure of one of the OSD's. Lets say my setup is configured to 
> have 3 replicas, this means when there is a failure of one the OSD's it will 
> start healing. I want that when an OSD fails and only 2 replicas are left, it 
> shouldn't do anything, only when also the 2nd OSD fails it should start 
> replicating/healing. Which configuration setting do I need to use, is it the 
> min size option? 
> 
> Thanks! 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD journaling benchmarks

2017-07-13 Thread Maged Mokhtar



--
From: "Jason Dillaman" <jdill...@redhat.com>
Sent: Thursday, July 13, 2017 4:45 AM
To: "Maged Mokhtar" <mmokh...@petasan.org>
Cc: "Mohamad Gebai" <mge...@suse.com>; "ceph-users" <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] RBD journaling benchmarks

> On Mon, Jul 10, 2017 at 3:41 PM, Maged Mokhtar <mmokh...@petasan.org> wrote:
>> On 2017-07-10 20:06, Mohamad Gebai wrote:
>>
>>
>> On 07/10/2017 01:51 PM, Jason Dillaman wrote:
>>
>> On Mon, Jul 10, 2017 at 1:39 PM, Maged Mokhtar <mmokh...@petasan.org> wrote:
>>
>> These are significant differences, to the point where it may not make sense
>> to use rbd journaling / mirroring unless there is only 1 active client.
>>
>> I interpreted the results as the same RBD image was being concurrently
>> used by two fio jobs -- which we strongly recommend against since it
>> will result in the exclusive-lock ping-ponging back and forth between
>> the two clients / jobs. Each fio RBD job should utilize its own
>> backing image to avoid such a scenario.
>>
>>
>> That is correct. The single job runs are more representative of the
>> overhead of journaling only, and it is worth noting the (expected)
>> inefficiency of multiple clients for the same RBD image, as explained by
>> Jason.
>>
>> Mohamad
>>
>> Yes i expected a penalty but not as large. There are some use cases that
>> would benefit from concurrent access to the same block device, in vmware ad
>> hyper-v several hypervisors could share the same device which is formatted
>> via a clustered file system like MS CSV ( clustered shared volumes ) or
>> VMFS, which creates a volume/datastore that houses many VMs.
> 
> Both of these use-cases would first need support for active/active
> iSCSI. While A/A iSCSI via MPIO is trivial to enable, getting it to
> properly handle failure conditions without the possibility of data
> corruption is not since it relies heavily on arbitrary initiator and
> target-based timers. The only realistic and safe solution is to rely
> on an MCS-based active/active implementation.

The case also applies to active/passive iSCSI.. you still have many 
initiators/hypervisors writing concurrently to the same rbd image using a 
clustered file system (csv/vmfs).

>> I was wondering if such a setup could be supported in the future and maybe
>> there could be a way to minimize the overhead of the exclusive lock..for
>> example by having a distributed sequence number to the different active
>> client writers and have each writer maintain its own journal, i doubt that
>> the overhead will reach the values you showed.
> 
> The journal used by the librbd mirroring feature was designed to
> support multiple concurrent writers. Of course, that original design
> was more inline with the goal of supporting multiple images within a
> consistency group.

Yes but they will still suffer performance penalty , my understanding is that 
they would need the lock while writing the data to the journal entries and thus 
will be waiting turns, or  do they need the lock only for journal metadata like 
generating a sequence number ? 

>> Maged
>>
>>
> 
> -- 
> Jason___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD journaling benchmarks

2017-07-10 Thread Maged Mokhtar

On 2017-07-10 20:06, Mohamad Gebai wrote:

> On 07/10/2017 01:51 PM, Jason Dillaman wrote: On Mon, Jul 10, 2017 at 1:39 
> PM, Maged Mokhtar <mmokh...@petasan.org> wrote: These are significant 
> differences, to the point where it may not make sense
> to use rbd journaling / mirroring unless there is only 1 active client. I 
> interpreted the results as the same RBD image was being concurrently
> used by two fio jobs -- which we strongly recommend against since it
> will result in the exclusive-lock ping-ponging back and forth between
> the two clients / jobs. Each fio RBD job should utilize its own
> backing image to avoid such a scenario.

That is correct. The single job runs are more representative of the
overhead of journaling only, and it is worth noting the (expected)
inefficiency of multiple clients for the same RBD image, as explained by
Jason.

Mohamad

Yes i expected a penalty but not as large. There are some use cases that
would benefit from concurrent access to the same block device, in vmware
ad hyper-v several hypervisors could share the same device which is
formatted via a clustered file system like MS CSV ( clustered shared
volumes ) or VMFS, which creates a volume/datastore that houses many
VMs. 

I was wondering if such a setup could be supported in the future and
maybe there could be a way to minimize the overhead of the exclusive
lock..for example by having a distributed sequence number to the
different active client writers and have each writer maintain its own
journal, i doubt that the overhead will reach the values you showed. 

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD journaling benchmarks

2017-07-10 Thread Maged Mokhtar

On 2017-07-10 18:14, Mohamad Gebai wrote:

> Resending as my first try seems to have disappeared.
> 
> Hi,
> 
> We ran some benchmarks to assess the overhead caused by enabling
> client-side RBD journaling in Luminous. The tests consists of:
> - Create an image with journaling enabled  (--image-feature journaling)
> - Run randread, randwrite and randrw workloads sequentially from a
> single client using fio
> - Collect IOPS
> 
> More info:
> - Feature exclusive-lock is enabled with journaling (required)
> - Queue depth of 128 for fio
> - With 1 and 2 threads
> 
> Cluster 1
> 
> 
> - 5 OSD nodes
> - 6 OSDs per node
> - 3 monitors
> - All SSD
> - Bluestore + WAL
> - 10GbE NIC
> - Ceph version 12.0.3-1380-g6984d41b5d
> (6984d41b5d142ce157216b6e757bcb547da2c7d2) luminous (dev)
> 
> Results:
> 
> DefaultJournalingJour width 32  
> JobsIOPSIOPSSlowdownIOPSSlowdown
> RW
> 1195219104   2.1x160671.2x
> 230575726   42.1x  48862.6x
> Read
> 12277522946  0.9x236010.9x
> 2359551078  33.3x  44680.2x
> Write
> 1185156054   3.0x 97651.9x
> 2295861188  24.9x  53455.4x
> 
> - "Default" is the baseline (with journaling disabled)
> - "Journaling" is with journaling enabled
> - "Jour width 32" is with a journal data width of 32 objects
> (--journal-splay-width 32)
> - The major slowdown for two jobs is due to locking
> - With a journal width of 32, the 0.9x slowdown (which is actually a
> speedup) is due to the read-only workload, which doesn't exercise the
> journaling code.
> - The randwrite workload exercises the journaling code the most, and is
> expected to have the highest slowdown, which is 1.9x in this case.
> 
> Cluster 2
> 
> 
> - 3 OSD nodes
> - 10 OSDs per node
> - 1 monitor
> - All HDD
> - Filestore
> - 10GbE NIC
> - Ceph version 12.1.0-289-g117b171715
> (117b1717154e1236b2d37c405a86a9444cf7871d) luminous (dev)
> 
> Results:
> 
> DefaultJournaling Jour width 32  
> Jobs  IOPSIOPS Slowdown  IOPS   Slowdown
> RW  
> 11186936743.2x   4914  2.4x
> 213127 736   17.8x432 30.4x
> Read  
> 114500   147001.0x  14703  1.0x
> 21667338934.3x307 54.3x
> Write  
> 1 826719254.3x   2591  3.2x
> 2 828310128.2x417 19.9x
> 
> - The number of IOPS for the write workload is quite low, which is due
> to HDDs and filestore
> 
> Mohamad
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

These are significant differences, to the point where it may not make
sense to use rbd journaling / mirroring unless there is only 1 active
client. Could there be in the future enhancement that will try to make
active/active possible ? Would it help if each active writer maintained
their own queue and only lock for a sequence number / counter to try to
minimize the lock overhead writing in the same journal queue  ?

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] krbd journal support

2017-07-06 Thread Maged Mokhtar

Hi all,

Are there any plans to support rbd journal feature in kernel krbd ?

Cheers /Maged

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Kernel mounted RBD's hanging

2017-06-30 Thread Maged Mokhtar

On 2017-06-29 16:30, Nick Fisk wrote:

> Hi All,
> 
> Putting out a call for help to see if anyone can shed some light on this.
> 
> Configuration:
> Ceph cluster presenting RBD's->XFS->NFS->ESXi
> Running 10.2.7 on the OSD's and 4.11 kernel on the NFS gateways in a
> pacemaker cluster
> Both OSD's and clients are go into a pair of switches, single L2 domain (no
> sign from pacemaker that there is network connectivity issues)
> 
> Symptoms:
> - All RBD's on a single client randomly hang for 30s to several minutes,
> confirmed by pacemaker and ESXi hosts complaining
> - Cluster load is minimal when this happens most times
> - All other clients with RBD's are not affected (Same RADOS pool), so its
> seems more of a client issue than cluster issue
> - It looks like pacemaker tries to also stop RBD+FS resource, but this also
> hangs
> - Eventually pacemaker succeeds in stopping resources and immediately
> restarts them, IO returns to normal
> - No errors, slow requests, or any other non normal Ceph status is reported
> on the cluster or ceph.log
> - Client logs show nothing apart from pacemaker
> 
> Things I've tried:
> - Different kernels (potentially happened less with older kernels, but can't
> be 100% sure)
> - Disabling scrubbing and anything else that could be causing high load
> - Enabling Kernel RBD debugging (Problem maybe happens a couple of times a
> day, debug logging was not practical as I can't reproduce on demand)
> 
> Anyone have any ideas?
> 
> Thanks,
> Nick
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

my suggestion is to do a test with pacemaker out of the picture and run
the NFS gateway(s) without HA, this may give a clue if it is ESX->NFX
issue or a pacemaker issue.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to force "rbd unmap"

2017-07-05 Thread Maged Mokhtar

On 2017-07-05 20:42, Ilya Dryomov wrote:

> On Wed, Jul 5, 2017 at 8:32 PM, David Turner  wrote: 
> 
>> I had this problem occasionally in a cluster where we were regularly mapping
>> RBDs with KRBD.  Something else we saw was that after this happened for
>> un-mapping RBDs, was that it would start preventing mapping some RBDs as
>> well.  We were able to use strace and kill the sub-thread that was stuck to
>> allow the RBD to finish un-mapping, but as it turned out, the server would
>> just continue to hang on KRBD functions until the server was restarted.  The
> 
> Did you ever report this?
> 
> Thanks,
> 
> Ilya
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This may or may not be related, but if you do a lot of mapping/unmapping
it may be better to load the rbd module with  

modprobe rbd single_major=Y 

load it before running targetcli or rtslib as they load it without this
param. 

/Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-05 Thread Maged Mokhtar

On 2017-07-05 23:22, David Clarke wrote:

> On 07/05/2017 08:54 PM, Massimiliano Cuttini wrote: 
> 
>> Dear all,
>> 
>> luminous is coming and sooner we should be allowed to avoid double writing.
>> This means use 100% of the speed of SSD and NVMe.
>> Cluster made all of SSD and NVMe will not be penalized and start to make
>> sense.
>> 
>> Looking forward I'm building the next pool of storage which we'll setup
>> on next term.
>> We are taking in consideration a pool of 4 with the following single
>> node configuration:
>> 
>> * 2x E5-2603 v4 - 6 cores - 1.70GHz
>> * 2x 32Gb of RAM
>> * 2x NVMe M2 for OS
>> * 6x NVMe U2 for OSD
>> * 2x 100Gib ethernet cards
>> 
>> We have yet not sure about which Intel and how much RAM we should put on
>> it to avoid CPU bottleneck.
>> Can you help me to choose the right couple of CPU?
>> Did you see any issue on the configuration proposed?
> 
> There are notes on ceph.com regarding flash, and NVMe in particular,
> deployments:
> 
> http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This is a nice link, but the Ceph configuration is a bit dated, it was
done for Hammer and a couple of config params were dropped in Jewel. I
hope Intel does publish some new settings for Luminous/Bluestore !  

In addition to tuning ceph.conf, sysstl, udev, it is important to run
stress benchmarks such as rados bench/ rbd bench and measure the system
load via atop/collectl/sysstat. This will tell you where your
bottlenecks are like. If you will do many tests, you may find the CBT
Ceph Benchmarking Tool handy as you can script incremental tests.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Iscsi configuration

2017-08-09 Thread Maged Mokhtar

Hi Sam, 

Pacemaker will take care of HA failover but you will need to progagate
the PR data yourself.
If you are interested in a solution that works out of the box with
Windows, have a look at PetaSAN 
www.petasan.org
It works well with MS hyper-v/storage spaces/Scale Out File Server. 

Cheers
/Maged 

On 2017-08-09 18:42, Samuel Soulard wrote:

> Hmm :(  Even for an Active/Passive configuration?  I'm guessing we will need 
> to do something with Pacemaker in the meantime? 
> 
> On Wed, Aug 9, 2017 at 12:37 PM, Jason Dillaman  wrote:
> 
>> I can probably say that it won't work out-of-the-gate for Hyper-V
>> since it most likely will require iSCSI persistent reservations. That
>> support is still being added to the kernel because right now it isn't
>> being distributed to all the target portal group nodes.
>> 
>> On Wed, Aug 9, 2017 at 12:30 PM, Samuel Soulard
>> 
>>  wrote:
>>> Thanks! we'll visit back this subject once it is released.  Waiting on this
>>> to perform some tests for Hyper-V/VMware ISCSI LUNs :)
>>> 
>>> Sam
>>> 
>>> On Wed, Aug 9, 2017 at 10:35 AM, Jason Dillaman  wrote:
 
 Yes, RHEL/CentOS 7.4 or kernel 4.13 (once it's released).
 
 On Wed, Aug 9, 2017 at 6:56 AM, Samuel Soulard 
 wrote:
> Hi Jason,
> 
> Oh the documentation is awesome:
> 
> https://github.com/ritz303/ceph/blob/6ab7bc887b265127510c3c3fde6dbad0e047955d/doc/rbd/iscsi-target-cli.rst
>  [1]
> 
> So I assume that this is not yet available for CentOS and requires us to
> wait until CentOS 7.4 is released?
> 
> Thanks for the documentation, it makes everything more clear.
> 
> On Tue, Aug 8, 2017 at 9:37 PM, Jason Dillaman 
> wrote:
>> 
>> We are working hard to formalize active/passive iSCSI configuration
>> across Linux/Windows/ESX via LIO. We have integrated librbd into LIO's
>> tcmu-runner and have developed a set of support applications to
>> managing the clustered configuration of your iSCSI targets. There is
>> some preliminary documentation here [1] that will be merged once we
>> can finish our testing.
>> 
>> [1] https://github.com/ceph/ceph/pull/16182 [2]
>> 
>> On Tue, Aug 8, 2017 at 4:45 PM, Samuel Soulard
>> 
>> wrote:
>>> Hi all,
>>>
>>> Platform : Centos 7 Luminous 12.1.2
>>>
>>> First time here but, are there any guides or guidelines out there on
>>> how
>>> to
>>> configure ISCSI gateways in HA so that if one gateway fails, IO can
>>> continue
>>> on the passive node?
>>>
>>> What I've done so far
>>> -ISCSI node with Ceph client map rbd on boot
>>> -Rbd has exclusive-lock feature enabled and layering
>>> -Targetd service dependent on rbdmap.service
>>> -rbd exported through LUN ISCSI
>>> -Windows ISCSI imitator can map the lun and format / write to it
>>> (awesome)
>>>
>>> Now I have no idea where to start to have an active /passive scenario
>>> for
>>> luns exported with LIO.  Any ideas?
>>>
>>> Also the web dashboard seem to hint that it can get stats for various
>>> clients made on ISCSI gateways, I'm not sure where it pulls that
>>> information. Is Luminous now shipping a ISCSI daemon of some sort?
>>>
>>> Thanks all!
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3]
>>>
>> 
>> 
>> 
>> --
>> Jason
> 
> 
 
 
 
 --
 Jason
>>> 
>>> 
>> 
>> --
>> Jason
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

  

Links:
--
[1]
https://github.com/ritz303/ceph/blob/6ab7bc887b265127510c3c3fde6dbad0e047955d/doc/rbd/iscsi-target-cli.rst
[2] https://github.com/ceph/ceph/pull/16182
[3] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] VMware + Ceph using NFS sync/async ?

2017-08-19 Thread Maged Mokhtar

Hi Nick, 

Interesting your note on PG locking, but I would be surprised if its
effect is that bad. I would think that in your example the 2 ms is a
total latency, the lock will probably be applied to small portion of
that, so the concurrent operations are not serialized for the entire
time..but again i may be wrong. Also if the lock is that bad, then we
should see 4k sequential writes to be much slower than random ones in
general testing, which is not the case. 

Another thing that may help in vm migration as per your description is
reducing the rbd stripe size to be a couple of times smaller than 2M (
32 x 64k ). 

Maged 

On 2017-08-16 16:12, Nick Fisk wrote:

> Hi Matt, 
> 
> Well behaved applications are the problem here. ESXi sends all writes as sync 
> writes. So although OS's will still do their own buffering, any ESXi level 
> operation is all done as sync. This is probably seen the greatest when 
> migrating vm's between datastores, everything gets done as sync 64KB ios 
> meaning, copying a 1TB VM can often take nearly 24 hours. 
> 
> Osama, can you describe the difference in performance you see between 
> Openstack and ESXi and what type of operations are these? Sync writes should 
> be the same no matter the client, except in the NFS case you will have an 
> extra network hop and potentially a little bit of PG congestion around the FS 
> journal on the RBd device. 
> 
> Osama, you can't compare Ceph to a SAN. Just in terms of network latency you 
> have an extra 2 hops. In ideal scenario you might be able to get Ceph write 
> latency down to 0.5-1ms for a 4kb io, compared to to about 0.1-0.3 for a 
> storage array. However, what you will find with Ceph is that other things 
> start to increase this average long before you would start to see this on 
> storage arrays. 
> 
> The migration is a good example of this. As I said, ESXi migrates a vm in 
> 64KB io's, but does 32 of these blocks in parallel at a time. On storage 
> arrays, these 64KB io's are coalesced in the battery protected write cached 
> into bigger IO's before being persisted to disk. The storage array can also 
> accept all 32 of these requests at once. 
> 
> A similar thing happens in Ceph/RBD/NFS via the Ceph filestore journal, but 
> that coalescing is now an extra 2 hops away and with a bit of extra latency 
> introduced by the Ceph code, we are already a bit slower. But here's the 
> killer, PG locking!!! You can't write 32 IO's in parallel to the same 
> object/PG, each one has to be processed sequentially because of the locks. 
> (Please someone correct me if I'm wrong here). If your 64KB write latency is 
> 2ms, then you can only do 500 64KB IO's a second. 64KB*500=~30MB/s vs a 
> Storage Array which would be doing the operation in the hundreds of MB/s 
> range. 
> 
> Note: When proper iSCSI for RBD support is finished, you might be able to use 
> the VAAI offloads, which would dramatically increase performance for 
> migrations as well. 
> 
> Also once persistent SSD write caching for librbd becomes available, a lot of 
> these problems will go away, as the SSD will behave like a storage array's 
> write cache and will only be 1 hop away from the client as well. 
> 
> FROM: Matt Benjamin [mailto:mbenj...@redhat.com] 
> SENT: 16 August 2017 14:49
> TO: Osama Hasebou 
> CC: n...@fisk.me.uk; ceph-users 
> SUBJECT: Re: [ceph-users] VMware + Ceph using NFS sync/async ? 
> 
> Hi Osama, 
> 
> I don't have a clear sense of the the application workflow here--and Nick 
> appears to--but I thought it worth noting that NFSv3 and NFSv4 clients 
> shouldn't normally need the sync mount option to achieve i/o stability with 
> well-behaved applications.  In both versions of the protocol, an application 
> write that is synchronous (or, more typically, the equivalent application 
> sync barrier) should not succeed until an NFS-protocol COMMIT (or in some 
> cases w/NFSv4, WRITE w/stable flag set) has been acknowledged by the NFS 
> server.  If the NFS i/o stability model is insufficient for a your workflow, 
> moreover, I'd be worried that -osync writes (which might be incompletely 
> applied during a failure event) may not be correctly enforcing your 
> invariant, either. 
> 
> Matt 
> 
> On Wed, Aug 16, 2017 at 8:33 AM, Osama Hasebou  wrote:
> 
>> Hi Nick, 
>> 
>> Thanks for replying! If Ceph is combined with Openstack then, does that mean 
>> that actually when openstack writes are happening, it is not fully sync'd 
>> (as in written to disks) before it starts receiving more data, so acting as 
>> async ? In that scenario there is a chance for data loss if things go bad, 
>> i.e power outage or something like that ? 
>> 
>> As for the slow operations, reading is quite fine when I compare it to a SAN 
>> storage system connected to VMware. It is writing data, small chunks or big 
>> ones, that suffer when trying to use the sync option with FIO for 
>>

Re: [ceph-users] Small-cluster performance issues

2017-08-22 Thread Maged Mokhtar

It is likely your 2 spinning disks cannot keep up with the load. Things
are likely to improve if you double your OSDs hooking them up to your
existing SSD journal. Technically it would be nice to run a
load/performance tool (either atop/collectl/sysstat) and measure how
busy your resources are, but it is most likely your 2 spinning disks
will show near 100% busy utilization. 

filestore_max_sync_interval: i do not recommend decreasing this to 0.1,
i would keep it at 5 sec  

osd_op_threads do not increase this unless you have enough cores. 

but adding disks is the way to go 

Maged 

On 2017-08-22 20:08, fcid wrote:

> Hello everyone,
> 
> I've been using ceph to provide storage using RBD for 60 KVM virtual machines 
> running on proxmox.
> 
> The ceph cluster we have is very small (2 OSDs + 1 mon per node, and a total 
> of 3 nodes) and we are having some performace issues, like big latency times 
> (apply lat:~0.5 s; commit lat: 0.001 s), which get worse by the weekly 
> deep-scrubs.
> 
> I wonder if doubling the numbers of OSDs would improve latency times, or if 
> there is any other configuration tweak recommended for such small cluster. 
> Also, I'm looking forward to read any experience of other users using a 
> similiar configuration.
> 
> Some technical info:
> 
> - Ceph version: 10.2.5
> 
> - OSDs have SSD journal (one SSD disk per 2 OSDs) and have a spindle for 
> backend disk.
> 
> - Using CFQ disk queue scheduler
> 
> - OSD configuration excerpt:
> 
> osd_recovery_max_active = 1
> osd_recovery_op_priority = 63
> osd_client_op_priority = 1
> osd_mkfs_options = -f -i size=2048 -n size=64k
> osd_mount_options_xfs = inode64,noatime,logbsize=256k
> osd_journal_size = 20480
> osd_op_threads = 12
> osd_disk_threads = 1
> osd_disk_thread_ioprio_class = idle
> osd_disk_thread_ioprio_priority = 7
> osd_scrub_begin_hour = 3
> osd_scrub_end_hour = 8
> osd_scrub_during_recovery = false
> filestore_merge_threshold = 40
> filestore_split_multiple = 8
> filestore_xattr_use_omap = true
> filestore_queue_max_ops = 2500
> filestore_min_sync_interval = 0.01
> filestore_max_sync_interval = 0.1
> filestore_journal_writeahead = true
> 
> Best regards,___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Config parameters for system tuning

2017-06-20 Thread Maged Mokhtar


Hi,

1) I am trying to set some of the following config values which seems to 
be present in most config examples relating to performance tuning:

journal_queue_max_ops
journal_queue_max_bytes
filestore_queue_committing_max_bytes
filestore_queue_committing_max_ops

I am using 10.2.7 but not able to set these parameters either via conf 
file or injections, also ceph --show-config does not list them. Have 
they been deprecated and should be ignored ?


2) For osd_op_threads i have seen some examples (not the official docs) 
fixing this to the number of cpu cores, is this the best recommendation 
or can could we use more threads than cores ?


Cheers
Maged Mokhtar
PetaSAN
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Config parameters for system tuning

2017-06-22 Thread Maged Mokhtar


Looking at the sources, the config values were in Hammer but not Jewel.
for jounral config i recommend
journal_queue_max_ops
journal_queue_max_bytes
be removed from the docs:
http://docs.ceph.com/docs/master/rados/configuration/journal-ref/

Also for the added filestore throttling params:
filestore_queue_max_delay_multiple
filestore_queue_high_delay_multiple
filestore_queue_low_threshhold
filestore_queue_high_threshhold
again it will be good to update the docs:
http://docs.ceph.com/docs/master/rados/configuration/filestore-config-ref/

I guess all eyes are on Bluestore now :)

Maged Mokhtar
PetaSAN
--
From: "Maged Mokhtar" <mmokh...@petasan.org>
Sent: Wednesday, June 21, 2017 12:33 AM
To: <ceph-users@lists.ceph.com>
Subject: [ceph-users] Config parameters for system tuning


Hi,

1) I am trying to set some of the following config values which seems to 
be present in most config examples relating to performance tuning:

journal_queue_max_ops
journal_queue_max_bytes
filestore_queue_committing_max_bytes
filestore_queue_committing_max_ops

I am using 10.2.7 but not able to set these parameters either via conf 
file or injections, also ceph --show-config does not list them. Have 
they been deprecated and should be ignored ?


2) For osd_op_threads i have seen some examples (not the official docs) 
fixing this to the number of cpu cores, is this the best recommendation 
or can could we use more threads than cores ?


Cheers
Maged Mokhtar
PetaSAN
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Squeezing Performance of CEPH

2017-06-22 Thread Maged Mokhtar

Generally you can measure your bottleneck via a tool like
atop/collectl/sysstat  and see how busy (ie %busy, %util ) your
resources are: cpu/disks/net. 

As was pointed out, in your case you will most probably have maxed out
on your disks. But the above tools should help as you grow and tune your
cluster. 

Cheers, 

Maged Mokhtar 

PetaSAN 

On 2017-06-22 19:19, Massimiliano Cuttini wrote:

> Hi everybody, 
> 
> I want to squeeze all the performance of CEPH (we are using jewel 10.2.7).
> We are testing a testing environment with 2 nodes having the same 
> configuration: 
> 
> * CentOS 7.3
> * 24 CPUs (12 for real in hyper threading)
> * 32Gb of RAM
> * 2x 100Gbit/s ethernet cards
> * 2x OS dedicated in raid SSD Disks
> * 4x OSD SSD Disks SATA 6Gbit/s
> 
> We are already expecting the following bottlenecks: 
> 
> * [ SATA speed x n° disks ] = 24Gbit/s
> * [ Networks speed x n° bonded cards ] = 200Gbit/s
> 
> So the minimum between them is 24 Gbit/s per node (not taking in account 
> protocol loss). 
> 
> 24Gbit/s per node x2 = 48Gbit/s of maximum hypotetical theorical gross speed. 
> 
> Here are the tests:
> ///IPERF2/// Tests are quite good scoring 88% of the bottleneck.
> Note: iperf2 can use only 1 connection from a bond.(it's a well know issue).
> 
>> [ ID] Interval   Transfer Bandwidth
>> [ 12]  0.0-10.0 sec  9.55 GBytes  8.21 Gbits/sec
>> [  3]  0.0-10.0 sec  10.3 GBytes  8.81 Gbits/sec
>> [  5]  0.0-10.0 sec  9.54 GBytes  8.19 Gbits/sec
>> [  7]  0.0-10.0 sec  9.52 GBytes  8.18 Gbits/sec
>> [  6]  0.0-10.0 sec  9.96 GBytes  8.56 Gbits/sec
>> [  8]  0.0-10.0 sec  12.1 GBytes  10.4 Gbits/sec
>> [  9]  0.0-10.0 sec  12.3 GBytes  10.6 Gbits/sec
>> [ 10]  0.0-10.0 sec  10.2 GBytes  8.80 Gbits/sec
>> [ 11]  0.0-10.0 sec  9.34 GBytes  8.02 Gbits/sec
>> [  4]  0.0-10.0 sec  10.3 GBytes  8.82 Gbits/sec
>> [SUM]  0.0-10.0 sec   103 GBytes  88.6 Gbits/sec
> 
> ///RADOS BENCH 
> 
> Take in consideration the maximum hypotetical speed of 48Gbit/s tests (due to 
> disks bottleneck), tests are not good enought. 
> 
> * Average MB/s in write is almost 5-7Gbit/sec (12,5% of the mhs)
> * Average MB/s in seq read is almost 24Gbit/sec (50% of the mhs)
> * Average MB/s in random read is almost 27Gbit/se (56,25% of the mhs).
> 
> Here are the reports.
> Write:
> 
>> # rados bench -p scbench 10 write --no-cleanup
>> Total time run: 10.229369
>> Total writes made:  1538
>> Write size: 4194304
>> Object size:4194304
>> Bandwidth (MB/sec): 601.406
>> Stddev Bandwidth:   357.012
>> Max bandwidth (MB/sec): 1080
>> Min bandwidth (MB/sec): 204
>> Average IOPS:   150
>> Stddev IOPS:89
>> Max IOPS:   270
>> Min IOPS:   51
>> Average Latency(s): 0.106218
>> Stddev Latency(s):  0.198735
>> Max latency(s): 1.87401
>> Min latency(s): 0.0225438
> 
> sequential read:
> 
>> # rados bench -p scbench 10 seq
>> Total time run:   2.054359
>> Total reads made: 1538
>> Read size:4194304
>> Object size:  4194304
>> Bandwidth (MB/sec):   2994.61
>> Average IOPS  748
>> Stddev IOPS:  67
>> Max IOPS: 802
>> Min IOPS: 707
>> Average Latency(s):   0.0202177
>> Max latency(s):   0.223319
>> Min latency(s):   0.00589238
> 
> random read:
> 
>> # rados bench -p scbench 10 rand
>> Total time run:   10.036816
>> Total reads made: 8375
>> Read size:4194304
>> Object size:  4194304
>> Bandwidth (MB/sec):   3337.71
>> Average IOPS: 834
>> Stddev IOPS:  78
>> Max IOPS: 927
>> Min IOPS: 741
>> Average Latency(s):   0.0182707
>> Max latency(s):   0.257397
>> Min latency(s):   0.00469212
> 
> // 
> 
> It's seems like that there are some bottleneck somewhere that we are 
> understimating.
> Can you help me to found it? 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph random read IOPS

2017-06-26 Thread Maged Mokhtar

On 2017-06-26 15:34, Willem Jan Withagen wrote:

> On 26-6-2017 09:01, Christian Wuerdig wrote: 
> 
>> Well, preferring faster clock CPUs for SSD scenarios has been floated
>> several times over the last few months on this list. And realistic or
>> not, Nick's and Kostas' setup are similar enough (testing single disk)
>> that it's a distinct possibility.
>> Anyway, as mentioned measuring the performance counters would probably
>> provide more insight.
> 
> I read the advise as:
> prefer GHz over cores.
> 
> And especially since there is a sort of balance between either GHz or
> cores, that can be an expensive one. Getting both means you have to pay
> relatively substantial more money.
> 
> And for an average Ceph server with plenty OSDs, I personally just don't
> buy that. There you'd have to look at the total throughput of the the
> system, and latency is only one of the many factors.
> 
> Let alone in a cluster with several hosts (and or racks). There the
> latency is dictated by the network. So a bad choice of network card or
> switch will out do any extra cycles that your CPU can burn.
> 
> I think that just testing 1 OSD is testing artifacts, and has very
> little to do with running an actual ceph cluster.
> 
> So if one would like to test this, the test setup should be something
> like: 3 hosts with something like 3 disks per host, min_disk=2  and a
> nice workload.
> Then turn the GHz-knob and see what happens with client latency and
> throughput.
> 
> --WjW 
> 
> In a high concurrency/queue depth situation, which is probably the most 
> common workload, there is no question that adding more cores will increase 
> IOPS almost linearly in case you have enough disk and network bandwidth, ie 
> your disk and network % utilization is low and your cpu is near 100%. Adding 
> more cores is also more economic to increase IOPS versus increasing 
> frequency. 
> But adding more cores will not lower latency below the value you get from the 
> QD=1 test. To achieve lower latency you need faster cpu freq. Yes it is 
> expensive and as you said you need lower latency switches and so on but you 
> just have to pay more to achieve this.  
> 
> /Maged 
> 
> On Sun, Jun 25, 2017 at 4:53 AM, Willem Jan Withagen <w...@digiware.nl
> <mailto:w...@digiware.nl>> wrote:
> 
> Op 24 jun. 2017 om 14:17 heeft Maged Mokhtar <mmokh...@petasan.org
> <mailto:mmokh...@petasan.org>> het volgende geschreven:
> 
> My understanding was this test is targeting latency more than
> IOPS. This is probably why its was run using QD=1. It also makes
> sense that cpu freq will be more important than cores. 
> 
> But then it is not generic enough to be used as an advise!
> It is just a line in 3D-space. 
> As there are so many
> 
> --WjW
> 
> On 2017-06-24 12:52, Willem Jan Withagen wrote:
> 
> On 24-6-2017 05:30, Christian Wuerdig wrote: The general advice floating 
> around is that your want CPUs with high
> clock speeds rather than more cores to reduce latency and
> increase IOPS
> for SSD setups (see also
> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/
> <http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/>)
> So
> something like a E5-2667V4 might bring better results in that
> situation.
> Also there was some talk about disabling the processor C states
> in order
> to bring latency down (something like this should be easy to test:
> https://stackoverflow.com/a/22482722/220986
> <https://stackoverflow.com/a/22482722/220986>) 
> I would be very careful to call this a general advice...
> 
> Although the article is interesting, it is rather single sided.
> 
> The only thing is shows that there is a lineair relation between
> clockspeed and write or read speeds???
> The article is rather vague on how and what is actually tested.
> 
> By just running a single OSD with no replication a lot of the
> functionality is left out of the equation.
> Nobody is running just 1 osD on a box in a normal cluster host.
> 
> Not using a serious SSD is another source of noise on the conclusion.
> More Queue depth can/will certainly have impact on concurrency.
> 
> I would call this an observation, and nothing more.
> 
> --WjW 
> On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
> <reverend...@gmail.com <mailto:reverend...@gmail.com>
> <mailto:reverend...@gmail.com <mailto:reverend...@gmail.com>>>
> wrote:
> 
> Hello,
> 
> We are in the process of evaluating the performance of a testing
> cluster (3 nodes) with ceph jewel. Our setup consists of:
> 3 monitors (VMs)
> 2 physical servers each connected with 1 JBOD

Re: [ceph-users] Ceph random read IOPS

2017-06-24 Thread Maged Mokhtar

My understanding was this test is targeting latency more than IOPS. This
is probably why its was run using QD=1. It also makes sense that cpu
freq will be more important than cores.  

On 2017-06-24 12:52, Willem Jan Withagen wrote:

> On 24-6-2017 05:30, Christian Wuerdig wrote: 
> 
>> The general advice floating around is that your want CPUs with high
>> clock speeds rather than more cores to reduce latency and increase IOPS
>> for SSD setups (see also
>> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/) So
>> something like a E5-2667V4 might bring better results in that situation.
>> Also there was some talk about disabling the processor C states in order
>> to bring latency down (something like this should be easy to test:
>> https://stackoverflow.com/a/22482722/220986)
> 
> I would be very careful to call this a general advice...
> 
> Although the article is interesting, it is rather single sided.
> 
> The only thing is shows that there is a lineair relation between
> clockspeed and write or read speeds???
> The article is rather vague on how and what is actually tested.
> 
> By just running a single OSD with no replication a lot of the
> functionality is left out of the equation.
> Nobody is running just 1 osD on a box in a normal cluster host.
> 
> Not using a serious SSD is another source of noise on the conclusion.
> More Queue depth can/will certainly have impact on concurrency.
> 
> I would call this an observation, and nothing more.
> 
> --WjW 
> 
>> On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
>> > wrote:
>> 
>> Hello,
>> 
>> We are in the process of evaluating the performance of a testing
>> cluster (3 nodes) with ceph jewel. Our setup consists of:
>> 3 monitors (VMs)
>> 2 physical servers each connected with 1 JBOD running Ubuntu Server
>> 16.04
>> 
>> Each server has 32 threads @2.1GHz and 128GB RAM.
>> The disk distribution per server is:
>> 38 * HUS726020ALS210 (SAS rotational)
>> 2 * HUSMH8010BSS200 (SAS SSD for journals)
>> 2 * ST1920FM0043 (SAS SSD for data)
>> 1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)
>> 
>> Since we don't currently have a 10Gbit switch, we test the performance
>> with the cluster in a degraded state, the noout flag set and we mount
>> rbd images on the powered on osd node. We confirmed that the network
>> is not saturated during the tests.
>> 
>> We ran tests on the NVME disk and the pool created on this disk where
>> we hoped to get the most performance without getting limited by the
>> hardware specs since we have more disks than CPU threads.
>> 
>> The nvme disk was at first partitioned with one partition and the
>> journal on the same disk. The performance on random 4K reads was
>> topped at 50K iops. We then removed the osd and partitioned with 4
>> data partitions and 4 journals on the same disk. The performance
>> didn't increase significantly. Also, since we run read tests, the
>> journals shouldn't cause performance issues.
>> 
>> We then ran 4 fio processes in parallel on the same rbd mounted image
>> and the total iops reached 100K. More parallel fio processes didn't
>> increase the measured iops.
>> 
>> Our ceph.conf is pretty basic (debug is set to 0/0 for everything) and
>> the crushmap just defines the different buckets/rules for the disk
>> separation (rotational, ssd, nvme) in order to create the required
>> pools
>> 
>> Is the performance of 100.000 iops for random 4K read normal for a
>> disk that on the same benchmark runs at more than 300K iops on the
>> same hardware or are we missing something?
>> 
>> Best regards,
>> Kostas
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] VMware + CEPH Integration

2017-06-15 Thread Maged Mokhtar


Hi,

Please check the PetaSAN project 
www.petasan.org
We provide clustered iSCSI using LIO/Ceph rbd and Consul for HA.
Works well with VMWare.  
/Maged



From: Osama Hasebou 
Sent: Thursday, June 15, 2017 12:29 PM
To: ceph-users 
Subject: [ceph-users] VMware + CEPH Integration


Hi Everyone,


We would like to start testing using VMware with CEPH storage. Can people share 
their experience with production ready ideas they tried and if they were 
successful? 


I have been reading lately that either NFS or iSCSI are possible with some 
server acting as a gateway in between Ceph and VMware environment but NFS is 
better.


Thank you.


Regards,
Ossi








___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] trying to understanding crush more deeply

2017-09-22 Thread Maged Mokhtar

Per section 3.4.4 The default bucket type straw computes the hash of (PG
number, replica number, bucket id) for all buckets using the Jenkins
integer hashing function, then multiply this by bucket weight (for OSD
disks the weight of 1 is for 1 TB, for higher level it is the sum of
contained weights). The selection function chooses the bucket/disk with
the max value: 
c(r,x) = maxi ( f (wi)hash(x, r, i)) 

So if you add a OSD disk, there is a new disk id that enters this
competition and will get PG from other OSDs proportional to its weight,
which is a desirable effect, but a side effect is that the weight
hierarchy has slightly changed so now some older buckets may win PGs
from other older buckets according to the hash function. 

So straw does have overhead when adding (rather than replacing), it does
not do minimal PG re-assignments. But it terms of overall efficiency of
adding/removing of buckets at end and in middle of hierarchy it is the
best overall over other algorithms as seen on chart 5 and table 2. 

On 2017-09-22 08:36, Will Zhao wrote:

> Hi Sage  and all :
> I am tring to understand cursh more deeply. I have tried to read
> the code and paper, and search the mail list archives ,  but I still
> have some questions and can't understand it well.
> If I have 100 osds, and when I add a osd ,  the osdmap changes,
> and how the pg is recaulated to make sure the data movement is
> minimal.  I tried to use crushtool --show-mappings --num-rep 3  --test
> -i map , through changing the map for 100osds and 101 osds , to look
> the result , it looks like the pgmap changed a lot .  Shouldn't the
> remap  only happen to some of the pgs ? Or crush from adding  a pg is
> different from a new osdmap ? I konw I must understand something
> wrong. I appreciate if you can explain more about the logic of adding
> a osd . Or is there  more doc that I can read ? Thank you very much
> !!! : )
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Maged Mokhtar

On 2017-09-21 07:56, Lazuardi Nasution wrote:

> Hi, 
> 
> I'm still looking for the answer of these questions. Maybe someone can share 
> their thought on these. Any comment will be helpful too. 
> 
> Best regards, 
> 
> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution  
> wrote:
> 
>> Hi, 
>> 
>> 1. Is it possible configure use osd_data not as small partition on OSD but a 
>> folder (ex. on root disk)? If yes, how to do that with ceph-disk and any 
>> pros/cons of doing that? 
>> 2. Is WAL & DB size calculated based on OSD size or expected throughput like 
>> on journal device of filestore? If no, what is the default value and 
>> pro/cons of adjusting that? 
>> 3. Is partition alignment matter on Bluestore, including WAL & DB if using 
>> separate device for them? 
>> 
>> Best regards,
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

I am also looking for recommendations on wal/db partition sizes. Some
hints: 

ceph-disk defaults used in case it does not find
bluestore_block_wal_size or bluestore_block_db_size in config file: 

wal =  512MB 

db = if bluestore_block_size (data size) is in config file it uses 1/100
of it else it uses 1G. 

There is also a presentation by Sage back in March, see page 16: 

https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in


wal: 512 MB 

db: "a few" GB  

the wal size is probably not debatable, it will be like a journal for
small block sizes which are constrained by iops hence 512 MB is more
than enough. Probably we will see more on the db size in the future. 

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] trying to understanding crush more deeply

2017-09-22 Thread Maged Mokhtar

If you have a random number generator rand() and variables A,B 

A = rand()
B = rand()
and loop 100 times to see which is bigger A or B, on average A will win
50 times and B wins 50 times
Now assume you want to make A win twice as many times, you can add a
weight
A = 3 x rand()
B = 1 x rand()
If you loop 100 times, on average A will win 75 times and B wins 25
times

Hashing is like a random function but takes (in case of Jenkins) integer
inputs, the output is a random distribution but is repeatable if you
pass the same integer values..hence it is called pseudo random. 

In code: 

straw 
draw = crush_hash32_3(bucket->h.hash, x, bucket->h.items[i], r);
draw &= 0x;
draw *= bucket->straws[i]; 

straw2
u = crush_hash32_3(bucket->h.hash, x, ids[i], r);
u &= 0x;
ln = crush_ln(u) - 0x1ll; 

/*
* divide by 16.16 fixed-point weight. note
* that the ln value is negative, so a larger
* weight means a larger (less negative) value
* for draw.
*/
draw = div64_s64(ln, weights[i]); 

In both straw and straw 2, we compute the hash based on pg number,
replica count, bucket id:
for straw: multiply hash value by weight (or function that depends on
weight)
for straw2: create a -ve number based on ln of hash value then divide by
weight (or function that depends on weight), as per comment in code we
divide rather than multiply since the value is negative. 

In both cases the bucket with the highest value wins the PG 

On 2017-09-22 18:05, Will Zhao wrote:

> Thanks !   I still have a question. Like the code in bucket_straw2_choose 
> below：
> 
> u = crush_hash32_3(bucket->h.hash, x, ids[i], r);
> u &= 0x;
> ln = crush_ln(u) - 0x1ll;
> draw = div64_s64(ln, weights[i]);
> 
> Because the x , id, r , don't change, so the ln won't change for old
> bucket, add osd or remove osd only change the weight.  Suppose for
> pgi, there are 3 bucket(host) with weight 3w, add one host with weight
> w, there are 4 buckets with weight now.  This means the movement will
> depend on  ln value , am I understand right ? I don't understand how
> this make sure the new bucket get desirable pgs ?  I read
> https://en.wikipedia.org/wiki/Jenkins_hash_function and
> http://en.wikipedia.org/wiki/Exponential_distribution,  But I can't
> link them  together to understand this ? Can you explain something
> about this ?  Apologize for my dummy. And thank you very much  .  : )
> 
> On Fri, Sep 22, 2017 at 3:50 PM, Maged Mokhtar <mmokh...@petasan.org> wrote: 
> 
>> Per section 3.4.4 The default bucket type straw computes the hash of (PG
>> number, replica number, bucket id) for all buckets using the Jenkins integer
>> hashing function, then multiply this by bucket weight (for OSD disks the
>> weight of 1 is for 1 TB, for higher level it is the sum of contained
>> weights). The selection function chooses the bucket/disk with the max value:
>> c(r,x) = maxi ( f (wi)hash(x, r, i))
>> 
>> So if you add a OSD disk, there is a new disk id that enters this
>> competition and will get PG from other OSDs proportional to its weight,
>> which is a desirable effect, but a side effect is that the weight hierarchy
>> has slightly changed so now some older buckets may win PGs from other older
>> buckets according to the hash function.
>> 
>> So straw does have overhead when adding (rather than replacing), it does not
>> do minimal PG re-assignments. But it terms of overall efficiency of
>> adding/removing of buckets at end and in middle of hierarchy it is the best
>> overall over other algorithms as seen on chart 5 and table 2.
>> 
>> On 2017-09-22 08:36, Will Zhao wrote:
>> 
>> Hi Sage  and all :
>> I am tring to understand cursh more deeply. I have tried to read
>> the code and paper, and search the mail list archives ,  but I still
>> have some questions and can't understand it well.
>> If I have 100 osds, and when I add a osd ,  the osdmap changes,
>> and how the pg is recaulated to make sure the data movement is
>> minimal.  I tried to use crushtool --show-mappings --num-rep 3  --test
>> -i map , through changing the map for 100osds and 101 osds , to look
>> the result , it looks like the pgmap changed a lot .  Shouldn't the
>> remap  only happen to some of the pgs ? Or crush from adding  a pg is
>> different from a new osdmap ? I konw I must understand something
>> wrong. I appreciate if you can explain more about the logic of adding
>> a osd . Or is there  more doc that I can read ? Thank you very much
>> !!! : )
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore disk colocation using NVRAM, SSD and SATA

2017-09-21 Thread Maged Mokhtar

On 2017-09-21 10:01, Dietmar Rieder wrote:

> Hi,
> 
> I'm in the same situation (NVMEs, SSDs, SAS HDDs). I asked the same
> questions to myself.
> For now I decided to use the NVMEs as wal and db devices for the SAS
> HDDs and on the SSDs I colocate wal and  db.
> 
> However, I'm still wonderin how (to what size) and if I should change
> the default sizes of wal and db.
> 
> Dietmar
> 
> On 09/21/2017 01:18 AM, Alejandro Comisario wrote: 
> 
>> But for example, on the same server i have 3 disks technologies to
>> deploy pools, SSD, SAS and SATA.
>> The NVME were bought just thinking on the journal for SATA and SAS,
>> since journals for SSD were colocated.
>> 
>> But now, exactly the same scenario, should i trust the NVME for the SSD
>> pool ? are there that much of a  gain ? against colocating block.* on
>> the same SSD? 
>> 
>> best.
>> 
>> On Wed, Sep 20, 2017 at 6:36 PM, Nigel Williams
>> <nigel.willi...@tpac.org.au <mailto:nigel.willi...@tpac.org.au>> wrote:
>> 
>> On 21 September 2017 at 04:53, Maximiliano Venesio
>> <mass...@nubeliu.com <mailto:mass...@nubeliu.com>> wrote:
>> 
>> Hi guys i'm reading different documents about bluestore, and it
>> never recommends to use NVRAM to store the bluefs db,
>> nevertheless the official documentation says that, is better to
>> use the faster device to put the block.db in.
>> 
>> Likely not mentioned since no one yet has had the opportunity to
>> test it.
>> 
>> So how do i have to deploy using bluestore, regarding where i
>> should put block.wal and block.db ? 
>> 
>> block.* would be best on your NVRAM device, like this:
>> 
>> ceph-deploy osd create --bluestore c0osd-136:/dev/sda --block-wal
>> /dev/nvme0n1 --block-db /dev/nvme0n1
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> 
>> -- 
>> *Alejandro Comisario*
>> *CTO | NUBELIU*
>> E-mail: alejan...@nubeliu.com <mailto:alejan...@nubeliu.com>Cell: +54 9
>> 11 3770 1857
>> _
>> www.nubeliu.com [1] <http://www.nubeliu.com/>
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

My guess is for wal: you are dealing with a 2 step io operation so in
case it is collocated on your SSDs your iops for small writes will be
halfed. The decision is if you add a small NVMEs as wal for 4 or 5
(large) SSDs, you will double their iops for small io sized. This is not
the case for db. 

For wal size:  512 MB is recommended ( ceph-disk default ) 

For db size: a "few" GB..probably 10GB is a good number. I guess we will
hear more in the future. 

Maged Mokhtar 

  

Links:
--
[1] http://www.nubeliu.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph iSCSI login failed due to authorization failure

2017-10-14 Thread Maged Mokhtar

On 2017-10-14 17:50, Kashif Mumtaz wrote:

> Hello Dear, 
> 
> I am trying to configure the Ceph iscsi gateway on Ceph Luminious . As per 
> below 
> 
> Ceph iSCSI Gateway -- Ceph Documentation [1] 
> 
> [1] 
> 
> CEPH ISCSI GATEWAY — CEPH DOCUMENTATION
> 
> Ceph is iscsi gateway are configured and chap auth is set. 
> 
> /> ls 
> o- / 
> .
>  [...] 
> o- clusters 
> 
>  [Clusters: 1] 
> | o- ceph 
> ..
>  [HEALTH_WARN] 
> |   o- pools 
> ..
>  [Pools: 2] 
> |   | o- kashif 
> . [Commit: 
> 0b, Avail: 116G, Used: 1K, Commit%: 0%] 
> |   | o- rbd 
> ... [Commit: 
> 10G, Avail: 116G, Used: 3K, Commit%: 8%] 
> |   o- topology 
> ...
>  [OSDs: 13,MONs: 3] 
> o- disks 
> .
>  [10G, Disks: 1] 
> | o- rbd.disk_1 
> ...
>  [disk_1 (10G)] 
> o- iscsi-target 
> .
>  [Targets: 1] 
> o- iqn.2003-01.com.redhat.iscsi-gw:tahir 
> . 
> [Gateways: 2] 
> o- gateways 
> 
>  [Up: 2/2, Portals: 2] 
> | o- gateway 
> 
>  [192.168.10.37 (UP)] 
> | o- gateway2 
> ...
>  [192.168.10.38 (UP)] 
> o- hosts 
> ..
>  [Hosts: 1] 
> o- iqn.1994-05.com.redhat:rh7-client 
> ... [Auth: CHAP, 
> Disks: 1(10G)] 
> o- lun 0 
> ..
>  [rbd.disk_1(10G), Owner: gateway2] 
> /> 
> 
> But initiators are unable to mount it. Try both ion Linux and ESXi 6. 
> 
> Below is the  error message on iscsi gateway server log file. 
> 
> Oct 14 19:34:49 gateway kernel: iSCSI Initiator Node: 
> iqn.1998-01.com.vmware:esx0-36c45c69 is not authorized to access iSCSI target 
> portal group: 1. 
> Oct 14 19:34:49 gateway kernel: iSCSI Login negotiation failed. 
> 
> Oct 14 19:35:27 gateway kernel: iSCSI Initiator Node: 
> iqn.1994-05.com.redhat:5ef55740c576 is not authorized to access iSCSI target 
> portal group: 1. 
> Oct 14 19:35:27 gateway kernel: iSCSI Login negotiation failed. 
> 
> I am giving the ceph authentication on initiator side.
> 
> Discovery on initiator is happening  
> 
> root@server1 ~]# iscsiadm -m discovery -t st -p  192.168.10.37 
> 192.168.10.37:3260,1 iqn.2003-01.com.redhat.iscsi-gw:tahir 
> 192.168.10.38:3260,2 iqn.2003-01.com.redhat.iscsi-gw:tahir 
> 
> But when trying to login , it is giving  "iSCSI login failed due to 
> authorization failure" 
> 
> [root@server1 ~]# iscsiadm -m node -T iqn.2003-01.com.redhat.iscsi-gw:tahir  
> -l 
> Logging in to [iface: default, target: iqn.2003-01.com.redhat.iscsi-gw:tahir, 
> portal: 192.168.10.37,3260] (multiple) 
> Logging in to [iface: default, target: iqn.2003-01.com.redhat.iscsi-gw:tahir, 
> portal: 192.168.10.38,3260] (multiple) 
> iscsiadm: Could not login to [iface: default, target: 
> iqn.2003-01.com.redhat.iscsi-gw:tahir, portal: 192.168.10.37,3260]. 
> iscsiadm: initiator reported error (24 - iSCSI login failed due to 
> authorization failure) 
> iscsiadm: Could not login to [iface: default, target: 
> iqn.2003-01.com.redhat.iscsi-gw:tahir, portal: 192.168.10.38,3260]. 
> iscsiadm: initiator reported error (24 - iSCSI login failed due to 
> authorization failure) 
> iscsiadm: Could not log into all portals 
> 
> Can someone give the idea what is missing. 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This is a bit different from the LIO version i know but it seems the
client initiator configured on the target is 
iqn.1994-05.com.redhat:rh7-client
whereas you are trying to log with: 

iqn.1998-01.com.vmware:esx0-36c45c69 is not authorized

Re: [ceph-users] osd max scrubs not honored?

2017-10-15 Thread Maged Mokhtar

On 2017-10-14 05:02, J David wrote:

> Thanks all for input on this.
> 
> It's taken a couple of weeks, but based on the feedback from the list,
> we've got our version of a scrub-one-at-a-time cron script running and
> confirmed that it's working properly.
> 
> Unfortunately, this hasn't really solved the real problem.  Even with
> just one scrub and one client running, client I/O requests routinely
> take 30-60 seconds to complete (read or write), which is so poor that
> the cluster is unusable for any sort of interactive activity.  Nobody
> is going to sit around and wait 30-60 seconds for a file to save or
> load, or for a web server to respond, or a SQL query to finish.
> 
> Running "ceph -w" blames this on slow requests blocked for > 32 seconds:
> 
> 2017-10-13 21:21:34.445798 mon.ceph1 [INF] overall HEALTH_OK
> 2017-10-13 21:21:51.305661 mon.ceph1 [WRN] Health check failed: 42
> slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2017-10-13 21:21:57.311892 mon.ceph1 [WRN] Health check update: 140
> slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2017-10-13 21:22:03.343443 mon.ceph1 [WRN] Health check update: 111
> slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2017-10-13 21:22:01.833605 osd.5 [WRN] 1 slow requests, 1 included
> below; oldest blocked for > 30.526819 secs
> 2017-10-13 21:22:01.833614 osd.5 [WRN] slow request 30.526819 seconds
> old, received at 2017-10-13 21:21:31.306718:
> osd_op(client.6104975.0:7330926 0.a2
> 0:456218c9:::rbd_data.1a24832ae8944a.0009d21d:head
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 2364416~88064] snapc 0=[] ondisk+write+known_if_redirected e18866)
> currently sub_op_commit_rec from 9
> 2017-10-13 21:22:11.238561 mon.ceph1 [WRN] Health check update: 24
> slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2017-10-13 21:22:04.834075 osd.5 [WRN] 1 slow requests, 1 included
> below; oldest blocked for > 30.291869 secs
> 2017-10-13 21:22:04.834082 osd.5 [WRN] slow request 30.291869 seconds
> old, received at 2017-10-13 21:21:34.542137:
> osd_op(client.6104975.0:7331703 0.a2
> 0:4571f0f6:::rbd_data.1a24832ae8944a.0009c8ef:head
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 2934272~46592] snapc 0=[] ondisk+write+known_if_redirected e18866)
> currently op_applied
> 2017-10-13 21:22:07.834445 osd.5 [WRN] 1 slow requests, 1 included
> below; oldest blocked for > 30.421122 secs
> 2017-10-13 21:22:07.834452 osd.5 [WRN] slow request 30.421122 seconds
> old, received at 2017-10-13 21:21:37.413260:
> osd_op(client.6104975.0:7332411 0.a2
> 0:456218c9:::rbd_data.1a24832ae8944a.0009d21d:head
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 4068352~16384] snapc 0=[] ondisk+write+known_if_redirected e18866)
> currently op_applied
> 2017-10-13 21:22:16.238929 mon.ceph1 [WRN] Health check update: 8 slow
> requests are blocked > 32 sec (REQUEST_SLOW)
> 2017-10-13 21:22:21.239234 mon.ceph1 [WRN] Health check update: 4 slow
> requests are blocked > 32 sec (REQUEST_SLOW)
> 2017-10-13 21:22:21.329402 mon.ceph1 [INF] Health check cleared:
> REQUEST_SLOW (was: 4 slow requests are blocked > 32 sec)
> 2017-10-13 21:22:21.329490 mon.ceph1 [INF] Cluster is now healthy
> 
> So far, the following steps have been taken to attempt to resolve this:
> 
> 1) Updated to Ubuntu 16.04.3 LTS and Ceph 12.2.1.
> 
> 2) Changes to ceph.conf:
> osd max scrubs = 1
> osd scrub during recovery = false
> osd deep scrub interval = 2592000
> osd scrub max interval = 2592000
> osd deep scrub randomize ratio = 0.0
> osd disk thread ioprio priority = 7
> osd disk thread ioprio class = idle
> osd scrub sleep = 0.1
> 
> 3) Kernel I/O Scheduler set to cfq.
> 
> 4) Deep-scrub moved to cron, with a limit of one running at a time.
> 
> With these changes, scrubs now take 40-45 minutes to complete, up from
> 20-25, so the amount of time where there are client I/O issues has
> actually gotten substantially worse.
> 
> To summarize the ceph cluster, it has five nodes.  Each node has
> - Intel Xeon E5-1620 v3 3.5Ghz quad core CPU
> - 64GiB DDR4 1866
> - Intel SSD DC S3700 1GB divided into three partitions used from
> Bluestore blocks.db for each OSD
> - Separate 64GB SSD for ceph monitor data & system image.
> - 3 x 7200rpm drives (Seagate Constellation ES.3 4TB or Seagate
> Enterprise Capacity 8TB)
> - Dual Intel 10Gigabit NIC w/LACP
> 
> The SATA drives all check out healthy via smartctl and several are
> either new and were tested right before insertion into this cluster,
> or have been pulled for testing.  When tested on random operations,
> they are by and large capable of 120-150 IOPS and about 30MB/sec
> throughput at 100% utilization with response times of 5-7ms.
> 
> The CPUs are 75-90% idle.  The RAM is largely unused (~55GiB free).
> The network is nearly idle (<50Mbps TX & RX, often <10Mbps).  The
> blocks.db SSDs report 0% to 0.2% utilization.  The system/monitor SSD
> reports 0-0.5% utilization.  The SATA drives report between

Re: [ceph-users] osd max scrubs not honored?

2017-10-15 Thread Maged Mokhtar

correction, i limit it to 128K:  

echo 128 > /sys/block/sdX/queue/read_ahead_kb 

On 2017-10-15 13:14, Maged Mokhtar wrote:

> On 2017-10-14 05:02, J David wrote:
> 
>> Thanks all for input on this.
>> 
>> It's taken a couple of weeks, but based on the feedback from the list,
>> we've got our version of a scrub-one-at-a-time cron script running and
>> confirmed that it's working properly.
>> 
>> Unfortunately, this hasn't really solved the real problem.  Even with
>> just one scrub and one client running, client I/O requests routinely
>> take 30-60 seconds to complete (read or write), which is so poor that
>> the cluster is unusable for any sort of interactive activity.  Nobody
>> is going to sit around and wait 30-60 seconds for a file to save or
>> load, or for a web server to respond, or a SQL query to finish.
>> 
>> Running "ceph -w" blames this on slow requests blocked for > 32 seconds:
>> 
>> 2017-10-13 21:21:34.445798 mon.ceph1 [INF] overall HEALTH_OK
>> 2017-10-13 21:21:51.305661 mon.ceph1 [WRN] Health check failed: 42
>> slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2017-10-13 21:21:57.311892 mon.ceph1 [WRN] Health check update: 140
>> slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2017-10-13 21:22:03.343443 mon.ceph1 [WRN] Health check update: 111
>> slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2017-10-13 21:22:01.833605 osd.5 [WRN] 1 slow requests, 1 included
>> below; oldest blocked for > 30.526819 secs
>> 2017-10-13 21:22:01.833614 osd.5 [WRN] slow request 30.526819 seconds
>> old, received at 2017-10-13 21:21:31.306718:
>> osd_op(client.6104975.0:7330926 0.a2
>> 0:456218c9:::rbd_data.1a24832ae8944a.0009d21d:head
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 2364416~88064] snapc 0=[] ondisk+write+known_if_redirected e18866)
>> currently sub_op_commit_rec from 9
>> 2017-10-13 21:22:11.238561 mon.ceph1 [WRN] Health check update: 24
>> slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2017-10-13 21:22:04.834075 osd.5 [WRN] 1 slow requests, 1 included
>> below; oldest blocked for > 30.291869 secs
>> 2017-10-13 21:22:04.834082 osd.5 [WRN] slow request 30.291869 seconds
>> old, received at 2017-10-13 21:21:34.542137:
>> osd_op(client.6104975.0:7331703 0.a2
>> 0:4571f0f6:::rbd_data.1a24832ae8944a.0009c8ef:head
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 2934272~46592] snapc 0=[] ondisk+write+known_if_redirected e18866)
>> currently op_applied
>> 2017-10-13 21:22:07.834445 osd.5 [WRN] 1 slow requests, 1 included
>> below; oldest blocked for > 30.421122 secs
>> 2017-10-13 21:22:07.834452 osd.5 [WRN] slow request 30.421122 seconds
>> old, received at 2017-10-13 21:21:37.413260:
>> osd_op(client.6104975.0:7332411 0.a2
>> 0:456218c9:::rbd_data.1a24832ae8944a.0009d21d:head
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 4068352~16384] snapc 0=[] ondisk+write+known_if_redirected e18866)
>> currently op_applied
>> 2017-10-13 21:22:16.238929 mon.ceph1 [WRN] Health check update: 8 slow
>> requests are blocked > 32 sec (REQUEST_SLOW)
>> 2017-10-13 21:22:21.239234 mon.ceph1 [WRN] Health check update: 4 slow
>> requests are blocked > 32 sec (REQUEST_SLOW)
>> 2017-10-13 21:22:21.329402 mon.ceph1 [INF] Health check cleared:
>> REQUEST_SLOW (was: 4 slow requests are blocked > 32 sec)
>> 2017-10-13 21:22:21.329490 mon.ceph1 [INF] Cluster is now healthy
>> 
>> So far, the following steps have been taken to attempt to resolve this:
>> 
>> 1) Updated to Ubuntu 16.04.3 LTS and Ceph 12.2.1.
>> 
>> 2) Changes to ceph.conf:
>> osd max scrubs = 1
>> osd scrub during recovery = false
>> osd deep scrub interval = 2592000
>> osd scrub max interval = 2592000
>> osd deep scrub randomize ratio = 0.0
>> osd disk thread ioprio priority = 7
>> osd disk thread ioprio class = idle
>> osd scrub sleep = 0.1
>> 
>> 3) Kernel I/O Scheduler set to cfq.
>> 
>> 4) Deep-scrub moved to cron, with a limit of one running at a time.
>> 
>> With these changes, scrubs now take 40-45 minutes to complete, up from
>> 20-25, so the amount of time where there are client I/O issues has
>> actually gotten substantially worse.
>> 
>> To summarize the ceph cluster, it has five nodes.  Each node has
>> - Intel Xeon E5-1620 v3 3.5Ghz quad core CPU
>> - 64GiB DDR4 1866
>> - Intel SSD DC S3700 1GB divided into three partitions used from
>> Bluestore blocks.db for each OSD
>> - Sepa

Re: [ceph-users] Bareos and libradosstriper works only for 4M sripe_unit size

2017-10-17 Thread Maged Mokhtar

>> Would it be 4 objects of 24M and 4 objects of 250KB? Or will the last
4 objects be artificially padded (with 0's) to meet the stripe_unit? 

It will be 4 object of 24M + 1M stored on the 5th object 

If you write 104M :  4 object of 24M + 8M stored on the 5th object 

If you write 105M :  4 object of 24M + 8M stored on the 5th object + 1M
on 6th object 

Maged 

On 2017-10-17 01:59, Christian Wuerdig wrote:

> Maybe an additional example where the numbers don't line up all so
> nicely would be good as well. For example it's not immediately obvious
> to me what would happen with the stripe settings given by your example
> but you write 97M of data
> Would it be 4 objects of 24M and 4 objects of 250KB? Or will the last
> 4 objects be artificially padded (with 0's) to meet the stripe_unit?
> 
> On Tue, Oct 17, 2017 at 12:35 PM, Alexander Kushnirenko
>  wrote: Hi, Gregory, Ian!
> 
> There is very little information on striper mode in Ceph documentation.
> Could this explanation help?
> 
> The logic of striper mode is very much the same as in RAID-0.  There are 3
> parameters that drives it:
> 
> stripe_unit - the stripe size  (default=4M)
> stripe_count - how many objects to write in parallel (default=1)
> object_size  - when to stop increasing object size and create new objects.
> (default =4M)
> 
> For example if you write 132M of data (132 consecutive pieces of data 1M
> each) in striped mode with the following parameters:
> stripe_unit = 8M
> stripe_count = 4
> object_size = 24M
> Then 8 objects will be created - 4 objects with 24M size and 4 objects with
> 8M size.
> 
> Obj1=24MObj2=24MObj3=24MObj4=24M
> 00 .. 07 08 .. 0f 10 .. 17 18 .. 1f  <-- consecutive
> 1M pieces of data
> 20 .. 27 21 .. 2f 30 .. 37 38 .. 3f
> 40 .. 47 48 .. 4f 50 .. 57 58 .. 5f
> 
> Obj5= 8MObj6= 8MObj7= 8MObj8= 8M
> 60 .. 6768 .. 6f70 .. 7778 .. 7f
> 
> Alexander.
> 
> On Wed, Oct 11, 2017 at 3:19 PM, Alexander Kushnirenko
>  wrote: 
> Oh!  I put a wrong link, sorry  The picture which explains stripe_unit and
> stripe count is here:
> 
> https://indico.cern.ch/event/330212/contributions/1718786/attachments/642384/883834/CephPluginForXroot.pdf
> 
> I tried to attach it in the mail, but it was blocked.
> 
> On Wed, Oct 11, 2017 at 3:16 PM, Alexander Kushnirenko
>  wrote: 
> Hi, Ian!
> 
> Thank you for your reference!
> 
> Could you comment on the following rule:
> object_size = stripe_unit * stripe_count
> Or it is not necessarily so?
> 
> I refer to page 8 in this report:
> 
> https://indico.cern.ch/event/531810/contributions/2298934/attachments/1358128/2053937/Ceph-Experience-at-RAL-final.pdf
> 
> Alexander.
> 
> On Wed, Oct 11, 2017 at 1:11 PM,  wrote: 
> Hi Gregory
> 
> You're right, when setting the object layout in libradosstriper, one
> should set all three parameters (the number of stripes, the size of the
> stripe unit, and the size of the striped object). The Ceph plugin for
> GridFTP has an example of this at
> https://github.com/stfc/gridFTPCephPlugin/blob/master/ceph_posix.cpp#L371
> 
> At RAL, we use the following values:
> 
> $STRIPER_NUM_STRIPES 1
> 
> $STRIPER_STRIPE_UNIT 8388608
> 
> $STRIPER_OBJECT_SIZE 67108864
> 
> Regards,
> 
> Ian Johnson MBCS
> 
> Data Services Group
> 
> Scientific Computing Department
> 
> Rutherford Appleton Laboratory
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph-ISCSI

2017-10-17 Thread Maged Mokhtar

The issue with active/active is the following condition:
client initiator sends write operation to gateway server A
server A does not respond within client timeout
client initiator re-sends failed write operation to gateway server B
client initiator sends another write operation to gateway server C(orB)
on the same sector with different data
Server A wakes up and write pending data, which will over-write sector
with old data 

As Jason mentioned this is an edge condition but pauses challenges on
how to deal with this, some approaches: 

-increase the timeout of the client failover + implement fencing with a
smaller heartbeat timeout. 
-implement a distributed operation counter (using a Ceph object or a
distributed configuration/dml tool ) so that if server B gets an
operation it can detect this was because of server A failing and starts
fencing action. 
-similar to the above but rely on iSCSI session counters in Microsoft
MCS..MPIO does not generate consecutice numbers accross the different
session paths. 

Maged 

On 2017-10-17 12:23, Jorge Pinilla López wrote:

> So what I have understood the final sum up was to support MC to be able to 
> Multipath Active/Active 
> 
> How is that proyect going? Windows will be able to support it because they 
> have already implemented it client-side but unless ESXi implements it, VMware 
> will only be able to do Active/Passive, am I right?
> 
> El 17/10/2017 a las 11:01, Frédéric Nass escribió: 
> 
> Hi folks, 
> 
> For those who missed it, the fun was here :-) : 
> https://youtu.be/IgpVOOVNJc0?t=3715 
> 
> Frederic. 
> 
> - Le 11 Oct 17, à 17:05, Jake Young  a écrit :
> 
> On Wed, Oct 11, 2017 at 8:57 AM Jason Dillaman  wrote: 
> 
> On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López  
> wrote:
> 
> As far as I am able to understand there are 2 ways of setting iscsi for ceph
> 
> 1- using kernel (lrbd) only able on SUSE, CentOS, fedora... 
> 
> The target_core_rbd approach is only utilized by SUSE (and its derivatives 
> like PetaSAN) as far as I know. This was the initial approach for Red 
> Hat-derived kernels as well until the upstream kernel maintainers indicated 
> that they really do not want a specialized target backend for just krbd. The 
> next attempt was to re-use the existing target_core_iblock to interface with 
> krbd via the kernel's block layer, but that hit similar upstream walls trying 
> to get support for SCSI command passthrough to the block layer. 
> 
> 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli) 
> 
> The TCMU approach is what upstream and Red Hat-derived kernels will support 
> going forward.  
> 
> The lrbd project was developed by SUSE to assist with configuring a cluster 
> of iSCSI gateways via the cli.  The ceph-iscsi-config + ceph-iscsi-cli 
> projects are similar in goal but take a slightly different approach. 
> ceph-iscsi-config provides a set of common Python libraries that can be 
> re-used by ceph-iscsi-cli and ceph-ansible for deploying and configuring the 
> gateway. The ceph-iscsi-cli project provides the gwcli tool which acts as a 
> cluster-aware replacement for targetcli. 
> 
> I don't know which one is better, I am seeing that oficial support is 
> pointing to tcmu but i havent done any testbench. 
> 
> We (upstream Ceph) provide documentation for the TCMU approach because that 
> is what is available against generic upstream kernels (starting with 4.14 
> when it's out). Since it uses librbd (which still needs to undergo some 
> performance improvements) instead of krbd, we know that librbd 4k IO 
> performance is slower compared to krbd, but 64k and 128k IO performance is 
> comparable. However, I think most iSCSI tuning guides would already tell you 
> to use larger block sizes (i.e. 64K NTFS blocks or 32K-128K ESX blocks). 
> 
> Does anyone tried both? Do they give the same output? Are both able to manage 
> multiple iscsi targets mapped to a single rbd disk? 
> 
> Assuming you mean multiple portals mapped to the same RBD disk, the answer is 
> yes, both approaches should support ALUA. The ceph-iscsi-config tooling will 
> only configure Active/Passive because we believe there are certain edge 
> conditions that could result in data corruption if configured for 
> Active/Active ALUA. 
> The TCMU approach also does not currently support SCSI persistent reservation 
> groups (needed for Windows clustering) because that support isn't available 
> in the upstream kernel. The SUSE kernel has an approach that utilizes two 
> round-trips to the OSDs for each IO to simulate PGR support. Earlier this 
> summer I believe SUSE started to look into how to get generic PGR support 
> merged into the upstream kernel using corosync/dlm to synchronize the states 
> between multiple nodes in the target. I am not sure of the current state of 
> that work, but it would benefit all LIO targets when complete. 
> 
> I will try to make my own testing but if anyone

Re: [ceph-users] Ceph-ISCSI

2017-10-12 Thread Maged Mokhtar

On 2017-10-11 14:57, Jason Dillaman wrote:

> On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López  
> wrote:
> 
>> As far as I am able to understand there are 2 ways of setting iscsi for ceph
>> 
>> 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
> 
> The target_core_rbd approach is only utilized by SUSE (and its derivatives 
> like PetaSAN) as far as I know. This was the initial approach for Red 
> Hat-derived kernels as well until the upstream kernel maintainers indicated 
> that they really do not want a specialized target backend for just krbd. The 
> next attempt was to re-use the existing target_core_iblock to interface with 
> krbd via the kernel's block layer, but that hit similar upstream walls trying 
> to get support for SCSI command passthrough to the block layer. 
> 
>> 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)
> 
> The TCMU approach is what upstream and Red Hat-derived kernels will support 
> going forward.  
> 
> The lrbd project was developed by SUSE to assist with configuring a cluster 
> of iSCSI gateways via the cli.  The ceph-iscsi-config + ceph-iscsi-cli 
> projects are similar in goal but take a slightly different approach. 
> ceph-iscsi-config provides a set of common Python libraries that can be 
> re-used by ceph-iscsi-cli and ceph-ansible for deploying and configuring the 
> gateway. The ceph-iscsi-cli project provides the gwcli tool which acts as a 
> cluster-aware replacement for targetcli. 
> 
>> I don't know which one is better, I am seeing that oficial support is 
>> pointing to tcmu but i havent done any testbench.
> 
> We (upstream Ceph) provide documentation for the TCMU approach because that 
> is what is available against generic upstream kernels (starting with 4.14 
> when it's out). Since it uses librbd (which still needs to undergo some 
> performance improvements) instead of krbd, we know that librbd 4k IO 
> performance is slower compared to krbd, but 64k and 128k IO performance is 
> comparable. However, I think most iSCSI tuning guides would already tell you 
> to use larger block sizes (i.e. 64K NTFS blocks or 32K-128K ESX blocks). 
> 
>> Does anyone tried both? Do they give the same output? Are both able to 
>> manage multiple iscsi targets mapped to a single rbd disk?
> 
> Assuming you mean multiple portals mapped to the same RBD disk, the answer is 
> yes, both approaches should support ALUA. The ceph-iscsi-config tooling will 
> only configure Active/Passive because we believe there are certain edge 
> conditions that could result in data corruption if configured for 
> Active/Active ALUA. 
> 
> The TCMU approach also does not currently support SCSI persistent reservation 
> groups (needed for Windows clustering) because that support isn't available 
> in the upstream kernel. The SUSE kernel has an approach that utilizes two 
> round-trips to the OSDs for each IO to simulate PGR support. Earlier this 
> summer I believe SUSE started to look into how to get generic PGR support 
> merged into the upstream kernel using corosync/dlm to synchronize the states 
> between multiple nodes in the target. I am not sure of the current state of 
> that work, but it would benefit all LIO targets when complete. 
> 
>> I will try to make my own testing but if anyone has tried in advance it 
>> would be really helpful.
>> 
>> -
>> JORGE PINILLA LÓPEZ
>> jorp...@unizar.es
>> 
>> -
>> 
>> [1]
>> Libre de virus. www.avast.com [1]
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2]
> 
> -- 
> 
> Jason 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi Jason, 

Similar to TCMU user space backstore approach, i would prefer cluster
sync of PR and other task management be done user space. It really does
not belong in the kernel and will give more flexibility in
implementation. A user space PR get/set interface could be implemented
via: 

-corosync 
-Writing PR metada to Ceph / network share
-Use Ceph watch/notify 

Also in the future it may be beneficial to build/extend on Ceph features
such as exclusive locks and paxos based leader election for applications
such as iSCSI gateways to use for resource distribution and fail over as
an alternative to Pacemaker which has sociability limits. 

Maged 

  

Links:
--
[1]
https://www.avast.com/sig-email?utm_medium=emailutm_source=linkutm_campaign=sig-emailutm_content=emailclient
[2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph-ISCSI

2017-10-12 Thread Maged Mokhtar

On 2017-10-12 11:32, David Disseldorp wrote:

> On Wed, 11 Oct 2017 14:03:59 -0400, Jason Dillaman wrote:
> 
> On Wed, Oct 11, 2017 at 1:10 PM, Samuel Soulard
>  wrote: Hmmm, If you failover the identity of the 
> LIO configuration including PGRs
> (I believe they are files on disk), this would work no?  Using an 2 ISCSI
> gateways which have shared storage to store the LIO configuration and PGR
> data.   
> Are you referring to the Active Persist Through Power Loss (APTPL)
> support in LIO where it writes the PR metadata to
> "/var/target/pr/aptpl_"? I suppose that would work for a
> Pacemaker failover if you had a shared file system mounted between all
> your gateways *and* the initiator requests APTPL mode(?).

I'm going off on a tangent here, but I can't seem to find where LIO
reads the /var/target/pr/aptpl_ PR state back off disk -
__core_scsi3_write_aptpl_to_file() seems to be the only function that
uses the path. Otherwise I would have thought the same, that the
propagating the file to backup gateways prior to failover would be
sufficient.

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

This code may help from rtslib  

https://github.com/open-iscsi/rtslib-fb/blob/master/rtslib/tcm.py 

def _config_pr_aptpl(self):
"""
LIO actually *writes* pr aptpl info to the filesystem, so we
need to read it in and squirt it back into configfs when we configure
the storage object. BLEH.
"""
from .root import RTSRoot
aptpl_dir = "%s/pr" % RTSRoot().dbroot 

try:
lines = fread("%s/aptpl_%s" % (aptpl_dir, self.wwn)).split()
except:
return 

if not lines[0].startswith("PR_REG_START:"):
return 

reservations = []
for line in lines:
if line.startswith("PR_REG_START:"):
res_list = []
elif line.startswith("PR_REG_END:"):
reservations.append(res_list)
else:
res_list.append(line.strip()) 

for res in reservations:
fwrite(self.path + "/pr/res_aptpl_metadata", ",".join(res))___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

2017-08-29 Thread Maged Mokhtar

One of the things to watch out in small clusters is OSDs can get full
rather unexpectedly in recovery/backfill cases: 

In your case you have 2 OSD nodes with 5 disks each. Since you have a
replica of 2, each PG will have 1 copy on each host, so if an OSD fails,
all its PGs will have to be re-created on the same host, meaning they
will be distributed only among the 4 OSDs on the same host, which will
quickly bump their usage by nearly 20% each.
the default osd_backfill_full_ratio is 85% so if any of the 4 OSDs was
near 70% util before the failure, it will easily reach 85% and cause the
cluster to error with backfill_toofull message you see.  This is why i
suggest you add an extra disk or try your luck reasing
osd_backfill_full_ratio to 92% it may fix things. 

/Maged 

On 2017-08-29 21:13, hjcho616 wrote:

> Nice!  Thank you for the explanation!  I feel like I can revive that OSD. =)  
> That does sound great.  I don't quite have another cluster so waiting for a 
> drive to arrive! =)   
> 
> After setting min and max_min to 1, looks like toofull flag is gone... Maybe 
> when I was making that video copy OSDs were already down... and those two 
> OSDs were not enough to take too much extra...  and on top of it that last 
> OSD alive was smaller disk (2TB vs 320GB)... so it probably was filling up 
> faster.  I should have captured that message... but turned machine off and 
> now I am at work. =P  When I get back home, I'll try to grab that and share.  
> Maybe I don't need to try to add another OSD to that cluster just yet!  OSDs 
> are about 50% full on OSD1. 
> 
> So next up, fixing osd0! 
> 
> Regards, 
> Hong   
> 
> On Tuesday, August 29, 2017 1:05 PM, David Turner  
> wrote:
> 
> But it was absolutely awesome to run an osd off of an rbd after the disk 
> failed. 
> 
> On Tue, Aug 29, 2017, 1:42 PM David Turner  wrote: 
> To addend Steve's success, the rbd was created in a second cluster in the 
> same datacenter so it didn't run the risk of deadlocking that mapping rbds on 
> machines running osds has.  It is still theoretical to work on the same 
> cluster, but more inherently dangerous for a few reasons. 
> 
> On Tue, Aug 29, 2017, 1:15 PM Steve Taylor  
> wrote: Hong,
> 
> Probably your best chance at recovering any data without special,
> expensive, forensic procedures is to perform a dd from /dev/sdb to
> somewhere else large enough to hold a full disk image and attempt to
> repair that. You'll want to use 'conv=noerror' with your dd command
> since your disk is failing. Then you could either re-attach the OSD
> from the new source or attempt to retrieve objects from the filestore
> on it.
> 
> I have actually done this before by creating an RBD that matches the
> disk size, performing the dd, running xfs_repair, and eventually
> adding it back to the cluster as an OSD. RBDs as OSDs is certainly a
> temporary arrangement for repair only, but I'm happy to report that it
> worked flawlessly in my case. I was able to weight the OSD to 0,
> offload all of its data, then remove it for a full recovery, at which
> point I just deleted the RBD.
> 
> The possibilities afforded by Ceph inception are endless. ☺
> 
> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2799 |
> 
> If you are not the intended recipient of this message or received it 
> erroneously, please notify the sender and delete it, together with any 
> attachments, and be advised that any dissemination or copying of this message 
> is prohibited.
> 
> On Mon, 2017-08-28 at 23:17 +0100, Tomasz Kusmierz wrote:
>> Rule of thumb with batteries is:
>> - more "proper temperature" you run them at the more life you get out
>> of them
>> - more battery is overpowered for your application the longer it will
>> survive. 
>> 
>> Get your self a LSI 94** controller and use it as HBA and you will be
>> fine. but get MORE DRIVES ! ... 
>>> On 28 Aug 2017, at 23:10, hjcho616  wrote:
>>>
>>> Thank you Tomasz and Ronny.  I'll have to order some hdd soon and
>>> try these out.  Car battery idea is nice!  I may try that.. =)  Do
>>> they last longer?  Ones that fit the UPS original battery spec
>>> didn't last very long... part of the reason why I gave up on them..
>>> =P  My wife probably won't like the idea of car battery hanging out
>>> though ha!
>>>
>>> The OSD1 (one with mostly ok OSDs, except that smart failure)
>>> motherboard doesn't have any additional SATA connectors available.
>>>  Would it be safe to add another OSD host?
>>>
>>> Regards,
>>> Hong
>>>
>>>
>>>
>>> On Monday, August 28, 2017 4:43 PM, Tomasz Kusmierz >> mail.com [1]> wrote:
>>>
>>>
>>> Sorry for being brutal ... anyway 
>>> 1. get the battery for UPS ( a car battery will do as well, I've
>>> moded on ups in the past with truck battery and it was working like
>>> a

Re: [ceph-users] RBD features(kernel client) with kernel version

2017-09-26 Thread Maged Mokhtar

On 2017-09-25 14:29, Ilya Dryomov wrote:

> On Sat, Sep 23, 2017 at 12:07 AM, Muminul Islam Russell
>  wrote: 
> 
>> Hi Ilya,
>> 
>> Hope you are doing great.
>> Sorry for bugging you. I did not find enough resources for my question.  I
>> would be really helped if you could reply me. My questions are in red
>> colour.
>> 
>> - layering: layering support:
>> Kernel: 3.10 and plus, right?
> 
> Yes.
> 
>> - striping: striping v2 support:
>> What kernel is supporting this feature?
> 
> Only the default striping v2 pattern (i.e. stripe unit == object size
> and stripe count == 1) is supported.
> 
>> - exclusive-lock: exclusive locking support:
>> It's supposed to be 4.9. Right?
> 
> Yes.
> 
>> rest the the features below is under development? or any feature is
>> available in any latest kernel?
>> - object-map: object map support (requires exclusive-lock):
>> - fast-diff: fast diff calculations (requires object-map):
>> - deep-flatten: snapshot flatten support:
>> - journaling: journaled IO support (requires exclusive-lock):
> 
> The former, none of these are available in latest kernels.
> 
> A separate data pool feature (rbd create --data-pool ) is
> supported since 4.11.
> 
> Thanks,
> 
> Ilya
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hello Ilya, 

Any rough estimate when rbd journaling will be added to the kernel rbd ?
I realize it is a lot of work.. 

Cheers /Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd create returns duplicate ID's

2017-09-29 Thread Maged Mokhtar

On 2017-09-29 10:44, Adrian Saul wrote:

> Do you mean that after you delete and remove the crush and auth entries for 
> the OSD, when you go to create another OSD later it will re-use the previous 
> OSD ID that you have destroyed in the past?
> 
> Because I have seen that behaviour as well -  but only for previously 
> allocated OSD IDs that have been osd rm/crush rm/auth del.
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Luis Periquito
>> Sent: Friday, 29 September 2017 6:01 PM
>> To: Ceph Users 
>> Subject: [ceph-users] osd create returns duplicate ID's
>> 
>> Hi all,
>> 
>> I use puppet to deploy and manage my clusters.
>> 
>> Recently, as I have been doing a removal of old hardware and adding of new
>> I've noticed that sometimes the "ceph osd create" is returning repeated IDs.
>> Usually it's on the same server, but yesterday I saw it in different servers.
>> 
>> I was expecting the OSD ID's to be unique, and when they come on the same
>> server puppet starts spewing errors - which is desirable - but when it's in
>> different servers it broke those OSDs in Ceph. As they hadn't backfill any 
>> full
>> PGs I just wiped, removed and started anew.
>> 
>> As for the process itself: The OSDs are marked out and removed from crush,
>> when empty they are auth del and osd rm. After building the server puppet
>> will osd create, and use the generated ID for crush move and mkfs.
>> 
>> Unfortunately I haven't been able to reproduce in isolation, and being a
>> production cluster logging is tuned way down.
>> 
>> This has happened in several different clusters, but they are all running
>> 10.2.7.
>> 
>> Any ideas?
>> 
>> thanks,
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> Confidentiality: This email and any attachments are confidential and may be 
> subject to copyright, legal or some other professional privilege. They are 
> intended solely for the attention and use of the named addressee(s). They may 
> only be copied, distributed or disclosed with the consent of the copyright 
> owner. If you have received this email by mistake or by breach of the 
> confidentiality clause, please notify the sender immediately by return email 
> and delete or destroy all copies of the email. Any confidentiality, privilege 
> or copyright is not waived or lost because this email has been sent to you by 
> mistake.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Ceph will re-use previous deleted osd ids. this is desirable to minimize
data rebalancing. What is not correct is having duplicate active ids and
i am not sure how this is happening but i would suggest avoid add/remove
osds simultaneously ie  should add them one at a time, if you can do it
manually check that the osd was added in crush and process is up and
running before trying to add a new one..if that still produces
duplicates then there is a serious issue. If adding via script double
check it is not trying to do several tasks at once.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd create returns duplicate ID's

2017-09-29 Thread Maged Mokhtar

On 2017-09-29 11:31, Maged Mokhtar wrote:

> On 2017-09-29 10:44, Adrian Saul wrote: 
> 
> Do you mean that after you delete and remove the crush and auth entries for 
> the OSD, when you go to create another OSD later it will re-use the previous 
> OSD ID that you have destroyed in the past?
> 
> Because I have seen that behaviour as well -  but only for previously 
> allocated OSD IDs that have been osd rm/crush rm/auth del.
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Luis Periquito
> Sent: Friday, 29 September 2017 6:01 PM
> To: Ceph Users <ceph-users@lists.ceph.com>
> Subject: [ceph-users] osd create returns duplicate ID's
> 
> Hi all,
> 
> I use puppet to deploy and manage my clusters.
> 
> Recently, as I have been doing a removal of old hardware and adding of new
> I've noticed that sometimes the "ceph osd create" is returning repeated IDs.
> Usually it's on the same server, but yesterday I saw it in different servers.
> 
> I was expecting the OSD ID's to be unique, and when they come on the same
> server puppet starts spewing errors - which is desirable - but when it's in
> different servers it broke those OSDs in Ceph. As they hadn't backfill any 
> full
> PGs I just wiped, removed and started anew.
> 
> As for the process itself: The OSDs are marked out and removed from crush,
> when empty they are auth del and osd rm. After building the server puppet
> will osd create, and use the generated ID for crush move and mkfs.
> 
> Unfortunately I haven't been able to reproduce in isolation, and being a
> production cluster logging is tuned way down.
> 
> This has happened in several different clusters, but they are all running
> 10.2.7.
> 
> Any ideas?
> 
> thanks,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Confidentiality: This 
> email and any attachments are confidential and may be subject to copyright, 
> legal or some other professional privilege. They are intended solely for the 
> attention and use of the named addressee(s). They may only be copied, 
> distributed or disclosed with the consent of the copyright owner. If you have 
> received this email by mistake or by breach of the confidentiality clause, 
> please notify the sender immediately by return email and delete or destroy 
> all copies of the email. Any confidentiality, privilege or copyright is not 
> waived or lost because this email has been sent to you by mistake.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Ceph will re-use previous deleted osd ids. this is desirable to minimize
data rebalancing. What is not correct is having duplicate active ids and
i am not sure how this is happening but i would suggest avoid add/remove
osds simultaneously ie  should add them one at a time, if you can do it
manually check that the osd was added in crush and process is up and
running before trying to add a new one..if that still produces
duplicates then there is a serious issue. If adding via script double
check it is not trying to do several tasks at once. 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

One more thing if you are using a script to add osds, try to add a small
sleep/pause to allow the new osd to get activated via udev and register
itself in crush before starting to create a new one.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rados_read versus rados_aio_read performance

2017-10-01 Thread Maged Mokhtar

On 2017-10-01 16:47, Alexander Kushnirenko wrote:

> Hi, Gregory! 
> 
> Thanks for the comment.  I compiled simple program to play with write speed 
> measurements (from librados examples). Underline "write" functions are: 
> rados_write(io, "hw", read_res, 1048576, i*1048576); 
> rados_aio_write(io, "foo", comp, read_res, 1048576, i*1048576); 
> 
> So I consecutively put 1MB blocks on CEPH.   What I measured is that 
> rados_aio_write gives me about 5 times the speed of rados_write.  I make 128 
> consecutive writes in for loop to create object of maximum allowed size of 
> 132MB. 
> 
> Now if I do consecutive write from some client into CEPH storage, then what 
> is the recommended buffer size? (I'm trying to debug very poor Bareos write 
> speed of just 3MB/s to CEPH) 
> 
> Thank you, 
> Alexander 
> 
> On Fri, Sep 29, 2017 at 5:18 PM, Gregory Farnum  wrote:
> It sounds like you are doing synchronous reads of small objects here. In that 
> case you are dominated by the per-op already rather than the throughout of 
> your cluster. Using aio or multiple threads will let you parallelism requests.
> -Greg
> 
> On Fri, Sep 29, 2017 at 3:33 AM Alexander Kushnirenko  
> wrote: 
> 
> Hello, 
> 
> We see very poor performance when reading/writing rados objects.  The speed 
> is only 3-4MB/sec, compared to 95MB rados benchmarking. 
> 
> When you look on underline code it uses librados and linradosstripper 
> libraries (both have poor performance) and the code uses rados_read and 
> rados_write functions.  If you look on examples they recommend 
> rados_aio_read/write.   
> 
> Could this be the reason for poor performance? 
> 
> Thank you, 
> Alexander. ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1]

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

Even the 95MB/s rados benchmark may still be indicative of a problem, it
defaults to creating 16 (or maybe 32) threads so it can be writing to 16
different OSDs simultaneously.  To get a more accurate value to what you
are doing try the rados bench with 1 thread and 1M block size (default
it 4M) such as  

rados bench -p testpool -b 1048576 30 write -t 1 --no-cleanup

  

Links:
--
[1] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Get rbd performance stats

2017-09-29 Thread Maged Mokhtar

On 2017-09-29 17:13, Matthew Stroud wrote:

> Is there a way I could get a performance stats for rbd images? I'm looking 
> for iops and throughput. 
> 
> This issue we are dealing with is that there was a sudden jump in throughput 
> and I want to be able to find out with rbd volume might be causing it. I just 
> manage the ceph cluster, not the openstack hypervisors. I'm hoping I can 
> figure out the offending volume with the tool set I have. 
> 
> Thanks, 
> 
> Matthew Stroud 
> 
> -
> 
> CONFIDENTIALITY NOTICE: This message is intended only for the use and review 
> of the individual or entity to which it is addressed and may contain 
> information that is privileged and confidential. If the reader of this 
> message is not the intended recipient, or the employee or agent responsible 
> for delivering the message solely to the intended recipient, you are hereby 
> notified that any dissemination, distribution or copying of this 
> communication is strictly prohibited. If you have received this communication 
> in error, please notify sender immediately by telephone or return email. 
> Thank you.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

if you use a kernel mapped rbd image you should be able to get io stats
from most stats tools, it will show as a regular block device.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Maged Mokhtar

I would suggest either adding 1 new disk on each of the 2 machines
increasing the osd_backfill_full_ratio to something like 90 or 92 from
default 85. 

/Maged  

On 2017-08-28 08:01, hjcho616 wrote:

> Hello! 
> 
> I've been using ceph for long time mostly for network CephFS storage, even 
> before Argonaut release!  It's been working very well for me.  Yes, I had 
> some power outtages before and asked few questions on this list before and 
> got resolved happily!  Thank you all! 
> 
> Not sure why but we've been having quite a bit of power outages lately.  Ceph 
> appear to be running OK with those going on.. so I was pretty happy and 
> didn't thought much of it... till yesterday, When I started to move some 
> videos to cephfs, ceph decided that it was full although df showed only 54% 
> utilization!  Then I looked up, some of the osds were down! (only 3 at that 
> point!) 
> 
> I am running pretty simple ceph configuration... I have one machine running 
> MDS and mon named MDS1.  Two OSD machines with 5 2TB HDDs and 1 SSD for 
> journal named OSD1 and OSD2. 
> 
> At the time, I was running jewel 10.2.2. I looked at some of downed OSD's log 
> file and googled some of them... they appeared to be tied to version 10.2.2.  
> So I just upgraded all to 10.2.9.  Well that didn't solve my problems.. =P  
> While looking at some of this.. there was another power outage!  D'oh!  I may 
> need to invest in a UPS or something... Until this happened, all of the osd 
> down were from OSD2.  But OSD1 took a hit!  Couldn't boot, because osd-0 was 
> damaged... I tried xfs_repair -L /dev/sdb1 as suggested by command line.. I 
> was able to mount it again, phew, reboot... then /dev/sdb1 is no longer 
> accessible!  N!!! 
> 
> So this is what I have today!  I am a bit concerned as half of the osds are 
> down!  and osd.0 doesn't look good at all... 
> # ceph osd tree 
> ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 
> -1 16.24478 root default 
> -2  8.12239 host OSD1 
> 1  1.95250 osd.1  up  1.0  1.0 
> 0  1.95250 osd.0down0  1.0 
> 7  0.31239 osd.7  up  1.0  1.0 
> 6  1.95250 osd.6  up  1.0  1.0 
> 2  1.95250 osd.2  up  1.0  1.0 
> -3  8.12239 host OSD2 
> 3  1.95250 osd.3down0  1.0 
> 4  1.95250 osd.4down0  1.0 
> 5  1.95250 osd.5down0  1.0 
> 8  1.95250 osd.8down0  1.0 
> 9  0.31239 osd.9  up  1.0  1.0 
> 
> This looked alot better before that last extra power outage... =(  Can't 
> mount it anymore! 
> # ceph health 
> HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 44 pgs 
> backfill_toofull; 80 pgs backfill_wait; 122 pgs degraded; 6 pgs down; 8 pgs 
> inconsistent; 6 pgs peering; 2 pgs recovering; 18 pgs recovery_wait; 16 pgs 
> stale; 122 pgs stuck degraded; 6 pgs stuck inactive; 16 pgs stuck stale; 159 
> pgs stuck unclean; 102 pgs stuck undersized; 102 pgs undersized; 1 requests 
> are blocked > 32 sec; recovery 1803466/4503980 objects degraded (40.042%); 
> recovery 692976/4503980 objects misplaced (15.386%); recovery 147/2251990 
> unfound (0.007%); 1 near full osd(s); 54 scrub errors; mds cluster is 
> degraded; no legacy OSD present but 'sortbitwise' flag is not set 
> 
> Each of osds are showing different failure signature.  
> 
> I've uploaded osd log with debug osd = 20, debug filestore = 20, and debug ms 
> = 20.  You can find it in below links.  Let me know if there is preferred way 
> to share this! 
> https://drive.google.com/open?id=0By7YztAJNGUWQXItNzVMR281Snc 
> (ceph-osd.3.log) 
> https://drive.google.com/open?id=0By7YztAJNGUWYmJBb3RvLVdSQWc 
> (ceph-osd.4.log) 
> https://drive.google.com/open?id=0By7YztAJNGUWaXhRMlFOajN6M1k 
> (ceph-osd.5.log) 
> https://drive.google.com/open?id=0By7YztAJNGUWdm9BWFM5a3ExOFE 
> (ceph-osd.8.log) 
> 
> So how does this look?  Can this be fixed? =)  If so please let me know.  I 
> used to take backups but since it grew so big, I wasn't able to do so 
> anymore... and would like to get most of these back if I can.  Please let me 
> know if you need more info! 
> 
> Thank you! 
> 
> Regards, 
> Hong 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-18 Thread Maged Mokhtar

First a general comment: local RAID will be faster than Ceph for a
single threaded (queue depth=1) io operation test. A single thread Ceph
client will see at best same disk speed for reads and for writes 4-6
times slower than single disk. Not to mention the latency of local disks
will much better. Where Ceph shines is when you have many concurrent
ios, it scales whereas RAID will decrease speed per client as you add
more. 

Having said that, i would recommend running rados/rbd bench-write and
measure 4k iops at 1 and 32 threads to get a better idea of how your
cluster performs: 

ceph osd pool create testpool 256 256 
rados bench -p testpool -b 4096 30 write -t 1
rados bench -p testpool -b 4096 30 write -t 32 
ceph osd pool delete testpool testpool --yes-i-really-really-mean-it 

rbd bench-write test-image --io-threads=1 --io-size 4096 --io-pattern
rand --rbd_cache=false
rbd bench-write test-image --io-threads=32 --io-size 4096 --io-pattern
rand --rbd_cache=false 

I think the request size difference you see is due to the io scheduler
in the case of local disks having more ios to re-group so has a better
chance in generating larger requests. Depending on your kernel, the io
scheduler may be different for rbd (blq-mq) vs sdx (cfq) but again i
would think the request size is a result not a cause. 

Maged 

On 2017-10-17 23:12, Russell Glaue wrote:

> I am running ceph jewel on 5 nodes with SSD OSDs. 
> I have an LVM image on a local RAID of spinning disks. 
> I have an RBD image on in a pool of SSD disks.
> Both disks are used to run an almost identical CentOS 7 system. 
> Both systems were installed with the same kickstart, though the disk 
> partitioning is different. 
> 
> I want to make writes on the the ceph image faster. For example, lots of 
> writes to MySQL (via MySQL replication) on a ceph SSD image are about 10x 
> slower than on a spindle RAID disk image. The MySQL server on ceph rbd image 
> has a hard time keeping up in replication. 
> 
> So I wanted to test writes on these two systems 
> I have a 10GB compressed (gzip) file on both servers. 
> I simply gunzip the file on both systems, while running iostat. 
> 
> The primary difference I see in the results is the average size of the 
> request to the disk. 
> CentOS7-lvm-raid-sata writes a lot faster to disk, and the size of the 
> request is about 40x, but the number of writes per second is about the same 
> This makes me want to conclude that the smaller size of the request for 
> CentOS7-ceph-rbd-ssd system is the cause of it being slow. 
> 
> How can I make the size of the request larger for ceph rbd images, so I can 
> increase the write throughput? 
> Would this be related to having jumbo packets enabled in my ceph storage 
> network? 
> 
> Here is a sample of the results: 
> 
> [CentOS7-lvm-raid-sata] 
> $ gunzip large10gFile.gz & 
> $ iostat -x vg_root-lv_var -d 5 -m -N 
> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util 
> ... 
> vg_root-lv_var 0.00 0.00   30.60  452.2013.60   222.15  1000.04   
>   8.69   14.050.99   14.93   2.07 100.04 
> vg_root-lv_var 0.00 0.00   88.20  182.0039.2089.43   974.95   
>   4.659.820.99   14.10   3.70 100.00 
> vg_root-lv_var 0.00 0.00   75.45  278.2433.53   136.70   985.73   
>   4.36   33.261.34   41.91   0.59  20.84 
> vg_root-lv_var 0.00 0.00  111.60  181.8049.6089.34   969.84   
>   2.608.870.81   13.81   0.13   3.90 
> vg_root-lv_var 0.00 0.00   68.40  109.6030.4053.63   966.87   
>   1.518.460.84   13.22   0.80  14.16 
> ... 
> 
> [CentOS7-ceph-rbd-ssd] 
> $ gunzip large10gFile.gz & 
> $ iostat -x vg_root-lv_data -d 5 -m -N 
> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util 
> ... 
> vg_root-lv_data 0.00 0.00   46.40  167.80 0.88 1.4622.36  
>1.235.662.476.54   4.52  96.82 
> vg_root-lv_data 0.00 0.00   16.60   55.20 0.36 0.1414.44  
>0.99   13.919.12   15.36  13.71  98.46 
> vg_root-lv_data 0.00 0.00   69.00  173.80 1.34 1.3222.48  
>1.255.193.775.75   3.94  95.68 
> vg_root-lv_data 0.00 0.00   74.40  293.40 1.37 1.4715.83  
>1.223.312.063.63   2.54  93.26 
> vg_root-lv_data 0.00 0.00   90.80  359.00 1.96 3.4124.45  
>1.633.631.944.05   2.10  94.38 
> ... 
> 
> [iostat key] 
> w/s == The number (after merges) of write requests completed per second for 
> the device. 
> wMB/s == The number of sectors (kilobytes, megabytes) written to the device 
> per second. 
> avgrq-sz == The average size (in kilobytes) of the requests that were issued 
> to the device. 
> avgqu-sz == The average queue length of the requests that were issued to the 
> device. 
> 
>

Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-18 Thread Maged Mokhtar

 4096 
> Bandwidth (MB/sec): 3.93282 
> Stddev Bandwidth:   3.66265 
> Max bandwidth (MB/sec): 13.668 
> Min bandwidth (MB/sec): 0 
> Average IOPS:   1006 
> Stddev IOPS:937 
> Max IOPS:   3499 
> Min IOPS:   0 
> Average Latency(s): 0.0317779 
> Stddev Latency(s):  0.164076 
> Max latency(s): 2.27707 
> Min latency(s): 0.0013848 
> Cleaning up (deleting benchmark objects) 
> Clean up completed and total clean up time :20.166559 
> 
> On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar <mmokh...@petasan.org> wrote:
> 
> First a general comment: local RAID will be faster than Ceph for a single 
> threaded (queue depth=1) io operation test. A single thread Ceph client will 
> see at best same disk speed for reads and for writes 4-6 times slower than 
> single disk. Not to mention the latency of local disks will much better. 
> Where Ceph shines is when you have many concurrent ios, it scales whereas 
> RAID will decrease speed per client as you add more. 
> 
> Having said that, i would recommend running rados/rbd bench-write and measure 
> 4k iops at 1 and 32 threads to get a better idea of how your cluster 
> performs: 
> 
> ceph osd pool create testpool 256 256 
> rados bench -p testpool -b 4096 30 write -t 1
> rados bench -p testpool -b 4096 30 write -t 32 
> ceph osd pool delete testpool testpool --yes-i-really-really-mean-it 
> 
> rbd bench-write test-image --io-threads=1 --io-size 4096 --io-pattern rand 
> --rbd_cache=false
> rbd bench-write test-image --io-threads=32 --io-size 4096 --io-pattern rand 
> --rbd_cache=false 
> 
> I think the request size difference you see is due to the io scheduler in the 
> case of local disks having more ios to re-group so has a better chance in 
> generating larger requests. Depending on your kernel, the io scheduler may be 
> different for rbd (blq-mq) vs sdx (cfq) but again i would think the request 
> size is a result not a cause. 
> 
> Maged
> 
> On 2017-10-17 23:12, Russell Glaue wrote: 
> 
> I am running ceph jewel on 5 nodes with SSD OSDs. 
> I have an LVM image on a local RAID of spinning disks. 
> I have an RBD image on in a pool of SSD disks.
> Both disks are used to run an almost identical CentOS 7 system. 
> Both systems were installed with the same kickstart, though the disk 
> partitioning is different. 
> 
> I want to make writes on the the ceph image faster. For example, lots of 
> writes to MySQL (via MySQL replication) on a ceph SSD image are about 10x 
> slower than on a spindle RAID disk image. The MySQL server on ceph rbd image 
> has a hard time keeping up in replication. 
> 
> So I wanted to test writes on these two systems 
> I have a 10GB compressed (gzip) file on both servers. 
> I simply gunzip the file on both systems, while running iostat. 
> 
> The primary difference I see in the results is the average size of the 
> request to the disk. 
> CentOS7-lvm-raid-sata writes a lot faster to disk, and the size of the 
> request is about 40x, but the number of writes per second is about the same 
> This makes me want to conclude that the smaller size of the request for 
> CentOS7-ceph-rbd-ssd system is the cause of it being slow. 
> 
> How can I make the size of the request larger for ceph rbd images, so I can 
> increase the write throughput? 
> Would this be related to having jumbo packets enabled in my ceph storage 
> network? 
> 
> Here is a sample of the results: 
> 
> [CentOS7-lvm-raid-sata] 
> $ gunzip large10gFile.gz & 
> $ iostat -x vg_root-lv_var -d 5 -m -N 
> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util 
> ... 
> vg_root-lv_var 0.00 0.00   30.60  452.2013.60   222.15  1000.04   
>   8.69   14.050.99   14.93   2.07 100.04 
> vg_root-lv_var 0.00 0.00   88.20  182.0039.2089.43   974.95   
>   4.659.820.99   14.10   3.70 100.00 
> vg_root-lv_var 0.00 0.00   75.45  278.2433.53   136.70   985.73   
>   4.36   33.261.34   41.91   0.59  20.84 
> vg_root-lv_var 0.00 0.00  111.60  181.8049.6089.34   969.84   
>   2.608.870.81   13.81   0.13   3.90 
> vg_root-lv_var 0.00 0.00   68.40  109.6030.4053.63   966.87   
>   1.518.460.84   13.22   0.80  14.16 
> ... 
> 
> [CentOS7-ceph-rbd-ssd] 
> $ gunzip large10gFile.gz & 
> $ iostat -x vg_root-lv_data -d 5 -m -N 
> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util 
> ... 
> vg_root-lv_data 0.00 0.00   46.40  167.80 0.88 1.4622.36  
>

Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-18 Thread Maged Mokhtar

measuring resource load as outlined earlier will show if the drives are
performing well or not. Also how many osds do you have  ? 

On 2017-10-18 19:26, Russell Glaue wrote:

> The SSD drives are Crucial M500 
> A Ceph user did some benchmarks and found it had good performance 
> https://forum.proxmox.com/threads/ceph-bad-performance-in-qemu-guests.21551/ 
> [1] 
> 
> However, a user comment from 3 years ago on the blog post you linked to says 
> to avoid the Crucial M500 
> 
> Yet, this performance posting tells that the Crucial M500 is good. 
> https://inside.servers.com/ssd-performance-2017-c4307a92dea [2] 
> 
> On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar <mmokh...@petasan.org> wrote:
> 
> Check out the following link: some SSDs perform bad in Ceph due to sync 
> writes to journal 
> 
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>  [3] 
> 
> Anther thing that can help is to re-run the rados 32 threads as stress and 
> view resource usage using atop (or collectl/sar) to check for %busy cpu and 
> %busy disks to give you an idea of what is holding down your cluster..for 
> example: if cpu/disk % are all low then check your network/switches.  If disk 
> %busy is high (90%) for all disks then your disks are the bottleneck: which 
> either means you have SSDs that are not suitable for Ceph or you have too few 
> disks (which i doubt is the case). If only 1 disk %busy is high, there may be 
> something wrong with this disk should be removed. 
> 
> Maged
> 
> On 2017-10-18 18:13, Russell Glaue wrote: 
> 
> In my previous post, in one of my points I was wondering if the request size 
> would increase if I enabled jumbo packets. currently it is disabled. 
> 
> @jdillama: The qemu settings for both these two guest machines, with RAID/LVM 
> and Ceph/rbd images, are the same. I am not thinking that changing the qemu 
> settings of "min_io_size=,opt_io_size= size>" will directly address the issue. 
> @mmokhtar: Ok. So you suggest the request size is the result of the problem 
> and not the cause of the problem. meaning I should go after a different 
> issue. 
> 
> I have been trying to get write speeds up to what people on this mail list 
> are discussing. 
> It seems that for our configuration, as it matches others, we should be 
> getting about 70MB/s write speed. 
> But we are not getting that. 
> Single writes to disk are lucky to get 5MB/s to 6MB/s, but are typically 
> 1MB/s to 2MB/s. 
> Monitoring the entire Ceph cluster (using http://cephdash.crapworks.de/), I 
> have seen very rare momentary spikes up to 30MB/s. 
> 
> My storage network is connected via a 10Gb switch 
> I have 4 storage servers with a LSI Logic MegaRAID SAS 2208 controller 
> Each storage server has 9 1TB SSD drives, each drive as 1 osd (no RAID) 
> Each drive is one LVM group, with two volumes - one volume for the osd, one 
> volume for the journal 
> 
> Each osd is formatted with xfs 
> The crush map is simple: default->rack->[host[1..4]->osd] with an evenly 
> distributed weight 
> The redundancy is triple replication 
> 
> While I have read comments that having the osd and journal on the same disk 
> decreases write speed, I have also read that once past 8 OSDs per node this 
> is the recommended configuration, however this is also the reason why SSD 
> drives are used exclusively for OSDs in the storage nodes. 
> None-the-less, I was still expecting write speeds to be above 30MB/s, not 
> below 6MB/s. 
> Even at 12x slower than the RAID, using my previously posted iostat data set, 
> I should be seeing write speeds that average 10MB/s, not 2MB/s. 
> 
> In regards to the rados benchmark tests you asked me to run, here is the 
> output: 
> 
> [centos7]# rados bench -p scbench -b 4096 30 write -t 1 
> Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up 
> to 30 seconds or 0 objects 
> Object prefix: benchmark_data_hamms.sys.cu [4].cait.org_85049 
> sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s) 
> 0   0 0 0 0 0   -   0 
> 1   1   201   200   0.78356   0.78125  0.00522307  0.00496574 
> 2   1   469   468  0.915303   1.04688  0.00437497  0.00426141 
> 3   1   741   740  0.9643711.0625  0.00512853   0.0040434 
> 4   1   888   887  0.866739  0.574219  0.00307699  0.00450177 
> 5   1  1147  1146  0.895725   1.01172  0.00376454   0.0043559 
> 6   1  1325  1324  0.862293  0.695312  0.004594430.004525 
> 7   1  1494  1493   0.83339  0.660156  0.00461002  0.00458452 
> 8   1  1736

Re: [ceph-users] Backup VM (Base image + snapshot)

2017-10-20 Thread Maged Mokhtar

Hi all, 

Can export-diff work effectively without the fast-diff rbd feature as it
is not supported in kernel rbd ? 

Maged 

On 2017-10-19 23:18, Oscar Segarra wrote:

> Hi Richard,  
> 
> Thanks a lot for sharing your experience... I have made deeper investigation 
> and it looks export-diff is the most common tool used for backup as you have 
> suggested. 
> 
> I will make some tests with export-diff  and I will share my experience. 
> 
> Again, thanks a lot! 
> 
> 2017-10-16 12:00 GMT+02:00 Richard Hesketh :
> 
>> On 16/10/17 03:40, Alex Gorbachev wrote:
>>> On Sat, Oct 14, 2017 at 12:25 PM, Oscar Segarra  
>>> wrote:
 Hi,
 
 In my VDI environment I have configured the suggested ceph
 design/arquitecture:
 
 http://docs.ceph.com/docs/giant/rbd/rbd-snapshot/ [1]
 
 Where I have a Base Image + Protected Snapshot + 100 clones (one for each
 persistent VDI).
 
 Now, I'd like to configure a backup script/mechanism to perform backups of
 each persistent VDI VM to an external (non ceph) device, like NFS or
 something similar...
 
 Then, some questions:
 
 1.- Does anybody have been able to do this kind of backups?
>>> 
>>> Yes, we have been using export-diff successfully (note this is off a
>>> snapshot and not a clone) to back up and restore ceph images to
>>> non-ceph storage.  You can use merge-diff to create "synthetic fulls"
>>> and even do some basic replication to another cluster.
>>> 
>>> http://ceph.com/geen-categorie/incremental-snapshots-with-rbd/ [2]
>>> 
>>> http://docs.ceph.com/docs/master/dev/rbd-export/ [3]
>>> 
>>> http://cephnotes.ksperis.com/blog/2014/08/12/rbd-replication [4]
>>> 
>>> --
>>> Alex Gorbachev
>>> Storcium
>>> 
 2.- Is it possible to export BaseImage in qcow2 format and snapshots in
 qcow2 format as well as "linked clones" ?
 3.- Is it possible to export the Base Image in raw format, snapshots in raw
 format as well and, when recover is required, import both images and
 "relink" them?
 4.- What is the suggested solution for this scenario?
 
 Thanks a lot everybody!
>> 
>> In my setup I backup individually complete raw disk images to file, because 
>> then they're easier to manually inspect and grab data off in the event of 
>> catastrophic cluster failure. I haven't personally bothered trying to 
>> preserve the layering between master/clone images in backup form; that 
>> sounds like a bunch of effort and by inspection the amount of space it'd 
>> actually save in my use case is really minimal.
>> 
>> However I do use export-diff in order to make backups efficient - a rolling 
>> snapshot on each RBD is used to export the day's diff out of the cluster and 
>> then the ceph_apply_diff utility from https://gp2x.org/ceph/ is used to 
>> apply that diff to the raw image file (though I did patch it to work with 
>> streaming input and eliminate the necessity for a temporary file containing 
>> the diff). There are a handful of very large RBDs in my cluster for which 
>> exporting the full disk image takes a prohibitively long time, which made 
>> leveraging diffs necessary.
>> 
>> For a while, I was instead just exporting diffs and using merge-diff to 
>> munge them together into big super-diffs, and the restoration procedure 
>> would be to apply the merged diff to a freshly made image in the cluster. 
>> This worked, but it is a more fiddly recovery process; importing complete 
>> disk images is easier. I don't think it's possible to create two images in 
>> the cluster and then link them into a layering relationship; you'd have to 
>> import the base image, clone it, and them import a diff onto that clone if 
>> you wanted to recreate the original layering.
>> 
>> Rich
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [5]
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

  

Links:
--
[1] http://docs.ceph.com/docs/giant/rbd/rbd-snapshot/
[2] http://ceph.com/geen-categorie/incremental-snapshots-with-rbd/
[3] http://docs.ceph.com/docs/master/dev/rbd-export/
[4] http://cephnotes.ksperis.com/blog/2014/08/12/rbd-replication
[5] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-18 Thread Maged Mokhtar

just run the same 32 threaded rados test as you did before and this time
run atop while the test is running looking for %busy of cpu/disks. It
should give an idea if there is a bottleneck in them.  

On 2017-10-18 21:35, Russell Glaue wrote:

> I cannot run the write test reviewed at the 
> ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device blog. The tests 
> write directly to the raw disk device. 
> Reading an infile (created with urandom) on one SSD, writing the outfile to 
> another osd, yields about 17MB/s. 
> But Isn't this write speed limited by the speed in which in the dd infile can 
> be read? 
> And I assume the best test should be run with no other load.
> 
> How does one run the rados bench "as stress"? 
> 
> -RG 
> 
> On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar <mmokh...@petasan.org> wrote:
> 
> measuring resource load as outlined earlier will show if the drives are 
> performing well or not. Also how many osds do you have  ?
> 
> On 2017-10-18 19:26, Russell Glaue wrote: 
> The SSD drives are Crucial M500 
> A Ceph user did some benchmarks and found it had good performance 
> https://forum.proxmox.com/threads/ceph-bad-performance-in-qemu-guests.21551/ 
> [1] 
> 
> However, a user comment from 3 years ago on the blog post you linked to says 
> to avoid the Crucial M500 
> 
> Yet, this performance posting tells that the Crucial M500 is good. 
> https://inside.servers.com/ssd-performance-2017-c4307a92dea [2] 
> 
> On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar <mmokh...@petasan.org> wrote:
> 
> Check out the following link: some SSDs perform bad in Ceph due to sync 
> writes to journal 
> 
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>  [3] 
> 
> Anther thing that can help is to re-run the rados 32 threads as stress and 
> view resource usage using atop (or collectl/sar) to check for %busy cpu and 
> %busy disks to give you an idea of what is holding down your cluster..for 
> example: if cpu/disk % are all low then check your network/switches.  If disk 
> %busy is high (90%) for all disks then your disks are the bottleneck: which 
> either means you have SSDs that are not suitable for Ceph or you have too few 
> disks (which i doubt is the case). If only 1 disk %busy is high, there may be 
> something wrong with this disk should be removed. 
> 
> Maged
> 
> On 2017-10-18 18:13, Russell Glaue wrote: 
> 
> In my previous post, in one of my points I was wondering if the request size 
> would increase if I enabled jumbo packets. currently it is disabled. 
> 
> @jdillama: The qemu settings for both these two guest machines, with RAID/LVM 
> and Ceph/rbd images, are the same. I am not thinking that changing the qemu 
> settings of "min_io_size=,opt_io_size= size>" will directly address the issue. 
> @mmokhtar: Ok. So you suggest the request size is the result of the problem 
> and not the cause of the problem. meaning I should go after a different 
> issue. 
> 
> I have been trying to get write speeds up to what people on this mail list 
> are discussing. 
> It seems that for our configuration, as it matches others, we should be 
> getting about 70MB/s write speed. 
> But we are not getting that. 
> Single writes to disk are lucky to get 5MB/s to 6MB/s, but are typically 
> 1MB/s to 2MB/s. 
> Monitoring the entire Ceph cluster (using http://cephdash.crapworks.de/), I 
> have seen very rare momentary spikes up to 30MB/s. 
> 
> My storage network is connected via a 10Gb switch 
> I have 4 storage servers with a LSI Logic MegaRAID SAS 2208 controller 
> Each storage server has 9 1TB SSD drives, each drive as 1 osd (no RAID) 
> Each drive is one LVM group, with two volumes - one volume for the osd, one 
> volume for the journal 
> 
> Each osd is formatted with xfs 
> The crush map is simple: default->rack->[host[1..4]->osd] with an evenly 
> distributed weight 
> The redundancy is triple replication 
> 
> While I have read comments that having the osd and journal on the same disk 
> decreases write speed, I have also read that once past 8 OSDs per node this 
> is the recommended configuration, however this is also the reason why SSD 
> drives are used exclusively for OSDs in the storage nodes. 
> None-the-less, I was still expecting write speeds to be above 30MB/s, not 
> below 6MB/s. 
> Even at 12x slower than the RAID, using my previously posted iostat data set, 
> I should be seeing write speeds that average 10MB/s, not 2MB/s. 
> 
> In regards to the rados benchmark tests you asked me to run, here is the 
> output: 
> 
> [centos7]# rados bench -p scbench -b 4096 30 write -t 1 
> Maintaining 1 conc

Re: [ceph-users] Bluestore performance 50% of filestore

2017-11-15 Thread Maged Mokhtar

On 2017-11-14 21:54, Milanov, Radoslav Nikiforov wrote:

> Hi 
> 
> We have 3 node, 27 OSDs cluster running Luminous 12.2.1 
> 
> In filestore configuration there are 3 SSDs used for journals of 9 OSDs on 
> each hosts (1 SSD has 3 journal paritions for 3 OSDs). 
> 
> I've converted filestore to bluestore by wiping 1 host a time and waiting for 
> recovery. SSDs now contain block-db - again one SSD serving 3 OSDs. 
> 
> Cluster is used as storage for Openstack. 
> 
> Running fio on a VM in that Openstack reveals bluestore performance almost 
> twice slower than filestore. 
> 
> fio --name fio_test_file --direct=1 --rw=randwrite --bs=4k --size=1G 
> --numjobs=2 --time_based --runtime=180 --group_reporting 
> 
> fio --name fio_test_file --direct=1 --rw=randread --bs=4k --size=1G 
> --numjobs=2 --time_based --runtime=180 --group_reporting 
> 
> Filestore 
> 
> write: io=3511.9MB, bw=19978KB/s, iops=4994, runt=180001msec 
> 
> write: io=3525.6MB, bw=20057KB/s, iops=5014, runt=180001msec 
> 
> write: io=3554.1MB, bw=20222KB/s, iops=5055, runt=180016msec 
> 
> read : io=1995.7MB, bw=11353KB/s, iops=2838, runt=180001msec 
> 
> read : io=1824.5MB, bw=10379KB/s, iops=2594, runt=180001msec 
> 
> read : io=1966.5MB, bw=11187KB/s, iops=2796, runt=180001msec 
> 
> Bluestore 
> 
> write: io=1621.2MB, bw=9222.3KB/s, iops=2305, runt=180002msec 
> 
> write: io=1576.3MB, bw=8965.6KB/s, iops=2241, runt=180029msec 
> 
> write: io=1531.9MB, bw=8714.3KB/s, iops=2178, runt=180001msec 
> 
> read : io=1279.4MB, bw=7276.5KB/s, iops=1819, runt=180006msec 
> 
> read : io=773824KB, bw=4298.9KB/s, iops=1074, runt=180010msec 
> 
> read : io=1018.5MB, bw=5793.7KB/s, iops=1448, runt=180001msec 
> 
> - Rado 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

It will be useful to see how this filestore edge would perform when you
increase your queue depth (threads/jobs). For example to 32 or 64. This
would represent a more practical load. 

I can see an extreme case if you have a cluster with a large number of
OSDs and only 1 client thread that filestore may be faster: in this case
when the client io hits an OSD it will not be as busy syncing its
journal to hdd (which is the case under normal load), but again this is
not a practical setup.  

/Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread Maged Mokhtar

On 2017-11-27 15:02, German Anders wrote:

> Hi All, 
> 
> I've a performance question, we recently install a brand new Ceph cluster 
> with all-nvme disks, using ceph version 12.2.0 with bluestore configured. The 
> back-end of the cluster is using a bond IPoIB (active/passive) , and for the 
> front-end we are using a bonding config with active/active (20GbE) to 
> communicate with the clients. 
> 
> The cluster configuration is the following: 
> 
> MON NODES: 
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14  
> 3x 1U servers: 
> 2x Intel Xeon E5-2630v4 @2.2Ghz 
> 128G RAM 
> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 
> 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection 
> 
> OSD NODES: 
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 
> 4x 2U servers: 
> 2x Intel Xeon E5-2640v4 @2.4Ghz 
> 128G RAM 
> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 
> 1x Ethernet Controller 10G X550T 
> 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection 
> 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons 
> 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) 
> 
> Here's the tree: 
> 
> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF 
> -7   48.0 root root 
> -5   24.0 rack rack1 
> -1   12.0 node cpn01 
> 0  nvme  1.0 osd.0  up  1.0 1.0 
> 1  nvme  1.0 osd.1  up  1.0 1.0 
> 2  nvme  1.0 osd.2  up  1.0 1.0 
> 3  nvme  1.0 osd.3  up  1.0 1.0 
> 4  nvme  1.0 osd.4  up  1.0 1.0 
> 5  nvme  1.0 osd.5  up  1.0 1.0 
> 6  nvme  1.0 osd.6  up  1.0 1.0 
> 7  nvme  1.0 osd.7  up  1.0 1.0 
> 8  nvme  1.0 osd.8  up  1.0 1.0 
> 9  nvme  1.0 osd.9  up  1.0 1.0 
> 10  nvme  1.0 osd.10 up  1.0 1.0 
> 11  nvme  1.0 osd.11 up  1.0 1.0 
> -3   12.0 node cpn03 
> 24  nvme  1.0 osd.24 up  1.0 1.0 
> 25  nvme  1.0 osd.25 up  1.0 1.0 
> 26  nvme  1.0 osd.26 up  1.0 1.0 
> 27  nvme  1.0 osd.27 up  1.0 1.0 
> 28  nvme  1.0 osd.28 up  1.0 1.0 
> 
> 29  nvme  1.0 osd.29 up  1.0 1.0 
> 30  nvme  1.0 osd.30 up  1.0 1.0 
> 31  nvme  1.0 osd.31 up  1.0 1.0 
> 32  nvme  1.0 osd.32 up  1.0 1.0 
> 33  nvme  1.0 osd.33 up  1.0 1.0 
> 34  nvme  1.0 osd.34 up  1.0 1.0 
> 35  nvme  1.0 osd.35 up  1.0 1.0 
> -6   24.0 rack rack2 
> -2   12.0 node cpn02 
> 12  nvme  1.0 osd.12 up  1.0 1.0 
> 13  nvme  1.0 osd.13 up  1.0 1.0 
> 14  nvme  1.0 osd.14 up  1.0 1.0 
> 15  nvme  1.0 osd.15 up  1.0 1.0 
> 16  nvme  1.0 osd.16 up  1.0 1.0 
> 17  nvme  1.0 osd.17 up  1.0 1.0 
> 18  nvme  1.0 osd.18 up  1.0 1.0 
> 19  nvme  1.0 osd.19 up  1.0 1.0 
> 20  nvme  1.0 osd.20 up  1.0 1.0 
> 21  nvme  1.0 osd.21 up  1.0 1.0 
> 22  nvme  1.0 osd.22 up  1.0 1.0 
> 23  nvme  1.0 osd.23 up  1.0 1.0 
> -4   12.0 node cpn04 
> 36  nvme  1.0 osd.36 up  1.0 1.0 
> 37  nvme  1.0 osd.37 up  1.0 1.0 
> 38  nvme  1.0 osd.38 up  1.0 1.0 
> 39  nvme  1.0 osd.39 up  1.0 1.0 
> 40  nvme  1.0 osd.40 up  1.0 1.0 
> 41  nvme  1.0 osd.41 up  1.0 1.0 
> 42  nvme  1.0 osd.42 up  1.0 1.0 
> 43  nvme  1.0 osd.43 up  1.0 1.0 
> 44  nvme  1.0 osd.44 up  1.0 1.0 
> 45  nvme  1.0 osd.45 up  1.0 1.0 
> 46  nvme  1.0 osd.46 up  1.0 1.0 
> 47  nvme  1.0 osd.47 up  1.0 1.0 
> 
> The disk partition of one of the OSD nodes: 
> 
> NAME   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT 
> nvme6n1259:10   1.1T  0 disk 
> ├─nvme6n1p2259:15   0   1.1T  0 part 
> └─nvme6n1p1259:13   0   100M  0 part  /var/lib/ceph/osd/ceph-6 
> nvme9n1259:00   1.1T  0 disk 
> ├─nvme9n1p2259:80   1.1T  0 part 
> └─nvme9n1p1259:70   100M  0 part  /var/lib/ceph/osd/ceph-9 
> sdb  8:16   0

Re: [ceph-users] ceph-disk is now deprecated

2017-11-28 Thread Maged Mokhtar

I tend to agree with Wido. May of us still reply on ceph-disk and hope
to see it live a little longer. 

Maged 

On 2017-11-28 13:54, Alfredo Deza wrote:

> On Tue, Nov 28, 2017 at 3:12 AM, Wido den Hollander  wrote: 
> Op 27 november 2017 om 14:36 schreef Alfredo Deza :
> 
> For the upcoming Luminous release (12.2.2), ceph-disk will be
> officially in 'deprecated' mode (bug fixes only). A large banner with
> deprecation information has been added, which will try to raise
> awareness.
> 
> As much as I like ceph-volume and the work being done, is it really a good 
> idea to use a minor release to deprecate a tool?
> 
> Can't we just introduce ceph-volume and deprecate ceph-disk at the release of 
> M? Because when you upgrade to 12.2.2 suddenly existing integrations will 
> have deprecation warnings being thrown at them while they haven't upgraded to 
> a new major version.

ceph-volume has been present since the very first release of Luminous,
the deprecation warning in ceph-disk is the only "new" thing
introduced for 12.2.2.

> As ceph-deploy doesn't support ceph-disk either I don't think it's a good 
> idea to deprecate it right now.

ceph-deploy work is being done to support ceph-volume exclusively
(ceph-disk support is dropped fully), which will mean a change in its
API in a non-backwards compatible
way. A major version change in ceph-deploy, documentation, and a bunch
of documentation is being worked on to allow users to transition to
it.

> How do others feel about this?
> 
> Wido
> 
>> We are strongly suggesting using ceph-volume for new (and old) OSD
>> deployments. The only current exceptions to this are encrypted OSDs
>> and FreeBSD systems
>> 
>> Encryption support is planned and will be coming soon to ceph-volume.
>> 
>> A few items to consider:
>> 
>> * ceph-disk is expected to be fully removed by the Mimic release
>> * Existing OSDs are supported by ceph-volume. They can be "taken over" [0 
>> [1]]
>> * ceph-ansible already fully supports ceph-volume and will soon default to it
>> * ceph-deploy support is planned and should be fully implemented soon
>> 
>> [0] http://docs.ceph.com/docs/master/ceph-volume/simple/
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

 

Links:
--
[1] http://docs.ceph.com/docs/master/ceph-volume/simple/___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-29 Thread Maged Mokhtar

Hi German, 

I would personally prefer to use rados bench/ fio which are more common
to benchmark the cluster first then later do mysql specific tests using
sysbench. Another thing is to run the client test simultaneously on more
than 1 machine and aggregate/add the performance numbers of each, the
limitation can be caused by client side resources which could be
stressed differently based on the different storage backends you tried. 

Maged 

On 2017-11-28 21:20, German Anders wrote:

> Don't know if there's any statistics available really, but Im running some 
> sysbench tests with mysql before the changes and the idea is to run those 
> tests again after the 'tuning' and see if numbers get better in any way, also 
> I'm gathering numbers from some collectd and statsd collectors running on the 
> osd nodes so, I hope to get some info about that :) 
> 
> GERMAN 
> 2017-11-28 16:12 GMT-03:00 Marc Roos <m.r...@f1-outsourcing.eu>:
> 
>> I was wondering if there are any statistics available that show the
>> performance increase of doing such things?
>> 
>> -Original Message-
>> From: German Anders [mailto:gand...@despegar.com]
>> Sent: dinsdag 28 november 2017 19:34
>> To: Luis Periquito
>> Cc: ceph-users
>> Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning
>> 
>> Thanks a lot Luis, I agree with you regarding the CPUs, but
>> unfortunately those were the best CPU model that we can afford :S
>> 
>> For the NUMA part, I manage to pinned the OSDs by changing the
>> /usr/lib/systemd/system/ceph-osd@.service file and adding the
>> CPUAffinity list to it. But, this is for ALL the OSDs to specific nodes
>> or specific CPU list. But I can't find the way to specify a list for
>> only a specific number of OSDs.
>> 
>> Also, I notice that the NVMe disks are all on the same node (since I'm
>> using half of the shelf - so the other half will be pinned to the other
>> node), so the lanes of the NVMe disks are all on the same CPU (in this
>> case 0). Also, I find that the IB adapter that is mapped to the OSD
>> network (osd replication) is pinned to CPU 1, so this will cross the QPI
>> path.
>> 
>> And for the memory, from the other email, we are already using the
>> TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of
>> 134217728
>> 
>> In this case I can pinned all the actual OSDs to CPU 0, but in the near
>> future when I add more nvme disks to the OSD nodes, I'll definitely need
>> to pinned the other half OSDs to CPU 1, someone already did this?
>> 
>> Thanks a lot,
>> 
>> Best,
>> 
>> German
>> 
>> 2017-11-28 6:36 GMT-03:00 Luis Periquito <periqu...@gmail.com>:
>> 
>> There are a few things I don't like about your machines... If you
>> want latency/IOPS (as you seemingly do) you really want the highest
>> frequency CPUs, even over number of cores. These are not too bad, but
>> not great either.
>> 
>> Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA
>> nodes? Ideally OSD is pinned to same NUMA node the NVMe device is
>> connected to. Each NVMe device will be running on PCIe lanes generated
>> by one of the CPUs...
>> 
>> What versions of TCMalloc (or jemalloc) are you running? Have you
>> tuned them to have a bigger cache?
>> 
>> These are from what I've learned using filestore - I've yet to run
>> full tests on bluestore - but they should still apply...
>> 
>> On Mon, Nov 27, 2017 at 5:10 PM, German Anders
>> <gand...@despegar.com> wrote:
>> 
>> Hi Nick,
>> 
>> yeah, we are using the same nvme disk with an additional
>> partition to use as journal/wal. We double check the c-state and it was
>> not configure to use c1, so we change that on all the osd nodes and mon
>> nodes and we're going to make some new tests, and see how it goes. I'll
>> get back as soon as get got those tests running.
>> 
>> Thanks a lot,
>> 
>> Best,
>> 
>> German
>> 
>> 2017-11-27 12:16 GMT-03:00 Nick Fisk <n...@fisk.me.uk>:
>> 
>> From: ceph-users
>> [mailto:ceph-users-boun...@lists.ceph.com
>> <mailto:ceph-users-boun...@lists.ceph.com> ] On Behalf Of German Anders
>> Sent: 27 November 2017 14:44
>> To: Maged Mokhtar <mmokh...@petasan.org>
>> Cc: ceph-users <ceph-users@lists.ceph.com>
>> Subject: Re: [ceph-users] ceph all-nvme mysql performance
>> tuning
>> 
>> Hi Maged,
>> 
>> Thanks a lot for the response. We try with different
>> number of threads and we're getting almost the same kind of di

Re: [ceph-users] Performance, and how much wiggle room there is with tunables

2017-11-10 Thread Maged Mokhtar

Hi Mark, 

It will be interesting to know: 

The impact of replication. I guess it will decrease by a higher factor
than the replica count. 

I assume you mean the 30K IOPS per OSD is what the client sees, if so
the OSD raw disk itself will be doing more IOPS, is this correct and if
so what is the factor ( the less the better efficiency). 

Are you running 1 OSD per physical drive or multiple..any
recommendations ? 

Cheers /Maged 

On 2017-11-10 18:51, Mark Nelson wrote:

> FWIW, on very fast drives you can achieve at least 1.4GB/s and 30K+ write 
> IOPS per OSD (before replication).  It's quite possible to do better but 
> those are recent numbers on a mostly default bluestore configuration that I'm 
> fairly confident to share.  It takes a lot of CPU, but it's possible.
> 
> Mark
> 
> On 11/10/2017 10:35 AM, Robert Stanford wrote: 
> Thank you for that excellent observation.  Are there any rumors / has
> anyone had experience with faster clusters, on faster networks?  I
> wonder how Ceph can get ("it depends"), of course, but I wonder about
> numbers people have seen.
> 
> On Fri, Nov 10, 2017 at 10:31 AM, Denes Dolhay  > wrote:
> 
> So you are using a 40 / 100 gbit connection all the way to your client?
> 
> John's question is valid because 10 gbit = 1.25GB/s ... subtract
> some ethernet, ip, tcp and protocol overhead take into account some
> additional network factors and you are about there...
> 
> Denes
> 
> On 11/10/2017 05:10 PM, Robert Stanford wrote: 
> The bandwidth of the network is much higher than that.  The
> bandwidth I mentioned came from "rados bench" output, under the
> "Bandwidth (MB/sec)" row.  I see from comparing mine to others
> online that mine is pretty good (relatively).  But I'd like to get
> much more than that.
> 
> Does "rados bench" show a near maximum of what a cluster can do?
> Or is it possible that I can tune it to get more bandwidth?
> |
> |
> 
> On Fri, Nov 10, 2017 at 3:43 AM, John Spray  > wrote:
> 
> On Fri, Nov 10, 2017 at 4:29 AM, Robert Stanford
> > wrote:
>>
>>  In my cluster, rados bench shows about 1GB/s bandwidth.
> I've done some
>> tuning:
>>
>> [osd]
>> osd op threads = 8
>> osd disk threads = 4
>> osd recovery max active = 7
>>
>>
>> I was hoping to get much better bandwidth.  My network can
> handle it, and my
>> disks are pretty fast as well.  Are there any major tunables
> I can play with
>> to increase what will be reported by "rados bench"?  Am I
> pretty much stuck
>> around the bandwidth it reported?
> 
> Are you sure your 1GB/s isn't just the NIC bandwidth limit of the
> client you're running rados bench from?
> 
> John
> 
>>
>>  Thank you
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
>>
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Performance, and how much wiggle room there is with tunables

2017-11-10 Thread Maged Mokhtar

rados benchmark is a client application that simulates client io to
stress the cluster. This applies whether you run the test from an
external client or from a cluster server that will act as a client. For
fast clusters it the client will saturate (cpu/net) before the cluster
does. To get accurate results it is better to run client sweeps..run the
test in steps adding 1 client in each step and aggregating the output
result. For small clusters the numbers will saturate quickly, for lager
cluster it will converge slowly but practically you can deduce where it
is heading to. Also it is best to run the clients from real clients and
not from cluster servers so you do not overstress your servers and get
more accurate results, but again practicality may limit this. 

It is also beneficial if you do measure your resource loads: cpu%, disk
% busy as well as network utilization using a tool such as
atop/collectl/sysstats.  

The are tools to automate this client sweeping, aggregation of results
and getting resource loafs, most notably the Ceph Benchmarking Tool  

https://github.com/ceph/cbt 

As for turntables, there are various recommendation for configuration
parameters for Jewel and earlier, i have not seen any for Luminous yet.
There are also various kernel sysctl.conf recommendationd for usage with
Ceph.  

/Maged  

On 2017-11-10 18:36, Robert Stanford wrote:

> But sorry, this was about "rados bench" which is run inside the Ceph cluster. 
>  So there's no network between the "client" and my cluster. 
> 
> On Fri, Nov 10, 2017 at 10:35 AM, Robert Stanford  
> wrote:
> 
> Thank you for that excellent observation.  Are there any rumors / has anyone 
> had experience with faster clusters, on faster networks?  I wonder how Ceph 
> can get ("it depends"), of course, but I wonder about numbers people have 
> seen. 
> 
> On Fri, Nov 10, 2017 at 10:31 AM, Denes Dolhay  wrote:
> 
> So you are using a 40 / 100 gbit connection all the way to your client? 
> 
> John's question is valid because 10 gbit = 1.25GB/s ... subtract some 
> ethernet, ip, tcp and protocol overhead take into account some additional 
> network factors and you are about there...
> 
> Denes
> 
> On 11/10/2017 05:10 PM, Robert Stanford wrote: 
> 
> The bandwidth of the network is much higher than that.  The bandwidth I 
> mentioned came from "rados bench" output, under the "Bandwidth (MB/sec)" row. 
>  I see from comparing mine to others online that mine is pretty good 
> (relatively).  But I'd like to get much more than that.
> 
> Does "rados bench" show a near maximum of what a cluster can do?  Or is it 
> possible that I can tune it to get more bandwidth?
> 
> On Fri, Nov 10, 2017 at 3:43 AM, John Spray  wrote:
> On Fri, Nov 10, 2017 at 4:29 AM, Robert Stanford
>  wrote:
>> 
>> In my cluster, rados bench shows about 1GB/s bandwidth.  I've done some
>> tuning:
>> 
>> [osd]
>> osd op threads = 8
>> osd disk threads = 4
>> osd recovery max active = 7
>> 
>> 
>> I was hoping to get much better bandwidth.  My network can handle it, and my
>> disks are pretty fast as well.  Are there any major tunables I can play with
>> to increase what will be reported by "rados bench"?  Am I pretty much stuck
>> around the bandwidth it reported?
> 
> Are you sure your 1GB/s isn't just the NIC bandwidth limit of the
> client you're running rados bench from?
> 
> John
> 
>> 
>> Thank you
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1]
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1]

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1]

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

  

Links:
--
[1] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Maged Mokhtar

On 2017-11-03 15:59, Wido den Hollander wrote:

> Op 3 november 2017 om 14:43 schreef Mark Nelson :
> 
> On 11/03/2017 08:25 AM, Wido den Hollander wrote: 
> Op 3 november 2017 om 13:33 schreef Mark Nelson :
> 
> On 11/03/2017 02:44 AM, Wido den Hollander wrote: 
> Op 3 november 2017 om 0:09 schreef Nigel Williams 
> :
> 
> On 3 November 2017 at 07:45, Martin Overgaard Hansen  
> wrote: I want to bring this subject back in the light and hope someone can 
> provide
> insight regarding the issue, thanks. 
> Thanks Martin, I was going to do the same.
> 
> Is it possible to make the DB partition (on the fastest device) too
> big? in other words is there a point where for a given set of OSDs
> (number + size) the DB partition is sized too large and is wasting
> resources. I recall a comment by someone proposing to split up a
> single large (fast) SSD into 100GB partitions for each OSD.

It depends on the size of your backing disk. The DB will grow for the
amount of Objects you have on your OSD.

A 4TB drive will hold more objects then a 1TB drive (usually), same goes
for a 10TB vs 6TB.

>From what I've seen now there is no such thing as a 'too big' DB.

The tests I've done for now seem to suggest that filling up a 50GB DB is
rather hard to do. But if you have Billions of Objects and thus tens of
millions object per OSD. 
Are you doing RBD, RGW, or something else to test?  What size are the
objets and are you fragmenting them? 

> Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.
> 
> You could look into your current numbers and check how many objects you have 
> per OSD.
> 
> I checked a couple of Ceph clusters I run and see about 1M objects per OSD, 
> but other only have 250k OSDs.
> 
> In all those cases even with 32k you would need a 30GB DB with 1M objects in 
> that OSD.
> 
>> The answer could be couched as some intersection of pool type (RBD /
>> RADOS / CephFS), object change(update?) intensity, size of OSD etc and
>> rule-of-thumb.
> 
> I would check your running Ceph clusters and calculate the amount of objects 
> per OSD.
> 
> total objects / num osd * 3

One nagging concern I have in the back of my mind is that the amount of
space amplification in rocksdb might grow with the number of levels (ie
the number of objects).  The space used per object might be different at
10M objects and 50M objects.

True. But how many systems do we have out there with 10M objects in ONE
OSD?

The systems I checked range from 250k to 1M objects per OSD. Ofcourse,
but statistics aren't the golden rule, but users will want some
guideline on how to size their DB. 
That's actually something I would really like better insight into.  I 
don't feel like I have a sufficient understanding of how many 
objects/OSD people are really deploying in the field.  I figure 10M/OSD 
is probably a reasonable "typical" upper limit for HDDs, but I could see

some use cases with flash backed SSDs pushing far more. 
Would a poll on the ceph-users list work? I understand that you require
such feedback to make a proper judgement.

I know of one cluster which has 10M objects (heavy, heavy, heavy RGW
user) in about 400TB of data.

All other clusters I've seen aren't that high on the amount of Objects.
They are usually high on data since they have a RBD use-case which is a
lot of 4M objects.

You could also ask users to use this tool:
https://github.com/42on/ceph-collect

That tarball would give you a lot of information about the cluster and
the amount of objects per OSD and PG.

Wido

>> WAL should be sufficient with 1GB~2GB, right?
> 
> Yep.  On the surface this appears to be a simple question, but a much 
> deeper question is what are we actually doing with the WAL?  How should 
> we be storing PG log and dup ops data?  How can we get away from the 
> large WAL buffers and memtables we have now?  These are questions we are 
> actively working on solving.  For the moment though, having multiple (4) 
> 256MB WAL buffers appears to give us the best performance despite 
> resulting in large memtables, so 1-2GB for the WAL is right.
> 
> Mark
> 
> Wido
> 
> Wido
> 
> An idea occurred to me that by monitoring for the logged spill message
> (the event when the DB partition spills/overflows to the OSD), OSDs
> could be (lazily) destroyed and recreated with a new DB partition
> increased in size say by 10% each time.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___

Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-12-08 Thread Maged Mokhtar

4M block sizes you will only need 22.5 iops  

On 2017-12-08 09:59, Maged Mokhtar wrote:

> Hi Russell, 
> 
> It is probably due to the difference in block sizes used in the test vs your 
> cluster load. You have a latency problem which is limiting your max write 
> iops to around 2.5K. For large block sizes you do not need that many iops, 
> for example if you write in 4M block sizes you will only need 12.5 iops to 
> reach your bandwidth of 90 MB/s, in such case you latency problem will not 
> affect your bandwidth. The reason i had suggested you run the original test 
> in 4k size was because this was the original problem subject of this thread, 
> the gunzip test and the small block sizes you were getting with iostat. 
> 
> If you want to know a "rough" ballpark on what block sizes you currently see 
> on your cluster, get the total bandwidth and iops as reported by ceph ( ceph 
> status should give you this ) and divide the first by the second. 
> 
> I still think you have a significant latency/iops issue: a 36 all SSDs 
> cluster should give much higher that 2.5K iops   
> 
> Maged 
> 
> On 2017-12-07 23:57, Russell Glaue wrote: 
> I want to provide an update to my interesting situation. 
> (New storage nodes were purchased and are going into the cluster soon) 
> 
> I have been monitoring the ceph storage nodes with atop and read/write 
> through put with ceph-dash for the last month. 
> I am regularly seeing 80-90MB/s of write throughput (140MB/s read) on the 
> ceph cluster. At these moments, the problem ceph node I have been speaking of 
> shows 101% disk busy on the same 3 to 4 (of the 9) OSDs. So I am getting the 
> throughput that I want with on the cluster, despite the OSDs in question. 
> 
> However, when I run the bench tests described in this thread, I do not see 
> the write throughput go above 5MB/s. 
> When I take the problem node out, and run the bench tests, I see the 
> throughput double, but not over 10MB/s. 
> 
> Why is the ceph cluster getting up to 90MB/s write in the wild, but not when 
> running the bench tests ? 
> 
> -RG 
> 
> On Fri, Oct 27, 2017 at 4:21 PM, Russell Glaue <rgl...@cait.org> wrote:
> 
> Yes, several have recommended the fio test now. I cannot perform a fio test 
> at this time. Because the post referred to directs us to write the fio test 
> data directly to the disk device, e.g. /dev/sdj. I'd have to take an OSD 
> completely out in order to perform the test. And I am not ready to do that at 
> this time. Perhaps after I attempt the hardware firmware updates, and still 
> do not have an answer, I would then take an OSD out of the cluster to run the 
> fio test. 
> Also, our M500 disks on the two newest machines are all running version MU05, 
> the latest firmware. The on the older two, they are behind a RAID0, but I 
> suspect they might be MU03 firmware.
> 
> -RG 
> 
> On Fri, Oct 27, 2017 at 4:12 PM, Brian Andrus <brian.and...@dreamhost.com> 
> wrote:
> 
> I would be interested in seeing the results from the post mentioned by an 
> earlier contributor: 
> 
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>  [1] 
> 
> Test an "old" M500 and a "new" M500 and see if the performance is A) 
> acceptable and B) comparable. Find hardware revision or firmware revision in 
> case of A=Good and B=different. 
> 
> If the "old" device doesn't test well in fio/dd testing, then the drives are 
> (as expected) not a great choice for journals and you might want to look at 
> hardware/backplane/RAID configuration differences that are somehow allowing 
> them to perform adequately. 
> 
> On Fri, Oct 27, 2017 at 12:36 PM, Russell Glaue <rgl...@cait.org> wrote:
> 
> Yes, all the MD500s we use are both journal and OSD, even the older ones. We 
> have a 3 year lifecycle and move older nodes from one ceph cluster to 
> another. 
> On old systems with 3 year old MD500s, they run as RAID0, and run faster than 
> our current problem system with 1 year old MD500s, ran as nonraid 
> pass-through on the controller. 
> 
> All disks are SATA and are connected to a SAS controller. We were wondering 
> if the SAS/SATA conversion is an issue. Yet, the older systems don't exhibit 
> a problem. 
> 
> I found what I wanted to know from a colleague, that when the current ceph 
> cluster was put together, the SSDs tested at 300+MB/s, and ceph cluster 
> writes at 30MB/s. 
> 
> Using SMART tools, the reserved cells in all drives is nearly 100%. 
> 
> Restarting the OSDs minorly improved performance. Still betting on hardware 
> issues that a firmware upgrade may resolve. 
> 
> -RG 
> 
&g

Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-12-08 Thread Maged Mokhtar

to be used as a Ceph journal in my last experience with 
> them. They make good OSDs with an NVMe in front of them perhaps, but not much 
> else. 
> 
> Ceph uses O_DSYNC for journal writes and these drives do not handle them as 
> expected. It's been many years since I've dealt with the M500s specifically, 
> but it has to do with the capacitor/power save feature and how it handles 
> those types of writes. I'm sorry I don't have the emails with specifics 
> around anymore, but last I remember, this was a hardware issue and could not 
> be resolved with firmware.  
> 
> Paging Kyle Bader... 
> 
> On Fri, Oct 27, 2017 at 9:24 AM, Russell Glaue <rgl...@cait.org> wrote:
> 
> We have older crucial M500 disks operating without such problems. So, I have 
> to believe it is a hardware firmware issue. 
> And its peculiar seeing performance boost slightly, even 24 hours later, when 
> I stop then start the OSDs. 
> 
> Our actual writes are low, as most of our Ceph Cluster based images are 
> low-write, high-memory. So a 20GB/day life/write capacity is a non-issue for 
> us. Only write speed is the concern. Our write-intensive images are locked on 
> non-ceph disks. 
> What are others using for SSD drives in their Ceph cluster? 
> With 0.50+ DWPD (Drive Writes Per Day), the Kingston SEDC400S37 models seems 
> to be the best for the price today. 
> 
> On Fri, Oct 27, 2017 at 6:34 AM, Maged Mokhtar <mmokh...@petasan.org> wrote:
> 
> It is quiet likely related, things are pointing to bad disks. Probably the 
> best thing is to plan for disk replacement, the sooner the better as it could 
> get worse.
> 
> On 2017-10-27 02:22, Christian Wuerdig wrote: 
> Hm, no necessarily directly related to your performance problem,
> however: These SSDs have a listed endurance of 72TB total data written
> - over a 5 year period that's 40GB a day or approx 0.04 DWPD. Given
> that you run the journal for each OSD on the same disk, that's
> effectively at most 0.02 DWPD (about 20GB per day per disk). I don't
> know many who'd run a cluster on disks like those. Also it means these
> are pure consumer drives which have a habit of exhibiting random
> performance at times (based on unquantified anecdotal personal
> experience with other consumer model SSDs). I wouldn't touch these
> with a long stick for anything but small toy-test clusters.
> 
> On Fri, Oct 27, 2017 at 3:44 AM, Russell Glaue <rgl...@cait.org> wrote: 
> On Wed, Oct 25, 2017 at 7:09 PM, Maged Mokhtar <mmokh...@petasan.org> wrote: 
> It depends on what stage you are in:
> in production, probably the best thing is to setup a monitoring tool
> (collectd/grahite/prometheus/grafana) to monitor both ceph stats as well as
> resource load. This will, among other things, show you if you have slowing
> disks. 
> I am monitoring Ceph performance with ceph-dash
> (http://cephdash.crapworks.de/), that is why I knew to look into the slow
> writes issue. And I am using Monitorix (http://www.monitorix.org/) to
> monitor system resources, including Disk I/O.
> 
> However, though I can monitor individual disk performance at the system
> level, it seems Ceph does not tax any disk more than the worst disk. So in
> my monitoring charts, all disks have the same performance.
> All four nodes are base-lining at 50 writes/sec during the cluster's normal
> load, with the non-problem hosts spiking up to 150, and the problem host
> only spikes up to 100.
> But during the window of time I took the problem host OSDs down to run the
> bench tests, the OSDs on the other nodes increased to 300-500 writes/sec.
> Otherwise, the chart looks the same for all disks on all ceph nodes/hosts.
> 
> Before production you should first make sure your SSDs are suitable for
> Ceph, either by being recommend by other Ceph users or you test them
> yourself for sync writes performance using fio tool as outlined earlier.
> Then after you build your cluster you can use rados and/or rbd bencmark
> tests to benchmark your cluster and find bottlenecks using atop/sar/collectl
> which will help you tune your cluster. 
> All 36 OSDs are: Crucial_CT960M500SSD1
> 
> Rados bench tests were done at the beginning. The speed was much faster than
> it is now. I cannot recall the test results, someone else on my team ran
> them. Recently, I had thought the slow disk problem was a configuration
> issue with Ceph - before I posted here. Now we are hoping it may be resolved
> with a firmware update. (If it is firmware related, rebooting the problem
> node may temporarily resolve this)
> 
> Though you did see better improvements, your cluster with 27 SSDs should
> give much higher numbers than 3k iops. If you are running rados bench while
> you have other client ios

[ceph-users] Single disk per OSD ?

2017-12-01 Thread Maged Mokhtar

Hi all, 

I believe most exiting setups use 1 disk per OSD. Is this going to be
the most common setup in the future ? With the move to lvm, will this
prefer the use of multiple disks per OSD ? On the other side i also see
nvme vendors recommending multiple OSDs ( 2,4 ) per disk as disks are
getting faster for a single OSD process. 

Can anyone shed some light/recommendations into this please ? 

Thanks a lot. 

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-25 Thread Maged Mokhtar

llers, but my current position has hp raid controllers and we 
> just tracked down 10 of our nodes that had >100ms await pretty much always 
> were the only 10 nodes in the cluster with failed batteries on the raid 
> controllers.
> 
> On Thu, Oct 19, 2017, 8:15 PM Christian Balzer <ch...@gol.com> wrote: 
> Hello,
> 
> On Thu, 19 Oct 2017 17:14:17 -0500 Russell Glaue wrote:
> 
>> That is a good idea.
>> However, a previous rebalancing processes has brought performance of our
>> Guest VMs to a slow drag.
>> 
> 
> Never mind that I'm not sure that these SSDs are particular well suited
> for Ceph, your problem is clearly located on that one node.
> 
> Not that I think it's the case, but make sure your PG distribution is not
> skewed with many more PGs per OSD on that node.
> 
> Once you rule that out my first guess is the RAID controller, you're
> running the SSDs are single RAID0s I presume?
> If so a either configuration difference or a failed BBU on the controller
> could result in the writeback cache being disabled, which would explain
> things beautifully.
> 
> As for a temporary test/fix (with reduced redundancy of course), set noout
> (or mon_osd_down_out_subtree_limit accordingly) and turn the slow host off.
> 
> This should result in much better performance than you have now and of
> course be the final confirmation of that host being the culprit.
> 
> Christian
> 
>> 
>> On Thu, Oct 19, 2017 at 3:55 PM, Jean-Charles Lopez <jelo...@redhat.com>
>> wrote:
>> 
>>> Hi Russell,
>>>
>>> as you have 4 servers, assuming you are not doing EC pools, just stop all
>>> the OSDs on the second questionable server, mark the OSDs on that server as
>>> out, let the cluster rebalance and when all PGs are active+clean just
>>> replay the test.
>>>
>>> All IOs should then go only to the other 3 servers.
>>>
>>> JC
>>>
>>> On Oct 19, 2017, at 13:49, Russell Glaue <rgl...@cait.org> wrote:
>>>
>>> No, I have not ruled out the disk controller and backplane making the
>>> disks slower.
>>> Is there a way I could test that theory, other than swapping out hardware?
>>> -RG
>>>
>>> On Thu, Oct 19, 2017 at 3:44 PM, David Turner <drakonst...@gmail.com>
>>> wrote:
>>>
>>>> Have you ruled out the disk controller and backplane in the server
>>>> running slower?
>>>>
>>>> On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue <rgl...@cait.org> wrote:
>>>>
>>>>> I ran the test on the Ceph pool, and ran atop on all 4 storage servers,
>>>>> as suggested.
>>>>>
>>>>> Out of the 4 servers:
>>>>> 3 of them performed with 17% to 30% disk %busy, and 11% CPU wait.
>>>>> Momentarily spiking up to 50% on one server, and 80% on another
>>>>> The 2nd newest server was almost averaging 90% disk %busy and 150% CPU
>>>>> wait. And more than momentarily spiking to 101% disk busy and 250% CPU 
>>>>> wait.
>>>>> For this 2nd newest server, this was the statistics for about 8 of 9
>>>>> disks, with the 9th disk not far behind the others.
>>>>>
>>>>> I cannot believe all 9 disks are bad
>>>>> They are the same disks as the newest 1st server, Crucial_CT960M500SSD1,
>>>>> and same exact server hardware too.
>>>>> They were purchased at the same time in the same purchase order and
>>>>> arrived at the same time.
>>>>> So I cannot believe I just happened to put 9 bad disks in one server,
>>>>> and 9 good ones in the other.
>>>>>
>>>>> I know I have Ceph configured exactly the same on all servers
>>>>> And I am sure I have the hardware settings configured exactly the same
>>>>> on the 1st and 2nd servers.
>>>>> So if I were someone else, I would say it maybe is bad hardware on the
>>>>> 2nd server.
>>>>> But the 2nd server is running very well without any hint of a problem.
>>>>>
>>>>> Any other ideas or suggestions?
>>>>>
>>>>> -RG
>>>>>
>>>>>
>>>>> On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar <mmokh...@petasan.org>
>>>>> wrote:
>>>>>
>>>>>> just run the same 32 threaded rados test as you did before and this
>>>>>> time run atop while the test is running looking for %busy of cpu/disks. 
>>&g

Re: [ceph-users] OSDs wrongly marked down

2017-12-20 Thread Maged Mokhtar

Could also be your hardware under powered for the io you have. try to
check your resource load during peak workload  together with recovery
and scrubbing going on at same time.  

On 2017-12-20 17:03, David Turner wrote:

> When I have OSDs wrongly marked down it's usually to do with the 
> filestore_split_multiple and filestore_merge_threshold in a thing I call PG 
> subfolder splitting.  This is no longer a factor with bluestore, but as 
> you're running hammer, it's worth a look.  
> http://docs.ceph.com/docs/hammer/rados/configuration/filestore-config-ref/ 
> 
> On Wed, Dec 20, 2017 at 9:31 AM Garuti, Lorenzo  wrote: 
> Hi Sergio,  
> 
> in my case it was a network problem, occasionally  (due to network problems) 
> mon.{id} can't reach osd.{id}. 
> The massage  fault, initiating reconnect and  failed lossy con in your logs 
> suggest a network problem. 
> 
> See also: 
> 
> http://docs.ceph.com/docs/giant/rados/troubleshooting/troubleshooting-osd/#flapping-osds
>  
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/troubleshooting_guide/troubleshooting-osds#flapping-osds
>  
> 
> Lorenzo 
> 
> 2017-12-20 15:13 GMT+01:00 Sergio Morales :
> 
> Hi.
> 
> I'm having problem with the OSD en  my cluster. 
> 
> Randomly some OSD get  wrongly marked down. I set my "mon osd min down 
> reporters " to OSD +1, but i still get this problem.
> 
> Any tips or ideas to do the troubleshooting? I'm using Ceph 0.94.5 on Centos 
> 7.
> 
> The logs shows this:
> 
> 2017-12-19 16:59:26.357707 7fa9177d3700  0 -- 172.17.4.2:6830/4775054 [1] >> 
> 172.17.4.3:6800/2009784 [2] pipe(0x7fa8a0907000 sd=43 :45955 s=1 pgs=1089 
> cs=1 l=0 c=0x7fa8a0965f00).connect got RESETSESSION
> 2017-12-19 16:59:26.360240 7fa8e5652700  0 -- 172.17.4.2:6830/4775054 [1] >> 
> 172.17.4.1:6808/6007742 [3] pipe(0x7fa9310e3000 sd=26 :53375 s=2 pgs=5272 
> cs=1 l=0 c=0x7fa931045680).fault, initiating reconnect
> 
> 2017-12-19 16:59:25.716758 7fa8e74c1700  0 -- 172.17.4.2:6830/4775054 [1] >> 
> 172.17.4.1:6826/1007559 [4] pipe(0x7fa907052000 sd=17 :45743 s=1 pgs=2105 
> cs=1 l=0 c=0x7fa8a051a180).connect got RESETSESSION
> 2017-12-19 16:59:25.716308 7fa9849ed700  0 -- 172.17.3.2:6802/3775054 [5] 
> submit_message osd_op_reply(392 rbd_data.129d2042eabc234.0605 
> [set-alloc-hint object_size 4194304 write_size 4194304,write 0~126976] 
> v26497'18879046 uv18879046 ondisk = 0) v6 remote, 172.17.1.3:0/5911141 [6], 
> failed lossy con, dropping message 0x7fa8830edb00
> 2017-12-19 16:59:25.718694 7fa9849ed700  0 -- 172.17.3.2:6802/3775054 [5] 
> submit_message osd_op_reply(10610054 rbd_data.6ccd3348ab9aac.011d 
> [set-alloc-hint object_size 8388608 write_size 8388608,write 876544~4096] 
> v26497'15075797 uv15075797 ondisk = 0) v6 remote, 172.17.1.4:0/1028032 [7], 
> failed lossy con, dropping message 0x7fa87a911700
> 
> -- 
> 
> Sergio A. Morales 
> Ingeniero de Sistemas 
> LINETS CHILE - 56 2 2412 5858 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> -- 
> 
> Lorenzo Garuti
> CED MaxMara
> email: garut...@maxmara.it 
> tel: 0522 3993772 - 335 8416054 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

 

Links:
--
[1] http://172.17.4.2:6830/4775054
[2] http://172.17.4.3:6800/2009784
[3] http://172.17.4.1:6808/6007742
[4] http://172.17.4.1:6826/1007559
[5] http://172.17.3.2:6802/3775054
[6] http://172.17.1.3:0/5911141
[7] http://172.17.1.4:0/1028032___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Issues with RBD when rebooting

2018-05-25 Thread Maged Mokhtar

On 2018-05-25 12:11, Josef Zelenka wrote:

> Hi, we are running a jewel cluster (54OSDs, six nodes, ubuntu 16.04) that 
> serves as a backend for openstack(newton) VMs. TOday we had to reboot one of 
> the nodes(replicated pool, x2) and some of our VMs oopsed with issues with 
> their FS(mainly database VMs, postgresql) - is there a reason for this to 
> happen? if data is replicated, the VMs shouldn't even notice we rebooted one 
> of the nodes, right? Maybe i just don't understand how this works correctly, 
> but i hope someone around here can either tell me why this is happenning or 
> how to fix it.
> 
> Thanks
> 
> Josef
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

It could be a timeout setting issue. Typically your higher application
level timeouts should be larger than your low level io timeouts to allow
for recovery. Check if your postgresql has timeouts that may be set too
low.
At the low level, the OSD will be detected as failed via
osd_heartbeat_grace + osd_heartbeat_interval, you can lower this to for
example 20s via:
osd heartbeat grace = 15
osd heartbeat interval = 5
this will give 20 sec before osd is reported as dead and new remapping
occurs. Do not lower it too much else you may be triggering remaps on
false alarms. 

At higher levels, it may be worth double checking:
rados_osd_op_timeout in case of librbd
osd_request_timeout in case of kernel rbd (if enabled)
They need to be larger than the osd timeouts above 

At the higher levels 

OS disk timeout is (this is usually high enough)
/sys/block/sdX/device/timeout 

Your postgresql timeouts, needs to be higher that 20s in this case. 

/Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to use libradostriper to improve I/O bandwidth?

2018-06-12 Thread Maged Mokhtar

On 2018-06-12 01:01, Jialin Liu wrote:

> Hello Ceph Community,  
> 
> I used libradosstriper api to test the striping feature, it doesn't seem to 
> improve the performance at all, can anyone advise what's wrong with my 
> settings: 
> 
> The rados object store  testbed at my center has 
> osd: 48 
> oss: 4 
> monitor:2 
> pg number: 1024 
> replicated size: 3 
> 
> I have implemented a benchmark code [1]  with libradosstriper api.  
> 
> I then used 1 processes 1 thread to do the test,  by varying a few settings: 
> 
> * stripe count from 1 to 48, 
> * and object size from 1MB to 128 MB (with stripe size 1MB, stripe size needs 
> to be smaller than the rados object size), 
> * and file size from 100MB to 1.6GB, 
> 
> The peak bandwidth among all tests is only 130MB/s, no difference in 
> different tests. 
> 
> I suspect that the IO got serialized in the rados layer, with some uncertain 
> evidence in the libradosstriper source code (note the for loop):

>> ...
>> ·Striper::file_to_extents(cct(), format.c_str(), , off, len, 0, extents);
>> ·FOR (vector::iterator p = extents.begin(); p != 
>> extents.end(); ++p) {
>> r = m_ioCtx.aio_write(p->oid.name [2], rados_completion, oid_bl,
>> p->length, p->offset);
>> }
>> ...
 Could you please correct me if I misused or misunderstood any things?
Thanks much.  

Best, 
Jialin 
NERSC/LBNL 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

The striper source is correct, the for loop does aio writes so there is
no serialization blocking. 

/Maged 

Links:
--
[1]
https://github.com/NERSC/object-store/blob/master/tests/ceph/vpic_io/librados_test.c
[2] http://oid.name/___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS+NFS For VMWare

2018-07-02 Thread Maged Mokhtar

Hi Nick, 

With iSCSI we reach over 150 MB/s vmotion for single vm, 1 GB/s for 7-8
vm migrations. Since these are 64KB block sizes, latency/iops is a large
factor, you need either controllers with write back cache or all flash .
hdds without write cache will suffer even with external wal/db on ssds,
giving around 80 MB/s vmotion migration. Potentially it may be possible
to get higher vmotion speeds by using fancy striping but i would not
recommend this unless your total queue depths in all your vms is small
compared to the number of osds. 

Regarding thin provisioning, a vmdk provisioned as lazy zeroed does have
an "initial" large impact on random write performance, could be up to
10x slower. If you are writing a random 64KB to an un-allocated vmfs
block, vmfs will first write 1MB to fill the block with zeros then write
the 64KB client data, so although a lot of data is being written the
perceived client bandwidth is very low. The performance will gradually
get better with time until the disk is fully provisioned. It is also
possible to thick eager zero the vmdk disk at creation time. Again this
is more apparent with random writes rather than sequential or vmotion
load. 

Maged 

On 2018-06-29 18:48, Nick Fisk wrote:

> This is for us peeps using Ceph with VMWare. 
> 
> My current favoured solution for consuming Ceph in VMWare is via RBD's 
> formatted with XFS and exported via NFS to ESXi. This seems to perform better 
> than iSCSI+VMFS which seems to not play nicely with Ceph's PG contention 
> issues particularly if working with thin provisioned VMDK's. 
> 
> I've still been noticing some performance issues however, mainly noticeable 
> when doing any form of storage migrations. This is largely due to the way 
> vSphere transfers VM's in 64KB IO's at a QD of 32. vSphere does this so 
> Arrays with QOS can balance the IO easier than if larger IO's were submitted. 
> However Ceph's PG locking means that only one or two of these IO's can happen 
> at a time, seriously lowering throughput. Typically you won't be able to push 
> more than 20-25MB/s during a storage migration 
> 
> There is also another issue in that the IO needed for the XFS journal on the 
> RBD, can cause contention and effectively also means every NFS write IO sends 
> 2 down to Ceph. This can have an impact on latency as well. Due to possible 
> PG contention caused by the XFS journal updates when multiple IO's are in 
> flight, you normally end up making more and more RBD's to try and spread the 
> load. This normally means you end up having to do storage migrations…..you 
> can see where I'm getting at here. 
> 
> I've been thinking for a while that CephFS works around a lot of these 
> limitations. 
> 
> 1.   It supports fancy striping, so should mean there is less per object 
> contention 
> 
> 2.   There is no FS in the middle to maintain a journal and other 
> associated IO 
> 
> 3.   A single large NFS mount should have none of the disadvantages seen 
> with a single RBD 
> 
> 4.   No need to migrate VM's about because of #3 
> 
> 5.   No need to fstrim after deleting VM's 
> 
> 6.   Potential to do away with pacemaker and use LVS to do active/active 
> NFS as ESXi does its own locking with files 
> 
> With this in mind I exported a CephFS mount via NFS and then mounted it to an 
> ESXi host as a test. 
> 
> Initial results are looking very good. I'm seeing storage migrations to the 
> NFS mount going at over 200MB/s, which equates to several thousand IO's and 
> seems to be writing at the intended QD32. 
> 
> I need to do more testing to make sure everything works as intended, but like 
> I say, promising initial results. 
> 
> Further testing needs to be done to see what sort of MDS performance is 
> required, I would imagine that since we are mainly dealing with large files, 
> it might not be that critical. I also need to consider the stability of 
> CephFS, RBD is relatively simple and is in use by a large proportion of the 
> Ceph community. CephFS is a lot easier to "upset". 
> 
> Nick 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-26 Thread Maged Mokhtar

I wish the firmware update will fix things for you.
Regarding monitoring: if your tool is able to record disk busy%, iops,
throughout then you do not need to run atop. 

I still highly recommend you run the fio SSD test for sync writes:
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
[6] 

The other important factor for SSDs is they should have commercial grade
endurance/DWPD 

In the absence of other load, if you stress your cluster using rados 4k
benchmark (i recommended 4k since this was the block sizes you were
getting when doing RAID comparison in your initial post ), your load
will be dominated by iops performance. You should be easily see-ing a
couple of thousand iops on a raw disk level, on a cluster level with 30
disks, you should be roughly approaching 30 x actual raw disk iops for
4k reads and about 5 x for writes ( due to replicas and journal seeks ).
If you were using fast SSDs ( 10k+ iops per disk), you will start
hitting other bottlenecks like cpu% but your case is far from this. In
your case to get decent cluster iops performance you should be aiming to
get a couple of thousand iops at the raw disk level and a busy% of below
90% during rados 4k test. 

Maged 

On 2017-10-26 16:44, Russell Glaue wrote:

> On Wed, Oct 25, 2017 at 7:09 PM, Maged Mokhtar <mmokh...@petasan.org> wrote:
> 
>> It depends on what stage you are in: 
>> in production, probably the best thing is to setup a monitoring tool 
>> (collectd/grahite/prometheus/grafana) to monitor both ceph stats as well as 
>> resource load. This will, among other things, show you if you have slowing 
>> disks.
> 
> I am monitoring Ceph performance with ceph-dash 
> (http://cephdash.crapworks.de/), that is why I knew to look into the slow 
> writes issue. And I am using Monitorix (http://www.monitorix.org/) to monitor 
> system resources, including Disk I/O. 
> 
> However, though I can monitor individual disk performance at the system 
> level, it seems Ceph does not tax any disk more than the worst disk. So in my 
> monitoring charts, all disks have the same performance. 
> All four nodes are base-lining at 50 writes/sec during the cluster's normal 
> load, with the non-problem hosts spiking up to 150, and the problem host only 
> spikes up to 100.  
> But during the window of time I took the problem host OSDs down to run the 
> bench tests, the OSDs on the other nodes increased to 300-500 writes/sec. 
> Otherwise, the chart looks the same for all disks on all ceph nodes/hosts. 
> 
>> Before production you should first make sure your SSDs are suitable for 
>> Ceph, either by being recommend by other Ceph users or you test them 
>> yourself for sync writes performance using fio tool as outlined earlier. 
>> Then after you build your cluster you can use rados and/or rbd bencmark 
>> tests to benchmark your cluster and find bottlenecks using atop/sar/collectl 
>> which will help you tune your cluster.
> 
> All 36 OSDs are: Crucial_CT960M500SSD1 
> 
> Rados bench tests were done at the beginning. The speed was much faster than 
> it is now. I cannot recall the test results, someone else on my team ran 
> them. Recently, I had thought the slow disk problem was a configuration issue 
> with Ceph - before I posted here. Now we are hoping it may be resolved with a 
> firmware update. (If it is firmware related, rebooting the problem node may 
> temporarily resolve this) 
> 
>> Though you did see better improvements, your cluster with 27 SSDs should 
>> give much higher numbers than 3k iops. If you are running rados bench while 
>> you have other client ios, then obviously the reported number by the tool 
>> will be less than what the cluster is actually giving...which you can find 
>> out via ceph status command, it will print the total cluster throughput and 
>> iops. If the total is still low i would recommend running the fio raw disk 
>> test, maybe the disks are not suitable. When you removed your 9 bad disk 
>> from 36 and your performance doubled, you still had 2 other disk slowing 
>> you..meaning near 100% busy ? It makes me feel the disk type used is not 
>> good. For these near 100% busy disks can you also measure their raw disk 
>> iops at that load (i am not sure atop shows this, if not use 
>> sat/syssyat/iostat/collecl).
> 
> I ran another bench test today with all 36 OSDs up. The overall performance 
> was improved slightly compared to the original tests. Only 3 OSDs on the 
> problem host were increasing to 101% disk busy. 
> The iops reported from ceph status during this bench test ranged from 1.6k to 
> 3.3k, the test yielding 4k iops. 
> 
> Yes, the two other OSDs/disks that were the bottleneck were at 101% disk 
>

Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-27 Thread Maged Mokhtar

It is quiet likely related, things are pointing to bad disks. Probably
the best thing is to plan for disk replacement, the sooner the better as
it could get worse. 

On 2017-10-27 02:22, Christian Wuerdig wrote:

> Hm, no necessarily directly related to your performance problem,
> however: These SSDs have a listed endurance of 72TB total data written
> - over a 5 year period that's 40GB a day or approx 0.04 DWPD. Given
> that you run the journal for each OSD on the same disk, that's
> effectively at most 0.02 DWPD (about 20GB per day per disk). I don't
> know many who'd run a cluster on disks like those. Also it means these
> are pure consumer drives which have a habit of exhibiting random
> performance at times (based on unquantified anecdotal personal
> experience with other consumer model SSDs). I wouldn't touch these
> with a long stick for anything but small toy-test clusters.
> 
> On Fri, Oct 27, 2017 at 3:44 AM, Russell Glaue <rgl...@cait.org> wrote: 
> On Wed, Oct 25, 2017 at 7:09 PM, Maged Mokhtar <mmokh...@petasan.org> wrote: 
> It depends on what stage you are in:
> in production, probably the best thing is to setup a monitoring tool
> (collectd/grahite/prometheus/grafana) to monitor both ceph stats as well as
> resource load. This will, among other things, show you if you have slowing
> disks. 
> I am monitoring Ceph performance with ceph-dash
> (http://cephdash.crapworks.de/), that is why I knew to look into the slow
> writes issue. And I am using Monitorix (http://www.monitorix.org/) to
> monitor system resources, including Disk I/O.
> 
> However, though I can monitor individual disk performance at the system
> level, it seems Ceph does not tax any disk more than the worst disk. So in
> my monitoring charts, all disks have the same performance.
> All four nodes are base-lining at 50 writes/sec during the cluster's normal
> load, with the non-problem hosts spiking up to 150, and the problem host
> only spikes up to 100.
> But during the window of time I took the problem host OSDs down to run the
> bench tests, the OSDs on the other nodes increased to 300-500 writes/sec.
> Otherwise, the chart looks the same for all disks on all ceph nodes/hosts.
> 
> Before production you should first make sure your SSDs are suitable for
> Ceph, either by being recommend by other Ceph users or you test them
> yourself for sync writes performance using fio tool as outlined earlier.
> Then after you build your cluster you can use rados and/or rbd bencmark
> tests to benchmark your cluster and find bottlenecks using atop/sar/collectl
> which will help you tune your cluster. 
> All 36 OSDs are: Crucial_CT960M500SSD1
> 
> Rados bench tests were done at the beginning. The speed was much faster than
> it is now. I cannot recall the test results, someone else on my team ran
> them. Recently, I had thought the slow disk problem was a configuration
> issue with Ceph - before I posted here. Now we are hoping it may be resolved
> with a firmware update. (If it is firmware related, rebooting the problem
> node may temporarily resolve this)
> 
> Though you did see better improvements, your cluster with 27 SSDs should
> give much higher numbers than 3k iops. If you are running rados bench while
> you have other client ios, then obviously the reported number by the tool
> will be less than what the cluster is actually giving...which you can find
> out via ceph status command, it will print the total cluster throughput and
> iops. If the total is still low i would recommend running the fio raw disk
> test, maybe the disks are not suitable. When you removed your 9 bad disk
> from 36 and your performance doubled, you still had 2 other disk slowing
> you..meaning near 100% busy ? It makes me feel the disk type used is not
> good. For these near 100% busy disks can you also measure their raw disk
> iops at that load (i am not sure atop shows this, if not use
> sat/syssyat/iostat/collecl). 
> I ran another bench test today with all 36 OSDs up. The overall performance
> was improved slightly compared to the original tests. Only 3 OSDs on the
> problem host were increasing to 101% disk busy.
> The iops reported from ceph status during this bench test ranged from 1.6k
> to 3.3k, the test yielding 4k iops.
> 
> Yes, the two other OSDs/disks that were the bottleneck were at 101% disk
> busy. The other OSD disks on the same host were sailing along at like 50-60%
> busy.
> 
> All 36 OSD disks are exactly the same disk. They were all purchased at the
> same time. All were installed at the same time.
> I cannot believe it is a problem with the disk model. A failed/bad disk,
> perhaps is possible. But the disk model itself cannot be the problem based
> on what I am seeing. If I am seeing bad performance on a

Re: [ceph-users] What is the should be the expected latency of 10Gbit network connections

2018-01-22 Thread Maged Mokhtar

On 2018-01-22 08:39, Wido den Hollander wrote:

> On 01/20/2018 02:02 PM, Marc Roos wrote: 
> 
>> If I test my connections with sockperf via a 1Gbit switch I get around
>> 25usec, when I test the 10Gbit connection via the switch I have around
>> 12usec is that normal? Or should there be a differnce of 10x.
> 
> No, that's normal.
> 
> Tests with 8k ping packets over different links I did:
> 
> 1GbE:  0.800ms
> 10GbE: 0.200ms
> 40GbE: 0.150ms
> 
> Wido
> 
>> sockperf ping-pong
>> 
>> sockperf: Warmup stage (sending a few dummy messages)...
>> sockperf: Starting test...
>> sockperf: Test end (interrupted by timer)
>> sockperf: Test ended
>> sockperf: [Total Run] RunTime=10.100 sec; SentMessages=432875;
>> ReceivedMessages=432874
>> sockperf: = Printing statistics for Server No: 0
>> sockperf: [Valid Duration] RunTime=10.000 sec; SentMessages=428640;
>> ReceivedMessages=428640
>> sockperf: > avg-lat= 11.609 (std-dev=1.684)
>> sockperf: # dropped messages = 0; # duplicated messages = 0; #
>> out-of-order messages = 0
>> sockperf: Summary: Latency is 11.609 usec
>> sockperf: Total 428640 observations; each percentile contains 4286.40
>> observations
>> sockperf: --->  observation =  856.944
>> sockperf: ---> percentile  99.99 =   39.789
>> sockperf: ---> percentile  99.90 =   20.550
>> sockperf: ---> percentile  99.50 =   17.094
>> sockperf: ---> percentile  99.00 =   15.578
>> sockperf: ---> percentile  95.00 =   12.838
>> sockperf: ---> percentile  90.00 =   12.299
>> sockperf: ---> percentile  75.00 =   11.844
>> sockperf: ---> percentile  50.00 =   11.409
>> sockperf: ---> percentile  25.00 =   11.124
>> sockperf: --->  observation =8.888
>> 
>> sockperf: Warmup stage (sending a few dummy messages)...
>> sockperf: Starting test...
>> sockperf: Test end (interrupted by timer)
>> sockperf: Test ended
>> sockperf: [Total Run] RunTime=1.100 sec; SentMessages=22065;
>> ReceivedMessages=22064
>> sockperf: = Printing statistics for Server No: 0
>> sockperf: [Valid Duration] RunTime=1.000 sec; SentMessages=20056;
>> ReceivedMessages=20056
>> sockperf: > avg-lat= 24.861 (std-dev=1.774)
>> sockperf: # dropped messages = 0; # duplicated messages = 0; #
>> out-of-order messages = 0
>> sockperf: Summary: Latency is 24.861 usec
>> sockperf: Total 20056 observations; each percentile contains 200.56
>> observations
>> sockperf: --->  observation =   77.158
>> sockperf: ---> percentile  99.99 =   54.285
>> sockperf: ---> percentile  99.90 =   37.864
>> sockperf: ---> percentile  99.50 =   34.406
>> sockperf: ---> percentile  99.00 =   33.337
>> sockperf: ---> percentile  95.00 =   27.497
>> sockperf: ---> percentile  90.00 =   26.072
>> sockperf: ---> percentile  75.00 =   24.618
>> sockperf: ---> percentile  50.00 =   24.443
>> sockperf: ---> percentile  25.00 =   24.361
>> sockperf: --->  observation =   16.746
>> [root@c01 sbin]# sockperf ping-pong -i 192.168.0.12 -p 5001 -t 10
>> sockperf: == version #2.6 ==
>> sockperf[CLIENT] send on:sockperf: using recvfrom() to block on
>> socket(s)
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

I find the ping command with flood option handy to measure latency,
gives stats min/max/average/std deviation 

example: 

ping  -c 10 -f 10.0.1.12 

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What is the should be the expected latency of 10Gbit network connections

2018-01-22 Thread Maged Mokhtar

On 2018-01-23 08:27, Blair Bethwaite wrote:

> Firstly, the OP's premise in asking, "Or should there be a differnce
> of 10x", is fundamentally incorrect. Greater bandwidth does not mean
> lower latency, though the latter almost always results in the former.
> Unfortunately, changing the speed of light remains a difficult
> engineering challenge :-). However, you can do things like: add
> multiple links, overlap signals on the wire, and tweak error
> correction encodings; all to get more bits on the wire without making
> the wire itself any faster. Take Mellanox 100Gb ethernet, 1 lane is
> 25Gb, to get 50Gb they mash 2 lanes together, to get 100Gb they mash 4
> lanes - the latency of single bit transmission is more-or-less
> unchanged. Also note that with UDP/TCP pings or actual Ceph traffic
> we're going via the kernel stack running on the CPU and as such the
> speed & power-management of the CPU can make quite a difference.
> 
> Example 25GE on a dual-port CX-4 card in LACP bond, RHEL7 host.
> 
> $ cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 7.3 (Maipo)
> $ ofed_info | head -1
> MLNX_OFED_LINUX-4.0-1.0.1.0 (OFED-4.0-1.0.1):
> $ grep 'model name' /proc/cpuinfo | uniq
> model name  : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> $ ibv_devinfo
> hca_id: mlx5_1
> transport:  InfiniBand (0)
> fw_ver: 14.18.1000
> node_guid:  ...
> sys_image_guid: ...
> vendor_id:  0x02c9
> vendor_part_id: 4117
> hw_ver: 0x0
> board_id:   MT_2420110034
> ...
> 
> $ sudo ping -M do -s 8972 -c 10 -f ...
> 10 packets transmitted, 10 received, 0% packet loss, time 4652ms
> rtt min/avg/max/mdev = 0.029/0.031/2.711/0.015 ms, ipg/ewma 0.046/0.031 ms
> 
> $ sudo ping -M do -s 3972 -c 10 -f ...
> 10 packets transmitted, 10 received, 0% packet loss, time 3321ms
> rtt min/avg/max/mdev = 0.019/0.022/0.364/0.003 ms, ipg/ewma 0.033/0.022 ms
> 
> $ sudo ping -M do -s 1972 -c 10 -f ...
> 10 packets transmitted, 10 received, 0% packet loss, time 2818ms
> rtt min/avg/max/mdev = 0.017/0.018/0.086/0.005 ms, ipg/ewma 0.028/0.021 ms
> 
> $ sudo ping -M do -s 472 -c 10 -f ...
> 10 packets transmitted, 10 received, 0% packet loss, time 2498ms
> rtt min/avg/max/mdev = 0.014/0.016/0.305/0.005 ms, ipg/ewma 0.024/0.017 ms
> 
> $ sudo ping -M do -c 10 -f ...
> 10 packets transmitted, 10 received, 0% packet loss, time 2363ms
> rtt min/avg/max/mdev = 0.014/0.015/0.322/0.006 ms, ipg/ewma 0.023/0.016 ms
> 
> On 22 January 2018 at 22:37, Nick Fisk <n...@fisk.me.uk> wrote: 
> 
>> Anyone with 25G ethernet willing to do the test? Would love to see what the
>> latency figures are for that.
>> 
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Maged Mokhtar
>> Sent: 22 January 2018 11:28
>> To: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] What is the should be the expected latency of
>> 10Gbit network connections
>> 
>> On 2018-01-22 08:39, Wido den Hollander wrote:
>> 
>> On 01/20/2018 02:02 PM, Marc Roos wrote:
>> 
>> If I test my connections with sockperf via a 1Gbit switch I get around
>> 25usec, when I test the 10Gbit connection via the switch I have around
>> 12usec is that normal? Or should there be a differnce of 10x.
>> 
>> No, that's normal.
>> 
>> Tests with 8k ping packets over different links I did:
>> 
>> 1GbE:  0.800ms
>> 10GbE: 0.200ms
>> 40GbE: 0.150ms
>> 
>> Wido
>> 
>> sockperf ping-pong
>> 
>> sockperf: Warmup stage (sending a few dummy messages)...
>> sockperf: Starting test...
>> sockperf: Test end (interrupted by timer)
>> sockperf: Test ended
>> sockperf: [Total Run] RunTime=10.100 sec; SentMessages=432875;
>> ReceivedMessages=432874
>> sockperf: = Printing statistics for Server No: 0
>> sockperf: [Valid Duration] RunTime=10.000 sec; SentMessages=428640;
>> ReceivedMessages=428640
>> sockperf: > avg-lat= 11.609 (std-dev=1.684)
>> sockperf: # dropped messages = 0; # duplicated messages = 0; #
>> out-of-order messages = 0
>> sockperf: Summary: Latency is 11.609 usec
>> sockperf: Total 428640 observations; each percentile contains 4286.40
>> observations
>> sockperf: --->  observation =  856.944
>> sockperf: ---> percentile  99.99 =   39.789
>> sockperf: ---> percentile  99.90 =   20.550
>> sockperf: ---> percentile  99.50 =   17.094
>> sockperf: ---> perce

Re: [ceph-users] How ceph client read data from ceph cluster

2018-01-26 Thread Maged Mokhtar

On 2018-01-26 09:09, shadow_lin wrote:

> Hi List, 
> I read a old article about how ceph client read from ceph cluster.It said the 
> client only read from the primary osd. Since ceph cluster in replicate mode 
> have serveral copys of data only read from one copy seems waste the 
> performance of concurrent read from all the copys. 
> But that artcile is rather old so maybe ceph has imporved to read from all 
> the copys? But I haven't find any info about that. 
> Any info about that would be appreciated. 
> Thanks 
> 
> 2018-01-26 
> -
> shadow_lin 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi  

The majority of cases you will have more concurrent io requests than
disks, so the load will already be distributed evenly. If this is not
the case and you have a large cluster with fewer clients, you may
consider using object/rbd striping so each io will be divided into
different osd requests. 

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How ceph client read data from ceph cluster

2018-01-26 Thread Maged Mokhtar

Hi Lin, 

Yes it will read from the primary osd, but for the reasons stated this
should not impact performance. 

Maged 

On 2018-01-26 19:52, shadow_lin wrote:

> Hi Maged, 
> I just want to make sure if I understand how ceph client read from cluster.So 
> with current version of ceph(12.2.2) the client only read from the primary 
> osd(one copy),is that true? 
> 
> 2018-01-27
> -
> 
> lin.yunfan 
> ---------
> 
> 发件人：Maged Mokhtar <mmokh...@petasan.org> 
> 发送时间：2018-01-26 20:27 
> 主题：Re: [ceph-users] How ceph client read data from ceph cluster 
> 收件人："shadow_lin"<shadow_...@163.com> 
> 抄送："ceph-users"<ceph-users@lists.ceph.com> 
> 
> On 2018-01-26 09:09, shadow_lin wrote: 
> Hi List, 
> I read a old article about how ceph client read from ceph cluster.It said the 
> client only read from the primary osd. Since ceph cluster in replicate mode 
> have serveral copys of data only read from one copy seems waste the 
> performance of concurrent read from all the copys. 
> But that artcile is rather old so maybe ceph has imporved to read from all 
> the copys? But I haven't find any info about that. 
> Any info about that would be appreciated. 
> Thanks 
> 
> 2018-01-26 
> -
> shadow_lin 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> Hi  
> 
> The majority of cases you will have more concurrent io requests than disks, 
> so the load will already be distributed evenly. If this is not the case and 
> you have a large cluster with fewer clients, you may consider using 
> object/rbd striping so each io will be divided into different osd requests. 
> 
> Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] troubleshooting ceph performance

2018-01-30 Thread Maged Mokhtar

On 2018-01-31 08:14, Manuel Sopena Ballesteros wrote:

> Dear Ceph community, 
> 
> I have a very small ceph cluster for testing with this configuration: 
> 
> · 2x compute nodes each with: 
> 
> · dual port of 25 nic 
> 
> · 2x socket (56 cores with hyperthreading) 
> 
> · X10 intel nvme DC P3500 drives 
> 
> · 512 GB RAM 
> 
> One of the nodes is also running as a monitor. 
> 
> Installation has been done using ceph-ansible. 
> 
> Ceph version: jewel 
> 
> Storage engine: filestore 
> 
> Performance test below: 
> 
> [root@zeus-59 ceph-block-device]# ceph osd pool ls detail 
> 
> pool 0 'rbd' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 64 pgp_num 64 last_change 115 flags hashpspool stripe_width 0 
> 
> pool 1 'images' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 128 pgp_num 128 last_change 118 flags hashpspool stripe_width 
> 0 
> 
> removed_snaps [1~3,7~4] 
> 
> pool 3 'backups' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 128 pgp_num 128 last_change 120 flags hashpspool stripe_width 
> 0 
> 
> pool 4 'vms' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 128 pgp_num 128 last_change 122 flags hashpspool stripe_width 
> 0 
> 
> removed_snaps [1~7] 
> 
> pool 5 'volumes' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 128 pgp_num 128 last_change 124 flags hashpspool stripe_width 
> 0 
> 
> removed_snaps [1~3] 
> 
> pool 6 'scbench' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 100 pgp_num 100 last_change 126 flags hashpspool stripe_width 
> 0 
> 
> pool 7 'rbdbench' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 100 pgp_num 100 last_change 128 flags hashpspool stripe_width 
> 0 
> 
> removed_snaps [1~3] 
> 
> [root@zeus-59 ceph-block-device]# ceph osd tree 
> 
> ID WEIGHT   TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY 
> 
> -1 36.17371 root default 
> 
> -2 18.08685 host zeus-58 
> 
> 0  1.80869 osd.0 up  1.0  1.0 
> 
> 2  1.80869 osd.2 up  1.0  1.0 
> 
> 4  1.80869 osd.4 up  1.0  1.0 
> 
> 6  1.80869 osd.6 up  1.0  1.0 
> 
> 8  1.80869 osd.8 up  1.0  1.0 
> 
> 10  1.80869 osd.10up  1.0  1.0 
> 
> 12  1.80869 osd.12up  1.0  1.0 
> 
> 14  1.80869 osd.14up  1.0  1.0 
> 
> 16  1.80869 osd.16up  1.0  1.0 
> 
> 18  1.80869 osd.18up  1.0  1.0 
> 
> -3 18.08685 host zeus-59 
> 
> 1  1.80869 osd.1 up  1.0  1.0 
> 
> 3  1.80869 osd.3 up  1.0  1.0 
> 
> 5  1.80869 osd.5 up  1.0  1.0 
> 
> 7  1.80869 osd.7 up  1.0  1.0 
> 
> 9  1.80869 osd.9 up  1.0  1.0 
> 
> 11  1.80869 osd.11up  1.0  1.0 
> 
> 13  1.80869 osd.13up  1.0  1.0 
> 
> 15  1.80869 osd.15up  1.0  1.0 
> 
> 17  1.80869 osd.17up  1.0  1.0 
> 
> 19  1.80869 osd.19up  1.0  1.0 
> 
> [root@zeus-59 ceph-block-device]# ceph status 
> 
> cluster 8e930b6c-455e-4328-872d-cb9f5c0359ae 
> 
> health HEALTH_OK 
> 
> monmap e1: 1 mons at {zeus-59=10.0.32.59:6789/0} 
> 
> election epoch 3, quorum 0 zeus-59 
> 
> osdmap e129: 20 osds: 20 up, 20 in 
> 
> flags sortbitwise,require_jewel_osds 
> 
> pgmap v1166945: 776 pgs, 7 pools, 1183 GB data, 296 kobjects 
> 
> 2363 GB used, 34678 GB / 37042 GB avail 
> 
> 775 active+clean 
> 
> 1 active+clean+scrubbing+deep 
> 
> [root@zeus-59 ceph-block-device]# rados bench -p scbench 10 write 
> --no-cleanup 
> 
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 
> for up to 10 seconds or 0 objects 
> 
> Object prefix: benchmark_data_zeus-59.localdomain_2844050 
> 
> sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s) 
> 
> 0   0 0 0 0 0   -   0 
> 
> 1  16   644   6282511.4  2512   0.02102730.025206 
> 
> 2  16  1319  1303   2605.49  2700   0.0238678   0.0243974 
> 
> 3  16  2003  1987   2648.89  2736   0.0201334   0.0240726 
> 
> 4  16  2669  2653   2652.59  2664   0.0258618   0.0240468 
> 
> 5  16  3349     2666.01  2720   0.0189464   0.0239484 
> 
> 6  16  4026  4010   2672.96  2708 0.02215   0.0238954 
> 
> 7  16  4697  4681   2674.49  2684   0.0217258   0.0238887 
> 
> 8  16  5358  5342   2670.64

Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ?

2018-02-01 Thread Maged Mokhtar

Hi Lin, 

We do the extra dd after zapping the disk. ceph-disk has a zap function
that uses wipefs to wipe fs traces, dd  to zero 10MB at partition
starts, then sgdisk to remove partition table, i believe ceph-volume
does the same. After this zap for each data or db block that will be
created on this device we use the dd command to zero 500MB, this may be
a bit overboard but other users have had similar issues: 

http://tracker.ceph.com/issues/22354 

Also the initial zap will wipe out the the disk and zeros the start of
partitions as they used to be, it is possible the new disk will have db
block with a different size so the start of partitioning has changed. 

I am not sure if your question was because you hit this issue or you
just want to skip the extra dd function or you are facing issues
cleaning disks, if it is the later we can send you some patch that does
this. 

Maged 

On 2018-02-01 15:04, shadow_lin wrote:

> Hi Maged, 
> The problem you met beacuse of the left over of older cluster.Did you remove 
> the db partition or you just use the old partition? 
> I thought Wido suggest to remove the partition then use the dd to be safe.Is 
> it safe I don't remove the partition and just use dd the try to destory the 
> data on that partition? 
> How would ceph-disk or ceph-volume do to the existing partition of 
> journal,db,wal?Will it clean it or it just uses it without any action? 
> 
> 2018-02-01
> -
> 
> lin.yunfan 
> -----
> 
> 发件人：Maged Mokhtar <mmokh...@petasan.org> 
> 发送时间：2018-02-01 14:22 
> 主题：Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it 
> is bluestore) ? 
> 收件人："David Turner"<drakonst...@gmail.com> 
> 抄送："shadow_lin"<shadow_...@163.com>,"ceph-users"<ceph-users@lists.ceph.com> 
> 
> I would recommend as Wido to use the dd command. block db device holds the 
> metada/allocation of objects stored in data block, not cleaning this is 
> asking for problems, besides it does not take any time.  In our testing 
> building new custer on top of older installation, we did see many cases where 
> osds will not start and report an error such as fsid of cluster and/or OSD 
> does not match metada in BlueFS superblock...these errors do not appear if we 
> use the dd command.  
> 
> On 2018-02-01 06:06, David Turner wrote: 
> 
> I know that for filestore journals that is fine.  I think it is also safe for 
> bluestore.  Doing Wido's recommendation of writing 100MB would be a good 
> idea, but not necessary. 
> 
> On Wed, Jan 31, 2018, 10:10 PM shadow_lin <shadow_...@163.com> wrote: 
> 
> Hi David, 
> Thanks for your reply. 
> I am wondering what if I don't remove the journal(wal,db for bluestore) 
> partion on the ssd and only zap the data disk.Then I assign the 
> journal(wal,db for bluestore) partion to a new osd.What would happen? 
> 
> 2018-02-01
> -
> 
> lin.yunfan 
> -
> 
> 发件人：David Turner <drakonst...@gmail.com> 
> 发送时间：2018-01-31 17:24 
> 主题：Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it 
> is bluestore) ? 
> 收件人："shadow_lin"<shadow_...@163.com> 
> 抄送："ceph-users"<ceph-users@lists.ceph.com> 
> 
> I use gdisk to remove the partition and partprobe for the OS to see the new 
> partition table. You can script it with sgdisk. 
> 
> On Wed, Jan 31, 2018, 4:10 AM shadow_lin <shadow_...@163.com> wrote: 
> 
> Hi list, 
> if I create an osd with journal(wal,db if it is bluestore) in the same hdd, I 
> use ceph-disk zap to clean the disk when I want to remove the osd and clean 
> the data on the disk. 
> But if I use a ssd partition as the journal(wal,db if it is bluestore) , how 
> should I clean the journal (wal,db if it is bluestore) of the osd I want to 
> remove?Especially when there are other osds are using other partition of the 
> same ssd  as journals(wal,db if it is bluestore) . 
> 
> 2018-01-31 
> -
> shadow_lin ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph - incorrect output of ceph osd tree

2018-01-31 Thread Maged Mokhtar

try setting: 

mon_osd_min_down_reporters = 1 

On 2018-01-31 20:46, Steven Vacaroaia wrote:

> Hi, 
> 
> Why is ceph osd tree reports that osd.4 is up when the server on which osd.4 
> is running is actually down ?? 
> 
> Any help will be appreciated  
> 
> [root@osd01 ~]# ping -c 2 osd02 
> PING osd02 (10.10.30.182) 56(84) bytes of data. 
> From osd01 (10.10.30.181) icmp_seq=1 Destination Host Unreachable 
> From osd01 (10.10.30.181) icmp_seq=2 Destination Host Unreachable 
> 
> [root@osd01 ~]# ceph osd tree 
> ID  CLASS WEIGHT  TYPE NAME  STATUS REWEIGHT PRI-AFF 
> -9 0 root ssds 
> -10 0 host osd01-ssd 
> -11 0 host osd02-ssd 
> -12 0 host osd04-ssd 
> -1   4.22031 root default 
> -3   1.67967 host osd01 
> 0   hdd 0.55989 osd.0down0 1.0 
> 3   hdd 0.55989 osd.3down0 1.0 
> 6   hdd 0.55989 osd.6  up  1.0 1.0 
> -5   1.67967 host osd02 
> 1   hdd 0.55989 osd.1down  1.0 1.0 
> 4   hdd 0.55989 osd.4  up  1.0 1.0 
> 7   hdd 0.55989 osd.7down  1.0 1.0 
> -7   0.86096 host osd04 
> 2   hdd 0.28699 osd.2down0 1.0 
> 5   hdd 0.28699 osd.5down  1.0 1.0 
> 8   hdd 0.28699 osd.8down  1.0 1.0 
> [root@osd01 ~]# ceph tell osd.4 bench 
> ^CError EINTR: problem getting command descriptions from osd.4 
> [root@osd01 ~]# ceph osd df 
> ID CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL %USE VAR  PGS 
> 0   hdd 0.559890 0  0 000   0 
> 3   hdd 0.559890 0  0 000   0 
> 6   hdd 0.55989  1.0  573G 16474M  557G 2.81 0.84   0 
> 1   hdd 0.55989  1.0  573G 16516M  557G 2.81 0.84   0 
> 4   hdd 0.55989  1.0  573G 16465M  557G 2.80 0.84   0 
> 7   hdd 0.55989  1.0  573G 16473M  557G 2.81 0.84   0 
> 2   hdd 0.286990 0  0 000   0 
> 5   hdd 0.28699  1.0  293G 16466M  277G 5.47 1.63   0 
> 8   hdd 0.28699  1.0  293G 16461M  277G 5.47 1.63   0 
> TOTAL 2881G 98857M 2784G 3.35 
> MIN/MAX VAR: 0.84/1.63  STDDEV: 1.30 
> [root@osd01 ~]# ceph osd df tree 
> ID  CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL %USE VAR  PGS TYPE NAME 
> -9 0- 0  0 000   - root ssds 
> -10 0- 0  0 000   - host 
> osd01-ssd 
> -11 0- 0  0 000   - host 
> osd02-ssd 
> -12 0- 0  0 000   - host 
> osd04-ssd 
> -1   4.22031- 2881G 98857M 2784G 3.35 1.00   - root default 
> -3   1.67967-  573G 16474M  557G 2.81 0.84   - host osd01 
> 0   hdd 0.559890 0  0 000   0 osd.0 
> 3   hdd 0.559890 0  0 000   0 osd.3 
> 6   hdd 0.55989  1.0  573G 16474M  557G 2.81 0.84   0 osd.6 
> -5   1.67967- 1720G 49454M 1671G 2.81 0.84   - host osd02 
> 1   hdd 0.55989  1.0  573G 16516M  557G 2.81 0.84   0 osd.1 
> 4   hdd 0.55989  1.0  573G 16465M  557G 2.80 0.84   0 osd.4 
> 7   hdd 0.55989  1.0  573G 16473M  557G 2.81 0.84   0 osd.7 
> -7   0.86096-  587G 32928M  555G 5.47 1.63   - host osd04 
> 2   hdd 0.286990 0  0 000   0 osd.2 
> 5   hdd 0.28699  1.0  293G 16466M  277G 5.47 1.63   0 osd.5 
> 8   hdd 0.28699  1.0  293G 16461M  277G 5.47 1.63   0 osd.8 
> TOTAL 2881G 98857M 2784G 3.35 
> MIN/MAX VAR: 0.84/1.63  STDDEV: 1.30 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ?

2018-01-31 Thread Maged Mokhtar

I would recommend as Wido to use the dd command. block db device holds
the metada/allocation of objects stored in data block, not cleaning this
is asking for problems, besides it does not take any time.  In our
testing building new custer on top of older installation, we did see
many cases where osds will not start and report an error such as fsid of
cluster and/or OSD does not match metada in BlueFS superblock...these
errors do not appear if we use the dd command.  

On 2018-02-01 06:06, David Turner wrote:

> I know that for filestore journals that is fine.  I think it is also safe for 
> bluestore.  Doing Wido's recommendation of writing 100MB would be a good 
> idea, but not necessary. 
> 
> On Wed, Jan 31, 2018, 10:10 PM shadow_lin  wrote: 
> 
> Hi David, 
> Thanks for your reply. 
> I am wondering what if I don't remove the journal(wal,db for bluestore) 
> partion on the ssd and only zap the data disk.Then I assign the 
> journal(wal,db for bluestore) partion to a new osd.What would happen? 
> 
> 2018-02-01
> -
> 
> lin.yunfan 
> -
> 
> 发件人：David Turner  
> 发送时间：2018-01-31 17:24 
> 主题：Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it 
> is bluestore) ? 
> 收件人："shadow_lin" 
> 抄送："ceph-users" 
> 
> I use gdisk to remove the partition and partprobe for the OS to see the new 
> partition table. You can script it with sgdisk. 
> 
> On Wed, Jan 31, 2018, 4:10 AM shadow_lin  wrote: 
> 
> Hi list, 
> if I create an osd with journal(wal,db if it is bluestore) in the same hdd, I 
> use ceph-disk zap to clean the disk when I want to remove the osd and clean 
> the data on the disk. 
> But if I use a ssd partition as the journal(wal,db if it is bluestore) , how 
> should I clean the journal (wal,db if it is bluestore) of the osd I want to 
> remove?Especially when there are other osds are using other partition of the 
> same ssd  as journals(wal,db if it is bluestore) . 
> 
> 2018-01-31 
> -
> shadow_lin ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Newbie question: stretch ceph cluster

2018-02-14 Thread Maged Mokhtar

Hi, 

You need to set the min_size to 2 in crush rule.  

The exact location and replication flow when a client writes data
depends on the object name and num of pgs. the crush rule determines
which osds will serve a pg, the first is the primary osd for that pg.
The client computes the pg from the object name and writes the object to
the primary osd for that pg, then primary osd is then responsible to
replicate with the other osds serving this pg. So for the same client,
some objects will be sent to datacenter 1 and some to 2 and the osds
will do the rest. 

The other point is regarding how to setup monitors across 2 datacenters
and be able to function if one goes down, this is tricky since monitors
do require an odd number and form a quorum. This link my is quite
interesting, i am not sure if there are better ways to do it: 

https://www.sebastien-han.fr/blog/2013/01/28/ceph-geo-replication-sort-of/

Maged 

On 2018-02-14 04:12, ST Wong (ITSC) wrote:

> Hi,
> 
> Thanks for your advice,
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Luis 
> Periquito
> Sent: Friday, February 09, 2018 11:34 PM
> To: Kai Wagner
> Cc: Ceph Users
> Subject: Re: [ceph-users] Newbie question: stretch ceph cluster
> 
> On Fri, Feb 9, 2018 at 2:59 PM, Kai Wagner  wrote: Hi and 
> welcome,
> 
> On 09.02.2018 15:46, ST Wong (ITSC) wrote:
> 
> Hi, I'm new to CEPH and got a task to setup CEPH with kind of DR feature.
> We've 2 10Gb connected data centers in the same campus.I wonder if it's
> possible to setup a CEPH cluster with following components in each 
> data
> center:
> 
> 3 x mon + mds + mgr In this scenario you wouldn't be any better, as loosing a 
> room means loosing half of your cluster. Can you run the MON somewhere else 
> that would be able to continue if you loose one of the rooms?

Will it be okay to have 3 x MON per DC so that we still have 3 x MON in
case of losing 1 DC ?  Or need more in case of double fault - losing 1
DC and failure of any MON in remaining DC will make the cluster stop
working?

>> As for MGR and MDS they're (recommended) active/passive; so one per room 
>> would be enough.
> 
> 3 x OSD (replicated factor=2, between data center)

>> replicated with size=2 is a bad idea. You can have size=4 and
>> min_size=2 and have a crush map with rules something like:

rule crosssite {
id 0
type replicated
min_size 4
max_size 4
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}

>> this will store 4 copies, 2 in different hosts and 2 different rooms.

Does it mean for new data write to hostA:roomA, replication will take
place as following?
1. from hostA:roomA to hostB:roomA
2. from hostA:roomA to hostA, roomB 
3. from hostB:roomA to hostB, roomB 

If it works in this way, can copy in 3 be skipped so that for each piece
of data, there are 3 replicas - original one, replica in same room, and
replica in other room, in order to save some space?

Besides, would also like to ask if it's correct that the cluster will
continue to work (degraded) if one room is lost?

Will there be any better way to setup such 'stretched' cluster between 2
DCs?  They're extension instead of real DR site...

Sorry for the newbie questions and we'll proceed to have more study and
experiment on this.

Thanks a lot.

> So that any one of following failure won't affect the cluster's 
> operation and data availability:
> 
> any one component in either data center failure of either one of the 
> data center
> 
> Is it possible?
> 
> In general this is possible, but I would consider that replica=2 is 
> not a good idea. In case of a failure scenario or just maintenance and 
> one DC is powered off and just one single disk fails on the other DC, 
> this can already lead to data loss. My advice here would be, if anyhow 
> possible, please don't do replica=2.
> 
> In case one data center failure case, seems replication can't occur any
> more.   Any CRUSH rule can achieve this purpose?
> 
> Sorry for the newbie question.
> 
> Thanks a lot.
> 
> Regards
> 
> /st wong
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, 
> HRB
> 21284 (AG Nürnberg)
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users

Re: [ceph-users] Ceph luminous performance - how to calculate expected results

2018-02-14 Thread Maged Mokhtar

On 2018-02-14 20:14, Steven Vacaroaia wrote:

> Hi, 
> 
> It is very useful to "set up expectations" from a performance perspective  
> 
> I have a  cluster using 3 DELL R620 with 64 GB RAM and  10 GB cluster network 
> 
> I've seen numerous posts and articles about the topic mentioning the 
> following formula  
> ( for disks WAL/DB on it ) 
> 
> OSD / replication / 2 
> 
> Example  
> My HDD are capable of 150 MB/s 
> If I have 6 OSDs, expected throughput should be aroung 250 MB/s  for a pool 
> withe replication =2  
> ( 150 x 6 / 2 / 2 ) 
> 
> How would one asses the impact of using SSD for WAL/DB i.e what performance 
> gains should I expect ? 
> Example: 
> adding an 500MB/s SDD for every 2 HDD 
> 
> Should I expect that kind of throuput on the client ( e.g Windows VM running 
> on datastore create on RBD image shared via iSCSI ) ? 
> 
> The reason I am asking is that despite rados bench meeting the expectaion, 
> local performance test are 4 times worse 
> 
> rados bench -p rbd 120 write --no-cleanup && rados bench -p rbd  120 seq 
> 
> Total time run: 120.813979 
> Total writes made:  6182 
> Write size: 4194304 
> Object size:4194304 
> Bandwidth (MB/sec): 204.678 
> Stddev Bandwidth:   36.2292 
> Max bandwidth (MB/sec): 280 
> Min bandwidth (MB/sec): 44 
> Average IOPS:   51 
> Stddev IOPS:9 
> Max IOPS:   70 
> Min IOPS:   11 
> Average Latency(s): 0.312613 
> Stddev Latency(s):  0.524001 
> Max latency(s): 2.61579 
> Min latency(s): 0.0113714 
> 
> Total time run:   113.850422 
> Total reads made: 6182 
> Read size:4194304 
> Object size:  4194304 
> Bandwidth (MB/sec):   217.197 
> Average IOPS: 54 
> Stddev IOPS:  7 
> Max IOPS: 80 
> Min IOPS: 31 
> Average Latency(s):   0.293956 
> Max latency(s):   1.99958 
> Min latency(s):   0.0192862 
> 
> Local test using CrystalDiskMark  
> 
> 57 MB/s seq read 
> 43 MB/s seq write 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi Steven, 

I do not believe there is formula as there are many factors involved.
The "OSD / replication / 2" is probably related to the theoretical peak
of a filestore based OSD with collocated journal, practically this does
not mean you will reach this due to other factors. If you have a link
with performance formulas, it would be interesting to know. 

For your test, i would check: 

The rados benchmark default values is using 4M objects and 16 threads.
You need to set your CrystalDiskMark with similar parameters. 

The iSCSI target gateway should easily pull the rados throughput you are
seeing without too much drop, double check how your client initiator and
target are configured. You can also run atop or other performance tool
on your iSCSI gateway and see if you have any resource issues. 

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] WAL/DB size

2018-09-07 Thread Maged Mokhtar

On 2018-09-07 14:36, Alfredo Deza wrote:

> On Fri, Sep 7, 2018 at 8:27 AM, Muhammad Junaid  
> wrote: 
> 
>> Hi there
>> 
>> Asking the questions as a newbie. May be asked a number of times before by
>> many but sorry, it is not clear yet to me.
>> 
>> 1. The WAL device is just like journaling device used before bluestore. And
>> CEPH confirms Write to client after writing to it (Before actual write to
>> primary device)?
>> 
>> 2. If we have lets say 5 OSD's (4 TB SAS) and 1 200GB SSD. Should we
>> partition SSD in 10 partitions? Shoud/Can we set WAL Partition Size against
>> each OSD as 10GB? Or what min/max we should set for WAL Partition? And can
>> we set remaining 150GB as (30GB * 5) for 5 db partitions for all OSD's?
> 
> A WAL partition would only help if you have a device faster than the
> SSD where the block.db would go.
> 
> We recently updated our sizing recommendations for block.db at least
> 4% of the size of block (also referenced as the data device):
> 
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
> 
> In your case, what you want is to create 5 logical volumes from your
> 200GB at 40GB each, without a need for a WAL device.
> 
>> Thanks in advance. Regards.
>> 
>> Muhammad Junaid
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

should not the db size depend on the number of objects stored rather
than their storage size ? or is the new recommendation assuming some
average object size ?___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] advice with erasure coding

2018-09-07 Thread Maged Mokhtar

On 2018-09-07 13:52, Janne Johansson wrote:

> Den fre 7 sep. 2018 kl 13:44 skrev Maged Mokhtar : 
> 
>> Good day Cephers, 
>> 
>> I want to get some guidance on erasure coding, the docs do state the 
>> different plugins and settings but to really understand them all and their 
>> use cases is not easy: 
>> 
>> -Are the majority of implementations using jerasure and just configuring k 
>> and m ?
> 
> Probably, yes 
> 
>> -For jerasure: when/if would i need to change 
>> stripe_unit/osd_pool_erasure_code_stripe_unit/packetsize/algorithm ? The 
>> main usage is rbd with 4M object size, the workload is virtualization with 
>> average block size of 64k. 
>> 
>> Any help based on people's actual experience will be greatly appreciated..
> 
> Running VMs on top of EC pools is possible, but probably not recommended. 
> All the random reads and writes they usually cause will make EC less suitable 
> than replicated pools, even if it is possible.
> -- 
> May the most significant bit of your life be positive.

Point well taken...it could be useful for backing up vms, and maybe vms
without too much latency requirements if k and m are not large.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] advice with erasure coding

2018-09-07 Thread Maged Mokhtar

Good day Cephers, 

I want to get some guidance on erasure coding, the docs do state the
different plugins and settings but to really understand them all and
their use cases is not easy: 

-Are the majority of implementations using jerasure and just configuring
k and m ?
-For jerasure: when/if would i need to change
stripe_unit/osd_pool_erasure_code_stripe_unit/packetsize/algorithm ? The
main usage is rbd with 4M object size, the workload is virtualization
with average block size of 64k. 

Any help based on people's actual experience will be greatly
appreciated.. 

/Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Performance tuning for SAN SSD config

2018-07-06 Thread Maged Mokhtar

On 2018-06-29 18:30, Matthew Stroud wrote:

> We back some of our ceph clusters with SAN SSD disk, particularly VSP G/F and 
> Purestorage. I'm curious what are some settings we should look into modifying 
> to take advantage of our SAN arrays. We had to manually set the class for the 
> luns to SSD class which was a big improvement. However we still see 
> situations where we get slow requests and the underlying disks and network 
> are underutilized. 
> 
> More info about our setup. We are running centos 7 with Luminous as our ceph 
> release. We have 4 osd nodes that have 5x2TB disks each and they are setup as 
> bluestore. Our ceph.conf is attached with some information removed for 
> security reasons. 
> 
> Thanks ahead of time. 
> 
> Thanks, 
> 
> Matthew Stroud 
> 
> -
> 
> CONFIDENTIALITY NOTICE: This message is intended only for the use and review 
> of the individual or entity to which it is addressed and may contain 
> information that is privileged and confidential. If the reader of this 
> message is not the intended recipient, or the employee or agent responsible 
> for delivering the message solely to the intended recipient, you are hereby 
> notified that any dissemination, distribution or copying of this 
> communication is strictly prohibited. If you have received this communication 
> in error, please notify sender immediately by telephone or return email. 
> Thank you.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

If i understand correctly, you are using luns (via iSCSI) from your
external SAN as OSDs and created a separate pool with these OSDs with
device class SSD, you are using this pool for backup. 

Some comments: 

* Using external disks as OSDs is probably not that common. It may be
better to keep the SAN and Ceph cluster separate and have your backup
tool access both, it will also be safer in case of disaster to the
cluster your backup will be on a separate system.
* What backup tool/script are you using ? it is better that this tool
uses high queue depth, large block sizes and memory/page cache to
increase performance during copies.
* To try to pin down where your current bottleneck is, i would run
benchmarks (eg fio) using the block sizes used by your backup tool on
the raw luns before being added as OSDs (as pure iSCSI disks) as well as
on both the main and backup pools. Have a resource tool (eg
atop/systat/collectl) run during these tests to check for resources:
disks %busy/cores %busy/io_wait
* You probably can use replica count of 1 for the SAN OSDs since they
include their own RAID redundancy.

Maged

ceph.conf
Description: Binary data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Performance tuning for SAN SSD config

2018-07-06 Thread Maged Mokhtar

Hi 

* What is your cpu utilization like ? are any cores close to
saturation ?
* If you use fio to test a raw FC lun (ie prior to adding as OSD) from
your host using random 4k blocks and high queue depth (32 or more) , do
you get high iops ? what is the disk utilization ? cpu utilization ?
* If you repeat the above test but instead of testing 1 lun, run
concurrent fio test on all 5 luns on the host, does the aggregate iops
performance scale x5 ? any resource issues ?
* Does increasing /sys/block/sdX/queue/nr_requests help ?
* Can you use active/active multipath ?
* If the above gives good performance/resource utilization, would you
get better performance if you had more that 20 OSDs/luns in total, for
example 40 or 60 ? that should not cost you anything.
* I still think you can use replica of 1 in Ceph since your SAN
already has redundancy. It maybe an over-kill to use both. I am not
trying to save space on the SAN but rather reduce write latency on the
Ceph side. 

Maged 

On 2018-07-06 20:19, Matthew Stroud wrote:

> Good to note about the replica set, we will stick with 3. We really aren't 
> concerned about the overhead, but the additional IO that occurs during writes 
> that have an additional copy.
> 
> To be clear, we aren't using ceph in place of FC, nor the other way around. 
> We have discovered that SAN storage is cheaper (this one was surprising to 
> me) and better performant than direct attached storage (DAS) on the small 
> scale that we are building things (20T to about 100T). I'm sure that would 
> switch if we were much larger, but for now SAN is better. In summary we are 
> using SAN pretty much as a DAS and ceph uses those SAN disks for OSDs.
> 
> The biggest issue we see is slow requests during rebuilds or node/osd 
> failures but the disks and network just aren't being to their fullest. That 
> would lead me to believe that there are some host and/or osd process 
> bottlenecks going on. Other than that, just increasing the performance of our 
> ceph cluster would be a plus and that is what I'm exploring.
> 
> As per test numbers, I can't run that right now because the systems we have 
> are in prod and I don't want to impact that for io testing. However, we do 
> have a new cluster coming online shortly and I could do some benchmarking 
> there and get that back to you.
> 
> However as memory serves, we were only getting something about 90-100k iops 
> and about 15 - 50 ms latency with 10 servers running fio with 50% of random 
> and sequential workloads. With a single vm, we were getting about 14k iops 
> with about 10 - 30 ms of latency.
> 
> Thanks,
> Matthew Stroud
> 
> On 7/6/18, 11:12 AM, "Vasu Kulkarni"  wrote:
> 
> On Fri, Jul 6, 2018 at 8:38 AM, Matthew Stroud  
> wrote:
>>
>> Thanks for the reply.
>>
>>
>>
>> Actually we are using fiber channel (it's so much more performant than iscsi 
>> in our tests) as the primary storage and this is serving up traffic for RBD 
>> for openstack, so this isn't for backups.
>>
>>
>>
>> Our biggest bottle neck is trying utilize the host and/or osd process 
>> correctly. The disks are running at sub-millisecond, with about 90% of the 
>> IO being pulled from the array's cache (a.k.a. not even hitting the disks). 
>> According to the host, we never get north of 20% disk utilization, unless 
>> there is a deep scrub going on.
>>
>>
>>
>> We have debated about putting the replica size to 2 instead of 3. However 
>> this isn't much of a win for the purestorage which dedupes on the backend, 
>> so having copies of data are relatively free for that unit. 1 wouldn't work 
>> because this is hosting a production work load.
> 
> It is a mistake to use replica of 2 for production, when one of the
> copy is corrupted its hard to fix things. if you are concerned about
> storage overhead there is an option to use EC pools in luminous.  To
> get back to your original question if you are comparing the
> network/disk utilization with FC numbers than that is wrong
> comparison,  They are 2 different storage systems with different
> purposes, Ceph is scale out object storage system unlike FC systems
> where you can use commodity hardware and grow as you need, you
> generally dont need hba/fc enclosed disks but nothing stopping you
> from using your existing system. Also you generally dont need any raid
> mirroring configurations in the backend since ceph will handle the
> redundancy for you. scale out systems have more work to do than
> traditional FC systems. There are minimal configuration options for
> bluestore , what kind of disk/network utilization slowdown you are
> seeing? can you publish your numbers an

Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-12 Thread Maged Mokhtar

On 2018-03-12 14:23, David Disseldorp wrote:

> On Fri, 09 Mar 2018 11:23:02 +0200, Maged Mokhtar wrote:
> 
>> 2)I undertand that before switching the path, the initiator will send a 
>> TMF ABORT can we pass this to down to the same abort_request() function 
>> in osd_client that is used for osd_request_timeout expiry ?
> 
> IIUC, the existing abort_request() codepath only cancels the I/O on the
> client/gw side. A TMF ABORT successful response should only be sent if
> we can guarantee that the I/O is terminated at all layers below, so I
> think this would have to be implemented via an additional OSD epoch
> barrier or similar.
> 
> Cheers, David

Hi David, 

I was thinking we would get the block request then loop down to all its
osd requests and cancel those using the same  osd request cancel
function. 

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: [ceph bad performance], can't find a bottleneck

2018-03-12 Thread Maged Mokhtar

Hi, 

Try increasing the queue depth from default 128 to 1024: 

rbd map image-XX  -o queue_depth=1024 

Also if you run multiple rbd images/fio tests, do you get higher
combined performance ? 

Maged 

On 2018-03-12 17:16, Sergey Kotov wrote:

> Dear moderator, i subscribed to ceph list today, could you please post my 
> message? 
> 
> -- Forwarded message --
> From: SERGEY KOTOV 
> Date: 2018-03-06 10:52 GMT+03:00
> Subject: [ceph bad performance], can't find a bottleneck
> To: ceph-users@lists.ceph.com
> Cc: Житенев Алексей , Anna Anikina 
> 
> Good day. 
> 
> Can you please help us to find bottleneck in our ceph installations. 
> We have 3 SSD-only clusters with different HW, but situation is the same - 
> overall i/o operations between client & ceph lower than 1/6 of summary 
> performance all ssd.  
> 
> For example - 
> One of our cluster has 4-nodes with ssd Toshiba 2Tb Enterprise drives, 
> installed on Ubuntu server 16.04.
> Servers are connected to the 10G switches. Latency between modes is about 
> 0.1ms. Ethernet utilisation is low.
> 
> # uname -a
> Linux storage01 4.4.0-101-generic #124-Ubuntu SMP Fri Nov 10 18:29:59 UTC 
> 2017 x86_64 x86_64 x86_64 GNU/Linux
> 
> # ceph osd versions
> {
> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous 
> (stable)": 55
> }
> 
> When we map rbd image direct on the storage nodes via krbd, performance is 
> not good enough.
> We use fio for testing. Even we run randwrite with 4k block size test in 
> multiple thread mode, our drives don't have utilisation higher then 30% and 
> lat is ok.
> 
> At the same time iostat tool displays 100% utilisation on /dev/rbdX.
> 
> Also we can't enable rbd_cache, because of using scst iscsi over rbd mapped 
> images.
> 
> How can we resolve the issue?
> 
> Ceph config:
> 
> [global]
> fsid = beX482fX-6a91-46dX-ad22-21a8a2696abX
> mon_initial_members = storage01, storage02, storage03
> mon_host = X.Y.Z.1,X.Y.Z.2,X.Y.Z.3
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> public_network = X.Y.Z.0/24
> filestore_xattr_use_omap = true
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
> osd_pool_default_pg_num = 1024
> osd_journal_size = 10240
> osd_mkfs_type = xfs
> filestore_op_threads = 16
> filestore_wbthrottle_enable = False
> throttler_perf_counter = False
> osd crush update on start = false
> 
> [osd]
> osd_scrub_begin_hour = 1
> osd_scrub_end_hour = 6
> osd_scrub_priority = 1
> 
> osd_enable_op_tracker = False
> osd_max_backfills = 1
> osd heartbeat grace = 20
> osd heartbeat interval = 5
> osd recovery max active = 1
> osd recovery max single start = 1
> osd recovery op priority = 1
> osd recovery threads = 1
> osd backfill scan max = 16
> osd backfill scan min = 4
> osd max scrubs = 1
> osd scrub interval randomize ratio = 1.0
> osd disk thread ioprio class = idle
> osd disk thread ioprio priority = 0
> osd scrub chunk max = 1
> osd scrub chunk min = 1
> osd deep scrub stride = 1048576
> osd scrub load threshold = 5.0
> osd scrub sleep = 0.1
> 
> [client]
> rbd_cache = false
> 
> Sample fio tests:
> 
> root@storage04:~# fio --name iops --rw randread --bs 4k --filename /dev/rbd2 
> --numjobs 12 --ioengine=libaio --group_reporting 
> iops: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1 
> ... 
> fio-2.2.10 
> Starting 12 processes 
> ^Cbs: 12 (f=12): [r(12)] [1.2% done] [128.4MB/0KB/0KB /s] [32.9K/0/0 iops] 
> [eta 16m:49s] 
> fio: terminating on signal 2 
> 
> iops: (groupid=0, jobs=12): err= 0: pid=29812: Sun Feb 11 23:59:19 2018 
> read : io=1367.8MB, bw=126212KB/s, iops=31553, runt= 11097msec 
> slat (usec): min=1, max=59700, avg=375.92, stdev=495.19 
> clat (usec): min=0, max=377, avg= 1.12, stdev= 3.16 
> lat (usec): min=1, max=59702, avg=377.61, stdev=495.32 
> clat percentiles (usec): 
> |  1.00th=[0],  5.00th=[0], 10.00th=[1], 20.00th=[1], 
> | 30.00th=[1], 40.00th=[1], 50.00th=[1], 60.00th=[1], 
> | 70.00th=[1], 80.00th=[1], 90.00th=[1], 95.00th=[2], 
> | 99.00th=[2], 99.50th=[2], 99.90th=[   73], 99.95th=[   78], 
> | 99.99th=[  115] 
> bw (KB  /s): min= 8536, max=11944, per=8.33%, avg=10516.45, stdev=635.32 
> lat (usec) : 2=91.74%, 4=7.93%, 10=0.14%, 20=0.09%, 50=0.01% 
> lat (usec) : 100=0.07%, 250=0.03%, 500=0.01% 
> cpu  : usr=1.32%, sys=3.69%, ctx=329556, majf=0, minf=134 
> IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% 
> submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
> complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
> issued: total=r=350144/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 
> latency   : target=0, window=0, percentile=100.00%, depth=1 
> 
> Run status group 0 (all jobs): 
> READ: io=1367.8MB, aggrb=126212KB/s, minb=126212KB/s, maxb=126212KB/s, 
> mint=11097msec,

Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-12 Thread Maged Mokhtar

On 2018-03-12 21:00, Ilya Dryomov wrote:

> On Mon, Mar 12, 2018 at 7:41 PM, Maged Mokhtar <mmokh...@petasan.org> wrote: 
> 
>> On 2018-03-12 14:23, David Disseldorp wrote:
>> 
>> On Fri, 09 Mar 2018 11:23:02 +0200, Maged Mokhtar wrote:
>> 
>> 2)I undertand that before switching the path, the initiator will send a
>> TMF ABORT can we pass this to down to the same abort_request() function
>> in osd_client that is used for osd_request_timeout expiry ?
>> 
>> IIUC, the existing abort_request() codepath only cancels the I/O on the
>> client/gw side. A TMF ABORT successful response should only be sent if
>> we can guarantee that the I/O is terminated at all layers below, so I
>> think this would have to be implemented via an additional OSD epoch
>> barrier or similar.
>> 
>> Cheers, David
>> 
>> Hi David,
>> 
>> I was thinking we would get the block request then loop down to all its osd
>> requests and cancel those using the same  osd request cancel function.
> 
> All that function does is tear down OSD client / messenger data
> structures associated with the OSD request.  Any OSD request that hit
> the TCP layer may eventually get through to the OSDs.
> 
> Thanks,
> 
> Ilya

Hi Ilya, 

OK..so i guess this also applies as well to osd_request_timeout expiry,
it is not guaranteed to stop all stale ios.  

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-10 Thread Maged Mokhtar

--
From: "Jason Dillaman" 
Sent: Sunday, March 11, 2018 1:46 AM
To: "shadow_lin" 
Cc: "Lazuardi Nasution" ; "Ceph Users" 

Subject: Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive 
Lock

On Sat, Mar 10, 2018 at 10:11 AM, shadow_lin  wrote:

Hi Jason,

As discussed in this thread, for active/passive, upon initiator
failover, we used the RBD exclusive-lock feature to blacklist the old
"active" iSCSI target gateway so that it cannot talk w/ the Ceph
cluster before new writes are accepted on the new target gateway.

I can get during the new active target gateway was talking to rbd the old
active target gateway cannot write because of the RBD exclusive-lock
But after the new target gateway done the writes,if the old target 
gateway

had some blocked io during the failover,cant it then get the lock and
overwrite the new writes?

Negative -- it's blacklisted so it cannot talk to the cluster.

PS:
Petasan say they can do active/active iscsi with patched suse kernel.

I'll let them comment on these corner cases.

We are not currently handling these corner cases. We have not hit this in 
practice but will work on it. We need to account for in-flight time early in 
the target stack before reaching krbd/tcmu.

/Maged

2018-03-10

shadowlin

发件人：Jason Dillaman 
发送时间：2018-03-10 21:40
主题：Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive 
Lock

收件人："shadow_lin"
抄送："Mike Christie","Lazuardi
Nasution","Ceph Users"

On Sat, Mar 10, 2018 at 7:42 AM, shadow_lin  wrote:

Hi Mike,
So for now only suse kernel with target_rbd_core and tcmu-runner can run
active/passive multipath safely?

Negative, the LIO / tcmu-runner implementation documented here [1] is
safe for active/passive.

I am a newbie to iscsi. I think the stuck io get excuted cause overwrite
problem can happen with both active/active and active/passive.
What makes the active/passive safer than active/active?

As discussed in this thread, for active/passive, upon initiator
failover, we used the RBD exclusive-lock feature to blacklist the old
"active" iSCSI target gateway so that it cannot talk w/ the Ceph
cluster before new writes are accepted on the new target gateway.

What mechanism should be implement to avoid the problem with
active/passive
and active/active multipath?

Active/passive it solved as discussed above. For active/active, we
don't have a solution that is known safe under all failure conditions.
If LIO supported MCS (multiple connections per session) instead of
just MPIO (multipath IO), the initiator would provide enough context
to the target to detect IOs from a failover situation.

2018-03-10

shadowlin

发件人：Mike Christie 
发送时间：2018-03-09 00:54
主题：Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive 
Lock

收件人："shadow_lin","Lazuardi
Nasution","Ceph 
Users"

抄送：

On 03/07/2018 09:24 AM, shadow_lin wrote:

Hi Christie,
Is it safe to use active/passive multipath with krbd with exclusive 
lock

for lio/tgt/scst/tcmu?

No. We tried to use lio and krbd initially, but there is a issue where
IO might get stuck in the target/block layer and get executed after new
IO. So for lio, tgt and tcmu it is not safe as is right now. We could
add some code tcmu's file_example handler which can be used with krbd so
it works like the rbd one.

I do know enough about SCST right now.

Is it safe to use active/active multipath If use suse kernel with
target_core_rbd?
Thanks.

2018-03-07

shadowlin

*发件人：*Mike Christie 
*发送时间：*2018-03-07 03:51
*主题：*Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD
Exclusive Lock
*收件人：*"Lazuardi Nasution","Ceph
Users"
*抄送：*

On 03/06/2018 01:17 PM, Lazuardi Nasution wrote:
> Hi,
>
> I want to do load balanced multipathing (multiple iSCSI
gateway/exporter
> nodes) of iSCSI backed with RBD images. Should I disable 
exclusive

lock
> feature? What if I don't disable that feature? I'm using TGT
(manual
> way) since I get so many CPU stuck error messages when I was 
using

LIO.
>

You are using LIO/TGT with krbd right?

You cannot or shouldn't do active/active multipathing. If you have
the
lock enabled then it bounces between paths for each IO and will be
slow.
If you do not

Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-09 Thread Maged Mokhtar

Hi Mike, 

> For the easy case, the SCSI command is sent directly to krbd and so if
> osd_request_timeout is less than M seconds then the command will be
> failed in time and we would not hit the problem above.
> If something happens in the target stack like the SCSI command gets
> stuck/queued then your osd_request_timeout value might be too short. 

1)Currently the osd_request_timeout timer (req->r_start_stamp) is
started 
in osd_client.c this is late in the stack and as you mentioned things 
could be stuck earlier. Is it be better to start this timer early 
like in iscsi_target.c iscsit_handle_scsi_cmd() at start of processing 
and propagate this value to osd_client ? 
Even more accurate will be to use SO_TIMESTAMPING and timestamp the 
socket buffers as they are received to compute time of current stream 
position. We can also use TCP Timestamps (RFC 7323) sent from the client

initiator, which is enabled by default on Linux/Win/ESX. But this is 
more work. What are your thoughts ? 

2)I undertand that before switching the path, the initiator will send a 
TMF ABORT can we pass this to down to the same abort_request() function 
in osd_client that is used for osd_request_timeout expiry ? 

Cheers /Maged 

On 2018-03-08 20:44, Mike Christie wrote:

> On 03/08/2018 10:59 AM, Lazuardi Nasution wrote: 
> 
>> Hi Mike,
>> 
>> Since I have moved from LIO to TGT, I can do full ALUA (active/active)
>> of multiple gateways. Of course I have to disable any write back cache
>> at any level (RBD cache and TGT cache). It seem to be safe to disable
>> exclusive lock since each RBD image is accessed only by single client
>> and as long as I know mostly ALUA use RR of I/O path.
> 
> It might be possible if you have configured your timers correctly but I
> do not think anyone has figured it all out yet.
> 
> Here is a simple but long example of the problem. Sorry for the length,
> but I want to make sure people know the risks.
> 
> You have 2 iscsi target nodes and 1 iscsi initiator connected to both
> doing active/active over them.
> 
> To make it really easy to hit, the iscsi initiator should be connected
> to the target with a different nic port or network than what is being
> used for ceph traffic.
> 
> 1. Prep the data. Just clear the first sector of your iscsi disk. On the
> initiator system do:
> 
> dd if=/dev/zero of=/dev/sdb count=1 ofile=direct
> 
> 2. Kill the network/port for one of the iscsi targets ceph traffic. So
> for example on target node 1 pull its cable for ceph traffic if you set
> it up where iscsi and ceph use different physical ports. iSCSI traffic
> should be unaffected for this test.
> 
> 3. Write some new data over the sector we just wrote in #1. This will
> get sent from the initiator to the target ok, but get stuck in the
> rbd/ceph layer since that network is down:
> 
> dd if=somefile of=/dev/sdb count=1 ofile=direct ifile=direct
> 
> 4. The initiator's eh timers will fire and that will fail and will the
> command will get failed and retired on the other path. After that dd in
> #3 completes run:
> 
> dd if=someotherfile of=/dev/sdb count=1 ofile=direct ifile=direct
> 
> This should execute quickly since it goes through the good iscsi and
> ceph path right away.
> 
> 5. Now plug the cable back in and wait for maybe 30 seconds for the
> network to come back up and the stuck command to run.
> 
> 6. Now do
> 
> dd if=/dev/sdb of=somenewfile count=1 ifile=direct ofile=direct
> 
> The data is going to be the data sent in step 3 and not the new data in
> step 4.
> 
> To get around this issue you could try to set the krbd
> osd_request_timeout to a value shorter than the initiator side failover
> time out (for multipath-tools/open-iscsi in linux this would be
> fast_io_fail_tmo/replacement timeout) + the various TMF/EH but also
> account for the transport related timers that might short circut/bypass
> the TMF based EH.
> 
> One problem with trying to rely on configuring that is handling all the
> corner cases. So you have:
> 
> - Transport (nop) timer or SCSI/TMF command timer set so the
> fast_io_fail/replacement timer starts at N seconds and then fires at M.
> - It is a really bad connection so it takes N - 1 seconds to get the
> SCSI command from the initiator to target.
> - At the N second mark the iscsi connection is dropped the
> fast_io_fail/replacement timer is started.
> 
> For the easy case, the SCSI command is sent directly to krbd and so if
> osd_request_timeout is less than M seconds then the command will be
> failed in time and we would not hit the problem above.
> 
> If something happens in the target stack like the SCSI command gets
> stuck/queued then your osd_request_timeout value might be too short. For
> example, if you were using tgt/lio right now and this was a
> COMPARE_AND_WRITE, the READ part might take osd_request_timeout - 1
> seconds, and then the write part might take osd_request_timeout -1
> seconds so you need to have your fast_io_fail long enough for that type
> of

1 2 >

1 - 100 of 143 matches

Mail list logo