Re: [ceph-users] Monitor as local VM on top of the server pool cluster?

2017-07-10 Thread Z Will
For  large cluster , there will be a lot of change at any time,  this
means the pressure of mon will be big at some time, because all change
will go through  leader , so for this , the local storage for mon
should be good enough, I think this maybe a conderation .

On Tue, Jul 11, 2017 at 11:29 AM, Brad Hubbard  wrote:
> On Tue, Jul 11, 2017 at 3:44 AM, David Turner  wrote:
>> Mons are a paxos quorum and as such want to be in odd numbers.  5 is
>> generally what people go with.  I think I've heard of a few people use 7
>> mons, but you do not want to have an even number of mons or an ever growing
>
> Unless your cluster is very large three should be sufficient.
>
>> number of mons.  The reason you do not want mons running on the same
>> hardware as osds is resource contention during recovery.  As long as the Xen
>> servers you are putting the mons on are not going to cause any source of
>> resource limitation/contention, then virtualizing them should be fine for
>> you.  Make sure that you aren't configuring the mon to run using an RBD for
>> its storage, that would be very bad.
>>
>> The mon Quorum elects a leader and that leader will be in charge of the
>> quorum.  Having local mons doesn't do anything as the clients will still be
>> talking to the mons as a quorum and won't necessarily talk to the mon
>> running on them.  The vast majority of communication to the cluster that
>> your Xen servers will be doing is to the OSDs anyway, very little
>> communication to the mons.
>>
>> On Mon, Jul 10, 2017 at 1:21 PM Massimiliano Cuttini 
>> wrote:
>>>
>>> Hi everybody,
>>>
>>> i would like to separate MON from OSD as reccomended.
>>> In order to do so without new hardware I'm planning to create all the
>>> monitor as a Virtual Machine on top of my hypervisors (Xen).
>>> I'm testing a pool of 8 nodes of Xen.
>>>
>>> I'm thinking about create 8 monitor and pin one monitor for one Xen node.
>>> So, i'm guessing, every Ceph monitor'll be local for each node client.
>>> This should speed up the system by local connecting monitors with a
>>> little overflown for the monitors sync between nodes.
>>>
>>> Is it a good idea have a local monitor virtualized on top of each
>>> hypervisor node?
>>> Did you see any understimation or wrong design in this?
>>>
>>> Thanks for every helpfull info.
>>>
>>>
>>> Regards,
>>> Max
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Cheers,
> Brad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor as local VM on top of the server pool cluster?

2017-07-10 Thread Brad Hubbard
On Tue, Jul 11, 2017 at 3:44 AM, David Turner  wrote:
> Mons are a paxos quorum and as such want to be in odd numbers.  5 is
> generally what people go with.  I think I've heard of a few people use 7
> mons, but you do not want to have an even number of mons or an ever growing

Unless your cluster is very large three should be sufficient.

> number of mons.  The reason you do not want mons running on the same
> hardware as osds is resource contention during recovery.  As long as the Xen
> servers you are putting the mons on are not going to cause any source of
> resource limitation/contention, then virtualizing them should be fine for
> you.  Make sure that you aren't configuring the mon to run using an RBD for
> its storage, that would be very bad.
>
> The mon Quorum elects a leader and that leader will be in charge of the
> quorum.  Having local mons doesn't do anything as the clients will still be
> talking to the mons as a quorum and won't necessarily talk to the mon
> running on them.  The vast majority of communication to the cluster that
> your Xen servers will be doing is to the OSDs anyway, very little
> communication to the mons.
>
> On Mon, Jul 10, 2017 at 1:21 PM Massimiliano Cuttini 
> wrote:
>>
>> Hi everybody,
>>
>> i would like to separate MON from OSD as reccomended.
>> In order to do so without new hardware I'm planning to create all the
>> monitor as a Virtual Machine on top of my hypervisors (Xen).
>> I'm testing a pool of 8 nodes of Xen.
>>
>> I'm thinking about create 8 monitor and pin one monitor for one Xen node.
>> So, i'm guessing, every Ceph monitor'll be local for each node client.
>> This should speed up the system by local connecting monitors with a
>> little overflown for the monitors sync between nodes.
>>
>> Is it a good idea have a local monitor virtualized on top of each
>> hypervisor node?
>> Did you see any understimation or wrong design in this?
>>
>> Thanks for every helpfull info.
>>
>>
>> Regards,
>> Max
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon leader election problem, should it be improved ?

2017-07-10 Thread Z Will
Hi Joao:

> Basically, this would be something similar to heartbeats. If a
monitor can't
> reach all monitors in an existing quorum, then just don't do anything.

 Based on your solution, I make a little change :
 - send a probe to all monitors
 - if  get a quorum ,
 it will join current quorum through join_quorum message,
when leader  receive this , it will change the quorum and
 claim victory again,
 If timeout , it means it can't reach leader , do nothing
and try later from bootstrap ,
 - if  get > 1/2 acks, do as before, call election

 With this , sometimes the leader do not have  the  smallest rank
num , I think this is fine. In quorum message , there will be one more
byte to point out the leader rank num .
 I think this will perform as same as before and can tolerate some
network partition error, and it only need to change little code,  any
suggesstion for this ? Do I lack of any  considerations ?


On Wed, Jul 5, 2017 at 6:26 PM, Joao Eduardo Luis  wrote:
> On 07/05/2017 08:01 AM, Z Will wrote:
>>
>> Hi Joao:
>> I think this is all because we choose the monitor with the
>> smallest rank number to be leader. For this kind of network error, no
>> matter which mon has lost connection with the  mon who has the
>> smallest rank num , will be constantly calling an election, that say
>> ,will constantly affact the cluster until it is stopped by human . So
>> do you think it make sense if I try to figure out a way to choose the
>> monitor who can see the most monitors ,  or with  the smallest rank
>> num if the view num is same , to be leader ?
>> In probing phase:
>>they will know there own view, so can set a view num.
>> In election phase:
>>they send the view num , rank num .
>>when receiving the election message, it compare the view num (
>> higher is leader ) and rank num ( lower is leader).
>
>
> As I understand it, our elector trades-off reliability in case of network
> failure for expediency in forming a quorum. This by itself is not a problem
> since we don't see many real-world cases where this behaviour happens, and
> we are a lot more interested in making sure we have a quorum - given without
> a quorum your cluster is effectively unusable.
>
> Currently, we form a quorum with a minimal number of messages passed.
> From my poor recollection, I think the Elector works something like
>
> - 1 probe message to each monitor in the monmap
> - receives defer from a monitor, or defers to a monitor
> - declares victory if number of defers is an absolute majority (including
> one's defer).
>
> An election cycle takes about 4-5 messages to complete, with roughly two
> round-trips (in the best case scenario).
>
> Figuring out which monitor is able to contact the highest number of
> monitors, and having said monitor being elected the leader, will necessarily
> increase the number of messages transferred.
>
> A rough idea would be
>
> - all monitors will send probes to all other monitors in the monmap;
> - all monitors need to ack the other's probes;
> - each monitor will count the number of monitors it can reach, and then send
> a message proposing itself as the leader to the other monitors, with the
> list of monitors they see;
> - each monitor will propose itself as the leader, or defer to some other
> monitor.
>
> This is closer to 3 round-trips.
>
> Additionally, we'd have to account for the fact that some monitors may be
> able to reach all other monitors, while some may only be able to reach a
> portion. How do we handle this scenario?
>
> - What do we do with monitors that do not reach all other monitors?
> - Do we ignore them for electoral purposes?
> - Are they part of the final quorum?
> - What if we need those monitors to form a quorum?
>
> Personally, I think the easiest solution to this problem would be
> blacklisting a problematic monitor (for a given amount a time, or until a
> new election is needed due to loss of quorum, or by human intervention).
>
> For example, if a monitor believes it should be the leader, and if all other
> monitors are deferring to someone else that is not reachable, the monitor
> could then enter a special case branch:
>
> - send a probe to all monitors
> - receive acks
> - share that with other monitors
> - if that list is missing monitors, then blacklist the monitor for a period,
> and send a message to that monitor with that decision
> - the monitor would blacklist itself and retry in a given amount of time.
>
> Basically, this would be something similar to heartbeats. If a monitor can't
> reach all monitors in an existing quorum, then just don't do anything.
>
> In any case, you are more than welcome to propose a solution. Let us know
> what you come up with and if you want to discuss this a bit more ;)
>
>   -Joao
>
>>
>> On Tue, Jul 4, 2017 at 9:25 PM, Joao Eduardo Luis  wrote:
>>>
>>> On 07/04/2017 06:57 AM, Z Will wrote:


 Hi:

Re: [ceph-users] OSD Full Ratio Luminous - Unset

2017-07-10 Thread Brad Hubbard


On Tue, Jul 11, 2017 at 12:46 PM, Ashley Merrick  wrote:
> Hello,
>
> Perfect thanks that fixed my issue!
>
> Still seems to be a bug on the ceph pg dump unless it has been moved out of 
> the PG and directly into the OSD?

I am looking into this issue and have been trying to bisect it but it has proved
challenging.

Please feel free to open a tracker as we have discussed or I will open one once
I have more information.

Edward, if you feel this could be improved please also open a separate tracker
for the ratio ordering issue outlining the steps involved and noting the issues
encountered.

>
> ,Ashley
>
> -Original Message-
> From: Edward R Huyer [mailto:erh...@rit.edu]
> Sent: Tuesday, 11 July 2017 7:53 AM
> To: Ashley Merrick 
> Cc: ceph-users@lists.ceph.com
> Subject: [ceph-users] OSD Full Ratio Luminous - Unset
>
> I just now ran into the same problem you did, though I managed to get it 
> straightened out.
>
> It looks to me like the "ceph osd set-{full,nearfull,backfillfull}-ratio" 
> commands *do* work, with two caveats.
>
> Caveat 1:  "ceph pg dump" doesn't reflect the change for some reason.
> Caveat 2: There doesn't seem to be much of a safety measure to prevent you 
> from entering incoherent values.  In particular, if backfillfull-ratio is 
> smaller than nearfull-ratio, you will continue to have the ERR health 
> message.  Backfillfull must be greater than (or equal to?) nearfull.  So, for 
> instance, my cluster at the moment has a full ratio of .9, a backfillfull 
> ratio of 0.85, and a nearfull ratio of 0.8.
>
> So there appears to be a bug when upgrading from Jewel where the full ratios 
> don't carry over properly, a display bug of some kind with "ceph pg dump", 
> and a general issue with lack of information about full ratios being out of 
> order.  Oh, and the set ratio commands allow incoherent values, though that 
> might be intentional.
>
> -
> Edward Huyer
> School of Interactive Games and Media
> Rochester Institute of Technology
> Golisano 70-2373
> 152 Lomb Memorial Drive
> Rochester, NY 14623
> 585-475-6651
> mailto:erh...@rit.edu
>
> Obligatory Legalese:
> The information transmitted, including attachments, is intended only for the 
> person(s) or entity to which it is addressed and may contain confidential 
> and/or privileged material. Any review, retransmission, dissemination or 
> other use of, or taking of any action in reliance upon this information by 
> persons or entities other than the intended recipient is prohibited. If you 
> received this in error, please contact the sender and destroy any copies of 
> this information.
>
> So using the commands given I checked all the mon's and a couple of OSD's 
> that are not backfilling due to saying they are backfill_full.
>
> However when running the command the value's I have been trying to set via 
> the set commands are correctly set and reporting via the admin socket.
>
> However "ceph pg dump | head" still shows 0 for them all and the PG's still 
> aren't back filling, so apart from ceph pg dump showing incorrect values the 
> admin socket shows correct however cluster is still stuck in non-backfilling 
> state / ERR_health due to messages.
>
> ,Ashley
>
> 
> From: Brad Hubbard 
> Sent: 07 July 2017 10:31:01
> To: Ashley Merrick
> Cc: ceph-users at ceph.com
> Subject: Re: [ceph-users] OSD Full Ratio Luminous - Unset
>
> On Fri, Jul 7, 2017 at 4:49 PM, Ashley Merrick  
> wrote:
>> After looking into this further it seem's none of the :
>>
>>
>> ceph osd set-{full,nearfull,backfillfull}-ratio
>>
>>
>> Commands seem to be taking any effect on the cluster including the
>> backfillfull ratio, this command looks to have been added/changed
>> since Jewel, and a different way of setting the above. However does
>> not seem to be giving the expected results.
>
> Hi Ashley,
>
> Please do open a tracker for this including commands used and resulting 
> output so we can investigate this fully.
>
> In the meantime you should be able to view and manipulate individual full 
> ratio values via the admin socket.
>
> ceph daemon osd.0 config show|grep ratio ceph daemon mon.a config show|grep 
> ratio ceph daemon osd.0 help|grep "config set "
>
>>
>>
>> ,Ashley
>>
>> 
>> From: ceph-users  on behalf of
>> Ashley Merrick 
>> Sent: 06 July 2017 12:44:09
>> To: ceph-users at ceph.com
>> Subject: Re: [ceph-users] OSD Full Ratio Luminous - Unset
>>
>> Anyone have some feedback on this? Happy to log a bug ticket if it is
>> one, but want to make sure not missing something Luminous change related.
>>
>> ,Ashley
>>
>> Sent from my iPhone
>>
>> On 4 Jul 2017, at 3:30 PM, Ashley Merrick  wrote:
>>
>> Okie noticed their is a new command to set these.
>>
>>
>> Tried these and still showing as 0 and error on full ratio out of
>> order "ceph osd set-{full,nearfull,backfillfull}-ratio"
>>
>>
>> ,Ashley
>>
>> 

Re: [ceph-users] OSD Full Ratio Luminous - Unset

2017-07-10 Thread Ashley Merrick
Hello,

Perfect thanks that fixed my issue!

Still seems to be a bug on the ceph pg dump unless it has been moved out of the 
PG and directly into the OSD?

,Ashley

-Original Message-
From: Edward R Huyer [mailto:erh...@rit.edu] 
Sent: Tuesday, 11 July 2017 7:53 AM
To: Ashley Merrick 
Cc: ceph-users@lists.ceph.com
Subject: [ceph-users] OSD Full Ratio Luminous - Unset

I just now ran into the same problem you did, though I managed to get it 
straightened out.

It looks to me like the "ceph osd set-{full,nearfull,backfillfull}-ratio" 
commands *do* work, with two caveats.

Caveat 1:  "ceph pg dump" doesn't reflect the change for some reason.
Caveat 2: There doesn't seem to be much of a safety measure to prevent you from 
entering incoherent values.  In particular, if backfillfull-ratio is smaller 
than nearfull-ratio, you will continue to have the ERR health message.  
Backfillfull must be greater than (or equal to?) nearfull.  So, for instance, 
my cluster at the moment has a full ratio of .9, a backfillfull ratio of 0.85, 
and a nearfull ratio of 0.8.

So there appears to be a bug when upgrading from Jewel where the full ratios 
don't carry over properly, a display bug of some kind with "ceph pg dump", and 
a general issue with lack of information about full ratios being out of order.  
Oh, and the set ratio commands allow incoherent values, though that might be 
intentional.

-
Edward Huyer
School of Interactive Games and Media
Rochester Institute of Technology
Golisano 70-2373
152 Lomb Memorial Drive
Rochester, NY 14623
585-475-6651
mailto:erh...@rit.edu

Obligatory Legalese:
The information transmitted, including attachments, is intended only for the 
person(s) or entity to which it is addressed and may contain confidential 
and/or privileged material. Any review, retransmission, dissemination or other 
use of, or taking of any action in reliance upon this information by persons or 
entities other than the intended recipient is prohibited. If you received this 
in error, please contact the sender and destroy any copies of this information.

So using the commands given I checked all the mon's and a couple of OSD's that 
are not backfilling due to saying they are backfill_full.

However when running the command the value's I have been trying to set via the 
set commands are correctly set and reporting via the admin socket.

However "ceph pg dump | head" still shows 0 for them all and the PG's still 
aren't back filling, so apart from ceph pg dump showing incorrect values the 
admin socket shows correct however cluster is still stuck in non-backfilling 
state / ERR_health due to messages.

,Ashley


From: Brad Hubbard 
Sent: 07 July 2017 10:31:01
To: Ashley Merrick
Cc: ceph-users at ceph.com
Subject: Re: [ceph-users] OSD Full Ratio Luminous - Unset

On Fri, Jul 7, 2017 at 4:49 PM, Ashley Merrick  wrote:
> After looking into this further it seem's none of the :
>
>
> ceph osd set-{full,nearfull,backfillfull}-ratio
>
>
> Commands seem to be taking any effect on the cluster including the 
> backfillfull ratio, this command looks to have been added/changed 
> since Jewel, and a different way of setting the above. However does 
> not seem to be giving the expected results.

Hi Ashley,

Please do open a tracker for this including commands used and resulting output 
so we can investigate this fully.

In the meantime you should be able to view and manipulate individual full ratio 
values via the admin socket.

ceph daemon osd.0 config show|grep ratio ceph daemon mon.a config show|grep 
ratio ceph daemon osd.0 help|grep "config set "

>
>
> ,Ashley
>
> 
> From: ceph-users  on behalf of 
> Ashley Merrick 
> Sent: 06 July 2017 12:44:09
> To: ceph-users at ceph.com
> Subject: Re: [ceph-users] OSD Full Ratio Luminous - Unset
>
> Anyone have some feedback on this? Happy to log a bug ticket if it is 
> one, but want to make sure not missing something Luminous change related.
>
> ,Ashley
>
> Sent from my iPhone
>
> On 4 Jul 2017, at 3:30 PM, Ashley Merrick  wrote:
>
> Okie noticed their is a new command to set these.
>
>
> Tried these and still showing as 0 and error on full ratio out of 
> order "ceph osd set-{full,nearfull,backfillfull}-ratio"
>
>
> ,Ashley
>
> 
> From: ceph-users  on behalf of 
> Ashley Merrick 
> Sent: 04 July 2017 05:55:10
> To: ceph-users at ceph.com
> Subject: [ceph-users] OSD Full Ratio Luminous - Unset
>
>
> Hello,
>
>
> On a Luminous upgraded from Jewel I am seeing the following in ceph -s  :
> "Full ratio(s) out of order"
>
>
> and
>
>
> ceph pg dump | head
> dumped all
> version 44281
> stamp 2017-07-04 05:52:08.337258
> last_osdmap_epoch 0
> last_pg_scan 0
> full_ratio 0
> nearfull_ratio 0
>
> I have tried to inject the values however makes no effect, these where 
> previously non 0 values and the issue only showed after running "ceph 
> osd require-osd-release 

[ceph-users] OSD Full Ratio Luminous - Unset

2017-07-10 Thread Edward R Huyer
I just now ran into the same problem you did, though I managed to get it 
straightened out.

It looks to me like the "ceph osd set-{full,nearfull,backfillfull}-ratio" 
commands *do* work, with two caveats.

Caveat 1:  "ceph pg dump" doesn't reflect the change for some reason.
Caveat 2: There doesn't seem to be much of a safety measure to prevent you from 
entering incoherent values.  In particular, if backfillfull-ratio is smaller 
than nearfull-ratio, you will continue to have the ERR health message.  
Backfillfull must be greater than (or equal to?) nearfull.  So, for instance, 
my cluster at the moment has a full ratio of .9, a backfillfull ratio of 0.85, 
and a nearfull ratio of 0.8.

So there appears to be a bug when upgrading from Jewel where the full ratios 
don't carry over properly, a display bug of some kind with "ceph pg dump", and 
a general issue with lack of information about full ratios being out of order.  
Oh, and the set ratio commands allow incoherent values, though that might be 
intentional.

-
Edward Huyer
School of Interactive Games and Media
Rochester Institute of Technology
Golisano 70-2373
152 Lomb Memorial Drive
Rochester, NY 14623
585-475-6651
mailto:erh...@rit.edu

Obligatory Legalese:
The information transmitted, including attachments, is intended only for the 
person(s) or entity to which it is addressed and may contain confidential 
and/or privileged material. Any review, retransmission, dissemination or other 
use of, or taking of any action in reliance upon this information by persons or 
entities other than the intended recipient is prohibited. If you received this 
in error, please contact the sender and destroy any copies of this information.

So using the commands given I checked all the mon's and a couple of OSD's that 
are not backfilling due to saying they are backfill_full.

However when running the command the value's I have been trying to set via the 
set commands are correctly set and reporting via the admin socket.

However "ceph pg dump | head" still shows 0 for them all and the PG's still 
aren't back filling, so apart from ceph pg dump showing incorrect values the 
admin socket shows correct however cluster is still stuck in non-backfilling 
state / ERR_health due to messages.

,Ashley


From: Brad Hubbard 
Sent: 07 July 2017 10:31:01
To: Ashley Merrick
Cc: ceph-users at ceph.com
Subject: Re: [ceph-users] OSD Full Ratio Luminous - Unset

On Fri, Jul 7, 2017 at 4:49 PM, Ashley Merrick  wrote:
> After looking into this further it seem's none of the :
>
>
> ceph osd set-{full,nearfull,backfillfull}-ratio
>
>
> Commands seem to be taking any effect on the cluster including the
> backfillfull ratio, this command looks to have been added/changed since
> Jewel, and a different way of setting the above. However does not seem to be
> giving the expected results.

Hi Ashley,

Please do open a tracker for this including commands used and
resulting output so we can investigate this fully.

In the meantime you should be able to view and manipulate individual
full ratio values via the admin socket.

ceph daemon osd.0 config show|grep ratio
ceph daemon mon.a config show|grep ratio
ceph daemon osd.0 help|grep "config set "

>
>
> ,Ashley
>
> 
> From: ceph-users  on behalf of Ashley
> Merrick 
> Sent: 06 July 2017 12:44:09
> To: ceph-users at ceph.com
> Subject: Re: [ceph-users] OSD Full Ratio Luminous - Unset
>
> Anyone have some feedback on this? Happy to log a bug ticket if it is one,
> but want to make sure not missing something Luminous change related.
>
> ,Ashley
>
> Sent from my iPhone
>
> On 4 Jul 2017, at 3:30 PM, Ashley Merrick  wrote:
>
> Okie noticed their is a new command to set these.
>
>
> Tried these and still showing as 0 and error on full ratio out of order
> "ceph osd set-{full,nearfull,backfillfull}-ratio"
>
>
> ,Ashley
>
> 
> From: ceph-users  on behalf of Ashley
> Merrick 
> Sent: 04 July 2017 05:55:10
> To: ceph-users at ceph.com
> Subject: [ceph-users] OSD Full Ratio Luminous - Unset
>
>
> Hello,
>
>
> On a Luminous upgraded from Jewel I am seeing the following in ceph -s  :
> "Full ratio(s) out of order"
>
>
> and
>
>
> ceph pg dump | head
> dumped all
> version 44281
> stamp 2017-07-04 05:52:08.337258
> last_osdmap_epoch 0
> last_pg_scan 0
> full_ratio 0
> nearfull_ratio 0
>
> I have tried to inject the values however makes no effect, these where
> previously non 0 values and the issue only showed after running "ceph osd
> require-osd-release luminous"
>
>
> Thanks,
>
> Ashley
>
> ___
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


[ceph-users] admin_socket error

2017-07-10 Thread Oscar Segarra
Hi,

My lab environment has just one node for testing purposes.

As user ceph (with sudo privileges granted) I have executed the following
commands in my environment:

ceph-deploy install vdicnode01
ceph-deploy --cluster vdiccephmgmtcluster new vdicnode01 --cluster-network
192.168.100.0/24 --public-network 192.168.100.0/24
ceph-deploy --cluster vdiccephmgmtcluster mon create-initial

But I get the followihg error:

[vdicnode01][INFO  ] Running command: sudo /usr/bin/ceph
--connect-timeout=25 --cluster=vdiccephmgmtcluster
--admin-daemon=/var/run/ceph/vdiccephmgmtcluster-mon.vdicnode01.asok
mon_status
[vdicnode01][ERROR ] "ceph mon_status vdicnode01" returned 22
[vdicnode01][DEBUG ] admin_socket: exception getting command descriptions:
[Errno 2] No such file or directory
[ceph_deploy.gatherkeys][ERROR ] Failed to connect to host:vdicnode01
[ceph_deploy.gatherkeys][INFO  ] Destroy temp directory /tmp/tmp8Rp143
[ceph_deploy][ERROR ] RuntimeError: Failed to connect any mon

The conf file:

[ceph@vdicnode01 ceph]$ cat /etc/ceph/vdiccephmgmtcluster.conf
[global]
fsid = 5841962c-4100-41e9-a450-12ef53312b8f
public_network = 192.168.100.0/24
cluster_network = 192.168.100.0/24
mon_initial_members = vdicnode01
mon_host = 192.168.100.101
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

The content of the directory where I've run all commands:

[ceph@vdicnode01 vdiccephmgmtcluster]$ ls -lisa
total 64
  191316  4 drwxrwxr-x. 2 ceph ceph  4096 Jul 11 01:21 .
67153692  4 drwx--. 4 ceph ceph  4096 Jul 10 01:06 ..
 1433511  8 -rw-rw-r--. 1 ceph ceph  8097 Jul 11 01:17 ceph-deploy-ceph.log
  191306 40 -rw-rw-r--. 1 ceph ceph 39174 Jul 11 01:34
ceph-deploy-vdiccephmgmtcluster.log
  194468  4 -rw-rw-r--. 1 ceph ceph   272 Jul 11 01:21
vdiccephmgmtcluster.conf
 1433510  4 -rw---. 1 ceph ceph73 Jul 11 01:21
vdiccephmgmtcluster.mon.keyring
[

Any help will be welcome!

Thanks a lot.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems with statistics after upgrade to luminous

2017-07-10 Thread Gregory Farnum
On Mon, Jul 10, 2017 at 1:00 PM Sage Weil  wrote:

> On Mon, 10 Jul 2017, Ruben Kerkhof wrote:
> > On Mon, Jul 10, 2017 at 7:44 PM, Sage Weil  wrote:
> > > On Mon, 10 Jul 2017, Gregory Farnum wrote:
> > >> On Mon, Jul 10, 2017 at 12:57 AM Marc Roos 
> wrote:
> > >>
> > >>   I need a little help with fixing some errors I am having.
> > >>
> > >>   After upgrading from Kraken im getting incorrect values reported
> > >>   on
> > >>   placement groups etc. At first I thought it is because I was
> > >>   changing
> > >>   the public cluster ip address range and modifying the monmap
> > >>   directly.
> > >>   But after deleting and adding a monitor this ceph daemon dump is
> > >>   still
> > >>   incorrect.
> > >>
> > >>
> > >>
> > >>
> > >>   ceph daemon mon.a perf dump cluster
> > >>   {
> > >>   "cluster": {
> > >>   "num_mon": 3,
> > >>   "num_mon_quorum": 3,
> > >>   "num_osd": 6,
> > >>   "num_osd_up": 6,
> > >>   "num_osd_in": 6,
> > >>   "osd_epoch": 3842,
> > >>   "osd_bytes": 0,
> > >>   "osd_bytes_used": 0,
> > >>   "osd_bytes_avail": 0,
> > >>   "num_pool": 0,
> > >>   "num_pg": 0,
> > >>   "num_pg_active_clean": 0,
> > >>   "num_pg_active": 0,
> > >>   "num_pg_peering": 0,
> > >>   "num_object": 0,
> > >>   "num_object_degraded": 0,
> > >>   "num_object_misplaced": 0,
> > >>   "num_object_unfound": 0,
> > >>   "num_bytes": 0,
> > >>   "num_mds_up": 1,
> > >>   "num_mds_in": 1,
> > >>   "num_mds_failed": 0,
> > >>   "mds_epoch": 816
> > >>   }
> > >>
> > >>   }
> > >>
> > >>
> > >> Huh, I didn't know that existed.
> > >>
> > >> So, yep, most of those values aren't updated any more. From a grep,
> you can
> > >> still trust:
> > >> num_mon
> > >> num_mon_quorum
> > >> num_osd
> > >> num_osd_up
> > >> num_osd_in
> > >> osd_epoch
> > >> num_mds_up
> > >> num_mds_in
> > >> num_mds_failed
> > >> mds_epoch
> > >>
> > >> We might be able to keep updating the others when we get reports from
> the
> > >> manager, but it'd be simpler to just rip them out — I don't think the
> admin
> > >> socket is really the right place to get cluster summary data like
> this.
> > >> Sage, any thoughts?
> > >
> > > These were added to fill a gap when operators are collecting everything
> > > via collectd or similar.
> >
> > Indeed, this has been reported as
> > https://github.com/collectd/collectd/issues/2345
> >
> > > Getting the same cluster-level data from
> > > multiple mons is redundant but it avoids having to code up a separate
> > > collector that polls the CLI or something.
> > >
> > > I suspect once we're funneling everything through a mgr module this
> > > problem will go away and we can remove this.
> >
> > That would be great, having collectd running on each monitor always felt
> > a bit weird. If anyone wants to contribute patches to the collectd Ceph
> > plugin to support the mgr, we would really appreciate that.
>
> To be clear, what we're currently working on right here is a *prometheus*
> module/plugin for mgr that will funnel the metrics for *all* ceph daemons
> through a single endpoint to prometheus.  I suspect we can easily
> include the cluster-level stats there.
>
> I'm not sure what the situation looks like with collectd or if there is
> any interest or work with making mgr behavior like a proxy for all
> of the cluster and daemon stats.
>
> > > Until then, these are easy
> > > to fix by populating from PGMapDigest... my vote is we do that!
> >
> > Yes please :)
>
> I've added a ticket for luminous:
>
> http://tracker.ceph.com/issues/20563
>
> sage


https://github.com/ceph/ceph/pull/16249

Checked with vstart and that appears to resolve it correctly. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems with statistics after upgrade to luminous

2017-07-10 Thread Sage Weil
On Mon, 10 Jul 2017, Ruben Kerkhof wrote:
> On Mon, Jul 10, 2017 at 7:44 PM, Sage Weil  wrote:
> > On Mon, 10 Jul 2017, Gregory Farnum wrote:
> >> On Mon, Jul 10, 2017 at 12:57 AM Marc Roos  
> >> wrote:
> >>
> >>   I need a little help with fixing some errors I am having.
> >>
> >>   After upgrading from Kraken im getting incorrect values reported
> >>   on
> >>   placement groups etc. At first I thought it is because I was
> >>   changing
> >>   the public cluster ip address range and modifying the monmap
> >>   directly.
> >>   But after deleting and adding a monitor this ceph daemon dump is
> >>   still
> >>   incorrect.
> >>
> >>
> >>
> >>
> >>   ceph daemon mon.a perf dump cluster
> >>   {
> >>   "cluster": {
> >>   "num_mon": 3,
> >>   "num_mon_quorum": 3,
> >>   "num_osd": 6,
> >>   "num_osd_up": 6,
> >>   "num_osd_in": 6,
> >>   "osd_epoch": 3842,
> >>   "osd_bytes": 0,
> >>   "osd_bytes_used": 0,
> >>   "osd_bytes_avail": 0,
> >>   "num_pool": 0,
> >>   "num_pg": 0,
> >>   "num_pg_active_clean": 0,
> >>   "num_pg_active": 0,
> >>   "num_pg_peering": 0,
> >>   "num_object": 0,
> >>   "num_object_degraded": 0,
> >>   "num_object_misplaced": 0,
> >>   "num_object_unfound": 0,
> >>   "num_bytes": 0,
> >>   "num_mds_up": 1,
> >>   "num_mds_in": 1,
> >>   "num_mds_failed": 0,
> >>   "mds_epoch": 816
> >>   }
> >>
> >>   }
> >>
> >>
> >> Huh, I didn't know that existed.
> >>
> >> So, yep, most of those values aren't updated any more. From a grep, you can
> >> still trust:
> >> num_mon
> >> num_mon_quorum
> >> num_osd
> >> num_osd_up
> >> num_osd_in
> >> osd_epoch
> >> num_mds_up
> >> num_mds_in
> >> num_mds_failed
> >> mds_epoch
> >>
> >> We might be able to keep updating the others when we get reports from the
> >> manager, but it'd be simpler to just rip them out — I don't think the admin
> >> socket is really the right place to get cluster summary data like this.
> >> Sage, any thoughts?
> >
> > These were added to fill a gap when operators are collecting everything
> > via collectd or similar.
> 
> Indeed, this has been reported as
> https://github.com/collectd/collectd/issues/2345
> 
> > Getting the same cluster-level data from
> > multiple mons is redundant but it avoids having to code up a separate
> > collector that polls the CLI or something.
> >
> > I suspect once we're funneling everything through a mgr module this
> > problem will go away and we can remove this.
> 
> That would be great, having collectd running on each monitor always felt 
> a bit weird. If anyone wants to contribute patches to the collectd Ceph 
> plugin to support the mgr, we would really appreciate that.

To be clear, what we're currently working on right here is a *prometheus* 
module/plugin for mgr that will funnel the metrics for *all* ceph daemons 
through a single endpoint to prometheus.  I suspect we can easily 
include the cluster-level stats there.

I'm not sure what the situation looks like with collectd or if there is 
any interest or work with making mgr behavior like a proxy for all 
of the cluster and daemon stats.

> > Until then, these are easy
> > to fix by populating from PGMapDigest... my vote is we do that!
> 
> Yes please :)

I've added a ticket for luminous:

http://tracker.ceph.com/issues/20563

sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems with statistics after upgrade to luminous

2017-07-10 Thread Ruben Kerkhof
On Mon, Jul 10, 2017 at 7:44 PM, Sage Weil  wrote:
> On Mon, 10 Jul 2017, Gregory Farnum wrote:
>> On Mon, Jul 10, 2017 at 12:57 AM Marc Roos  wrote:
>>
>>   I need a little help with fixing some errors I am having.
>>
>>   After upgrading from Kraken im getting incorrect values reported
>>   on
>>   placement groups etc. At first I thought it is because I was
>>   changing
>>   the public cluster ip address range and modifying the monmap
>>   directly.
>>   But after deleting and adding a monitor this ceph daemon dump is
>>   still
>>   incorrect.
>>
>>
>>
>>
>>   ceph daemon mon.a perf dump cluster
>>   {
>>   "cluster": {
>>   "num_mon": 3,
>>   "num_mon_quorum": 3,
>>   "num_osd": 6,
>>   "num_osd_up": 6,
>>   "num_osd_in": 6,
>>   "osd_epoch": 3842,
>>   "osd_bytes": 0,
>>   "osd_bytes_used": 0,
>>   "osd_bytes_avail": 0,
>>   "num_pool": 0,
>>   "num_pg": 0,
>>   "num_pg_active_clean": 0,
>>   "num_pg_active": 0,
>>   "num_pg_peering": 0,
>>   "num_object": 0,
>>   "num_object_degraded": 0,
>>   "num_object_misplaced": 0,
>>   "num_object_unfound": 0,
>>   "num_bytes": 0,
>>   "num_mds_up": 1,
>>   "num_mds_in": 1,
>>   "num_mds_failed": 0,
>>   "mds_epoch": 816
>>   }
>>
>>   }
>>
>>
>> Huh, I didn't know that existed.
>>
>> So, yep, most of those values aren't updated any more. From a grep, you can
>> still trust:
>> num_mon
>> num_mon_quorum
>> num_osd
>> num_osd_up
>> num_osd_in
>> osd_epoch
>> num_mds_up
>> num_mds_in
>> num_mds_failed
>> mds_epoch
>>
>> We might be able to keep updating the others when we get reports from the
>> manager, but it'd be simpler to just rip them out — I don't think the admin
>> socket is really the right place to get cluster summary data like this.
>> Sage, any thoughts?
>
> These were added to fill a gap when operators are collecting everything
> via collectd or similar.

Indeed, this has been reported as
https://github.com/collectd/collectd/issues/2345

> Getting the same cluster-level data from
> multiple mons is redundant but it avoids having to code up a separate
> collector that polls the CLI or something.
>
> I suspect once we're funneling everything through a mgr module this
> problem will go away and we can remove this.

That would be great, having collectd running on each monitor always
felt a bit weird.
If anyone wants to contribute patches to the collectd Ceph plugin to
support the mgr, we would really appreciate that.

> Until then, these are easy
> to fix by populating from PGMapDigest... my vote is we do that!

Yes please :)
>
> sage

Kind regards,

Ruben Kerkhof
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD journaling benchmarks

2017-07-10 Thread Maged Mokhtar
On 2017-07-10 20:06, Mohamad Gebai wrote:

> On 07/10/2017 01:51 PM, Jason Dillaman wrote: On Mon, Jul 10, 2017 at 1:39 
> PM, Maged Mokhtar  wrote: These are significant 
> differences, to the point where it may not make sense
> to use rbd journaling / mirroring unless there is only 1 active client. I 
> interpreted the results as the same RBD image was being concurrently
> used by two fio jobs -- which we strongly recommend against since it
> will result in the exclusive-lock ping-ponging back and forth between
> the two clients / jobs. Each fio RBD job should utilize its own
> backing image to avoid such a scenario.

That is correct. The single job runs are more representative of the
overhead of journaling only, and it is worth noting the (expected)
inefficiency of multiple clients for the same RBD image, as explained by
Jason.

Mohamad

Yes i expected a penalty but not as large. There are some use cases that
would benefit from concurrent access to the same block device, in vmware
ad hyper-v several hypervisors could share the same device which is
formatted via a clustered file system like MS CSV ( clustered shared
volumes ) or VMFS, which creates a volume/datastore that houses many
VMs. 

I was wondering if such a setup could be supported in the future and
maybe there could be a way to minimize the overhead of the exclusive
lock..for example by having a distributed sequence number to the
different active client writers and have each writer maintain its own
journal, i doubt that the overhead will reach the values you showed. 

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD journaling benchmarks

2017-07-10 Thread Mohamad Gebai

On 07/10/2017 01:51 PM, Jason Dillaman wrote:
> On Mon, Jul 10, 2017 at 1:39 PM, Maged Mokhtar  wrote:
>> These are significant differences, to the point where it may not make sense
>> to use rbd journaling / mirroring unless there is only 1 active client.
> I interpreted the results as the same RBD image was being concurrently
> used by two fio jobs -- which we strongly recommend against since it
> will result in the exclusive-lock ping-ponging back and forth between
> the two clients / jobs. Each fio RBD job should utilize its own
> backing image to avoid such a scenario.
>

That is correct. The single job runs are more representative of the
overhead of journaling only, and it is worth noting the (expected)
inefficiency of multiple clients for the same RBD image, as explained by
Jason.

Mohamad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD journaling benchmarks

2017-07-10 Thread Jason Dillaman
On Mon, Jul 10, 2017 at 1:39 PM, Maged Mokhtar  wrote:
> These are significant differences, to the point where it may not make sense
> to use rbd journaling / mirroring unless there is only 1 active client.

I interpreted the results as the same RBD image was being concurrently
used by two fio jobs -- which we strongly recommend against since it
will result in the exclusive-lock ping-ponging back and forth between
the two clients / jobs. Each fio RBD job should utilize its own
backing image to avoid such a scenario.

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems with statistics after upgrade to luminous

2017-07-10 Thread Sage Weil
On Mon, 10 Jul 2017, Gregory Farnum wrote:
> On Mon, Jul 10, 2017 at 12:57 AM Marc Roos  wrote:
> 
>   I need a little help with fixing some errors I am having.
> 
>   After upgrading from Kraken im getting incorrect values reported
>   on
>   placement groups etc. At first I thought it is because I was
>   changing
>   the public cluster ip address range and modifying the monmap
>   directly.
>   But after deleting and adding a monitor this ceph daemon dump is
>   still
>   incorrect.
> 
> 
> 
> 
>   ceph daemon mon.a perf dump cluster
>   {
>       "cluster": {
>           "num_mon": 3,
>           "num_mon_quorum": 3,
>           "num_osd": 6,
>           "num_osd_up": 6,
>           "num_osd_in": 6,
>           "osd_epoch": 3842,
>           "osd_bytes": 0,
>           "osd_bytes_used": 0,
>           "osd_bytes_avail": 0,
>           "num_pool": 0,
>           "num_pg": 0,
>           "num_pg_active_clean": 0,
>           "num_pg_active": 0,
>           "num_pg_peering": 0,
>           "num_object": 0,
>           "num_object_degraded": 0,
>           "num_object_misplaced": 0,
>           "num_object_unfound": 0,
>           "num_bytes": 0,
>           "num_mds_up": 1,
>           "num_mds_in": 1,
>           "num_mds_failed": 0,
>           "mds_epoch": 816
>       } 
> 
>   }
> 
> 
> Huh, I didn't know that existed.
> 
> So, yep, most of those values aren't updated any more. From a grep, you can
> still trust:
> num_mon
> num_mon_quorum
> num_osd
> num_osd_up
> num_osd_in
> osd_epoch
> num_mds_up
> num_mds_in
> num_mds_failed
> mds_epoch
> 
> We might be able to keep updating the others when we get reports from the
> manager, but it'd be simpler to just rip them out — I don't think the admin
> socket is really the right place to get cluster summary data like this.
> Sage, any thoughts?

These were added to fill a gap when operators are collecting everything 
via collectd or similar.  Getting the same cluster-level data from 
multiple mons is redundant but it avoids having to code up a separate 
collector that polls the CLI or something.

I suspect once we're funneling everything through a mgr module this 
problem will go away and we can remove this.  Until then, these are easy 
to fix by populating from PGMapDigest... my vote is we do that!

sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor as local VM on top of the server pool cluster?

2017-07-10 Thread David Turner
Mons are a paxos quorum and as such want to be in odd numbers.  5 is
generally what people go with.  I think I've heard of a few people use 7
mons, but you do not want to have an even number of mons or an ever growing
number of mons.  The reason you do not want mons running on the same
hardware as osds is resource contention during recovery.  As long as the
Xen servers you are putting the mons on are not going to cause any source
of resource limitation/contention, then virtualizing them should be fine
for you.  Make sure that you aren't configuring the mon to run using an RBD
for its storage, that would be very bad.

The mon Quorum elects a leader and that leader will be in charge of the
quorum.  Having local mons doesn't do anything as the clients will still be
talking to the mons as a quorum and won't necessarily talk to the mon
running on them.  The vast majority of communication to the cluster that
your Xen servers will be doing is to the OSDs anyway, very little
communication to the mons.

On Mon, Jul 10, 2017 at 1:21 PM Massimiliano Cuttini 
wrote:

> Hi everybody,
>
> i would like to separate MON from OSD as reccomended.
> In order to do so without new hardware I'm planning to create all the
> monitor as a Virtual Machine on top of my hypervisors (Xen).
> I'm testing a pool of 8 nodes of Xen.
>
> I'm thinking about create 8 monitor and pin one monitor for one Xen node.
> So, i'm guessing, every Ceph monitor'll be local for each node client.
> This should speed up the system by local connecting monitors with a
> little overflown for the monitors sync between nodes.
>
> Is it a good idea have a local monitor virtualized on top of each
> hypervisor node?
> Did you see any understimation or wrong design in this?
>
> Thanks for every helpfull info.
>
>
> Regards,
> Max
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD journaling benchmarks

2017-07-10 Thread Maged Mokhtar
On 2017-07-10 18:14, Mohamad Gebai wrote:

> Resending as my first try seems to have disappeared.
> 
> Hi,
> 
> We ran some benchmarks to assess the overhead caused by enabling
> client-side RBD journaling in Luminous. The tests consists of:
> - Create an image with journaling enabled  (--image-feature journaling)
> - Run randread, randwrite and randrw workloads sequentially from a
> single client using fio
> - Collect IOPS
> 
> More info:
> - Feature exclusive-lock is enabled with journaling (required)
> - Queue depth of 128 for fio
> - With 1 and 2 threads
> 
> Cluster 1
> 
> 
> - 5 OSD nodes
> - 6 OSDs per node
> - 3 monitors
> - All SSD
> - Bluestore + WAL
> - 10GbE NIC
> - Ceph version 12.0.3-1380-g6984d41b5d
> (6984d41b5d142ce157216b6e757bcb547da2c7d2) luminous (dev)
> 
> Results:
> 
> DefaultJournalingJour width 32  
> JobsIOPSIOPSSlowdownIOPSSlowdown
> RW
> 1195219104   2.1x160671.2x
> 230575726   42.1x  48862.6x
> Read
> 12277522946  0.9x236010.9x
> 2359551078  33.3x  44680.2x
> Write
> 1185156054   3.0x 97651.9x
> 2295861188  24.9x  53455.4x
> 
> - "Default" is the baseline (with journaling disabled)
> - "Journaling" is with journaling enabled
> - "Jour width 32" is with a journal data width of 32 objects
> (--journal-splay-width 32)
> - The major slowdown for two jobs is due to locking
> - With a journal width of 32, the 0.9x slowdown (which is actually a
> speedup) is due to the read-only workload, which doesn't exercise the
> journaling code.
> - The randwrite workload exercises the journaling code the most, and is
> expected to have the highest slowdown, which is 1.9x in this case.
> 
> Cluster 2
> 
> 
> - 3 OSD nodes
> - 10 OSDs per node
> - 1 monitor
> - All HDD
> - Filestore
> - 10GbE NIC
> - Ceph version 12.1.0-289-g117b171715
> (117b1717154e1236b2d37c405a86a9444cf7871d) luminous (dev)
> 
> Results:
> 
> DefaultJournaling Jour width 32  
> Jobs  IOPSIOPS Slowdown  IOPS   Slowdown
> RW  
> 11186936743.2x   4914  2.4x
> 213127 736   17.8x432 30.4x
> Read  
> 114500   147001.0x  14703  1.0x
> 21667338934.3x307 54.3x
> Write  
> 1 826719254.3x   2591  3.2x
> 2 828310128.2x417 19.9x
> 
> - The number of IOPS for the write workload is quite low, which is due
> to HDDs and filestore
> 
> Mohamad
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

These are significant differences, to the point where it may not make
sense to use rbd journaling / mirroring unless there is only 1 active
client. Could there be in the future enhancement that will try to make
active/active possible ? Would it help if each active writer maintained
their own queue and only lock for a sequence number / counter to try to
minimize the lock overhead writing in the same journal queue  ?

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Monitor as local VM on top of the server pool cluster?

2017-07-10 Thread Massimiliano Cuttini

Hi everybody,

i would like to separate MON from OSD as reccomended.
In order to do so without new hardware I'm planning to create all the 
monitor as a Virtual Machine on top of my hypervisors (Xen).

I'm testing a pool of 8 nodes of Xen.

I'm thinking about create 8 monitor and pin one monitor for one Xen node.
So, i'm guessing, every Ceph monitor'll be local for each node client.
This should speed up the system by local connecting monitors with a 
little overflown for the monitors sync between nodes.


Is it a good idea have a local monitor virtualized on top of each 
hypervisor node?

Did you see any understimation or wrong design in this?

Thanks for every helpfull info.


Regards,
Max

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems with statistics after upgrade to luminous

2017-07-10 Thread Gregory Farnum
On Mon, Jul 10, 2017 at 12:57 AM Marc Roos  wrote:

>
> I need a little help with fixing some errors I am having.
>
> After upgrading from Kraken im getting incorrect values reported on
> placement groups etc. At first I thought it is because I was changing
> the public cluster ip address range and modifying the monmap directly.
> But after deleting and adding a monitor this ceph daemon dump is still
> incorrect.
>
>
>
>
> ceph daemon mon.a perf dump cluster
> {
> "cluster": {
> "num_mon": 3,
> "num_mon_quorum": 3,
> "num_osd": 6,
> "num_osd_up": 6,
> "num_osd_in": 6,
> "osd_epoch": 3842,
> "osd_bytes": 0,
> "osd_bytes_used": 0,
> "osd_bytes_avail": 0,
> "num_pool": 0,
> "num_pg": 0,
> "num_pg_active_clean": 0,
> "num_pg_active": 0,
> "num_pg_peering": 0,
> "num_object": 0,
> "num_object_degraded": 0,
> "num_object_misplaced": 0,
> "num_object_unfound": 0,
> "num_bytes": 0,
> "num_mds_up": 1,
> "num_mds_in": 1,
> "num_mds_failed": 0,
> "mds_epoch": 816
> }

}
>

Huh, I didn't know that existed.

So, yep, most of those values aren't updated any more. From a grep, you can
still trust:
num_mon
num_mon_quorum
num_osd
num_osd_up
num_osd_in
osd_epoch
num_mds_up
num_mds_in
num_mds_failed
mds_epoch

We might be able to keep updating the others when we get reports from the
manager, but it'd be simpler to just rip them out — I don't think the admin
socket is really the right place to get cluster summary data like this.
Sage, any thoughts?
-Greg



>
> 2017-07-10 09:51:54.219167 7f5cb7338700 -1 WARNING: the following
> dangerous and experimental features are enabled: bluestore
>   cluster:
> id: 0f1701f5-453a-4a3b-928d-f652a2bbbcb0
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum a,b,c
> mgr: c(active), standbys: a, b
> mds: 1/1/1 up {0=c=up:active}, 1 up:standby
> osd: 6 osds: 6 up, 6 in
>
>   data:
> pools:   4 pools, 328 pgs
> objects: 5224k objects, 889 GB
> usage:   2474 GB used, 28264 GB / 30739 GB avail
> pgs: 327 active+clean
>  1   active+clean+scrubbing+deep
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD journaling benchmarks

2017-07-10 Thread Mohamad Gebai
Resending as my first try seems to have disappeared.

Hi,

We ran some benchmarks to assess the overhead caused by enabling
client-side RBD journaling in Luminous. The tests consists of:
- Create an image with journaling enabled  (--image-feature journaling)
- Run randread, randwrite and randrw workloads sequentially from a
single client using fio
- Collect IOPS

More info:
- Feature exclusive-lock is enabled with journaling (required)
- Queue depth of 128 for fio
- With 1 and 2 threads


Cluster 1


- 5 OSD nodes
- 6 OSDs per node
- 3 monitors
- All SSD
- Bluestore + WAL
- 10GbE NIC
- Ceph version 12.0.3-1380-g6984d41b5d
(6984d41b5d142ce157216b6e757bcb547da2c7d2) luminous (dev)


Results:

DefaultJournalingJour width 32  
JobsIOPSIOPSSlowdownIOPSSlowdown
RW
1195219104   2.1x160671.2x
230575726   42.1x  48862.6x
Read
12277522946  0.9x236010.9x
2359551078  33.3x  44680.2x
Write
1185156054   3.0x 97651.9x
2295861188  24.9x  53455.4x

- "Default" is the baseline (with journaling disabled)
- "Journaling" is with journaling enabled
- "Jour width 32" is with a journal data width of 32 objects
(--journal-splay-width 32)
- The major slowdown for two jobs is due to locking
- With a journal width of 32, the 0.9x slowdown (which is actually a
speedup) is due to the read-only workload, which doesn't exercise the
journaling code.
- The randwrite workload exercises the journaling code the most, and is
expected to have the highest slowdown, which is 1.9x in this case.


Cluster 2


- 3 OSD nodes
- 10 OSDs per node
- 1 monitor
- All HDD
- Filestore
- 10GbE NIC
- Ceph version 12.1.0-289-g117b171715
(117b1717154e1236b2d37c405a86a9444cf7871d) luminous (dev)


Results:

DefaultJournaling Jour width 32  
Jobs  IOPSIOPS Slowdown  IOPS   Slowdown
RW  
11186936743.2x   4914  2.4x
213127 736   17.8x432 30.4x
Read  
114500   147001.0x  14703  1.0x
21667338934.3x307 54.3x
Write  
1 826719254.3x   2591  3.2x
2 828310128.2x417 19.9x

- The number of IOPS for the write workload is quite low, which is due
to HDDs and filestore

Mohamad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hammer -> jewel 10.2.8 upgrade and setting sortbitwise

2017-07-10 Thread Sage Weil
On Mon, 10 Jul 2017, Luis Periquito wrote:
> Hi Dan,
> 
> I've enabled it in a couple of big-ish clusters and had the same
> experience - a few seconds disruption caused by a peering process
> being triggered, like any other crushmap update does. Can't remember
> if it triggered data movement, but I have a feeling it did...

That's consistent with what one should expect.

The flag triggers a new peering interval, which means the PGs will peer, 
but there is no change in the mapping or data layout or anything else.  
The only thing that is potentially scary here is that *every* PG will 
repeer at the same time.

sage


> 
> 
> 
> On Mon, Jul 10, 2017 at 3:17 PM, Dan van der Ster  wrote:
> > Hi all,
> >
> > With 10.2.8, ceph will now warn if you didn't yet set sortbitwise.
> >
> > I just updated a test cluster, saw that warning, then did the necessary
> >   ceph osd set sortbitwise
> >
> > I noticed a short re-peering which took around 10s on this small
> > cluster with very little data.
> >
> > Has anyone done this already on a large cluster with lots of objects?
> > It would be nice to hear that it isn't disruptive before running it on
> > our big production instances.
> >
> > Cheers, Dan
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding storage to exiting clusters with minimal impact

2017-07-10 Thread bruno.canning
Hi All,

Thanks for your ideas and recommendations. I've been experimenting with:
https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight
by Dan van der Ster and it is producing good results.

It does indeed seem that adjusting the crush weight up from zero is the way to 
go, rather than the osd reweight.

Best wishes,
Bruno



From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Peter Maloney 
[peter.malo...@brockmann-consult.de]
Sent: 06 July 2017 19:29
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Adding storage to exiting clusters with minimal impact

Here's my possibly unique method... I had 3 nodes with 12 disks each,
and when adding 2 more nodes, I had issues with the common method you
describe, totally blocking clients for minutes, but this worked great
for me:

> my own method
> - osd max backfills = 1 and osd recovery max active = 1
> - create them with crush weight 0 so no peering happens
> - (starting here the script below does it, eg. `ceph_activate_osds 6`
> will set weight 6)
> - after they're up, set them reweight 0
> - then set crush weight to the TB of the disk
> - peering starts, but reweight is 0 so it doesn't block clients
> - when that's done, reweight 1 and it should be faster than the
> previous peering and not bug clients as much
>
>
> # list osds with hosts next to them for easy filtering with awk
> (doesn't support chassis, rack, etc. buckets)
> ceph_list_osd() {
> ceph osd tree | awk '
> BEGIN {found=0; host=""};
> $3 == "host" {found=1; host=$4; getline};
> $3 == "host" {found=0}
> found || $3 ~ /osd\./ {print $0 " " host}'
> }
>
> peering_sleep() {
> echo "sleeping"
> sleep 2
> while ceph health | grep -q peer; do
> echo -n .
> sleep 1
> done
> echo
> sleep 5
> }
>
> # after an osd is already created, this reweights them to 'activate' them
> ceph_activate_osds() {
> weight="$1"
> host=$(hostname -s)
>
> if [ -z "$weight" ]; then
> # TODO: somehow make this automatic...
> # This assumes all disks are the same weight.
> weight=6.00099
> fi
>
> # for crush weight 0 osds, set reweight 0 so the crush weight
> non-zero won't cause as many blocked requests
> for id in $(ceph_list_osd | awk '$2 == 0 {print $1}'); do
> ceph osd reweight $id 0 &
> done
> wait
> peering_sleep
>
> # the harsh reweight which we do slowly
> for id in $(ceph_list_osd | awk -v host="$host" '$5 == 0 && $7 ==
> host {print $1}'); do
> echo ceph osd crush reweight "osd.$id" "$weight"
> ceph osd crush reweight "osd.$id" "$weight"
> peering_sleep
> done
>
> # the light reweight
> for id in $(ceph_list_osd | awk -v host="$host" '$5 == 0 && $7 ==
> host {print $1}'); do
> ceph osd reweight $id 1 &
> done
> wait
> }


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hammer -> jewel 10.2.8 upgrade and setting sortbitwise

2017-07-10 Thread Luis Periquito
Hi Dan,

I've enabled it in a couple of big-ish clusters and had the same
experience - a few seconds disruption caused by a peering process
being triggered, like any other crushmap update does. Can't remember
if it triggered data movement, but I have a feeling it did...



On Mon, Jul 10, 2017 at 3:17 PM, Dan van der Ster  wrote:
> Hi all,
>
> With 10.2.8, ceph will now warn if you didn't yet set sortbitwise.
>
> I just updated a test cluster, saw that warning, then did the necessary
>   ceph osd set sortbitwise
>
> I noticed a short re-peering which took around 10s on this small
> cluster with very little data.
>
> Has anyone done this already on a large cluster with lots of objects?
> It would be nice to hear that it isn't disruptive before running it on
> our big production instances.
>
> Cheers, Dan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] hammer -> jewel 10.2.8 upgrade and setting sortbitwise

2017-07-10 Thread Dan van der Ster
Hi all,

With 10.2.8, ceph will now warn if you didn't yet set sortbitwise.

I just updated a test cluster, saw that warning, then did the necessary
  ceph osd set sortbitwise

I noticed a short re-peering which took around 10s on this small
cluster with very little data.

Has anyone done this already on a large cluster with lots of objects?
It would be nice to hear that it isn't disruptive before running it on
our big production instances.

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Access rights of /var/lib/ceph with Jewel

2017-07-10 Thread Brady Deetz
>From a least privilege standpoint, o=rx seems bad. Instead, if you need a
user to gave rx, why not set a default acl on each osd to allow Nagios to
have rx?

I think it's designed to best practice. If a user wishes to accept
additional risk, that's their risk.

On Jul 10, 2017 8:10 AM, "Jens Rosenboom"  wrote:

> 2017-07-10 10:40 GMT+00:00 Christian Balzer :
> > On Mon, 10 Jul 2017 11:27:26 +0200 Marc Roos wrote:
> >
> >> Looks to me by design (from rpm install), and the settings of the
> >> directorys below are probably the result of a user umask setting.
> >
> > I know it's deliberate, I'm asking why.
>
> It seems to have been introduced in
> https://github.com/ceph/ceph/pull/4456 and Sage writes there:
>
> > need to validate the permissiong choices for /var/log/ceph adn
> /var/lib/ceph
>
> I agree with you that setting "o=rx" would be a more sensible choice.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Access rights of /var/lib/ceph with Jewel

2017-07-10 Thread Jens Rosenboom
2017-07-10 10:40 GMT+00:00 Christian Balzer :
> On Mon, 10 Jul 2017 11:27:26 +0200 Marc Roos wrote:
>
>> Looks to me by design (from rpm install), and the settings of the
>> directorys below are probably the result of a user umask setting.
>
> I know it's deliberate, I'm asking why.

It seems to have been introduced in
https://github.com/ceph/ceph/pull/4456 and Sage writes there:

> need to validate the permissiong choices for /var/log/ceph adn /var/lib/ceph

I agree with you that setting "o=rx" would be a more sensible choice.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] subscribe

2017-07-10 Thread hui chen
subscribe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects while OSD is being added/filled

2017-07-10 Thread Eino Tuominen
[replying to my post]


In fact, I did just this:


1. On a HEALTH_OK​ cluster, command ceph osd in 245

2. wait cluster to stabilise

3. witness this:


cluster 0a9f2d69-5905-4369-81ae-e36e4a791831

 health HEALTH_WARN

385 pgs backfill_wait

1 pgs backfilling

33 pgs degraded

33 pgs recovery_wait

305 pgs stuck unclean

recovery 73550/278276590 objects degraded (0.026%)

recovery 5151479/278276590 objects misplaced (1.851%)

 monmap e3: 3 mons at 
{0=130.232.243.65:6789/0,1=130.232.243.66:6789/0,2=130.232.243.67:6789/0}

election epoch 356, quorum 0,1,2 0,1,2

 osdmap e397402: 260 osds: 260 up, 243 in; 386 remapped pgs

flags require_jewel_osds

  pgmap v81108208: 25728 pgs, 8 pools, 203 TB data, 89746 kobjects

614 TB used, 303 TB / 917 TB avail

73550/278276590 objects degraded (0.026%)

5151479/278276590 objects misplaced (1.851%)

   25293 active+clean

 385 active+remapped+wait_backfill

  33 active+recovery_wait+degraded

  16 active+clean+scrubbing+deep

   1 active+remapped+backfilling

--
  Eino Tuominen



From: Eino Tuominen
Sent: Monday, July 10, 2017 14:25
To: Gregory Farnum; Andras Pataki; ceph-users
Subject: Re: [ceph-users] Degraded objects while OSD is being added/filled


Hi Greg,


I was not clear enough. First I set the weight to 0 (ceph osd out), I waited 
until the cluster was stable and healthy (all pgs active+clean). Then I went 
and removed the now empty osds. That was when I saw degraded objects. I'm soon 
about to add some new disks to the cluster. I can reproduce this on the cluster 
if you'd like to see what's happening. What would help to debug this? ceph osd 
dump and ceph pg dump before and after the modifications?


--

  Eino Tuominen



From: Gregory Farnum 
Sent: Thursday, July 6, 2017 19:20
To: Eino Tuominen; Andras Pataki; ceph-users
Subject: Re: [ceph-users] Degraded objects while OSD is being added/filled



On Tue, Jul 4, 2017 at 10:47 PM Eino Tuominen > 
wrote:

​Hello,


I noticed the same behaviour in our cluster.


ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)



cluster 0a9f2d69-5905-4369-81ae-e36e4a791831

 health HEALTH_WARN

1 pgs backfill_toofull

4366 pgs backfill_wait

11 pgs backfilling

45 pgs degraded

45 pgs recovery_wait

45 pgs stuck degraded

4423 pgs stuck unclean

recovery 181563/302722835 objects degraded (0.060%)

recovery 57192879/302722835 objects misplaced (18.893%)

1 near full osd(s)

noout,nodeep-scrub flag(s) set

 monmap e3: 3 mons at 
{0=130.232.243.65:6789/0,1=130.232.243.66:6789/0,2=130.232.243.67:6789/0}

election epoch 356, quorum 0,1,2 0,1,2

 osdmap e388588: 260 osds: 260 up, 242 in; 4378 remapped pgs

flags nearfull,noout,nodeep-scrub,require_jewel_osds

  pgmap v80658624: 25728 pgs, 8 pools, 202 TB data, 89212 kobjects

612 TB used, 300 TB / 912 TB avail

181563/302722835 objects degraded (0.060%)

57192879/302722835 objects misplaced (18.893%)

   21301 active+clean

4366 active+remapped+wait_backfill

  45 active+recovery_wait+degraded

  11 active+remapped+backfilling

   4 active+clean+scrubbing

   1 active+remapped+backfill_toofull

recovery io 421 MB/s, 155 objects/s

  client io 201 kB/s rd, 2034 B/s wr, 75 op/s rd, 0 op/s wr

I'm currently doing a rolling migration from Puppet on Ubuntu to Ansible on 
RHEL, and I started with a healthy cluster, evacuated some nodes by setting 
their weight to 0, removed them from the cluster and re-added them with ansible 
playbook.

Basically I ran


ceph osd crush remove osd.$num

ceph osd rm $num

ceph auth del osd.$num

in a loop for the osds I was replacing, and then let the ansible ceph-osd 
playbook to bring the host back to the cluster. Crushmap is attached.

This case is different. If you are removing OSDs before they've had the chance 
to offload themselves, objects are going to be degraded since you're removing a 
copy! :)
-Greg

​
--
  Eino Tuominen



From: ceph-users 
> 
on behalf of Gregory Farnum >
Sent: Friday, June 30, 2017 23:38
To: Andras Pataki; ceph-users
Subject: Re: [ceph-users] Degraded objects while OSD is being added/filled

On Wed, Jun 21, 

Re: [ceph-users] Degraded objects while OSD is being added/filled

2017-07-10 Thread Eino Tuominen
Hi Greg,


I was not clear enough. First I set the weight to 0 (ceph osd out), I waited 
until the cluster was stable and healthy (all pgs active+clean). Then I went 
and removed the now empty osds. That was when I saw degraded objects. I'm soon 
about to add some new disks to the cluster. I can reproduce this on the cluster 
if you'd like to see what's happening. What would help to debug this? ceph osd 
dump and ceph pg dump before and after the modifications?


--

  Eino Tuominen



From: Gregory Farnum 
Sent: Thursday, July 6, 2017 19:20
To: Eino Tuominen; Andras Pataki; ceph-users
Subject: Re: [ceph-users] Degraded objects while OSD is being added/filled



On Tue, Jul 4, 2017 at 10:47 PM Eino Tuominen > 
wrote:

​Hello,


I noticed the same behaviour in our cluster.


ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)



cluster 0a9f2d69-5905-4369-81ae-e36e4a791831

 health HEALTH_WARN

1 pgs backfill_toofull

4366 pgs backfill_wait

11 pgs backfilling

45 pgs degraded

45 pgs recovery_wait

45 pgs stuck degraded

4423 pgs stuck unclean

recovery 181563/302722835 objects degraded (0.060%)

recovery 57192879/302722835 objects misplaced (18.893%)

1 near full osd(s)

noout,nodeep-scrub flag(s) set

 monmap e3: 3 mons at 
{0=130.232.243.65:6789/0,1=130.232.243.66:6789/0,2=130.232.243.67:6789/0}

election epoch 356, quorum 0,1,2 0,1,2

 osdmap e388588: 260 osds: 260 up, 242 in; 4378 remapped pgs

flags nearfull,noout,nodeep-scrub,require_jewel_osds

  pgmap v80658624: 25728 pgs, 8 pools, 202 TB data, 89212 kobjects

612 TB used, 300 TB / 912 TB avail

181563/302722835 objects degraded (0.060%)

57192879/302722835 objects misplaced (18.893%)

   21301 active+clean

4366 active+remapped+wait_backfill

  45 active+recovery_wait+degraded

  11 active+remapped+backfilling

   4 active+clean+scrubbing

   1 active+remapped+backfill_toofull

recovery io 421 MB/s, 155 objects/s

  client io 201 kB/s rd, 2034 B/s wr, 75 op/s rd, 0 op/s wr

I'm currently doing a rolling migration from Puppet on Ubuntu to Ansible on 
RHEL, and I started with a healthy cluster, evacuated some nodes by setting 
their weight to 0, removed them from the cluster and re-added them with ansible 
playbook.

Basically I ran


ceph osd crush remove osd.$num

ceph osd rm $num

ceph auth del osd.$num

in a loop for the osds I was replacing, and then let the ansible ceph-osd 
playbook to bring the host back to the cluster. Crushmap is attached.

This case is different. If you are removing OSDs before they've had the chance 
to offload themselves, objects are going to be degraded since you're removing a 
copy! :)
-Greg

​
--
  Eino Tuominen



From: ceph-users 
> 
on behalf of Gregory Farnum >
Sent: Friday, June 30, 2017 23:38
To: Andras Pataki; ceph-users
Subject: Re: [ceph-users] Degraded objects while OSD is being added/filled

On Wed, Jun 21, 2017 at 6:57 AM Andras Pataki 
> wrote:
Hi cephers,

I noticed something I don't understand about ceph's behavior when adding an 
OSD.  When I start with a clean cluster (all PG's active+clean) and add an OSD 
(via ceph-deploy for example), the crush map gets updated and PGs get 
reassigned to different OSDs, and the new OSD starts getting filled with data.  
As the new OSD gets filled, I start seeing PGs in degraded states.  Here is an 
example:

  pgmap v52068792: 42496 pgs, 6 pools, 1305 TB data, 390 Mobjects
3164 TB used, 781 TB / 3946 TB avail
8017/994261437 objects degraded (0.001%)
2220581/994261437 objects misplaced (0.223%)
   42393 active+clean
  91 active+remapped+wait_backfill
   9 active+clean+scrubbing+deep
   1 active+recovery_wait+degraded
   1 active+clean+scrubbing
   1 active+remapped+backfilling

Any ideas why there would be any persistent degradation in the cluster while 
the newly added drive is being filled?  It takes perhaps a day or two to fill 
the drive - and during all this time the cluster seems to be running degraded.  
As data is written to the cluster, the number of degraded objects increases 
over time.  Once the newly added OSD is filled, the cluster comes back to clean 
again.

Here is the PG that is degraded in this picture:


Re: [ceph-users] Access rights of /var/lib/ceph with Jewel

2017-07-10 Thread Christian Balzer
On Mon, 10 Jul 2017 11:27:26 +0200 Marc Roos wrote:

> Looks to me by design (from rpm install), and the settings of the 
> directorys below are probably the result of a user umask setting. 

I know it's deliberate, I'm asking why.

> Anyway 
> I can imagine that it is not nagios business to access those area's, for 
> that you have: not? 
>
 DISK CRITICAL - /var/lib/ceph/osd/ceph-30 is not accessible: Permission
denied

Of course putting nagios into the ceph group "fixes" this, but seriously,
the ROOT owned ceph directories in previous incarnations were not
protected as such and I see very little reason for this.

> ceph daemon osd.0 perf dump 
> 
I'm not interested in parsing this stuff in a yet to be written script
with (if some of the scripts out there are a not shining example) every osd
needed to be listed.

Christian
> 
> 
> 
> 
> -Original Message-
> From: Christian Balzer [mailto:ch...@gol.com] 
> Sent: maandag 10 juli 2017 8:09
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Access rights of /var/lib/ceph with Jewel
> 
> 
> Hello,
> 
> With Jewel /var/lib/ceph has these permissions: "drwxr-x---", while 
> every directory below it still has the world aXessible bit set. 
> 
> This makes it impossible (by default) for nagios and other non-root bits 
> to determine the disk usage for example.
> 
> Any rhyme or reason for this decision?
> 
> Christian


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs have different mdsmap epoch

2017-07-10 Thread John Spray
On Mon, Jul 10, 2017 at 7:44 AM, TYLin  wrote:
> Hi all,
>
> We have a cluster whose fsmap and mdsmap have different value. Also, each mds 
> has different mdsmap epoch. Active mds has epoch 52, and other two standby 
> mds have 53 and 55, respectively. Why are the mdsmap epoch of each mds 
> different?

This is normal.  The epoch for each standby and each mdsmap indicates
the FSMap epoch where that thing was last updated.  Every time we
change something, we increment the FSMap epoch, and then copy that
number into the epoch field of whichever part of the fsmap was
changed.

We do it this way so that if an MDS or client is subscribing to
updates for just an MDSMap (not the whole FSMap), then they only get
updates when something in that part changes, not when anything in the
FSMap changes.  So the epoch number for an MDSMap will always
increase, but it will not be a contiguous range of numbers.

John

>
> Our cluster:
> ceph 11.2.0
> 3 nodes. Each node has a mon, mds and 4 OSDs.
>
> $ ceph mds stat --format=json
> {
>   "fsmap": {
> "epoch": 55,
> "compat": {
>   "compat": {},
>   "ro_compat": {},
>   "incompat": {
> "feature_1": "base v0.20",
> "feature_2": "client writeable ranges",
> "feature_3": "default file layouts on dirs",
> "feature_4": "dir inode in separate object",
> "feature_5": "mds uses versioned encoding",
> "feature_6": "dirfrag is stored in omap",
> "feature_8": "file layout v2"
>   }
> },
> "feature_flags": {
>   "enable_multiple": false,
>   "ever_enabled_multiple": false
> },
> "standbys": [
>   {
> "gid": 41,
> "name": "Host2",
> "rank": -1,
> "incarnation": 0,
> "state": "up:standby",
> "state_seq": 2,
> "addr": "10.4.154.141:6816/304221716",
> "standby_for_rank": -1,
> "standby_for_fscid": -1,
> "standby_for_name": "",
> "standby_replay": false,
> "export_targets": [],
> "features": 1152921504336314400,
> "epoch": 53
>   },
>   {
> "gid": 424717,
> "name": "Host3",
> "rank": -1,
> "incarnation": 0,
> "state": "up:standby",
> "state_seq": 2,
> "addr": "10.4.154.142:6816/627678162",
> "standby_for_rank": -1,
> "standby_for_fscid": -1,
> "standby_for_name": "",
> "standby_replay": false,
> "export_targets": [],
> "features": 1152921504336314400,
> "epoch": 55
>   }
> ],
> "filesystems": [
>   {
> "mdsmap": {
>   "epoch": 52,
>   "flags": 0,
>   "ever_allowed_features": 0,
>   "explicitly_allowed_features": 0,
>   "created": "2017-06-15 11:56:32.709015",
>   "modified": "2017-06-15 11:56:32.709015",
>   "tableserver": 0,
>   "root": 0,
>   "session_timeout": 60,
>   "session_autoclose": 300,
>   "max_file_size": 1099511627776,
>   "last_failure": 0,
>   "last_failure_osd_epoch": 154,
>   "compat": {
> "compat": {},
> "ro_compat": {},
> "incompat": {
>   "feature_1": "base v0.20",
>   "feature_2": "client writeable ranges",
>   "feature_3": "default file layouts on dirs",
>   "feature_4": "dir inode in separate object",
>   "feature_5": "mds uses versioned encoding",
>   "feature_6": "dirfrag is stored in omap",
>   "feature_8": "file layout v2"
> }
>   },
>   "max_mds": 1,
>   "in": [
> 0
>   ],
>   "up": {
> "mds_0": 399818
>   },
>   "failed": [],
>   "damaged": [],
>   "stopped": [],
>   "info": {
> "gid_399818": {
>   "gid": 399818,
>   "name": “Host1",
>   "rank": 0,
>   "incarnation": 49,
>   "state": "up:active",
>   "state_seq": 492,
>   "addr": "10.4.154.140:6816/3168307953",
>   "standby_for_rank": -1,
>   "standby_for_fscid": -1,
>   "standby_for_name": "",
>   "standby_replay": false,
>   "export_targets": [],
>   "features": 1152921504336314400
> }
>   },
>   "data_pools": [
> 2
>   ],
>   "metadata_pool": 3,
>   "enabled": true,
>   "fs_name": "cephfs",
>   "balancer": ""
> },
> "id": 1
>   }
> ]
>   },
>   "mdsmap_first_committed": 1,
>   "mdsmap_last_committed": 55
> }
>
> Thanks,
> Tim Lin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Problems with statistics after upgrade to luminous

2017-07-10 Thread Marc Roos

I need a little help with fixing some errors I am having. 

After upgrading from Kraken im getting incorrect values reported on 
placement groups etc. At first I thought it is because I was changing 
the public cluster ip address range and modifying the monmap directly. 
But after deleting and adding a monitor this ceph daemon dump is still 
incorrect.




ceph daemon mon.a perf dump cluster
{
"cluster": {
"num_mon": 3,
"num_mon_quorum": 3,
"num_osd": 6,
"num_osd_up": 6,
"num_osd_in": 6,
"osd_epoch": 3842,
"osd_bytes": 0,
"osd_bytes_used": 0,
"osd_bytes_avail": 0,
"num_pool": 0,
"num_pg": 0,
"num_pg_active_clean": 0,
"num_pg_active": 0,
"num_pg_peering": 0,
"num_object": 0,
"num_object_degraded": 0,
"num_object_misplaced": 0,
"num_object_unfound": 0,
"num_bytes": 0,
"num_mds_up": 1,
"num_mds_in": 1,
"num_mds_failed": 0,
"mds_epoch": 816
}
}

2017-07-10 09:51:54.219167 7f5cb7338700 -1 WARNING: the following 
dangerous and experimental features are enabled: bluestore
  cluster:
id: 0f1701f5-453a-4a3b-928d-f652a2bbbcb0
health: HEALTH_OK

  services:
mon: 3 daemons, quorum a,b,c
mgr: c(active), standbys: a, b
mds: 1/1/1 up {0=c=up:active}, 1 up:standby
osd: 6 osds: 6 up, 6 in

  data:
pools:   4 pools, 328 pgs
objects: 5224k objects, 889 GB
usage:   2474 GB used, 28264 GB / 30739 GB avail
pgs: 327 active+clean
 1   active+clean+scrubbing+deep
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph MeetUp Berlin on July 17

2017-07-10 Thread Robert Sander
Hi,

https://www.meetup.com/de-DE/Ceph-Berlin/events/240812906/

Come join us for an introduction into Ceph and DESY including a tour of
their data center and photo injector test facility.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MON daemons fail after creating bluestore osd with block.db partition (luminous 12.1.0-1~bpo90+1 )

2017-07-10 Thread Thomas Gebhardt
Hello,

Thomas Gebhardt schrieb am 07.07.2017 um 17:21:
> ( e.g.,
> ceph-deploy osd create --bluestore --block-db=/dev/nvme0bnp1 node1:/dev/sdi
> )

just noticed that there was typo in the block-db device name
(/dev/nvme0bnp1 -> /dev/nvme0n1p1). After fixing that misspelling my
cookbook worked fine and the mons are running.

Kind regards, Thomas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDSs have different mdsmap epoch

2017-07-10 Thread TYLin
Hi all,

We have a cluster whose fsmap and mdsmap have different value. Also, each mds 
has different mdsmap epoch. Active mds has epoch 52, and other two standby mds 
have 53 and 55, respectively. Why are the mdsmap epoch of each mds different?

Our cluster:
ceph 11.2.0
3 nodes. Each node has a mon, mds and 4 OSDs. 

$ ceph mds stat --format=json
{
  "fsmap": {
"epoch": 55,
"compat": {
  "compat": {},
  "ro_compat": {},
  "incompat": {
"feature_1": "base v0.20",
"feature_2": "client writeable ranges",
"feature_3": "default file layouts on dirs",
"feature_4": "dir inode in separate object",
"feature_5": "mds uses versioned encoding",
"feature_6": "dirfrag is stored in omap",
"feature_8": "file layout v2"
  }
},
"feature_flags": {
  "enable_multiple": false,
  "ever_enabled_multiple": false
},
"standbys": [
  {
"gid": 41,
"name": "Host2",
"rank": -1,
"incarnation": 0,
"state": "up:standby",
"state_seq": 2,
"addr": "10.4.154.141:6816/304221716",
"standby_for_rank": -1,
"standby_for_fscid": -1,
"standby_for_name": "",
"standby_replay": false,
"export_targets": [],
"features": 1152921504336314400,
"epoch": 53
  },
  {
"gid": 424717,
"name": "Host3",
"rank": -1,
"incarnation": 0,
"state": "up:standby",
"state_seq": 2,
"addr": "10.4.154.142:6816/627678162",
"standby_for_rank": -1,
"standby_for_fscid": -1,
"standby_for_name": "",
"standby_replay": false,
"export_targets": [],
"features": 1152921504336314400,
"epoch": 55
  }
],
"filesystems": [
  {
"mdsmap": {
  "epoch": 52,
  "flags": 0,
  "ever_allowed_features": 0,
  "explicitly_allowed_features": 0,
  "created": "2017-06-15 11:56:32.709015",
  "modified": "2017-06-15 11:56:32.709015",
  "tableserver": 0,
  "root": 0,
  "session_timeout": 60,
  "session_autoclose": 300,
  "max_file_size": 1099511627776,
  "last_failure": 0,
  "last_failure_osd_epoch": 154,
  "compat": {
"compat": {},
"ro_compat": {},
"incompat": {
  "feature_1": "base v0.20",
  "feature_2": "client writeable ranges",
  "feature_3": "default file layouts on dirs",
  "feature_4": "dir inode in separate object",
  "feature_5": "mds uses versioned encoding",
  "feature_6": "dirfrag is stored in omap",
  "feature_8": "file layout v2"
}
  },
  "max_mds": 1,
  "in": [
0
  ],
  "up": {
"mds_0": 399818
  },
  "failed": [],
  "damaged": [],
  "stopped": [],
  "info": {
"gid_399818": {
  "gid": 399818,
  "name": “Host1",
  "rank": 0,
  "incarnation": 49,
  "state": "up:active",
  "state_seq": 492,
  "addr": "10.4.154.140:6816/3168307953",
  "standby_for_rank": -1,
  "standby_for_fscid": -1,
  "standby_for_name": "",
  "standby_replay": false,
  "export_targets": [],
  "features": 1152921504336314400
}
  },
  "data_pools": [
2
  ],
  "metadata_pool": 3,
  "enabled": true,
  "fs_name": "cephfs",
  "balancer": ""
},
"id": 1
  }
]
  },
  "mdsmap_first_committed": 1,
  "mdsmap_last_committed": 55
}

Thanks,
Tim Lin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Access rights of /var/lib/ceph with Jewel

2017-07-10 Thread Christian Balzer

Hello,

With Jewel /var/lib/ceph has these permissions: "drwxr-x---", while every
directory below it still has the world aXessible bit set. 

This makes it impossible (by default) for nagios and other non-root bits
to determine the disk usage for example.

Any rhyme or reason for this decision?

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com