Re: [ceph-users] Unexpected "out" OSD behaviour

2019-12-23 Thread Oliver Freyermuth
Dear Jonas,

I tried just now on a 14.2.5 cluster, and sadly, the unexpected behaviour is 
still there,
i.e. an OSD marked "out" and then restarted is not considered as data source 
anymore. 
I also tried with a 13.2.8 OSD (in a cluster running 13.2.6 on other OSDs, MONs 
and MGRs), same effect. 

However, the trick you described ("mark your OSD in and then out right away") 
helps in both cases,
the data on the OSDs is considered as data source again and any degradation is 
gone. 

So while I think your patch should solve the issue, for some reason, it does 
not seem to be effective. 

Cheers,
Oliver

Am 22.12.19 um 23:50 schrieb Oliver Freyermuth:
> Dear Jonas,
> 
> Am 22.12.19 um 23:40 schrieb Jonas Jelten:
>> hi!
>>
>> I've also noticed that behavior and have submitted a patch some time ago 
>> that should fix (2):
>> https://github.com/ceph/ceph/pull/27288
> 
> thanks, this does indeed seem very much like the issue I saw! 
> I'm luckily not in a critical situation at the moment, but was just wondering 
> if this behaviour was normal (since it does not fit well
> with the goal of ensuring maximum possible redundancy at all times). 
> 
> However, I observed this on 13.2.6, which - if I read the release notes 
> correctly - should already have your patch in. Strange. 
> 
>> But it may well be that there's more cases where PGs are not discovered on 
>> devices that do have them. Just recently a
>> lot of my data was degraded and then recreated even though it would have 
>> been available on a node that had taken very
>> long to reboot.
> 
> We've set "mon_osd_down_out_subtree_limit" to "host" to make sure recovery of 
> data from full hosts does not start without one of us admins
> telling Ceph to go ahead. Maybe this also helps in your case? 
> 
>> What you can do also is to mark your OSD in and then out right away, the 
>> data is discovered then. Although with my patch
>> that shouldn't be necessary any more. Hope this helps you.
> 
> I will keep this in mind the next time it happens (I may be able to provoke 
> it, we have to drain more nodes, and once the next node is almost-empty,
> I can just restart one of the "out" OSDs and see what happens). 
> 
> Cheers and many thanks,
>   Oliver
> 
>>
>> Cheers
>>   -- Jonas
>>
>>
>> On 22/12/2019 19.48, Oliver Freyermuth wrote:
>>> Dear Cephers,
>>>
>>> I realized the following behaviour only recently:
>>>
>>> 1. Marking an OSD "out" sets the weight to zero and allows to migrate data 
>>> away (as long as it is up),
>>>i.e. it is still considered as a "source" and nothing goes to degraded 
>>> state (so far, everything expected). 
>>> 2. Restarting an "out" OSD, however, means it will come back with "0 pgs", 
>>> and if data was not fully migrated away yet,
>>>it means the PGs which were still kept on it before will enter degraded 
>>> state since they now lack a copy / shard.
>>>
>>> Is (2) expected? 
>>>
>>> If so, my understanding that taking an OSD "out" to let the data be 
>>> migrated away without losing any redundancy is wrong,
>>> since redundancy will be lost as soon as the "out" OSD is restarted (e.g. 
>>> due to a crash, node reboot,...) and the only safe options would be:
>>> 1. Disable the automatic balancer. 
>>> 2. Either adjust the weights of the OSDs to drain to zero, or use pg upmap 
>>> to drain them. 
>>> 3. Reenable the automatic balancer only after having fully drained those 
>>> OSDs and performing the necessary intervention
>>>(in our case, recreating the OSDs with a faster blockdb). 
>>>
>>> Is this correct? 
>>>
>>> Cheers,
>>> Oliver
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected "out" OSD behaviour

2019-12-22 Thread Oliver Freyermuth
Dear Jonas,

Am 22.12.19 um 23:40 schrieb Jonas Jelten:
> hi!
> 
> I've also noticed that behavior and have submitted a patch some time ago that 
> should fix (2):
> https://github.com/ceph/ceph/pull/27288

thanks, this does indeed seem very much like the issue I saw! 
I'm luckily not in a critical situation at the moment, but was just wondering 
if this behaviour was normal (since it does not fit well
with the goal of ensuring maximum possible redundancy at all times). 

However, I observed this on 13.2.6, which - if I read the release notes 
correctly - should already have your patch in. Strange. 

> But it may well be that there's more cases where PGs are not discovered on 
> devices that do have them. Just recently a
> lot of my data was degraded and then recreated even though it would have been 
> available on a node that had taken very
> long to reboot.

We've set "mon_osd_down_out_subtree_limit" to "host" to make sure recovery of 
data from full hosts does not start without one of us admins
telling Ceph to go ahead. Maybe this also helps in your case? 

> What you can do also is to mark your OSD in and then out right away, the data 
> is discovered then. Although with my patch
> that shouldn't be necessary any more. Hope this helps you.

I will keep this in mind the next time it happens (I may be able to provoke it, 
we have to drain more nodes, and once the next node is almost-empty,
I can just restart one of the "out" OSDs and see what happens). 

Cheers and many thanks,
    Oliver

> 
> Cheers
>   -- Jonas
> 
> 
> On 22/12/2019 19.48, Oliver Freyermuth wrote:
>> Dear Cephers,
>>
>> I realized the following behaviour only recently:
>>
>> 1. Marking an OSD "out" sets the weight to zero and allows to migrate data 
>> away (as long as it is up),
>>i.e. it is still considered as a "source" and nothing goes to degraded 
>> state (so far, everything expected). 
>> 2. Restarting an "out" OSD, however, means it will come back with "0 pgs", 
>> and if data was not fully migrated away yet,
>>it means the PGs which were still kept on it before will enter degraded 
>> state since they now lack a copy / shard.
>>
>> Is (2) expected? 
>>
>> If so, my understanding that taking an OSD "out" to let the data be migrated 
>> away without losing any redundancy is wrong,
>> since redundancy will be lost as soon as the "out" OSD is restarted (e.g. 
>> due to a crash, node reboot,...) and the only safe options would be:
>> 1. Disable the automatic balancer. 
>> 2. Either adjust the weights of the OSDs to drain to zero, or use pg upmap 
>> to drain them. 
>> 3. Reenable the automatic balancer only after having fully drained those 
>> OSDs and performing the necessary intervention
>>(in our case, recreating the OSDs with a faster blockdb). 
>>
>> Is this correct? 
>>
>> Cheers,
>>  Oliver
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Unexpected "out" OSD behaviour

2019-12-22 Thread Oliver Freyermuth
Dear Cephers,

I realized the following behaviour only recently:

1. Marking an OSD "out" sets the weight to zero and allows to migrate data away 
(as long as it is up),
   i.e. it is still considered as a "source" and nothing goes to degraded state 
(so far, everything expected). 
2. Restarting an "out" OSD, however, means it will come back with "0 pgs", and 
if data was not fully migrated away yet,
   it means the PGs which were still kept on it before will enter degraded 
state since they now lack a copy / shard.

Is (2) expected? 

If so, my understanding that taking an OSD "out" to let the data be migrated 
away without losing any redundancy is wrong,
since redundancy will be lost as soon as the "out" OSD is restarted (e.g. due 
to a crash, node reboot,...) and the only safe options would be:
1. Disable the automatic balancer. 
2. Either adjust the weights of the OSDs to drain to zero, or use pg upmap to 
drain them. 
3. Reenable the automatic balancer only after having fully drained those OSDs 
and performing the necessary intervention
   (in our case, recreating the OSDs with a faster blockdb). 

Is this correct? 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dashboard hangs

2019-11-22 Thread Oliver Freyermuth

Hi,

On 2019-11-20 15:55, thoralf schulze wrote:

hi,

we were able to track this down to the auto balancer: disabling the auto
balancer and cleaning out old (and probably not very meaningful)
upmap-entries via ceph osd rm-pg-upmap-items brought back stable mgr
daemons and an usable dashboard.


I can confirm that, in our case I see this on a 14.2.4 cluster (which has 
started its life with an earlier Nautilus version,
and developed this issue over the past weeks) and doing:
 ceph balancer off
has been sufficient to make the mgrs stable again (i.e. I left the upmap-items 
in place).

Interestingly, we did not see this with Luminous or Mimic on different clusters 
(which however have a more stable number of OSDs).

@devs: If there's any more info needed to track this down, please let us know.

Cheers,
Oliver



the not-so-sensible upmap-entries might or might not have been caused by
us updating from mimic to nautilus - it's too late to debug this now.
this seems to be consistent with bryan stillwell's findings ("mgr hangs
with upmap balancer").

thank you very much & with kind regards,
thoralf.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't create erasure coded pools with k+m greater than hosts?

2019-10-24 Thread Oliver Freyermuth

On 2019-10-24 09:46, Janne Johansson wrote:

(Slightly abbreviated)

Den tors 24 okt. 2019 kl 09:24 skrev Frank Schilder mailto:fr...@dtu.dk>>:

  What I learned are the following:

1) Avoid this work-around too few hosts for EC rule at all cost.

2) Do not use EC 2+1. It does not offer anything interesting for 
production. Use 4+2 (or 8+2, 8+3 if you have the hosts).

3) If you have no perspective of getting at least 7 servers in the long run 
(4+2=6 for EC profile, +1 for fail-over automatic rebuild), do not go for EC.

4) Before you start thinking about replicating to a second site, you should 
have a primary site running solid first.

This is collected from my experience. I would do things different now and 
maybe it helps you with deciding how to proceed. Its basically about what 
resources can you expect in the foreseeable future and what compromises are you 
willing to make with regards to sleep and sanity.


Amen to all of those points. We did similar-but-not-same mistakes on an EC 
cluster here. You are going to produce more tears than I/O if you make these 
mis-designs mentioned above.
We could add:

5) Never buy SMR drives, pretend they don't even exist. If a similar technology 
appears tomorrow for cheap SSD/NVME, skip it.


Amen from my side, too. Luckily, we only made a small fraction of these 
mistakes (running 4+2 on 6 servers and wondering about funny effects when 
taking one server offline,
while we still were testing the setup, before we finally descided to ask for a 
7th server), but this can in parts be extrapolated.

Concerning SMR, I learnt that SMR-awareness is on Ceph's roadmap (for 
host-managed SMR drives). Once that is available, host-managed SMR drives 
should be a well-working and cheap solution
especially for backup / WORM workloads.
But as of for now, even disk vendors will tell you to avoid SMR for datacenter 
setups (unless you have a storage system aware of it and host-managed drives).

Cheers,
Oliver



--
May the most significant bit of your life be positive.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED

2019-09-25 Thread Oliver Freyermuth

Hi together,

can somebody confirm whether should I put this in a ticket, or whether this is 
wanted (but very unexpected) behaviour?
We have some pools which gain a factor of three by compression:
POOL  ID STORED  OBJECTS USED
%USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY   USED 
COMPR UNDER COMPR
rbd2 1.2 TiB 472.44k 1.8 TiB 
35.24   1.1 TiB N/A   N/A 472.44k717 
GiB 2.1 TiB
so as of now, this always leads to a health warning via pg-autoscaler as soon 
as the cluster is 33 % filled, since it thinks the subtree is overcommitted:
 POOL SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  
TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE
 default.rgw.buckets.data   61358M3.0 5952G  0.0302 
   0.0700   1.0  32  on
 rbd 1856G3.0 5952G  0.9359 
   0.9200   1.0 256  on

Cheers,
Oliver

Am 12.09.19 um 23:34 schrieb Oliver Freyermuth:

Dear Cephalopodians,

I can confirm the same problem described by Joe Ryner in 14.2.2. I'm also 
getting (in a small test setup):
-
# ceph health detail
HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes; 1 subtrees 
have overcommitted pool target_size_ratio
POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted pool 
target_size_bytes
 Pools ['rbd', '.rgw.root', 'default.rgw.control', 'default.rgw.meta', 
'default.rgw.log', 'default.rgw.buckets.index', 'default.rgw.buckets.data'] 
overcommit available storage by 1.068x due to target_size_bytes0  on pools 
[]
POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted pool 
target_size_ratio
 Pools ['rbd', '.rgw.root', 'default.rgw.control', 'default.rgw.meta', 
'default.rgw.log', 'default.rgw.buckets.index', 'default.rgw.buckets.data'] 
overcommit available storage by 1.068x due to target_size_ratio 0.000 on pools 
[]
-

However, there's not much actual data STORED:
-
# ceph df
RAW STORAGE:
 CLASS SIZEAVAIL   USEDRAW USED %RAW USED
 hdd   4.0 TiB 2.6 TiB 1.4 TiB  1.4 TiB 35.94
 TOTAL 4.0 TiB 2.6 TiB 1.4 TiB  1.4 TiB 35.94
  
POOLS:

 POOL  ID STORED  OBJECTS USED
%USED MAX AVAIL
 rbd2 676 GiB 266.40k 707 GiB 
23.42   771 GiB
 .rgw.root  9 1.2 KiB   4 768 KiB   
  0   771 GiB
 default.rgw.control   10 0 B   8 0 B   
  0   771 GiB
 default.rgw.meta  11 1.2 KiB   8 1.3 MiB   
  0   771 GiB
 default.rgw.log   12 0 B 175 0 B   
  0   771 GiB
 default.rgw.buckets.index 13 0 B   1 0 B   
  0   771 GiB
 default.rgw.buckets.data  14 249 GiB  99.62k 753 GiB 
24.57   771 GiB
-
The main culprit here seems to be the default.rgw.buckets.data pool, but also 
the rbd pool contains thin images.

As in the case of Joe, the autoscaler seems to look at the "USED" space, not at the 
"STORED" bytes:
-
  POOL SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  
TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE
  default.rgw.meta1344k3.0 4092G  0.
 1.0   8  on
  default.rgw.buckets.index  0 3.0 4092G  0.
 1.0   8  on
  default.rgw.control0 3.0 4092G  0.
 1.0   8  on
  default.rgw.buckets.data   788.6G3.0 4092G  0.5782
 1.0 128  on
  .rgw.root  768.0k3.0 4092G  0.
 1.0   8  on
  rbd710.8G3.0 4092G  0.5212
 1.0  64  on
  default.rgw.log0 3.0 4092G  0.
 1.0   8  on
-

This does seem like a bug to me. The warning actually fires on a cluster with 
35 % raw usage, and things are mostly balanced.
Is there already a tracker entry on this?

Cheers,
Oliver


On 2019-05-01 22:01, Joe Ryner wrote:

I think I have figured out the issue.

  POOL

Re: [ceph-users] eu.ceph.com mirror out of sync?

2019-09-24 Thread Oliver Freyermuth

Dear Wido,

On 2019-09-24 08:53, Wido den Hollander wrote:



On 9/17/19 11:01 PM, Oliver Freyermuth wrote:

Dear Cephalopodians,

I realized just now that:
   https://eu.ceph.com/rpm-nautilus/el7/x86_64/
still holds only released up to 14.2.2, and nothing is to be seen of
14.2.3 or 14.2.4,
while the main repository at:
   https://download.ceph.com/rpm-nautilus/el7/x86_64/
looks as expected.

Is this issue with the eu.ceph.com mirror already knwon?



I missed this message and I see what's going on. Going to fix it right away.

I manage this mirror.


many thanks, it looks like it's already fixed now, at least the new packages 
are popping up :-).

I'll also contact the other mirror owners whose mirrors appear to have issues 
or are out-of-sync, now that I have been pointed to the list of people managing 
them.

Cheers and thanks,
Oliver



Wido


Cheers,
 Oliver


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com







smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] eu.ceph.com mirror out of sync?

2019-09-23 Thread Oliver Freyermuth

Dear Matthew,

On 2019-09-24 01:50, Matthew Taylor wrote:

Hi David,

RedHat staff had transitioned the Mirror mailing list to a new domain + self 
hosted instance of Mailman on this date:


Subject:[Ceph-mirrors] FYI: Mailing list domain change
Date:   Mon, 17 Jun 2019 16:19:55 -0400
From:   David Galloway 
To: ceph-mirr...@lists.ceph.com



The new mirror list email is: ceph-mirr...@ceph.io

*You can subscribe to the list via this URL: 
**https://lists.ceph.io/postorius/lists/*

Please note that the actual mirror "project" is quite loose and vastly ignored 
as mirrors can easily be considered as 'set and forget' once set up.


many thanks! This clarifies why I have never seen any mirror discussions and 
the fact that so many of the mirrors are either unreachable or out of sync at 
the same time
(well, probably since a long time, but I checked only now).



We used to have some strong advocates promoting improvement on the older 
mailing list (myself included), however the list it's self (old and new) has 
next to no traffic on it, inclusive of RedHat staff. The list has been active 
since 2015-11-10 (thank you, Wido).

With that being said, and to be fair; the official docs at the time of writing 
this doesn't really give any direction about the mailing list or the project 
it's self:

https://docs.ceph.com/docs/master/install/mirrors/


As a "mirror user", indeed all this was very unclear to me, since those mirrors are 
"just part of the install instructions".



At this stage, I can really only suggest reaching out to the individual mirror 
maintainers should you have issues with them. Here is a list of current mirrors 
and their maintainer's contact info:

https://github.com/ceph/ceph/blob/master/mirroring/MIRRORS


Many thanks for this! This is really helpful.
I see Wido is there for the eu-mirror. Since he is usually very active on this 
list, I guess he is on well-deserved holidays which would explain the silence 
;-).

In any case, I will walk through the list later and contact those mirror 
operators whose mirrors are either out of date or unreachable.
An automated script checking https://MIRROR_URL/timestamp and alerting mirror 
owners automatically could technically also do this.

Mmany thanks for the valuable information and your work in maintaining 
au.ceph.com!

Cheers,
Oliver



Cheers,
Matthew.
(au.ceph.com maintainer)



On 24/9/19 6:48 am, David Majchrzak, ODERLAND Webbhotell AB wrote:

Hi,

I'll have a look at the status of se.ceph.com tomorrow morning, it's
maintained by us.

Kind Regards,

David


On mån, 2019-09-23 at 22:41 +0200, Oliver Freyermuth wrote:

Hi together,

the EU mirror still seems to be out-of-sync - does somebody on this
list happen to know whom to contact about this?
Or is this mirror unmaintained and we should switch to something
else?

Going through the list of appropriate mirrors from
https://docs.ceph.com/docs/master/install/mirrors/  (we are in
Germany) I also find:
http://de.ceph.com/
(the mirror in Germany) to be non-resolvable.

Closest by then for us is possibly France:
http://fr.ceph.com/rpm-nautilus/el7/x86_64/
but also here, there's only 14.2.2, so that's also out-of-sync.

So in the EU, at least geographically, this only leaves Sweden and
UK.
Sweden at se.ceph.com does not load for me, but UK indeed seems fine.

Should people in the EU use that mirror, or should we all just use
download.ceph.com instead of something geographically close-by?

Cheers,
Oliver


On 2019-09-17 23:01, Oliver Freyermuth wrote:

Dear Cephalopodians,

I realized just now that:
https://eu.ceph.com/rpm-nautilus/el7/x86_64/
still holds only released up to 14.2.2, and nothing is to be seen
of 14.2.3 or 14.2.4,
while the main repository at:
https://download.ceph.com/rpm-nautilus/el7/x86_64/
looks as expected.

Is this issue with the eu.ceph.com mirror already knwon?

Cheers,
  Oliver


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] eu.ceph.com mirror out of sync?

2019-09-23 Thread Oliver Freyermuth

Hi together,

the EU mirror still seems to be out-of-sync - does somebody on this list happen 
to know whom to contact about this?
Or is this mirror unmaintained and we should switch to something else?

Going through the list of appropriate mirrors from 
https://docs.ceph.com/docs/master/install/mirrors/ (we are in Germany) I also 
find:
  http://de.ceph.com/
(the mirror in Germany) to be non-resolvable.

Closest by then for us is possibly France:
  http://fr.ceph.com/rpm-nautilus/el7/x86_64/
but also here, there's only 14.2.2, so that's also out-of-sync.

So in the EU, at least geographically, this only leaves Sweden and UK.
Sweden at se.ceph.com does not load for me, but UK indeed seems fine.

Should people in the EU use that mirror, or should we all just use 
download.ceph.com instead of something geographically close-by?

Cheers,
Oliver


On 2019-09-17 23:01, Oliver Freyermuth wrote:

Dear Cephalopodians,

I realized just now that:
   https://eu.ceph.com/rpm-nautilus/el7/x86_64/
still holds only released up to 14.2.2, and nothing is to be seen of 14.2.3 or 
14.2.4,
while the main repository at:
   https://download.ceph.com/rpm-nautilus/el7/x86_64/
looks as expected.

Is this issue with the eu.ceph.com mirror already knwon?

Cheers,
 Oliver


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD's keep crasching after clusterreboot

2019-09-23 Thread Oliver Freyermuth

Hi together,

for those reading along: We had to turn off all OSDs keeping our cephfs-data 
pool during the intervention, luckily everything came back fine.
However, we managed to leave the MDS's and OSDs keeping the cephfs-metadata 
pool and the MONs online. We restarted those sequentially afterwards, though.

So this probably means we are not affected by the upgrade bug - still, I would 
sleep better if somebody can confirm how to detected this bug and - if you are 
affected - how to edit the pool
to fix it.

Cheers,
Oliver

On 2019-09-17 21:23, Oliver Freyermuth wrote:

Hi together,

it seems the issue described by Ansgar was reported and closed here as being 
fixed for newly created pools in post-Luminous releases:
https://tracker.ceph.com/issues/41336

However, it is unclear to me:
- How to find out if an EC cephfs you have created in Luminous is actually affected, 
before actually testing the "shutdown all" procedure,
   and thus having dying OSDs.
- If affected, how to fix it without purging the pool completely (which is not 
so easily done if there is 0.5 PB inside, which can't be restored without a 
long downtime).

If this is an acknowledged issue, it should probably also be mentioned in the 
upgrade notes from pre-Mimic to Mimic and newer before more people lose data.

In our case, we have such a a CephFS on an EC pool created with Luminous, and are right 
now running Mimic 13.2.6, but never tried a "full shutdown".
We need to try that on Friday, though... (cooling system maintenance).

"osd dump" contains:

pool 1 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 1 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 40903 flags hashpspool stripe_width 
0 compression_algorithm snappy compression_mode aggressive application cephfs
pool 2 'cephfs_data' erasure size 6 min_size 5 crush_rule 2 object_hash 
rjenkins pg_num 4096 pgp_num 4096 last_change 40953 flags 
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 
compression_algorithm snappy compression_mode aggressive application cephfs


and the EC profile is:

# ceph osd erasure-code-profile get cephfs_data
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8


Neither contains the stripe_unit explicitly, so I wonder how to find out if it 
is (in)valid.
Checking the xattr ceph.file.layout.stripe_unit of some "old" files on the FS 
reveals 4194304 in my case.

Any help appreciated.

Cheers and all the best,
 Oliver

Am 09.08.19 um 08:54 schrieb Ansgar Jazdzewski:

We got our OSD's back

Since we removed the EC-Pool (cephfs.data) we had to figure out how to
remove the PG from teh Offline OSD and hier is how we did it.

remove cehfs, remove cache layer, remove pools:
#ceph mds fail 0
#ceph fs rm cephfs --yes-i-really-mean-it
#ceph osd tier remove-overlay cephfs.data
there is now (or already was) no overlay for 'cephfs.data'
#ceph osd tier remove cephfs.data cephfs.cache
pool 'cephfs.cache' is now (or already was) not a tier of 'cephfs.data'
#ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
#ceph osd pool delete cephfs.cache cephfs.cache --yes-i-really-really-mean-it
pool 'cephfs.cache' removed
#ceph osd pool delete cephfs.data cephfs.data --yes-i-really-really-mean-it
pool 'cephfs.data' removed
#ceph osd pool delete cephfs.metadata cephfs.metadata
--yes-i-really-really-mean-it
pool 'cephfs.metadata' removed

remove placement groups of pool 23 (cephfs.data) from all offline OSDs:
DATAPATH=/var/lib/ceph/osd/ceph-${OSD}
a=`ceph-objectstore-tool --data-path ${DATAPATH} --op list-pgs | grep "^23\."`
for i in $a; do
   echo "INFO: removing ${i} from OSD ${OSD}"
   ceph-objectstore-tool --data-path ${DATAPATH} --pgid ${i} --op remove --force
done

since we now had removed our cephfs we still not know if we could have
solved it without data loss by upgrading to nautilus.

Have a nice Weekend,
Ansgar

Am Mi., 7. Aug. 2019 um 17:03 Uhr schrieb Ansgar Jazdzewski
:


another update,

we now took the more destructive route and removed the cephfs pools
(lucky we had only test date in the filesystem)
Our hope was that within the startup-process the osd will delete the
no longer needed PG, But this is NOT the Case.

So we are still have the same issue the only difference is that the PG
does not belong to a pool anymore.

  -360> 2019-08-07 14:52:32.655 7fb14db8de00  5 osd.44 pg_epoch: 196586
pg[23.f8s0(unlocked)] enter Initial
  -360> 2019-08-07 14:52:32.659 7fb14db8de00 -1
/build/ceph-13.2.6/src/osd/ECUtil.h: In function
'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread
7fb14db8de00 time 2019-08-07 14:52:32.660169
/build/ceph-13.

[ceph-users] eu.ceph.com mirror out of sync?

2019-09-17 Thread Oliver Freyermuth

Dear Cephalopodians,

I realized just now that:
  https://eu.ceph.com/rpm-nautilus/el7/x86_64/
still holds only released up to 14.2.2, and nothing is to be seen of 14.2.3 or 
14.2.4,
while the main repository at:
  https://download.ceph.com/rpm-nautilus/el7/x86_64/
looks as expected.

Is this issue with the eu.ceph.com mirror already knwon?

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD's keep crasching after clusterreboot

2019-09-17 Thread Oliver Freyermuth

Hi together,

it seems the issue described by Ansgar was reported and closed here as being 
fixed for newly created pools in post-Luminous releases:
https://tracker.ceph.com/issues/41336

However, it is unclear to me:
- How to find out if an EC cephfs you have created in Luminous is actually affected, 
before actually testing the "shutdown all" procedure,
  and thus having dying OSDs.
- If affected, how to fix it without purging the pool completely (which is not 
so easily done if there is 0.5 PB inside, which can't be restored without a 
long downtime).

If this is an acknowledged issue, it should probably also be mentioned in the 
upgrade notes from pre-Mimic to Mimic and newer before more people lose data.

In our case, we have such a a CephFS on an EC pool created with Luminous, and are right 
now running Mimic 13.2.6, but never tried a "full shutdown".
We need to try that on Friday, though... (cooling system maintenance).

"osd dump" contains:

pool 1 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 1 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 40903 flags hashpspool stripe_width 
0 compression_algorithm snappy compression_mode aggressive application cephfs
pool 2 'cephfs_data' erasure size 6 min_size 5 crush_rule 2 object_hash 
rjenkins pg_num 4096 pgp_num 4096 last_change 40953 flags 
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 
compression_algorithm snappy compression_mode aggressive application cephfs


and the EC profile is:

# ceph osd erasure-code-profile get cephfs_data
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8


Neither contains the stripe_unit explicitly, so I wonder how to find out if it 
is (in)valid.
Checking the xattr ceph.file.layout.stripe_unit of some "old" files on the FS 
reveals 4194304 in my case.

Any help appreciated.

Cheers and all the best,
Oliver

Am 09.08.19 um 08:54 schrieb Ansgar Jazdzewski:

We got our OSD's back

Since we removed the EC-Pool (cephfs.data) we had to figure out how to
remove the PG from teh Offline OSD and hier is how we did it.

remove cehfs, remove cache layer, remove pools:
#ceph mds fail 0
#ceph fs rm cephfs --yes-i-really-mean-it
#ceph osd tier remove-overlay cephfs.data
there is now (or already was) no overlay for 'cephfs.data'
#ceph osd tier remove cephfs.data cephfs.cache
pool 'cephfs.cache' is now (or already was) not a tier of 'cephfs.data'
#ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
#ceph osd pool delete cephfs.cache cephfs.cache --yes-i-really-really-mean-it
pool 'cephfs.cache' removed
#ceph osd pool delete cephfs.data cephfs.data --yes-i-really-really-mean-it
pool 'cephfs.data' removed
#ceph osd pool delete cephfs.metadata cephfs.metadata
--yes-i-really-really-mean-it
pool 'cephfs.metadata' removed

remove placement groups of pool 23 (cephfs.data) from all offline OSDs:
DATAPATH=/var/lib/ceph/osd/ceph-${OSD}
a=`ceph-objectstore-tool --data-path ${DATAPATH} --op list-pgs | grep "^23\."`
for i in $a; do
   echo "INFO: removing ${i} from OSD ${OSD}"
   ceph-objectstore-tool --data-path ${DATAPATH} --pgid ${i} --op remove --force
done

since we now had removed our cephfs we still not know if we could have
solved it without data loss by upgrading to nautilus.

Have a nice Weekend,
Ansgar

Am Mi., 7. Aug. 2019 um 17:03 Uhr schrieb Ansgar Jazdzewski
:


another update,

we now took the more destructive route and removed the cephfs pools
(lucky we had only test date in the filesystem)
Our hope was that within the startup-process the osd will delete the
no longer needed PG, But this is NOT the Case.

So we are still have the same issue the only difference is that the PG
does not belong to a pool anymore.

  -360> 2019-08-07 14:52:32.655 7fb14db8de00  5 osd.44 pg_epoch: 196586
pg[23.f8s0(unlocked)] enter Initial
  -360> 2019-08-07 14:52:32.659 7fb14db8de00 -1
/build/ceph-13.2.6/src/osd/ECUtil.h: In function
'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread
7fb14db8de00 time 2019-08-07 14:52:32.660169
/build/ceph-13.2.6/src/osd/ECUtil.h: 34: FAILED assert(stripe_width %
stripe_size == 0)

we now can take one rout and try to delete the pg by hand in the OSD
(bluestore) how this can be done? OR we try to upgrade to Nautilus and
hope for the beset.

any help hints are welcome,
have a nice one
Ansgar

Am Mi., 7. Aug. 2019 um 11:32 Uhr schrieb Ansgar Jazdzewski
:


Hi,

as a follow-up:
* a full log of one OSD failing to start https://pastebin.com/T8UQ2rZ6
* our ec-pool cration in the fist place https://pastebin.com/20cC06Jn
* ceph osd dump and ceph osd erasure-code-profile get cephfs
https://pastebin.com/TRLPaWcH

as we try to dig more into it, it looks like a bug 

Re: [ceph-users] Ceph RBD Mirroring

2019-09-14 Thread Oliver Freyermuth
Dear Jason,

Am 15.09.19 um 00:03 schrieb Jason Dillaman:
> I was able to repeat this issue locally by restarting the primary OSD
> for the "rbd_mirroring" object. It seems that a regression was
> introduced w/ the introduction of Ceph msgr2 in that upon reconnect,
> the connection type for the client switches from ANY to V2 -- but only
> for the watcher session and not the status updates. I've opened a
> tracker ticker for this issue [1].
> 
> Thanks.

many thanks to you for the detailed investigation and reproduction!
While I did not restart the first 5 OSDs of the test cluster, I added an OSD 
and rebalanced - so I guess this can also be triggered if the primary OSD for 
the object changes,
which should of course also lead to a reconnection. 
I can also add to my observations that now while not touching the cluster 
anymore things stay in "up+replaying". 

Thanks and all the best,
Oliver

> 
> On Fri, Sep 13, 2019 at 12:44 PM Oliver Freyermuth
>  wrote:
>>
>> Am 13.09.19 um 18:38 schrieb Jason Dillaman:
>>> On Fri, Sep 13, 2019 at 11:30 AM Oliver Freyermuth
>>>  wrote:
>>>>
>>>> Am 13.09.19 um 17:18 schrieb Jason Dillaman:
>>>>> On Fri, Sep 13, 2019 at 10:41 AM Oliver Freyermuth
>>>>>  wrote:
>>>>>>
>>>>>> Am 13.09.19 um 16:30 schrieb Jason Dillaman:
>>>>>>> On Fri, Sep 13, 2019 at 10:17 AM Jason Dillaman  
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On Fri, Sep 13, 2019 at 10:02 AM Oliver Freyermuth
>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>> Dear Jason,
>>>>>>>>>
>>>>>>>>> thanks for the very detailed explanation! This was very instructive.
>>>>>>>>> Sadly, the watchers look correct - see details inline.
>>>>>>>>>
>>>>>>>>> Am 13.09.19 um 15:02 schrieb Jason Dillaman:
>>>>>>>>>> On Thu, Sep 12, 2019 at 9:55 PM Oliver Freyermuth
>>>>>>>>>>  wrote:
>>>>>>>>>>>
>>>>>>>>>>> Dear Jason,
>>>>>>>>>>>
>>>>>>>>>>> thanks for taking care and developing a patch so quickly!
>>>>>>>>>>>
>>>>>>>>>>> I have another strange observation to share. In our test setup, 
>>>>>>>>>>> only a single RBD mirroring daemon is running for 51 images.
>>>>>>>>>>> It works fine with a constant stream of 1-2 MB/s, but at some point 
>>>>>>>>>>> after roughly 20 hours, _all_ images go to this interesting state:
>>>>>>>>>>> -
>>>>>>>>>>> # rbd mirror image status test-vm.X-disk2
>>>>>>>>>>> test-vm.X-disk2:
>>>>>>>>>>>   global_id:   XXX
>>>>>>>>>>>   state:   down+replaying
>>>>>>>>>>>   description: replaying, master_position=[object_number=14, 
>>>>>>>>>>> tag_tid=6, entry_tid=6338], mirror_position=[object_number=14, 
>>>>>>>>>>> tag_tid=6, entry_tid=6338], entries_behind_master=0
>>>>>>>>>>>   last_update: 2019-09-13 03:45:43
>>>>>>>>>>> -
>>>>>>>>>>> Running this command several times, I see entry_tid increasing at 
>>>>>>>>>>> both ends, so mirroring seems to be working just fine.
>>>>>>>>>>>
>>>>>>>>>>> However:
>>>>>>>>>>> -
>>>>>>>>>>> # rbd mirror pool status
>>>>>>>>>>> health: WARNING
>>>>>>>>>>> images: 51 total
>>>>>>>>>>> 51 unknown
>>>>>>>>>>> -
>>>>>>>>>>> The health warning is not visible in the dashboard (also not in the 
>>>>>>>>>>> mirroring menu), the daemon still seems to be running, dropped 
>>>>>>>>>>> nothing in the logs,
>>>>&g

Re: [ceph-users] Ceph RBD Mirroring

2019-09-13 Thread Oliver Freyermuth

Am 13.09.19 um 18:38 schrieb Jason Dillaman:

On Fri, Sep 13, 2019 at 11:30 AM Oliver Freyermuth
 wrote:


Am 13.09.19 um 17:18 schrieb Jason Dillaman:

On Fri, Sep 13, 2019 at 10:41 AM Oliver Freyermuth
 wrote:


Am 13.09.19 um 16:30 schrieb Jason Dillaman:

On Fri, Sep 13, 2019 at 10:17 AM Jason Dillaman  wrote:


On Fri, Sep 13, 2019 at 10:02 AM Oliver Freyermuth
 wrote:


Dear Jason,

thanks for the very detailed explanation! This was very instructive.
Sadly, the watchers look correct - see details inline.

Am 13.09.19 um 15:02 schrieb Jason Dillaman:

On Thu, Sep 12, 2019 at 9:55 PM Oliver Freyermuth
 wrote:


Dear Jason,

thanks for taking care and developing a patch so quickly!

I have another strange observation to share. In our test setup, only a single 
RBD mirroring daemon is running for 51 images.
It works fine with a constant stream of 1-2 MB/s, but at some point after 
roughly 20 hours, _all_ images go to this interesting state:
-
# rbd mirror image status test-vm.X-disk2
test-vm.X-disk2:
  global_id:   XXX
  state:   down+replaying
  description: replaying, master_position=[object_number=14, tag_tid=6, 
entry_tid=6338], mirror_position=[object_number=14, tag_tid=6, entry_tid=6338], 
entries_behind_master=0
  last_update: 2019-09-13 03:45:43
-
Running this command several times, I see entry_tid increasing at both ends, so 
mirroring seems to be working just fine.

However:
-
# rbd mirror pool status
health: WARNING
images: 51 total
51 unknown
-
The health warning is not visible in the dashboard (also not in the mirroring 
menu), the daemon still seems to be running, dropped nothing in the logs,
and claims to be "ok" in the dashboard - it's only that all images show up in 
unknown state even though all seems to be working fine.

Any idea on how to debug this?
When I restart the rbd-mirror service, all images come back as green. I already 
encountered this twice in 3 days.


The dashboard relies on the rbd-mirror daemon to provide it errors and
warnings. You can see the status reported by rbd-mirror by running
"ceph service status":

$ ceph service status
{
"rbd-mirror": {
"4152": {
"status_stamp": "2019-09-13T08:58:41.937491-0400",
"last_beacon": "2019-09-13T08:58:41.937491-0400",
"status": {
"json":
"{\"1\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_local_count\":1,\"image_remote_count\":1,\"image_warning_count\":0,\"instance_id\":\"4154\",\"leader\":true},\"2\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0,\"instance_id\":\"4156\",\"leader\":true}}"
}
}
}
}

In your case, most likely it seems like rbd-mirror thinks all is good
with the world so it's not reporting any errors.


This is indeed the case:

# ceph service status
{
"rbd-mirror": {
"84243": {
"status_stamp": "2019-09-13 15:40:01.149815",
"last_beacon": "2019-09-13 15:40:26.151381",
"status": {
"json": 
"{\"2\":{\"name\":\"rbd\",\"callouts\":{},\"image_assigned_count\":51,\"image_error_count\":0,\"image_local_count\":51,\"image_remote_count\":51,\"image_warning_count\":0,\"instance_id\":\"84247\",\"leader\":true}}"
}
}
},
"rgw": {
...
}
}


The "down" state indicates that the rbd-mirror daemon isn't correctly
watching the "rbd_mirroring" object in the pool. You can see who it
watching that object by running the "rados" "listwatchers" command:

$ rados -p  listwatchers rbd_mirroring
watcher=1.2.3.4:0/199388543 client.4154 cookie=94769010788992
watcher=1.2.3.4:0/199388543 client.4154 cookie=94769061031424

In my case, the "4154" from "client.4154" is the unique global id for
my connection to the cluster, which relates back to the "ceph service
status" dump which also shows status by daemon using the unique global
id.


Sadly(?), this looks as expected:

# rados -p rb

Re: [ceph-users] Ceph RBD Mirroring

2019-09-13 Thread Oliver Freyermuth

Am 13.09.19 um 17:18 schrieb Jason Dillaman:

On Fri, Sep 13, 2019 at 10:41 AM Oliver Freyermuth
 wrote:


Am 13.09.19 um 16:30 schrieb Jason Dillaman:

On Fri, Sep 13, 2019 at 10:17 AM Jason Dillaman  wrote:


On Fri, Sep 13, 2019 at 10:02 AM Oliver Freyermuth
 wrote:


Dear Jason,

thanks for the very detailed explanation! This was very instructive.
Sadly, the watchers look correct - see details inline.

Am 13.09.19 um 15:02 schrieb Jason Dillaman:

On Thu, Sep 12, 2019 at 9:55 PM Oliver Freyermuth
 wrote:


Dear Jason,

thanks for taking care and developing a patch so quickly!

I have another strange observation to share. In our test setup, only a single 
RBD mirroring daemon is running for 51 images.
It works fine with a constant stream of 1-2 MB/s, but at some point after 
roughly 20 hours, _all_ images go to this interesting state:
-
# rbd mirror image status test-vm.X-disk2
test-vm.X-disk2:
 global_id:   XXX
 state:   down+replaying
 description: replaying, master_position=[object_number=14, tag_tid=6, 
entry_tid=6338], mirror_position=[object_number=14, tag_tid=6, entry_tid=6338], 
entries_behind_master=0
 last_update: 2019-09-13 03:45:43
-
Running this command several times, I see entry_tid increasing at both ends, so 
mirroring seems to be working just fine.

However:
-
# rbd mirror pool status
health: WARNING
images: 51 total
   51 unknown
-
The health warning is not visible in the dashboard (also not in the mirroring 
menu), the daemon still seems to be running, dropped nothing in the logs,
and claims to be "ok" in the dashboard - it's only that all images show up in 
unknown state even though all seems to be working fine.

Any idea on how to debug this?
When I restart the rbd-mirror service, all images come back as green. I already 
encountered this twice in 3 days.


The dashboard relies on the rbd-mirror daemon to provide it errors and
warnings. You can see the status reported by rbd-mirror by running
"ceph service status":

$ ceph service status
{
   "rbd-mirror": {
   "4152": {
   "status_stamp": "2019-09-13T08:58:41.937491-0400",
   "last_beacon": "2019-09-13T08:58:41.937491-0400",
   "status": {
   "json":
"{\"1\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_local_count\":1,\"image_remote_count\":1,\"image_warning_count\":0,\"instance_id\":\"4154\",\"leader\":true},\"2\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0,\"instance_id\":\"4156\",\"leader\":true}}"
   }
   }
   }
}

In your case, most likely it seems like rbd-mirror thinks all is good
with the world so it's not reporting any errors.


This is indeed the case:

# ceph service status
{
   "rbd-mirror": {
   "84243": {
   "status_stamp": "2019-09-13 15:40:01.149815",
   "last_beacon": "2019-09-13 15:40:26.151381",
   "status": {
   "json": 
"{\"2\":{\"name\":\"rbd\",\"callouts\":{},\"image_assigned_count\":51,\"image_error_count\":0,\"image_local_count\":51,\"image_remote_count\":51,\"image_warning_count\":0,\"instance_id\":\"84247\",\"leader\":true}}"
   }
   }
   },
   "rgw": {
...
   }
}


The "down" state indicates that the rbd-mirror daemon isn't correctly
watching the "rbd_mirroring" object in the pool. You can see who it
watching that object by running the "rados" "listwatchers" command:

$ rados -p  listwatchers rbd_mirroring
watcher=1.2.3.4:0/199388543 client.4154 cookie=94769010788992
watcher=1.2.3.4:0/199388543 client.4154 cookie=94769061031424

In my case, the "4154" from "client.4154" is the unique global id for
my connection to the cluster, which relates back to the "ceph service
status" dump which also shows status by daemon using the unique global
id.


Sadly(?), this looks as expected:

# rados -p rbd listwatchers rbd_mirroring
watcher=10.160.19.240:0/2922488671 client.84247 cookie=139770046978672
watcher=10.160.19.240:0

Re: [ceph-users] Ceph RBD Mirroring

2019-09-13 Thread Oliver Freyermuth

Am 13.09.19 um 16:30 schrieb Jason Dillaman:

On Fri, Sep 13, 2019 at 10:17 AM Jason Dillaman  wrote:


On Fri, Sep 13, 2019 at 10:02 AM Oliver Freyermuth
 wrote:


Dear Jason,

thanks for the very detailed explanation! This was very instructive.
Sadly, the watchers look correct - see details inline.

Am 13.09.19 um 15:02 schrieb Jason Dillaman:

On Thu, Sep 12, 2019 at 9:55 PM Oliver Freyermuth
 wrote:


Dear Jason,

thanks for taking care and developing a patch so quickly!

I have another strange observation to share. In our test setup, only a single 
RBD mirroring daemon is running for 51 images.
It works fine with a constant stream of 1-2 MB/s, but at some point after 
roughly 20 hours, _all_ images go to this interesting state:
-
# rbd mirror image status test-vm.X-disk2
test-vm.X-disk2:
global_id:   XXX
state:   down+replaying
description: replaying, master_position=[object_number=14, tag_tid=6, 
entry_tid=6338], mirror_position=[object_number=14, tag_tid=6, entry_tid=6338], 
entries_behind_master=0
last_update: 2019-09-13 03:45:43
-
Running this command several times, I see entry_tid increasing at both ends, so 
mirroring seems to be working just fine.

However:
-
# rbd mirror pool status
health: WARNING
images: 51 total
  51 unknown
-
The health warning is not visible in the dashboard (also not in the mirroring 
menu), the daemon still seems to be running, dropped nothing in the logs,
and claims to be "ok" in the dashboard - it's only that all images show up in 
unknown state even though all seems to be working fine.

Any idea on how to debug this?
When I restart the rbd-mirror service, all images come back as green. I already 
encountered this twice in 3 days.


The dashboard relies on the rbd-mirror daemon to provide it errors and
warnings. You can see the status reported by rbd-mirror by running
"ceph service status":

$ ceph service status
{
  "rbd-mirror": {
  "4152": {
  "status_stamp": "2019-09-13T08:58:41.937491-0400",
  "last_beacon": "2019-09-13T08:58:41.937491-0400",
  "status": {
  "json":
"{\"1\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_local_count\":1,\"image_remote_count\":1,\"image_warning_count\":0,\"instance_id\":\"4154\",\"leader\":true},\"2\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0,\"instance_id\":\"4156\",\"leader\":true}}"
  }
  }
  }
}

In your case, most likely it seems like rbd-mirror thinks all is good
with the world so it's not reporting any errors.


This is indeed the case:

# ceph service status
{
  "rbd-mirror": {
  "84243": {
  "status_stamp": "2019-09-13 15:40:01.149815",
  "last_beacon": "2019-09-13 15:40:26.151381",
  "status": {
  "json": 
"{\"2\":{\"name\":\"rbd\",\"callouts\":{},\"image_assigned_count\":51,\"image_error_count\":0,\"image_local_count\":51,\"image_remote_count\":51,\"image_warning_count\":0,\"instance_id\":\"84247\",\"leader\":true}}"
  }
  }
  },
  "rgw": {
...
  }
}


The "down" state indicates that the rbd-mirror daemon isn't correctly
watching the "rbd_mirroring" object in the pool. You can see who it
watching that object by running the "rados" "listwatchers" command:

$ rados -p  listwatchers rbd_mirroring
watcher=1.2.3.4:0/199388543 client.4154 cookie=94769010788992
watcher=1.2.3.4:0/199388543 client.4154 cookie=94769061031424

In my case, the "4154" from "client.4154" is the unique global id for
my connection to the cluster, which relates back to the "ceph service
status" dump which also shows status by daemon using the unique global
id.


Sadly(?), this looks as expected:

# rados -p rbd listwatchers rbd_mirroring
watcher=10.160.19.240:0/2922488671 client.84247 cookie=139770046978672
watcher=10.160.19.240:0/2922488671 client.84247 cookie=139771389162560


Hmm, the unique id is different (84243 vs 84247). I wouldn't have
expected the global id to

Re: [ceph-users] Ceph RBD Mirroring

2019-09-13 Thread Oliver Freyermuth

Am 13.09.19 um 16:17 schrieb Jason Dillaman:

On Fri, Sep 13, 2019 at 10:02 AM Oliver Freyermuth
 wrote:


Dear Jason,

thanks for the very detailed explanation! This was very instructive.
Sadly, the watchers look correct - see details inline.

Am 13.09.19 um 15:02 schrieb Jason Dillaman:

On Thu, Sep 12, 2019 at 9:55 PM Oliver Freyermuth
 wrote:


Dear Jason,

thanks for taking care and developing a patch so quickly!

I have another strange observation to share. In our test setup, only a single 
RBD mirroring daemon is running for 51 images.
It works fine with a constant stream of 1-2 MB/s, but at some point after 
roughly 20 hours, _all_ images go to this interesting state:
-
# rbd mirror image status test-vm.X-disk2
test-vm.X-disk2:
global_id:   XXX
state:   down+replaying
description: replaying, master_position=[object_number=14, tag_tid=6, 
entry_tid=6338], mirror_position=[object_number=14, tag_tid=6, entry_tid=6338], 
entries_behind_master=0
last_update: 2019-09-13 03:45:43
-
Running this command several times, I see entry_tid increasing at both ends, so 
mirroring seems to be working just fine.

However:
-
# rbd mirror pool status
health: WARNING
images: 51 total
  51 unknown
-
The health warning is not visible in the dashboard (also not in the mirroring 
menu), the daemon still seems to be running, dropped nothing in the logs,
and claims to be "ok" in the dashboard - it's only that all images show up in 
unknown state even though all seems to be working fine.

Any idea on how to debug this?
When I restart the rbd-mirror service, all images come back as green. I already 
encountered this twice in 3 days.


The dashboard relies on the rbd-mirror daemon to provide it errors and
warnings. You can see the status reported by rbd-mirror by running
"ceph service status":

$ ceph service status
{
  "rbd-mirror": {
  "4152": {
  "status_stamp": "2019-09-13T08:58:41.937491-0400",
  "last_beacon": "2019-09-13T08:58:41.937491-0400",
  "status": {
  "json":
"{\"1\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_local_count\":1,\"image_remote_count\":1,\"image_warning_count\":0,\"instance_id\":\"4154\",\"leader\":true},\"2\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0,\"instance_id\":\"4156\",\"leader\":true}}"
  }
  }
  }
}

In your case, most likely it seems like rbd-mirror thinks all is good
with the world so it's not reporting any errors.


This is indeed the case:

# ceph service status
{
  "rbd-mirror": {
  "84243": {
  "status_stamp": "2019-09-13 15:40:01.149815",
  "last_beacon": "2019-09-13 15:40:26.151381",
  "status": {
  "json": 
"{\"2\":{\"name\":\"rbd\",\"callouts\":{},\"image_assigned_count\":51,\"image_error_count\":0,\"image_local_count\":51,\"image_remote_count\":51,\"image_warning_count\":0,\"instance_id\":\"84247\",\"leader\":true}}"
  }
  }
  },
  "rgw": {
...
  }
}


The "down" state indicates that the rbd-mirror daemon isn't correctly
watching the "rbd_mirroring" object in the pool. You can see who it
watching that object by running the "rados" "listwatchers" command:

$ rados -p  listwatchers rbd_mirroring
watcher=1.2.3.4:0/199388543 client.4154 cookie=94769010788992
watcher=1.2.3.4:0/199388543 client.4154 cookie=94769061031424

In my case, the "4154" from "client.4154" is the unique global id for
my connection to the cluster, which relates back to the "ceph service
status" dump which also shows status by daemon using the unique global
id.


Sadly(?), this looks as expected:

# rados -p rbd listwatchers rbd_mirroring
watcher=10.160.19.240:0/2922488671 client.84247 cookie=139770046978672
watcher=10.160.19.240:0/2922488671 client.84247 cookie=139771389162560


Hmm, the unique id is different (84243 vs 84247). I wouldn't have
expected the global id to have changed. Did you restart the Ceph
cluster or MON

Re: [ceph-users] Ceph RBD Mirroring

2019-09-13 Thread Oliver Freyermuth

Dear Jason,

thanks for the very detailed explanation! This was very instructive.
Sadly, the watchers look correct - see details inline.

Am 13.09.19 um 15:02 schrieb Jason Dillaman:

On Thu, Sep 12, 2019 at 9:55 PM Oliver Freyermuth
 wrote:


Dear Jason,

thanks for taking care and developing a patch so quickly!

I have another strange observation to share. In our test setup, only a single 
RBD mirroring daemon is running for 51 images.
It works fine with a constant stream of 1-2 MB/s, but at some point after 
roughly 20 hours, _all_ images go to this interesting state:
-
# rbd mirror image status test-vm.X-disk2
test-vm.X-disk2:
   global_id:   XXX
   state:   down+replaying
   description: replaying, master_position=[object_number=14, tag_tid=6, 
entry_tid=6338], mirror_position=[object_number=14, tag_tid=6, entry_tid=6338], 
entries_behind_master=0
   last_update: 2019-09-13 03:45:43
-
Running this command several times, I see entry_tid increasing at both ends, so 
mirroring seems to be working just fine.

However:
-
# rbd mirror pool status
health: WARNING
images: 51 total
 51 unknown
-
The health warning is not visible in the dashboard (also not in the mirroring 
menu), the daemon still seems to be running, dropped nothing in the logs,
and claims to be "ok" in the dashboard - it's only that all images show up in 
unknown state even though all seems to be working fine.

Any idea on how to debug this?
When I restart the rbd-mirror service, all images come back as green. I already 
encountered this twice in 3 days.


The dashboard relies on the rbd-mirror daemon to provide it errors and
warnings. You can see the status reported by rbd-mirror by running
"ceph service status":

$ ceph service status
{
 "rbd-mirror": {
 "4152": {
 "status_stamp": "2019-09-13T08:58:41.937491-0400",
 "last_beacon": "2019-09-13T08:58:41.937491-0400",
 "status": {
 "json":
"{\"1\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_local_count\":1,\"image_remote_count\":1,\"image_warning_count\":0,\"instance_id\":\"4154\",\"leader\":true},\"2\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0,\"instance_id\":\"4156\",\"leader\":true}}"
 }
 }
 }
}

In your case, most likely it seems like rbd-mirror thinks all is good
with the world so it's not reporting any errors.


This is indeed the case:

# ceph service status
{
"rbd-mirror": {
"84243": {
"status_stamp": "2019-09-13 15:40:01.149815",
"last_beacon": "2019-09-13 15:40:26.151381",
"status": {
"json": 
"{\"2\":{\"name\":\"rbd\",\"callouts\":{},\"image_assigned_count\":51,\"image_error_count\":0,\"image_local_count\":51,\"image_remote_count\":51,\"image_warning_count\":0,\"instance_id\":\"84247\",\"leader\":true}}"
}
}
},
"rgw": {
...
}
}


The "down" state indicates that the rbd-mirror daemon isn't correctly
watching the "rbd_mirroring" object in the pool. You can see who it
watching that object by running the "rados" "listwatchers" command:

$ rados -p  listwatchers rbd_mirroring
watcher=1.2.3.4:0/199388543 client.4154 cookie=94769010788992
watcher=1.2.3.4:0/199388543 client.4154 cookie=94769061031424

In my case, the "4154" from "client.4154" is the unique global id for
my connection to the cluster, which relates back to the "ceph service
status" dump which also shows status by daemon using the unique global
id.


Sadly(?), this looks as expected:

# rados -p rbd listwatchers rbd_mirroring
watcher=10.160.19.240:0/2922488671 client.84247 cookie=139770046978672
watcher=10.160.19.240:0/2922488671 client.84247 cookie=139771389162560

However, the dashboard still shows those images in "unknown", and this also 
shows up via command line:

# rbd mirror pool status
health: WARNING
images: 51 total
51 unknown
# rbd mirror image status test-vm.physik.uni-bonn.de-disk1
test-vm.physik.uni-bonn.de-disk2:
  global_

Re: [ceph-users] Ceph RBD Mirroring

2019-09-12 Thread Oliver Freyermuth
Dear Jason,

thanks for taking care and developing a patch so quickly! 

I have another strange observation to share. In our test setup, only a single 
RBD mirroring daemon is running for 51 images. 
It works fine with a constant stream of 1-2 MB/s, but at some point after 
roughly 20 hours, _all_ images go to this interesting state:
-
# rbd mirror image status test-vm.X-disk2
test-vm.X-disk2:
  global_id:   XXX
  state:   down+replaying
  description: replaying, master_position=[object_number=14, tag_tid=6, 
entry_tid=6338], mirror_position=[object_number=14, tag_tid=6, entry_tid=6338], 
entries_behind_master=0
  last_update: 2019-09-13 03:45:43
-
Running this command several times, I see entry_tid increasing at both ends, so 
mirroring seems to be working just fine. 

However:
-
# rbd mirror pool status
health: WARNING
images: 51 total
51 unknown
-
The health warning is not visible in the dashboard (also not in the mirroring 
menu), the daemon still seems to be running, dropped nothing in the logs,
and claims to be "ok" in the dashboard - it's only that all images show up in 
unknown state even though all seems to be working fine. 

Any idea on how to debug this? 
When I restart the rbd-mirror service, all images come back as green. I already 
encountered this twice in 3 days. 

Any idea on this (or how I can extract more information)? 
I fear keeping high-level debug logs active for ~24h is not feasible. 

Cheers,
Oliver


On 2019-09-11 19:14, Jason Dillaman wrote:
> On Wed, Sep 11, 2019 at 12:57 PM Oliver Freyermuth
>  wrote:
>>
>> Dear Jason,
>>
>> I played a bit more with rbd mirroring and learned that deleting an image at 
>> the source (or disabling journaling on it) immediately moves the image to 
>> trash at the target -
>> but setting rbd_mirroring_delete_delay helps to have some more grace time to 
>> catch human mistakes.
>>
>> However, I have issues restoring such an image which has been moved to trash 
>> by the RBD-mirror daemon as user:
>> ---
>> [root@mon001 ~]# rbd trash ls -la
>> ID   NAME SOURCEDELETED_AT   
>> STATUS   PARENT
>> d4fbe8f63905 test-vm-XX-disk2 MIRRORING Wed Sep 11 18:43:14 
>> 2019 protected until Thu Sep 12 18:43:14 2019
>> [root@mon001 ~]# rbd trash restore --image foo-image d4fbe8f63905
>> rbd: restore error: 2019-09-11 18:50:15.387 7f5fa9590b00 -1 
>> librbd::api::Trash: restore: Current trash source: mirroring does not match 
>> expected: user
>> (22) Invalid argument
>> ---
>> This is issued on the mon, which has the client.admin key, so it should not 
>> be a permission issue.
>> It also fails when I try that in the Dashboard.
>>
>> Sadly, the error message is not clear enough for me to figure out what could 
>> be the problem - do you see what I did wrong?
> 
> Good catch, it looks like we accidentally broke this in Nautilus when
> image live-migration support was added. I've opened a new tracker
> ticket to fix this [1].
> 
>> Cheers and thanks again,
>> Oliver
>>
>> On 2019-09-10 23:17, Oliver Freyermuth wrote:
>>> Dear Jason,
>>>
>>> On 2019-09-10 23:04, Jason Dillaman wrote:
>>>> On Tue, Sep 10, 2019 at 2:08 PM Oliver Freyermuth
>>>>  wrote:
>>>>>
>>>>> Dear Jason,
>>>>>
>>>>> On 2019-09-10 18:50, Jason Dillaman wrote:
>>>>>> On Tue, Sep 10, 2019 at 12:25 PM Oliver Freyermuth
>>>>>>  wrote:
>>>>>>>
>>>>>>> Dear Cephalopodians,
>>>>>>>
>>>>>>> I have two questions about RBD mirroring.
>>>>>>>
>>>>>>> 1) I can not get it to work - my setup is:
>>>>>>>
>>>>>>>  - One cluster holding the live RBD volumes and snapshots, in pool 
>>>>>>> "rbd", cluster name "ceph",
>>>>>>>running latest Mimic.
>>>>>>>I ran "rbd mirror pool enable rbd pool" on that cluster and 
>>>>>>> created a cephx user "rbd_mirror" with (is there a better way?):
>>>>>>>ceph auth get-or-create client.rbd_mirror mon 'allow r' osd 
>>>>>>> 'allow class-read object_prefix rbd_children, allow pool rbd r' -o 

Re: [ceph-users] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED

2019-09-12 Thread Oliver Freyermuth
Dear Cephalopodians,

I can confirm the same problem described by Joe Ryner in 14.2.2. I'm also 
getting (in a small test setup):
-
# ceph health detail
HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes; 1 subtrees 
have overcommitted pool target_size_ratio
POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted pool 
target_size_bytes
Pools ['rbd', '.rgw.root', 'default.rgw.control', 'default.rgw.meta', 
'default.rgw.log', 'default.rgw.buckets.index', 'default.rgw.buckets.data'] 
overcommit available storage by 1.068x due to target_size_bytes0  on pools 
[]
POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted pool 
target_size_ratio
Pools ['rbd', '.rgw.root', 'default.rgw.control', 'default.rgw.meta', 
'default.rgw.log', 'default.rgw.buckets.index', 'default.rgw.buckets.data'] 
overcommit available storage by 1.068x due to target_size_ratio 0.000 on pools 
[]
-

However, there's not much actual data STORED:
-
# ceph df
RAW STORAGE:
CLASS SIZEAVAIL   USEDRAW USED %RAW USED 
hdd   4.0 TiB 2.6 TiB 1.4 TiB  1.4 TiB 35.94 
TOTAL 4.0 TiB 2.6 TiB 1.4 TiB  1.4 TiB 35.94 
 
POOLS:
POOL  ID STORED  OBJECTS USED
%USED MAX AVAIL 
rbd2 676 GiB 266.40k 707 GiB 
23.42   771 GiB 
.rgw.root  9 1.2 KiB   4 768 KiB
 0   771 GiB 
default.rgw.control   10 0 B   8 0 B
 0   771 GiB 
default.rgw.meta  11 1.2 KiB   8 1.3 MiB
 0   771 GiB 
default.rgw.log   12 0 B 175 0 B
 0   771 GiB 
default.rgw.buckets.index 13 0 B   1 0 B
 0   771 GiB 
default.rgw.buckets.data  14 249 GiB  99.62k 753 GiB 
24.57   771 GiB
-
The main culprit here seems to be the default.rgw.buckets.data pool, but also 
the rbd pool contains thin images. 

As in the case of Joe, the autoscaler seems to look at the "USED" space, not at 
the "STORED" bytes:
-
 POOL SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  
TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE 
 default.rgw.meta1344k3.0 4092G  0. 
1.0   8  on
 default.rgw.buckets.index  0 3.0 4092G  0. 
1.0   8  on
 default.rgw.control0 3.0 4092G  0. 
1.0   8  on
 default.rgw.buckets.data   788.6G3.0 4092G  0.5782 
1.0 128  on
 .rgw.root  768.0k3.0 4092G  0. 
1.0   8  on
 rbd710.8G3.0 4092G  0.5212 
1.0  64  on
 default.rgw.log0 3.0 4092G  0. 
1.0   8  on
-

This does seem like a bug to me. The warning actually fires on a cluster with 
35 % raw usage, and things are mostly balanced. 
Is there already a tracker entry on this? 

Cheers,
Oliver


On 2019-05-01 22:01, Joe Ryner wrote:
> I think I have figured out the issue.
> 
>  POOL        SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  
> PG_NUM  NEW PG_NUM  AUTOSCALE 
>  images    28523G                3.0        68779G  1.2441                  
> 1000              warn 
> 
> My images are 28523G with a replication level 3 and have a total of 68779G in 
> Raw Capacity.
> 
>  According to the documentation 
> http://docs.ceph.com/docs/master/rados/operations/placement-groups/  
> 
> "*SIZE* is the amount of data stored in the pool. *TARGET SIZE*, if present, 
> is the amount of data the administrator has specified that they expect to 
> eventually be stored in this pool. The system uses the larger of the two 
> values for its calculation.
> 
> *RATE* is the multiplier for the pool that determines how much raw storage 
> capacity is consumed. For example, a 3 replica pool will have a ratio of 3.0, 
> while a k=4,m=2 erasure coded pool will have a ratio of 1.5.
> 
> *RAW CAPACITY* is the total amount of raw storage capacity on the OSDs that 
> are responsible for storing this pool’s (and perhaps other pools’) data. 
> *RATIO* is the ratio of that total capacity that this pool is 

Re: [ceph-users] Ceph RBD Mirroring

2019-09-11 Thread Oliver Freyermuth

Dear Jason,

I played a bit more with rbd mirroring and learned that deleting an image at 
the source (or disabling journaling on it) immediately moves the image to trash 
at the target -
but setting rbd_mirroring_delete_delay helps to have some more grace time to 
catch human mistakes.

However, I have issues restoring such an image which has been moved to trash by 
the RBD-mirror daemon as user:
---
[root@mon001 ~]# rbd trash ls -la
ID   NAME SOURCEDELETED_AT  
 STATUS   PARENT
d4fbe8f63905 test-vm-XX-disk2 MIRRORING Wed Sep 11 18:43:14 
2019 protected until Thu Sep 12 18:43:14 2019
[root@mon001 ~]# rbd trash restore --image foo-image d4fbe8f63905
rbd: restore error: 2019-09-11 18:50:15.387 7f5fa9590b00 -1 librbd::api::Trash: 
restore: Current trash source: mirroring does not match expected: user
(22) Invalid argument
---
This is issued on the mon, which has the client.admin key, so it should not be 
a permission issue.
It also fails when I try that in the Dashboard.

Sadly, the error message is not clear enough for me to figure out what could be 
the problem - do you see what I did wrong?

Cheers and thanks again,
Oliver

On 2019-09-10 23:17, Oliver Freyermuth wrote:

Dear Jason,

On 2019-09-10 23:04, Jason Dillaman wrote:

On Tue, Sep 10, 2019 at 2:08 PM Oliver Freyermuth
 wrote:


Dear Jason,

On 2019-09-10 18:50, Jason Dillaman wrote:

On Tue, Sep 10, 2019 at 12:25 PM Oliver Freyermuth
 wrote:


Dear Cephalopodians,

I have two questions about RBD mirroring.

1) I can not get it to work - my setup is:

 - One cluster holding the live RBD volumes and snapshots, in pool "rbd", cluster 
name "ceph",
   running latest Mimic.
   I ran "rbd mirror pool enable rbd pool" on that cluster and created a cephx user 
"rbd_mirror" with (is there a better way?):
   ceph auth get-or-create client.rbd_mirror mon 'allow r' osd 'allow 
class-read object_prefix rbd_children, allow pool rbd r' -o 
ceph.client.rbd_mirror.keyring --cluster ceph
   In that pool, two images have the journaling feature activated, all 
others have it disabled still (so I would expect these two to be mirrored).


You can just use "mon 'profile rbd' osd 'profile rbd'" for the caps --
but you definitely need more than read-only permissions to the remote
cluster since it needs to be able to create snapshots of remote images
and update/trim the image journals.


these profiles really make life a lot easier. I should have thought of them rather than 
"guessing" a potentially good configuration...




 - Another (empty) cluster running latest Nautilus, cluster name "ceph", pool 
"rbd".
   I've used the dashboard to activate mirroring for the RBD pool, and then added a peer with 
cluster name "ceph-virt", cephx-ID "rbd_mirror", filled in the mons and key 
created above.
   I've then run:
   ceph auth get-or-create client.rbd_mirror_backup mon 'allow r' osd 
'allow class-read object_prefix rbd_children, allow pool rbd rwx' -o 
client.rbd_mirror_backup.keyring --cluster ceph
   and deployed that key on the rbd-mirror machine, and started the service 
with:


Please use "mon 'profile rbd-mirror' osd 'profile rbd'" for your caps [1].


That did the trick (in combination with the above)!
Again a case of PEBKAC: I should have read the documentation until the end, 
clearly my fault.

It works well now, even though it seems to run a bit slow (~35 MB/s for the 
initial sync when everything is 1 GBit/s),
but that may also be caused by combination of some very limited hardware on the 
receiving end (which will be scaled up in the future).
A single host with 6 disks, replica 3 and a RAID controller which can only do 
RAID0 and not JBOD is certainly not ideal, so commit latency may cause this 
slow bandwidth.


You could try increasing "rbd_concurrent_management_ops" from the
default of 10 ops to something higher to attempt to account for the
latency. However, I wouldn't expect near-line speed w/ RBD mirroring.


Thanks - I will play with this option once we have more storage available in 
the target pool ;-).






   systemctl start ceph-rbd-mirror@rbd_mirror_backup.service

After this, everything looks fine:
 # rbd mirror pool info
   Mode: pool
   Peers:
UUID NAME  CLIENT
XXX  ceph-virt client.rbd_mirror

The service also seems to start fine, but logs show (debug rbd_mirror=20):

rbd::mirror::ClusterWatcher:0x5575e2a7d390 resolve_peer_config_keys: 
retrieving config-key: pool_id=2, pool_name=rbd, peer_uuid=XXX
rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: enter
rbd::mirror::Mirror

Re: [ceph-users] Ceph RBD Mirroring

2019-09-10 Thread Oliver Freyermuth
Dear Jason,

On 2019-09-10 23:04, Jason Dillaman wrote:
> On Tue, Sep 10, 2019 at 2:08 PM Oliver Freyermuth
>  wrote:
>>
>> Dear Jason,
>>
>> On 2019-09-10 18:50, Jason Dillaman wrote:
>>> On Tue, Sep 10, 2019 at 12:25 PM Oliver Freyermuth
>>>  wrote:
>>>>
>>>> Dear Cephalopodians,
>>>>
>>>> I have two questions about RBD mirroring.
>>>>
>>>> 1) I can not get it to work - my setup is:
>>>>
>>>> - One cluster holding the live RBD volumes and snapshots, in pool 
>>>> "rbd", cluster name "ceph",
>>>>   running latest Mimic.
>>>>   I ran "rbd mirror pool enable rbd pool" on that cluster and created 
>>>> a cephx user "rbd_mirror" with (is there a better way?):
>>>>   ceph auth get-or-create client.rbd_mirror mon 'allow r' osd 'allow 
>>>> class-read object_prefix rbd_children, allow pool rbd r' -o 
>>>> ceph.client.rbd_mirror.keyring --cluster ceph
>>>>   In that pool, two images have the journaling feature activated, all 
>>>> others have it disabled still (so I would expect these two to be mirrored).
>>>
>>> You can just use "mon 'profile rbd' osd 'profile rbd'" for the caps --
>>> but you definitely need more than read-only permissions to the remote
>>> cluster since it needs to be able to create snapshots of remote images
>>> and update/trim the image journals.
>>
>> these profiles really make life a lot easier. I should have thought of them 
>> rather than "guessing" a potentially good configuration...
>>
>>>
>>>> - Another (empty) cluster running latest Nautilus, cluster name 
>>>> "ceph", pool "rbd".
>>>>   I've used the dashboard to activate mirroring for the RBD pool, and 
>>>> then added a peer with cluster name "ceph-virt", cephx-ID "rbd_mirror", 
>>>> filled in the mons and key created above.
>>>>   I've then run:
>>>>   ceph auth get-or-create client.rbd_mirror_backup mon 'allow r' osd 
>>>> 'allow class-read object_prefix rbd_children, allow pool rbd rwx' -o 
>>>> client.rbd_mirror_backup.keyring --cluster ceph
>>>>   and deployed that key on the rbd-mirror machine, and started the 
>>>> service with:
>>>
>>> Please use "mon 'profile rbd-mirror' osd 'profile rbd'" for your caps [1].
>>
>> That did the trick (in combination with the above)!
>> Again a case of PEBKAC: I should have read the documentation until the end, 
>> clearly my fault.
>>
>> It works well now, even though it seems to run a bit slow (~35 MB/s for the 
>> initial sync when everything is 1 GBit/s),
>> but that may also be caused by combination of some very limited hardware on 
>> the receiving end (which will be scaled up in the future).
>> A single host with 6 disks, replica 3 and a RAID controller which can only 
>> do RAID0 and not JBOD is certainly not ideal, so commit latency may cause 
>> this slow bandwidth.
> 
> You could try increasing "rbd_concurrent_management_ops" from the
> default of 10 ops to something higher to attempt to account for the
> latency. However, I wouldn't expect near-line speed w/ RBD mirroring.

Thanks - I will play with this option once we have more storage available in 
the target pool ;-). 

> 
>>>
>>>>   systemctl start ceph-rbd-mirror@rbd_mirror_backup.service
>>>>
>>>>After this, everything looks fine:
>>>> # rbd mirror pool info
>>>>   Mode: pool
>>>>   Peers:
>>>>UUID NAME  CLIENT
>>>>XXX  ceph-virt client.rbd_mirror
>>>>
>>>>The service also seems to start fine, but logs show (debug 
>>>> rbd_mirror=20):
>>>>
>>>>rbd::mirror::ClusterWatcher:0x5575e2a7d390 resolve_peer_config_keys: 
>>>> retrieving config-key: pool_id=2, pool_name=rbd, peer_uuid=XXX
>>>>rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: enter
>>>>rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: restarting 
>>>> failed pool replayer for uuid: XXX cluster: ceph-virt client: 
>>>> client.rbd_mirror
>>>>rbd::mirror::PoolReplayer: 0x5575e2a7da20 init: replaying for uuid: 
>>>> XXX cluster: ceph-virt client: c

Re: [ceph-users] Ceph RBD Mirroring

2019-09-10 Thread Oliver Freyermuth
Dear Jason,

On 2019-09-10 18:50, Jason Dillaman wrote:
> On Tue, Sep 10, 2019 at 12:25 PM Oliver Freyermuth
>  wrote:
>>
>> Dear Cephalopodians,
>>
>> I have two questions about RBD mirroring.
>>
>> 1) I can not get it to work - my setup is:
>>
>> - One cluster holding the live RBD volumes and snapshots, in pool "rbd", 
>> cluster name "ceph",
>>   running latest Mimic.
>>   I ran "rbd mirror pool enable rbd pool" on that cluster and created a 
>> cephx user "rbd_mirror" with (is there a better way?):
>>   ceph auth get-or-create client.rbd_mirror mon 'allow r' osd 'allow 
>> class-read object_prefix rbd_children, allow pool rbd r' -o 
>> ceph.client.rbd_mirror.keyring --cluster ceph
>>   In that pool, two images have the journaling feature activated, all 
>> others have it disabled still (so I would expect these two to be mirrored).
> 
> You can just use "mon 'profile rbd' osd 'profile rbd'" for the caps --
> but you definitely need more than read-only permissions to the remote
> cluster since it needs to be able to create snapshots of remote images
> and update/trim the image journals.

these profiles really make life a lot easier. I should have thought of them 
rather than "guessing" a potentially good configuration... 

> 
>> - Another (empty) cluster running latest Nautilus, cluster name "ceph", 
>> pool "rbd".
>>   I've used the dashboard to activate mirroring for the RBD pool, and 
>> then added a peer with cluster name "ceph-virt", cephx-ID "rbd_mirror", 
>> filled in the mons and key created above.
>>   I've then run:
>>   ceph auth get-or-create client.rbd_mirror_backup mon 'allow r' osd 
>> 'allow class-read object_prefix rbd_children, allow pool rbd rwx' -o 
>> client.rbd_mirror_backup.keyring --cluster ceph
>>   and deployed that key on the rbd-mirror machine, and started the 
>> service with:
> 
> Please use "mon 'profile rbd-mirror' osd 'profile rbd'" for your caps [1].

That did the trick (in combination with the above)! 
Again a case of PEBKAC: I should have read the documentation until the end, 
clearly my fault. 

It works well now, even though it seems to run a bit slow (~35 MB/s for the 
initial sync when everything is 1 GBit/s), 
but that may also be caused by combination of some very limited hardware on the 
receiving end (which will be scaled up in the future). 
A single host with 6 disks, replica 3 and a RAID controller which can only do 
RAID0 and not JBOD is certainly not ideal, so commit latency may cause this 
slow bandwidth. 

> 
>>   systemctl start ceph-rbd-mirror@rbd_mirror_backup.service
>>
>>After this, everything looks fine:
>> # rbd mirror pool info
>>   Mode: pool
>>   Peers:
>>UUID NAME  CLIENT
>>XXX  ceph-virt client.rbd_mirror
>>
>>The service also seems to start fine, but logs show (debug rbd_mirror=20):
>>
>>rbd::mirror::ClusterWatcher:0x5575e2a7d390 resolve_peer_config_keys: 
>> retrieving config-key: pool_id=2, pool_name=rbd, peer_uuid=XXX
>>rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: enter
>>rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: restarting 
>> failed pool replayer for uuid: XXX cluster: ceph-virt client: 
>> client.rbd_mirror
>>rbd::mirror::PoolReplayer: 0x5575e2a7da20 init: replaying for uuid: 
>> XXX cluster: ceph-virt client: client.rbd_mirror
>>rbd::mirror::PoolReplayer: 0x5575e2a7da20 init_rados: error connecting to 
>> remote peer uuid: XXX cluster: ceph-virt client: client.rbd_mirror: 
>> (95) Operation not supported
>>rbd::mirror::ServiceDaemon: 0x5575e29c8d70 add_or_update_callout: 
>> pool_id=2, callout_id=2, callout_level=error, text=unable to connect to 
>> remote cluster
> 
> If it's still broken after fixing your caps above, perhaps increase
> debugging for "rados", "monc", "auth", and "ms" to see if you can
> determine the source of the op not supported error.
> 
>> I already tried storing the ceph.client.rbd_mirror.keyring (i.e. from the 
>> cluster with the live images) on the rbd-mirror machine explicitly (i.e. not 
>> only in mon config storage),
>> and after doing that:
>>   rbd -m mon_ip_of_ceph_virt_cluster --id=rbd_mirror ls
>> works fine. So it's not a connectivity issue. Maybe a permission issue? Or 
>> did I miss something?
>>
>

[ceph-users] Ceph RBD Mirroring

2019-09-10 Thread Oliver Freyermuth

Dear Cephalopodians,

I have two questions about RBD mirroring.

1) I can not get it to work - my setup is:

   - One cluster holding the live RBD volumes and snapshots, in pool "rbd", cluster name 
"ceph",
 running latest Mimic.
 I ran "rbd mirror pool enable rbd pool" on that cluster and created a cephx user 
"rbd_mirror" with (is there a better way?):
 ceph auth get-or-create client.rbd_mirror mon 'allow r' osd 'allow 
class-read object_prefix rbd_children, allow pool rbd r' -o 
ceph.client.rbd_mirror.keyring --cluster ceph
 In that pool, two images have the journaling feature activated, all others 
have it disabled still (so I would expect these two to be mirrored).
 
   - Another (empty) cluster running latest Nautilus, cluster name "ceph", pool "rbd".

 I've used the dashboard to activate mirroring for the RBD pool, and then added a peer with 
cluster name "ceph-virt", cephx-ID "rbd_mirror", filled in the mons and key 
created above.
 I've then run:
 ceph auth get-or-create client.rbd_mirror_backup mon 'allow r' osd 'allow 
class-read object_prefix rbd_children, allow pool rbd rwx' -o 
client.rbd_mirror_backup.keyring --cluster ceph
 and deployed that key on the rbd-mirror machine, and started the service 
with:
 systemctl start ceph-rbd-mirror@rbd_mirror_backup.service

  After this, everything looks fine:
   # rbd mirror pool info
 Mode: pool
 Peers:
  UUID NAME  CLIENT
  XXX  ceph-virt client.rbd_mirror

  The service also seems to start fine, but logs show (debug rbd_mirror=20):

  rbd::mirror::ClusterWatcher:0x5575e2a7d390 resolve_peer_config_keys: 
retrieving config-key: pool_id=2, pool_name=rbd, peer_uuid=XXX
  rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: enter
  rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: restarting failed 
pool replayer for uuid: XXX cluster: ceph-virt client: client.rbd_mirror
  rbd::mirror::PoolReplayer: 0x5575e2a7da20 init: replaying for uuid: 
XXX cluster: ceph-virt client: client.rbd_mirror
  rbd::mirror::PoolReplayer: 0x5575e2a7da20 init_rados: error connecting to 
remote peer uuid: XXX cluster: ceph-virt client: client.rbd_mirror: 
(95) Operation not supported
  rbd::mirror::ServiceDaemon: 0x5575e29c8d70 add_or_update_callout: pool_id=2, 
callout_id=2, callout_level=error, text=unable to connect to remote cluster

I already tried storing the ceph.client.rbd_mirror.keyring (i.e. from the 
cluster with the live images) on the rbd-mirror machine explicitly (i.e. not 
only in mon config storage),
and after doing that:
 rbd -m mon_ip_of_ceph_virt_cluster --id=rbd_mirror ls
works fine. So it's not a connectivity issue. Maybe a permission issue? Or did 
I miss something?

Any idea what "operation not supported" means?
It's unclear to me whether things should work well using Mimic with Nautilus, 
and enabling pool mirroring but only having journaling on for two images is a 
supported case.

2) Since there is a performance drawback (about 2x) for journaling, is it also 
possible to only mirror snapshots, and leave the live volumes alone?
   This would cover the common backup usecase before deferred mirroring is 
implemented (or is it there already?).

Cheers and thanks in advance,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Urgent Help Needed (regarding rbd cache)

2019-08-01 Thread Oliver Freyermuth

Hi together,

Am 01.08.19 um 08:45 schrieb Janne Johansson:

Den tors 1 aug. 2019 kl 07:31 skrev Muhammad Junaid mailto:junaid.fsd...@gmail.com>>:

Your email has cleared many things to me. Let me repeat my understanding. 
Every Critical data (Like Oracle/Any Other DB) writes will be done with sync, 
fsync flags, meaning they will be only confirmed to DB/APP after it is actually 
written to Hard drives/OSD's. Any other application can do it also.
All other writes, like OS logs etc will be confirmed immediately to app/user but later on written  passing through kernel, RBD Cache, Physical drive Cache (If any)  and then to disks. These are susceptible to power-failure-loss but overall things are recoverable/non-critical. 



That last part is probably simplified a bit, I suspect between a program in a 
guest sending its data to the virtualised device, running in a KVM on top of an 
OS that has remote storage over network, to a storage server with its own OS 
and drive controller chip and lastly physical drive(s) to store the write, 
there will be something like ~10 layers of write caching possible, out of which 
the RBD you were asking about, is just one.

It is just located very conveniently before the I/O has to leave the KVM host 
and go back and forth over the network, so it is the last place where you can 
see huge gains in the guests I/O response time, but at the same time possible 
to share between lots of guests on the KVM host which should have tons of RAM 
available compared to any single guest so it is a nice way to get a large cache 
for outgoing writes.

Also, to answer your first part, yes all critical software that depend heavily 
on write ordering and integrity is hopefully already doing write operations 
that way, asking for sync(), fsync() or fdatasync() and similar calls, but I 
can't produce a list of all programs that do. Since there already are many 
layers of delayed cached writes even without virtualisation and/or ceph, 
applications that are important have mostly learned their lessons by now, so 
chances are very high that all your important databases and similar program are 
doing the right thing.


Just to add on this: One such software, for which people cared a lot, is of 
course a file system itself. BTRFS is notably a candidate very sensitive to 
broken flush / FUA ( 
https://en.wikipedia.org/wiki/Disk_buffer#Force_Unit_Access_(FUA) ) 
implementations at any layer of the I/O path due to the rather complicated 
metadata structure.
While for in-kernel and other open source software (such as librbd), there are 
usually a lot of people checking the code for a correct implementation and 
testing things, there is also broken hardware
(or rather, firmware) in the wild.

But there are even software issues around, if you think more general and strive 
for data correctness (since also corruption can happen at any layer):
I was hit by an in-kernel issue in the past (network driver writing network statistics 
via DMA to the wrong memory location - "sometimes")
corrupting two BTRFS partitions of mine, and causing random crashes in browsers 
and mail client apps. BTRFS has been hardened only in kernel 5.2 to check the 
metadata tree before flushing it to disk.

If you are curious about known hardware issues, check out this lengthy, but 
very insightful mail on the linux-btrfs list:
https://lore.kernel.org/linux-btrfs/20190623204523.gc11...@hungrycats.org/
As you can learn there, there are many drive and firmware combinations out 
there which do not implement flush / FUA correctly and your BTRFS may be 
corrupted after a power failure. The very same thing can happen to Ceph,
but with replication across several OSDs and lower probability to have broken 
disks in all hosts makes this issue less likely.

For what it is worth, we also use writeback caching for our virtualization 
cluster and are very happy with it - we also tried pulling power plugs on 
hypervisors, MONs and OSDs at random times during writes and ext4 could always 
recover easily with an fsck
making use of the journal.

Cheers and HTH,
Oliver



But if the guest is instead running a mail filter that does antivirus checks, spam checks 
and so on, operating on files that live on the machine for something like one second, and 
then either get dropped or sent to the destination mailbox somewhere else, then having 
aggressive write caches would be very useful, since the effects of a crash would still 
mostly mean "the emails that were in the queue were lost, not acked by the final 
mailserver and will probably be resent by the previous smtp server". For such a 
guest VM, forcing sync writes would only be a net loss, it would gain much by having 
large ram write caches.

--
May the most significant bit of your life be positive.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





smime.p7s
Description: S/MIME 

Re: [ceph-users] Fix scrub error in bluestore.

2019-06-06 Thread Oliver Freyermuth

Hi Alfredo,

you may want to check the SMART data for the disk.
I also had such a case recently (see 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/035117.html for 
the thread),
and the disk had one unreadable sector which was pending reallocation.

Triggering "ceph pg repair" for the problematic placement group made the OSD 
rewrite the problematic sector and allowed the disk to reallocate this unreadable sector.

Cheers,
Oliver

Am 06.06.19 um 18:45 schrieb Tarek Zegar:

Look here
_http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#pgs-inconsistent_

Read error typically is a disk issue. The doc is not clear on how to resolve 
that




Inactive hide details for Alfredo Rezinovsky ---06/06/2019 10:58:50 
AM---https://urldefense.proofpoint.com/v2/url?u=https-3A__cAlfredo Rezinovsky 
---06/06/2019 10:58:50 
AM---https://urldefense.proofpoint.com/v2/url?u=https-3A__ceph.com_geen-2Dcategorie_ceph-2Dmanually-2Drep

From: Alfredo Rezinovsky 
To: Ceph Users 
Date: 06/06/2019 10:58 AM
Subject: [EXTERNAL] [ceph-users] Fix scrub error in bluestore.
Sent by: "ceph-users" 

--



_https://ceph.com/geen-categorie/ceph-manually-repair-object/_

is a little outdated.

After stopping the OSD, flushing the journal I don't have any clue on how to 
move the object (easy in filestore).

I have thins in my osd log.

2019-06-05 10:46:41.418 7f47d0502700 -1 log_channel(cluster) log [ERR] : 10.c5 
shard 2 soid 10:a39e2c78:::183f81f.0001:head : candidate had a read 
error

How can I fix it?

--
Alfrenovsky___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object read error - enough copies available

2019-05-31 Thread Oliver Freyermuth
Hi,

Am 31.05.19 um 12:07 schrieb Burkhard Linke:
> Hi,
> 
> 
> see my post in the recent 'CephFS object mapping.' thread. It describes the 
> necessary commands to lookup a file based on its rados object name.

many thanks! I somehow missed the important part in that thread earlier and 
only got the functional, but not really scaling "find . -xdev -inum 
xxx"-approach before I stopped reading,
but now I have followed it in full - very enlightening indeed, so one needs to 
look at the xattrs of the RADOS objects! 
Very logical once you know it. 

Thanks again!
Oliver

> 
> 
> Regards,
> 
> Burkhard
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object read error - enough copies available

2019-05-30 Thread Oliver Freyermuth
Am 30.05.19 um 17:00 schrieb Oliver Freyermuth:
> Dear Cephalopodians,
> 
> I found the messages:
>  2019-05-30 16:08:51.656363 [ERR]  Error -5 reading object 
> 2:0979ae43:::10002954ea6.007c:head
>  2019-05-30 16:08:51.760660 [WRN]  Error(s) ignored for 
> 2:0979ae43:::10002954ea6.007c:head enough copies available 
> just now in our logs (Mimic 13.2.5). However, everything stayed HEALTH_OK and 
> seems fine. Pool 2 is an EC pool containing CephFS. 
> 
> Up to now I've never had to delve into the depths of RADOS, so I have some 
> questions. If there are docs and I missed them, just redirect me :-). 
> 
> - How do I find the OSDs / PG for that object (is the PG contained in the 
> name?)
>   I'd love to check SMART in more detail and deep-scrub that PG to see if 
> this was just a hiccup, or a permanent error. 

I've progressed - and put it on the list in the hope it can also help others:
# ceph osd map cephfs_data 10002954ea6.007c
osdmap e40907 pool 'cephfs_data' (2) object '10002954ea6.007c' -> pg 
2.c2759e90 (2.e90) -> up ([196,101,14,156,47,177], p196) acting 
([196,101,14,156,47,177], p196)
# ceph pg deep-scrub 2.e90
instructing pg 2.e90s0 on osd.196 to deep-scrub

Checking the OSD logs (osd 196), I find:
-
2019-05-30 16:08:51.759 7f46b36ac700  0 log_channel(cluster) log [WRN] : 
Error(s) ignored for 2:0979ae43:::10002954ea6.007c:head enough copies 
available
2019-05-30 17:13:39.817 7f46b36ac700  0 log_channel(cluster) log [DBG] : 2.e90 
deep-scrub starts
2019-05-30 17:19:51.013 7f46b36ac700 -1 log_channel(cluster) log [ERR] : 2.e90 
shard 14(2) soid 2:0979ae43:::10002954ea6.007c:head : candidate had a read 
error
2019-05-30 17:23:52.360 7f46b36ac700 -1 log_channel(cluster) log [ERR] : 
2.e90s0 deep-scrub 0 missing, 1 inconsistent objects
2019-05-30 17:23:52.360 7f46b36ac700 -1 log_channel(cluster) log [ERR] : 2.e90 
deep-scrub 1 errors
-
And now, the cluster is in HEALTH_ERR as expected. So that would probably have 
happened automatically after a while - wouldn't it be better to alert the 
operator immediately,
e.g. by scheduling an immediate deep-scrub after a read-error?

I presume "shard 14(2)" means: "Shard on OSD 14, third (index 2) in the acting 
set". Correct? 

Checking that OSDs logs, I do indeed find:
-
2019-05-30 16:08:51.566 7f2e7dc15700 -1 bdev(0x55ae2eade000 
/var/lib/ceph/osd/ceph-14/block) _aio_thread got r=-5 ((5) Input/output error)
2019-05-30 16:08:51.566 7f2e7dc15700 -1 bdev(0x55ae2eade000 
/var/lib/ceph/osd/ceph-14/block) _aio_thread translating the error to EIO for 
upper layer
2019-05-30 16:08:51.655 7f2e683ea700 -1 log_channel(cluster) log [ERR] : Error 
-5 reading object 2:0979ae43:::10002954ea6.007c:head
-
The underlying disk has one problematic sector in SMART. Issuing:
# ceph pg repair 2.e90
has triggered rewriting that sector and allowed the disk to reallocate that 
sector, and Ceph is HEALTH_OK again. 

So my issue is solved, but two questions remain:
- Is it wanted that the error is "ignored" until the next deep-scrub happens? 

- Is there also a way to map the object name to a CephFS file object and 
vice-versa? 
  In one direction (file / inode to object), it seems this approach should work:
  http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005384.html

Cheers and thanks,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Object read error - enough copies available

2019-05-30 Thread Oliver Freyermuth
Dear Cephalopodians,

I found the messages:
 2019-05-30 16:08:51.656363 [ERR]  Error -5 reading object 
2:0979ae43:::10002954ea6.007c:head
 2019-05-30 16:08:51.760660 [WRN]  Error(s) ignored for 
2:0979ae43:::10002954ea6.007c:head enough copies available 
just now in our logs (Mimic 13.2.5). However, everything stayed HEALTH_OK and 
seems fine. Pool 2 is an EC pool containing CephFS. 

Up to now I've never had to delve into the depths of RADOS, so I have some 
questions. If there are docs and I missed them, just redirect me :-). 

- How do I find the OSDs / PG for that object (is the PG contained in the name?)
  I'd love to check SMART in more detail and deep-scrub that PG to see if this 
was just a hiccup, or a permanent error. 

- Is there also a way to map the object name to a CephFS file object and 
vice-versa? 
  In one direction (file / inode to object), it seems this approach should work:
  http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005384.html

- Should Ceph stay healthy in that case? 
  Does it maybe even deep-scrub automatically, and only decide afterwards 
whether to stay healthy / whether repair is needed? 

Cheers and thanks,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancer: uneven OSDs

2019-05-29 Thread Oliver Freyermuth
y] Loaded module_config entry 
> mgr/balancer/mode:upmap
> 2019-05-29 17:06:54.299 7f40cd3e8700 4 mgr get_config get_config key: 
> mgr/balancer/active
> 2019-05-29 17:06:54.299 7f40cd3e8700 4 mgr get_config get_config key: 
> mgr/balancer/begin_time
> 2019-05-29 17:06:54.299 7f40cd3e8700 4 mgr get_config get_config key: 
> mgr/balancer/end_time
> 2019-05-29 17:06:54.299 7f40cd3e8700 4 mgr get_config get_config key: 
> mgr/balancer/sleep_interval
> *2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr[balancer] Optimize plan 
> auto_2019-05-29_17:06:54*
> 2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr get_config get_config key: 
> mgr/balancer/mode
> 2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr get_config get_config key: 
> mgr/balancer/max_misplaced
> 2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr[balancer] Mode upmap, max 
> misplaced 0.50
> 2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr[balancer] do_upmap
> 2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr get_config get_config key: 
> mgr/balancer/upmap_max_iterations
> 2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr get_config get_config key: 
> mgr/balancer/upmap_max_deviation
> 2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr[balancer] pools ['rbd']
> *2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr[balancer] prepared 0/10 changes*
> 
> 
> Inactive hide details for Oliver Freyermuth ---05/29/2019 11:59:39 AM---Hi 
> Tarek, Am 29.05.19 um 18:49 schrieb Tarek Zegar:Oliver Freyermuth 
> ---05/29/2019 11:59:39 AM---Hi Tarek, Am 29.05.19 um 18:49 schrieb Tarek 
> Zegar:
> 
> From: Oliver Freyermuth 
> To: Tarek Zegar 
> Cc: ceph-users@lists.ceph.com
> Date: 05/29/2019 11:59 AM
> Subject: [EXTERNAL] Re: [ceph-users] Balancer: uneven OSDs
> 
> --
> 
> 
> 
> Hi Tarek,
> 
> Am 29.05.19 um 18:49 schrieb Tarek Zegar:
>> Hi Oliver,
>>
>> Thank you for the response, I did ensure that min-client-compact-level is 
>> indeed Luminous (see below). I have no kernel mapped rbd clients. Ceph 
>> versions reports mimic. Also below is the output of ceph balancer status. 
>> One thing to note, I did enable the balancer after I already filled the 
>> cluster, not from the onset. I had hoped that it wouldn't matter, though 
>> your comment "if the compat-level is too old for upmap, you'll only find a 
>> small warning about that in the logfiles" leaves me to believe that it will 
>> *not* work in doing it this way, please confirm and let me know what message 
>> to look for in /var/log/ceph.
> 
> it should also work well on existing clusters - we have also used it on a 
> Luminous cluster after it was already half-filled, and it worked well - 
> that's what it was made for ;-).
> The only issue we encountered was that the client-compat-level needed to be 
> set to Luminous before enabling the balancer plugin, but since you can always 
> disable and re-enable a plugin,
> this is not a "blocker".
> 
> Do you see anything in the logs of the active mgr when disabling and 
> re-enabling the balancer plugin?
> That's how we initially found the message that we needed to raise the 
> client-compat-level.
> 
> Cheers,
> Oliver
> 
>>
>> Thank you!
>>
>> root@hostadmin:~# ceph balancer status
>> {
>> "active": true,
>> "plans": [],
>> "mode": "upmap"
>> }
>>
>>
>>
>> root@hostadmin:~# ceph features
>> {
>> "mon": [
>> {
>> "features&quo

Re: [ceph-users] Balancer: uneven OSDs

2019-05-29 Thread Oliver Freyermuth
Hi Tarek,

Am 29.05.19 um 18:49 schrieb Tarek Zegar:
> Hi Oliver,
> 
> Thank you for the response, I did ensure that min-client-compact-level is 
> indeed Luminous (see below). I have no kernel mapped rbd clients. Ceph 
> versions reports mimic. Also below is the output of ceph balancer status. One 
> thing to note, I did enable the balancer after I already filled the cluster, 
> not from the onset. I had hoped that it wouldn't matter, though your comment 
> "if the compat-level is too old for upmap, you'll only find a small warning 
> about that in the logfiles" leaves me to believe that it will *not* work in 
> doing it this way, please confirm and let me know what message to look for in 
> /var/log/ceph.

it should also work well on existing clusters - we have also used it on a 
Luminous cluster after it was already half-filled, and it worked well - that's 
what it was made for ;-). 
The only issue we encountered was that the client-compat-level needed to be set 
to Luminous before enabling the balancer plugin, but since you can always 
disable and re-enable a plugin,
this is not a "blocker". 

Do you see anything in the logs of the active mgr when disabling and 
re-enabling the balancer plugin? 
That's how we initially found the message that we needed to raise the 
client-compat-level. 

Cheers,
Oliver

> 
> Thank you!
> 
> root@hostadmin:~# ceph balancer status
> {
> "active": true,
> "plans": [],
> "mode": "upmap"
> }
> 
> 
> 
> root@hostadmin:~# ceph features
> {
> "mon": [
> {
> "features": "0x3ffddff8ffacfffb",
> "release": "luminous",
> "num": 3
> }
> ],
> "osd": [
> {
> "features": "0x3ffddff8ffacfffb",
> "release": "luminous",
> "num": 7
> }
> ],
> "client": [
> {
> "features": "0x3ffddff8ffacfffb",
> "release": "luminous",
> "num": 1
> }
> ],
> "mgr": [
> {
> "features": "0x3ffddff8ffacfffb",
> "release": "luminous",
> "num": 3
> }
> ]
> }
> 
> 
> 
> 
> Inactive hide details for Oliver Freyermuth ---05/29/2019 11:13:51 AM---Hi 
> Tarek, what's the output of "ceph balancer status"?Oliver Freyermuth 
> ---05/29/2019 11:13:51 AM---Hi Tarek, what's the output of "ceph balancer 
> status"?
> 
> From: Oliver Freyermuth 
> To: ceph-users@lists.ceph.com
> Date: 05/29/2019 11:13 AM
> Subject: [EXTERNAL] Re: [ceph-users] Balancer: uneven OSDs
> Sent by: "ceph-users" 
> 
> --
> 
> 
> 
> Hi Tarek,
> 
> what's the output of "ceph balancer status"?
> In case you are using "upmap" mode, you must make sure to have a 
> min-client-compat-level of at least Luminous:
> http://docs.ceph.com/docs/mimic/rados/operations/upmap/
> Of course, please be aware that your clients must be recent enough 
> (especially for kernel clients).
> 
> Sadly, if the compat-level is too old for upmap, you'll only find a small 
> warning about that in the logfiles,
> but no error on terminal when activating the balancer or any other kind of 
> erroneous / health condition.
> 
> Cheers,
> Oliver
> 
> Am 29.05.19 um 17:52 schrieb Tarek Zegar:
>> Can anyone help with this? Why can't I optimize this cluster, the pg counts 
>> and data distribution is way off.
>> __
>>
>> I enabled the balancer plugin and even tried to manually invoke it but it 
>> won't allow any changes. Looking at ceph osd df, it's not even at all. 
>> Thoughts?
>>
>> root@hostadmin:~# ceph osd df
>> 

Re: [ceph-users] Balancer: uneven OSDs

2019-05-29 Thread Oliver Freyermuth

Hi Tarek,

what's the output of "ceph balancer status"?
In case you are using "upmap" mode, you must make sure to have a 
min-client-compat-level of at least Luminous:
http://docs.ceph.com/docs/mimic/rados/operations/upmap/
Of course, please be aware that your clients must be recent enough (especially 
for kernel clients).

Sadly, if the compat-level is too old for upmap, you'll only find a small 
warning about that in the logfiles,
but no error on terminal when activating the balancer or any other kind of 
erroneous / health condition.

Cheers,
Oliver

Am 29.05.19 um 17:52 schrieb Tarek Zegar:

Can anyone help with this? Why can't I optimize this cluster, the pg counts and 
data distribution is way off.
__

I enabled the balancer plugin and even tried to manually invoke it but it won't 
allow any changes. Looking at ceph osd df, it's not even at all. Thoughts?

root@hostadmin:~# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
1 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
3 hdd 0.00980 1.0 10 GiB 8.3 GiB 1.7 GiB 82.83 1.14 156
6 hdd 0.00980 1.0 10 GiB 8.4 GiB 1.6 GiB 83.77 1.15 144
0 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
5 hdd 0.00980 1.0 10 GiB 9.0 GiB 1021 MiB 90.03 1.23 159
7 hdd 0.00980 1.0 10 GiB 7.7 GiB 2.3 GiB 76.57 1.05 141
2 hdd 0.00980 1.0 10 GiB 5.5 GiB 4.5 GiB 55.42 0.76 90
4 hdd 0.00980 1.0 10 GiB 5.9 GiB 4.1 GiB 58.78 0.81 99
8 hdd 0.00980 1.0 10 GiB 6.3 GiB 3.7 GiB 63.12 0.87 111
TOTAL 90 GiB 53 GiB 37 GiB 72.93
MIN/MAX VAR: 0.76/1.23 STDDEV: 12.67


root@hostadmin:~# osdmaptool om --upmap out.txt --upmap-pool rbd
osdmaptool: osdmap file 'om'
writing upmap command output to: out.txt
checking for upmap cleanups
upmap, max-count 100, max*deviation 0.01 <---really? It's not even close to 1% 
across the drives*
limiting to pools rbd (1)
*no upmaps proposed*


ceph balancer optimize myplan
Error EALREADY: Unable to find further optimization,or distribution is already 
perfect


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Quotas with Mimic (CephFS-FUSE) clients in a Luminous Cluster

2019-05-28 Thread Oliver Freyermuth

Am 28.05.19 um 03:24 schrieb Yan, Zheng:

On Mon, May 27, 2019 at 6:54 PM Oliver Freyermuth
 wrote:


Am 27.05.19 um 12:48 schrieb Oliver Freyermuth:

Am 27.05.19 um 11:57 schrieb Dan van der Ster:

On Mon, May 27, 2019 at 11:54 AM Oliver Freyermuth
 wrote:


Dear Dan,

thanks for the quick reply!

Am 27.05.19 um 11:44 schrieb Dan van der Ster:

Hi Oliver,

We saw the same issue after upgrading to mimic.

IIRC we could make the max_bytes xattr visible by touching an empty
file in the dir (thereby updating the dir inode).

e.g. touch  /cephfs/user/freyermu/.quota; rm  /cephfs/user/freyermu/.quota


sadly, no, not even with sync's in between:
-
$ touch /cephfs/user/freyermu/.quota; sync; rm -f /cephfs/user/freyermu/.quota; 
sync; getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
/cephfs/user/freyermu/
/cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute
-
Also restarting the FUSE client after that does not change it. Maybe this 
requires the rest of the cluster to be upgraded to work?
I'm just guessing here, but maybe the MDS needs the file creation / update of the 
directory inode to "update" the way the quota attributes are exported. If 
something changed here with Mimic,
this would explain why the "touch" is needed. And this would also explain why 
this might only help if the MDS is upgraded to Mimic, too.



I think the relevant change which is causing this is the new_snaps in mimic.

Did you already enable them? `ceph fs set cephfs allow_new_snaps 1`


Good point! We wanted to enable these anyways with Mimic.

I've enabled it just now (since servers are still Luminous, that required 
"--yes-i-really-mean-it") but sadly, the max_bytes attribute is still not there
(also not after remounting on the client / using the file creation and deletion 
trick).


That's interesting - it suddenly started to work for one directory after 
creating a snapshot for one directory subtree on which we have quotas enabled,
and removing that snapshot again.
I can reproduce that for other directories.
So it seems enabling snapshots and snapshotting once fixes it for that 
directory tree.

If that's the case, maybe this could be added to the upgrade notes?



quota handling code changed in mimic. mimic client + luminous mds have
compat issue.  there should be no issue if  both mds and client are
both upgraded to mimic,


Thanks for the confirmation!
We have by now upgraded all our MDSs, and indeed now the trick which Dan 
outlined initially works:
 touch /directory/with/quotas/.somefile
 rm /directory/with/quotas/.somefile
to get the attribute to show up again. No creation of snaps is needed anymore, but it's 
also not showing up by itself (an update of the directory inode seems needed to trigger 
the "migration").
Since a change inside the subtree is also sufficient, this means things will 
"heal" automatically for us.

Still, this surprised me - maybe this compat issue could / should be mentioned 
in the upgrade notes?
Naïvely, I believed that (fuse) clients should be relatively safe to upgrade 
even if the rest of the cluster is not there yet.

Cheers and thanks,
Oliver



Regards
Yan, Zheng


Cheers,
 Oliver



Cheers,
  Oliver



-- dan



We have scheduled the remaining parts of the upgrade for Wednesday, and worst 
case could survive until then without quota enforcement, but it's a really 
strange and unexpected incompatibility.

Cheers,
  Oliver



Does that work?

-- dan


On Mon, May 27, 2019 at 11:36 AM Oliver Freyermuth
 wrote:


Dear Cephalopodians,

in the process of migrating a cluster from Luminous (12.2.12) to Mimic 
(13.2.5), we have upgraded the FUSE clients first (we took the chance during a 
time of low activity),
thinking that this should not cause any issues. All MDS+MON+OSDs are still on 
Luminous, 12.2.12.

However, it seems quotas have stopped working - with a (FUSE) Mimic client 
(13.2.5), I see:
$ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
/cephfs/user/freyermu/
/cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute

A Luminous client (12.2.12) on the same cluster sees:
$ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
/cephfs/user/freyermu/
5

It does not seem as if the attribute has been renamed (e.g. 
https://github.com/ceph/ceph/blob/mimic/qa/tasks/cephfs/test_quota.py still 
references it, same for the docs),
and I have to assume the clients also do not enforce quota if they do not see 
it.

Is this a known incompatibility between Mimic clients and a Luminous cluster?
The release notes of Mimic only mention that quota support was added to the 
kernel client, but nothing else quota related catches my eye.

Cheers,
   Oliver

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Oliver Freyermuth
Universität Bonn
Physikal

Re: [ceph-users] Quotas with Mimic (CephFS-FUSE) clients in a Luminous Cluster

2019-05-27 Thread Oliver Freyermuth

Am 27.05.19 um 12:48 schrieb Oliver Freyermuth:

Am 27.05.19 um 11:57 schrieb Dan van der Ster:

On Mon, May 27, 2019 at 11:54 AM Oliver Freyermuth
 wrote:


Dear Dan,

thanks for the quick reply!

Am 27.05.19 um 11:44 schrieb Dan van der Ster:

Hi Oliver,

We saw the same issue after upgrading to mimic.

IIRC we could make the max_bytes xattr visible by touching an empty
file in the dir (thereby updating the dir inode).

e.g. touch  /cephfs/user/freyermu/.quota; rm  /cephfs/user/freyermu/.quota


sadly, no, not even with sync's in between:
-
$ touch /cephfs/user/freyermu/.quota; sync; rm -f /cephfs/user/freyermu/.quota; 
sync; getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
/cephfs/user/freyermu/
/cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute
-
Also restarting the FUSE client after that does not change it. Maybe this 
requires the rest of the cluster to be upgraded to work?
I'm just guessing here, but maybe the MDS needs the file creation / update of the 
directory inode to "update" the way the quota attributes are exported. If 
something changed here with Mimic,
this would explain why the "touch" is needed. And this would also explain why 
this might only help if the MDS is upgraded to Mimic, too.



I think the relevant change which is causing this is the new_snaps in mimic.

Did you already enable them? `ceph fs set cephfs allow_new_snaps 1`


Good point! We wanted to enable these anyways with Mimic.

I've enabled it just now (since servers are still Luminous, that required 
"--yes-i-really-mean-it") but sadly, the max_bytes attribute is still not there
(also not after remounting on the client / using the file creation and deletion 
trick).


That's interesting - it suddenly started to work for one directory after 
creating a snapshot for one directory subtree on which we have quotas enabled,
and removing that snapshot again.
I can reproduce that for other directories.
So it seems enabling snapshots and snapshotting once fixes it for that 
directory tree.

If that's the case, maybe this could be added to the upgrade notes?

Cheers,
Oliver



Cheers,
 Oliver



-- dan



We have scheduled the remaining parts of the upgrade for Wednesday, and worst 
case could survive until then without quota enforcement, but it's a really 
strange and unexpected incompatibility.

Cheers,
 Oliver



Does that work?

-- dan


On Mon, May 27, 2019 at 11:36 AM Oliver Freyermuth
 wrote:


Dear Cephalopodians,

in the process of migrating a cluster from Luminous (12.2.12) to Mimic 
(13.2.5), we have upgraded the FUSE clients first (we took the chance during a 
time of low activity),
thinking that this should not cause any issues. All MDS+MON+OSDs are still on 
Luminous, 12.2.12.

However, it seems quotas have stopped working - with a (FUSE) Mimic client 
(13.2.5), I see:
$ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
/cephfs/user/freyermu/
/cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute

A Luminous client (12.2.12) on the same cluster sees:
$ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
/cephfs/user/freyermu/
5

It does not seem as if the attribute has been renamed (e.g. 
https://github.com/ceph/ceph/blob/mimic/qa/tasks/cephfs/test_quota.py still 
references it, same for the docs),
and I have to assume the clients also do not enforce quota if they do not see 
it.

Is this a known incompatibility between Mimic clients and a Luminous cluster?
The release notes of Mimic only mention that quota support was added to the 
kernel client, but nothing else quota related catches my eye.

Cheers,
  Oliver

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Quotas with Mimic (CephFS-FUSE) clients in a Luminous Cluster

2019-05-27 Thread Oliver Freyermuth

Am 27.05.19 um 11:57 schrieb Dan van der Ster:

On Mon, May 27, 2019 at 11:54 AM Oliver Freyermuth
 wrote:


Dear Dan,

thanks for the quick reply!

Am 27.05.19 um 11:44 schrieb Dan van der Ster:

Hi Oliver,

We saw the same issue after upgrading to mimic.

IIRC we could make the max_bytes xattr visible by touching an empty
file in the dir (thereby updating the dir inode).

e.g. touch  /cephfs/user/freyermu/.quota; rm  /cephfs/user/freyermu/.quota


sadly, no, not even with sync's in between:
-
$ touch /cephfs/user/freyermu/.quota; sync; rm -f /cephfs/user/freyermu/.quota; 
sync; getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
/cephfs/user/freyermu/
/cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute
-
Also restarting the FUSE client after that does not change it. Maybe this 
requires the rest of the cluster to be upgraded to work?
I'm just guessing here, but maybe the MDS needs the file creation / update of the 
directory inode to "update" the way the quota attributes are exported. If 
something changed here with Mimic,
this would explain why the "touch" is needed. And this would also explain why 
this might only help if the MDS is upgraded to Mimic, too.



I think the relevant change which is causing this is the new_snaps in mimic.

Did you already enable them? `ceph fs set cephfs allow_new_snaps 1`


Good point! We wanted to enable these anyways with Mimic.

I've enabled it just now (since servers are still Luminous, that required 
"--yes-i-really-mean-it") but sadly, the max_bytes attribute is still not there
(also not after remounting on the client / using the file creation and deletion 
trick).

Cheers,
Oliver



-- dan



We have scheduled the remaining parts of the upgrade for Wednesday, and worst 
case could survive until then without quota enforcement, but it's a really 
strange and unexpected incompatibility.

Cheers,
 Oliver



Does that work?

-- dan


On Mon, May 27, 2019 at 11:36 AM Oliver Freyermuth
 wrote:


Dear Cephalopodians,

in the process of migrating a cluster from Luminous (12.2.12) to Mimic 
(13.2.5), we have upgraded the FUSE clients first (we took the chance during a 
time of low activity),
thinking that this should not cause any issues. All MDS+MON+OSDs are still on 
Luminous, 12.2.12.

However, it seems quotas have stopped working - with a (FUSE) Mimic client 
(13.2.5), I see:
$ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
/cephfs/user/freyermu/
/cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute

A Luminous client (12.2.12) on the same cluster sees:
$ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
/cephfs/user/freyermu/
5

It does not seem as if the attribute has been renamed (e.g. 
https://github.com/ceph/ceph/blob/mimic/qa/tasks/cephfs/test_quota.py still 
references it, same for the docs),
and I have to assume the clients also do not enforce quota if they do not see 
it.

Is this a known incompatibility between Mimic clients and a Luminous cluster?
The release notes of Mimic only mention that quota support was added to the 
kernel client, but nothing else quota related catches my eye.

Cheers,
  Oliver

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--




--
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Quotas with Mimic (CephFS-FUSE) clients in a Luminous Cluster

2019-05-27 Thread Oliver Freyermuth

Dear Dan,

thanks for the quick reply!

Am 27.05.19 um 11:44 schrieb Dan van der Ster:

Hi Oliver,

We saw the same issue after upgrading to mimic.

IIRC we could make the max_bytes xattr visible by touching an empty
file in the dir (thereby updating the dir inode).

e.g. touch  /cephfs/user/freyermu/.quota; rm  /cephfs/user/freyermu/.quota


sadly, no, not even with sync's in between:
-
$ touch /cephfs/user/freyermu/.quota; sync; rm -f /cephfs/user/freyermu/.quota; 
sync; getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
/cephfs/user/freyermu/
/cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute
-
Also restarting the FUSE client after that does not change it. Maybe this 
requires the rest of the cluster to be upgraded to work?
I'm just guessing here, but maybe the MDS needs the file creation / update of the 
directory inode to "update" the way the quota attributes are exported. If 
something changed here with Mimic,
this would explain why the "touch" is needed. And this would also explain why 
this might only help if the MDS is upgraded to Mimic, too.

We have scheduled the remaining parts of the upgrade for Wednesday, and worst 
case could survive until then without quota enforcement, but it's a really 
strange and unexpected incompatibility.

Cheers,
Oliver



Does that work?

-- dan


On Mon, May 27, 2019 at 11:36 AM Oliver Freyermuth
 wrote:


Dear Cephalopodians,

in the process of migrating a cluster from Luminous (12.2.12) to Mimic 
(13.2.5), we have upgraded the FUSE clients first (we took the chance during a 
time of low activity),
thinking that this should not cause any issues. All MDS+MON+OSDs are still on 
Luminous, 12.2.12.

However, it seems quotas have stopped working - with a (FUSE) Mimic client 
(13.2.5), I see:
$ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
/cephfs/user/freyermu/
/cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute

A Luminous client (12.2.12) on the same cluster sees:
$ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
/cephfs/user/freyermu/
5

It does not seem as if the attribute has been renamed (e.g. 
https://github.com/ceph/ceph/blob/mimic/qa/tasks/cephfs/test_quota.py still 
references it, same for the docs),
and I have to assume the clients also do not enforce quota if they do not see 
it.

Is this a known incompatibility between Mimic clients and a Luminous cluster?
The release notes of Mimic only mention that quota support was added to the 
kernel client, but nothing else quota related catches my eye.

Cheers,
 Oliver

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Quotas with Mimic (CephFS-FUSE) clients in a Luminous Cluster

2019-05-27 Thread Oliver Freyermuth

Dear Cephalopodians,

in the process of migrating a cluster from Luminous (12.2.12) to Mimic 
(13.2.5), we have upgraded the FUSE clients first (we took the chance during a 
time of low activity),
thinking that this should not cause any issues. All MDS+MON+OSDs are still on 
Luminous, 12.2.12.

However, it seems quotas have stopped working - with a (FUSE) Mimic client 
(13.2.5), I see:
$ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
/cephfs/user/freyermu/
/cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute

A Luminous client (12.2.12) on the same cluster sees:
$ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
/cephfs/user/freyermu/
5

It does not seem as if the attribute has been renamed (e.g. 
https://github.com/ceph/ceph/blob/mimic/qa/tasks/cephfs/test_quota.py still 
references it, same for the docs),
and I have to assume the clients also do not enforce quota if they do not see 
it.

Is this a known incompatibility between Mimic clients and a Luminous cluster?
The release notes of Mimic only mention that quota support was added to the 
kernel client, but nothing else quota related catches my eye.

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inodes on /cephfs

2019-05-01 Thread Oliver Freyermuth
Dear Yury,

Am 01.05.19 um 08:07 schrieb Yury Shevchuk:
> cephfs is not alone at this, there are other inode-less filesystems
> around.  They all go with zeroes:
> 
> # df -i /nfs-dir
> Filesystem  Inodes IUsed IFree IUse% Mounted on
> xxx.xxx.xx.x:/xxx/xxx/x  0 0 0 - /xxx
> 
> # df -i /reiserfs-dir
> FilesystemInodes   IUsed   IFree IUse% Mounted on
> /xxx//x0   0   0-  /xxx/xxx//x
> 
> # df -i /btrfs-dir
> Filesystem   Inodes IUsed IFree IUse% Mounted on
> /xxx/xx/  0 0 0 - /

you are right, thanks for pointing me to these examples! 

> 
> Would YUM refuse to install on them all, including mainstream btrfs?
> I doubt that.  Prehaps YUM is confused by Inodes count that
> cephfs (alone!) reports as non-zero.  Look at YUM sources?

Indeed, Yum works on all these file systems. 
Here's the place in the sources:
https://github.com/rpm-software-management/rpm/blob/6913360d66510e60d7b6399cd338425d663a051b/lib/transaction.c#L172
That's actually in RPM, since Yum calls RPM and the complaint comes from RPM. 

Reading the sources, they just interpret the results from the statfs call. If a 
file system reports:
sfb.f_ffree == 0 && sfb.f_files == 0
i.e. no used and no free inodes, then it's assumed the file system has no 
notion of inodes, and the check is disabled. 
However, since CephFS reports something non-zero for the total count (f_files), 
RPM assumes it has a notion of Inodes, and a check should be performed. 

So indeed, another solution would be to change f_files to also report 0, as all 
other file systems without actual inodes seem to do. 
That would (in my opinion) also be more correct than what is currently done, 
since reporting something non-zero as f_files but zero as f_free
from a logical point of view seems "full". 
Even df shows a more useful output with both being zero - it just shows a 
"dash", highlighting that this is not information to be monitored. 

What do you think? 

Cheers,
    Oliver

> 
> 
> -- Yury
> 
> On Wed, May 01, 2019 at 01:23:57AM +0200, Oliver Freyermuth wrote:
>> Am 01.05.19 um 00:51 schrieb Patrick Donnelly:
>>> On Tue, Apr 30, 2019 at 8:01 AM Oliver Freyermuth
>>>  wrote:
>>>>
>>>> Dear Cephalopodians,
>>>>
>>>> we have a classic libvirtd / KVM based virtualization cluster using 
>>>> Ceph-RBD (librbd) as backend and sharing the libvirtd configuration 
>>>> between the nodes via CephFS
>>>> (all on Mimic).
>>>>
>>>> To share the libvirtd configuration between the nodes, we have symlinked 
>>>> some folders from /etc/libvirt to their counterparts on /cephfs,
>>>> so all nodes see the same configuration.
>>>> In general, this works very well (of course, there's a "gotcha": Libvirtd 
>>>> needs reloading / restart for some changes to the XMLs, we have automated 
>>>> that),
>>>> but there is one issue caused by Yum's cleverness (that's on CentOS 7). 
>>>> Whenever there's a libvirtd update, unattended upgrades fail, and we see:
>>>>
>>>>Transaction check error:
>>>>  installing package 
>>>> libvirt-daemon-driver-network-4.5.0-10.el7_6.7.x86_64 needs 2 inodes on 
>>>> the /cephfs filesystem
>>>>  installing package 
>>>> libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.7.x86_64 needs 18 inodes on 
>>>> the /cephfs filesystem
>>>>
>>>> So it seems yum follows the symlinks and checks the available inodes on 
>>>> /cephfs. Sadly, that reveals:
>>>>[root@kvm001 libvirt]# LANG=C df -i /cephfs/
>>>>Filesystem Inodes IUsed IFree IUse% Mounted on
>>>>ceph-fuse  6868 0  100% /cephfs
>>>>
>>>> I think that's just because there is no real "limit" on the maximum inodes 
>>>> on CephFS. However, returning 0 breaks some existing tools (notably, Yum).
>>>>
>>>> What do you think? Should CephFS return something different than 0 here to 
>>>> not break existing tools?
>>>> Or should the tools behave differently? But one might also argue that if 
>>>> the total number of Inodes matches the used number of Inodes, the FS is 
>>>> indeed "full".
>>>> It's just unclear to me who to file a bug against ;-).
>>>>
>>>> Right now, I am just using:
>>>> yum -y --setopt=diskspacecheck=0 update
>>>> as a manual workaround, but this is natura

Re: [ceph-users] Inodes on /cephfs

2019-04-30 Thread Oliver Freyermuth
Am 01.05.19 um 00:51 schrieb Patrick Donnelly:
> On Tue, Apr 30, 2019 at 8:01 AM Oliver Freyermuth
>  wrote:
>>
>> Dear Cephalopodians,
>>
>> we have a classic libvirtd / KVM based virtualization cluster using Ceph-RBD 
>> (librbd) as backend and sharing the libvirtd configuration between the nodes 
>> via CephFS
>> (all on Mimic).
>>
>> To share the libvirtd configuration between the nodes, we have symlinked 
>> some folders from /etc/libvirt to their counterparts on /cephfs,
>> so all nodes see the same configuration.
>> In general, this works very well (of course, there's a "gotcha": Libvirtd 
>> needs reloading / restart for some changes to the XMLs, we have automated 
>> that),
>> but there is one issue caused by Yum's cleverness (that's on CentOS 7). 
>> Whenever there's a libvirtd update, unattended upgrades fail, and we see:
>>
>>Transaction check error:
>>  installing package 
>> libvirt-daemon-driver-network-4.5.0-10.el7_6.7.x86_64 needs 2 inodes on the 
>> /cephfs filesystem
>>  installing package 
>> libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.7.x86_64 needs 18 inodes on 
>> the /cephfs filesystem
>>
>> So it seems yum follows the symlinks and checks the available inodes on 
>> /cephfs. Sadly, that reveals:
>>[root@kvm001 libvirt]# LANG=C df -i /cephfs/
>>Filesystem Inodes IUsed IFree IUse% Mounted on
>>ceph-fuse  6868 0  100% /cephfs
>>
>> I think that's just because there is no real "limit" on the maximum inodes 
>> on CephFS. However, returning 0 breaks some existing tools (notably, Yum).
>>
>> What do you think? Should CephFS return something different than 0 here to 
>> not break existing tools?
>> Or should the tools behave differently? But one might also argue that if the 
>> total number of Inodes matches the used number of Inodes, the FS is indeed 
>> "full".
>> It's just unclear to me who to file a bug against ;-).
>>
>> Right now, I am just using:
>> yum -y --setopt=diskspacecheck=0 update
>> as a manual workaround, but this is naturally rather cumbersome.
> 
> This is fallout from [1]. See discussion on setting f_free to 0 here
> [2]. In summary, userland tools are trying to be too clever by looking
> at f_free. [I could be convinced to go back to f_free = ULONG_MAX if
> there are other instances of this.]
> 
> [1] https://github.com/ceph/ceph/pull/23323
> [2] https://github.com/ceph/ceph/pull/23323#issuecomment-409249911

Thanks for the references! That certainly enlightens me on why this decision 
was taken, and of course I congratulate upon trying to prevent false 
monitoring. 
Still, even though I don't have other instances at hand (yet), I am not yet 
convinced "0" is a better choice than "ULONG_MAX". 
It certainly alerts users / monitoring software about doing something wrong, 
but it prevents a check which any file system (or rather, any file system I 
encountered so far) allows. 

Yum (or other package managers doing things in a safe manner) need to ensure 
they can fully install a package in an "atomic" way before doing so,
since rolling back may be complex or even impossible (for most file systems). 
So they need a way to check if a file system can store the additional files in 
terms of space and inodes, before placing the data there,
or risk installing something only partially, and potentially being unable to 
roll back. 

In most cases, the free number of inodes allows for that check. Of course, that 
has no (direct) meaning for CephFS, so one might argue the tools should add an 
exception for CephFS - 
but as the discussion correctly stated, there's no defined way to find out 
where the file system has a notion of "free inodes", and - if we go for an 
exceptional treatment for a list of file systems - 
not even a "clean" way to find out if the file system is CephFS (the tools will 
only see it is FUSE for ceph-fuse) [1]. 

So my question is: 
How are tools which need to ensure that a file system can accept a given number 
of bytes and inodes before actually placing the data there check that in case 
of CephFS? 
And if they should not, how do they find out that this check which is valid on 
e.g. ext4 is not useful on CephFS? 
(or, in other words: if I would file a bug report against Yum, I could not 
think of any implementation they could make to solve this issue)

Of course, if it's just us, we can live with the workaround. We monitor space 
consumption on all file systems, and may start monitoring free inodes on our 
ext4 file systems, 
such that we can safely disable the Yum check on the affected nodes. 
But I wonder whether this is the best way

[ceph-users] Inodes on /cephfs

2019-04-30 Thread Oliver Freyermuth

Dear Cephalopodians,

we have a classic libvirtd / KVM based virtualization cluster using Ceph-RBD 
(librbd) as backend and sharing the libvirtd configuration between the nodes 
via CephFS
(all on Mimic).

To share the libvirtd configuration between the nodes, we have symlinked some 
folders from /etc/libvirt to their counterparts on /cephfs,
so all nodes see the same configuration.
In general, this works very well (of course, there's a "gotcha": Libvirtd needs 
reloading / restart for some changes to the XMLs, we have automated that),
but there is one issue caused by Yum's cleverness (that's on CentOS 7). 
Whenever there's a libvirtd update, unattended upgrades fail, and we see:

  Transaction check error:
installing package libvirt-daemon-driver-network-4.5.0-10.el7_6.7.x86_64 
needs 2 inodes on the /cephfs filesystem
installing package libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.7.x86_64 
needs 18 inodes on the /cephfs filesystem

So it seems yum follows the symlinks and checks the available inodes on 
/cephfs. Sadly, that reveals:
  [root@kvm001 libvirt]# LANG=C df -i /cephfs/
  Filesystem Inodes IUsed IFree IUse% Mounted on
  ceph-fuse  6868 0  100% /cephfs

I think that's just because there is no real "limit" on the maximum inodes on 
CephFS. However, returning 0 breaks some existing tools (notably, Yum).

What do you think? Should CephFS return something different than 0 here to not 
break existing tools?
Or should the tools behave differently? But one might also argue that if the total number 
of Inodes matches the used number of Inodes, the FS is indeed "full".
It's just unclear to me who to file a bug against ;-).

Right now, I am just using:
yum -y --setopt=diskspacecheck=0 update
as a manual workaround, but this is naturally rather cumbersome.

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Some ceph config parameters default values

2019-02-16 Thread Oliver Freyermuth
Dear Cephalopodians,

in some recent threads on this list, I have read about the "knobs":

  pglog_hardlimit (false by default, available at least with 12.2.11 and 
13.2.5)
  bdev_enable_discard (false by default, advanced option, no description)
  bdev_async_discard  (false by default, advanced option, no description)

I am wondering about the defaults for these settings, and why these settings 
seem mostly undocumented. 

It seems to me that on SSD / NVMe devices, you would always want to enable 
discard for significantly increased lifetime,
or run fstrim regularly (which you can't with bluestore since it's a filesystem 
of its own). From personal experience, 
I have already lost two eMMC devices in Android phones early due to trimming 
not working fine. 
Of course, on first generation SSD devices, "discard" may lead to data loss 
(which for most devices has been fixed with firmware updates, though). 

I would presume that async-discard is also advantageous, since it seems to 
queue the discards and work on these in bulk later
instead of issuing them immediately (that's what I grasp from the code). 

Additionally, it's unclear to me whether the bdev-discard settings also affect 
WAL/DB devices, which are very commonly SSD/NVMe devices
in the Bluestore age. 

Concerning the pglog_hardlimit, I read on that list that it's safe and limits 
maximum memory consumption especially for backfills / during recovery. 
So it "sounds" like this is also something that could be on by default. But 
maybe that is not the case yet to allow downgrades after failed upgrades? 


So in the end, my question is: 
Is there a reason why these values are not on by default, and are also not 
really mentioned in the documentation? 
Are they just "not ready yet" / unsafe to be on by default, or are the defaults 
just like that because they have always been at this value,
and defaults will change with the next major release (nautilus)? 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read-only mounts of RBD images on multiple nodes for parallel reads

2019-01-17 Thread Oliver Freyermuth
Hi,

first of: I'm probably not the expert you are waiting for, but we are using 
CephFS for HPC / HTC (storing datafiles), and make use of containers for all 
jobs (up to ~2000 running in parallel). 
We also use RBD, but for our virtualization infrastructure. 

While I'm always one of the first to recommend CephFS / RBD, I personally think 
that another (open source) file system - CVMFS - may suit your 
container-usecase significantly better. 
We use that to store our container images (and software in several versions). 
The containers are rebuilt daily. 
CVMFS is read-only for the clients by design. An administrator commits changes 
on the "Stratum 0" server,
and the clients see the new changes shortly after the commit has happened. 
Things are revisioned, and you can roll back in case something goes wrong. 
Why did we choose CVMFS here? 
- No need to have an explicit write-lock when changing things. 
- Deduplication built-in. We build several new containers daily, and keep them 
for 30 days (for long-running jobs). 
  Deduplication spares us from the need to have many factors more of storage. 
  I still hope Ceph learns deduplication some day ;-). 
- Extreme caching. The file system works via HTTP, i.e. you can use standard 
caching proxies (squids), and all clients have their own local disk cache. The 
deduplication
  also applies to that, so only unique chunks need to be fetched. 
High availability is rather easy to get (not as easy as with Ceoh, but you can 
have it by running one "Stratum 0" machine which does the writing,
at least two "Stratum 1" machines syncing everything, and if you want more 
performance also at least two squid servers in front). 
It's a FUSE filesystem, but unexpectedly well performing especially for small 
files as you have them for software and containers. 
The caching and deduplication heavily reduce traffic when you run many 
containers, especially when they all start concurrently. 

That's just my 2 cents, and your mileage may vary (for example, this does not 
work well if the machines running the containers do not have any local storage 
to cache things). 
And maybe you do not run thousands of containers in parallel, and you do not 
gain as much as we do from the deduplication. 

If it does not fit your case, I think RBD is a good way to go, but sadly I can 
not share experience how well / stable it works with many clients mounting the 
volume read-only in parallel. 
In our virtualization, there's always only one exclusive lock on a volume. 

Cheers,
Oliver

Am 17.01.19 um 19:27 schrieb Void Star Nill:
> Hi,
> 
> We am trying to use Ceph in our products to address some of the use cases. We 
> think Ceph block device for us. One of the use cases is that we have a number 
> of jobs running in containers that need to have Read-Only access to shared 
> data. The data is written once and is consumed multiple times. I have read 
> through some of the similar discussions and the recommendations on using 
> CephFS for these situations, but in our case Block device makes more sense as 
> it fits well with other use cases and restrictions we have around this use 
> case.
> 
> The following scenario seems to work as expected when we tried on a test 
> cluster, but we wanted to get an expert opinion to see if there would be any 
> issues in production. The usage scenario is as follows:
> 
> - A block device is created with "--image-shared" options:
> 
> rbd create mypool/foo --size 4G --image-shared
> 
> 
> - The image is mapped to a host, formatted in ext4 format (or other file 
> formats), mounted to a directory in read/write mode and data is written to 
> it. Please note that the image will be mapped in exclusive write mode -- no 
> other read/write mounts are allowed a this time.
> 
> - The volume is unmapped from the host and then mapped on to N number of 
> other hosts where it will be mounted in read-only mode and the data is read 
> simultaneously from N readers
> 
> As mentioned above, this seems to work as expected, but we wanted to confirm 
> that we won't run into any unexpected issues.
> 
> Appreciate any inputs on this.
> 
> Thanks,
> Shridhar
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Invalid RBD object maps of snapshots on Mimic

2019-01-12 Thread Oliver Freyermuth
Am 10.01.19 um 16:53 schrieb Jason Dillaman:
> On Thu, Jan 10, 2019 at 10:50 AM Oliver Freyermuth
>  wrote:
>>
>> Dear Jason and list,
>>
>> Am 10.01.19 um 16:28 schrieb Jason Dillaman:
>>> On Thu, Jan 10, 2019 at 4:01 AM Oliver Freyermuth
>>>  wrote:
>>>>
>>>> Dear Cephalopodians,
>>>>
>>>> I performed several consistency checks now:
>>>> - Exporting an RBD snapshot before and after the object map rebuilding.
>>>> - Exporting a backup as raw image, all backups (re)created before and 
>>>> after the object map rebuilding.
>>>> - md5summing all of that for a snapshot for which the rebuilding was 
>>>> actually needed.
>>>>
>>>> The good news: I found that all checksums are the same. So the backups are 
>>>> (at least for those I checked) not broken.
>>>>
>>>> I also checked the source and found:
>>>> https://github.com/ceph/ceph/blob/master/src/include/rbd/object_map_types.h
>>>> So to my understanding, the object map entries are OBJECT_EXISTS, but 
>>>> should be OBJECT_EXISTS_CLEAN.
>>>> Do I understand correctly that OBJECT_EXISTS_CLEAN relates to the object 
>>>> being unchanged ("clean") as compared to another snapshot / the main 
>>>> volume?
>>>>
>>>> If so, this would explain why the backups, exports etc. are all okay, 
>>>> since the backup tools only got "too many" objects in the fast-diff and
>>>> hence extracted too many objects from Ceph-RBD even though that was not 
>>>> needed. Since both Benji and Backy2 deduplicate again in their backends,
>>>> this causes only a minor network traffic inefficiency.
>>>>
>>>> Is my understanding correct?
>>>> Then the underlying issue would still be a bug, but (as it seems) a 
>>>> harmless one.
>>>
>>> Yes, your understanding is correct in that it's harmless from a
>>> data-integrity point-of-view.
>>>
>>> During the creation of the snapshot, the current object map (for the
>>> HEAD revision) is copied to a new object map for that snapshot and
>>> then all the objects in the HEAD revision snapshot are marked as
>>> EXISTS_CLEAN (if they EXIST). Somehow an IO operation is causing the
>>> object map to think there is an update, but apparently no object
>>> update is actually occurring (or at least the OSD doesn't think a
>>> change occurred).
>>
>> thanks a lot for the clarification! Good to know my understanding is correct.
>>
>> I re-checked all object maps just now. Again, the most recent snapshots show 
>> this issue, but only those.
>> The only "special" thing which probably not everybody is doing would likely 
>> be us running fstrim in the machines
>> running from the RBD regularly, to conserve space.
>>
>> I am not sure how exactly the DISCARD operation is handled in rbd. But since 
>> this was my guess, I just did an fstrim inside one of the VMs,
>> and checked the object-maps again. I get:
>> 2019-01-10 16:44:25.320 7f06f67fc700 -1 librbd::ObjectMapIterateRequest: 
>> object map error: object rbd_data.4f587327b23c6.0040 marked as 
>> 1, but should be 3
>> In this case, I got it for the volume itself and not a snapshot.
>>
>> So it seems to me that sometimes, DISCARD causes objects to think they have 
>> been updated, albeit they have not.
>> Sadly due to in-depth code knowledge and lack of a real debug setup I can 
>> not track it down further :-(.
>>
>> Cheers and hope that helps a code expert in tracking it down (at least it's 
>> not affecting data integrity),
> 
> Thanks, that definitely provides a good investigation starting point.

Should we also put it into a ticket, so it can be tracked? 
I could do it if you like. On the other hand, maybe you could summarize the 
issue more concisely than I can. 

Cheers and all the best,
Oliver

> 
>> Oliver
>>
>>>
>>>> I'll let you know if it happens again to some of our snapshots, and if so, 
>>>> if it only happens to newly created ones...
>>>>
>>>> Cheers,
>>>>  Oliver
>>>>
>>>> Am 10.01.19 um 01:18 schrieb Oliver Freyermuth:
>>>>> Dear Cephalopodians,
>>>>>
>>>>> inspired by 
>>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/

Re: [ceph-users] Invalid RBD object maps of snapshots on Mimic

2019-01-10 Thread Oliver Freyermuth

Dear Jason and list,

Am 10.01.19 um 16:28 schrieb Jason Dillaman:

On Thu, Jan 10, 2019 at 4:01 AM Oliver Freyermuth
 wrote:


Dear Cephalopodians,

I performed several consistency checks now:
- Exporting an RBD snapshot before and after the object map rebuilding.
- Exporting a backup as raw image, all backups (re)created before and after the 
object map rebuilding.
- md5summing all of that for a snapshot for which the rebuilding was actually 
needed.

The good news: I found that all checksums are the same. So the backups are (at 
least for those I checked) not broken.

I also checked the source and found:
https://github.com/ceph/ceph/blob/master/src/include/rbd/object_map_types.h
So to my understanding, the object map entries are OBJECT_EXISTS, but should be 
OBJECT_EXISTS_CLEAN.
Do I understand correctly that OBJECT_EXISTS_CLEAN relates to the object being unchanged 
("clean") as compared to another snapshot / the main volume?

If so, this would explain why the backups, exports etc. are all okay, since the backup 
tools only got "too many" objects in the fast-diff and
hence extracted too many objects from Ceph-RBD even though that was not needed. 
Since both Benji and Backy2 deduplicate again in their backends,
this causes only a minor network traffic inefficiency.

Is my understanding correct?
Then the underlying issue would still be a bug, but (as it seems) a harmless 
one.


Yes, your understanding is correct in that it's harmless from a
data-integrity point-of-view.

During the creation of the snapshot, the current object map (for the
HEAD revision) is copied to a new object map for that snapshot and
then all the objects in the HEAD revision snapshot are marked as
EXISTS_CLEAN (if they EXIST). Somehow an IO operation is causing the
object map to think there is an update, but apparently no object
update is actually occurring (or at least the OSD doesn't think a
change occurred).


thanks a lot for the clarification! Good to know my understanding is correct.

I re-checked all object maps just now. Again, the most recent snapshots show 
this issue, but only those.
The only "special" thing which probably not everybody is doing would likely be 
us running fstrim in the machines
running from the RBD regularly, to conserve space.

I am not sure how exactly the DISCARD operation is handled in rbd. But since 
this was my guess, I just did an fstrim inside one of the VMs,
and checked the object-maps again. I get:
2019-01-10 16:44:25.320 7f06f67fc700 -1 librbd::ObjectMapIterateRequest: object 
map error: object rbd_data.4f587327b23c6.0040 marked as 1, but 
should be 3
In this case, I got it for the volume itself and not a snapshot.

So it seems to me that sometimes, DISCARD causes objects to think they have 
been updated, albeit they have not.
Sadly due to in-depth code knowledge and lack of a real debug setup I can not 
track it down further :-(.

Cheers and hope that helps a code expert in tracking it down (at least it's not 
affecting data integrity),
Oliver




I'll let you know if it happens again to some of our snapshots, and if so, if 
it only happens to newly created ones...

Cheers,
 Oliver

Am 10.01.19 um 01:18 schrieb Oliver Freyermuth:

Dear Cephalopodians,

inspired by 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-January/032092.html I 
did a check of the object-maps of our RBD volumes
and snapshots. We are running 13.2.1 on the cluster I am talking about, all 
hosts (OSDs, MONs, RBD client nodes) still on CentOS 7.5.

Sadly, I found that for at least 50 % of the snapshots (only the snapshots, not 
the volumes themselves), I got something like:
--
2019-01-09 23:00:06.481 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object 
map error: object rbd_data.519c46b8b4567.0260 marked as 1, but 
should be 3
2019-01-09 23:00:06.563 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object 
map error: object rbd_data.519c46b8b4567.0840 marked as 1, but 
should be 3
--
2019-01-09 23:00:09.166 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object 
map error: object rbd_data.519c46b8b4567.0480 marked as 1, but 
should be 3
2019-01-09 23:00:09.228 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object 
map error: object rbd_data.519c46b8b4567.0840 marked as 1, but 
should be 3
--
It often appears to affect 1-3 entries in the map of a snapshot. The Object Map 
was *not* marked invalid before I ran the check.
After rebuilding it, the check is fine again.

The cluster has not yet seen any Ceph update (it was installed as 13.2.1, we 
plan to upgrade to 13.2.4 soonish).
There have been no major causes of worries 

Re: [ceph-users] Invalid RBD object maps of snapshots on Mimic

2019-01-10 Thread Oliver Freyermuth

Dear Cephalopodians,

I performed several consistency checks now:
- Exporting an RBD snapshot before and after the object map rebuilding.
- Exporting a backup as raw image, all backups (re)created before and after the 
object map rebuilding.
- md5summing all of that for a snapshot for which the rebuilding was actually 
needed.

The good news: I found that all checksums are the same. So the backups are (at 
least for those I checked) not broken.

I also checked the source and found:
https://github.com/ceph/ceph/blob/master/src/include/rbd/object_map_types.h
So to my understanding, the object map entries are OBJECT_EXISTS, but should be 
OBJECT_EXISTS_CLEAN.
Do I understand correctly that OBJECT_EXISTS_CLEAN relates to the object being unchanged 
("clean") as compared to another snapshot / the main volume?

If so, this would explain why the backups, exports etc. are all okay, since the backup 
tools only got "too many" objects in the fast-diff and
hence extracted too many objects from Ceph-RBD even though that was not needed. 
Since both Benji and Backy2 deduplicate again in their backends,
this causes only a minor network traffic inefficiency.

Is my understanding correct?
Then the underlying issue would still be a bug, but (as it seems) a harmless 
one.

I'll let you know if it happens again to some of our snapshots, and if so, if 
it only happens to newly created ones...

Cheers,
Oliver

Am 10.01.19 um 01:18 schrieb Oliver Freyermuth:

Dear Cephalopodians,

inspired by 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-January/032092.html I 
did a check of the object-maps of our RBD volumes
and snapshots. We are running 13.2.1 on the cluster I am talking about, all 
hosts (OSDs, MONs, RBD client nodes) still on CentOS 7.5.

Sadly, I found that for at least 50 % of the snapshots (only the snapshots, not 
the volumes themselves), I got something like:
--
2019-01-09 23:00:06.481 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object 
map error: object rbd_data.519c46b8b4567.0260 marked as 1, but 
should be 3
2019-01-09 23:00:06.563 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object 
map error: object rbd_data.519c46b8b4567.0840 marked as 1, but 
should be 3
--
2019-01-09 23:00:09.166 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object 
map error: object rbd_data.519c46b8b4567.0480 marked as 1, but 
should be 3
2019-01-09 23:00:09.228 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object 
map error: object rbd_data.519c46b8b4567.0840 marked as 1, but 
should be 3
--
It often appears to affect 1-3 entries in the map of a snapshot. The Object Map 
was *not* marked invalid before I ran the check.
After rebuilding it, the check is fine again.

The cluster has not yet seen any Ceph update (it was installed as 13.2.1, we 
plan to upgrade to 13.2.4 soonish).
There have been no major causes of worries so far. We purged a single OSD disk, 
balanced PGs with upmap, modified the CRUSH topology slightly etc.
The cluster never was in a prolonged unhealthy period nor did we have to repair 
any PG.

Is this a known error?
Is it harmful, or is this just something like reference counting being off, and 
objects being in the map which did not really change in the snapshot?

Our usecase, in case that helps to understand or reproduce:
- RBDs are used as disks for qemu/kvm virtual machines.
- Every night:
   - We run an fstrim in the VM (which propagates to RBD and purges empty 
blocks), fsfreeze it, take a snapshot, thaw it again.
   - After that, we run two backups with Benji backup ( 
https://benji-backup.me/ ) and Backy2 backup ( http://backy2.com/docs/ )
 which seems to work rather well so far.
   - We purge some old snapshots.

We use the following RBD feature flags:
layering, exclusive-lock, object-map, fast-diff, deep-flatten

Since Benji and Backy2 are optimized for differential RBD backups to deduplicated 
storage, they leverage "rbd diff" (and hence make use of fast-diff, I would 
think).
If rbd diff produces wrong output due to this issue, it would affect our backups (but it 
would also affect classic backups of snapshots via "rbd export"...).
In case the issue is known or understood, can somebody extrapolate whether this means 
"rbd diff" contains too many blocks or actually misses changed blocks?


We are from now on running daily, full object-map checks on all volumes and 
backups, and automatically rebuild any object-map which was found invalid after 
the check.
Hopefully, this will allow to correlate the appearance of these issues with 
"something" happening on the cluster.
I did not detect 

[ceph-users] Invalid RBD object maps of snapshots on Mimic

2019-01-09 Thread Oliver Freyermuth
Dear Cephalopodians,

inspired by 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-January/032092.html I 
did a check of the object-maps of our RBD volumes
and snapshots. We are running 13.2.1 on the cluster I am talking about, all 
hosts (OSDs, MONs, RBD client nodes) still on CentOS 7.5. 

Sadly, I found that for at least 50 % of the snapshots (only the snapshots, not 
the volumes themselves), I got something like:
--
2019-01-09 23:00:06.481 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object 
map error: object rbd_data.519c46b8b4567.0260 marked as 1, but 
should be 3
2019-01-09 23:00:06.563 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object 
map error: object rbd_data.519c46b8b4567.0840 marked as 1, but 
should be 3
--
2019-01-09 23:00:09.166 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object 
map error: object rbd_data.519c46b8b4567.0480 marked as 1, but 
should be 3
2019-01-09 23:00:09.228 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object 
map error: object rbd_data.519c46b8b4567.0840 marked as 1, but 
should be 3
--
It often appears to affect 1-3 entries in the map of a snapshot. The Object Map 
was *not* marked invalid before I ran the check. 
After rebuilding it, the check is fine again. 

The cluster has not yet seen any Ceph update (it was installed as 13.2.1, we 
plan to upgrade to 13.2.4 soonish). 
There have been no major causes of worries so far. We purged a single OSD disk, 
balanced PGs with upmap, modified the CRUSH topology slightly etc. 
The cluster never was in a prolonged unhealthy period nor did we have to repair 
any PG. 

Is this a known error? 
Is it harmful, or is this just something like reference counting being off, and 
objects being in the map which did not really change in the snapshot? 

Our usecase, in case that helps to understand or reproduce:
- RBDs are used as disks for qemu/kvm virtual machines. 
- Every night:
  - We run an fstrim in the VM (which propagates to RBD and purges empty 
blocks), fsfreeze it, take a snapshot, thaw it again. 
  - After that, we run two backups with Benji backup ( https://benji-backup.me/ 
) and Backy2 backup ( http://backy2.com/docs/ )
which seems to work rather well so far. 
  - We purge some old snapshots. 

We use the following RBD feature flags:
layering, exclusive-lock, object-map, fast-diff, deep-flatten

Since Benji and Backy2 are optimized for differential RBD backups to 
deduplicated storage, they leverage "rbd diff" (and hence make use of 
fast-diff, I would think). 
If rbd diff produces wrong output due to this issue, it would affect our 
backups (but it would also affect classic backups of snapshots via "rbd 
export"...). 
In case the issue is known or understood, can somebody extrapolate whether this 
means "rbd diff" contains too many blocks or actually misses changed blocks? 


We are from now on running daily, full object-map checks on all volumes and 
backups, and automatically rebuild any object-map which was found invalid after 
the check. 
Hopefully, this will allow to correlate the appearance of these issues with 
"something" happening on the cluster. 
I did not detect a clean pattern in the affected snapshots, though, it seemed 
rather random... 

Maybe it would also help to understand this issue if somebody else using RBD in 
a similar manner on Mimic could also check the object-maps. 
Since this issue does not show up until a check is performed, this was below 
our radar for many months now... 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD snapshot atomicity guarantees?

2018-12-18 Thread Oliver Freyermuth
Am 18.12.18 um 11:48 schrieb Hector Martin:
> On 18/12/2018 18:28, Oliver Freyermuth wrote:
>> We have yet to observe these hangs, we are running this with ~5 VMs with ~10 
>> disks for about half a year now with daily snapshots. But all of these VMs 
>> have very "low" I/O,
>> since we put anything I/O intensive on bare metal (but with automated 
>> provisioning of course).
>>
>> So I'll chime in on your question, especially since there might be VMs on 
>> our cluster in the future where the inner OS may not be running an agent.
>> Since we did not observe this yet, I'll also add: What's your "scale", is it 
>> hundreds of VMs / disks? Hourly snapshots? I/O intensive VMs?
> 
> 5 hosts, 15 VMs, daily snapshots. I/O is variable (customer workloads); 
> usually not that high, but it can easily peak at 100% when certain things 
> happen. We don't have great I/O performance (RBD over 1gbps links to HDD 
> OSDs).
> 
> I'm poring through monitoring graphs now and I think the issue this time 
> around was just too much dirty data in the page cache of a guest. The VM that 
> failed spent 3 minutes flushing out writes to disk before its I/O was 
> quiesced, at around 100 IOPS throughput (the actual data throughput was low, 
> though, so small writes). That exceeded our timeout and then things went 
> south from there.
> 
> I wasn't sure if fsfreeze did a full sync to disk, but given the I/O behavior 
> I'm seeing that seems to be the case. Unfortunately coming up with an upper 
> bound for the freeze time seems tricky now. I'm increasing our timeout to 15 
> minutes, we'll see if the problem recurs.
> 
> Given this, it makes even more sense to just avoid the freeze if at all 
> reasonable. There's no real way to guarantee that a fsfreeze will complete in 
> a "reasonable" amount of time as far as I can tell.

Potentially, if granted arbitrary command execution by the guest agent, you 
could check (there might be a better interface than parsing meminfo...):
  cat /proc/meminfo | grep -i dirty
  Dirty: 19476 kB
You could guess from that information how long the fsfreeze may take (ideally, 
combining that with allowed IOPS). 
Of course, if you have control over your VMs, you may also play with the 
vm.dirty_ratio and vm.dirty_background_ratio. 

Interestingly, tuned on CentOS 7 configures for a "virtual-guest" profile:
vm.dirty_ratio = 30
(default is 20 %) so they optimize for performance by increasing the dirty 
buffers to delay writeback even more. 
They take the opposite for their "virtual-host" profile:
vm.dirty_background_ratio = 5
(default is 10 %). 
I believe these choices are good for performance, but may increase the time it 
takes to freeze the VMs, especially if IOPS are limited and there's a lot of 
dirty data. 

Since we also have 1 Gbps links and HDD OSDs, and plan to add more and more VMs 
and hosts, we may also observe this one day... 
So I'm curious:
How did you implement the timeout in your case? Are you using a 
qemu-agent-command issuing fsfreeze with --async and --timeout instead of 
domfsfreeze? 
We are using domfsfreeze as of now, which (probably) has an infinite timeout, 
or at least no timeout documented in the manpage. 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD snapshot atomicity guarantees?

2018-12-18 Thread Oliver Freyermuth

Dear Hector,

we are using the very same approach on CentOS 7 (freeze + thaw), but preceeded 
by an fstrim. With virtio-scsi, using fstrim propagates the discards from 
within the VM to Ceph RBD (if qemu is configured accordingly),
and a lot of space is saved.

We have yet to observe these hangs, we are running this with ~5 VMs with ~10 disks for 
about half a year now with daily snapshots. But all of these VMs have very 
"low" I/O,
since we put anything I/O intensive on bare metal (but with automated 
provisioning of course).

So I'll chime in on your question, especially since there might be VMs on our 
cluster in the future where the inner OS may not be running an agent.
Since we did not observe this yet, I'll also add: What's your "scale", is it 
hundreds of VMs / disks? Hourly snapshots? I/O intensive VMs?

Cheers,
Oliver

Am 18.12.18 um 10:10 schrieb Hector Martin:

Hi list,

I'm running libvirt qemu guests on RBD, and currently taking backups by issuing 
a domfsfreeze, taking a snapshot, and then issuing a domfsthaw. This seems to 
be a common approach.

This is safe, but it's impactful: the guest has frozen I/O for the duration of 
the snapshot. This is usually only a few seconds. Unfortunately, the freeze 
action doesn't seem to be very reliable. Sometimes it times out, leaving the 
guest in a messy situation with frozen I/O (thaw times out too when this 
happens, or returns success but FSes end up frozen anyway). This is clearly a 
bug somewhere, but I wonder whether the freeze is a hard requirement or not.

Are there any atomicity guarantees for RBD snapshots taken *without* freezing 
the filesystem? Obviously the filesystem will be dirty and will require journal 
recovery, but that is okay; it's equivalent to a hard shutdown/crash. But is 
there any chance of corruption related to the snapshot being taken in a 
non-atomic fashion? Filesystems and applications these days should have no 
trouble with hard shutdowns, as long as storage writes follow ordering 
guarantees (no writes getting reordered across a barrier and such).

Put another way: do RBD snapshots have ~identical atomicity guarantees to e.g. 
LVM snapshots?

If we can get away without the freeze, honestly I'd rather go that route. If I 
really need to pause I/O during the snapshot creation, I might end up resorting 
to pausing the whole VM (suspend/resume), which has higher impact but also 
probably a much lower chance of messing up (or having excess latency), since it 
doesn't involve the guest OS or the qemu agent at all...






smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Warning: Forged Email] Ceph 10.2.11 - Status not working

2018-12-17 Thread Oliver Freyermuth
That's kind of unrelated to Ceph, but since you wrote two mails already,
and I believe it is caused by the mailing list software for ceph-users... 

Your original mail distributed via the list ("[ceph-users] Ceph 10.2.11 - 
Status not working") did 
*not* have the forged-warning. 
Only the subsequent "Re:"-replies by yourself had it. That also matches what 
you will find in the archives. 

So my guess is that "[Warning: Forged Email]" was added by your own mailing 
system for the mail incoming to you after it was distributed by the ceph-users 
list server. 

That's probably since the mailman sending mail for ceph-users leaves the 
"From:" intact,
and that contains your domain (oeg.com.au). So the mailman server for 
ceph-users is "forging",
since it sends mail with "From: m...@oeg.com.au", but using it's own IP, hence 
violating your SPF record. 
It also breaks DKIM by adding the footer (ceph-users mailing list, 
ceph-users@lists.ceph.com, 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com)
thus manipulating the body of the mail. 

So in short: The mailman used for ceph-users breaks both SPF and DKIM (most 
mailing lists still do that). My guess is that your mailing system
adds a tag "[Warning: Forged Email]" at least for mail with a "From:" matching 
your domain in case SPF and / or DKIM is broken. 

If somebody wants to "fix" this: The reason is sadly that SPF and DKIM are not 
well suited for mailing lists :-(. But workarounds exist. 
Newer mailing list software (including modern mailman releases) allow to 
manipulate the "From:" before sending out mail,
e.g. writing in the header:
  From: "Mike O'Connor (via ceph-users list)" 
  Reply-To: "Mike O'Connor" 
With this, SPF is fine, since the mail server sending the mail is allowed to do 
so for @lists.ceph.com . Users can still reply just fine. 
Concerning DKIM, there's also a midway. The cleanest (I believe) is pruning all 
previous DKIM signatures on the list server and re-signing before sending it 
out. 

S/MIME will still break by adding the footer, but that's another matter. 

Cheers,
Oliver

Am 18.12.18 um 01:34 schrieb Mike O'Connor:
> mmm wonder why the list is saying my email is forged, wonder what I have
> wrong.
> 
> My email is sent via an outbound spam filter, but I was sure I had the
> SPF set correctly.
> 
> Mike
> 
> On 18/12/18 10:53 am, Mike O'Connor wrote:
>> Hi All
>>
>> I have a ceph cluster which has been working with out issues for about 2
>> years now, it was upgrade about 6 month ago to 10.2.11
>>
>> root@blade3:/var/lib/ceph/mon# ceph status
>> 2018-12-18 10:42:39.242217 7ff770471700  0 -- 10.1.5.203:0/1608630285 >>
>> 10.1.5.207:6789/0 pipe(0x7ff768000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1
>> c=0x7ff768001f90).fault
>> 2018-12-18 10:42:45.242745 7ff770471700  0 -- 10.1.5.203:0/1608630285 >>
>> 10.1.5.207:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1
>> c=0x7ff768002410).fault
>> 2018-12-18 10:42:51.243230 7ff770471700  0 -- 10.1.5.203:0/1608630285 >>
>> 10.1.5.207:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1
>> c=0x7ff768002f40).fault
>> 2018-12-18 10:42:54.243452 7ff770572700  0 -- 10.1.5.203:0/1608630285 >>
>> 10.1.5.205:6789/0 pipe(0x7ff768000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1
>> c=0x7ff768008060).fault
>> 2018-12-18 10:42:57.243715 7ff770471700  0 -- 10.1.5.203:0/1608630285 >>
>> 10.1.5.207:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1
>> c=0x7ff768003580).fault
>> 2018-12-18 10:43:03.244280 7ff7781b9700  0 -- 10.1.5.203:0/1608630285 >>
>> 10.1.5.205:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1
>> c=0x7ff768003670).fault
>>
>> All system can ping each other. I simple can not see why its failing.
>>
>>
>> ceph.conf
>>
>> [global]
>>      auth client required = cephx
>>      auth cluster required = cephx
>>      auth service required = cephx
>>      cluster network = 10.1.5.0/24
>>      filestore xattr use omap = true
>>      fsid = 42a0f015-76da-4f47-b506-da5cdacd030f
>>      keyring = /etc/pve/priv/$cluster.$name.keyring
>>      osd journal size = 5120
>>      osd pool default min size = 1
>>      public network = 10.1.5.0/24
>>  mon_pg_warn_max_per_osd = 0
>>
>> [client]
>>      rbd cache = true
>> [osd]
>>      keyring = /var/lib/ceph/osd/ceph-$id/keyring
>>      osd max backfills = 1
>>      osd recovery max active = 1
>>      osd_disk_threads = 1
>>      osd_disk_thread_ioprio_class = idle
>>      osd_disk_thread_ioprio_priority = 7
>> [mon.2]
>>      host = blade5
>>      mon addr = 10.1.5.205:6789
>> [mon.1]
>>      host = blade3
>>      mon addr = 10.1.5.203:6789
>> [mon.3]
>>      host = blade7
>>      mon addr = 10.1.5.207:6789
>> [mon.0]
>>      host = blade1
>>      mon addr = 10.1.5.201:6789
>> [mds]
>>  mds data = /var/lib/ceph/mds/mds.$id
>>  keyring = /var/lib/ceph/mds/mds.$id/mds.$id.keyring
>> [mds.0]
>>  host = blade1
>> [mds.1]
>>  host = blade3
>> [mds.2]
>>  host = blade5
>> [mds.3]
>>  host = blade7
>>
>>
>> Any 

Re: [ceph-users] Upgrade to Luminous (mon+osd)

2018-12-03 Thread Oliver Freyermuth

There's also an additional issue which made us activate
CEPH_AUTO_RESTART_ON_UPGRADE=yes
(and of course, not have automatic updates of Ceph):
  When using compression e.g. with Snappy, it seems that already running OSDs 
which try to dlopen() the snappy library
  for some version upgrades become unhappy if the version mismatches 
expectation (i.e. symbols don't match).

So effectively, it seems that in some cases you can not get around restarting 
the OSDs when updating the corresponding packages.

Cheers,
Oliver

Am 03.12.18 um 15:51 schrieb Dan van der Ster:

It's not that simple see http://tracker.ceph.com/issues/21672

For the 12.2.8 to 12.2.10 upgrade it seems the selinux module was
updated -- so the rpms restart the ceph.target.
What's worse is that this seems to happen before all the new updated
files are in place.

Our 12.2.8 to 12.2.10 upgrade procedure is:

systemctl stop ceph.target
yum update
systemctl start ceph.target

-- Dan

On Mon, Dec 3, 2018 at 12:42 PM Paul Emmerich  wrote:


Upgrading Ceph packages does not restart the services -- exactly for
this reason.

This means there's something broken with your yum setup if the
services are restarted when only installing the new version.


Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

Am Mo., 3. Dez. 2018 um 11:56 Uhr schrieb Jan Kasprzak :


 Hello, ceph users,

I have a small(-ish) Ceph cluster, where there are osds on each host,
and in addition to that, there are mons on the first three hosts.
Is it possible to upgrade the cluster to Luminous without service
interruption?

I have tested that when I run "yum --enablerepo Ceph update" on a
mon host, the osds on that host remain down until all three mons
are upgraded to Luminous. Is it possible to upgrade ceph-mon only,
and keep ceph-osd running the old version (Jewel in my case) as long
as possible? It seems RPM dependencies forbid this, but with --nodeps
it could be done.

Is there a supported way how to upgrade host running both mon and osd
to Luminous?

Thanks,

-Yenya

--
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
  This is the world we live in: the way to deal with computers is to google
  the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Customized Crush location hooks in Mimic

2018-11-30 Thread Oliver Freyermuth
Dear Greg,

Am 30.11.18 um 18:38 schrieb Gregory Farnum:
> I’m pretty sure the monitor command there won’t move intermediate buckets 
> like the host. This is so if an osd has incomplete metadata it doesn’t 
> inadvertently move 11 other OSDs into a different rack/row/whatever.
> 
> So in this case, it finds the host osd0001 and matches it, but since the 
> crush map already knows about osd0001 it doesn’t pay any attention to the 
> datacenter field.
> Whereas if you tried setting it with mynewhost, the monitor wouldn’t know 
> where that host exists and would look at the other fields to set it in the 
> specified data center.

thanks! That's a good and clear explanation. This was not apparent from the 
documentation to me, but it sounds like the safest way to go. 
So in the end, crush-location-hooks are mostly useful for freshly created OSDs, 
e.g. on a new host (they should then directly go to the correct rack / 
datacenter etc.). 

I wonder if that's the only sensible usecase, but it seems to me right now that 
this is the case. 
So for our scheme, I will indeed use it for that, and move hosts manually (when 
moving them physically...) by moving the ceph buckets manually to the other 
rack / datacenter. 

Thanks for the explanation!
Cheers,
Oliver

> -Greg
> On Fri, Nov 30, 2018 at 6:46 AM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> Dear Cephalopodians,
> 
> sorry for the spam, but I found the following in mon logs just now and am 
> finally out of ideas:
> 
> --
> 2018-11-30 15:43:05.207 7f9d64aac700  0 mon.mon001@0(leader) e3 
> handle_command mon_command({"prefix": "osd crush set-device-class", "class": 
> "hdd", "ids": ["1"]} v 0) v1
> 2018-11-30 15:43:05.207 7f9d64aac700  0 log_channel(audit) log [INF] : 
> from='osd.1 10.160.12.101:6816/90528 <http://10.160.12.101:6816/90528>' 
> entity='osd.1' cmd=[{"prefix": "osd crush set-device-class", "class": "hdd", 
> "ids": ["1"]}]: dispatch
> 2018-11-30 15:43:05.208 7f9d64aac700  0 mon.mon001@0(leader) e3 
> handle_command mon_command({"prefix": "osd crush create-or-move", "id": 1, 
> "weight":3.6824, "args": ["datacenter=FTD", "host=osd001", "root=default"]} v 
> 0) v1
> 2018-11-30 15:43:05.208 7f9d64aac700  0 log_channel(audit) log [INF] : 
> from='osd.1 10.160.12.101:6816/90528 <http://10.160.12.101:6816/90528>' 
> entity='osd.1' cmd=[{"prefix": "osd crush create-or-move", "id": 1, 
> "weight":3.6824, "args": ["datacenter=FTD", "host=osd001", "root=default"]}]: 
> dispatch
> 2018-11-30 15:43:05.208 7f9d64aac700  0 mon.mon001@0(leader).osd e2464 
> create-or-move crush item name 'osd.1' initial_weight 3.6824 at location 
> {datacenter=FTD,host=osd001,root=default}
> 
> --
> So the request to move to datacenter=FTD arrives at the mon, but no 
> action is taken, and the OSD is left in FTD_1.
> 
> Cheers,
>         Oliver
> 
> Am 30.11.18 um 15:25 schrieb Oliver Freyermuth:
> > Dear Cephalopodians,
> >
> > further experiments revealed that the crush-location-hook is indeed 
> called!
> > It's just my check (writing to a file in tmp from inside the hook) 
> which somehow failed. Using "logger" works for debugging.
> >
> > So now, my hook outputs:
> > host=osd001 datacenter=FTD root=default
> > as explained before. I have also explicitly created the buckets 
> beforehand in case that is needed.
> >
> > Tree looks like that:
> > # ceph osd tree
> > ID  CLASS WEIGHT   TYPE NAME    STATUS REWEIGHT PRI-AFF
> >   -1   55.23582 root default
> >   -9  0 datacenter FTD
> > -12   18.41194 datacenter FTD_1
> >   -3   18.41194 host osd001
> >    0   hdd  3.68239 osd.0    up  1.0 1.0
> >    1   hdd  3.68239 osd.1    up  1.0 1.0
> >    2   hdd  3.68239 osd.2    up  1.0 1.0
> >    3   hdd  3.68239 osd.3    up  1.0 1.0
> >    4   hdd  3.68239 osd.4    up  1.0 1.0
> > -11  0 datacenter FTD_2
> >   -5   18.41194 host osd002
> >    5  

Re: [ceph-users] Customized Crush location hooks in Mimic

2018-11-30 Thread Oliver Freyermuth

Dear Cephalopodians,

sorry for the spam, but I found the following in mon logs just now and am 
finally out of ideas:
--
2018-11-30 15:43:05.207 7f9d64aac700  0 mon.mon001@0(leader) e3 handle_command mon_command({"prefix": "osd crush 
set-device-class", "class": "hdd", "ids": ["1"]} v 0) v1
2018-11-30 15:43:05.207 7f9d64aac700  0 log_channel(audit) log [INF] : from='osd.1 10.160.12.101:6816/90528' entity='osd.1' 
cmd=[{"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["1"]}]: 
dispatch
2018-11-30 15:43:05.208 7f9d64aac700  0 mon.mon001@0(leader) e3 handle_command mon_command({"prefix": "osd crush create-or-move", 
"id": 1, "weight":3.6824, "args": ["datacenter=FTD", "host=osd001", "root=default"]} v 0) v1
2018-11-30 15:43:05.208 7f9d64aac700  0 log_channel(audit) log [INF] : from='osd.1 10.160.12.101:6816/90528' entity='osd.1' cmd=[{"prefix": "osd 
crush create-or-move", "id": 1, "weight":3.6824, "args": ["datacenter=FTD", "host=osd001", 
"root=default"]}]: dispatch
2018-11-30 15:43:05.208 7f9d64aac700  0 mon.mon001@0(leader).osd e2464 
create-or-move crush item name 'osd.1' initial_weight 3.6824 at location 
{datacenter=FTD,host=osd001,root=default}
--
So the request to move to datacenter=FTD arrives at the mon, but no action is 
taken, and the OSD is left in FTD_1.

Cheers,
Oliver

Am 30.11.18 um 15:25 schrieb Oliver Freyermuth:

Dear Cephalopodians,

further experiments revealed that the crush-location-hook is indeed called!
It's just my check (writing to a file in tmp from inside the hook) which somehow failed. 
Using "logger" works for debugging.

So now, my hook outputs:
host=osd001 datacenter=FTD root=default
as explained before. I have also explicitly created the buckets beforehand in 
case that is needed.

Tree looks like that:
# ceph osd tree
ID  CLASS WEIGHT   TYPE NAME    STATUS REWEIGHT PRI-AFF
  -1   55.23582 root default
  -9  0 datacenter FTD
-12   18.41194 datacenter FTD_1
  -3   18.41194 host osd001
   0   hdd  3.68239 osd.0    up  1.0 1.0
   1   hdd  3.68239 osd.1    up  1.0 1.0
   2   hdd  3.68239 osd.2    up  1.0 1.0
   3   hdd  3.68239 osd.3    up  1.0 1.0
   4   hdd  3.68239 osd.4    up  1.0 1.0
-11  0 datacenter FTD_2
  -5   18.41194 host osd002
   5   hdd  3.68239 osd.5    up  1.0 1.0
   6   hdd  3.68239 osd.6    up  1.0 1.0
   7   hdd  3.68239 osd.7    up  1.0 1.0
   8   hdd  3.68239 osd.8    up  1.0 1.0
   9   hdd  3.68239 osd.9    up  1.0 1.0
  -7   18.41194 host osd003
  10   hdd  3.68239 osd.10   up  1.0 1.0
  11   hdd  3.68239 osd.11   up  1.0 1.0
  12   hdd  3.68239 osd.12   up  1.0 1.0
  13   hdd  3.68239 osd.13   up  1.0 1.0
  14   hdd  3.68239 osd.14   up  1.0 1.0

So naively, I would expect that when I restart osd.0, it should move itself 
into datacenter=FTD.
But that does not happen...

Any idea what I am missing?

Cheers,
 Oliver



Am 30.11.18 um 11:44 schrieb Oliver Freyermuth:

Dear Cephalopodians,

I'm probably missing something obvious, but I am at a loss here on how to 
actually make use of a customized crush location hook.

I'm currently on "ceph version 13.2.1" on CentOS 7 (i.e. the last version 
before the upgrade-preventing bugs). Here's what I did:

1. Write a script /usr/local/bin/customized-ceph-crush-location. The script can be 
executed by user "ceph":
   # sudo -u ceph /usr/local/bin/customized-ceph-crush-location
   host=osd001 datacenter=FTD root=default

2. Add the following to ceph.conf:
  [osd]
  crush_location_hook = /usr/local/bin/customized-ceph-crush-location

3. Restart an OSD and confirm that is picked up:
  # systemctl restart ceph-osd@0
  # ceph config show-with-defaults osd.0
   ...
   crush_location_hook    /usr/local/bin/customized-ceph-crush-location  
file
   ...
   osd_crush_update_on_start  true   
default
   ...

However, the script is not executed, and I can ensure that since the script 
should also write a log to /tmp, which is not created.
Also, the "datacenter" type does not show up in the crush tree.

I have already disabled S

Re: [ceph-users] Customized Crush location hooks in Mimic

2018-11-30 Thread Oliver Freyermuth

Dear Cephalopodians,

further experiments revealed that the crush-location-hook is indeed called!
It's just my check (writing to a file in tmp from inside the hook) which somehow failed. 
Using "logger" works for debugging.

So now, my hook outputs:
host=osd001 datacenter=FTD root=default
as explained before. I have also explicitly created the buckets beforehand in 
case that is needed.

Tree looks like that:
# ceph osd tree
ID  CLASS WEIGHT   TYPE NAMESTATUS REWEIGHT PRI-AFF
 -1   55.23582 root default
 -9  0 datacenter FTD
-12   18.41194 datacenter FTD_1
 -3   18.41194 host osd001
  0   hdd  3.68239 osd.0up  1.0 1.0
  1   hdd  3.68239 osd.1up  1.0 1.0
  2   hdd  3.68239 osd.2up  1.0 1.0
  3   hdd  3.68239 osd.3up  1.0 1.0
  4   hdd  3.68239 osd.4up  1.0 1.0
-11  0 datacenter FTD_2
 -5   18.41194 host osd002
  5   hdd  3.68239 osd.5up  1.0 1.0
  6   hdd  3.68239 osd.6up  1.0 1.0
  7   hdd  3.68239 osd.7up  1.0 1.0
  8   hdd  3.68239 osd.8up  1.0 1.0
  9   hdd  3.68239 osd.9up  1.0 1.0
 -7   18.41194 host osd003
 10   hdd  3.68239 osd.10   up  1.0 1.0
 11   hdd  3.68239 osd.11   up  1.0 1.0
 12   hdd  3.68239 osd.12   up  1.0 1.0
 13   hdd  3.68239 osd.13   up  1.0 1.0
 14   hdd  3.68239 osd.14   up  1.0 1.0

So naively, I would expect that when I restart osd.0, it should move itself 
into datacenter=FTD.
But that does not happen...

Any idea what I am missing?

Cheers,
Oliver



Am 30.11.18 um 11:44 schrieb Oliver Freyermuth:

Dear Cephalopodians,

I'm probably missing something obvious, but I am at a loss here on how to 
actually make use of a customized crush location hook.

I'm currently on "ceph version 13.2.1" on CentOS 7 (i.e. the last version 
before the upgrade-preventing bugs). Here's what I did:

1. Write a script /usr/local/bin/customized-ceph-crush-location. The script can be 
executed by user "ceph":
   # sudo -u ceph /usr/local/bin/customized-ceph-crush-location
   host=osd001 datacenter=FTD root=default

2. Add the following to ceph.conf:
  [osd]
  crush_location_hook = /usr/local/bin/customized-ceph-crush-location

3. Restart an OSD and confirm that is picked up:
  # systemctl restart ceph-osd@0
  # ceph config show-with-defaults osd.0
   ...
   crush_location_hook    /usr/local/bin/customized-ceph-crush-location  
file
   ...
   osd_crush_update_on_start  true   
default
   ...

However, the script is not executed, and I can ensure that since the script 
should also write a log to /tmp, which is not created.
Also, the "datacenter" type does not show up in the crush tree.

I have already disabled SELinux just to make sure.

Any ideas what I am missing here?

Cheers and thanks in advance,
 Oliver






smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Customized Crush location hooks in Mimic

2018-11-30 Thread Oliver Freyermuth

Dear Cephalopodians,

I'm probably missing something obvious, but I am at a loss here on how to 
actually make use of a customized crush location hook.

I'm currently on "ceph version 13.2.1" on CentOS 7 (i.e. the last version 
before the upgrade-preventing bugs). Here's what I did:

1. Write a script /usr/local/bin/customized-ceph-crush-location. The script can be 
executed by user "ceph":
  # sudo -u ceph /usr/local/bin/customized-ceph-crush-location
  host=osd001 datacenter=FTD root=default

2. Add the following to ceph.conf:
 [osd]
 crush_location_hook = /usr/local/bin/customized-ceph-crush-location

3. Restart an OSD and confirm that is picked up:
 # systemctl restart ceph-osd@0
 # ceph config show-with-defaults osd.0
  ...
  crush_location_hook/usr/local/bin/customized-ceph-crush-location  file
  ...
  osd_crush_update_on_start  true   
default
  ...

However, the script is not executed, and I can ensure that since the script 
should also write a log to /tmp, which is not created.
Also, the "datacenter" type does not show up in the crush tree.

I have already disabled SELinux just to make sure.

Any ideas what I am missing here?

Cheers and thanks in advance,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df space usage confusion - balancing needed?

2018-10-26 Thread Oliver Freyermuth
Am 27.10.18 um 04:12 schrieb Linh Vu:
> Should be fine as long as your "mgr/balancer/max_misplaced" is reasonable. I 
> find the default value of 0.05 decent enough, although from experience that 
> seems like 0.05% rather than 5% as suggested here: 
> http://docs.ceph.com/docs/luminous/mgr/balancer/  

Ok! I did actually choose 0.01. Interestingly, during the initial large 
rebalancing, it went up to > 2 % of misplaced objects (in small steps) until I 
decided to stop the balancer for a day to give the cluster
enough time to adapt. 
 
> You can also choose to turn it on only during certain hours when the cluster 
> might be less busy. The config-keys are there somewhere (there's a post by 
> Dan van der Ster on the ML about them) but they don't actually work in 12.2.8 
> at least, when I tried them. I suggest just use cron to turn the balancer on 
> and off. 

I found that mail in the archives. Indeed, that seems helpful. I'll start with 
permanently leaving the balancer on for now and observe if it has any impact. 
Since we rarely change the cluster's layout,
it should effectively just sit there silently most of the time. 

Thanks!
Oliver

> 
> ----------
> *From:* Oliver Freyermuth 
> *Sent:* Friday, 26 October 2018 9:32:14 PM
> *To:* Linh Vu; Janne Johansson
> *Cc:* ceph-users@lists.ceph.com; Peter Wienemann
> *Subject:* Re: [ceph-users] ceph df space usage confusion - balancing needed?
>  
> Dear Cephalopodians,
> 
> thanks for all your feedback!
> 
> I finally "pushed the button" and let upmap run for ~36 hours.
> Previously, we had ~63 % usage of our CephFS with only 50 % raw usage, now, 
> we see only 53.77 % usage.
> 
> That's as close as I expect things to ever become, and we gained about 70 TiB 
> of free storage by that, which is almost one file server.
> So the outcome is really close to perfection :-).
> 
> I'm leaving the balancer active now in upmap mode. Any bad experiences with 
> leaving it active "forever"?
> 
> Cheers and many thanks again,
>     Oliver
> 
> Am 23.10.18 um 01:14 schrieb Linh Vu:
>> Upmap is awesome. I ran it on our new cluster before we started ingesting 
>> data, so that the PG count is balanced on all OSDs. After ingesting about 
>> 315TB, it's still beautifully balanced. Note: we have a few nodes with 8TB 
>> OSDs, and the rest on 10TBs. 
>> 
>> 
>> # ceph osd df plain
>> ID  CLASS    WEIGHT  REWEIGHT SIZE    USE     AVAIL   %USE  VAR  PGS 
>>   0   mf1hdd 7.27739  1.0 7.28TiB 2.06TiB 5.21TiB 28.34 1.01 144 
>>   1   mf1hdd 7.27739  1.0 7.28TiB 2.07TiB 5.21TiB 28.38 1.02 144 
>>   2   mf1hdd 7.27739  1.0 7.28TiB 2.03TiB 5.24TiB 27.96 1.00 142 
>>   3   mf1hdd 7.27739  1.0 7.28TiB 2.06TiB 5.21TiB 28.37 1.02 144 
>>   4   mf1hdd 7.27739  1.0 7.28TiB 2.03TiB 5.24TiB 27.96 1.00 142 
>>   5   mf1hdd 7.27739  1.0 7.28TiB 2.02TiB 5.26TiB 27.73 0.99 141 
>>   6   mf1hdd 7.27739  1.0 7.28TiB 2.03TiB 5.24TiB 27.94 1.00 142 
>>   7   mf1hdd 7.27739  1.0 7.28TiB 2.06TiB 5.21TiB 28.35 1.02 144 
>>   8   mf1hdd 7.27739  1.0 7.28TiB 2.02TiB 5.26TiB 27.76 0.99 141 
>>   9   mf1hdd 7.27739  1.0 7.28TiB 2.04TiB 5.24TiB 27.97 1.00 142 
>>  10   mf1hdd 7.27739  1.0 7.28TiB 2.06TiB 5.21TiB 28.35 1.02 144 
>>  11   mf1hdd 7.27739  1.0 7.28TiB 2.04TiB 5.24TiB 27.99 1.00 142 
>>  12   mf1hdd 7.27739  1.0 7.28TiB 2.02TiB 5.26TiB 27.75 0.99 141 
>>  13   mf1hdd 7.27739  1.0 7.28TiB 2.03TiB 5.24TiB 27.96 1.00 142 
>>  14   mf1hdd 7.27739  1.0 7.28TiB 2.02TiB 5.26TiB 27.78 0.99 141 
>>  15   mf1hdd 7.27739  1.0 7.28TiB 2.07TiB 5.21TiB 28.38 1.02 144 
>> 224 nvmemeta 0.02179  1.0 22.3GiB 1.52GiB 20.8GiB  6.82 0.24 185 
>> 225 nvmemeta 0.02179  1.0 22.4GiB 1.49GiB 20

Re: [ceph-users] ceph df space usage confusion - balancing needed?

2018-10-26 Thread Oliver Freyermuth
7.99 1.00 174 
> 137   mf1hdd 8.91019  1.0 8.91TiB 2.48TiB 6.43TiB 27.82 1.00 173 
> 138   mf1hdd 8.91019  1.0 8.91TiB 2.48TiB 6.43TiB 27.81 1.00 173 
> 139   mf1hdd 8.91019  1.0 8.91TiB 2.48TiB 6.43TiB 27.84 1.00 173 
> 140   mf1hdd 8.91019  1.0 8.91TiB 2.48TiB 6.43TiB 27.81 1.00 173 
> 141   mf1hdd 8.91019  1.0 8.91TiB 2.48TiB 6.43TiB 27.82 1.00 173 
> 142   mf1hdd 8.91019  1.0 8.91TiB 2.50TiB 6.41TiB 28.00 1.00 174 
> 143   mf1hdd 8.91019  1.0 8.91TiB 2.48TiB 6.43TiB 27.82 1.00 173 
> 240 nvmemeta 0.02179  1.0 22.3GiB 1.61GiB 20.7GiB  7.22 0.26 184 
> 241 nvmemeta 0.02179  1.0 22.4GiB 1.43GiB 20.9GiB  6.41 0.23 182 
>                         TOTAL 1.85PiB  528TiB 1.33PiB 27.93          
> MIN/MAX VAR: 0.23/1.02  STDDEV: 7.10
> 
> ------
> *From:* ceph-users  on behalf of Oliver 
> Freyermuth 
> *Sent:* Sunday, 21 October 2018 6:57:49 AM
> *To:* Janne Johansson
> *Cc:* ceph-users@lists.ceph.com; Peter Wienemann
> *Subject:* Re: [ceph-users] ceph df space usage confusion - balancing needed?
>  
> Ok, I'll try out the balancer end of the upcoming week then (after we've 
> fixed a HW-issue with one of our mons
> and the cooling system).
> 
> Until then, any further advice and whether upmap is recommended over 
> crush-compat (all clients are Luminous) are welcome ;-).
> 
> Cheers,
> Oliver
> 
> Am 20.10.18 um 21:26 schrieb Janne Johansson:
>> Ok, can't say "why" then, I'd reweigh them somewhat to even it out,
>> 1.22 -vs- 0.74 in variance is a lot, so either a balancer plugin for
>> the MGRs, a script or just a few manual tweaks might be in order.
>> 
>> Den lör 20 okt. 2018 kl 21:02 skrev Oliver Freyermuth
>> :
>>>
>>> All OSDs are of the very same size. One OSD host has slightly more disks 
>>> (33 instead of 31), though.
>>> So also that that can't explain the hefty difference.
>>>
>>> I attach the output of "ceph osd tree" and "ceph osd df".
>>>
>>> The crush rule for the ceph_data pool is:
>>> rule cephfs_data {
>>> id 2
>>> type erasure
>>> min_size 3
>>> max_size 6
>>> step set_chooseleaf_tries 5
>>> step set_choose_tries 100
>>> step take default class hdd
>>> step chooseleaf indep 0 type host
>>> step emit
>>> }
>>> So that only considers the hdd device class. EC is done with k=4 m=2.
>>>
>>> So I don't see any imbalance on the hardware level, but only a somewhat 
>>> uneven distribution of PGs.
>>> Am I missing something, or is this really just a case for the ceph balancer 
>>> plugin?
>>> I'm just a bit astonished this effect is so huge.
>>> Maybe our 4096 PGs for the ceph_data pool are not enough to get an even 
>>> distribution without balancing?
>>> But it yields about 100 PGs per OSD, as you can see...
>>>
>>> --
>>> # ceph osd tree
>>> ID  CLASS WEIGHT    TYPE NAME   STATUS REWEIGHT PRI-AFF
>>>  -1   826.26428 root default
>>>  -3 0.43700 host mon001
>>>   0   ssd   0.21799 osd.0   up  1.0 1.0
>>>   1   ssd   0.21799 osd.1   up  1.0 1.0
>>>  -5 0.43700 host mon002
>>>   2   ssd   0.21799 osd.2   up  1.0 1.0
>>>   3   ssd   0.21799 osd.3   up  1.0 1.0
>>> -31 1.81898 host mon003
>>> 230   ssd   0.90999 osd.230 up  1.0 1.0
>>> 231   ssd   0.90999 osd.231 up  1.0 1.0
>>> -10   116.64600 host osd001
>>>   4   hdd   3.64499 osd.4   up  1.0 1.0
>>>   5   hdd   3.64499 osd.5   up  1.0 1.0
>>>   6   hdd   3.64499 osd.6   up  1.0 1.0
>>>   7   hdd   3.64499 osd.7   up  1.0 1.0
>>>   8   hdd   3.64499 osd.8   up  1.0 1.0
>>>   9   hdd   3.64499 osd.9   up  1.0 1.0
>>>  10   hdd   3.64499 osd.10  up  1.0 1.0
>>>  11   hdd   3.64499 osd.11  up  1.0 1.0
>>>  12   hdd   3.64499 osd.12  up  1.0 1.0
>>>  13   hdd   3.64499 osd.13  up  1.0 1.000

Re: [ceph-users] ceph df space usage confusion - balancing needed?

2018-10-20 Thread Oliver Freyermuth
Ok, I'll try out the balancer end of the upcoming week then (after we've fixed 
a HW-issue with one of our mons
and the cooling system). 

Until then, any further advice and whether upmap is recommended over 
crush-compat (all clients are Luminous) are welcome ;-). 

Cheers,
Oliver

Am 20.10.18 um 21:26 schrieb Janne Johansson:
> Ok, can't say "why" then, I'd reweigh them somewhat to even it out,
> 1.22 -vs- 0.74 in variance is a lot, so either a balancer plugin for
> the MGRs, a script or just a few manual tweaks might be in order.
> 
> Den lör 20 okt. 2018 kl 21:02 skrev Oliver Freyermuth
> :
>>
>> All OSDs are of the very same size. One OSD host has slightly more disks (33 
>> instead of 31), though.
>> So also that that can't explain the hefty difference.
>>
>> I attach the output of "ceph osd tree" and "ceph osd df".
>>
>> The crush rule for the ceph_data pool is:
>> rule cephfs_data {
>> id 2
>> type erasure
>> min_size 3
>> max_size 6
>> step set_chooseleaf_tries 5
>> step set_choose_tries 100
>> step take default class hdd
>> step chooseleaf indep 0 type host
>> step emit
>> }
>> So that only considers the hdd device class. EC is done with k=4 m=2.
>>
>> So I don't see any imbalance on the hardware level, but only a somewhat 
>> uneven distribution of PGs.
>> Am I missing something, or is this really just a case for the ceph balancer 
>> plugin?
>> I'm just a bit astonished this effect is so huge.
>> Maybe our 4096 PGs for the ceph_data pool are not enough to get an even 
>> distribution without balancing?
>> But it yields about 100 PGs per OSD, as you can see...
>>
>> --
>> # ceph osd tree
>> ID  CLASS WEIGHTTYPE NAME   STATUS REWEIGHT PRI-AFF
>>  -1   826.26428 root default
>>  -3 0.43700 host mon001
>>   0   ssd   0.21799 osd.0   up  1.0 1.0
>>   1   ssd   0.21799 osd.1   up  1.0 1.0
>>  -5 0.43700 host mon002
>>   2   ssd   0.21799 osd.2   up  1.0 1.0
>>   3   ssd   0.21799 osd.3   up  1.0 1.0
>> -31 1.81898 host mon003
>> 230   ssd   0.90999 osd.230 up  1.0 1.0
>> 231   ssd   0.90999 osd.231 up  1.0 1.0
>> -10   116.64600 host osd001
>>   4   hdd   3.64499 osd.4   up  1.0 1.0
>>   5   hdd   3.64499 osd.5   up  1.0 1.0
>>   6   hdd   3.64499 osd.6   up  1.0 1.0
>>   7   hdd   3.64499 osd.7   up  1.0 1.0
>>   8   hdd   3.64499 osd.8   up  1.0 1.0
>>   9   hdd   3.64499 osd.9   up  1.0 1.0
>>  10   hdd   3.64499 osd.10  up  1.0 1.0
>>  11   hdd   3.64499 osd.11  up  1.0 1.0
>>  12   hdd   3.64499 osd.12  up  1.0 1.0
>>  13   hdd   3.64499 osd.13  up  1.0 1.0
>>  14   hdd   3.64499 osd.14  up  1.0 1.0
>>  15   hdd   3.64499 osd.15  up  1.0 1.0
>>  16   hdd   3.64499 osd.16  up  1.0 1.0
>>  17   hdd   3.64499 osd.17  up  1.0 1.0
>>  18   hdd   3.64499 osd.18  up  1.0 1.0
>>  19   hdd   3.64499 osd.19  up  1.0 1.0
>>  20   hdd   3.64499 osd.20  up  1.0 1.0
>>  21   hdd   3.64499 osd.21  up  1.0 1.0
>>  22   hdd   3.64499 osd.22  up  1.0 1.0
>>  23   hdd   3.64499 osd.23  up  1.0 1.0
>>  24   hdd   3.64499 osd.24  up  1.0 1.0
>>  25   hdd   3.64499 osd.25  up  1.0 1.0
>>  26   hdd   3.64499 osd.26  up  1.0 1.0
>>  27   hdd   3.64499 osd.27  up  1.0 1.0
>>  28   hdd   3.64499 osd.28  up  1.0 1.0
>>  29   hdd   3.64499 osd.29  up  1.0 1.0
>>  30   hdd   3.64499 osd.30  up  1.0 1.0
>>  31   hdd   3.64499 osd.31  up  1.0 1.0
>>  32   hdd   3.64499 osd.32  up  1.0 1.0
>>  33   hdd   3.64499 osd.33  up  1.0 1.0
>>  34   hdd   3.64499 osd.34  up  1.0 1.0
>>  35   hdd   3.64499 osd.35  up  1.0 1.0
>> -13   116.64600 host osd002
>>  36   hdd   3.64499 os

Re: [ceph-users] ceph df space usage confusion - balancing needed?

2018-10-20 Thread Oliver Freyermuth
76G 1949G 47.69 0.95 104 
227   hdd 3.63899  1.0 3726G 1929G 1796G 51.78 1.03 113 
228   hdd 3.63899  1.0 3726G 1657G 2068G 44.48 0.89  97 
229   hdd 3.63899  1.0 3726G 1843G 1882G 49.47 0.98 108 
 TOTAL  825T  414T  410T 50.24  
MIN/MAX VAR: 0.01/1.29  STDDEV: 9.22
--

Am 20.10.18 um 20:35 schrieb Janne Johansson:
> Yes, if you have uneven sizes I guess you could end up in a situation
> where you have
> lots of 1TB OSDs and a number of 2TB OSD but pool replication forces
> the pool to have one
> PG replica on the 1TB OSD, then it would be possible to state "this
> pool cant write more than X G"
> but when it is full, there would be free space left on some of the
> 2TB-OSDs, but which the pool
> cant utilize. Probably same for uneven OSD hosts if you have those.
> 
> Den lör 20 okt. 2018 kl 20:28 skrev Oliver Freyermuth
> :
>>
>> Dear Janne,
>>
>> yes, of course. But since we only have two pools here, this can not explain 
>> the difference.
>> The metadata is replicated (3 copies) across ssd drives, and we have < 3 TB 
>> of total raw storage for that.
>> So looking at the raw space usage, we can ignore that.
>>
>> All the rest is used for the ceph_data pool. So the ceph_data pool, in terms 
>> of raw storage, is about 50 % used.
>>
>> But in terms of storage shown for that pool, it's almost 63 % %USED.
>> So I guess this can purely be from bad balancing, correct?
>>
>> Cheers,
>> Oliver
>>
>> Am 20.10.18 um 19:49 schrieb Janne Johansson:
>>> Do mind that drives may have more than one pool on them, so RAW space
>>> is what it says, how much free space there is. Then the avail and
>>> %USED on per-pool stats will take replication into account, it can
>>> tell how much data you may write into that particular pool, given that
>>> pools replication or EC settings.
>>>
>>> Den lör 20 okt. 2018 kl 19:09 skrev Oliver Freyermuth
>>> :
>>>>
>>>> Dear Cephalopodians,
>>>>
>>>> as many others, I'm also a bit confused by "ceph df" output
>>>> in a pretty straightforward configuration.
>>>>
>>>> We have a CephFS (12.2.7) running, with 4+2 EC profile.
>>>>
>>>> I get:
>>>> 
>>>> # ceph df
>>>> GLOBAL:
>>>> SIZE AVAIL RAW USED %RAW USED
>>>> 824T  410T 414T 50.26
>>>> POOLS:
>>>> NAMEID USED %USED MAX AVAIL OBJECTS
>>>> cephfs_metadata 1  452M  0.05  860G   365774
>>>> cephfs_data 2  275T 62.68  164T 75056403
>>>> 
>>>>
>>>> So about 50 % of raw space are used, but already ~63 % of filesystem space 
>>>> are used.
>>>> Is this purely from imperfect balancing?
>>>> In "ceph osd df", I do indeed see OSD usages spreading from 65.02 % usage 
>>>> down to 37.12 %.
>>>>
>>>> We did not yet use the balancer plugin.
>>>> We don't have any pre-luminous clients.
>>>> In that setup, I take it that "upmap" mode would be recommended - correct?
>>>> Any "gotchas" using that on luminous?
>>>>
>>>> Cheers,
>>>> Oliver
>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>
>>
> 
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df space usage confusion - balancing needed?

2018-10-20 Thread Oliver Freyermuth
Dear Janne,

yes, of course. But since we only have two pools here, this can not explain the 
difference. 
The metadata is replicated (3 copies) across ssd drives, and we have < 3 TB of 
total raw storage for that. 
So looking at the raw space usage, we can ignore that. 

All the rest is used for the ceph_data pool. So the ceph_data pool, in terms of 
raw storage, is about 50 % used. 

But in terms of storage shown for that pool, it's almost 63 % %USED. 
So I guess this can purely be from bad balancing, correct? 

Cheers,
Oliver

Am 20.10.18 um 19:49 schrieb Janne Johansson:
> Do mind that drives may have more than one pool on them, so RAW space
> is what it says, how much free space there is. Then the avail and
> %USED on per-pool stats will take replication into account, it can
> tell how much data you may write into that particular pool, given that
> pools replication or EC settings.
> 
> Den lör 20 okt. 2018 kl 19:09 skrev Oliver Freyermuth
> :
>>
>> Dear Cephalopodians,
>>
>> as many others, I'm also a bit confused by "ceph df" output
>> in a pretty straightforward configuration.
>>
>> We have a CephFS (12.2.7) running, with 4+2 EC profile.
>>
>> I get:
>> 
>> # ceph df
>> GLOBAL:
>> SIZE AVAIL RAW USED %RAW USED
>> 824T  410T 414T 50.26
>> POOLS:
>> NAMEID USED %USED MAX AVAIL OBJECTS
>> cephfs_metadata 1  452M  0.05  860G   365774
>> cephfs_data 2  275T 62.68  164T 75056403
>> 
>>
>> So about 50 % of raw space are used, but already ~63 % of filesystem space 
>> are used.
>> Is this purely from imperfect balancing?
>> In "ceph osd df", I do indeed see OSD usages spreading from 65.02 % usage 
>> down to 37.12 %.
>>
>> We did not yet use the balancer plugin.
>> We don't have any pre-luminous clients.
>> In that setup, I take it that "upmap" mode would be recommended - correct?
>> Any "gotchas" using that on luminous?
>>
>> Cheers,
>> Oliver
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph df space usage confusion - balancing needed?

2018-10-20 Thread Oliver Freyermuth
Dear Cephalopodians,

as many others, I'm also a bit confused by "ceph df" output
in a pretty straightforward configuration. 

We have a CephFS (12.2.7) running, with 4+2 EC profile. 

I get:

# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED 
824T  410T 414T 50.26 
POOLS:
NAMEID USED %USED MAX AVAIL OBJECTS  
cephfs_metadata 1  452M  0.05  860G   365774 
cephfs_data 2  275T 62.68  164T 75056403


So about 50 % of raw space are used, but already ~63 % of filesystem space are 
used. 
Is this purely from imperfect balancing? 
In "ceph osd df", I do indeed see OSD usages spreading from 65.02 % usage down 
to 37.12 %. 

We did not yet use the balancer plugin. 
We don't have any pre-luminous clients. 
In that setup, I take it that "upmap" mode would be recommended - correct? 
Any "gotchas" using that on luminous? 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] backup ceph

2018-09-21 Thread Oliver Freyermuth
Hi,

Am 21.09.18 um 03:28 schrieb ST Wong (ITSC):
> Hi,
> 
>>> Will the RAID 6 be mirrored to another storage in remote site for DR 
>>> purpose?
>>
>> Not yet. Our goal is to have the backup ceph to which we will replicate 
>> spread across three different buildings, with 3 replicas.
> 
> May I ask if the backup ceph is a single ceph cluster span across 3 different 
> buildings, or compose of 3 ceph clusters in 3 different buildings?   Thanks.
> 

This will be a single ceph cluster with a failure domain corresponding to the 
building and three replicas. 
To test updates before rolling them out to the full cluster, we will also 
instantiate a small test cluster separately,
but we try to keep the number of production clusters down and rather let Ceph 
handle failover and replication than doing that ourselves,
which also allows to grow / shrink the cluster more easily as needed ;-). 

All the best,
Oliver

> Thanks again for your help.
> Best Regards,
> /ST Wong
> 
> -Original Message-
> From: Oliver Freyermuth  
> Sent: Thursday, September 20, 2018 2:10 AM
> To: ST Wong (ITSC) 
> Cc: Peter Wienemann ; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] backup ceph
> 
> Hi,
> 
> Am 19.09.18 um 18:32 schrieb ST Wong (ITSC):
>> Thanks for your help.
> 
> You're welcome! 
> I should also add we don't have very long-term experience with this yet - 
> Benji is pretty modern. 
> 
>>> For the moment, we use Benji to backup to a classic RAID 6.
>> Will the RAID 6 be mirrored to another storage in remote site for DR purpose?
> 
> Not yet. Our goal is to have the backup ceph to which we will replicate 
> spread across three different buildings, with 3 replicas. 
> 
>>
>>> For RBD mirroring, you do indeed need another running Ceph Cluster, but we 
>>> plan to use that in the long run (on separate hardware of course).
>> Seems this is the way to go, regardless of additional resources required? :)
>> Btw, RBD mirroring looks like a DR copy instead of a daily backup from which 
>> we can restore image of particular date ?
> 
> We would still perform daily snapshots, and keep those both in the RBD mirror 
> and in the Benji backup. Even when fading out the current RAID 6 machine at 
> some point,
> we'd probably keep Benji and direct it's output to a CephFS pool on our 
> backup Ceph cluster. If anything goes wrong with the mirroring, this still 
> leaves us
> with an independent backup approach. We also keep several days of snapshots 
> in the production RBD pool to be able to quickly roll back a VM if anything 
> goes wrong. 
> With Benji, you can also mount any of these daily snapshots via NBD in case 
> it is needed, or restore from a specific date. 
> 
> All the best,
>   Oliver
> 
>>
>> Thanks again.
>> /st wong
>>
>> -Original Message-
>> From: Oliver Freyermuth  
>> Sent: Wednesday, September 19, 2018 5:28 PM
>> To: ST Wong (ITSC) 
>> Cc: Peter Wienemann ; ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] backup ceph
>>
>> Hi,
>>
>> Am 19.09.18 um 03:24 schrieb ST Wong (ITSC):
>>> Hi,
>>>
>>> Thanks for your information.
>>> May I know more about the backup destination to use?  As the size of the 
>>> cluster will be a bit large (~70TB to start with), we're looking for some 
>>> efficient method to do that backup.   Seems RBD mirroring or incremental 
>>> snapshot s with RBD 
>>> (https://ceph.com/geen-categorie/incremental-snapshots-with-rbd/) are some 
>>> ways to go, but requires another running Ceph cluster.  Is my understanding 
>>> correct?Thanks.
>>
>> For the moment, we use Benji to backup to a classic RAID 6. With Benji, only 
>> the changed chunks are backed up, and it learns that by asking Ceph for a 
>> diff of the RBD snapshots. 
>> So that's really fast after the first backup, and especially if you do 
>> trimming (e.g. via guest agent if you run VMs) of the RBD volumes before 
>> backing them up. 
>> The same is true for Backy2, but it does not support compression (which 
>> really helps by several factors(!) in saving I/O and with zstd it does not 
>> use much CPU). 
>>
>> For RBD mirroring, you do indeed need another running Ceph Cluster, but we 
>> plan to use that in the long run (on separate hardware of course). 
>>
>>> Btw, is this one (https://benji-backup.me/) Benji you'r referring to ?  
>>> Thanks a lot.
>>
>> Exactly :-). 
>>
>> Cheers,
>>  Oliver
>>
>>>
>>>
>>>
>>> Cheers

Re: [ceph-users] backup ceph

2018-09-19 Thread Oliver Freyermuth
Hi,

Am 19.09.18 um 18:32 schrieb ST Wong (ITSC):
> Thanks for your help.

You're welcome! 
I should also add we don't have very long-term experience with this yet - Benji 
is pretty modern. 

>> For the moment, we use Benji to backup to a classic RAID 6.
> Will the RAID 6 be mirrored to another storage in remote site for DR purpose?

Not yet. Our goal is to have the backup ceph to which we will replicate spread 
across three different buildings, with 3 replicas. 

> 
>> For RBD mirroring, you do indeed need another running Ceph Cluster, but we 
>> plan to use that in the long run (on separate hardware of course).
> Seems this is the way to go, regardless of additional resources required? :)
> Btw, RBD mirroring looks like a DR copy instead of a daily backup from which 
> we can restore image of particular date ?

We would still perform daily snapshots, and keep those both in the RBD mirror 
and in the Benji backup. Even when fading out the current RAID 6 machine at 
some point,
we'd probably keep Benji and direct it's output to a CephFS pool on our backup 
Ceph cluster. If anything goes wrong with the mirroring, this still leaves us
with an independent backup approach. We also keep several days of snapshots in 
the production RBD pool to be able to quickly roll back a VM if anything goes 
wrong. 
With Benji, you can also mount any of these daily snapshots via NBD in case it 
is needed, or restore from a specific date. 

All the best,
Oliver

> 
> Thanks again.
> /st wong
> 
> -Original Message-
> From: Oliver Freyermuth  
> Sent: Wednesday, September 19, 2018 5:28 PM
> To: ST Wong (ITSC) 
> Cc: Peter Wienemann ; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] backup ceph
> 
> Hi,
> 
> Am 19.09.18 um 03:24 schrieb ST Wong (ITSC):
>> Hi,
>>
>> Thanks for your information.
>> May I know more about the backup destination to use?  As the size of the 
>> cluster will be a bit large (~70TB to start with), we're looking for some 
>> efficient method to do that backup.   Seems RBD mirroring or incremental 
>> snapshot s with RBD 
>> (https://ceph.com/geen-categorie/incremental-snapshots-with-rbd/) are some 
>> ways to go, but requires another running Ceph cluster.  Is my understanding 
>> correct?Thanks.
> 
> For the moment, we use Benji to backup to a classic RAID 6. With Benji, only 
> the changed chunks are backed up, and it learns that by asking Ceph for a 
> diff of the RBD snapshots. 
> So that's really fast after the first backup, and especially if you do 
> trimming (e.g. via guest agent if you run VMs) of the RBD volumes before 
> backing them up. 
> The same is true for Backy2, but it does not support compression (which 
> really helps by several factors(!) in saving I/O and with zstd it does not 
> use much CPU). 
> 
> For RBD mirroring, you do indeed need another running Ceph Cluster, but we 
> plan to use that in the long run (on separate hardware of course). 
> 
>> Btw, is this one (https://benji-backup.me/) Benji you'r referring to ?  
>> Thanks a lot.
> 
> Exactly :-). 
> 
> Cheers,
>   Oliver
> 
>>
>>
>>
>> Cheers,
>> /ST Wong
>>
>>
>>
>> -Original Message-
>> From: Oliver Freyermuth  
>> Sent: Tuesday, September 18, 2018 6:09 PM
>> To: ST Wong (ITSC) 
>> Cc: Peter Wienemann 
>> Subject: Re: [ceph-users] backup ceph
>>
>> Hi,
>>
>> we're also just starting to collect experiences, so we have nothing to share 
>> (yet). However, we are evaluating using Benji (a well-maintained fork of 
>> Backy2 which can also compress) in addition, trimming and fsfreezing the VM 
>> disks shortly before,
>> and additionally keeping a few daily and weekly snapshots. 
>> We may add RBD mirroring to a backup system in the future. 
>>
>> Since our I/O requirements are not too high, I guess we will be fine either 
>> way, but any shared experience is very welcome. 
>>
>> Cheers,
>>  Oliver
>>
>> Am 18.09.18 um 11:54 schrieb ST Wong (ITSC):
>>> Hi,
>>>
>>>  
>>>
>>> We're newbie to Ceph.  Besides using incremental snapshots with RDB to 
>>> backup data on one Ceph cluster to another running Ceph cluster, or using 
>>> backup tools like backy2, will there be any recommended way to backup Ceph 
>>> data  ?   Someone here suggested taking snapshot of RDB daily and keeps 30 
>>> days to replace backup.  I wonder if this is practical and if performance 
>>> will be impact.
>>>
>>>  
>>>
>>> Thanks a lot.
>>>
>>> Regards
>>>
>>> /st wong
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] backup ceph

2018-09-19 Thread Oliver Freyermuth
Hi,

Am 19.09.18 um 03:24 schrieb ST Wong (ITSC):
> Hi,
> 
> Thanks for your information.
> May I know more about the backup destination to use?  As the size of the 
> cluster will be a bit large (~70TB to start with), we're looking for some 
> efficient method to do that backup.   Seems RBD mirroring or incremental 
> snapshot s with RBD 
> (https://ceph.com/geen-categorie/incremental-snapshots-with-rbd/) are some 
> ways to go, but requires another running Ceph cluster.  Is my understanding 
> correct?Thanks.

For the moment, we use Benji to backup to a classic RAID 6. With Benji, only 
the changed chunks are backed up, and it learns that by asking Ceph for a diff 
of the RBD snapshots. 
So that's really fast after the first backup, and especially if you do trimming 
(e.g. via guest agent if you run VMs) of the RBD volumes before backing them 
up. 
The same is true for Backy2, but it does not support compression (which really 
helps by several factors(!) in saving I/O and with zstd it does not use much 
CPU). 

For RBD mirroring, you do indeed need another running Ceph Cluster, but we plan 
to use that in the long run (on separate hardware of course). 

> Btw, is this one (https://benji-backup.me/) Benji you'r referring to ?  
> Thanks a lot.

Exactly :-). 

Cheers,
Oliver

> 
> 
> 
> Cheers,
> /ST Wong
> 
> 
> 
> -Original Message-
> From: Oliver Freyermuth  
> Sent: Tuesday, September 18, 2018 6:09 PM
> To: ST Wong (ITSC) 
> Cc: Peter Wienemann 
> Subject: Re: [ceph-users] backup ceph
> 
> Hi,
> 
> we're also just starting to collect experiences, so we have nothing to share 
> (yet). However, we are evaluating using Benji (a well-maintained fork of 
> Backy2 which can also compress) in addition, trimming and fsfreezing the VM 
> disks shortly before,
> and additionally keeping a few daily and weekly snapshots. 
> We may add RBD mirroring to a backup system in the future. 
> 
> Since our I/O requirements are not too high, I guess we will be fine either 
> way, but any shared experience is very welcome. 
> 
> Cheers,
>   Oliver
> 
> Am 18.09.18 um 11:54 schrieb ST Wong (ITSC):
>> Hi,
>>
>>  
>>
>> We're newbie to Ceph.  Besides using incremental snapshots with RDB to 
>> backup data on one Ceph cluster to another running Ceph cluster, or using 
>> backup tools like backy2, will there be any recommended way to backup Ceph 
>> data  ?   Someone here suggested taking snapshot of RDB daily and keeps 30 
>> days to replace backup.  I wonder if this is practical and if performance 
>> will be impact.
>>
>>  
>>
>> Thanks a lot.
>>
>> Regards
>>
>> /st wong
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Quota and ACL support

2018-08-28 Thread Oliver Freyermuth
Am 28.08.18 um 07:14 schrieb Yan, Zheng:
> On Mon, Aug 27, 2018 at 10:53 AM Oliver Freyermuth
>  wrote:
>>
>> Thanks for the replies.
>>
>> Am 27.08.18 um 19:25 schrieb Patrick Donnelly:
>>> On Mon, Aug 27, 2018 at 12:51 AM, Oliver Freyermuth
>>>  wrote:
>>>> These features are critical for us, so right now we use the Fuse client. 
>>>> My hope is CentOS 8 will use a recent enough kernel
>>>> to get those features automatically, though.
>>>
>>> Your cluster needs to be running Mimic and Linux v4.17+.
>>>
>>> See also: https://github.com/ceph/ceph/pull/23728/files
>>>
>>
>> Yes, I know that it's part of the official / vanilla kernel as of 4.17.
>> However, I was wondering whether this functionality is also likely to be 
>> backported to the RedHat-maintained kernel which is also used in CentOS 7?
>> Even though the kernel version is "stone-aged", it matches CentOS 7's 
>> userspace and RedHat is taking good care to implement fixes.
>>
> 
> We have already backported quota patches to RHEL 3.10 kernel. It may
> take some time for redhat to release the new kernel.

That's great news, many thanks - looking forward to it! 
I also noted the CephFS kernel client is now mentioned as "fully supported" 
with the upcoming RHEL 7.6: 
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7-beta/html-single/7.6_release_notes/index#new_features_file_systems
Those release notes still talk about missing quota support, but I guess this 
will then be added soonish :-). 

All the best,
Oliver

> 
> Regards
> Yan, Zheng
> 
>> Seeing that even features are backported, it would be really helpful if also 
>> this functionality would appear as part of CentOS 7.6 / 7.7,
>> especially since CentOS 8 still appears to be quite some time away.
>>
>> Cheers,
>> Oliver
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Quota and ACL support

2018-08-27 Thread Oliver Freyermuth
Thanks for the replies. 

Am 27.08.18 um 19:25 schrieb Patrick Donnelly:
> On Mon, Aug 27, 2018 at 12:51 AM, Oliver Freyermuth
>  wrote:
>> These features are critical for us, so right now we use the Fuse client. My 
>> hope is CentOS 8 will use a recent enough kernel
>> to get those features automatically, though.
> 
> Your cluster needs to be running Mimic and Linux v4.17+.
> 
> See also: https://github.com/ceph/ceph/pull/23728/files
> 

Yes, I know that it's part of the official / vanilla kernel as of 4.17. 
However, I was wondering whether this functionality is also likely to be 
backported to the RedHat-maintained kernel which is also used in CentOS 7? 
Even though the kernel version is "stone-aged", it matches CentOS 7's userspace 
and RedHat is taking good care to implement fixes. 

Seeing that even features are backported, it would be really helpful if also 
this functionality would appear as part of CentOS 7.6 / 7.7,
especially since CentOS 8 still appears to be quite some time away. 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS Quota and ACL support

2018-08-27 Thread Oliver Freyermuth
Dear Cephalopodians,

sorry if this is the wrong place to ask - but does somebody know if the 
recently added quota support in the kernel client,
and the ACL support, are going to be backported to RHEL 7 / CentOS 7 kernels? 
Or can someone redirect me to the correct place to ask? 
We don't have a RHEL subscription, but are using CentOS. 

These features are critical for us, so right now we use the Fuse client. My 
hope is CentOS 8 will use a recent enough kernel
to get those features automatically, though. 

Cheers and thanks,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how can time machine know difference between cephfs fuse and kernel client?

2018-08-17 Thread Oliver Freyermuth
Hi,

completely different idea: Have you tried to export the "time capsule" storage 
via AFP (using netatalk) instead of Samba? 
We are also planning to offer something like this for our users (in the 
mid-term future), but my feeling was that compatibility with netatalk / AFP 
would be better than with Samba. 
That also appears to be the implementation consumer-grade NAS devices are using 
behind the scenes for their "time capsule" functionality. 

I also don't have experience with this (yet) but I know some users backing up 
their time machine data to AFP shares from NAS devices, and in general this 
appears to work well. 
Probably it won't help with the space reporting issue, but it might still be of 
interest for the use case? 
In any case, I'd be very interested in case you have experience with both, and 
if so, why you decided for Samba ;-). 

And since our plans were also to use export a CephFS mounted via fuse, I'll 
closely follow your issue... 

Cheers,
Oliver

Am 17.08.18 um 17:13 schrieb Chad William Seys:
> Hello all,
>   I have used cephfs served over Samba to set up a "time capsule" server.  
> However, I could only get this to work using the cephfs kernel module.  Time 
> machine would give errors if cephfs were mounted with fuse. (Sorry, I didn't 
> write down the error messages!)
>   Anyone have an idea how the two methods of mounting are detectable by time 
> machine through Samba?
>   Windows 10 File History behaved the same way.  Error messages are "Could 
> not enable File History. There is not enough space on the disk". (Although it 
> shows the correct amount of space.) And "File History doesn't recognize this 
> drive."
>   I'd like to use cephfs fuse for the quota support.  (The kernel client is 
> said to support quotas with Mimic and kernel version >= 4.17, but that is to 
> cutting edge for me ATM.)
> 
> Thanks!
> Chad.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-30 Thread Oliver Freyermuth
Hi together,

for all others on this list, it might also be helpful to know which setups are 
likely affected. 
Does this only occur for Filestore disks, i.e. if ceph-volume has taken over 
taking care of these? 
Does it happen on every RHEL 7.5 system? 

We're still on 13.2.0 here and ceph-detect-init works fine on our CentOS 7.5 
systems (it just echoes "systemd"). 
We're on Bluestore. 
Should we hold off on an upgrade, or are we unaffected? 

Cheers,
Oliver

Am 30.07.2018 um 09:50 schrieb ceph.nov...@habmalnefrage.de:
> Hey Nathan.
> 
> No blaming here. I'm very thankful for this great peace (ok, sometime more of 
> a beast ;) ) of open-source SDS and all the great work around it incl. 
> community and users... and happy the problem is identified and can be fixed 
> for others/the future as well :)
>  
> Well, yes, can confirm your found "error" also here:
> 
> [root@sds20 ~]# ceph-detect-init
> Traceback (most recent call last):
>   File "/usr/bin/ceph-detect-init", line 9, in 
> load_entry_point('ceph-detect-init==1.0.1', 'console_scripts', 
> 'ceph-detect-init')()
>   File "/usr/lib/python2.7/site-packages/ceph_detect_init/main.py", line 56, 
> in run
> print(ceph_detect_init.get(args.use_rhceph).init)
>   File "/usr/lib/python2.7/site-packages/ceph_detect_init/__init__.py", line 
> 42, in get
> release=release)
> ceph_detect_init.exc.UnsupportedPlatform: Platform is not supported.: rhel  
> 7.5
> 
> 
> Gesendet: Sonntag, 29. Juli 2018 um 20:33 Uhr
> Von: "Nathan Cutler" 
> An: ceph.nov...@habmalnefrage.de, "Vasu Kulkarni" 
> Cc: ceph-users , "Ceph Development" 
> 
> Betreff: Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")
>> Strange...
>> - wouldn't swear, but pretty sure v13.2.0 was working ok before
>> - so what do others say/see?
>> - no one on v13.2.1 so far (hard to believe) OR
>> - just don't have this "systemctl ceph-osd.target" problem and all just 
>> works?
>>
>> If you also __MIGRATED__ from Luminous (say ~ v12.2.5 or older) to Mimic 
>> (say v13.2.0 -> v13.2.1) and __DO NOT__ see the same systemctl problems, 
>> whats your Linix OS and version (I'm on RHEL 7.5 here) ? :O
> 
> Best regards
>  Anton
> 
> 
> 
> Hi ceph.novice:
> 
> I'm the one to blame for this regretful incident. Today I have
> reproduced the issue in teuthology:
> 
> 2018-07-29T18:20:07.288 INFO:teuthology.orchestra.run.ovh093:Running:
> 'sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph-detect-init'
> 2018-07-29T18:20:07.796
> INFO:teuthology.orchestra.run.ovh093.stderr:Traceback (most recent call
> last):
> 2018-07-29T18:20:07.797 INFO:teuthology.orchestra.run.ovh093.stderr:
> File "/bin/ceph-detect-init", line 9, in 
> 2018-07-29T18:20:07.797 INFO:teuthology.orchestra.run.ovh093.stderr:
> load_entry_point('ceph-detect-init==1.0.1', 'console_scripts',
> 'ceph-detect-init')()
> 2018-07-29T18:20:07.797 INFO:teuthology.orchestra.run.ovh093.stderr:
> File "/usr/lib/python2.7/site-packages/ceph_detect_init/main.py", line
> 56, in run
> 2018-07-29T18:20:07.797 INFO:teuthology.orchestra.run.ovh093.stderr:
> print(ceph_detect_init.get(args.use_rhceph).init)
> 2018-07-29T18:20:07.797 INFO:teuthology.orchestra.run.ovh093.stderr:
> File "/usr/lib/python2.7/site-packages/ceph_detect_init/__init__.py",
> line 42, in get
> 2018-07-29T18:20:07.797 INFO:teuthology.orchestra.run.ovh093.stderr:
> release=release)
> 2018-07-29T18:20:07.797
> INFO:teuthology.orchestra.run.ovh093.stderr:ceph_detect_init.exc.UnsupportedPlatform:
> Platform is not supported.: rhel 7.5
> 
> Just to be sure, can you confirm? (I.e. issue the command
> "ceph-detect-init" on your RHEL 7.5 system. Instead of saying "systemd"
> it gives an error like above?)
> 
> I'm working on a fix now at https://github.com/ceph/ceph/pull/23303
> 
> Nathan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-23 Thread Oliver Freyermuth
Am 23.07.2018 um 14:59 schrieb Nicolas Huillard:
> Le lundi 23 juillet 2018 à 12:40 +0200, Oliver Freyermuth a écrit :
>> Am 23.07.2018 um 11:18 schrieb Nicolas Huillard:
>>> Le lundi 23 juillet 2018 à 18:23 +1000, Brad Hubbard a écrit :
>>>> Ceph doesn't shut down systems as in kill or reboot the box if
>>>> that's
>>>> what you're saying?
>>>
>>> That's the first part of what I was saying, yes. I was pretty sure
>>> Ceph
>>> doesn't reboot/shutdown/reset, but now it's 100% sure, thanks.
>>> Maybe systemd triggered something, but without any lasting traces.
>>> The kernel didn't leave any more traces in kernel.log, and since
>>> the
>>> server was off, there was no oops remaining on the console...
>>
>> If there was an oops, it should also be recorded in pstore. 
>> If the kernel was still running and able to show a stacktrace, even
>> if disk I/O has become impossible,
>> it will in general dump the stacktrace to pstore (e.g. UEFI pstore if
>> you boot via EFI, or ACPI pstore, if available). 
> 
> I was sure I would learn something from this thread. Thnaks!
> Unfortunately, those machines don't boot using UEFI, /sys/fs/pstore/ is
> empty, and:
> /sys/module/pstore/parameters/backend:(null)
> /sys/module/pstore/parameters/update_ms:-1
> 
> I suppose this pstore is also shown in the BMC web interface as "Server
> Health / System Log". This is empty too, and I wondered what would fill
> it. Maybe I'll use UEFI boot next time.

It's usually not shown anywhere else - in the end, the UEFI pstore is just 
permanend storage, which the Linux kernel uses to save OOPSes and other kinds 
of PANICs. 
It's very unlikely that the BMC can interpret the very same format the Linux 
kernel writes there. 

Sadly, it seems your machine does not have any backend available (unless booted 
via UEFI). 
Our machines can luckily use ACPI ERST (Error Record Serialization Table) even 
if legacy-booted. 

So probably, booting via UEFI is your only option (other options could be 
netconsole, but it is less robust / does not capture everything, or ramoops, 
but I've never used that). 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "CPU CATERR Fault" Was: Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-23 Thread Oliver Freyermuth
Am 23.07.2018 um 11:39 schrieb Nicolas Huillard:
> Le lundi 23 juillet 2018 à 10:28 +0200, Caspar Smit a écrit :
>> Do you have any hardware watchdog running in the system? A watchdog
>> could
>> trigger a powerdown if it meets some value. Any event logs from the
>> chassis
>> itself?
> 
> Nice suggestions ;-)
> 
> I see some [watchdog/N] and one [watchdogd] kernel threads, along with
> a "kernel: [0.116002] NMI watchdog: enabled on all CPUs,
> permanently consumes one hw-PMU counter." line in the kernel log, but
> no user-land watchdog daemon: I'm not sure if the watchdog is actually
> active.
> 
> There ARE chassis/BMC/IPMI level events, one of which is "CPU CATERR
> Fault", with a timestamp matching the timestamps below, and no more
> information.

If this kind of failure (or a less severe one) also happens at runtime, mcelog 
should catch it. 
For CATERR errors, we also found that sometimes the web interface of the BMC 
shows more information for the event log entry 
than querying the event log via ipmitool - you may want to check this. 


> If I understand correctly, this is a signal emitted by the CPU, to the
> BMC, upon "catastrophic error" (more than "fatal"), which the BMC must
> respond to the way it wants, Intel suggestions including resetting the
> chassis.
> 
> https://www.intel.in/content/dam/www/public/us/en/documents/white-paper
> s/platform-level-error-strategies-paper.pdf
> 
> Does that mean that the hardware is failing, or a neutrino just crossed
> some CPU register?
> CPU is a Xeon D-1521 with ECC memory.
> 
>> Kind regards,
> 
> Many thanks!
> 
>>
>> Caspar
>>
>> 2018-07-21 10:31 GMT+02:00 Nicolas Huillard :
>>
>>> Hi all,
>>>
>>> One of my server silently shutdown last night, with no explanation
>>> whatsoever in any logs. According to the existing logs, the
>>> shutdown
>>> (without reboot) happened between 03:58:20.061452 (last timestamp
>>> from
>>> /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON
>>> election called, for which oxygene didn't answer).
>>>
>>> Is there any way in which Ceph could silently shutdown a server?
>>> Can SMART self-test influence scrubbing or compaction?
>>>
>>> The only thing I have is that smartd stated a long self-test on
>>> both
>>> OSD spinning drives on that host:
>>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT],
>>> starting
>>> scheduled Long Self-Test.
>>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT],
>>> starting
>>> scheduled Long Self-Test.
>>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
>>> starting
>>> scheduled Long Self-Test.
>>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self-
>>> test in
>>> progress, 90% remaining
>>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self-
>>> test in
>>> progress, 90% remaining
>>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
>>> previous
>>> self-test completed without error
>>>
>>> ...and smartctl now says that the self-tests didn't finish (on both
>>> drives) :
>>> # 1  Extended offlineInterrupted (host
>>> reset)  00% 10636
>>> -
>>>
>>> MON logs on oxygene talks about rockdb compaction a few minutes
>>> before
>>> the shutdown, and a deep-scrub finished earlier:
>>> /var/log/ceph/ceph-osd.6.log
>>> 2018-07-21 03:32:54.086021 7fd15d82c700  0 log_channel(cluster) log
>>> [DBG]
>>> : 6.1d deep-scrub starts
>>> 2018-07-21 03:34:31.185549 7fd15d82c700  0 log_channel(cluster) log
>>> [DBG]
>>> : 6.1d deep-scrub ok
>>> 2018-07-21 03:43:36.720707 7fd178082700  0 --
>>> 172.22.0.16:6801/478362 >>
>>> 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801
>>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>> l=1).handle_connect_msg: challenging authorizer
>>>
>>> /var/log/ceph/ceph-mgr.oxygene.log
>>> 2018-07-21 03:58:16.060137 7fbcd300  1 mgr send_beacon standby
>>> 2018-07-21 03:58:18.060733 7fbcd300  1 mgr send_beacon standby
>>> 2018-07-21 03:58:20.061452 7fbcd300  1 mgr send_beacon standby
>>>
>>> /var/log/ceph/ceph-mon.oxygene.log
>>> 2018-07-21 03:52:27.702314 7f25b5406700  4 rocksdb: (Original Log
>>> Time
>>> 2018/07/21-03:52:27.702302) [/build/ceph-12.2.7/src/
>>> rocksdb/db/db_impl_compaction_flush.cc:1392] [default] Manual
>>> compaction
>>> from level-0 to level-1 from 'mgrstat .. '
>>> 2018-07-21 03:52:27.702321 7f25b5406700  4 rocksdb:
>>> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1403]
>>> [default] [JOB
>>> 1746] Compacting 1@0 + 1@1 files to L1, score -1.00
>>> 2018-07-21 03:52:27.702329 7f25b5406700  4 rocksdb:
>>> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1407]
>>> [default]
>>> Compaction start summary: Base version 1745 Base level 0, inputs:
>>> [149507(602KB)], [149505(13MB)]
>>> 2018-07-21 03:52:27.702348 7f25b5406700  4 rocksdb: EVENT_LOG_v1
>>> {"time_micros": 1532137947702334, "job": 1746, "event":
>>> "compaction_started", "files_L0": [149507], "files_L1": [149505],
>>> "score":
>>> -1, "input_data_size": 14916379}

Re: [ceph-users] Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-23 Thread Oliver Freyermuth
Am 23.07.2018 um 11:18 schrieb Nicolas Huillard:
> Le lundi 23 juillet 2018 à 18:23 +1000, Brad Hubbard a écrit :
>> Ceph doesn't shut down systems as in kill or reboot the box if that's
>> what you're saying?
> 
> That's the first part of what I was saying, yes. I was pretty sure Ceph
> doesn't reboot/shutdown/reset, but now it's 100% sure, thanks.
> Maybe systemd triggered something, but without any lasting traces.
> The kernel didn't leave any more traces in kernel.log, and since the
> server was off, there was no oops remaining on the console...

If there was an oops, it should also be recorded in pstore. 
If the kernel was still running and able to show a stacktrace, even if disk I/O 
has become impossible,
it will in general dump the stacktrace to pstore (e.g. UEFI pstore if you boot 
via EFI, or ACPI pstore, if available). 

Cheers,
Oliver

> 
> I'm currently activating "Auto video recording" at the BMC/IPMI level,
> as that may help next time this event occurs... Triggers look like
> they're tuned for Windows BSOD though...
> 
> Thanks for all answers ;-)
> 
>> On Mon, Jul 23, 2018 at 5:04 PM, Nicolas Huillard > .fr> wrote:
>>> Le lundi 23 juillet 2018 à 11:07 +0700, Konstantin Shalygin a écrit
>>> :
> I even have no fancy kernel or device, just real standard
> Debian.
> The
> uptime was 6 days since the upgrade from 12.2.6...

 Nicolas, you should upgrade your 12.2.6 to 12.2.7 due bugs in
 this
 release.
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-21 Thread Oliver Freyermuth
Since all services are running on these machines - are you by any chance 
running low on memory? 
Do you have a monitoring of this? 

We observe some strange issues with our servers if they run for a long while, 
and with high memory pressure (more memory is ordered...). 
Then, it seems our Infiniband driver can not allocate sufficiently large pages 
anymore, communication is lost between the Ceph nodes, recovery starts,
memory usage grows even higher from this, etc. 
In some cases, it seems this may lead to a freeze / lockup (not reboot). My 
feeling is that the CentOS 7.5 kernel is not doing as well on memory compaction 
as the modern kernels do. 

Right now, this is just a hunch of mine, but my recommendation would be to have 
some monitoring of the machine and see if something strange happens in terms of 
memory usage, CPU usage, or disk I/O (e.g. iowait)
to further pin down the issue. It may as well be something completely 
different. 

Other options to investigate would be a potential kernel stacktrace in pstore, 
or something in mcelog. 

Cheers,
Oliver

Am 21.07.2018 um 14:34 schrieb Nicolas Huillard:
> I forgot to mention that this server, along with all the other Ceph
> servers in my cluster, do not run anything else than Ceph, and each run
>  all the Ceph daemons (mon, mgr, mds, 2×osd).
> 
> Le samedi 21 juillet 2018 à 10:31 +0200, Nicolas Huillard a écrit :
>> Hi all,
>>
>> One of my server silently shutdown last night, with no explanation
>> whatsoever in any logs. According to the existing logs, the shutdown
>> (without reboot) happened between 03:58:20.061452 (last timestamp
>> from
>> /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON
>> election called, for which oxygene didn't answer).
>>
>> Is there any way in which Ceph could silently shutdown a server?
>> Can SMART self-test influence scrubbing or compaction?
>>
>> The only thing I have is that smartd stated a long self-test on both
>> OSD spinning drives on that host:
>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT], starting
>> scheduled Long Self-Test.
>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT], starting
>> scheduled Long Self-Test.
>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT], starting
>> scheduled Long Self-Test.
>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self-
>> test in progress, 90% remaining
>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self-
>> test in progress, 90% remaining
>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT], previous
>> self-test completed without error
>>
>> ...and smartctl now says that the self-tests didn't finish (on both
>> drives) :
>> # 1  Extended offlineInterrupted (host
>> reset)  00% 10636 -
>>
>> MON logs on oxygene talks about rockdb compaction a few minutes
>> before
>> the shutdown, and a deep-scrub finished earlier:
>> /var/log/ceph/ceph-osd.6.log
>> 2018-07-21 03:32:54.086021 7fd15d82c700  0 log_channel(cluster) log
>> [DBG] : 6.1d deep-scrub starts
>> 2018-07-21 03:34:31.185549 7fd15d82c700  0 log_channel(cluster) log
>> [DBG] : 6.1d deep-scrub ok
>> 2018-07-21 03:43:36.720707 7fd178082700  0 -- 172.22.0.16:6801/478362
 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801
>>
>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>> l=1).handle_connect_msg: challenging authorizer
>>
>> /var/log/ceph/ceph-mgr.oxygene.log
>> 2018-07-21 03:58:16.060137 7fbcd300  1 mgr send_beacon standby
>> 2018-07-21 03:58:18.060733 7fbcd300  1 mgr send_beacon standby
>> 2018-07-21 03:58:20.061452 7fbcd300  1 mgr send_beacon standby
>>
>> /var/log/ceph/ceph-mon.oxygene.log
>> 2018-07-21 03:52:27.702314 7f25b5406700  4 rocksdb: (Original Log
>> Time 2018/07/21-03:52:27.702302) [/build/ceph-
>> 12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:1392] [default]
>> Manual compaction from level-0 to level-1 from 'mgrstat .. '
>> 2018-07-21 03:52:27.702321 7f25b5406700  4 rocksdb: [/build/ceph-
>> 12.2.7/src/rocksdb/db/compaction_job.cc:1403] [default] [JOB 1746]
>> Compacting 1@0 + 1@1 files to L1, score -1.00
>> 2018-07-21 03:52:27.702329 7f25b5406700  4 rocksdb: [/build/ceph-
>> 12.2.7/src/rocksdb/db/compaction_job.cc:1407] [default] Compaction
>> start summary: Base version 1745 Base level 0, inputs:
>> [149507(602KB)], [149505(13MB)]
>> 2018-07-21 03:52:27.702348 7f25b5406700  4 rocksdb: EVENT_LOG_v1
>> {"time_micros": 1532137947702334, "job": 1746, "event":
>> "compaction_started", "files_L0": [149507], "files_L1": [149505],
>> "score": -1, "input_data_size": 14916379}
>> 2018-07-21 03:52:27.785532 7f25b5406700  4 rocksdb: [/build/ceph-
>> 12.2.7/src/rocksdb/db/compaction_job.cc:1116] [default] [JOB 1746]
>> Generated table #149508: 4904 keys, 14808953 bytes
>> 2018-07-21 03:52:27.785587 7f25b5406700  4 rocksdb: EVENT_LOG_v1
>> {"time_micros": 1532137947785565, "cf_name": "default", "job": 1746,
>> "event": "table_file_creation", "file_number": 

Re: [ceph-users] JBOD question

2018-07-20 Thread Oliver Freyermuth
Hi Satish,

that really completely depends on your controller. 

For what it's worth: We have AVAGO MegaRAID controllers (9361 series). 
They can be switched to a "JBOD personality". After doing so and reinitializing 
(poewrcycling),
the cards change PCI-ID and run a different firmware, optimized for JBOD mode 
(with different caching etc.). Also, the block devices are ordered differently. 

In that mode, new disks will be exported as JBOD by default, but you can still 
do RAID1 and RAID0. 
I think RAID5 and RAID6 are disabled, though. 

We are using those to have a RAID 1 for our OS and export the rest as JBOD for 
CephFS. 

So there surely are controllers which can only do JBOD in addition (without a 
special controller mode / "personality"),
controllers which can be switched, but simple RAID levels are still possible,
and I'm also sure there are controllers out there which can be switched to JBOD 
mode and can't do anything RAID anymore in that mode. 

If that's the case, just go with software RAID for the OS, or install your 
servers with a good deployment tool so you can just reinstall them
if the OS breaks (we also do that for some Ceph servers with simpler RAID 
controllers). With a good deployment tool,
reinstalling takes 1 click and waiting 40 minutes - but of course, the server 
will still be down until a broken OS HDD is replaced physically. 
But Ceph has redundancy for that :-). 

Cheers,
Oliver


Am 20.07.2018 um 23:52 schrieb Satish Patel:
> Thanks Brian,
> 
> That make sense because i was reading document and found you can
> either choose RAID or JBOD
> 
> On Fri, Jul 20, 2018 at 5:33 PM, Brian :  wrote:
>> Hi Satish
>>
>> You should be able to choose different modes of operation for each
>> port / disk. Most dell servers will let you do RAID and JBOD in
>> parallel.
>>
>> If you can't do that and can only either turn RAID on or off then you
>> can use SW RAID for your OS
>>
>>
>> On Fri, Jul 20, 2018 at 9:01 PM, Satish Patel  wrote:
>>> Folks,
>>>
>>> I never used JBOD mode before and now i am planning so i have stupid
>>> question if i switch RAID controller to JBOD mode in that case how
>>> does my OS disk will get mirror?
>>>
>>> Do i need to use software raid for OS disk when i use JBOD mode?
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush Rules with multiple Device Classes

2018-07-19 Thread Oliver Freyermuth
Am 19.07.2018 um 08:43 schrieb Linh Vu:
> Since the new NVMes are meant to replace the existing SSDs, why don't you 
> assign class "ssd" to the new NVMe OSDs? That way you don't need to change 
> the existing OSDs nor the existing crush rule. And the new NVMe OSDs won't 
> lose any performance, "ssd" or "nvme" is just a name.
> 
> When you deploy the new NVMe, you can chuck this under [osd] in their local 
> ceph.conf: `osd_class_update_on_start = false` They should then come up with 
> a blank class and you can set the class to ssd afterwards. 

Right, this should also work. But then I'd prefer to "relabel" the existing 
SSDs and the crush rule to read "NVME" such that the future NVMEs will update 
themselves automatically
without manual configuration. We are trying to keep our ceph.conf small to 
follow the spirit of Mimic and future releases ;-). 
I'll schedule this change for our next I/O pause just to be on the safe side. 

Thanks and all the best,
Oliver

> 
> ------
> *From:* ceph-users  on behalf of Oliver 
> Freyermuth 
> *Sent:* Thursday, 19 July 2018 6:13:25 AM
> *To:* ceph-users@lists.ceph.com
> *Cc:* Peter Wienemann
> *Subject:* [ceph-users] Crush Rules with multiple Device Classes
>  
> Dear Cephalopodians,
> 
> we use an SSD-only pool to store the metadata of our CephFS.
> In the future, we will add a few NVMEs, and in the long-term view, replace 
> the existing SSDs by NVMEs, too.
> 
> Thinking this through, I came up with three questions which I do not find 
> answered in the docs (yet).
> 
> Currently, we use the following crush-rule:
> 
> rule cephfs_metadata {
>     id 1
>     type replicated
>     min_size 1
>     max_size 10
>     step take default class ssd
>     step choose firstn 0 type osd
>     step emit
> }
> 
> As you can see, this uses "class ssd".
> 
> Now my first question is:
> 1) Is there a way to specify "take default class (ssd or nvme)"?
>    Then we could just do this for the migration period, and at some point 
> remove "ssd".
> 
> If multi-device-class in a crush rule is not supported yet, the only 
> workaround which comes to my mind right now is to issue:
>   $ ceph osd crush set-device-class nvme 
> for all our old SSD-backed osds, and modify the crush rule to refer to class 
> "nvme" straightaway.
> 
> This leads to my second question:
> 2) Since the OSD IDs do not change, Ceph should not move any data around by 
> changing both the device classes of the OSDs and the device class in the 
> crush rule - correct?
> 
> After this operation, adding NVMEs to our cluster should let them 
> automatically join this crush rule, and once all SSDs are replaced with NVMEs,
> the workaround is automatically gone.
> 
> As long as the SSDs are still there, some tunables might not fit well anymore 
> out of the box, i.e. the "sleep" values for scrub and repair, though.
> 
> Here my third question:
> 3) Are the tunables used for NVME devices the same as for SSD devices?
>    I do not find any NVME tunables here:
>    http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/
>    Only SSD, HDD and Hybrid are shown.
> 
> Cheers,
>     Oliver
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush Rules with multiple Device Classes

2018-07-19 Thread Oliver Freyermuth
Am 19.07.2018 um 05:57 schrieb Konstantin Shalygin:
>> Now my first question is: 
>> 1) Is there a way to specify "take default class (ssd or nvme)"? 
>>Then we could just do this for the migration period, and at some point 
>> remove "ssd". 
>>
>> If multi-device-class in a crush rule is not supported yet, the only 
>> workaround which comes to my mind right now is to issue:
>>   $ ceph osd crush set-device-class nvme 
>> for all our old SSD-backed osds, and modify the crush rule to refer to class 
>> "nvme" straightaway. 
> 
> 
> My advice is to set class to 'nvme' to your current osd's with class 'ssd' 
> and change crush rule to this class.
> 
> You still have to do it, better sooner than later.Either use the ssd class 
> for your future drives, in case when you switch all your ssd to nvme and 
> forgot about ssd disks.

Yes, this sounds good. I'll schedule this for as soon as we have a small I/O 
pause in any case, just to be sure this will not interfere with ongoing I/O. 
Changing the old devices and the crush rule sounds like the best plan, then all 
future NVMEs will be handled correctly without any manual intervention. 

> 
> 
>> Here my third question:
>> 3) Are the tunables used for NVME devices the same as for SSD devices?
>>I do not find any NVME tunables here:
>>http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/
>>Only SSD, HDD and Hybrid are shown. 
> 
> Ceph is doesn't care about nvme/ssd. Ceph is only care is_rotational drive or 
> not.
> 
> 
>     "bluefs_db_rotational": "0",
>     "bluefs_slow_rotational": "1",
>     "bluefs_wal_rotational": "0",
>     "bluestore_bdev_rotational": "1",
>     "journal_rotational": "0",
>     "rotational": "1"
> 

Ah, I see! 
So those tunables (osd recovery sleep ssd, osd recovery sleep hdd, osd recovery 
sleep hybrid, and the other sleep parameters) just have a misleading name ;-). 

Thanks and all the best,
Oliver


> 
> 
> k
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Crush Rules with multiple Device Classes

2018-07-18 Thread Oliver Freyermuth
Dear Cephalopodians,

we use an SSD-only pool to store the metadata of our CephFS. 
In the future, we will add a few NVMEs, and in the long-term view, replace the 
existing SSDs by NVMEs, too. 

Thinking this through, I came up with three questions which I do not find 
answered in the docs (yet). 

Currently, we use the following crush-rule:

rule cephfs_metadata {
id 1
type replicated
min_size 1
max_size 10
step take default class ssd
step choose firstn 0 type osd
step emit
}

As you can see, this uses "class ssd". 

Now my first question is: 
1) Is there a way to specify "take default class (ssd or nvme)"? 
   Then we could just do this for the migration period, and at some point 
remove "ssd". 

If multi-device-class in a crush rule is not supported yet, the only workaround 
which comes to my mind right now is to issue:
  $ ceph osd crush set-device-class nvme 
for all our old SSD-backed osds, and modify the crush rule to refer to class 
"nvme" straightaway. 

This leads to my second question:
2) Since the OSD IDs do not change, Ceph should not move any data around by 
changing both the device classes of the OSDs and the device class in the crush 
rule - correct? 

After this operation, adding NVMEs to our cluster should let them automatically 
join this crush rule, and once all SSDs are replaced with NVMEs, 
the workaround is automatically gone. 

As long as the SSDs are still there, some tunables might not fit well anymore 
out of the box, i.e. the "sleep" values for scrub and repair, though. 

Here my third question:
3) Are the tunables used for NVME devices the same as for SSD devices?
   I do not find any NVME tunables here:
   http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/
   Only SSD, HDD and Hybrid are shown. 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.7 Luminous released

2018-07-18 Thread Oliver Freyermuth
Am 18.07.2018 um 16:20 schrieb Sage Weil:
> On Wed, 18 Jul 2018, Oliver Freyermuth wrote:
>> Am 18.07.2018 um 14:20 schrieb Sage Weil:
>>> On Wed, 18 Jul 2018, Linh Vu wrote:
>>>> Thanks for all your hard work in putting out the fixes so quickly! :)
>>>>
>>>> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, 
>>>> not RGW. In the release notes, it says RGW is a risk especially the 
>>>> garbage collection, and the recommendation is to either pause IO or 
>>>> disable RGW garbage collection.
>>>>
>>>>
>>>> In our case with CephFS, not RGW, is it a lot less risky to perform the 
>>>> upgrade to 12.2.7 without the need to pause IO?
>>>>
>>>>
>>>> What does pause IO do? Do current sessions just get queued up and IO 
>>>> resume normally with no problem after unpausing?
>>>>
>>>>
>>>> If we have to pause IO, is it better to do something like: pause IO, 
>>>> restart OSDs on one node, unpause IO - repeated for all the nodes 
>>>> involved in the EC pool?
>>
>> Hi!
>>
>> sorry for asking again, but... 
>>
>>>
>>> CephFS can generate a problem rados workload too when files are deleted or 
>>> truncated.  If that isn't happening in your workload then you're probably 
>>> fine.  If deletes are mixed in, then you might consider pausing IO for the 
>>> upgrade.
>>>
>>> FWIW, if you have been running 12.2.5 for a while and haven't encountered 
>>> the OSD FileStore crashes with
>>>
>>> src/os/filestore/FileStore.cc: 5524: FAILED assert(0 == "ERROR: source must 
>>> exist")
>>>
>>> but have had OSDs go up/down then you are probably okay.
>>
>> => Does this issue only affect filestore, or also bluestore? 
>> In your "IMPORTANT" warning mail, you wrote:
>> "It seems to affect filestore and busy clusters with this specific 
>> workload."
>> concerning this issue. 
>> However, the release notes do not mention explicitly that only Filestore is 
>> affected. 
>>
>> Both Linh Vu and me are using Bluestore (exclusively). 
>> Are we potentially affected unless we pause I/O during the upgrade? 
> 
> The bug should apply to both FileStore and BlueStore, but we have only 
> seen crashes with FileStore.  I'm not entirely sure why that is.  One 
> theory is that the filestore apply timing is different and that makes the 
> bug more likely to happen.  Another is that filestore splitting is a 
> "good" source of that latency that tends to trigger the bug easily.
> 
> If it were me I would err on the safe side. :)

That's certainly the choice of a sage ;-). 

We'll do that, too - we informed our users just now I/O will be blocked for 
thirty minutes or so to give us some leeway for the upgrade... 
They will certainly survive the pause with the nice weather outside :-). 

Cheers and many thanks,
Oliver

> 
> sage
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.7 Luminous released

2018-07-18 Thread Oliver Freyermuth
Am 18.07.2018 um 14:20 schrieb Sage Weil:
> On Wed, 18 Jul 2018, Linh Vu wrote:
>> Thanks for all your hard work in putting out the fixes so quickly! :)
>>
>> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, 
>> not RGW. In the release notes, it says RGW is a risk especially the 
>> garbage collection, and the recommendation is to either pause IO or 
>> disable RGW garbage collection.
>>
>>
>> In our case with CephFS, not RGW, is it a lot less risky to perform the 
>> upgrade to 12.2.7 without the need to pause IO?
>>
>>
>> What does pause IO do? Do current sessions just get queued up and IO 
>> resume normally with no problem after unpausing?
>>
>>
>> If we have to pause IO, is it better to do something like: pause IO, 
>> restart OSDs on one node, unpause IO - repeated for all the nodes 
>> involved in the EC pool?

Hi!

sorry for asking again, but... 

> 
> CephFS can generate a problem rados workload too when files are deleted or 
> truncated.  If that isn't happening in your workload then you're probably 
> fine.  If deletes are mixed in, then you might consider pausing IO for the 
> upgrade.
> 
> FWIW, if you have been running 12.2.5 for a while and haven't encountered 
> the OSD FileStore crashes with
> 
> src/os/filestore/FileStore.cc: 5524: FAILED assert(0 == "ERROR: source must 
> exist")
> 
> but have had OSDs go up/down then you are probably okay.

=> Does this issue only affect filestore, or also bluestore? 
In your "IMPORTANT" warning mail, you wrote:
"It seems to affect filestore and busy clusters with this specific 
workload."
concerning this issue. 
However, the release notes do not mention explicitly that only Filestore is 
affected. 

Both Linh Vu and me are using Bluestore (exclusively). 
Are we potentially affected unless we pause I/O during the upgrade? 

All the best,
Oliver

> 
> Thanks!
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.7 Luminous released

2018-07-18 Thread Oliver Freyermuth
Also many thanks from my side! 

Am 18.07.2018 um 03:04 schrieb Linh Vu:
> Thanks for all your hard work in putting out the fixes so quickly! :)
> 
> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, not 
> RGW. In the release notes, it says RGW is a risk especially the garbage 
> collection, and the recommendation is to either pause IO or disable RGW 
> garbage collection. 
> 
> 
> In our case with CephFS, not RGW, is it a lot less risky to perform the 
> upgrade to 12.2.7 without the need to pause IO? 
> 
> 
> What does pause IO do? Do current sessions just get queued up and IO resume 
> normally with no problem after unpausing? 

That's my understanding, pause blocks any reads and writes. If the processes 
accessing CephFS do not have any wallclock-related timeout handlers, they 
should be fine IMHO. 
I'm unsure how NFS Ganesha 
But indeed I have the very same question - we also have a pure CephFS cluster, 
without RGW, EC-pool-backed, on 12.2.5. Should we pause IO during upgrade? 

I wonder whether it is risky / unrisky to upgrade without pausing I/O? 
The update notes in the blog do not state whether a pure CephFS setup is 
affected. 

Cheers,
Oliver

> 
> 
> If we have to pause IO, is it better to do something like: pause IO, restart 
> OSDs on one node, unpause IO - repeated for all the nodes involved in the EC 
> pool? 
> 
> 
> Regards,
> 
> Linh
> 
> --
> *From:* ceph-users  on behalf of Sage Weil 
> 
> *Sent:* Wednesday, 18 July 2018 4:42:41 AM
> *To:* Stefan Kooman
> *Cc:* ceph-annou...@ceph.com; ceph-de...@vger.kernel.org; 
> ceph-maintain...@ceph.com; ceph-us...@ceph.com
> *Subject:* Re: [ceph-users] v12.2.7 Luminous released
>  
> On Tue, 17 Jul 2018, Stefan Kooman wrote:
>> Quoting Abhishek Lekshmanan (abhis...@suse.com):
>>
>> > *NOTE* The v12.2.5 release has a potential data corruption issue with
>> > erasure coded pools. If you ran v12.2.5 with erasure coding, please see
> ^^^
>> > below.
>>
>> < snip >
>>
>> > Upgrading from v12.2.5 or v12.2.6
>> > -
>> >
>> > If you used v12.2.5 or v12.2.6 in combination with erasure coded
> ^
>> > pools, there is a small risk of corruption under certain workloads.
>> > Specifically, when:
>>
>> < snip >
>>
>> One section mentions Luminous clusters _with_ EC pools specifically, the 
>> other
>> section mentions Luminous clusters running 12.2.5.
> 
> I think they both do?
> 
>> I might be misreading this, but to make things clear for current Ceph
>> Luminous 12.2.5 users. Is the following statement correct?
>>
>> If you do _NOT_ use EC in your 12.2.5 cluster (only replicated pools), there 
>> is
>> no need to quiesce IO (ceph osd pause).
> 
> Correct.
> 
>> http://docs.ceph.com/docs/master/releases/luminous/#upgrading-from-other-versions
>> If your cluster did not run v12.2.5 or v12.2.6 then none of the above
>> issues apply to you and you should upgrade normally.
>>
>> ^^ Above section would indicate all 12.2.5 luminous clusters.
> 
> The intent here is to clarify that any cluster running 12.2.4 or
> older can upgrade without reading carefully. If the cluster
> does/did run 12.2.5 or .6, then read carefully because it may (or may not)
> be affected.
> 
> Does that help? Any suggested revisions to the wording in the release
> notes that make it clearer are welcome!
> 
> Thanks-
> sage
> 
> 
>>
>> Please clarify,
>>
>> Thanks,
>>
>> Stefan
>>
>> --
>> | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
>> | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing 

Re: [ceph-users] mds daemon damaged

2018-07-13 Thread Oliver Freyermuth
Hi Kevin,

Am 13.07.2018 um 04:21 schrieb Kevin:
> That thread looks exactly like what I'm experiencing. Not sure why my 
> repeated googles didn't find it!

maybe the thread was still too "fresh" for Google's indexing. 

> 
> I'm running 12.2.6 and CentOS 7
> 
> And yes, I recently upgraded from jewel to luminous following the 
> instructions of changing the repo and then updating. Everything has been 
> working fine up until this point
> 
> Given that previous thread I feel at a bit of a loss as to what to try now 
> since that thread ended with no resolution I could see.

I hope the thread is still continuing, given that another affected person just 
commented on it. 
We also planned to upgrade our production cluster to 12.2.6 (also on CentOS 7) 
in the weekend since we are affected by two Ceph-fuse bugs 
causing inconsistency of directory contents since months which have been fixed 
in 12.2.6, 
but given this situation, we'll rather live with that a bit longer and hold off 
on the update... 

> 
> Thanks for pointing that out though, it seems like almost the exact same 
> situation
> 
> On 2018-07-12 18:23, Oliver Freyermuth wrote:
>> Hi,
>>
>> all this sounds an awful lot like:
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-July/027992.html
>> In htat case, things started with an update to 12.2.6. Which version
>> are you running?
>>
>> Cheers,
>> Oliver
>>
>> Am 12.07.2018 um 23:30 schrieb Kevin:
>>> Sorry for the long posting but trying to cover everything
>>>
>>> I woke up to find my cephfs filesystem down. This was in the logs
>>>
>>> 2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 0x6fc2f65a 
>>> != expected 0x1c08241c on 2:292cf221:::200.:head
>>>
>>> I had one standby MDS, but as far as I can tell it did not fail over. This 
>>> was in the logs
>>>
>>> (insufficient standby MDS daemons available)
>>>
>>> Currently my ceph looks like this
>>>   cluster:
>>>     id: ..
>>>     health: HEALTH_ERR
>>>     1 filesystem is degraded
>>>     1 mds daemon damaged
>>>
>>>   services:
>>>     mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29
>>>     mgr: ids27(active)
>>>     mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged
>>>     osd: 5 osds: 5 up, 5 in
>>>
>>>   data:
>>>     pools:   3 pools, 202 pgs
>>>     objects: 1013k objects, 4018 GB
>>>     usage:   12085 GB used, 6544 GB / 18630 GB avail
>>>     pgs: 201 active+clean
>>>  1   active+clean+scrubbing+deep
>>>
>>>   io:
>>>     client:   0 B/s rd, 0 op/s rd, 0 op/s wr
>>>
>>> I started trying to get the damaged MDS back online
>>>
>>> Based on this page 
>>> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts
>>>
>>> # cephfs-journal-tool journal export backup.bin
>>> 2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is unreadable
>>> 2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not 
>>> readable, attempt object-by-object dump with `rados`
>>> Error ((5) Input/output error)
>>>
>>> # cephfs-journal-tool event recover_dentries summary
>>> Events by type:
>>> 2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is 
>>> unreadableErrors: 0
>>>
>>> cephfs-journal-tool journal reset - (I think this command might have worked)
>>>
>>> Next up, tried to reset the filesystem
>>>
>>> ceph fs reset test-cephfs-1 --yes-i-really-mean-it
>>>
>>> Each time same errors
>>>
>>> 2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: MDS_DAMAGE 
>>> (was: 1 mds daemon damaged)
>>> 2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 assigned 
>>> to filesystem test-cephfs-1 as rank 0
>>> 2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 0x200: 
>>> (5) Input/output error
>>> 2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds daemon 
>>> damaged (MDS_DAMAGE)
>>> 2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 0x6fc2f65a 
>>> != expected 0x1c08241c on 2:292cf221:::200.:head
>>> 2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 filesystem 
>>> is degraded; 1 mds daemon damaged
>>>
>>> Tried to 

Re: [ceph-users] mds daemon damaged

2018-07-12 Thread Oliver Freyermuth
Hi,

all this sounds an awful lot like:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-July/027992.html
In htat case, things started with an update to 12.2.6. Which version are you 
running? 

Cheers,
Oliver

Am 12.07.2018 um 23:30 schrieb Kevin:
> Sorry for the long posting but trying to cover everything
> 
> I woke up to find my cephfs filesystem down. This was in the logs
> 
> 2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 0x6fc2f65a != 
> expected 0x1c08241c on 2:292cf221:::200.:head
> 
> I had one standby MDS, but as far as I can tell it did not fail over. This 
> was in the logs
> 
> (insufficient standby MDS daemons available)
> 
> Currently my ceph looks like this
>   cluster:
>     id: ..
>     health: HEALTH_ERR
>     1 filesystem is degraded
>     1 mds daemon damaged
> 
>   services:
>     mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29
>     mgr: ids27(active)
>     mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged
>     osd: 5 osds: 5 up, 5 in
> 
>   data:
>     pools:   3 pools, 202 pgs
>     objects: 1013k objects, 4018 GB
>     usage:   12085 GB used, 6544 GB / 18630 GB avail
>     pgs: 201 active+clean
>  1   active+clean+scrubbing+deep
> 
>   io:
>     client:   0 B/s rd, 0 op/s rd, 0 op/s wr
> 
> I started trying to get the damaged MDS back online
> 
> Based on this page 
> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts
> 
> # cephfs-journal-tool journal export backup.bin
> 2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is unreadable
> 2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not 
> readable, attempt object-by-object dump with `rados`
> Error ((5) Input/output error)
> 
> # cephfs-journal-tool event recover_dentries summary
> Events by type:
> 2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is 
> unreadableErrors: 0
> 
> cephfs-journal-tool journal reset - (I think this command might have worked)
> 
> Next up, tried to reset the filesystem
> 
> ceph fs reset test-cephfs-1 --yes-i-really-mean-it
> 
> Each time same errors
> 
> 2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: MDS_DAMAGE 
> (was: 1 mds daemon damaged)
> 2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 assigned to 
> filesystem test-cephfs-1 as rank 0
> 2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 0x200: (5) 
> Input/output error
> 2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds daemon 
> damaged (MDS_DAMAGE)
> 2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 0x6fc2f65a != 
> expected 0x1c08241c on 2:292cf221:::200.:head
> 2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 filesystem is 
> degraded; 1 mds daemon damaged
> 
> Tried to 'fail' mds.ds27
> # ceph mds fail ds27
> # failed mds gid 1929168
> 
> Command worked, but each time I run the reset command the same errors above 
> appear
> 
> Online searches say the object read error has to be removed. But there's no 
> object listed. This web page is the closest to the issue
> http://tracker.ceph.com/issues/20863
> 
> Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it 
> completes but still have the same issue above
> 
> Final option is to attempt removing mds.ds27. If mds.ds29 was a standby and 
> has data it should become live. If it was not
> I assume we will lose the filesystem at this point
> 
> Why didn't the standby MDS failover?
> 
> Just looking for any way to recover the cephfs, thanks!
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bug? Ceph-volume /var/lib/ceph/osd permissions

2018-06-02 Thread Oliver Freyermuth
Am 02.06.2018 um 12:35 schrieb Marc Roos:
> 
> o+w? I don’t think that is necessary not?

I also wondered about that, but it seems safe - it's only a tmpfs,
with sticky bit set - and all files within have:
-rw---.
as you can check. 
Also, on our systems, we have:
drwxr-x---.
for /var/lib/ceph, so nobody can enter there in the first place. 

Still it would be nice to remove the unnecessary permissions from
the OSD subdirectories. I guess what's there now is just the tmpfs default 
without any mask... 

Cheers,
Oliver


> 
> drwxr-xr-x  2 ceph ceph 182 May  9 12:59 ceph-15
> drwxr-xr-x  2 ceph ceph 182 May  9 20:51 ceph-14
> drwxr-xr-x  2 ceph ceph 182 May 12 10:32 ceph-16
> drwxr-xr-x  2 ceph ceph   6 Jun  2 17:21 ceph-19
> drwxr-x--- 13 ceph ceph 168 Jun  2 17:47 .
> drwxrwxrwt  2 ceph ceph 300 Jun  2 17:47 ceph-20 <<<
> 
> I feel like beta tester, playing a bit with this ceph-volume.
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Should ceph-volume lvm prepare not be backwards compitable with ceph-disk?

2018-06-02 Thread Oliver Freyermuth
Am 02.06.2018 um 11:44 schrieb Marc Roos:
> 
> 
> ceph-disk does not require bootstrap-osd/ceph.keyring and ceph-volume 
> does

I believe that's expected when you use "prepare". 
For ceph-volume, "prepare" already bootstraps the OSD and fetches a fresh OSD 
id,
for which it needs the keyring. 
For ceph-disk, this was not part of "prepare", but you only needed a key for 
"activate" later, I think. 

Since we always use "create" here via ceph-deploy, I'm not an expert on the 
subtle command differences, though - 
but ceph-deploy is doing a good job at making you survive without learning them 
;-). 

Cheers,
Oliver

> 
> 
> 
> [@~]# ceph-disk prepare --bluestore --zap-disk /dev/sdf
> 
> ***
> Found invalid GPT and valid MBR; converting MBR to GPT format.
> ***
> 
> GPT data structures destroyed! You may now partition the disk using 
> fdisk or
> other utilities.
> Creating new GPT entries.
> The operation has completed successfully.
> The operation has completed successfully.
> The operation has completed successfully.
> The operation has completed successfully.
> meta-data=/dev/sdf1  isize=2048   agcount=4, agsize=6400 
> blks
>  =   sectsz=4096  attr=2, projid32bit=1
>  =   crc=1finobt=0, sparse=0
> data =   bsize=4096   blocks=25600, imaxpct=25
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
> log  =internal log   bsize=4096   blocks=1608, version=2
>  =   sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
> Warning: The kernel is still using the old partition table.
> The new table will be used at the next reboot.
> The operation has completed successfully.
> 
> [@~]# ceph-disk  zap /dev/sdf
> /dev/sdf1: 4 bytes were erased at offset 0x (xfs): 58 46 53 42
> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB) copied, 0.946816 s, 111 MB/s
> 110+0 records in
> 110+0 records out
> 115343360 bytes (115 MB) copied, 0.876412 s, 132 MB/s
> Caution: invalid backup GPT header, but valid main header; regenerating
> backup header from main header.
> 
> Warning! Main and backup partition tables differ! Use the 'c' and 'e' 
> options
> on the recovery & transformation menu to examine the two tables.
> 
> Warning! One or more CRCs don't match. You should repair the disk!
> 
> 
> 
> Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but 
> disk
> verification and recovery are STRONGLY recommended.
> 
> 
> GPT data structures destroyed! You may now partition the disk using 
> fdisk or
> other utilities.
> Creating new GPT entries.
> The operation has completed successfully.
> 
> 
> 
> [@ ~]# fdisk -l /dev/sdf
> WARNING: fdisk GPT support is currently new, and therefore in an 
> experimental phase. Use at your own discretion.
> 
> Disk /dev/sdf: 3000.6 GB, 3000592982016 bytes, 5860533168 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
> Disk label type: gpt
> Disk identifier: 7DB3B9B6-CD8E-41B5-85BA-3ABB566BAF8E
> 
> 
> # Start  EndSize  TypeName
> 
> 
> [@ ~]# ceph-volume lvm prepare --bluestore --data /dev/sdf
> Running command: /bin/ceph-authtool --gen-print-key
> Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd 
> --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 
> 8a2440c2-55a3-4b09-8906-965c25e36066
>  stderr: 2018-06-02 17:00:47.309487 7f5a083c1700 -1 auth: unable to find 
> a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file 
> or directory
>  stderr: 2018-06-02 17:00:47.309502 7f5a083c1700 -1 monclient: ERROR: 
> missing keyring, cannot use cephx for authentication
>  stderr: 2018-06-02 17:00:47.309505 7f5a083c1700  0 librados: 
> client.bootstrap-osd initialization error (2) No such file or directory
>  stderr: [errno 2] error connecting to the cluster
> -->  RuntimeError: Unable to create a new OSD id
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bug? ceph-volume zap not working

2018-06-02 Thread Oliver Freyermuth
The command mapping from ceph-disk to ceph-volume is certainly not 1:1. 
What we are ended up using is:
ceph-volume lvm zap /dev/sda --destroy
This takes care of destroying Pvs and Lvs (as the documentation says). 

Cheers,
Oliver

Am 02.06.2018 um 12:16 schrieb Marc Roos:
> 
> I guess zap should be used instead of destroy? Maybe keep ceph-disk 
> backwards compatibility and keep destroy??
> 
> [root@c03 bootstrap-osd]# ceph-volume lvm zap /dev/sdf
> --> Zapping: /dev/sdf
> --> Unmounting /var/lib/ceph/osd/ceph-19
> Running command: umount -v /var/lib/ceph/osd/ceph-19
>  stderr: umount: /var/lib/ceph/osd/ceph-19 (tmpfs) unmounted
> Running command: wipefs --all /dev/sdf
>  stderr: wipefs: error: /dev/sdf: probing initialization failed: Device 
> or resource busy
> -->  RuntimeError: command returned non-zero exit status: 1
> 
> Pvs / lvs are still there, I guess these are keeping the 'resource busy'
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-31 Thread Oliver Freyermuth
Am 01.06.2018 um 02:59 schrieb Yan, Zheng:
> On Wed, May 30, 2018 at 5:17 PM, Oliver Freyermuth
>  wrote:
>> Am 30.05.2018 um 10:37 schrieb Yan, Zheng:
>>> On Wed, May 30, 2018 at 3:04 PM, Oliver Freyermuth
>>>  wrote:
>>>> Hi,
>>>>
>>>> ij our case, there's only a single active MDS
>>>> (+1 standby-replay + 1 standby).
>>>> We also get the health warning in case it happens.
>>>>
>>>
>>> Were there "client.xxx isn't responding to mclientcaps(revoke)"
>>> warnings in cluster log.  please send them to me if there were.
>>
>> Yes, indeed, I almost missed them!
>>
>> Here you go:
>>
>> 
>> 2018-05-29 12:16:02.491186 mon.mon003 mon.0 10.161.8.40:6789/0 11177 : 
>> cluster [WRN] MDS health message (mds.0): Client XXX:XXX failing to 
>> respond to capability release
>> 2018-05-29 12:16:03.401014 mon.mon003 mon.0 10.161.8.40:6789/0 11178 : 
>> cluster [WRN] Health check failed: 1 clients failing to respond to 
>> capability release (MDS_CLIENT_LATE_RELEASE)
>> 
>> 2018-05-29 12:16:00.567520 mds.mon001 mds.0 10.161.8.191:6800/3068262341 
>> 15745 : cluster [WRN] client.1524813 isn't responding to 
>> mclientcaps(revoke), ino 0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, 
>> sent 63.908382 seconds ago
>> 
>>> repetition of message with increasing delays in between>
>> 
>> 2018-05-29 16:31:00.899416 mds.mon001 mds.0 10.161.8.191:6800/3068262341 
>> 17169 : cluster [WRN] client.1524813 isn't responding to 
>> mclientcaps(revoke), ino 0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, 
>> sent 15364.240272 seconds ago
>> 
>>
>> After evicting the client, I also get:
>> 2018-05-29 17:00:00.000134 mon.mon003 mon.0 10.161.8.40:6789/0 11293 : 
>> cluster [WRN] overall HEALTH_WARN 1 clients failing to respond to capability 
>> release; 1 MDSs report slow requests
>> 2018-05-29 17:09:50.964730 mon.mon003 mon.0 10.161.8.40:6789/0 11297 : 
>> cluster [INF] MDS health message cleared (mds.0): Client XXX:XXX 
>> failing to respond to capability release
>> 2018-05-29 17:09:50.964767 mon.mon003 mon.0 10.161.8.40:6789/0 11298 : 
>> cluster [INF] MDS health message cleared (mds.0): 123 slow requests are 
>> blocked > 30 sec
>> 2018-05-29 17:09:51.015071 mon.mon003 mon.0 10.161.8.40:6789/0 11299 : 
>> cluster [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients 
>> failing to respond to capability release)
>> 2018-05-29 17:09:51.015154 mon.mon003 mon.0 10.161.8.40:6789/0 11300 : 
>> cluster [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report 
>> slow requests)
>> 2018-05-29 17:09:51.015191 mon.mon003 mon.0 10.161.8.40:6789/0 11301 : 
>> cluster [INF] Cluster is now healthy
>> 2018-05-29 17:14:26.178321 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 
>> 8 : cluster [WRN]  replayed op client.1495010:32710304,32710299 used ino 
>> 0x13909d0 but session next is 0x1388af6
>> 2018-05-29 17:14:26.178393 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 
>> 9 : cluster [WRN]  replayed op client.1495010:32710306,32710299 used ino 
>> 0x13909d1 but session next is 0x1388af6
>> 2018-05-29 18:00:00.000132 mon.mon003 mon.0 10.161.8.40:6789/0 11304 : 
>> cluster [INF] overall HEALTH_OK
>>
>> Thanks for looking into it!
>>
>> Cheers,
>> Oliver
>>
>>
> 
> I found cause of your issue. http://tracker.ceph.com/issues/24369

Wow, many thanks! 
I did not yet manage to reproduce the stuck behaviour, since the user who could 
reliably cause it made use of the national holiday around here. 

But the issue seems extremely likely to be exactly that one - quotas are set 
for the directory tree which was affected. 
Let me know if I still should ask him to reproduce and collect the information 
from the client to confirm. 

Many thanks and cheers,
Oliver

> 
>>>
>>>> Cheers,
>>>> Oliver
>>>>
>>>> Am 30.05.2018 um 03:25 schrieb Yan, Zheng:
>>>>> I could be http://tracker.ceph.com/issues/24172
>>>>>
>>>>>
>>>>> On Wed, May 30, 2018 at 9:01 AM, Linh Vu  wrote:
>>>>>> In my case, I have multiple active MDS (with directory pinning at the 
>>>>>> very
>>>>>> top level), and there would be "Client xxx failing to respond to 
>>>>>> capability
>>>>>> release" health warning every single time that happens.
>>>>>>
>>>

Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-30 Thread Oliver Freyermuth
Am 30.05.2018 um 10:37 schrieb Yan, Zheng:
> On Wed, May 30, 2018 at 3:04 PM, Oliver Freyermuth
>  wrote:
>> Hi,
>>
>> ij our case, there's only a single active MDS
>> (+1 standby-replay + 1 standby).
>> We also get the health warning in case it happens.
>>
> 
> Were there "client.xxx isn't responding to mclientcaps(revoke)"
> warnings in cluster log.  please send them to me if there were.

Yes, indeed, I almost missed them!

Here you go:


2018-05-29 12:16:02.491186 mon.mon003 mon.0 10.161.8.40:6789/0 11177 : cluster 
[WRN] MDS health message (mds.0): Client XXX:XXX failing to respond to 
capability release
2018-05-29 12:16:03.401014 mon.mon003 mon.0 10.161.8.40:6789/0 11178 : cluster 
[WRN] Health check failed: 1 clients failing to respond to capability release 
(MDS_CLIENT_LATE_RELEASE)

2018-05-29 12:16:00.567520 mds.mon001 mds.0 10.161.8.191:6800/3068262341 15745 
: cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), ino 
0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 63.908382 seconds ago

>repetition of message with increasing delays in between>

2018-05-29 16:31:00.899416 mds.mon001 mds.0 10.161.8.191:6800/3068262341 17169 
: cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), ino 
0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 15364.240272 seconds ago


After evicting the client, I also get:
2018-05-29 17:00:00.000134 mon.mon003 mon.0 10.161.8.40:6789/0 11293 : cluster 
[WRN] overall HEALTH_WARN 1 clients failing to respond to capability release; 1 
MDSs report slow requests
2018-05-29 17:09:50.964730 mon.mon003 mon.0 10.161.8.40:6789/0 11297 : cluster 
[INF] MDS health message cleared (mds.0): Client XXX:XXX failing to 
respond to capability release
2018-05-29 17:09:50.964767 mon.mon003 mon.0 10.161.8.40:6789/0 11298 : cluster 
[INF] MDS health message cleared (mds.0): 123 slow requests are blocked > 30 sec
2018-05-29 17:09:51.015071 mon.mon003 mon.0 10.161.8.40:6789/0 11299 : cluster 
[INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients failing to 
respond to capability release)
2018-05-29 17:09:51.015154 mon.mon003 mon.0 10.161.8.40:6789/0 11300 : cluster 
[INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests)
2018-05-29 17:09:51.015191 mon.mon003 mon.0 10.161.8.40:6789/0 11301 : cluster 
[INF] Cluster is now healthy
2018-05-29 17:14:26.178321 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 8 
: cluster [WRN]  replayed op client.1495010:32710304,32710299 used ino 
0x13909d0 but session next is 0x1388af6
2018-05-29 17:14:26.178393 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 9 
: cluster [WRN]  replayed op client.1495010:32710306,32710299 used ino 
0x13909d1 but session next is 0x1388af6
2018-05-29 18:00:00.000132 mon.mon003 mon.0 10.161.8.40:6789/0 11304 : cluster 
[INF] overall HEALTH_OK

Thanks for looking into it!

Cheers,
Oliver


> 
>> Cheers,
>> Oliver
>>
>> Am 30.05.2018 um 03:25 schrieb Yan, Zheng:
>>> I could be http://tracker.ceph.com/issues/24172
>>>
>>>
>>> On Wed, May 30, 2018 at 9:01 AM, Linh Vu  wrote:
>>>> In my case, I have multiple active MDS (with directory pinning at the very
>>>> top level), and there would be "Client xxx failing to respond to capability
>>>> release" health warning every single time that happens.
>>>>
>>>> 
>>>> From: ceph-users  on behalf of Yan, 
>>>> Zheng
>>>> 
>>>> Sent: Tuesday, 29 May 2018 9:53:43 PM
>>>> To: Oliver Freyermuth
>>>> Cc: Ceph Users; Peter Wienemann
>>>> Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to
>>>> authpin local pins"
>>>>
>>>> Single or multiple acitve mds? Were there "Client xxx failing to
>>>> respond to capability release" health warning?
>>>>
>>>> On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
>>>>  wrote:
>>>>> Dear Cephalopodians,
>>>>>
>>>>> we just had a "lockup" of many MDS requests, and also trimming fell
>>>>> behind, for over 2 days.
>>>>> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status
>>>>> "currently failed to authpin local pins". Metadata pool usage did grow by 
>>>>> 10
>>>>> GB in those 2 days.
>>>>>
>>>>> Rebooting the node to force a client eviction solved the issue, and now
>>>>> metadata usage is down again, and all stuck requests were processed 
>>>>> quickly.
>>>>

Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-30 Thread Oliver Freyermuth
Hi,

ij our case, there's only a single active MDS
(+1 standby-replay + 1 standby). 
We also get the health warning in case it happens. 

Cheers,
Oliver

Am 30.05.2018 um 03:25 schrieb Yan, Zheng:
> I could be http://tracker.ceph.com/issues/24172
> 
> 
> On Wed, May 30, 2018 at 9:01 AM, Linh Vu  wrote:
>> In my case, I have multiple active MDS (with directory pinning at the very
>> top level), and there would be "Client xxx failing to respond to capability
>> release" health warning every single time that happens.
>>
>> 
>> From: ceph-users  on behalf of Yan, Zheng
>> 
>> Sent: Tuesday, 29 May 2018 9:53:43 PM
>> To: Oliver Freyermuth
>> Cc: Ceph Users; Peter Wienemann
>> Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to
>> authpin local pins"
>>
>> Single or multiple acitve mds? Were there "Client xxx failing to
>> respond to capability release" health warning?
>>
>> On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
>>  wrote:
>>> Dear Cephalopodians,
>>>
>>> we just had a "lockup" of many MDS requests, and also trimming fell
>>> behind, for over 2 days.
>>> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status
>>> "currently failed to authpin local pins". Metadata pool usage did grow by 10
>>> GB in those 2 days.
>>>
>>> Rebooting the node to force a client eviction solved the issue, and now
>>> metadata usage is down again, and all stuck requests were processed quickly.
>>>
>>> Is there any idea on what could cause something like that? On the client,
>>> der was no CPU load, but many processes waiting for cephfs to respond.
>>> Syslog did yield anything. It only affected one user and his user
>>> directory.
>>>
>>> If there are no ideas: How can I collect good debug information in case
>>> this happens again?
>>>
>>> Cheers,
>>> Oliver
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>>
>>> https://protect-au.mimecast.com/s/Zl9aCXLKNwFxY9nNc6jQJC?domain=lists.ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-29 Thread Oliver Freyermuth
I get the feeling this is not dependent on the exact Ceph version... 

In our case, I know what the user has done (and he'll not do it again). He 
misunderstood how our cluster works and started 1100 cluster jobs,
all entering the very same directory on CephFS (mounted via ceph-fuse on 38 
machines), all running "make clean; make -j10 install". 
So 1100 processes from 38 clients have been trying to lock / delete / write the 
very same files. 

In parallel, an IDE (eclipse) and an indexing service (zeitgeist...) may have 
accessed the very same directory via nfs-ganesha since the user mounted the 
NFS-exported directory via sshfs into his desktop home directory... 

So I can't really blame CephFS for becoming as unhappy as I would become 
myself. 
However, I would have hoped it would not enter a "stuck" state in which only 
client eviction will help... 

Cheers,
Oliver


Am 29.05.2018 um 03:26 schrieb Linh Vu:
> I get the exact opposite to the same error message "currently failed to 
> authpin local pins". Had a few clients on ceph-fuse 12.2.2 and they ran into 
> those issues a lot (evicting works). Upgrading to ceph-fuse 12.2.5 fixed it. 
> The main cluster is on 12.2.4.
> 
> 
> The cause is user's HPC jobs or even just their login on multiple nodes 
> accessing the same files, in a particular way. Doesn't happen to other users. 
> Haven't quite dug into it deep enough as upgrading to 12.2.5 fixed our 
> problem. 
> 
> ------
> *From:* ceph-users  on behalf of Oliver 
> Freyermuth 
> *Sent:* Tuesday, 29 May 2018 7:29:06 AM
> *To:* Paul Emmerich
> *Cc:* Ceph Users; Peter Wienemann
> *Subject:* Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to 
> authpin local pins"
>  
> Dear Paul,
> 
> Am 28.05.2018 um 20:16 schrieb Paul Emmerich:
>> I encountered the exact same issue earlier today immediately after upgrading 
>> a customer's cluster from 12.2.2 to 12.2.5.
>> I've evicted the session and restarted the ganesha client to fix it, as I 
>> also couldn't find any obvious problem.
> 
> interesting! In our case, the client with the problem (it happened again a 
> few hours later...) always was a ceph-fuse client. Evicting / rebooting the 
> client node helped.
> However, it may well be that the original issue way caused by a Ganesha 
> client, which we also use (and the user in question who complained was 
> accessing files in parallel via NFS and ceph-fuse),
> but I don't have a clear indication of that.
> 
> Cheers,
>     Oliver
> 
>> 
>> Paul
>> 
>> 2018-05-28 16:38 GMT+02:00 Oliver Freyermuth > <mailto:freyerm...@physik.uni-bonn.de>>:
>> 
>> Dear Cephalopodians,
>> 
>> we just had a "lockup" of many MDS requests, and also trimming fell 
>>behind, for over 2 days.
>> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status 
>>"currently failed to authpin local pins". Metadata pool usage did grow by 10 
>>GB in those 2 days.
>> 
>> Rebooting the node to force a client eviction solved the issue, and now 
>>metadata usage is down again, and all stuck requests were processed quickly.
>> 
>> Is there any idea on what could cause something like that? On the 
>>client, der was no CPU load, but many processes waiting for cephfs to respond.
>> Syslog did yield anything. It only affected one user and his user 
>>directory.
>> 
>> If there are no ideas: How can I collect good debug information in case 
>>this happens again?
>> 
>> Cheers,
>>         Oliver
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-

Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-28 Thread Oliver Freyermuth
Dear Paul,

Am 28.05.2018 um 20:16 schrieb Paul Emmerich:
> I encountered the exact same issue earlier today immediately after upgrading 
> a customer's cluster from 12.2.2 to 12.2.5.
> I've evicted the session and restarted the ganesha client to fix it, as I 
> also couldn't find any obvious problem.

interesting! In our case, the client with the problem (it happened again a few 
hours later...) always was a ceph-fuse client. Evicting / rebooting the client 
node helped. 
However, it may well be that the original issue way caused by a Ganesha client, 
which we also use (and the user in question who complained was accessing files 
in parallel via NFS and ceph-fuse),
but I don't have a clear indication of that. 

Cheers,
Oliver

> 
> Paul
> 
> 2018-05-28 16:38 GMT+02:00 Oliver Freyermuth  <mailto:freyerm...@physik.uni-bonn.de>>:
> 
> Dear Cephalopodians,
> 
> we just had a "lockup" of many MDS requests, and also trimming fell 
> behind, for over 2 days.
> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status 
> "currently failed to authpin local pins". Metadata pool usage did grow by 10 
> GB in those 2 days.
> 
> Rebooting the node to force a client eviction solved the issue, and now 
> metadata usage is down again, and all stuck requests were processed quickly.
> 
> Is there any idea on what could cause something like that? On the client, 
> der was no CPU load, but many processes waiting for cephfs to respond.
> Syslog did yield anything. It only affected one user and his user 
> directory.
> 
> If there are no ideas: How can I collect good debug information in case 
> this happens again?
> 
> Cheers,
>         Oliver
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> 
> 
> 
> -- 
> Paul Emmerich
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io <http://www.croit.io>
> Tel: +49 89 1896585 90



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Am 25.05.2018 um 15:39 schrieb Sage Weil:
> On Fri, 25 May 2018, Oliver Freyermuth wrote:
>> Dear Ric,
>>
>> I played around a bit - the common denominator seems to be: Moving it 
>> within a directory subtree below a directory for which max_bytes / 
>> max_files quota settings are set, things work fine. Moving it to another 
>> directory tree without quota settings / with different quota settings, 
>> rename() returns EXDEV.
> 
> Aha, yes, this is the issue.
> 
> When you set a quota you force subvolume-like behavior.  This is done 
> because hard links across this quota boundary won't correctly account for 
> utilization (only one of the file links will accrue usage).  The 
> expectation is that quotas are usually set in locations that aren't 
> frequently renamed across.

Understood, that explains it. That's indeed also true for our application in 
most cases - 
but sometimes, we have the case that users want to migrate their data to group 
storage, or vice-versa. 

> 
> It might be possible to allow rename(2) to proceed in cases where 
> nlink==1, but the behavior will probably seem inconsistent (some files get 
> EXDEV, some don't).

I believe even this would be extremely helpful, performance-wise. At least in 
our case, hardlinks are seldomly used,
it's more about data movement between user, group and scratch areas. 
For files with nlinks>1, it's more or less expected a copy has to be performed 
when crossing quota boundaries (I think). 

Cheers,
Oliver

> 
> sage
> 
> 
> 
>>
>> Cheers, Oliver
>>
>>
>> Am 25.05.2018 um 15:18 schrieb Ric Wheeler:
>>> That seems to be the issue - we need to understand why rename sees them as 
>>> different.
>>>
>>> Ric
>>>
>>>
>>> On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth 
>>> <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> 
>>> wrote:
>>>
>>> Mhhhm... that's funny, I checked an mv with an strace now. I get:
>>> 
>>> -
>>> access("/cephfs/some_folder/file", W_OK) = 0
>>> rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid 
>>> cross-device link)
>>> unlink("/cephfs/some_folder/file") = 0
>>> lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 
>>> 255) = 30
>>> 
>>> -
>>> But I can assure it's only a single filesystem, and a single ceph-fuse 
>>> client running.
>>>
>>>     Same happens when using absolute paths.
>>>
>>> Cheers,
>>>         Oliver
>>>
>>> Am 25.05.2018 um 15:06 schrieb Ric Wheeler:
>>> > We should look at what mv uses to see if it thinks the directories 
>>> are on different file systems.
>>> >
>>> > If the fstat or whatever it looks at is confused, that might explain 
>>> it.
>>> >
>>> > Ric
>>> >
>>> >
>>> > On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth 
>>> <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de> 
>>> <mailto:freyerm...@physik.uni-bonn.de 
>>> <mailto:freyerm...@physik.uni-bonn.de>>> wrote:
>>> >
>>> >     Am 25.05.2018 um 14:57 schrieb Ric Wheeler:
>>> >     > Is this move between directories on the same file system?
>>> >
>>> >     It is, we only have a single CephFS in use. There's also only a 
>>> single ceph-fuse client running.
>>> >
>>> >     What's different, though, are different ACLs set for source and 
>>> target directory, and owner / group,
>>> >     but I hope that should not matter.
>>> >
>>> >     All the best,
>>> >     Oliver
>>> >
>>> >     > Rename as a system call only works within a file system.
>>> >     >
>>> >     > The user space mv command becomes a copy when not the same file 
>>> system. 
>>> >     >
>>> >     > Regards,
>>> >     >
>>> >     > Ric
>>> >     >
>>> >     >
>>> >     > On Fri, May 25, 2018, 8:51 AM John Spray <jsp...@redhat.com 
&g

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Am 25.05.2018 um 15:26 schrieb Luis Henriques:
> Oliver Freyermuth <freyerm...@physik.uni-bonn.de> writes:
> 
>> Mhhhm... that's funny, I checked an mv with an strace now. I get:
>> -
>> access("/cephfs/some_folder/file", W_OK) = 0
>> rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device 
>> link)
> 
> I believe this could happen if you have quotas set on any of the paths,
> or different snapshot realms.

Wow - yes, this matches my observations! 
So in this case, e.g. moving files from a "user" directory with quota to a 
"group" directory with different quota,
it's currently expected that files can not be renamed across those boundaries?

Cheers,
Oliver

> 
> Cheers,
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Dear Sage,

here you go, some_folder in reality is "/cephfs/group":


# stat foo
  File: ‘foo’
  Size: 1048576000  Blocks: 2048000IO Block: 4194304 regular file
Device: 27h/39d Inode: 1099515065517  Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
Context: system_u:object_r:fusefs_t:s0
Access: 2018-05-25 15:27:59.433279424 +0200
Modify: 2018-05-25 15:28:01.379754052 +0200
Change: 2018-05-25 15:28:01.379754052 +0200
 Birth: -

# stat -f foo
  File: "foo"
ID: 0Namelen: 255 Type: fuseblk
Block size: 4194304Fundamental block size: 4194304
Blocks: Total: 104471885  Free: 79096968   Available: 79096968
Inodes: Total: 26258533   Free: -1


# stat -f /cephfs/group/
  File: "/cephfs/group/"
ID: 0Namelen: 255 Type: fuseblk
Block size: 4194304Fundamental block size: 4194304
Blocks: Total: 104471835  Free: 79098264   Available: 79098264
Inodes: Total: 26257190   Free: -1

# stat /cephfs/group/
  File: ‘/cephfs/group/’
  Size: 73167320986856  Blocks: 1  IO Block: 4096   directory
Device: 27h/39d Inode: 1099511627888  Links: 1
Access: (0755/drwxr-xr-x)  Uid: (0/root)   Gid: (0/root)
Context: system_u:object_r:fusefs_t:s0
Access: 2018-03-09 18:22:47.061501906 +0100
Modify: 2018-05-25 15:18:02.164391701 +0200
Change: 2018-05-25 15:18:02.164391701 +0200
 Birth: -


Cheers,
Oliver

Am 25.05.2018 um 15:21 schrieb Sage Weil:
> Can you paste the output of 'stat foo' and 'stat /cephfs/some_folder'?  
> (Maybe also the same with 'stat -f'.)
> 
> Thanks!
> sage
> 
> 
> On Fri, 25 May 2018, Ric Wheeler wrote:
>> That seems to be the issue - we need to understand why rename sees them as
>> different.
>>
>> Ric
>>
>>
>> On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth <
>> freyerm...@physik.uni-bonn.de> wrote:
>>
>>> Mhhhm... that's funny, I checked an mv with an strace now. I get:
>>>
>>> -
>>> access("/cephfs/some_folder/file", W_OK) = 0
>>> rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device
>>> link)
>>> unlink("/cephfs/some_folder/file") = 0
>>> lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 255)
>>> = 30
>>>
>>> -
>>> But I can assure it's only a single filesystem, and a single ceph-fuse
>>> client running.
>>>
>>> Same happens when using absolute paths.
>>>
>>> Cheers,
>>> Oliver
>>>
>>> Am 25.05.2018 um 15:06 schrieb Ric Wheeler:
>>>> We should look at what mv uses to see if it thinks the directories are
>>> on different file systems.
>>>>
>>>> If the fstat or whatever it looks at is confused, that might explain it.
>>>>
>>>> Ric
>>>>
>>>>
>>>> On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth <
>>> freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>>
>>> wrote:
>>>>
>>>> Am 25.05.2018 um 14:57 schrieb Ric Wheeler:
>>>> > Is this move between directories on the same file system?
>>>>
>>>> It is, we only have a single CephFS in use. There's also only a
>>> single ceph-fuse client running.
>>>>
>>>> What's different, though, are different ACLs set for source and
>>> target directory, and owner / group,
>>>> but I hope that should not matter.
>>>>
>>>> All the best,
>>>> Oliver
>>>>
>>>> > Rename as a system call only works within a file system.
>>>> >
>>>> > The user space mv command becomes a copy when not the same file
>>> system.
>>>> >
>>>> > Regards,
>>>> >
>>>> > Ric
>>>> >
>>>> >
>>>> > On Fri, May 25, 2018, 8:51 AM John Spray <jsp...@redhat.com
>>> <mailto:jsp...@redhat.com> <mailto:jsp...@redhat.com >> jsp...@redhat.com>>> wrote:
>>>> >
>>>> > On Fri, May 2

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Dear Ric,

I played around a bit - the common denominator seems to be: Moving it within a 
directory subtree below a directory for which max_bytes / max_files quota 
settings are set,
things work fine. 
Moving it to another directory tree without quota settings / with different 
quota settings, rename() returns EXDEV. 

Cheers,
Oliver


Am 25.05.2018 um 15:18 schrieb Ric Wheeler:
> That seems to be the issue - we need to understand why rename sees them as 
> different.
> 
> Ric
> 
> 
> On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth 
> <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> Mhhhm... that's funny, I checked an mv with an strace now. I get:
> 
> -
> access("/cephfs/some_folder/file", W_OK) = 0
> rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid 
> cross-device link)
> unlink("/cephfs/some_folder/file") = 0
> lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 
> 255) = 30
> 
> -
> But I can assure it's only a single filesystem, and a single ceph-fuse 
> client running.
> 
> Same happens when using absolute paths.
> 
> Cheers,
>         Oliver
> 
> Am 25.05.2018 um 15:06 schrieb Ric Wheeler:
> > We should look at what mv uses to see if it thinks the directories are 
> on different file systems.
> >
> > If the fstat or whatever it looks at is confused, that might explain it.
> >
> > Ric
> >
> >
> > On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth 
> <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de> 
> <mailto:freyerm...@physik.uni-bonn.de 
> <mailto:freyerm...@physik.uni-bonn.de>>> wrote:
> >
> >     Am 25.05.2018 um 14:57 schrieb Ric Wheeler:
> >     > Is this move between directories on the same file system?
> >
> >     It is, we only have a single CephFS in use. There's also only a 
> single ceph-fuse client running.
> >
> >     What's different, though, are different ACLs set for source and 
> target directory, and owner / group,
> >     but I hope that should not matter.
> >
> >     All the best,
> >     Oliver
> >
> >     > Rename as a system call only works within a file system.
> >     >
> >     > The user space mv command becomes a copy when not the same file 
> system. 
> >     >
> >     > Regards,
> >     >
> >     > Ric
> >     >
> >     >
> >     > On Fri, May 25, 2018, 8:51 AM John Spray <jsp...@redhat.com 
> <mailto:jsp...@redhat.com> <mailto:jsp...@redhat.com 
> <mailto:jsp...@redhat.com>> <mailto:jsp...@redhat.com 
> <mailto:jsp...@redhat.com> <mailto:jsp...@redhat.com 
> <mailto:jsp...@redhat.com>>>> wrote:
> >     >
> >     >     On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
> >     >     <freyerm...@physik.uni-bonn.de 
> <mailto:freyerm...@physik.uni-bonn.de> <mailto:freyerm...@physik.uni-bonn.de 
> <mailto:freyerm...@physik.uni-bonn.de>> <mailto:freyerm...@physik.uni-bonn.de 
> <mailto:freyerm...@physik.uni-bonn.de> <mailto:freyerm...@physik.uni-bonn.de 
> <mailto:freyerm...@physik.uni-bonn.de>>>> wrote:
> >     >     > Dear Cephalopodians,
> >     >     >
> >     >     > I was wondering why a simple "mv" is taking extraordinarily 
> long on CephFS and must note that,
> >     >     > at least with the fuse-client (12.2.5) and when moving a 
> file from one directory to another,
> >     >     > the file appears to be copied first (byte by byte, traffic 
> going through the client?) before the initial file is deleted.
> >     >     >
> >     >     > Is this true, or am I missing something?
> >     >
> >     >     A mv should not involve copying a file through the client -- 
> it's
> >     >     implemented in the MDS as a rename from one location to 
> another.
> >     >     What's the observation that's making it seem like the data is 
> going
> >     >     through the client?
> >     >
> >     >     John
> >     >
> >     >     >
>

Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Mhhhm... that's funny, I checked an mv with an strace now. I get:
-
access("/cephfs/some_folder/file", W_OK) = 0
rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device link)
unlink("/cephfs/some_folder/file") = 0
lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 255) = 30
-
But I can assure it's only a single filesystem, and a single ceph-fuse client 
running. 

Same happens when using absolute paths. 

Cheers,
Oliver

Am 25.05.2018 um 15:06 schrieb Ric Wheeler:
> We should look at what mv uses to see if it thinks the directories are on 
> different file systems.
> 
> If the fstat or whatever it looks at is confused, that might explain it.
> 
> Ric
> 
> 
> On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth 
> <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> Am 25.05.2018 um 14:57 schrieb Ric Wheeler:
> > Is this move between directories on the same file system?
> 
> It is, we only have a single CephFS in use. There's also only a single 
> ceph-fuse client running.
> 
> What's different, though, are different ACLs set for source and target 
> directory, and owner / group,
> but I hope that should not matter.
> 
> All the best,
> Oliver
> 
> > Rename as a system call only works within a file system.
> >
> > The user space mv command becomes a copy when not the same file system. 
> >
> > Regards,
> >
> > Ric
> >
> >
> > On Fri, May 25, 2018, 8:51 AM John Spray <jsp...@redhat.com 
> <mailto:jsp...@redhat.com> <mailto:jsp...@redhat.com 
> <mailto:jsp...@redhat.com>>> wrote:
> >
> >     On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
> >     <freyerm...@physik.uni-bonn.de 
> <mailto:freyerm...@physik.uni-bonn.de> <mailto:freyerm...@physik.uni-bonn.de 
> <mailto:freyerm...@physik.uni-bonn.de>>> wrote:
> >     > Dear Cephalopodians,
> >     >
> >     > I was wondering why a simple "mv" is taking extraordinarily long 
> on CephFS and must note that,
> >     > at least with the fuse-client (12.2.5) and when moving a file 
> from one directory to another,
> >     > the file appears to be copied first (byte by byte, traffic going 
> through the client?) before the initial file is deleted.
> >     >
> >     > Is this true, or am I missing something?
> >
> >     A mv should not involve copying a file through the client -- it's
> >     implemented in the MDS as a rename from one location to another.
> >     What's the observation that's making it seem like the data is going
> >     through the client?
> >
> >     John
> >
> >     >
> >     > For large files, this might be rather time consuming,
> >     > and we should certainly advise all our users to not move files 
> around needlessly if this is the case.
> >     >
> >     > Cheers,
> >     >         Oliver
> >     >
> >     >
> >     > ___
> >     > ceph-users mailing list
> >     > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
> <mailto:ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>>
> >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >     >
> >     ___
> >     ceph-users mailing list
> >     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
> <mailto:ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>>
> >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Am 25.05.2018 um 14:57 schrieb Ric Wheeler:
> Is this move between directories on the same file system?

It is, we only have a single CephFS in use. There's also only a single 
ceph-fuse client running. 

What's different, though, are different ACLs set for source and target 
directory, and owner / group,
but I hope that should not matter. 

All the best,
Oliver

> Rename as a system call only works within a file system.
> 
> The user space mv command becomes a copy when not the same file system. 
> 
> Regards,
> 
> Ric
> 
> 
> On Fri, May 25, 2018, 8:51 AM John Spray <jsp...@redhat.com 
> <mailto:jsp...@redhat.com>> wrote:
> 
> On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
> <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> 
> wrote:
> > Dear Cephalopodians,
> >
> > I was wondering why a simple "mv" is taking extraordinarily long on 
> CephFS and must note that,
> > at least with the fuse-client (12.2.5) and when moving a file from one 
> directory to another,
> > the file appears to be copied first (byte by byte, traffic going 
> through the client?) before the initial file is deleted.
> >
> > Is this true, or am I missing something?
> 
> A mv should not involve copying a file through the client -- it's
> implemented in the MDS as a rename from one location to another.
> What's the observation that's making it seem like the data is going
> through the client?
> 
> John
> 
> >
> > For large files, this might be rather time consuming,
> > and we should certainly advise all our users to not move files around 
> needlessly if this is the case.
> >
> > Cheers,
> >         Oliver
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Am 25.05.2018 um 14:50 schrieb John Spray:
> On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth
> <freyerm...@physik.uni-bonn.de> wrote:
>> Dear Cephalopodians,
>>
>> I was wondering why a simple "mv" is taking extraordinarily long on CephFS 
>> and must note that,
>> at least with the fuse-client (12.2.5) and when moving a file from one 
>> directory to another,
>> the file appears to be copied first (byte by byte, traffic going through the 
>> client?) before the initial file is deleted.
>>
>> Is this true, or am I missing something?
> 
> A mv should not involve copying a file through the client -- it's
> implemented in the MDS as a rename from one location to another.
> What's the observation that's making it seem like the data is going
> through the client?

The fact that it's happening with only about 1 GBit/s and all OSDs are reading 
and writing. 
I will also check the network interface of the client next time it occurs. 
Also, ceph-fuse was taking 50 % CPU load just from this. 

Also, I observe the file at the source being kept during the copy,
and the file at the target growing slowly. So it's definitely a copy, and only 
at the end the source file is deleted. 

> 
> John
> 
>>
>> For large files, this might be rather time consuming,
>> and we should certainly advise all our users to not move files around 
>> needlessly if this is the case.
>>
>> Cheers,
>> Oliver
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS "move" operation

2018-05-25 Thread Oliver Freyermuth
Dear Cephalopodians,

I was wondering why a simple "mv" is taking extraordinarily long on CephFS and 
must note that,
at least with the fuse-client (12.2.5) and when moving a file from one 
directory to another,
the file appears to be copied first (byte by byte, traffic going through the 
client?) before the initial file is deleted. 

Is this true, or am I missing something? 

For large files, this might be rather time consuming,
and we should certainly advise all our users to not move files around 
needlessly if this is the case. 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nfs-ganesha 2.6 packages in ceph repo

2018-05-16 Thread Oliver Freyermuth
Hi David,

thanks for the reply! 

Interesting that the package was not installed - it was for us, but the 
machines we run the nfs-ganesha servers on are also OSDs, so it might have been 
pulled in via ceph-packages for us. 
In any case, I'd say this means librados2 as dependency is missing either in 
the libcephfs or in nfs-ganesha packages. 

Also, good news that things work fine with 12.2.5 - so I hope our upgrade will 
also go without bumps ;-). 

My experience is sadly only a few months old. We've started with nfs-ganesha 
2.5 from the Ceph repos, but hit a bad locking issue, which I also reported to 
this list. 
After upgrading to 2.6, we did not observe any further hard issues. It seems 
that there are sometimes issues with slow locks if processes are running with a 
working directory in ceph
and other ceph-fuse clients want to access files in the same directory, but 
there are no "deadlock" situations anymore. 

In terms of tuning, I did not do anything special yet. I'm running with some 
basic NFS / Fileserver kernel tunables (sysctl):
net.core.rmem_max = 12582912
net.core.wmem_max = 12582912
net.ipv4.tcp_rmem = 10240 87380 12582912
net.ipv4.tcp_wmem = 10240 87380 12582912
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_no_metrics_save = 1
net.core.netdev_max_backlog = 25
net.core.default_qdisc = fq_codel

However, I did not do explicit testing of different values, but just followed 
general recommendations here. 

It seems ACLs and quotas are honoured by the NFS server (as expected, since it 
uses libcephfs behind the scenes). 
Right now, throughput for bulk data is close to perfect (we manage to saturate 
our 1 GBit/s link) and for metadata access it seems close to what ceph-fuse 
achieves,
which is sufficient for us. 

Cheers and thanks for the feedback,
Oliver

Am 16.05.2018 um 21:06 schrieb David C:
> Hi Oliver
> 
> Thanks for following up. I just picked this up again today and it was indeed 
> librados2...the package wasn't installed! It's working now, haven't tested 
> much but I haven't noticed any problems yet. This is with 
> nfs-ganesha-2.6.1-0.1.el7.x86_64, libcephfs2-12.2.5-0.el7.x86_64 and 
> librados2-12.2.5-0.el7.x86_64. Thanks for the pointer on that.
> 
> I'd be interested to hear your experience with ganesha with cephfs if you're 
> happy to share some insights. Any tuning you would recommend?
> 
> Thanks,
> 
> On Wed, May 16, 2018 at 4:14 PM, Oliver Freyermuth 
> <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> Hi David,
> 
> did you already manage to check your librados2 version and manage to pin 
> down the issue?
> 
>     Cheers,
>         Oliver
> 
> Am 11.05.2018 um 17:15 schrieb Oliver Freyermuth:
> > Hi David,
> >
> > Am 11.05.2018 um 16:55 schrieb David C:
> >> Hi Oliver
> >>
> >> Thanks for the detailed reponse! I've downgraded my libcephfs2 to 
> 12.2.4 and still get a similar error:
> >>
> >> load_fsal :NFS STARTUP :CRIT :Could not dlopen 
> module:/usr/lib64/ganesha/libfsalceph.so Error:/lib64/libcephfs.so.2: 
> undefined symbol: 
> _Z14common_preinitRK18CephInitParameters18code_environment_ti
> >> load_fsal :NFS STARTUP :MAJ :Failed to load module 
> (/usr/lib64/ganesha/libfsalceph.so) because: Can not access a needed shared 
> library
> >>
> >> I'm on CentOS 7.4, using the following package versions:
> >>
> >> # rpm -qa | grep ganesha
> >> nfs-ganesha-2.6.1-0.1.el7.x86_64
> >> nfs-ganesha-vfs-2.6.1-0.1.el7.x86_64
> >> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> >>
> >> # rpm -qa | grep ceph
> >> libcephfs2-12.2.4-0.el7.x86_64
> >> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> >
> > Mhhhm - that sounds like a messup in the dependencies.
> > The symbol you are missing should be provided by
> > librados2-12.2.4-0.el7.x86_64
> > which contains
> > /usr/lib64/ceph/ceph/libcephfs-common.so.0
> > Do you have a different version of librados2 installed? If so, I wonder 
> how yum / rpm allowed that ;-).
> >
> > Thinking again, it might also be (if you indeed have a different 
> version there) that this is the cause also for the previous error.
> > If the problematic symbol is indeed not exposed, but can be resolved 
> only if both libraries (libcephfs-common and libcephfs) are loaded in unison 
> with matching versions,
> > it might be that also 12.2.5 works fine...
> >
> > First thing, in any case, is to checkout which version of librados2 you 
> are using ;-).
>

Re: [ceph-users] Nfs-ganesha 2.6 packages in ceph repo

2018-05-16 Thread Oliver Freyermuth
Hi David,

did you already manage to check your librados2 version and manage to pin down 
the issue? 

Cheers,
Oliver

Am 11.05.2018 um 17:15 schrieb Oliver Freyermuth:
> Hi David,
> 
> Am 11.05.2018 um 16:55 schrieb David C:
>> Hi Oliver
>>
>> Thanks for the detailed reponse! I've downgraded my libcephfs2 to 12.2.4 and 
>> still get a similar error:
>>
>> load_fsal :NFS STARTUP :CRIT :Could not dlopen 
>> module:/usr/lib64/ganesha/libfsalceph.so Error:/lib64/libcephfs.so.2: 
>> undefined symbol: 
>> _Z14common_preinitRK18CephInitParameters18code_environment_ti
>> load_fsal :NFS STARTUP :MAJ :Failed to load module 
>> (/usr/lib64/ganesha/libfsalceph.so) because: Can not access a needed shared 
>> library
>>
>> I'm on CentOS 7.4, using the following package versions:
>>
>> # rpm -qa | grep ganesha
>> nfs-ganesha-2.6.1-0.1.el7.x86_64
>> nfs-ganesha-vfs-2.6.1-0.1.el7.x86_64
>> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
>>
>> # rpm -qa | grep ceph
>> libcephfs2-12.2.4-0.el7.x86_64
>> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> 
> Mhhhm - that sounds like a messup in the dependencies. 
> The symbol you are missing should be provided by
> librados2-12.2.4-0.el7.x86_64
> which contains
> /usr/lib64/ceph/ceph/libcephfs-common.so.0
> Do you have a different version of librados2 installed? If so, I wonder how 
> yum / rpm allowed that ;-). 
> 
> Thinking again, it might also be (if you indeed have a different version 
> there) that this is the cause also for the previous error. 
> If the problematic symbol is indeed not exposed, but can be resolved only if 
> both libraries (libcephfs-common and libcephfs) are loaded in unison with 
> matching versions,
> it might be that also 12.2.5 works fine... 
> 
> First thing, in any case, is to checkout which version of librados2 you are 
> using ;-). 
> 
> Cheers,
>   Oliver
> 
>>
>> I don't have the ceph user space components installed, assuming they're not 
>> nesscary apart from libcephfs2? Any idea why it's giving me this error?
>>
>> Thanks,
>>
>> On Fri, May 11, 2018 at 2:17 AM, Oliver Freyermuth 
>> <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> wrote:
>>
>> Hi David,
>>
>> for what it's worth, we are running with nfs-ganesha 2.6.1 from Ceph 
>> repos on CentOS 7.4 with the following set of versions:
>> libcephfs2-12.2.4-0.el7.x86_64
>> nfs-ganesha-2.6.1-0.1.el7.x86_64
>> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
>> Of course, we plan to upgrade to 12.2.5 soon-ish...
>>
>> Am 11.05.2018 um 00:05 schrieb David C:
>> > Hi All
>> > 
>> > I'm testing out the nfs-ganesha-2.6.1-0.1.el7.x86_64.rpm package from 
>> http://download.ceph.com/nfs-ganesha/rpm-V2.6-stable/luminous/x86_64/ 
>> <http://download.ceph.com/nfs-ganesha/rpm-V2.6-stable/luminous/x86_64/>
>> > 
>> > It's failing to load /usr/lib64/ganesha/libfsalceph.so
>> > 
>> > With libcephfs-12.2.1 installed I get the following error in my 
>> ganesha log:
>> > 
>> >     load_fsal :NFS STARTUP :CRIT :Could not dlopen 
>> module:/usr/lib64/ganesha/libfsalceph.so Error:
>> >     /usr/lib64/ganesha/libfsalceph.so: undefined symbol: 
>> ceph_set_deleg_timeout
>> >     load_fsal :NFS STARTUP :MAJ :Failed to load module 
>> (/usr/lib64/ganesha/libfsalceph.so) because
>> >     : Can not access a needed shared library
>>
>> That looks like an ABI incompatibility, probably the nfs-ganesha 
>> packages should block this libcephfs2-version (and older ones).
>>
>> > 
>> > 
>> > With libcephfs-12.2.5 installed I get:
>> > 
>> >     load_fsal :NFS STARTUP :CRIT :Could not dlopen 
>> module:/usr/lib64/ganesha/libfsalceph.so Error:
>> >     /lib64/libcephfs.so.2: undefined symbol: 
>> _ZNK5FSMap10parse_roleEN5boost17basic_string_viewIcSt11char_traitsIcEEEP10mds_role_tRSo
>> >     load_fsal :NFS STARTUP :MAJ :Failed to load module 
>> (/usr/lib64/ganesha/libfsalceph.so) because
>> >     : Can not access a needed shared library
>>
>> That looks ugly and makes me fear for our planned 12.2.5-upgrade.
>> Interestingly, we do not have that symbol on 12.2.4:
>> # nm -D /lib64/libcephfs.so.2 | grep FSMap
>>                  U _ZNK5FSMap10parse_roleERKSsP10mds_role_tRSo
>>                  U _ZNK5FSMap13print_summaryEPN4ceph9FormatterEPSo
>> and NFS-Gane

Re: [ceph-users] Nfs-ganesha 2.6 packages in ceph repo

2018-05-11 Thread Oliver Freyermuth
Hi David,

Am 11.05.2018 um 16:55 schrieb David C:
> Hi Oliver
> 
> Thanks for the detailed reponse! I've downgraded my libcephfs2 to 12.2.4 and 
> still get a similar error:
> 
> load_fsal :NFS STARTUP :CRIT :Could not dlopen 
> module:/usr/lib64/ganesha/libfsalceph.so Error:/lib64/libcephfs.so.2: 
> undefined symbol: 
> _Z14common_preinitRK18CephInitParameters18code_environment_ti
> load_fsal :NFS STARTUP :MAJ :Failed to load module 
> (/usr/lib64/ganesha/libfsalceph.so) because: Can not access a needed shared 
> library
> 
> I'm on CentOS 7.4, using the following package versions:
> 
> # rpm -qa | grep ganesha
> nfs-ganesha-2.6.1-0.1.el7.x86_64
> nfs-ganesha-vfs-2.6.1-0.1.el7.x86_64
> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> 
> # rpm -qa | grep ceph
> libcephfs2-12.2.4-0.el7.x86_64
> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64

Mhhhm - that sounds like a messup in the dependencies. 
The symbol you are missing should be provided by
librados2-12.2.4-0.el7.x86_64
which contains
/usr/lib64/ceph/ceph/libcephfs-common.so.0
Do you have a different version of librados2 installed? If so, I wonder how yum 
/ rpm allowed that ;-). 

Thinking again, it might also be (if you indeed have a different version there) 
that this is the cause also for the previous error. 
If the problematic symbol is indeed not exposed, but can be resolved only if 
both libraries (libcephfs-common and libcephfs) are loaded in unison with 
matching versions,
it might be that also 12.2.5 works fine... 

First thing, in any case, is to checkout which version of librados2 you are 
using ;-). 

Cheers,
Oliver

> 
> I don't have the ceph user space components installed, assuming they're not 
> nesscary apart from libcephfs2? Any idea why it's giving me this error?
> 
> Thanks,
> 
> On Fri, May 11, 2018 at 2:17 AM, Oliver Freyermuth 
> <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> Hi David,
> 
> for what it's worth, we are running with nfs-ganesha 2.6.1 from Ceph 
> repos on CentOS 7.4 with the following set of versions:
> libcephfs2-12.2.4-0.el7.x86_64
> nfs-ganesha-2.6.1-0.1.el7.x86_64
> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> Of course, we plan to upgrade to 12.2.5 soon-ish...
> 
> Am 11.05.2018 um 00:05 schrieb David C:
> > Hi All
> > 
> > I'm testing out the nfs-ganesha-2.6.1-0.1.el7.x86_64.rpm package from 
> http://download.ceph.com/nfs-ganesha/rpm-V2.6-stable/luminous/x86_64/ 
> <http://download.ceph.com/nfs-ganesha/rpm-V2.6-stable/luminous/x86_64/>
> > 
> > It's failing to load /usr/lib64/ganesha/libfsalceph.so
> > 
> > With libcephfs-12.2.1 installed I get the following error in my ganesha 
> log:
> > 
> >     load_fsal :NFS STARTUP :CRIT :Could not dlopen 
> module:/usr/lib64/ganesha/libfsalceph.so Error:
> >     /usr/lib64/ganesha/libfsalceph.so: undefined symbol: 
> ceph_set_deleg_timeout
> >     load_fsal :NFS STARTUP :MAJ :Failed to load module 
> (/usr/lib64/ganesha/libfsalceph.so) because
> >     : Can not access a needed shared library
> 
> That looks like an ABI incompatibility, probably the nfs-ganesha packages 
> should block this libcephfs2-version (and older ones).
> 
> > 
> > 
> > With libcephfs-12.2.5 installed I get:
> > 
> >     load_fsal :NFS STARTUP :CRIT :Could not dlopen 
> module:/usr/lib64/ganesha/libfsalceph.so Error:
> >     /lib64/libcephfs.so.2: undefined symbol: 
> _ZNK5FSMap10parse_roleEN5boost17basic_string_viewIcSt11char_traitsIcEEEP10mds_role_tRSo
> >     load_fsal :NFS STARTUP :MAJ :Failed to load module 
> (/usr/lib64/ganesha/libfsalceph.so) because
> >     : Can not access a needed shared library
> 
> That looks ugly and makes me fear for our planned 12.2.5-upgrade.
> Interestingly, we do not have that symbol on 12.2.4:
> # nm -D /lib64/libcephfs.so.2 | grep FSMap
>                  U _ZNK5FSMap10parse_roleERKSsP10mds_role_tRSo
>                  U _ZNK5FSMap13print_summaryEPN4ceph9FormatterEPSo
> and NFS-Ganesha works fine.
> 
> Looking at:
> https://github.com/ceph/ceph/blob/v12.2.4/src/mds/FSMap.h 
> <https://github.com/ceph/ceph/blob/v12.2.4/src/mds/FSMap.h>
> versus
> https://github.com/ceph/ceph/blob/v12.2.5/src/mds/FSMap.h 
> <https://github.com/ceph/ceph/blob/v12.2.5/src/mds/FSMap.h>
> it seems this commit:
> 
> https://github.com/ceph/ceph/commit/7d8b3c1082b6b870710989773f3cd98a472b9a3d 
> <https://github.com/ceph/ceph/commit/7d8b3c1082b6b870710989773f3cd98a472b9a3d>
> changed libcephfs2 ABI.
> 
> I've no idea h

  1   2   >