[ceph-users] Multiple corrupt bluestore osds, Host Machine attacks VM OSDs

2019-07-25 Thread Daniel Williams
Hey,

I have a machine with 5 drives in a VM and 5 drives that were on the same
host machine. I've made this mistake once before ceph-volume activate -all
the host machines drives and it takes over the 5 drives in the VM as well
and corrupts them.

I've actually lost data this time. Erasure encoded 6.3 but losing 5 drives
I lost a small number of PGs (6). Repair gives this message

$ ceph-bluestore-tool repair --deep true --path /var/lib/ceph/osd/ceph-0

/build/ceph-13.2.6/src/os/bluestore/BlueFS.cc: In function 'int
BlueFS::_replay(bool, bool)' thread 7f21c3c6d980 time 2019-07-25
23:19:44.820537
/build/ceph-13.2.6/src/os/bluestore/BlueFS.cc: 848: FAILED assert(r !=
q->second->file_map.end())
 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x14e) [0x7f21c56c4b5e]
 2: (()+0x2c4cb7) [0x7f21c56c4cb7]
 3: (BlueFS::_replay(bool, bool)+0x4082) [0x56432ef954a2]
 4: (BlueFS::mount()+0xff) [0x56432ef958ef]
 5: (BlueStore::_open_db(bool, bool)+0x81c) [0x56432eff1a1c]
 6: (BlueStore::_fsck(bool, bool)+0x337) [0x56432f00e0a7]
 7: (main()+0xf0a) [0x56432eea7dca]
 8: (__libc_start_main()+0xeb) [0x7f21c4b7109b]
 9: (_start()+0x2a) [0x56432ef700fa]
*** Caught signal (Aborted) **
 in thread 7f21c3c6d980 thread_name:ceph-bluestore-
2019-07-25 23:19:44.817 7f21c3c6d980 -1
/build/ceph-13.2.6/src/os/bluestore/BlueFS.cc: In function 'int
BlueFS::_replay(bool, bool)' thread 7f21c3c6d980 time 2019-07-25
23:19:44.820537
/build/ceph-13.2.6/src/os/bluestore/BlueFS.cc: 848: FAILED assert(r !=
q->second->file_map.end())

I have two osds that don't start but at least make it further into the
repair
ceph-bluestore-tool repair --deep true --path /var/lib/ceph/osd/ceph-8
2019-07-25 21:59:17.314 7f1a03dfb980 -1 bluestore(/var/lib/ceph/osd/ceph-8)
_verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xf139a661,
expected 0x9344f85e, device location [0x1~1000], logical extent
0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2019-07-25 21:59:17.314 7f1a03dfb980 -1 bluestore(/var/lib/ceph/osd/ceph-8)
_verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xf139a661,
expected 0x9344f85e, device location [0x1~1000], logical extent
0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2019-07-25 21:59:17.314 7f1a03dfb980 -1 bluestore(/var/lib/ceph/osd/ceph-8)
_verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xf139a661,
expected 0x9344f85e, device location [0x1~1000], logical extent
0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2019-07-25 21:59:17.314 7f1a03dfb980 -1 bluestore(/var/lib/ceph/osd/ceph-8)
_verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xf139a661,
expected 0x9344f85e, device location [0x1~1000], logical extent
0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2019-07-25 21:59:17.314 7f1a03dfb980 -1 bluestore(/var/lib/ceph/osd/ceph-8)
fsck error: #-1:7b3f43c4:::osd_superblock:0# error during read:  0~21a (5)
Input/output error
... still running 

I've read through the archives and unlike others who have come across this
I'm not able to recover the content without the lost OSDs.

These PGs are backing a cephfs instance, so ideally
1. I'd be able to recover the 6 missing PGs for 3 OSDs of the 5 in broken
state...
or less desirable
2. Figure out how to map PGs to cephfs files that I lost on the cephfs so
that I can figure out whats lost and what remains.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New best practices for osds???

2019-07-25 Thread Anthony D'Atri
> We run few hundred HDD OSDs for our backup cluster, we set one RAID 0 per HDD 
> in order to be able
> to use -battery protected- write cache from the RAID controller. It really 
> improves performance, for both
> bluestore and filestore OSDs.

Having run something like 6000 HDD-based FileStore OSDs with colo journals on 
RAID HBAs I’d like to offer some contrasting thoughts.

TL;DR:  Never again!  False economy.  ymmv.

Details:

* The implementation predated me and was carved in dogfood^H^H^H^H^H^H^Hstone, 
try as I might I could not get it fixed.

* Single-drive RAID0 VDs were created to expose the underlying drives to the 
OS.  When the architecture was conceived, the HBAs in question didn’t have 
JBOD/passthrough, though a firmware update shortly thereafter did bring that 
ability.  That caching was a function of VDs wasn’t known at the time.

* My sense was that the FBWC did offer some throughput performance for at least 
some workloads, but at the cost of latency.

* Using a RAID-capable HBA in IR mode with FBWC meant having to monitor for the 
presence and status of the BBU/supercap

* The utility needed for that monitoring, when invoked with ostensibly 
innocuous parameters, would lock up the HBA for several seconds.

* Traditional BBUs are rated for lifespan of *only* one year.  FBWCs maybe for 
… three?  Significant cost to RMA or replace them:  time and karma wasted 
fighting with the system vendor CSO, engineer and remote hands time to take the 
system down and swap.  And then the connectors for the supercap were touchy; 
15% of the time the system would come up and not see it at all.

* The RAID-capable HBA itself + FBWC + supercap cost …. a couple three hundred 
more than an IT / JBOD equivalent

* There was a little-known flaw in secondary firmware that caused FBWC / 
supercap modules to be falsely reported bad.  The system vendor acted like I 
was making this up and washed their hands of it, even when I provided them the 
HBA vendors’ artifacts and documents.

* There were two design flaws that could and did result in cache data loss when 
a system rebooted or lost power.  There was a field notice for this, which 
required harvesting serial numbers and checking each.  The affected range of 
serials was quite a bit larger than what the validation tool admitted.  I had 
to manage the replacement of 302+ of these in production use, each needing 
engineer time time to manage Ceph, to do the hands work, and hassle with RMA 
paperwork.

* There was a firmware / utility design flaw that caused the HDD’s onboard 
volatile write cache to be silently turned on, despite an HBA config dump 
showing a setting that should have left it off.  Again data was lost when a 
node crashed hard or lost power.

* There was another firmware flaw that prevented booting if there was pinned / 
preserved cache data after a reboot / power loss if a drive failed or was 
yanked.  The HBA’s option ROM utility would block booting and wait for input on 
the console.  One could get in and tell it to discard that cache, but it would 
not actually do so, instead looping back to the same screen.  The only way to 
get the system to boot again was to replace and RMA the HBA.

* The VD layer lessened the usefulness of iostat data.  It also complicated OSD 
deployment / removal / replacement.  A smartctl hack to access SMART attributes 
below the VD layer would work on some systems but not others.

* The HBA model in question would work normally with a certain CPU generation, 
but not with slightly newer servers with the next CPU generation.  They would 
randomly, on roughly one boot out of five, negotiate PCIe gen3 which they 
weren’t capable of handling properly, and would silently run at about 20% of 
normal speed.  Granted this isn’t necessarily specific to an IR HBA.



Add it all up, and my assertion is that the money, time, karma, and user impact 
you save from NOT dealing with a RAID HBA *more than pays for* using SSDs for 
OSDs instead.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading and lost OSDs

2019-07-25 Thread Bob R
I would try 'mv /etc/ceph/osd{,.old}' then run 'ceph-volume  simple scan'
again. We had some problems upgrading due to OSDs (perhaps initially
installed as firefly?) missing the 'type' attribute and iirc the
'ceph-volume simple scan' command refused to overwrite existing json files
after I made some changes to ceph-volume.

Bob

On Wed, Jul 24, 2019 at 1:24 PM Alfredo Deza  wrote:

>
>
> On Wed, Jul 24, 2019 at 4:15 PM Peter Eisch 
> wrote:
>
>> Hi,
>>
>>
>>
>> I appreciate the insistency that the directions be followed.  I wholly
>> agree.  The only liberty I took was to do a ‘yum update’ instead of just
>> ‘yum update ceph-osd’ and then reboot.  (Also my MDS runs on the MON hosts,
>> so it got update a step early.)
>>
>>
>>
>> As for the logs:
>>
>>
>>
>> [2019-07-24 15:07:22,713][ceph_volume.main][INFO  ] Running command:
>> ceph-volume  simple scan
>>
>> [2019-07-24 15:07:22,714][ceph_volume.process][INFO  ] Running command:
>> /bin/systemctl show --no-pager --property=Id --state=running ceph-osd@*
>>
>> [2019-07-24 15:07:27,574][ceph_volume.main][INFO  ] Running command:
>> ceph-volume  simple activate --all
>>
>> [2019-07-24 15:07:27,575][ceph_volume.devices.simple.activate][INFO  ]
>> activating OSD specified in
>> /etc/ceph/osd/0-93fb5f2f-0273-4c87-a718-886d7e6db983.json
>>
>> [2019-07-24 15:07:27,576][ceph_volume.devices.simple.activate][ERROR ]
>> Required devices (block and data) not present for bluestore
>>
>> [2019-07-24 15:07:27,576][ceph_volume.devices.simple.activate][ERROR ]
>> bluestore devices found: [u'data']
>>
>> [2019-07-24 15:07:27,576][ceph_volume][ERROR ] exception caught by
>> decorator
>>
>> Traceback (most recent call last):
>>
>>   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line
>> 59, in newfunc
>>
>> return f(*a, **kw)
>>
>>   File "/usr/lib/python2.7/site-packages/ceph_volume/main.py", line 148,
>> in main
>>
>> terminal.dispatch(self.mapper, subcommand_args)
>>
>>   File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line
>> 182, in dispatch
>>
>> instance.main()
>>
>>   File
>> "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/main.py", line
>> 33, in main
>>
>> terminal.dispatch(self.mapper, self.argv)
>>
>>   File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line
>> 182, in dispatch
>>
>> instance.main()
>>
>>   File
>> "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/activate.py",
>> line 272, in main
>>
>> self.activate(args)
>>
>>   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line
>> 16, in is_root
>>
>> return func(*a, **kw)
>>
>>   File
>> "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/activate.py",
>> line 131, in activate
>>
>> self.validate_devices(osd_metadata)
>>
>>   File
>> "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/activate.py",
>> line 62, in validate_devices
>>
>> raise RuntimeError('Unable to activate bluestore OSD due to missing
>> devices')
>>
>> RuntimeError: Unable to activate bluestore OSD due to missing devices
>>
>>
>>
>> (this is repeated for each of the 16 drives)
>>
>>
>>
>> Any other thoughts?  (I’ll delete/create the OSDs with ceph-deply
>> otherwise.)
>>
>
> Try using `ceph-volume simple scan --stdout` so that it doesn't persist
> data onto /etc/ceph/osd/ and inspect that the JSON produced is capturing
> all the necessary details for OSDs.
>
> Alternatively, I would look into the JSON files already produced in
> /etc/ceph/osd/ and check if the details are correct. The `scan` sub-command
> does a tremendous effort to cover all cases where ceph-disk
> created an OSD (filestore, bluestore, dmcrypt, etc...) but it is possible
> that it may be hitting a problem. This is why the tool made these JSON
> files available, so that they could be inspected and corrected if anything.
>
> The details of the scan sub-command can be found at
> http://docs.ceph.com/docs/master/ceph-volume/simple/scan/ and the JSON
> structure is described in detail below at
> http://docs.ceph.com/docs/master/ceph-volume/simple/scan/#json-contents
>
> In this particular case the tool is refusing to activate what seems to be
> a bluestore OSD. Is it really a bluestore OSD? if so, then it can't find
> where is the data partition. What does that partition look like (for any of
> the failing OSDs) ? Does it use dmcrypt, how was it created? (hopefully
> with ceph-disk!)
>
> If you know the data partition for a given OSD, try and pass it onto
> 'scan'. For example if it is /dev/sda1 you could do `ceph-volume simple
> scan /dev/sda1` and check its output.
>
>
>
>>
>> peter
>>
>>
>>
>>
>> Peter Eisch
>> Senior Site Reliability Engineer
>> T *1.612.659.3228* <1.612.659.3228>
>> [image: Facebook] 
>> [image: LinkedIn] 
>> [image: Twitter] 
>> *virginpulse.com* 
>> | 

Re: [ceph-users] Future of Filestore?

2019-07-25 Thread Stuart Longland
On 25/7/19 9:32 pm, Виталий Филиппов wrote:
> Hi again,
> 
> I reread your initial email - do you also run a nanoceph on some SBCs
> each having one 2.5" 5400rpm HDD plugged into it? What SBCs do you use? :-)

I presently have a 5-node Ceph cluster:

- 3× Supermicro A1SAi-2750F with 1 120GB 2.5" SSD for boot (and
originally, journal), and a 2.5" 2TB HDD (WD20SPZX-00U).  One has 32GB
RAM (it was a former compute node), the others have 16GB.
- 2× Intel NUC Core i5 with 1 120GB M.2 SSD for boot and a 2.5" 2TB HDD
(WD20SPZX-00U).  Both with 8GB RAM.

For compute (KVM) I have one Supermicro A1SAi-2750F and one
A2SDi-16C-HLN4F, both with 32GB RAM.

https://hackaday.io/project/10529-solar-powered-cloud-computing
-- 
Stuart Longland (aka Redhatter, VK4MSL)

I haven't lost my mind...
  ...it's backed up on a tape somewhere.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Disarmed] Re: ceph-ansible firewalld blocking ceph comms

2019-07-25 Thread DHilsbos
Nathan;

I'm not an expert on firewalld, but shouldn't you have a list of open ports?

 ports: ?

Here's the configuration on my test cluster:
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: bond0
  sources:
  services: ssh dhcpv6-client
  ports: 6789/tcp 3300/tcp 6800-7300/tcp 8443/tcp
  protocols:
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:
trusted (active)
  target: ACCEPT
  icmp-block-inversion: no
  interfaces: bond1
  sources:
  services:
  ports: 6789/tcp 3300/tcp 6800-7300/tcp 8443/tcp
  protocols:
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:

I use interfaces as selectors, but would think source selectors would work the 
same.

You might start by adding the MON ports to the firewall on the MONs:
firewall-cmd --zone=public --add-port=6789/tcp --permanent
firewall-cmd --zone=public --add-port=3300/tcp --permanent
firewall-cmd --reload

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nathan 
Harper
Sent: Thursday, July 25, 2019 2:08 PM
To: ceph-us...@ceph.com
Subject: [Disarmed] Re: [ceph-users] ceph-ansible firewalld blocking ceph comms

This is a new issue to us, and did not have the same problem running the same 
activity on our test system. 
Regards,
Nathan

On 25 Jul 2019, at 22:00, solarflow99  wrote:
I used ceph-ansible just fine, never had this problem.  

On Thu, Jul 25, 2019 at 1:31 PM Nathan Harper  wrote:
Hi all,

We've run into a strange issue with one of our clusters managed with 
ceph-ansible.   We're adding some RGW nodes to our cluster, and so re-ran 
site.yml against the cluster.  The new RGWs added successfully, but

When we did, we started to get slow requests, effectively across the whole 
cluster.   Quickly we realised that the firewall was now (apparently) blocking 
Ceph communications.   I say apparently, because the config looks correct:

[root@osdsrv05 ~]# firewall-cmd --list-all
public (active)
  target: default
  icmp-block-inversion: no
  interfaces:
  sources: MailScanner has detected a possible fraud attempt from "172.20.22.0" 
claiming to be 172.20.22.0/24 MailScanner has detected a possible fraud attempt 
from "172.20.23.0" claiming to be 172.20.23.0/24
  services: ssh dhcpv6-client ceph
  ports:
  protocols:
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:

If we drop the firewall everything goes back healthy.   All the clients 
(Openstack cinder) are on the 172.20.22.0 network (172.20.23.0 is the 
replication network).  Has anyone seen this?
-- 
Nathan Harper // IT Systems Lead

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-ansible firewalld blocking ceph comms

2019-07-25 Thread Nathan Harper
This is a new issue to us, and did not have the same problem running the same 
activity on our test system. 

Regards,
Nathan

> On 25 Jul 2019, at 22:00, solarflow99  wrote:
> 
> I used ceph-ansible just fine, never had this problem.  
> 
>> On Thu, Jul 25, 2019 at 1:31 PM Nathan Harper  
>> wrote:
>> Hi all,
>> 
>> We've run into a strange issue with one of our clusters managed with 
>> ceph-ansible.   We're adding some RGW nodes to our cluster, and so re-ran 
>> site.yml against the cluster.  The new RGWs added successfully, but
>> 
>> When we did, we started to get slow requests, effectively across the whole 
>> cluster.   Quickly we realised that the firewall was now (apparently) 
>> blocking Ceph communications.   I say apparently, because the config looks 
>> correct:
>> 
>>> [root@osdsrv05 ~]# firewall-cmd --list-all
>>> public (active)
>>>   target: default
>>>   icmp-block-inversion: no
>>>   interfaces:
>>>   sources: 172.20.22.0/24 172.20.23.0/24
>>>   services: ssh dhcpv6-client ceph
>>>   ports:
>>>   protocols:
>>>   masquerade: no
>>>   forward-ports:
>>>   source-ports:
>>>   icmp-blocks:
>>>   rich rules:
>> 
>> If we drop the firewall everything goes back healthy.   All the clients 
>> (Openstack cinder) are on the 172.20.22.0 network (172.20.23.0 is the 
>> replication network).  Has anyone seen this?
>> -- 
>> Nathan Harper // IT Systems Lead
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-07-25 Thread Janek Bevendorff
I am not sure if making caps recall more aggressive helps. It seems to be the client failing to respond to it (at least that's what the warnings say).But I will try your new suggested settings as soon as I get the chance and will report back with the results. On 25 Jul 2019 11:00 pm, Patrick Donnelly  wrote:On Thu, Jul 25, 2019 at 12:49 PM Janek Bevendorff
 wrote:
>
>
> > Based on that message, it would appear you still have an inode limit
> > in place ("mds_cache_size"). Please unset that config option. Your
> > mds_cache_memory_limit is apparently ~19GB.
>
> No, I do not have an inode limit set. Only the memory limit.
>
>
> > There is another limit mds_max_caps_per_client (default 1M) which the
> > client is hitting. That's why the MDS is recalling caps from the
> > client and not because any cache memory limit is hit. It is not
> > recommend you increase this.
> Okay, this this setting isn't documented either and I did not change it,
> but it's also quite clear that it isn't working. My MDS hasn't crashed
> yet (without the recall settings it would have), but ceph fs status is
> reporting 14M inodes at this point and the number is slowly going up.

Can you share two captures of `ceph daemon mds.X perf dump` about 1
second apart.

You can also try increasing the aggressiveness of the MDS recall but
I'm surprised it's still a problem with the settings I gave you:

ceph config set mds mds_recall_max_caps 15000
ceph config set mds mds_recall_max_decay_rate 0.75

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-07-25 Thread Patrick Donnelly
On Thu, Jul 25, 2019 at 12:49 PM Janek Bevendorff
 wrote:
>
>
> > Based on that message, it would appear you still have an inode limit
> > in place ("mds_cache_size"). Please unset that config option. Your
> > mds_cache_memory_limit is apparently ~19GB.
>
> No, I do not have an inode limit set. Only the memory limit.
>
>
> > There is another limit mds_max_caps_per_client (default 1M) which the
> > client is hitting. That's why the MDS is recalling caps from the
> > client and not because any cache memory limit is hit. It is not
> > recommend you increase this.
> Okay, this this setting isn't documented either and I did not change it,
> but it's also quite clear that it isn't working. My MDS hasn't crashed
> yet (without the recall settings it would have), but ceph fs status is
> reporting 14M inodes at this point and the number is slowly going up.

Can you share two captures of `ceph daemon mds.X perf dump` about 1
second apart.

You can also try increasing the aggressiveness of the MDS recall but
I'm surprised it's still a problem with the settings I gave you:

ceph config set mds mds_recall_max_caps 15000
ceph config set mds mds_recall_max_decay_rate 0.75

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-ansible firewalld blocking ceph comms

2019-07-25 Thread solarflow99
I used ceph-ansible just fine, never had this problem.

On Thu, Jul 25, 2019 at 1:31 PM Nathan Harper 
wrote:

> Hi all,
>
> We've run into a strange issue with one of our clusters managed with
> ceph-ansible.   We're adding some RGW nodes to our cluster, and so re-ran
> site.yml against the cluster.  The new RGWs added successfully, but
>
> When we did, we started to get slow requests, effectively across the whole
> cluster.   Quickly we realised that the firewall was now (apparently)
> blocking Ceph communications.   I say apparently, because the config looks
> correct:
>
> [root@osdsrv05 ~]# firewall-cmd --list-all
>> public (active)
>>   target: default
>>   icmp-block-inversion: no
>>   interfaces:
>>   sources: 172.20.22.0/24 172.20.23.0/24
>>   services: ssh dhcpv6-client ceph
>>   ports:
>>   protocols:
>>   masquerade: no
>>   forward-ports:
>>   source-ports:
>>   icmp-blocks:
>>   rich rules:
>>
>
> If we drop the firewall everything goes back healthy.   All the clients
> (Openstack cinder) are on the 172.20.22.0 network (172.20.23.0 is the
> replication network).  Has anyone seen this?
> --
> *Nathan Harper* // IT Systems Lead
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-ansible firewalld blocking ceph comms

2019-07-25 Thread Nathan Harper
Hi all,

We've run into a strange issue with one of our clusters managed with
ceph-ansible.   We're adding some RGW nodes to our cluster, and so re-ran
site.yml against the cluster.  The new RGWs added successfully, but

When we did, we started to get slow requests, effectively across the whole
cluster.   Quickly we realised that the firewall was now (apparently)
blocking Ceph communications.   I say apparently, because the config looks
correct:

[root@osdsrv05 ~]# firewall-cmd --list-all
> public (active)
>   target: default
>   icmp-block-inversion: no
>   interfaces:
>   sources: 172.20.22.0/24 172.20.23.0/24
>   services: ssh dhcpv6-client ceph
>   ports:
>   protocols:
>   masquerade: no
>   forward-ports:
>   source-ports:
>   icmp-blocks:
>   rich rules:
>

If we drop the firewall everything goes back healthy.   All the clients
(Openstack cinder) are on the 172.20.22.0 network (172.20.23.0 is the
replication network).  Has anyone seen this?
-- 
*Nathan Harper* // IT Systems Lead
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-07-25 Thread Janek Bevendorff


> Based on that message, it would appear you still have an inode limit
> in place ("mds_cache_size"). Please unset that config option. Your
> mds_cache_memory_limit is apparently ~19GB.

No, I do not have an inode limit set. Only the memory limit.


> There is another limit mds_max_caps_per_client (default 1M) which the
> client is hitting. That's why the MDS is recalling caps from the
> client and not because any cache memory limit is hit. It is not
> recommend you increase this.
Okay, this this setting isn't documented either and I did not change it,
but it's also quite clear that it isn't working. My MDS hasn't crashed
yet (without the recall settings it would have), but ceph fs status is
reporting 14M inodes at this point and the number is slowly going up.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Objects in zone.rgw.log pool

2019-07-25 Thread Brett Chancellor
14.2.1
Thanks, I'll try that.

On Thu, Jul 25, 2019 at 2:54 PM Casey Bodley  wrote:

> What ceph version is this cluster running? Luminous or later should not
> be writing any new meta.log entries when it detects a single-zone
> configuration.
>
> I'd recommend editing your zonegroup configuration (via 'radosgw-admin
> zonegroup get' and 'put') to set both log_meta and log_data to false,
> then commit the change with 'radosgw-admin period update --commit'.
>
> You can then delete any meta.log.* and data_log.* objects from your log
> pool using the rados tool.
>
> On 7/25/19 2:30 PM, Brett Chancellor wrote:
> > Casey,
> >   These clusters were setup with the intention of one day doing multi
> > site replication. That has never happened. The cluster has a single
> > realm, which contains a single zonegroup, and that zonegroup contains
> > a single zone.
> >
> > -Brett
> >
> > On Thu, Jul 25, 2019 at 2:16 PM Casey Bodley  > > wrote:
> >
> > Hi Brett,
> >
> > These meta.log objects store the replication logs for metadata
> > sync in
> > multisite. Log entries are trimmed automatically once all other zones
> > have processed them. Can you verify that all zones in the multisite
> > configuration are reachable and syncing? Does 'radosgw-admin sync
> > status' on any zone show that it's stuck behind on metadata sync?
> > That
> > would prevent these logs from being trimmed and result in these large
> > omap warnings.
> >
> > On 7/25/19 1:59 PM, Brett Chancellor wrote:
> > > I'm having an issue similar to
> > >
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033611.html
>  .
> >
> > > I don't see where any solution was proposed.
> > >
> > > $ ceph health detail
> > > HEALTH_WARN 1 large omap objects
> > > LARGE_OMAP_OBJECTS 1 large omap objects
> > > 1 large objects found in pool 'us-prd-1.rgw.log'
> > > Search the cluster log for 'Large omap object found' for
> > more details.
> > >
> > > $ grep "Large omap object" /var/log/ceph/ceph.log
> > > 2019-07-25 14:58:21.758321 osd.3 (osd.3) 15 : cluster [WRN]
> > Large omap
> > > object found. Object:
> > > 51:61eb35fe:::meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19:head
> > > Key count: 3382154 Size (bytes): 611384043
> > >
> > > $ rados -p us-prd-1.rgw.log listomapkeys
> > > meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19 |wc -l
> > > 3382154
> > >
> > > $ rados -p us-prd-1.rgw.log listomapvals
> > > meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19
> > > This returns entries from almost every bucket, across multiple
> > > tenants. Several of the entries are from buckets that no longer
> > exist
> > > on the system.
> > >
> > > $ ceph df |egrep 'OBJECTS|.rgw.log'
> > > POOLID  STORED  OBJECTS USED%USED MAX
> > > AVAIL
> > > us-prd-1.rgw.log 51 758 MiB 228   758 MiB
> > >   0   102 TiB
> > >
> > > Thanks,
> > >
> > > -Brett
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Objects in zone.rgw.log pool

2019-07-25 Thread Casey Bodley
What ceph version is this cluster running? Luminous or later should not 
be writing any new meta.log entries when it detects a single-zone 
configuration.


I'd recommend editing your zonegroup configuration (via 'radosgw-admin 
zonegroup get' and 'put') to set both log_meta and log_data to false, 
then commit the change with 'radosgw-admin period update --commit'.


You can then delete any meta.log.* and data_log.* objects from your log 
pool using the rados tool.


On 7/25/19 2:30 PM, Brett Chancellor wrote:

Casey,
  These clusters were setup with the intention of one day doing multi 
site replication. That has never happened. The cluster has a single 
realm, which contains a single zonegroup, and that zonegroup contains 
a single zone.


-Brett

On Thu, Jul 25, 2019 at 2:16 PM Casey Bodley > wrote:


Hi Brett,

These meta.log objects store the replication logs for metadata
sync in
multisite. Log entries are trimmed automatically once all other zones
have processed them. Can you verify that all zones in the multisite
configuration are reachable and syncing? Does 'radosgw-admin sync
status' on any zone show that it's stuck behind on metadata sync?
That
would prevent these logs from being trimmed and result in these large
omap warnings.

On 7/25/19 1:59 PM, Brett Chancellor wrote:
> I'm having an issue similar to
>
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033611.html .

> I don't see where any solution was proposed.
>
> $ ceph health detail
> HEALTH_WARN 1 large omap objects
> LARGE_OMAP_OBJECTS 1 large omap objects
>     1 large objects found in pool 'us-prd-1.rgw.log'
>     Search the cluster log for 'Large omap object found' for
more details.
>
> $ grep "Large omap object" /var/log/ceph/ceph.log
> 2019-07-25 14:58:21.758321 osd.3 (osd.3) 15 : cluster [WRN]
Large omap
> object found. Object:
> 51:61eb35fe:::meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19:head
> Key count: 3382154 Size (bytes): 611384043
>
> $ rados -p us-prd-1.rgw.log listomapkeys
> meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19 |wc -l
> 3382154
>
> $ rados -p us-prd-1.rgw.log listomapvals
> meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19
> This returns entries from almost every bucket, across multiple
> tenants. Several of the entries are from buckets that no longer
exist
> on the system.
>
> $ ceph df |egrep 'OBJECTS|.rgw.log'
>     POOL        ID      STORED      OBJECTS     USED    %USED MAX
> AVAIL
>     us-prd-1.rgw.log                 51     758 MiB 228   758 MiB
>       0       102 TiB
>
> Thanks,
>
> -Brett
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-07-25 Thread Patrick Donnelly
On Thu, Jul 25, 2019 at 3:08 AM Janek Bevendorff
 wrote:
>
> The rsync job has been copying quite happily for two hours now. The good
> news is that the cache size isn't increasing unboundedly with each
> request anymore. The bad news is that it still is increasing afterall,
> though much slower. I am at 3M inodes now and it started off with 900k,
> settling at 1M initially. I had a peak just now of 3.7M, but it went
> back down to 3.2M shortly after that.
>
> According to the health status, the client has started failing to
> respond to cache pressure, so it's still not working as reliably as I
> would like it to. I am also getting this very peculiar message:
>
> MDS cache is too large (7GB/19GB); 52686 inodes in use by clients
>
> I guess the 53k inodes is the number that is actively in use right now
> (compared to the 3M for which the client generally holds caps). Is that
> so? Cache memory is still well within bounds, however. Perhaps the
> message is triggered by the recall settings and just a bit misleading?

Based on that message, it would appear you still have an inode limit
in place ("mds_cache_size"). Please unset that config option. Your
mds_cache_memory_limit is apparently ~19GB.

There is another limit mds_max_caps_per_client (default 1M) which the
client is hitting. That's why the MDS is recalling caps from the
client and not because any cache memory limit is hit. It is not
recommend you increase this.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Objects in zone.rgw.log pool

2019-07-25 Thread Brett Chancellor
Casey,
  These clusters were setup with the intention of one day doing multi site
replication. That has never happened. The cluster has a single realm, which
contains a single zonegroup, and that zonegroup contains a single zone.

-Brett

On Thu, Jul 25, 2019 at 2:16 PM Casey Bodley  wrote:

> Hi Brett,
>
> These meta.log objects store the replication logs for metadata sync in
> multisite. Log entries are trimmed automatically once all other zones
> have processed them. Can you verify that all zones in the multisite
> configuration are reachable and syncing? Does 'radosgw-admin sync
> status' on any zone show that it's stuck behind on metadata sync? That
> would prevent these logs from being trimmed and result in these large
> omap warnings.
>
> On 7/25/19 1:59 PM, Brett Chancellor wrote:
> > I'm having an issue similar to
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033611.html .
>
> > I don't see where any solution was proposed.
> >
> > $ ceph health detail
> > HEALTH_WARN 1 large omap objects
> > LARGE_OMAP_OBJECTS 1 large omap objects
> > 1 large objects found in pool 'us-prd-1.rgw.log'
> > Search the cluster log for 'Large omap object found' for more
> details.
> >
> > $ grep "Large omap object" /var/log/ceph/ceph.log
> > 2019-07-25 14:58:21.758321 osd.3 (osd.3) 15 : cluster [WRN] Large omap
> > object found. Object:
> > 51:61eb35fe:::meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19:head
> > Key count: 3382154 Size (bytes): 611384043
> >
> > $ rados -p us-prd-1.rgw.log listomapkeys
> > meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19 |wc -l
> > 3382154
> >
> > $ rados -p us-prd-1.rgw.log listomapvals
> > meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19
> > This returns entries from almost every bucket, across multiple
> > tenants. Several of the entries are from buckets that no longer exist
> > on the system.
> >
> > $ ceph df |egrep 'OBJECTS|.rgw.log'
> > POOLID  STORED  OBJECTS USED%USED MAX
> > AVAIL
> > us-prd-1.rgw.log 51 758 MiB 228 758 MiB
> >   0   102 TiB
> >
> > Thanks,
> >
> > -Brett
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Objects in zone.rgw.log pool

2019-07-25 Thread Casey Bodley

Hi Brett,

These meta.log objects store the replication logs for metadata sync in 
multisite. Log entries are trimmed automatically once all other zones 
have processed them. Can you verify that all zones in the multisite 
configuration are reachable and syncing? Does 'radosgw-admin sync 
status' on any zone show that it's stuck behind on metadata sync? That 
would prevent these logs from being trimmed and result in these large 
omap warnings.


On 7/25/19 1:59 PM, Brett Chancellor wrote:
I'm having an issue similar to 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033611.html . 
I don't see where any solution was proposed.


$ ceph health detail
HEALTH_WARN 1 large omap objects
LARGE_OMAP_OBJECTS 1 large omap objects
    1 large objects found in pool 'us-prd-1.rgw.log'
    Search the cluster log for 'Large omap object found' for more details.

$ grep "Large omap object" /var/log/ceph/ceph.log
2019-07-25 14:58:21.758321 osd.3 (osd.3) 15 : cluster [WRN] Large omap 
object found. Object: 
51:61eb35fe:::meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19:head 
Key count: 3382154 Size (bytes): 611384043


$ rados -p us-prd-1.rgw.log listomapkeys 
meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19 |wc -l

3382154

$ rados -p us-prd-1.rgw.log listomapvals 
meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19
This returns entries from almost every bucket, across multiple 
tenants. Several of the entries are from buckets that no longer exist 
on the system.


$ ceph df |egrep 'OBJECTS|.rgw.log'
    POOL        ID      STORED      OBJECTS     USED        %USED MAX 
AVAIL
    us-prd-1.rgw.log                 51     758 MiB 228     758 MiB   
      0       102 TiB


Thanks,

-Brett

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to power off a cephfs cluster cleanly

2019-07-25 Thread Patrick Donnelly
On Thu, Jul 25, 2019 at 7:48 AM Dan van der Ster  wrote:
>
> Hi all,
>
> In September we'll need to power down a CephFS cluster (currently
> mimic) for a several-hour electrical intervention.
>
> Having never done this before, I thought I'd check with the list.
> Here's our planned procedure:
>
> 1. umounts /cephfs from all hpc clients.
> 2. ceph osd set noout
> 3. wait until there is zero IO on the cluster
> 4. stop all mds's (active + standby)

You can also use `ceph fs set  down true` which will flush all
metadata/journals, evict any lingering clients, and leave the file
system down until manually brought back up even if there are standby
MDSs available.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Large OMAP Objects in zone.rgw.log pool

2019-07-25 Thread Brett Chancellor
I'm having an issue similar to
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033611.html .
I don't see where any solution was proposed.

$ ceph health detail
HEALTH_WARN 1 large omap objects
LARGE_OMAP_OBJECTS 1 large omap objects
1 large objects found in pool 'us-prd-1.rgw.log'
Search the cluster log for 'Large omap object found' for more details.

$ grep "Large omap object" /var/log/ceph/ceph.log
2019-07-25 14:58:21.758321 osd.3 (osd.3) 15 : cluster [WRN] Large omap
object found. Object:
51:61eb35fe:::meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19:head Key
count: 3382154 Size (bytes): 611384043

$ rados -p us-prd-1.rgw.log listomapkeys
meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19 |wc -l
3382154

$ rados -p us-prd-1.rgw.log listomapvals
meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19
This returns entries from almost every bucket, across multiple tenants.
Several of the entries are from buckets that no longer exist on the system.

$ ceph df |egrep 'OBJECTS|.rgw.log'
POOLID  STORED  OBJECTS USED
 %USED MAX AVAIL
us-prd-1.rgw.log 51 758 MiB 228 758 MiB
0   102 TiB

Thanks,

-Brett
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus: significant increase in cephfs metadata pool usage

2019-07-25 Thread Nathan Fish
I have seen significant increases (1GB -> 8GB) proportional to number
of inodes open, just like the MDS cache grows. These went away once
the stat-heavy workloads (multiple parallel rsyncs) stopped. I
disabled autoscale warnings on the metadata pools due to this
fluctuation.

On Thu, Jul 25, 2019 at 1:31 PM Dietmar Rieder
 wrote:
>
> On 7/25/19 11:55 AM, Konstantin Shalygin wrote:
> >> we just recently upgraded our cluster from luminous 12.2.10 to nautilus
> >> 14.2.1 and I noticed a massive increase of the space used on the cephfs
> >> metadata pool although the used space in the 2 data pools  basically did
> >> not change. See the attached graph (NOTE: log10 scale on y-axis)
> >>
> >> Is there any reason that explains this?
> >
> > Dietmar, how your metadata usage now? Is stop growing?
>
> it is stable now and only changes as the number of files in the FS changes.
>
> Dietmar
>
> --
> _
> D i e t m a r  R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
> Innrain 80, 6020 Innsbruck
> Phone: +43 512 9003 71402
> Fax: +43 512 9003 73100
> Email: dietmar.rie...@i-med.ac.at
> Web:   http://www.icbi.at
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] test, please ignore

2019-07-25 Thread Federico Lucifredi
Anything for our favorite esquire! :-) -F2

-- "'Problem' is a bleak word for challenge" - Richard Fish
_
Federico Lucifredi
Product Management Director, Ceph Storage Platform
Red Hat
A273 4F57 58C0 7FE8 838D 4F87 AEEB EC18 4A73 88AC
redhat.com   TRIED. TESTED. TRUSTED.


On Thu, Jul 25, 2019 at 7:08 AM Tim Serong  wrote:

> Sorry for the noise, I was getting "Remote Server returned '550 Cannot
> process address'" errors earlier trying to send to
> ceph-users@lists.ceph.com, and wanted to re-test.
>
> --
> Tim Serong
> Senior Clustering Engineer
> SUSE
> tser...@suse.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-25 Thread Matthew Vernon

On 24/07/2019 20:06, Paul Emmerich wrote:

+1 on adding them all at the same time.

All these methods that gradually increase the weight aren't really 
necessary in newer releases of Ceph.


FWIW, we added a rack-full (9x60 = 540 OSDs) in one go to our production 
cluster (then running Jewel) taking it from 2520 to 3060 OSDs and it 
wasn't a big issue.


Regards,

Matthew



--
The Wellcome Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to power off a cephfs cluster cleanly

2019-07-25 Thread DHilsbos
Dan;

I don't have  a lot of experience with Ceph, but I generally set all of the 
following before taking a cluster offline:
ceph osd set noout
ceph osd set nobackfill
ceph osd set norecover
ceph osd set norebalance
ceph osd set nodown
ceph osd set pause

I then unset them in the opposite order:

ceph osd unset pause
ceph osd unset nodown
ceph osd unset norebalance
ceph osd unset norecover
ceph osd unset nobackfill
ceph osd unset noout

This may be overkill though.

Will the MONs still have a quorum (i.e. will n / 2 + 1 still be running)?

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dan 
van der Ster
Sent: Thursday, July 25, 2019 7:48 AM
To: ceph-users
Subject: [ceph-users] how to power off a cephfs cluster cleanly

Hi all,

In September we'll need to power down a CephFS cluster (currently
mimic) for a several-hour electrical intervention.

Having never done this before, I thought I'd check with the list.
Here's our planned procedure:

1. umounts /cephfs from all hpc clients.
2. ceph osd set noout
3. wait until there is zero IO on the cluster
4. stop all mds's (active + standby)
5. stop all osds.
(6. we won't stop all mon's as they are not affected by that
electrical intervention)
7. power off the cluster.
...
8. power on the cluster, osd's first, then mds's. wait for health_ok.
9. ceph osd unset noout

Seems pretty simple... Are there any gotchas I'm missing? Maybe
there's some special procedure to stop the mds's cleanly?

Cheers, dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how to power off a cephfs cluster cleanly

2019-07-25 Thread Dan van der Ster
Hi all,

In September we'll need to power down a CephFS cluster (currently
mimic) for a several-hour electrical intervention.

Having never done this before, I thought I'd check with the list.
Here's our planned procedure:

1. umounts /cephfs from all hpc clients.
2. ceph osd set noout
3. wait until there is zero IO on the cluster
4. stop all mds's (active + standby)
5. stop all osds.
(6. we won't stop all mon's as they are not affected by that
electrical intervention)
7. power off the cluster.
...
8. power on the cluster, osd's first, then mds's. wait for health_ok.
9. ceph osd unset noout

Seems pretty simple... Are there any gotchas I'm missing? Maybe
there's some special procedure to stop the mds's cleanly?

Cheers, dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Anybody using 4x (size=4) replication?

2019-07-25 Thread Xavier Trilla
We had a similar situation, with one machine reseting when we restarted another 
one.

I’m not 100% sure why it happened, but I would bet it was related to several 
thousand client connections migrating from the machine we restarted to another 
one.

We have a similar setup than yours, and if you check the number of connections 
you have per machine they will be really high.

When you restart that machine all that connections have to be migrated to other 
machines, and I would say that may cause an error at a kernel level.

My suggestion -is what we do nowadays- would be to slowly stop all the OSDs on 
the machine your are about to restart before your restart it.

Cheers!



El 25 jul 2019, a les 11:01, Wido den Hollander  va escriure:

> 
> 
>> On 7/25/19 9:56 AM, Xiaoxi Chen wrote:
>> The real impact of changing min_size to 1 , is not about the possibility
>> of losing data ,but how much data it will lost.. in both case you will
>> lost some data , just how much.
>> 
>> Let PG X -> (osd A, B, C), min_size = 2, size =3
>> In your description, 
>> 
>> T1,   OSD A goes down due to upgrade, now the PG is in degraded mode
>> with   (B,C),  note that the PG is still active so that there is data
>> only written to B and C.  
>> 
>> T2 , B goes down to due to disk failure.  C is the only one holding the
>> portion of data between [T1, T2].
>> The failure rate of C, in this situation  , is independent to whether we
>> continue writing to C. 
>> 
>> if C failed in T3,  
>> w/o changing min_size, you  lost data from [T1, T2] together with
>> data unavailable from [T2,T3]
>> changing min_size = 1 , you lost data from [T1, T3] 
>> 
>> But agree, it is a tradeoff,  depending on how you believe you wont have
>> two drive failure in a row within 15 mins upgrade window...
> 
> Yes. So with min_size=1 you would need a single disk failure (B in your
> example) to loose data.
> 
> If min_size=2 you would need both B and C to fail within that window or
> when they are recovering A.
> 
> That is an even smaller chance than a single disk failure.
> 
> Wido
> 
>> Wido den Hollander mailto:w...@42on.com>> 于2019年7月25
>> 日周四 下午3:39写道:
>> 
>> 
>> 
>>>On 7/25/19 9:19 AM, Xiaoxi Chen wrote:
>>> We had hit this case in production but my solution will be change
>>> min_size = 1 immediately so that PG back to active right after.
>>> 
>>> It somewhat tradeoff reliability(durability) with availability during
>>> that window of 15 mins but if you are certain one out of two "failure"
>>> is due to recoverable issue, it worth to do so.
>> 
>>That's actually dangerous imho.
>> 
>>Because while you set min_size=1 you will be mutating data on that
>>single disk/OSD.
>> 
>>If the other two OSDs come back recovery will start. Now IF that single
>>disk/OSD now dies while performing the recovery you have lost data.
>> 
>>The PG (or PGs) becomes inactive and you either need to perform data
>>recovery on the failed disk or revert back to the last state.
>> 
>>I can't take that risk in this situation.
>> 
>>Wido
>> 
>>> My 0.02
>>> 
>>> Wido den Hollander mailto:w...@42on.com>
>>>> 于2019年7月25
>>> 日周四 上午3:48写道:
>>> 
>>> 
>>> 
>>>  On 7/24/19 9:35 PM, Mark Schouten wrote:
>>>  > I’d say the cure is worse than the issue you’re trying to
>>fix, but
>>>  that’s my two cents.
>>>  >
>>> 
>>>  I'm not completely happy with it either. Yes, the price goes
>>up and
>>>  latency increases as well.
>>> 
>>>  Right now I'm just trying to find a clever solution to this.
>>It's a 2k
>>>  OSD cluster and the likelihood of an host or OSD crashing is
>>reasonable
>>>  while you are performing maintenance on a different host.
>>> 
>>>  All kinds of things have crossed my mind where using size=4 is one
>>>  of them.
>>> 
>>>  Wido
>>> 
>>>  > Mark Schouten
>>>  >
>>>  >> Op 24 jul. 2019 om 21:22 heeft Wido den Hollander
>>mailto:w...@42on.com>
>>>  >> het volgende
>>geschreven:
>>>  >>
>>>  >> Hi,
>>>  >>
>>>  >> Is anybody using 4x (size=4, min_size=2) replication with Ceph?
>>>  >>
>>>  >> The reason I'm asking is that a customer of mine asked me for a
>>>  solution
>>>  >> to prevent a situation which occurred:
>>>  >>
>>>  >> A cluster running with size=3 and replication over different
>>>  racks was
>>>  >> being upgraded from 13.2.5 to 13.2.6.
>>>  >>
>>>  >> During the upgrade, which involved patching the OS as well,
>>they
>>>  >> rebooted one of the nodes. During that reboot suddenly a
>>node in a
>>>  >> different rack rebooted. It was unclear why this happened, but
>>>  the node
>>>  >> was gone.
>>>  >>
>>>  >> While the upgraded node was rebooting and the other node
>>crashed
>>>  about
>>>  >> 120 PGs were inactive due to min_size=2

Re: [ceph-users] Future of Filestore?

2019-07-25 Thread Виталий Филиппов
Hi again,

I reread your initial email - do you also run a nanoceph on some SBCs each 
having one 2.5" 5400rpm HDD plugged into it? What SBCs do you use? :-)
-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: [lca-announce] linux.conf.au 2020 - Call for Sessions and Miniconfs now open!

2019-07-25 Thread Tim Serong
Hi All,

Just a reminder, there's only a few days left to submit talks for this
most excellent conference; the CFP is open until Sunday 28 July Anywhere
on Earth.

(I've submitted a Data Storage miniconf day, fingers crossed...)

Regards,

Tim

On 6/26/19 2:09 PM, Tim Serong wrote:
> Here we go again!  As usual the conference theme is intended to
> inspire, not to restrict; talks on any topic in the world of free and
> open source software, hardware, etc. are most welcome, and Ceph talks
> definitely fit.
> 
> I've added this to https://pad.ceph.com/p/cfp-coordination as well.
> 
>  Forwarded Message 
> Subject: [lca-announce] linux.conf.au 2020 - Call for Sessions and
> Miniconfs now open!
> Date: Tue, 25 Jun 2019 21:19:43 +1000
> From: linux.conf.au Announcements 
> Reply-To: lca-annou...@lists.linux.org.au
> To: lca-annou...@lists.linux.org.au
> 
> 
> The linux.conf.au 2020 organising team is excited to announce that the
> linux.conf.au 2020 Call for Sessions and Call for Miniconfs are now open!
> These will stay open from now until Sunday 28 July Anywhere on Earth
> (AoE) (https://en.wikipedia.org/wiki/Anywhere_on_Earth).
> 
> Our theme for linux.conf.au 2020 is "Who's Watching", focusing on
> security, privacy and ethics.
> As big data and IoT-connected devices become more pervasive, it's no
> surprise that we're more concerned about privacy and security than ever
> before.
> We've set our sights on how open source could play a role in maximising
> security and protecting our privacy in times of uncertainty.
> With the concept of privacy continuing to blur, open source could be the
> solution to give us '2020 vision'.
> 
> Call for Sessions
> 
> Would you like to talk in the main conference of linux.conf.au 2020?
> The main conference runs from Wednesday to Friday, with multiple streams
> catering for a wide range of interest areas.
> We welcome you to submit a session
> (https://linux.conf.au/programme/sessions/) proposal for either a talk
> or tutorial now.
> 
> Call for Miniconfs
> 
> Miniconfs are dedicated day-long streams focusing on single topics,
> creating a more immersive experience for delegates than a session.
> Miniconfs are run on the first two days of the conference before the
> main conference commences on Wednesday.
> If you would like to organise a miniconf
> (https://linux.conf.au/programme/miniconfs/) at linux.conf.au, we want
> to hear from you.
> 
> Have we got you interested?
> 
> You can find out how to submit your session or miniconf proposals at
> https://linux.conf.au/programme/proposals/.
> If you have any other questions you can contact us via email at
> cont...@lca2020.linux.org.au.
> 
> We are looking forward to reading your submissions.
> 
> linux.conf.au 2020 Organising Team
> 
> 
> ---
> Read this online at
> https://lca2020.linux.org.au/news/call-for-sessions-miniconfs-now-open/
> ___
> lca-announce mailing list
> lca-annou...@lists.linux.org.au
> http://lists.linux.org.au/mailman/listinfo/lca-announce
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] test, please ignore

2019-07-25 Thread Tim Serong
Sorry for the noise, I was getting "Remote Server returned '550 Cannot
process address'" errors earlier trying to send to
ceph-users@lists.ceph.com, and wanted to re-test.

-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-25 Thread zhanrzh...@teamsun.com.cn
Hi,Janne
 Thank you for correcting my mistake.
Maybe the first advice description is unclear,I want to say that add osds into 
one failuer domain at a time ,
so that only one PG  among up set to remap at a time.


--
zhanrzh...@teamsun.com.cn
>Den tors 25 juli 2019 kl 10:47 skrev 展荣臻(信泰) :
>
>>
>> 1、Adding osds in same one failure domain is to ensure only one PG in pg up
>> set (ceph pg dump shows)to remap.
>> 2、Setting "osd_pool_default_min_size=1" is to ensure objects to read/write
>> uninterruptedly while pg remap.
>> Is this wrong?
>>
>
>How did you read the first email where he described how 3 copies was not
>enough, wanting to perhaps go to 4 copies
>to make sure he is not putting data at risk?
>
>The effect you describe is technically correct, it will allow writes to
>pass, but it would also go 100% against what ceph tries to do here, retain
>the data even while doing planned maintenance, even while getting
>unexpected downtime.
>
>Setting min_size=1 means you don't care at all for your data, and that you
>will be placing it under extreme risks.
>
>Not only will that single copy be a danger, but you can easily get into a
>situation where your singlecopy-write gets accepted and then that drive
>gets destroyed, and the cluster will know the latest writes ended up on it,
>and even getting the two older copies back will not help, since it has
>already registered that somewhere there is a newer version. For a single
>object, reverting to older (if possible) isn't all that bad, but for a
>section in the middle of a VM drive, that could mean total disaster.
>
>There are lots of people losing data with 1 copy, lots of posts on how
>repl_size=2, min_size=1 lost data for people using ceph, so I think posting
>advice to that effect goes against what ceph is good for.
>
>Not that I think the original poster would fall into that trap, but others
>might find this post later and think that it would be a good solution to
>maximize risk while adding/rebuilding 100s of OSDs. I don't agree.
>
>
>> Den tors 25 juli 2019 kl 04:36 skrev zhanrzh...@teamsun.com.cn <
>> zhanrzh...@teamsun.com.cn>:
>>
>>> I think it should to set "osd_pool_default_min_size=1" before you add osd
>>> ,
>>> and the osd that you add  at a time  should in same Failure domain.
>>>
>>
>> That sounds like weird or even bad advice?
>> What is the motivation behind it?
>>
>>
>--
>May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Please help: change IP address of a cluster

2019-07-25 Thread ST Wong (ITSC)
Hi,

Migrated the cluster to new IP range.  Regarding the MON that doesn't listen to 
v1 port, maybe I ran the command as mentioned in manual, but seems the part 
after comma doesn't work and tells the mon to listen on v2 port only.

ceph-mon -i cmon5 --public-addr v2:10.0.1.97:3300,v1:10.0.1.97:6789

MON map resumed normal with both v1 and v2 for this MON after removing the MON 
and add it again without running this command after mkfs, then start the 
services as usual.

Thanks a lot.
Rgds
/stwong

-Original Message-
From: ceph-users  On Behalf Of ST Wong (ITSC)
Sent: Wednesday, July 24, 2019 2:20 PM
To: Manuel Lausch ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Please help: change IP address of a cluster

Hi,

Thanks for your help.
I changed IP addresses of OSD nodes successfully.  When changing IP address on 
MON, it works except that the MON only listens on v2 port 3300 after adding the 
MON back to the cluster.  Previously the MON listens on both v1 (6789) and v2 
(3300).  
Besides, can't add both v1 and v2 entries manually using monmaptool like 
following.  Only the 2nd one will get added.

monmaptool -add node1  v1:10.0.1.1:6789, v2:10.0.1.1:3330  {tmp}/{filename}

the monmap now looks like following:

min_mon_release 14 (nautilus)
0: [v2:10.0.1.92:3300/0,v1:10.0.1.92:6789/0] mon.cmon2
1: [v2:10.0.1.93:3300/0,v1:10.0.1.93:6789/0] mon.cmon2
2: [v2:10.0.1.94:3300/0,v1:10.0.1.94:6789/0] mon.cmon3
3: [v2:10.0.1.95:3300/0,v1:10.0.1.95:6789/0] mon.cmon4
4: v2:10.0.1.97:3300/0 mon.cmon5<--- the MON 
being removed/added 

Although it's okay to use v2 only, I'm afraid I've missed some steps and messed 
the cluster up.Any advice?
Thanks again.

Best Regards,
/stwong

-Original Message-
From: ceph-users  On Behalf Of Manuel Lausch
Sent: Tuesday, July 23, 2019 7:32 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Please help: change IP address of a cluster

Hi,

I had to change the IPs of my cluster some time ago. The process was quite easy.

I don't understand what you mean with configuring and deleting static routes. 
The easies way is if the router allows (at least for the
change) all traffic between the old and the new network. 

I did the following steps.

1. Add the new IP Network space separated to the "public network" line in your 
ceph.conf

2. OSDS: stop you OSDs on the first node. Reconfigure the host network and 
start your OSDs again. Repeat this for all hosts one by one

3. MON: stop and remove one mon from cluster, delete all data in 
/var/ceph/mon/mon. reconfigure the host network. Create the new mon 
instance (don't forget the "mon host" entrys in your ceph.conf and your clients 
as well) Of course this requires at least 3 Mons in your cluster!
After 2 of 5 Mons in my cluster I added the new mon adresses to my clients and 
restarted them. 

4. MGR: stop the mgr daemon. reconfigure the host network. Start the mgr daemon 
one by one


I wouldn't recomend the "messy way" to reconfigure your mons. removing and 
adding mons to the cluster is quite easy and in my opinion the most secure.

The complet IP change in our cluster worked without outage while the cluster 
was in production.

I hope I could help you.

Regards
Manuel



On Fri, 19 Jul 2019 10:22:37 +
"ST Wong (ITSC)"  wrote:

> Hi all,
> 
> Our cluster has to change to new IP range in same VLAN:  10.0.7.0/24
> -> 10.0.18.0/23, while IP address on private network for OSDs
> remains unchanged. I wonder if we can do that in either one following
> ways:
> 
> =
> 
> 1.
> 
> a.   Define static route for 10.0.18.0/23 on each node
> 
> b.   Do it one by one:
> 
> For each monitor/mgr:
> 
> -  remove from cluster
> 
> -  change IP address
> 
> -  add static route to original IP range 10.0.7.0/24
> 
> -  delete static route for 10.0.18.0/23
> 
> -  add back to cluster
> 
> For each OSD:
> 
> -  stop OSD daemons
> 
> -   change IP address
> 
> -  add static route to original IP range 10.0.7.0/24
> 
> -  delete static route for 10.0.18.0/23
> 
> -  start OSD daemons
> 
> c.   Clean up all static routes defined.
> 
> 
> 
> 2.
> 
> a.   Export and update monmap using the messy way as described in
> http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-mons/
> 
> 
> 
> ceph mon getmap -o {tmp}/{filename}
> 
> monmaptool -rm node1 -rm node2 ... --rm node n {tmp}/{filename}
> 
> monmaptool -add node1 v2:10.0.18.1:3330,v1:10.0.18.1:6789 -add node2
> v2:10.0.18.2:3330,v1:10.0.18.2:6789 ... --add nodeN
> v2:10.0.18.N:3330,v1:10.0.18.N:6789  {tmp}/{filename}
> 
> 
> 
> b.   stop entire cluster daemons and change IP addresses
> 
> 
> c.   For each mon node:  ceph-mon -I {mon-id} -inject-monmap
> {tmp}/{filename}
> 
> 
> 
> d.   Restart cluster daemons.
> 
> 
> 
> 3.   Or any better method...
> =
> 
> Would anyone please help?   Thanks 

Re: [ceph-users] Nautilus: significant increase in cephfs metadata pool usage

2019-07-25 Thread Dietmar Rieder
On 7/25/19 11:55 AM, Konstantin Shalygin wrote:
>> we just recently upgraded our cluster from luminous 12.2.10 to nautilus
>> 14.2.1 and I noticed a massive increase of the space used on the cephfs
>> metadata pool although the used space in the 2 data pools  basically did
>> not change. See the attached graph (NOTE: log10 scale on y-axis)
>>
>> Is there any reason that explains this?
> 
> Dietmar, how your metadata usage now? Is stop growing?

it is stable now and only changes as the number of files in the FS changes.

Dietmar

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus: significant increase in cephfs metadata pool usage

2019-07-25 Thread Konstantin Shalygin

we just recently upgraded our cluster from luminous 12.2.10 to nautilus
14.2.1 and I noticed a massive increase of the space used on the cephfs
metadata pool although the used space in the 2 data pools  basically did
not change. See the attached graph (NOTE: log10 scale on y-axis)

Is there any reason that explains this?


Dietmar, how your metadata usage now? Is stop growing?




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-25 Thread Janne Johansson
Den tors 25 juli 2019 kl 10:47 skrev 展荣臻(信泰) :

>
> 1、Adding osds in same one failure domain is to ensure only one PG in pg up
> set (ceph pg dump shows)to remap.
> 2、Setting "osd_pool_default_min_size=1" is to ensure objects to read/write
> uninterruptedly while pg remap.
> Is this wrong?
>

How did you read the first email where he described how 3 copies was not
enough, wanting to perhaps go to 4 copies
to make sure he is not putting data at risk?

The effect you describe is technically correct, it will allow writes to
pass, but it would also go 100% against what ceph tries to do here, retain
the data even while doing planned maintenance, even while getting
unexpected downtime.

Setting min_size=1 means you don't care at all for your data, and that you
will be placing it under extreme risks.

Not only will that single copy be a danger, but you can easily get into a
situation where your singlecopy-write gets accepted and then that drive
gets destroyed, and the cluster will know the latest writes ended up on it,
and even getting the two older copies back will not help, since it has
already registered that somewhere there is a newer version. For a single
object, reverting to older (if possible) isn't all that bad, but for a
section in the middle of a VM drive, that could mean total disaster.

There are lots of people losing data with 1 copy, lots of posts on how
repl_size=2, min_size=1 lost data for people using ceph, so I think posting
advice to that effect goes against what ceph is good for.

Not that I think the original poster would fall into that trap, but others
might find this post later and think that it would be a good solution to
maximize risk while adding/rebuilding 100s of OSDs. I don't agree.


> Den tors 25 juli 2019 kl 04:36 skrev zhanrzh...@teamsun.com.cn <
> zhanrzh...@teamsun.com.cn>:
>
>> I think it should to set "osd_pool_default_min_size=1" before you add osd
>> ,
>> and the osd that you add  at a time  should in same Failure domain.
>>
>
> That sounds like weird or even bad advice?
> What is the motivation behind it?
>
>
-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Anybody using 4x (size=4) replication?

2019-07-25 Thread Wido den Hollander


On 7/25/19 9:56 AM, Xiaoxi Chen wrote:
> The real impact of changing min_size to 1 , is not about the possibility
> of losing data ,but how much data it will lost.. in both case you will
> lost some data , just how much.
> 
> Let PG X -> (osd A, B, C), min_size = 2, size =3
> In your description, 
> 
> T1,   OSD A goes down due to upgrade, now the PG is in degraded mode
> with   (B,C),  note that the PG is still active so that there is data
> only written to B and C.  
> 
> T2 , B goes down to due to disk failure.  C is the only one holding the
> portion of data between [T1, T2].
> The failure rate of C, in this situation  , is independent to whether we
> continue writing to C. 
> 
> if C failed in T3,  
>     w/o changing min_size, you  lost data from [T1, T2] together with
> data unavailable from [T2,T3]
>     changing min_size = 1 , you lost data from [T1, T3] 
> 
> But agree, it is a tradeoff,  depending on how you believe you wont have
> two drive failure in a row within 15 mins upgrade window...
> 

Yes. So with min_size=1 you would need a single disk failure (B in your
example) to loose data.

If min_size=2 you would need both B and C to fail within that window or
when they are recovering A.

That is an even smaller chance than a single disk failure.

Wido

> Wido den Hollander mailto:w...@42on.com>> 于2019年7月25
> 日周四 下午3:39写道:
> 
> 
> 
> On 7/25/19 9:19 AM, Xiaoxi Chen wrote:
> > We had hit this case in production but my solution will be change
> > min_size = 1 immediately so that PG back to active right after.
> >
> > It somewhat tradeoff reliability(durability) with availability during
> > that window of 15 mins but if you are certain one out of two "failure"
> > is due to recoverable issue, it worth to do so.
> >
> 
> That's actually dangerous imho.
> 
> Because while you set min_size=1 you will be mutating data on that
> single disk/OSD.
> 
> If the other two OSDs come back recovery will start. Now IF that single
> disk/OSD now dies while performing the recovery you have lost data.
> 
> The PG (or PGs) becomes inactive and you either need to perform data
> recovery on the failed disk or revert back to the last state.
> 
> I can't take that risk in this situation.
> 
> Wido
> 
> > My 0.02
> >
> > Wido den Hollander mailto:w...@42on.com>
> >> 于2019年7月25
> > 日周四 上午3:48写道:
> >
> >
> >
> >     On 7/24/19 9:35 PM, Mark Schouten wrote:
> >     > I’d say the cure is worse than the issue you’re trying to
> fix, but
> >     that’s my two cents.
> >     >
> >
> >     I'm not completely happy with it either. Yes, the price goes
> up and
> >     latency increases as well.
> >
> >     Right now I'm just trying to find a clever solution to this.
> It's a 2k
> >     OSD cluster and the likelihood of an host or OSD crashing is
> reasonable
> >     while you are performing maintenance on a different host.
> >
> >     All kinds of things have crossed my mind where using size=4 is one
> >     of them.
> >
> >     Wido
> >
> >     > Mark Schouten
> >     >
> >     >> Op 24 jul. 2019 om 21:22 heeft Wido den Hollander
> mailto:w...@42on.com>
> >     >> het volgende
> geschreven:
> >     >>
> >     >> Hi,
> >     >>
> >     >> Is anybody using 4x (size=4, min_size=2) replication with Ceph?
> >     >>
> >     >> The reason I'm asking is that a customer of mine asked me for a
> >     solution
> >     >> to prevent a situation which occurred:
> >     >>
> >     >> A cluster running with size=3 and replication over different
> >     racks was
> >     >> being upgraded from 13.2.5 to 13.2.6.
> >     >>
> >     >> During the upgrade, which involved patching the OS as well,
> they
> >     >> rebooted one of the nodes. During that reboot suddenly a
> node in a
> >     >> different rack rebooted. It was unclear why this happened, but
> >     the node
> >     >> was gone.
> >     >>
> >     >> While the upgraded node was rebooting and the other node
> crashed
> >     about
> >     >> 120 PGs were inactive due to min_size=2
> >     >>
> >     >> Waiting for the nodes to come back, recovery to finish it took
> >     about 15
> >     >> minutes before all VMs running inside OpenStack were back
> again.
> >     >>
> >     >> As you are upgraded or performing any maintenance with
> size=3 you
> >     can't
> >     >> tolerate a failure of a node as that will cause PGs to go
> inactive.
> >     >>
> >     >> This made me think about using size=4 and min_size=2 to
> prevent this
> >     >> situation.
> >     >>
> >     >> This obviously has implications on write latency and 

Re: [ceph-users] How to add 100 new OSDs...

2019-07-25 Thread 展荣臻(信泰)

1、Adding osds in same one failure domain is to ensure only one PG in pg up set 
(ceph pg dump shows)to remap.
2、Setting "osd_pool_default_min_size=1" is to ensure objects to read/write 
uninterruptedly while pg remap.
Is this wrong?


-原始邮件-
发件人:"Janne Johansson" 
发送时间:2019-07-25 15:01:37 (星期四)
收件人: "zhanrzh...@teamsun.com.cn" 
抄送: "xavier.trilla" , ceph-users 

主题: Re: [ceph-users] How to add 100 new OSDs...






Den tors 25 juli 2019 kl 04:36 skrev zhanrzh...@teamsun.com.cn 
:

I think it should to set "osd_pool_default_min_size=1" before you add osd ,
and the osd that you add  at a time  should in same Failure domain.


That sounds like weird or even bad advice?
What is the motivation behind it?


--

May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-25 Thread Thomas Byrne - UKRI STFC
As a counterpoint, adding large amounts of new hardware in gradually (or more 
specifically in a few steps) has a few benefits IMO.

- Being able to pause the operation and confirm the new hardware (and cluster) 
is operating as expected. You can identify problems with hardware with OSDs at 
10% weight that would be much harder to notice during backfilling, and could 
cause performance issues to the cluster if they ended up with their full 
complement of PGs.

- Breaking up long backfills. For a full cluster with large OSDs, backfills can 
take weeks. I find that letting the mon stores compact, and getting the cluster 
back to health OK is good for my sanity and gives a good stopping point to work 
on other cluster issues. This obviously depends on the cluster fullness and OSD 
size.

I still aim for the smallest amount of steps/work, but an initial crush 
weighting of 10-25% of final weight is a good sanity check of the new hardware, 
and gives a good indication of how to approach the rest of the backfill.

Cheers,
Tom

From: ceph-users  On Behalf Of Paul Emmerich
Sent: 24 July 2019 20:06
To: Reed Dier 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] How to add 100 new OSDs...

+1 on adding them all at the same time.

All these methods that gradually increase the weight aren't really necessary in 
newer releases of Ceph.

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jul 24, 2019 at 8:59 PM Reed Dier 
mailto:reed.d...@focusvq.com>> wrote:
Just chiming in to say that this too has been my preferred method for adding 
[large numbers of] OSDs.

Set the norebalance nobackfill flags.
Create all the OSDs, and verify everything looks good.
Make sure my max_backfills, recovery_max_active are as expected.
Make sure everything has peered.
Unset flags and let it run.

One crush map change, one data movement.

Reed



That works, but with newer releases I've been doing this:

- Make sure cluster is HEALTH_OK
- Set the 'norebalance' flag (and usually nobackfill)
- Add all the OSDs
- Wait for the PGs to peer. I usually wait a few minutes
- Remove the norebalance and nobackfill flag
- Wait for HEALTH_OK

Wido

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Anybody using 4x (size=4) replication?

2019-07-25 Thread Xiaoxi Chen
The real impact of changing min_size to 1 , is not about the possibility of
losing data ,but how much data it will lost.. in both case you will lost
some data , just how much.

Let PG X -> (osd A, B, C), min_size = 2, size =3
In your description,

T1,   OSD A goes down due to upgrade, now the PG is in degraded mode with
 (B,C),  note that the PG is still active so that there is data only
written to B and C.

T2 , B goes down to due to disk failure.  C is the only one holding the
portion of data between [T1, T2].
The failure rate of C, in this situation  , is independent to whether we
continue writing to C.

if C failed in T3,
w/o changing min_size, you  lost data from [T1, T2] together with data
unavailable from [T2,T3]
changing min_size = 1 , you lost data from [T1, T3]

But agree, it is a tradeoff,  depending on how you believe you wont have
two drive failure in a row within 15 mins upgrade window...

Wido den Hollander  于2019年7月25日周四 下午3:39写道:

>
>
> On 7/25/19 9:19 AM, Xiaoxi Chen wrote:
> > We had hit this case in production but my solution will be change
> > min_size = 1 immediately so that PG back to active right after.
> >
> > It somewhat tradeoff reliability(durability) with availability during
> > that window of 15 mins but if you are certain one out of two "failure"
> > is due to recoverable issue, it worth to do so.
> >
>
> That's actually dangerous imho.
>
> Because while you set min_size=1 you will be mutating data on that
> single disk/OSD.
>
> If the other two OSDs come back recovery will start. Now IF that single
> disk/OSD now dies while performing the recovery you have lost data.
>
> The PG (or PGs) becomes inactive and you either need to perform data
> recovery on the failed disk or revert back to the last state.
>
> I can't take that risk in this situation.
>
> Wido
>
> > My 0.02
> >
> > Wido den Hollander mailto:w...@42on.com>> 于2019年7月25
> > 日周四 上午3:48写道:
> >
> >
> >
> > On 7/24/19 9:35 PM, Mark Schouten wrote:
> > > I’d say the cure is worse than the issue you’re trying to fix, but
> > that’s my two cents.
> > >
> >
> > I'm not completely happy with it either. Yes, the price goes up and
> > latency increases as well.
> >
> > Right now I'm just trying to find a clever solution to this. It's a
> 2k
> > OSD cluster and the likelihood of an host or OSD crashing is
> reasonable
> > while you are performing maintenance on a different host.
> >
> > All kinds of things have crossed my mind where using size=4 is one
> > of them.
> >
> > Wido
> >
> > > Mark Schouten
> > >
> > >> Op 24 jul. 2019 om 21:22 heeft Wido den Hollander  > > het volgende geschreven:
> > >>
> > >> Hi,
> > >>
> > >> Is anybody using 4x (size=4, min_size=2) replication with Ceph?
> > >>
> > >> The reason I'm asking is that a customer of mine asked me for a
> > solution
> > >> to prevent a situation which occurred:
> > >>
> > >> A cluster running with size=3 and replication over different
> > racks was
> > >> being upgraded from 13.2.5 to 13.2.6.
> > >>
> > >> During the upgrade, which involved patching the OS as well, they
> > >> rebooted one of the nodes. During that reboot suddenly a node in a
> > >> different rack rebooted. It was unclear why this happened, but
> > the node
> > >> was gone.
> > >>
> > >> While the upgraded node was rebooting and the other node crashed
> > about
> > >> 120 PGs were inactive due to min_size=2
> > >>
> > >> Waiting for the nodes to come back, recovery to finish it took
> > about 15
> > >> minutes before all VMs running inside OpenStack were back again.
> > >>
> > >> As you are upgraded or performing any maintenance with size=3 you
> > can't
> > >> tolerate a failure of a node as that will cause PGs to go
> inactive.
> > >>
> > >> This made me think about using size=4 and min_size=2 to prevent
> this
> > >> situation.
> > >>
> > >> This obviously has implications on write latency and cost, but it
> > would
> > >> prevent such a situation.
> > >>
> > >> Is anybody here running a Ceph cluster with size=4 and min_size=2
> for
> > >> this reason?
> > >>
> > >> Thank you,
> > >>
> > >> Wido
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com 
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-07-25 Thread Janek Bevendorff




It's possible the MDS is not being aggressive enough with asking the
single (?) client to reduce its cache size. There were recent changes
[1] to the MDS to improve this. However, the defaults may not be
aggressive enough for your client's workload. Can you try:

ceph config set mds mds_recall_max_caps 1
ceph config set mds mds_recall_max_decay_rate 1.0


Thank you. I was looking for config directives that do exactly this all 
week. Why are they not documented anywhere outside that blog post?


I added them as you described and the MDS seems to have stabilized and 
stays just under 1M inos now. I will continue to monitor it and see if 
it is working in the long run. Settings like these should be the default 
IMHO. Clients should never be able to crash the server just by holding 
onto their capabilities. If a server decides to drop things from its 
cache, clients must deal with it. Everything else threatens the 
stability of the system (and may even prevent the MDS from ever starting 
again, as we saw).



Also your other mailings made me think you may still be using the old
inode limit for the cache size. Are you using the new
mds_cache_memory_limit config option?


No, I am not. I tried it at some point to see if it made things better, 
but just like the memory cache limit, it seemed to have no effect 
whatsoever except for delaying the health warning.




Finally, if this fixes your issue (please let us know!) and you decide
to try multiple active MDS, you should definitely use pinning as the
parallel create workload will greatly benefit from it.


I will try that, although I directory tree is quite imbalanced.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Anybody using 4x (size=4) replication?

2019-07-25 Thread Wido den Hollander


On 7/25/19 9:19 AM, Xiaoxi Chen wrote:
> We had hit this case in production but my solution will be change
> min_size = 1 immediately so that PG back to active right after.
> 
> It somewhat tradeoff reliability(durability) with availability during
> that window of 15 mins but if you are certain one out of two "failure"
> is due to recoverable issue, it worth to do so.
> 

That's actually dangerous imho.

Because while you set min_size=1 you will be mutating data on that
single disk/OSD.

If the other two OSDs come back recovery will start. Now IF that single
disk/OSD now dies while performing the recovery you have lost data.

The PG (or PGs) becomes inactive and you either need to perform data
recovery on the failed disk or revert back to the last state.

I can't take that risk in this situation.

Wido

> My 0.02
> 
> Wido den Hollander mailto:w...@42on.com>> 于2019年7月25
> 日周四 上午3:48写道:
> 
> 
> 
> On 7/24/19 9:35 PM, Mark Schouten wrote:
> > I’d say the cure is worse than the issue you’re trying to fix, but
> that’s my two cents.
> >
> 
> I'm not completely happy with it either. Yes, the price goes up and
> latency increases as well.
> 
> Right now I'm just trying to find a clever solution to this. It's a 2k
> OSD cluster and the likelihood of an host or OSD crashing is reasonable
> while you are performing maintenance on a different host.
> 
> All kinds of things have crossed my mind where using size=4 is one
> of them.
> 
> Wido
> 
> > Mark Schouten
> >
> >> Op 24 jul. 2019 om 21:22 heeft Wido den Hollander  > het volgende geschreven:
> >>
> >> Hi,
> >>
> >> Is anybody using 4x (size=4, min_size=2) replication with Ceph?
> >>
> >> The reason I'm asking is that a customer of mine asked me for a
> solution
> >> to prevent a situation which occurred:
> >>
> >> A cluster running with size=3 and replication over different
> racks was
> >> being upgraded from 13.2.5 to 13.2.6.
> >>
> >> During the upgrade, which involved patching the OS as well, they
> >> rebooted one of the nodes. During that reboot suddenly a node in a
> >> different rack rebooted. It was unclear why this happened, but
> the node
> >> was gone.
> >>
> >> While the upgraded node was rebooting and the other node crashed
> about
> >> 120 PGs were inactive due to min_size=2
> >>
> >> Waiting for the nodes to come back, recovery to finish it took
> about 15
> >> minutes before all VMs running inside OpenStack were back again.
> >>
> >> As you are upgraded or performing any maintenance with size=3 you
> can't
> >> tolerate a failure of a node as that will cause PGs to go inactive.
> >>
> >> This made me think about using size=4 and min_size=2 to prevent this
> >> situation.
> >>
> >> This obviously has implications on write latency and cost, but it
> would
> >> prevent such a situation.
> >>
> >> Is anybody here running a Ceph cluster with size=4 and min_size=2 for
> >> this reason?
> >>
> >> Thank you,
> >>
> >> Wido
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com 
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel, Distro & Ceph

2019-07-25 Thread Dietmar Rieder
On 7/24/19 10:05 PM, Wido den Hollander wrote:
> 
> 
> On 7/24/19 9:38 PM, dhils...@performair.com wrote:
>> All;
>>
>> There's been a lot of discussion of various kernel versions on this list 
>> lately, so I thought I'd seek some clarification.
>>
>> I prefer to run CentOS, and I prefer to keep the number of "extra" 
>> repositories to a minimum.  Ceph requires adding a Ceph repo, and the EPEL 
>> repo.  Updating the kernel requires (from the research I've done) adding 
>> EL-Repo.  I believe CentOS 7 uses the 3.10 kernel.
>>
> 
> Are you planning on using CephFS? Because only on the clients using
> CephFS through the kernel you might require a new kernel.
> 

We are running CentOS stock kernels on all our HPC nodes which mount
cephfs via kernel client (pg-upmap, and quota enabled). Not problem, or
missing features noticed so far.

Dietmar

> The nodes in the Ceph cluster can run with the stock CentOS kernel.
> 
> Wido
> 
>> Under what circumstances would you recommend adding EL-Repo to CentOS 7.6, 
>> and installing kernel-ml?  Are there certain parts of Ceph which 
>> particularly benefit from kernels newer that 3.10?
>>
>> Thank you,
>>
>> Dominic L. Hilsbos, MBA 
>> Director - Information Technology 
>> Perform Air International Inc.
>> dhils...@performair.com 
>> www.PerformAir.com
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Anybody using 4x (size=4) replication?

2019-07-25 Thread Wido den Hollander


On 7/25/19 8:55 AM, Janne Johansson wrote:
> Den ons 24 juli 2019 kl 21:48 skrev Wido den Hollander  >:
> 
> Right now I'm just trying to find a clever solution to this. It's a 2k
> OSD cluster and the likelihood of an host or OSD crashing is reasonable
> while you are performing maintenance on a different host.
> 
> All kinds of things have crossed my mind where using size=4 is one
> of them.
> 
> 
> The slow and boring solution would be to empty the host first. 8-(
>  

That crossed my mind as well, but that would be a very slow
operation Try to update a 2k OSD cluster which stores ~2PiB of data.

Wido

> -- 
> May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-25 Thread Janne Johansson
Den tors 25 juli 2019 kl 04:36 skrev zhanrzh...@teamsun.com.cn <
zhanrzh...@teamsun.com.cn>:

> I think it should to set "osd_pool_default_min_size=1" before you add osd ,
> and the osd that you add  at a time  should in same Failure domain.
>

That sounds like weird or even bad advice?
What is the motivation behind it?

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_ERR with a kitchen sink of problems: MDS damaged, readonly, and so forth

2019-07-25 Thread Sangwhan Moon
Original Message:
> 
> 
> On 7/25/19 7:49 AM, Sangwhan Moon wrote:
> > Hello,
> > 
> > Original Message:
> >>
> >>
> >> On 7/25/19 6:49 AM, Sangwhan Moon wrote:
> >>> Hello,
> >>>
> >>> I've inherited a Ceph cluster from someone who has left zero 
> >>> documentation or any handover. A couple days ago it decided to show the 
> >>> entire company what it is capable of..
> >>>
> >>> The health report looks like this:
> >>>
> >>> [root@host mnt]# ceph -s
> >>>   cluster:
> >>> id: 809718aa-3eac-4664-b8fa-38c46cdbfdab
> >>> health: HEALTH_ERR
> >>> 1 MDSs report damaged metadata
> >>> 1 MDSs are read only
> >>> 2 MDSs report slow requests
> >>> 6 MDSs behind on trimming
> >>> Reduced data availability: 2 pgs stale
> >>> Degraded data redundancy: 2593/186803520 objects degraded 
> >>> (0.001%), 2 pgs degraded, 2 pgs undersized
> >>> 1 slow requests are blocked > 32 sec. Implicated osds
> >>> 716 stuck requests are blocked > 4096 sec. Implicated osds 
> >>> 25,31,38\
> >>
> >> I would start here:
> >>
> >>>
> >>>   services:
> >>> mon: 3 daemons, quorum f,rook-ceph-mon2,rook-ceph-mon0
> >>> mgr: a(active)
> >>> mds: ceph-fs-2/2/2 up odd-fs-2/2/2 up  
> >>> {[ceph-fs:0]=ceph-fs-5b997cbf7b-5tjwh=up:active,[ceph-fs:1]=ceph-fs-5b997cbf
> >>> 7b-nstqz=up:active,[user-fs:0]=odd-fs-5668c75f9f-hflps=up:active,[user-fs:1]=odd-fs-5668c75f9f-jf59x=up:active},
> >>>  4 up:sta
> >>> ndby-replay
> >>> osd: 39 osds: 39 up, 38 in
> >>>
> >>>   data:
> >>> pools:   5 pools, 706 pgs
> >>> objects: 91212k objects, 4415 GB
> >>> usage:   10415 GB used, 13024 GB / 23439 GB avail
> >>> pgs: 2593/186803520 objects degraded (0.001%)
> >>>  703 active+clean
> >>>  2   stale+active+undersized+degraded
> >>
> >> This is a problem! Can you check:
> >>
> >> $ ceph pg dump_stuck
> >>
> >> The PGs will start with a number like 8.1a where '8' it the pool ID.
> >>
> >> Then check:
> >>
> >> $ ceph df
> >>
> >> To which pools to those PGs belong?
> >>
> >> Then check:
> >>
> >> $ ceph pg  query
> >>
> >> And the bottom somewhere should show why these PGs are not active. You
> >> might even want to try a restart of these OSDs involved with those two PGs.
> > 
> > Thanks a lot for the suggestions - I just checked and it says that the 
> > problematic PGs are 4.4f and 4.59 - but querying those seem result in the 
> > following error:
> > 
> > Error ENOENT: i don't have pgid 4.4f
> > 
> > (same applies for 4.59 - they do seem to show up in "ceph pg ls" though.)
> > 
> > In ceph pg ls, it shows that for these PGs UP, UP_PRIMARY ACTING, 
> > ACTING_PRIMARY all only have one OSD associated with it. (24, 13 - although 
> > both the PG ID mentioned above and these numbers probably don't help much 
> > with the diagnosis) Should restarting be a safe thing to try first?
> > 
> > ceph health detail says the following:
> > 
> > MDS_DAMAGE 1 MDSs report damaged metadata
> > mdsceph-fs-5b997cbf7b-5tjwh(mds.0): Metadata damage detected
> > MDS_READ_ONLY 1 MDSs are read only
> > mdsceph-fs-5b997cbf7b-5tjwh(mds.0): MDS in read-only mode
> > MDS_SLOW_REQUEST 2 MDSs report slow requests
> > mdsuser-fs-5668c75f9f-hflps(mds.0): 3 slow requests are blocked > 30 sec
> > mdsuser-fs-5668c75f9f-jf59x(mds.1): 980 slow requests are blocked > 30 
> > sec
> > MDS_TRIM 6 MDSs behind on trimming
> > mdsuser-fs-5668c75f9f-hflps(mds.0): Behind on trimming (342/128) 
> > max_segments: 128, num_segments: 342
> > mdsuser-fs-5668c75f9f-jf59x(mds.1): Behind on trimming (461/128) 
> > max_segments: 128, num_segments: 461
> > mdsuser-fs-5668c75f9f-h8p2t(mds.0): Behind on trimming (342/128) 
> > max_segments: 128, num_segments: 342
> > mdsuser-fs-5668c75f9f-7gs67(mds.1): Behind on trimming (461/128) 
> > max_segments: 128, num_segments: 461
> > mdsceph-fs-5b997cbf7b-5tjwh(mds.0): Behind on trimming (386/128) 
> > max_segments: 128, num_segments: 386
> > mdsceph-fs-5b997cbf7b-hmrxr(mds.0): Behind on trimming (386/128) 
> > max_segments: 128, num_segments: 386
> > PG_AVAILABILITY Reduced data availability: 2 pgs stale
> > pg 4.4f is stuck stale for 171783.855465, current state 
> > stale+active+undersized+degraded, last acting [24]
> > pg 4.59 is stuck stale for 171751.961506, current state 
> > stale+active+undersized+degraded, last acting [13]
> > PG_DEGRADED Degraded data redundancy: 2593/186805106 objects degraded 
> > (0.001%), 2 pgs degraded, 2 pgs undersized
> > pg 4.4f is stuck undersized for 171797.245359, current state 
> > stale+active+undersized+degraded, last acting [24]> pg 4.59 is stuck 
> > undersized for 171797.257707, current state
> stale+active+undersized+degraded, last acting [13]
> 
> So where are osd.24 and osd.13?
> 
> To which pool do these PGs belong?
> 
> But these PGs are probably the root-cause of all the issues you are seeing.
> 

Both 

Re: [ceph-users] Anybody using 4x (size=4) replication?

2019-07-25 Thread Janne Johansson
Den ons 24 juli 2019 kl 21:48 skrev Wido den Hollander :

> Right now I'm just trying to find a clever solution to this. It's a 2k
> OSD cluster and the likelihood of an host or OSD crashing is reasonable
> while you are performing maintenance on a different host.
>
> All kinds of things have crossed my mind where using size=4 is one of them.
>

The slow and boring solution would be to empty the host first. 8-(

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_ERR with a kitchen sink of problems: MDS damaged, readonly, and so forth

2019-07-25 Thread Wido den Hollander



On 7/25/19 7:49 AM, Sangwhan Moon wrote:
> Hello,
> 
> Original Message:
>>
>>
>> On 7/25/19 6:49 AM, Sangwhan Moon wrote:
>>> Hello,
>>>
>>> I've inherited a Ceph cluster from someone who has left zero documentation 
>>> or any handover. A couple days ago it decided to show the entire company 
>>> what it is capable of..
>>>
>>> The health report looks like this:
>>>
>>> [root@host mnt]# ceph -s
>>>   cluster:
>>> id: 809718aa-3eac-4664-b8fa-38c46cdbfdab
>>> health: HEALTH_ERR
>>> 1 MDSs report damaged metadata
>>> 1 MDSs are read only
>>> 2 MDSs report slow requests
>>> 6 MDSs behind on trimming
>>> Reduced data availability: 2 pgs stale
>>> Degraded data redundancy: 2593/186803520 objects degraded 
>>> (0.001%), 2 pgs degraded, 2 pgs undersized
>>> 1 slow requests are blocked > 32 sec. Implicated osds
>>> 716 stuck requests are blocked > 4096 sec. Implicated osds 
>>> 25,31,38\
>>
>> I would start here:
>>
>>>
>>>   services:
>>> mon: 3 daemons, quorum f,rook-ceph-mon2,rook-ceph-mon0
>>> mgr: a(active)
>>> mds: ceph-fs-2/2/2 up odd-fs-2/2/2 up  
>>> {[ceph-fs:0]=ceph-fs-5b997cbf7b-5tjwh=up:active,[ceph-fs:1]=ceph-fs-5b997cbf
>>> 7b-nstqz=up:active,[user-fs:0]=odd-fs-5668c75f9f-hflps=up:active,[user-fs:1]=odd-fs-5668c75f9f-jf59x=up:active},
>>>  4 up:sta
>>> ndby-replay
>>> osd: 39 osds: 39 up, 38 in
>>>
>>>   data:
>>> pools:   5 pools, 706 pgs
>>> objects: 91212k objects, 4415 GB
>>> usage:   10415 GB used, 13024 GB / 23439 GB avail
>>> pgs: 2593/186803520 objects degraded (0.001%)
>>>  703 active+clean
>>>  2   stale+active+undersized+degraded
>>
>> This is a problem! Can you check:
>>
>> $ ceph pg dump_stuck
>>
>> The PGs will start with a number like 8.1a where '8' it the pool ID.
>>
>> Then check:
>>
>> $ ceph df
>>
>> To which pools to those PGs belong?
>>
>> Then check:
>>
>> $ ceph pg  query
>>
>> And the bottom somewhere should show why these PGs are not active. You
>> might even want to try a restart of these OSDs involved with those two PGs.
> 
> Thanks a lot for the suggestions - I just checked and it says that the 
> problematic PGs are 4.4f and 4.59 - but querying those seem result in the 
> following error:
> 
> Error ENOENT: i don't have pgid 4.4f
> 
> (same applies for 4.59 - they do seem to show up in "ceph pg ls" though.)
> 
> In ceph pg ls, it shows that for these PGs UP, UP_PRIMARY ACTING, 
> ACTING_PRIMARY all only have one OSD associated with it. (24, 13 - although 
> both the PG ID mentioned above and these numbers probably don't help much 
> with the diagnosis) Should restarting be a safe thing to try first?
> 
> ceph health detail says the following:
> 
> MDS_DAMAGE 1 MDSs report damaged metadata
> mdsceph-fs-5b997cbf7b-5tjwh(mds.0): Metadata damage detected
> MDS_READ_ONLY 1 MDSs are read only
> mdsceph-fs-5b997cbf7b-5tjwh(mds.0): MDS in read-only mode
> MDS_SLOW_REQUEST 2 MDSs report slow requests
> mdsuser-fs-5668c75f9f-hflps(mds.0): 3 slow requests are blocked > 30 sec
> mdsuser-fs-5668c75f9f-jf59x(mds.1): 980 slow requests are blocked > 30 sec
> MDS_TRIM 6 MDSs behind on trimming
> mdsuser-fs-5668c75f9f-hflps(mds.0): Behind on trimming (342/128) 
> max_segments: 128, num_segments: 342
> mdsuser-fs-5668c75f9f-jf59x(mds.1): Behind on trimming (461/128) 
> max_segments: 128, num_segments: 461
> mdsuser-fs-5668c75f9f-h8p2t(mds.0): Behind on trimming (342/128) 
> max_segments: 128, num_segments: 342
> mdsuser-fs-5668c75f9f-7gs67(mds.1): Behind on trimming (461/128) 
> max_segments: 128, num_segments: 461
> mdsceph-fs-5b997cbf7b-5tjwh(mds.0): Behind on trimming (386/128) 
> max_segments: 128, num_segments: 386
> mdsceph-fs-5b997cbf7b-hmrxr(mds.0): Behind on trimming (386/128) 
> max_segments: 128, num_segments: 386
> PG_AVAILABILITY Reduced data availability: 2 pgs stale
> pg 4.4f is stuck stale for 171783.855465, current state 
> stale+active+undersized+degraded, last acting [24]
> pg 4.59 is stuck stale for 171751.961506, current state 
> stale+active+undersized+degraded, last acting [13]
> PG_DEGRADED Degraded data redundancy: 2593/186805106 objects degraded 
> (0.001%), 2 pgs degraded, 2 pgs undersized
> pg 4.4f is stuck undersized for 171797.245359, current state 
> stale+active+undersized+degraded, last acting [24]> pg 4.59 is stuck 
> undersized for 171797.257707, current state
stale+active+undersized+degraded, last acting [13]

So where are osd.24 and osd.13?

To which pool do these PGs belong?

But these PGs are probably the root-cause of all the issues you are seeing.

Wido

> REQUEST_SLOW 3 slow requests are blocked > 32 sec. Implicated osds
> 3 ops are blocked > 2097.15 sec
> REQUEST_STUCK 717 stuck requests are blocked > 4096 sec. Implicated osds 
> 25,31,38
> 286 ops are blocked > 268435 sec
> 211 ops are blocked > 134218 sec
>   

Re: [ceph-users] HEALTH_ERR with a kitchen sink of problems: MDS damaged, readonly, and so forth

2019-07-25 Thread Sangwhan Moon
Original Message:
> On Thu, 25 Jul 2019 13:49:22 +0900 Sangwhan Moon wrote:
> 
> > osd: 39 osds: 39 up, 38 in
> 
> You might want to find that out OSD.

Thanks, I've identified the OSD and put it back in - doesn't seem to change 
anything though. :(

Sangwhan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com