Re: [ceph-users] Balanced MDS, all as active and recomended client settings.

2018-02-21 Thread Daniel Carrasco
Thanks, I'll check it.

I want to search also if there is any way to cache file metadata on client,
to lower the MDS load. I suppose that files are cached but the client check
with MDS if there are changes on files. On my server files are the most of
time read-only so MDS data can be also cached for a while.

Greetings!!

El 22 feb. 2018 3:59, "Patrick Donnelly"  escribió:

> Hello Daniel,
>
> On Wed, Feb 21, 2018 at 10:26 AM, Daniel Carrasco 
> wrote:
> > Is possible to make a better distribution on the MDS load of both nodes?.
>
> We are aware of bugs with the balancer which are being worked on. You
> can also manually create a partition if the workload can benefit:
>
> https://ceph.com/community/new-luminous-cephfs-subtree-pinning/
>
> > Is posible to set all nodes as Active without problems?
>
> No. I recommend you read the docs carefully:
>
> http://docs.ceph.com/docs/master/cephfs/multimds/
>
> > My last question is if someone can recomend me a good client
> configuration
> > like cache size, and maybe something to lower the metadata servers load.
>
> >>
> >> ##
> >> [mds]
> >>  mds_cache_size = 25
> >>  mds_cache_memory_limit = 792723456
>
> You should only specify one of those. See also:
>
> http://docs.ceph.com/docs/master/cephfs/cache-size-limits/
>
> --
> Patrick Donnelly
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] identifying public buckets

2018-02-21 Thread Robin H. Johnson
On Wed, Feb 21, 2018 at 10:19:58AM +, Dave Holland wrote:
> Hi,
> 
> We would like to scan our users' buckets to identify those which are
> publicly-accessible, to avoid potential embarrassment (or worse), e.g.
> http://www.bbc.co.uk/news/technology-42839462
> 
> I didn't find a way to use radosgw-admin to report ACL information for a
> given bucket. And using the API to query a bucket's information would
> require a valid access key for that bucket. What am I missing, please?
You can do it via the S3 API. The below in Luminous, but should work fine in
Jewel (might have to force AWS-CLI to use a v2 signature).

You need to create a RGW user with the system flag set (it might be
possible with the newer admin flag as well).

As a concrete example, using Amazon's awscli, here:
# set the system bit on a user, if you don't already have a user with
# this power.
$ radosgw-admin user modify --uid $UID --system
# use the access+secret key from the above user.
$ AWS_ACCESS_KEY_ID='...' AWS_SECRET_ACCESS_KEY='...' \
aws \
--endpoint-url=https://$ENDPOINT \
s3api get-bucket-acl \
--bucket $BUCKETNAME

Example output (censored):
{
 "Owner": {
  "DisplayName": "ANOTHER-USER-THAT-WAS-NOT-SYSTEM", 
  "ID": "ANOTHER-USER-THAT-WAS-NOT-SYSTEM"
 }, 
 "Grants": [
  {
   "Grantee": {
"Type": "CanonicalUser", 
"DisplayName": "ANOTHER-USER-THAT-WAS-NOT-SYSTEM", 
"ID": "ANOTHER-USER-THAT-WAS-NOT-SYSTEM"
   }, 
   "Permission": "FULL_CONTROL"
  }
 ]
}

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG_DAMAGED Possible data damage: 1 pg inconsistent

2018-02-21 Thread Brad Hubbard
On Wed, Feb 21, 2018 at 6:40 PM, Yoann Moulin  wrote:
> Hello,
>
> I migrated my cluster from jewel to luminous 3 weeks ago (using ceph-ansible 
> playbook), a few days after, ceph status told me "PG_DAMAGED
> Possible data damage: 1 pg inconsistent", I tried to repair the PG without 
> success, I tried to stop the OSD, flush the journal and restart the
> OSDs but the OSD refuse to start due to a bad journal. I decided to destroy 
> the OSD and recreated it from scratch. After that, everything seemed
> to be all right, but, I just saw now I have exactly the same error again on 
> the same PG on the same OSD (78).
>
>> $ ceph health detail
>> HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent
>> OSD_SCRUB_ERRORS 3 scrub errors
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>> pg 11.5f is active+clean+inconsistent, acting [78,154,170]
>
>> $ ceph -s
>>   cluster:
>> id: f9dfd27f-c704-4d53-9aa0-4a23d655c7c4
>> health: HEALTH_ERR
>> 3 scrub errors
>> Possible data damage: 1 pg inconsistent
>>
>>   services:
>> mon: 3 daemons, quorum 
>> iccluster002.iccluster.epfl.ch,iccluster010.iccluster.epfl.ch,iccluster018.iccluster.epfl.ch
>> mgr: iccluster001(active), standbys: iccluster009, iccluster017
>> mds: cephfs-3/3/3 up  
>> {0=iccluster022.iccluster.epfl.ch=up:active,1=iccluster006.iccluster.epfl.ch=up:active,2=iccluster014.iccluster.epfl.ch=up:active}
>> osd: 180 osds: 180 up, 180 in
>> rgw: 6 daemons active
>>
>>   data:
>> pools:   29 pools, 10432 pgs
>> objects: 82862k objects, 171 TB
>> usage:   515 TB used, 465 TB / 980 TB avail
>> pgs: 10425 active+clean
>>  6 active+clean+scrubbing+deep
>>  1 active+clean+inconsistent
>>
>>   io:
>> client:   21538 B/s wr, 0 op/s rd, 33 op/s wr
>
>> ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
>> (stable)
>
> Short log :
>
>> 2018-02-21 09:08:33.408396 7fb7b8222700  0 log_channel(cluster) log [DBG] : 
>> 11.5f repair starts
>> 2018-02-21 09:08:33.727277 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
>> 11.5f shard 78: soid 
>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head 
>> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 
>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9- 
>> b494-57bdb48fab4e.314528.19:head(98394'20014544 osd.78.0:1623704 
>> dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd  od d46bb5a1 
>> alloc_hint [0 0 0])
>> 2018-02-21 09:08:33.727290 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
>> 11.5f shard 154: soid 
>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head 
>> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 
>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544
>>  osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd 
>>  od d46bb5a1 alloc_hint [0 0 0])
>> 2018-02-21 09:08:33.727293 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
>> 11.5f shard 170: soid 
>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head 
>> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 
>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544
>>  osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd 
>>  od d46bb5a1 alloc_hint [0 0 0])
>> 2018-02-21 09:08:33.727295 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
>> 11.5f soid 
>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head: 
>> failed to pick suitable auth object
>> 2018-02-21 09:08:33.727333 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
>> 11.5f repair 3 errors, 0 fixed
>
> I set "debug_osd 20/20" on osd.78 and start the repair again, the log file is 
> here :
>
> ceph-post-file: 1ccac8ea-0947-4fe4-90b1-32d1048548f1
>
> What can I do in that situation ?

Take a look and see if http://tracker.ceph.com/issues/21388 is
relevant as well as the debugging and advice therein.

>
> Thanks for your help.
>
> --
> Yoann Moulin
> EPFL IC-IT
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balanced MDS, all as active and recomended client settings.

2018-02-21 Thread Patrick Donnelly
Hello Daniel,

On Wed, Feb 21, 2018 at 10:26 AM, Daniel Carrasco  wrote:
> Is possible to make a better distribution on the MDS load of both nodes?.

We are aware of bugs with the balancer which are being worked on. You
can also manually create a partition if the workload can benefit:

https://ceph.com/community/new-luminous-cephfs-subtree-pinning/

> Is posible to set all nodes as Active without problems?

No. I recommend you read the docs carefully:

http://docs.ceph.com/docs/master/cephfs/multimds/

> My last question is if someone can recomend me a good client configuration
> like cache size, and maybe something to lower the metadata servers load.

>>
>> ##
>> [mds]
>>  mds_cache_size = 25
>>  mds_cache_memory_limit = 792723456

You should only specify one of those. See also:

http://docs.ceph.com/docs/master/cephfs/cache-size-limits/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG overdose protection causing PG unavailability

2018-02-21 Thread David Turner
You could set the flag noin to prevent the new osds from being calculated
by crush until you are ready for all of them in the host to be marked in.
You can also set initial crush weight to 0 for new pads so that they won't
receive any PGs until you're ready for it.

On Wed, Feb 21, 2018, 5:46 PM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Dear Cephalopodians,
>
> in a Luminous 12.2.3 cluster with a pool with:
> - 192 Bluestore OSDs total
> - 6 hosts (32 OSDs per host)
> - 2048 total PGs
> - EC profile k=4, m=2
> - CRUSH failure domain = host
> which results in 2048*6/192 = 64 PGs per OSD on average, I run into issues
> with PG overdose protection.
>
> In case I reinstall one OSD host (zapping all disks), and recreate the
> OSDs one by one with ceph-volume,
> they will usually come back "slowly", i.e. one after the other.
>
> This means the first OSD will initially be assigned all 2048 PGs (to
> fulfill the "failure domain host" requirement),
> thus breaking through the default osd_max_pg_per_osd_hard_ratio of 2.
> We also use mon_max_pg_per_osd default, i.e. 200.
>
> This appears to cause the previously active (but of course
> undersized+degraded) PGs to enter an "activating+remapped" state,
> and hence they become unavailable.
> Thus, data availability is reduced. All this is caused by adding an OSD!
>
> Of course, as more and more OSDs are added until all 32 are back online,
> this situation is relaxed.
> Still, I observe that some PGs get stuck in this "activating" state, and
> can't seem to figure out from logs or by dumping them
> what's the actual reason. Waiting does not help, PGs stay "activating",
> data stays inaccessible.
>
> Waiting a bit and manually restarting the ceph-OSD-services on the
> reinstalled host seems to bring them back.
> Also, adjusting osd_max_pg_per_osd_hard_ratio to something large (e.g. 10)
> appears to prevent the issue.
>
> So my best guess is that this is related to PG overdose protection.
> Any ideas on how to best overcome this / similar observations?
>
> It would be nice to be able to reinstall an OSD host without temporarily
> making data unavailable,
> right now the only thing which comes to my mind is to effectively disable
> PG overdose protection.
>
> Cheers,
> Oliver
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to really change public network in ceph

2018-02-21 Thread David Turner
Osds can change their IP every time they start. When they start and check
in with the mons, they tell the mons where they are.  Changing your public
network requires restarting every daemon. Likely you will want to schedule
downtime for this. Clients can be routed and on whatever subnet you want,
but osds and mons should be on the same subnet.

On Wed, Feb 21, 2018, 6:27 AM Mario Giammarco  wrote:

> I try to ask a simpler question: when I change monitors network and the
> network of osds, how can monitors know the new addresses of osds?
> Thanks,
> Mario
>
> 2018-02-19 10:22 GMT+01:00 Mario Giammarco :
>
>> Hello,
>> I have a test proxmox/ceph cluster with four servers.
>> I need to change the ceph public subnet from 10.1.0.0/24 to 10.1.5.0/24.
>> I have read documentation and tutorials.
>> The most critical part seems monitor map editing.
>> But it seems to me that osds need to bind to new subnet too.
>> I tried to put 10.1.0 and 10.1.5 subnets to public but it seems it
>> changes nothing.
>> Infact official documentation is unclear: it says you can put in public
>> network more than one subnet. It says they must be routed. But it does not
>> say what happens when you use multiple subnets or why you should do it.
>>
>> So I need help on these questions:
>> - exact sequence of operations to change public network in ceph (not only
>> monitors, also osds)
>> - details on multiple subnets in public networks in ceph
>>
>> Thanks in advance for any help,
>> Mario
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous v12.2.3 released

2018-02-21 Thread Sergey Malinin
Sadly, have to keep going with http://tracker.ceph.com/issues/22510






On Wednesday, February 21, 2018 at 22:50, Abhishek Lekshmanan wrote:

> We're happy to announce the third bugfix release of Luminous v12.2.x 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help with Bluestore WAL

2018-02-21 Thread David Turner
There WAL sis a required party of the osd. If you remove that, then the osd
is missing a crucial part of itself and it will be unable to start until
the WAL is back online. If the SSD were to fail, then all osds using it
would need to be removed and recreated on the cluster.

On Tue, Feb 20, 2018, 9:44 PM Konstantin Shalygin  wrote:

> Hi,
> We were recently testing luminous with bluestore. We have 6 node cluster 
> with 12 HDD and 1 SSD each, we used ceph-volume with LVM to create all the 
> OSD and attached with SSD WAL (LVM ). We create individual 10GBx12 LVM on 
> single SDD for each WAL. So all the OSD WAL is on the singe SSD. Problem is 
> if we pull the SSD out, it brings down all the 12 OSD on that node. Is that 
> expected behavior or we are missing any configuration ?
>
>
> Yes, you should plan your failure domain, i.e. what will be happens with
> your cluster if one backend ssd suddenly dies.
>
> Also you should plan mass failures of your ssd/nvme, so rule of thumb -
> don't overload your flash backend with osd. Recommend is ~4 osd per
> ssd/nvme.
>
>
>
> k
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating to new pools

2018-02-21 Thread David Turner
I recently migrated several VMs from an HDD pool to an SSD pool without any
downtime with proxmox. It is definitely possible with qemu to do no
downtime migrations between pools.

On Wed, Feb 21, 2018, 8:32 PM Alexandre DERUMIER 
wrote:

> Hi,
>
> if you use qemu, it's also possible to use drive-mirror feature from qemu.
> (can mirror and migrate from 1 storage to another storage without
> downtime).
>
> I don't known if openstack has implemented it, but It's working fine on
> proxmox.
>
>
> - Mail original -
> De: "Anthony D'Atri" 
> À: "ceph-users" 
> Envoyé: Jeudi 22 Février 2018 01:27:23
> Objet: Re: [ceph-users] Migrating to new pools
>
> >> I was thinking we might be able to configure/hack rbd mirroring to
> mirror to
> >> a pool on the same cluster but I gather from the OP and your post that
> this
> >> is not really possible?
> >
> > No, it's not really possible currently and we have no plans to add
> > such support since it would not be of any long-term value.
>
> The long-term value would be the ability to migrate volumes from, say, a
> replicated pool to an an EC pool without extended downtime.
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading inconvenience for Luminous

2018-02-21 Thread David Turner
Having all of the daemons in your cluster able to restart themselves at
will sounds terrifying. What's preventing every osd from restarting at the
same time? Also, ceph dot releases have been known to break environments.
It's the nature of such a widely used software. I would recommend pinning
the ceph version instead.

On Wed, Feb 21, 2018, 6:09 PM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Dear Cephalopodians,
>
> we had our cluster (still in testing phase) configured for automatic
> updates so we got 12.2.3 "automagically" when it was released.
>
> In /etc/sysconfig/ceph, we still have the default:
> CEPH_AUTO_RESTART_ON_UPGRADE=no
> so as expected, services were not restarted.
>
> However, as soon as scrubs started to run, we got many scrub errors and
> inconsistent PGs.
> Looking into the logs, I found that some ceph-osd processes (still running
> as 12.2.2) tried to load
> the compression library (libsnappy) dynamically, and refused to do so
> since it was already updated to 12.2.3 on disk.
> This appears to have caused the OSD to report read errors.
>
> The situation was reasonably easy to fix (i.e. just restart all ceph-osd
> processes, and re-run a deep scrub some of the inconsistent PGs).
> Still, I wonder whether this could be prevented by loading the libraries
> at OSD startup (and never unloading them),
> or by shutting down the OSD in case of a library load failure.
> Did anybody else experience this as of yet?
>
> We will work around it either by version pinning or
> CEPH_AUTO_RESTART_ON_UPGRADE=yes (not decided yet).
>
> Cheers,
> Oliver
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating to new pools

2018-02-21 Thread Alexandre DERUMIER
Hi,

if you use qemu, it's also possible to use drive-mirror feature from qemu.
(can mirror and migrate from 1 storage to another storage without downtime).

I don't known if openstack has implemented it, but It's working fine on proxmox.


- Mail original -
De: "Anthony D'Atri" 
À: "ceph-users" 
Envoyé: Jeudi 22 Février 2018 01:27:23
Objet: Re: [ceph-users] Migrating to new pools

>> I was thinking we might be able to configure/hack rbd mirroring to mirror to 
>> a pool on the same cluster but I gather from the OP and your post that this 
>> is not really possible? 
> 
> No, it's not really possible currently and we have no plans to add 
> such support since it would not be of any long-term value. 

The long-term value would be the ability to migrate volumes from, say, a 
replicated pool to an an EC pool without extended downtime. 





___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rm: cannot remove dir and files (cephfs)

2018-02-21 Thread Deepak Naidu
>> rm: cannot remove '/design/4695/8/6-50kb.jpg': No space left on device
“No space left on device” issue typically in ceph FS might be caused if you 
have files > 1million(10) in “single directory”. To mitigate this try 
increasing the "mds_bal_fragment_size_max" to a higher value, example 7 million 
(70)

[mds]
mds_bal_fragment_size_max = 70

I am not going detail here, that said, there are many other tuning params 
including enabling dir frag, multiple MDS etc. It seems your CEPH version is 
10.2.10(Jewel). Luminous(12.2.x) support better with multiple MDS and dir_frag 
etc. Jewel some of these options might be experimental features you might need 
todo bit reading related to cephFS.  Just a note of advise based on my 
experience, cephFS is ideal for large files in MB/GB is not great for small 
files(in kbs). Also split your files into multiple sub-dirs to avoid the 
similar issue or any likewise.

--
Deepak



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ??
Sent: Friday, February 09, 2018 2:35 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] rm: cannot remove dir and files (cephfs)

ceph version 10.2.10

ceph -s
cluster 97f833aa-cc6f-41d5-bf82-bda5c09fd664
health HEALTH_OK
monmap e3: 3 mons at 
{web23=192.168.65.24:6789/0,web25=192.168.65.55:6789/0,web26=192.168.65.56:6789/0}
election epoch 1198, quorum 0,1,2 web23,web25,web26
fsmap e464: 1/1/1 up {0=web26=up:active}, 3 up:standby
osdmap e325: 4 osds: 4 up, 4 in
flags sortbitwise,require_jewel_osds
pgmap v42475: 128 pgs, 3 pools, 274 GB data, 1710 kobjects
854 GB used, 2939 GB / 3793 GB avail
128 active+clean
client io 181 kB/s wr, 0 op/s rd, 5 op/s wr

ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
3793G 2941G 851G 22.46
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 0 0 822G 0
cephfs_metadata 1 17968k 0 822G 101458
cephfs_data 2 273G 24.99 822G 1643059

kernel mount cephfs
grep design /etc/fstab
192.168.65.24:,192.168.65.55:,192.168.65.56:/ /design ceph 
rw,relatime,name=design,secret=...,_netdev,noatime 0 0

I can not delete same files and dir.
rm -rf /design/4*
rm: cannot remove '/design/4695/8/6-50kb.jpg': No space left on device
rm: cannot remove '/design/4695/8/9-300kb.png': No space left on device
rm: cannot remove '/design/4695/8/0-300kb.png': No space left on device

What ideas?
help




---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating to new pools

2018-02-21 Thread Anthony D'Atri
>> I was thinking we might be able to configure/hack rbd mirroring to mirror to
>> a pool on the same cluster but I gather from the OP and your post that this
>> is not really possible?
> 
> No, it's not really possible currently and we have no plans to add
> such support since it would not be of any long-term value.

The long-term value would be the ability to migrate volumes from, say, a 
replicated pool to an an EC pool without extended downtime.





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume activation

2018-02-21 Thread Oliver Freyermuth
Am 21.02.2018 um 14:24 schrieb Alfredo Deza:
[snip]
>> Are there plans to have something like
>> "ceph-volume discover-and-activate"
>> which would effectively do something like:
>> ceph-volume list and activate all OSDs which are re-discovered from LVM 
>> metadata?
> 
> This is a good idea, I think ceph-disk had an 'activate all', and it
> would make it easier for the situation you explain with ceph-volume
> 
> I've created http://tracker.ceph.com/issues/23067 to follow up on this
> an implement it.

Many thanks for creating the issue, and also thanks for the extensive and clear 
reply! 

In case somebody finds it useful, I am right now using the following 
incantation:

lvs -o lv_tags | awk -vFS=, /ceph.osd_fsid/'{ 
OSD_ID=gensub(".*ceph.osd_id=([0-9]+),.*", "\\1", ""); 
OSD_FSID=gensub(".*ceph.osd_fsid=([a-z0-9-]+),.*", "\\1", ""); print 
OSD_ID,OSD_FSID }' | sort -n | uniq

which runs very fast and directly returns the parameters to be consumed by 
"ceph-volume lvm activate" (i.e. OSD-ID and OSD-FSID). 
Or course, use that at your own risk until a good implementation in ceph-volume 
is available. 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Upgrading inconvenience for Luminous

2018-02-21 Thread Oliver Freyermuth
Dear Cephalopodians, 

we had our cluster (still in testing phase) configured for automatic updates so 
we got 12.2.3 "automagically" when it was released. 

In /etc/sysconfig/ceph, we still have the default: 
CEPH_AUTO_RESTART_ON_UPGRADE=no
so as expected, services were not restarted. 

However, as soon as scrubs started to run, we got many scrub errors and 
inconsistent PGs. 
Looking into the logs, I found that some ceph-osd processes (still running as 
12.2.2) tried to load
the compression library (libsnappy) dynamically, and refused to do so since it 
was already updated to 12.2.3 on disk. 
This appears to have caused the OSD to report read errors. 

The situation was reasonably easy to fix (i.e. just restart all ceph-osd 
processes, and re-run a deep scrub some of the inconsistent PGs). 
Still, I wonder whether this could be prevented by loading the libraries at OSD 
startup (and never unloading them), 
or by shutting down the OSD in case of a library load failure. 
Did anybody else experience this as of yet? 

We will work around it either by version pinning or 
CEPH_AUTO_RESTART_ON_UPGRADE=yes (not decided yet). 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SSD Bluestore Backfills Slow

2018-02-21 Thread Reed Dier
Hi all,

I am running into an odd situation that I cannot easily explain.
I am currently in the midst of destroy and rebuild of OSDs from filestore to 
bluestore.
With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing 
unexpected behavior. The HDDs and SSDs are set in crush accordingly.

My path to replacing the OSDs is to set the noout, norecover, norebalance flag, 
destroy the OSD, create the OSD back, (iterate n times, all within a single 
failure domain), unset the flags, and let it go. It finishes, rinse, repeat.

For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node, with 2 
NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G partitions for 
block.db (previously filestore journals).
2x10GbE networking between the nodes. SATA backplane caps out at around 10 Gb/s 
as its 2x 6 Gb/s controllers. Luminous 12.2.2.

When the flags are unset, recovery starts and I see a very large rush of 
traffic, however, after the first machine completed, the performance tapered 
off at a rapid pace and trickles. Comparatively, I’m getting 100-200 recovery 
ops on 3 HDDs, backfilling from 21 other HDDs, where as I’m getting 150-250 
recovery ops on 5 SSDs, backfilling from 40 other SSDs. Every once in a while I 
will see a spike up to 500, 1000, or even 2000 ops on the SSDs, often a few 
hundred recovery ops from one OSD, and 8-15 ops from the others that are 
backfilling.

This is a far cry from the more than 15-30k recovery ops that it started off 
recovering with 1-3k recovery ops from a single OSD to the backfilling OSD(s). 
And an even farther cry from the >15k recovery ops I was sustaining for over an 
hour or more before. I was able to rebuild a 1.9T SSD (1.1T used) in a little 
under an hour, and I could do about 5 at a time and still keep it at roughly an 
hour to backfill all of them, but then I hit a roadblock after the first 
machine, when I tried to do 10 at a time (single machine). I am now still 
experiencing the same thing on the third node, while doing 5 OSDs at a time. 

The pools associated with these SSDs are cephfs-metadata, as well as a pure 
rados object pool we use for our own internal applications. Both are size=3, 
min_size=2.

It appears I am not the first to run into this, but it looks like there was no 
resolution: https://www.spinics.net/lists/ceph-users/msg41493.html 


Recovery parameters for the OSDs match what was in the previous thread, sans 
the osd conf block listed. And current osd_max_backfills = 30 and 
osd_recovery_max_active = 35. Very little activity on the OSDs during this 
period, so should not be any contention for iops on the SSDs.

The only oddity that I can attribute to things is that we had a few periods of 
time where the disk load on one of the mons was high enough to cause the mon to 
drop out of quorum for a brief amount of time, a few times. But I wouldn’t 
think backfills would just get throttled due to mons flapping.

Hopefully someone has some experience or can steer me in a path to improve the 
performance of the backfills so that I’m not stuck in backfill purgatory longer 
than I need to be.

Linking an imgur album with some screen grabs of the recovery ops over time for 
the first machine, versus the second and third machines to demonstrate the 
delta between them.
https://imgur.com/a/OJw4b 

Also including a ceph osd df of the SSDs, highlighted in red are the OSDs 
currently backfilling. Could this possibly be PG overdose? I don’t ever run 
into ‘stuck activating’ PGs, its just painfully slow backfills, like they are 
being throttled by ceph, that are causing me to worry. Drives aren’t worn, <30 
P/E cycles on the drives, so plenty of life left in them.

Thanks,
Reed

> $ ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
> 24   ssd 1.76109  1.0 1803G 1094G  708G 60.69 1.08 260
> 25   ssd 1.76109  1.0 1803G 1136G  667G 63.01 1.12 271
> 26   ssd 1.76109  1.0 1803G 1018G  785G 56.46 1.01 243
> 27   ssd 1.76109  1.0 1803G 1065G  737G 59.10 1.05 253
> 28   ssd 1.76109  1.0 1803G 1026G  776G 56.94 1.02 245
> 29   ssd 1.76109  1.0 1803G 1132G  671G 62.79 1.12 270
> 30   ssd 1.76109  1.0 1803G  944G  859G 52.35 0.93 224
> 31   ssd 1.76109  1.0 1803G 1061G  742G 58.85 1.05 252
> 32   ssd 1.76109  1.0 1803G 1003G  799G 55.67 0.99 239
> 33   ssd 1.76109  1.0 1803G 1049G  753G 58.20 1.04 250
> 34   ssd 1.76109  1.0 1803G 1086G  717G 60.23 1.07 257
> 35   ssd 1.76109  1.0 1803G  978G  824G 54.26 0.97 232
> 36   ssd 1.76109  1.0 1803G 1057G  745G 58.64 1.05 252
> 37   ssd 1.76109  1.0 1803G 1025G  777G 56.88 1.01 244
> 38   ssd 1.76109  1.0 1803G 1047G  756G 58.06 1.04 250
> 39   ssd 1.76109  1.0 1803G 1031G  771G 57.20 1.02 246
> 40   ssd 1.76109  1.0 1803G 1029G  774G 57.07 1.02 245
> 41   ssd 1.76109  1.0 1803G 1033G  770G 57.28 1.02 245
> 42   ssd 

Re: [ceph-users] ceph-volume activation

2018-02-21 Thread Oliver Freyermuth
Am 21.02.2018 um 15:58 schrieb Alfredo Deza:
> On Wed, Feb 21, 2018 at 9:40 AM, Dan van der Ster  wrote:
>> On Wed, Feb 21, 2018 at 2:24 PM, Alfredo Deza  wrote:
>>> On Tue, Feb 20, 2018 at 9:05 PM, Oliver Freyermuth
>>>  wrote:
 Many thanks for your replies!

 Are there plans to have something like
 "ceph-volume discover-and-activate"
 which would effectively do something like:
 ceph-volume list and activate all OSDs which are re-discovered from LVM 
 metadata?
>>>
>>> This is a good idea, I think ceph-disk had an 'activate all', and it
>>> would make it easier for the situation you explain with ceph-volume
>>>
>>> I've created http://tracker.ceph.com/issues/23067 to follow up on this
>>> an implement it.
>>
>> +1 thanks a lot for this thread and clear answers!
>> We were literally stuck today not knowing how to restart a ceph-volume
>> lvm created OSD.
>>
>> (It seems that once you systemctl stop ceph-osd@* on a machine, the
>> only way to get them back is ceph-volume lvm activate ... )
>>
>> BTW, ceph-osd.target now has less obvious functionality. For example,
>> this works:
>>
>>   systemctl restart ceph-osd.target
>>
>> But if you stop ceph-osd.target, then you can no longer start 
>> ceph-osd.target.
>>
>> Is this a regression or something we'll have to live with?
> 
> This sounds surprising. Stopping a ceph-osd target should not do
> anything with the devices. All that 'activate' does when called in
> ceph-volume is to ensure that
> the devices are available and mounted in the right places so that the
> OSD can start.
> 
> If you are experiencing a problem stopping an OSD that can't be
> started again, then something is going on. I would urge you to create
> a ticket with as many details as you can
> at http://tracker.ceph.com/projects/ceph-volume/issues/new

I also see this - but it's not really that "the osd can not be started again". 
The problem is that once the osd is stopped, e.g. via
systemctl stop ceph.target
doing a
systemctl start ceph.target
will not bring it up again. 

Doing a manual
systemctl start ceph-osd@36.service
will work, though. 

The ceph-osd@36.service, in fact, is not enabled on my machine,
which is likely why ceph.target will not cause it to come up. 

I am not a systemd expert, but I think the issue is that the 
ceph-volume@-services which
(I think) take care to activate the OSD services are not re-triggered. 

Cheers,
Oliver

> 
>>
>> Cheers, Dan




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG overdose protection causing PG unavailability

2018-02-21 Thread Oliver Freyermuth
Dear Cephalopodians, 

in a Luminous 12.2.3 cluster with a pool with:
- 192 Bluestore OSDs total
- 6 hosts (32 OSDs per host)
- 2048 total PGs
- EC profile k=4, m=2
- CRUSH failure domain = host
which results in 2048*6/192 = 64 PGs per OSD on average, I run into issues with 
PG overdose protection. 

In case I reinstall one OSD host (zapping all disks), and recreate the OSDs one 
by one with ceph-volume,
they will usually come back "slowly", i.e. one after the other. 

This means the first OSD will initially be assigned all 2048 PGs (to fulfill 
the "failure domain host" requirement), 
thus breaking through the default osd_max_pg_per_osd_hard_ratio of 2. 
We also use mon_max_pg_per_osd default, i.e. 200. 

This appears to cause the previously active (but of course undersized+degraded) 
PGs to enter an "activating+remapped" state,
and hence they become unavailable. 
Thus, data availability is reduced. All this is caused by adding an OSD! 

Of course, as more and more OSDs are added until all 32 are back online, this 
situation is relaxed. 
Still, I observe that some PGs get stuck in this "activating" state, and can't 
seem to figure out from logs or by dumping them
what's the actual reason. Waiting does not help, PGs stay "activating", data 
stays inaccessible. 

Waiting a bit and manually restarting the ceph-OSD-services on the reinstalled 
host seems to bring them back. 
Also, adjusting osd_max_pg_per_osd_hard_ratio to something large (e.g. 10) 
appears to prevent the issue. 

So my best guess is that this is related to PG overdose protection. 
Any ideas on how to best overcome this / similar observations? 

It would be nice to be able to reinstall an OSD host without temporarily making 
data unavailable,
right now the only thing which comes to my mind is to effectively disable PG 
overdose protection. 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SRV mechanism for looking up mons lacks IPv6 support

2018-02-21 Thread Simon Leinen
We just upgraded our last cluster to Luminous.  Since we might need to
renumber our mons in the not-too-distant future, it would be nice to
remove the literal IP addresses of the mons from ceph.conf.  Kraken and
above support a DNS-based mechanism for this based on SRV records[1].

Unfortunately our Rados cluster is IPv6-based, and in testing we found
out that the code that resolves these SRV records only looks for IPv4
addresses (A records) of the hostnames that the SRVs point to.

I just created issue #23078[2] for this.  The description points to
where I think the code would need to be changed.  If I can do anything
to help (in particular test fixes), please let me know.

This might be relevant to others who run IPv6 Rados clusters.
-- 
Simon.
[1] http://docs.ceph.com/docs/master/rados/configuration/mon-lookup-dns/
[2] http://tracker.ceph.com/issues/23078
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-21 Thread David Turner
I've been out sick for a couple days. I agree with Bryan Stillwell about
setting those flags and doing a rolling restart of all of the osds is a
good next step.

On Wed, Feb 21, 2018, 3:49 PM Bryan Stillwell 
wrote:

> Bryan,
>
>
>
> The good news is that there is progress being made on making this harder
> to screw up.  Read this article for example:
>
>
>
> https://ceph.com/community/new-luminous-pg-overdose-protection/
>
>
>
> The bad news is that I don't have a great solution for you regarding your
> peering problem.  I've run into things like that on testing clusters.  That
> almost always teaches me not to do too many operations at one time.
> Usually some combination of flags (norecover, norebalance, nobackfill,
> noout, etc.) with OSD restarts will fix the problem.  You can also query
> PGs to figure out why they aren't peering, increase logging, or if you want
> to get it back quickly you should consider RedHat support or contacting a
> Ceph consultant like Wido:
>
>
>
> In fact, I would recommend watching Wido's presentation on "10 ways to
> break your Ceph cluster" from Ceph Days Germany earlier this month for
> other things to watch out for:
>
>
>
> https://ceph.com/cephdays/germany/
>
>
>
> Bryan
>
>
>
> *From: *ceph-users  on behalf of Bryan
> Banister 
> *Date: *Tuesday, February 20, 2018 at 2:53 PM
> *To: *David Turner 
> *Cc: *Ceph Users 
>
>
> *Subject: *Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
>
>
>
> HI David [Resending with smaller message size],
>
>
>
> I tried setting the OSDs down and that does clear the blocked requests
> momentarily but they just return back to the same state.  Not sure how to
> proceed here, but one thought was just to do a full cold restart of the
> entire cluster.  We have disabled our backups so the cluster is effectively
> down.  Any recommendations on next steps?
>
>
>
> This also seems like a pretty serious issue, given that making this change
> has effectively broken the cluster.  Perhaps Ceph should not allow you to
> increase the number of PGs so drastically or at least make you put in a
> ‘--yes-i-really-mean-it’ flag?
>
>
>
> Or perhaps just some warnings on the docs.ceph.com placement groups page (
> http://docs.ceph.com/docs/master/rados/operations/placement-groups/ ) and
> the ceph command man page?
>
>
>
> Would be good to help other avoid this pitfall.
>
>
>
> Thanks again,
>
> -Bryan
>
>
>
> *From:* David Turner [mailto:drakonst...@gmail.com ]
>
> *Sent:* Friday, February 16, 2018 3:21 PM
> *To:* Bryan Banister 
> *Cc:* Bryan Stillwell ; Janne Johansson <
> icepic...@gmail.com>; Ceph Users 
> *Subject:* Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
>
>
>
> *Note: External Email*
> --
>
> That sounds like a good next step.  Start with OSDs involved in the
> longest blocked requests.  Wait a couple minutes after the osd marks itself
> back up and continue through them.  Hopefully things will start clearing up
> so that you don't need to mark all of them down.  There is usually a only a
> couple OSDs holding everything up.
>
>
>
> On Fri, Feb 16, 2018 at 4:15 PM Bryan Banister 
> wrote:
>
> Thanks David,
>
>
>
> Taking the list of all OSDs that are stuck reports that a little over 50%
> of all OSDs are in this condition.  There isn’t any discernable pattern
> that I can find and they are spread across the three servers.  All of the
> OSDs are online as far as the service is concern.
>
>
>
>
> I have also taken all PGs that were reported the health detail output and
> looked for any that report “peering_blocked_by” but none do, so I can’t
> tell if any OSD is actually blocking the peering operation.
>
>
>
> As suggested, I got a report of all peering PGs:
>
> [root@carf-ceph-osd01 ~]# ceph health detail | grep "pg " | grep peering
> | sort -k13
>
> pg 14.fe0 is stuck peering since forever, current state peering, last
> acting [104,94,108]
>
> pg 14.fe0 is stuck unclean since forever, current state peering, last
> acting [104,94,108]
>
> pg 14.fbc is stuck peering since forever, current state peering, last
> acting [110,91,0]
>
> pg 14.fd1 is stuck peering since forever, current state peering, last
> acting [130,62,111]
>
> pg 14.fd1 is stuck unclean since forever, current state peering, last
> acting [130,62,111]
>
> pg 14.fed is stuck peering since forever, current state peering, last
> acting [32,33,82]
>
> pg 14.fed is stuck unclean since forever, current state peering, last
> acting [32,33,82]
>
> pg 14.fee is stuck peering since forever, current state peering, last
> acting [37,96,68]
>
> pg 14.fee is stuck unclean since forever, current state peering, last
> acting [37,96,68]
>
> pg 

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-21 Thread Bryan Stillwell
Bryan,

The good news is that there is progress being made on making this harder to 
screw up.  Read this article for example:

https://ceph.com/community/new-luminous-pg-overdose-protection/

The bad news is that I don't have a great solution for you regarding your 
peering problem.  I've run into things like that on testing clusters.  That 
almost always teaches me not to do too many operations at one time.  Usually 
some combination of flags (norecover, norebalance, nobackfill, noout, etc.) 
with OSD restarts will fix the problem.  You can also query PGs to figure out 
why they aren't peering, increase logging, or if you want to get it back 
quickly you should consider RedHat support or contacting a Ceph consultant like 
Wido:

In fact, I would recommend watching Wido's presentation on "10 ways to break 
your Ceph cluster" from Ceph Days Germany earlier this month for other things 
to watch out for:

https://ceph.com/cephdays/germany/

Bryan

From: ceph-users  on behalf of Bryan 
Banister 
Date: Tuesday, February 20, 2018 at 2:53 PM
To: David Turner 
Cc: Ceph Users 
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

HI David [Resending with smaller message size],

I tried setting the OSDs down and that does clear the blocked requests 
momentarily but they just return back to the same state.  Not sure how to 
proceed here, but one thought was just to do a full cold restart of the entire 
cluster.  We have disabled our backups so the cluster is effectively down.  Any 
recommendations on next steps?

This also seems like a pretty serious issue, given that making this change has 
effectively broken the cluster.  Perhaps Ceph should not allow you to increase 
the number of PGs so drastically or at least make you put in a 
‘--yes-i-really-mean-it’ flag?

Or perhaps just some warnings on the docs.ceph.com placement groups page 
(http://docs.ceph.com/docs/master/rados/operations/placement-groups/ ) and the 
ceph command man page?

Would be good to help other avoid this pitfall.

Thanks again,
-Bryan

From: David Turner [mailto:drakonst...@gmail.com]
Sent: Friday, February 16, 2018 3:21 PM
To: Bryan Banister >
Cc: Bryan Stillwell >; 
Janne Johansson >; Ceph Users 
>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email

That sounds like a good next step.  Start with OSDs involved in the longest 
blocked requests.  Wait a couple minutes after the osd marks itself back up and 
continue through them.  Hopefully things will start clearing up so that you 
don't need to mark all of them down.  There is usually a only a couple OSDs 
holding everything up.

On Fri, Feb 16, 2018 at 4:15 PM Bryan Banister 
> wrote:
Thanks David,

Taking the list of all OSDs that are stuck reports that a little over 50% of 
all OSDs are in this condition.  There isn’t any discernable pattern that I can 
find and they are spread across the three servers.  All of the OSDs are online 
as far as the service is concern.

I have also taken all PGs that were reported the health detail output and 
looked for any that report “peering_blocked_by” but none do, so I can’t tell if 
any OSD is actually blocking the peering operation.

As suggested, I got a report of all peering PGs:
[root@carf-ceph-osd01 ~]# ceph health detail | grep "pg " | grep peering | sort 
-k13
pg 14.fe0 is stuck peering since forever, current state peering, last 
acting [104,94,108]
pg 14.fe0 is stuck unclean since forever, current state peering, last 
acting [104,94,108]
pg 14.fbc is stuck peering since forever, current state peering, last 
acting [110,91,0]
pg 14.fd1 is stuck peering since forever, current state peering, last 
acting [130,62,111]
pg 14.fd1 is stuck unclean since forever, current state peering, last 
acting [130,62,111]
pg 14.fed is stuck peering since forever, current state peering, last 
acting [32,33,82]
pg 14.fed is stuck unclean since forever, current state peering, last 
acting [32,33,82]
pg 14.fee is stuck peering since forever, current state peering, last 
acting [37,96,68]
pg 14.fee is stuck unclean since forever, current state peering, last 
acting [37,96,68]
pg 14.fe8 is stuck peering since forever, current state peering, last 
acting [45,31,107]
pg 14.fe8 is stuck unclean since forever, current state peering, last 
acting [45,31,107]
pg 14.fc1 is stuck peering since forever, current state peering, last 
acting [59,124,39]
pg 14.ff2 is stuck peering since forever, current state peering, last 
acting [62,117,7]
pg 14.ff2 is 

[ceph-users] Ceph auth caps - make it more user error proof

2018-02-21 Thread Enrico Kern
Hey all,

i would suggest some changes to the ceph auth caps command.

Today i almost fucked up half of one of our openstack regions with i/o
errors because of user failure.

I tried to add osd blacklist caps to a cinder keyring after luminous
upgrade.

I did so by issuing ceph auth caps client.cinder mon 'bla'

doing this i forgot that this will wipe also other caps and not just only
updates caps for mon because you need to specify all in one line. Result
was all of our vms passing out with read only filesystems after a while
because osd caps were gone.

I suggest that if you only pass

Ceph auth caps mon

It only updates caps for mon or osd etc. and leaves others untouched. Or at
least print some huge error message.

I know it is more a pebkac problem, but ceph is doing great in being idiot
proof and this would make it even more idiot proof ;)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balanced MDS, all as active and recomended client settings.

2018-02-21 Thread Daniel Carrasco
2018-02-21 19:26 GMT+01:00 Daniel Carrasco :

> Hello,
>
> I've created a Ceph cluster with 3 nodes to serve files to an high traffic
> webpage. I've configured two MDS as active and one as standby, but after
> add the new system to production I've noticed that MDS are not balanced and
> one server get the most of clients petitions (One MDS about 700 or less vs
> 4.000 or more the other).
>
> Is possible to make a better distribution on the MDS load of both nodes?.
> Is posible to set all nodes as Active without problems?
>
> I know that is possible to set max_mds to 3 and all will be active, but I
> want to know what happen if one node goes down for example, or if there are
> another side effects.
>
>
> My last question is if someone can recomend me a good client configuration
> like cache size, and maybe something to lower the metadata servers load.
>
>
> Thanks!!
>

I forgot to say my configuration xD.

I've a three nodes cluster with AIO:

   - 3 Monitors
   - 3 OSD
   - 3 MDS (2 actives and one standby)
   - 3 MGR (1 active)

The data has 3 copies, so is in every node.

My configuration file is:
[global]
fsid = BlahBlahBlah
mon_initial_members = fs-01, fs-02, fs-03
mon_host = 192.168.4.199,192.168.4.200,192.168.4.201
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

public network = 192.168.4.0/24
osd pool default size = 3


##
### OSD
##
[osd]
  osd_pool_default_pg_num = 128
  osd_pool_default_pgp_num = 128
  osd_pool_default_size = 3
  osd_pool_default_min_size = 2

  osd_mon_heartbeat_interval = 5
  osd_mon_report_interval_max = 10
  osd_heartbeat_grace = 15
  osd_fast_fail_on_connection_refused = True


##
### MON
##
[mon]
  mon_osd_min_down_reporters = 2

##
### MDS
##
[mds]
  mds_cache_size = 25
  mds_cache_memory_limit = 792723456

##
### Client
##
[client]
  client_cache_size = 32768
  client_mount_timeout = 30
  client_oc_max_objects = 2000
  client_oc_size = 629145600
  rbd_cache = true
  rbd_cache_size = 671088640


Thanks!!!

-- 
_

  Daniel Carrasco Marín
  Ingeniería para la Innovación i2TIC, S.L.
  Tlf:  +34 911 12 32 84 Ext: 223
  www.i2tic.com
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous v12.2.3 released

2018-02-21 Thread Abhishek Lekshmanan

We're happy to announce the third bugfix release of Luminous v12.2.x
long term stable release series. It contains a range of bug fixes and 
stability improvements across Bluestore, CephFS, RBD & RGW. We recommend all the
users of 12.2.x series update.

Notable Changes
---
 *CephFS*:
  * The CephFS client now checks for older kernels' inability to reliably clear
dentries from the kernel dentry cache. The new option
client_die_on_failed_dentry_invalidate (default: true) may be turned off to
allow the client to proceed (dangerous!).

For the full changelog with links, please refer to the release blog at
http://ceph.com/releases/v12-2-3-luminous-released/. We thank everyone
for contributing PRs, reporting bugs and helping make luminous
luminescent. Many thanks to Yuri for keeping the QE runs across the
components green. 

Getting Ceph

* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-12.2.3.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* For ceph-deploy, see 
http://docs.ceph.com/docs/master/install/install-ceph-deploy
* Release git sha1: 2dab17a455c09584f2a85e6b10888337d1ec8949

Changelog
-
* bluestore: do not crash on over-large objects (issue#22161, pr#19630, Sage 
Weil)
* bluestore: OSD crash on boot with assert caused by Bluefs on flush write 
(issue#21932, pr#19047, Jianpeng Ma)
* build/ops: ceph-base symbols not stripped in debs (issue#22640, pr#19969, 
Sage Weil)
* build/ops: ceph-conf: dump parsed config in plain text or as json 
(issue#21862, pr#18842, Piotr Dałek)
* build/ops: ceph-mgr dashboard has dependency on python-jinja2 (issue#22457, 
pr#19865, John Spray)
* build/ops: ceph-volume fails when centos7 image doesn't have lvm2 installed 
(issue#22443, issue#22217, pr#20215, Nathan Cutler, Theofilos Mouratidis)
* build/ops: Default kernel.pid_max is easily exceeded during recovery on high 
OSD-count system (issue#21929, pr#19133, David Disseldorp, Kefu Chai)
* build/ops: install-deps.sh: revert gcc to the one shipped by distro 
(issue#0, pr#19680, Kefu Chai)
* build/ops: luminous build fails with --without-radosgw (issue#22321, 
pr#19483, Jason Dillaman)
* build/ops: move ceph-\*-tool binaries out of ceph-test subpackage 
(issue#22319, issue#21762, pr#19355, liuchang0812, Nathan Cutler, Kefu Chai, 
Sage Weil)
* build.ops: rpm: adjust ceph-{osdomap,kvstore,monstore}-tool feature move 
(issue#22558, pr#19839, Kefu Chai)
* ceph: cluster [ERR] Unhandled exception from module 'balancer' while running 
on mgr.x: 'NoneType' object has no attribute 'iteritems'" in cluster log 
(issue#22090, pr#19023, Sage Weil)
* cephfs: cephfs-journal-tool: add "set pool_id" option (issue#22631, pr#20085, 
dongdong tao)
* cephfs: cephfs-journal-tool: tool would miss to report some invalid range 
(issue#22459, pr#19626, dongdong tao)
* cephfs: cephfs: potential adjust failure in lru_expire (issue#22458, 
pr#19627, dongdong tao)
* cephfs: "ceph tell mds" commands result in "File exists" errors on client 
admin socket (issue#21406, issue#21967, pr#18831, Patrick Donnelly)
* cephfs: client: anchor Inode while trimming caps (issue#22157, pr#19105, 
Patrick Donnelly)
* cephfs: client: avoid recursive lock in ll_get_vino (issue#22629, pr#20086, 
dongdong tao)
* cephfs: client: dual client segfault with racing ceph_shutdown (issue#21512, 
issue#20988, pr#20082, Jeff Layton)
* cephfs: client: implement delegation support in userland cephfs (issue#18490, 
pr#19480, Jeff Layton)
* cephfs: client: quit on failed remount during dentry invalidate test #19218 
(issue#22269, pr#19370, Patrick Donnelly)
* cephfs: List of filesystems does not get refreshed after a filesystem 
deletion (issue#21599, pr#18730, John Spray)
* cephfs: MDS : Avoid the assert failure when the inode for the cap_export from 
other… (issue#22610, pr#20300, Jianyu Li)
* cephfs: MDSMonitor: monitor gives constant "is now active in filesystem 
cephfs as rank" cluster log info messages (issue#21959, pr#19055, Patrick 
Donnelly)
* cephfs: racy is_mounted() checks in libcephfs (issue#21025, pr#17875, Jeff 
Layton)
* cephfs: src/mds/MDCache.cc: 7421: FAILED assert(CInode::count() == 
inode_map.size() + snap_inode_map.size()) (issue#21928, pr#18912, "Yan, Zheng")
* cephfs: vstart_runner: fixes for recent cephfs changes (issue#22526, 
pr#19829, Patrick Donnelly)
* ceph-volume: adds a --destroy flag to ceph-volume lvm zap (issue#22653, 
pr#20240, Andrew Schoen)
* ceph-volume: adds success messages for lvm prepare/activate/create 
(issue#22307, pr#20238, Andrew Schoen)
* ceph-volume: dmcrypt support for lvm (issue#22619, pr#20241, Alfredo Deza)
* ceph-volume dmcrypt support for simple (issue#22620, pr#20350, Andrew Schoen, 
Alfredo Deza)
* ceph-volume: do not use --key during mkfs (issue#22283, pr#20244, Kefu Chai, 
Sage Weil)
* ceph-volume: fix usage of the --osd-id flag (issue#22642, issue#22836, 
pr#20323, Andrew Schoen)
* ceph-volume Format 

Re: [ceph-users] mon service failed to start

2018-02-21 Thread Brian :
Hello

Wasn't this originally an issue with mon store now you are getting a
checksum error from an OSD? I think some hardware here in this node is just
hosed.


On Wed, Feb 21, 2018 at 5:46 PM, Behnam Loghmani 
wrote:

> Hi there,
>
> I changed SATA port and cable of SSD disk and also update ceph to version
> 12.2.3 and rebuild OSDs
> but when recovery starts OSDs failed with this error:
>
>
> 2018-02-21 21:12:18.037974 7f3479fe2d00 -1 bluestore(/var/lib/ceph/osd/ceph-7)
> _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x84c097b0,
> expected 0xaf1040a2, device location [0x1~1000], logical extent
> 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
> 2018-02-21 21:12:18.038002 7f3479fe2d00 -1 osd.7 0 OSD::init() : unable to
> read osd superblock
> 2018-02-21 21:12:18.038009 7f3479fe2d00  1 bluestore(/var/lib/ceph/osd/ceph-7)
> umount
> 2018-02-21 21:12:18.038282 7f3479fe2d00  1 stupidalloc 0x0x55e99236c620
> shutdown
> 2018-02-21 21:12:18.038308 7f3479fe2d00  1 freelist shutdown
> 2018-02-21 21:12:18.038336 7f3479fe2d00  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/
> AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/
> MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/ceph-12.2.3/src/rocksdb/db/db_impl.cc:217]
> Shutdown: ca
> nceling all background work
> 2018-02-21 21:12:18.041561 7f3465561700  4 rocksdb: (Original Log Time
> 2018/02/21-21:12:18.041514) [/home/jenkins-build/build/wor
> kspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL
> E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.
> 2.3/rpm/el7/BUILD/ceph-12.
> 2.3/src/rocksdb/db/compaction_job.cc:621] [default] compacted to: base
> level 1 max bytes base 268435456 files[5 0 0 0 0 0 0] max score 0.00,
> MB/sec: 2495.2 rd, 10.1 wr, level 1, files in(5, 0) out(1) MB in(213.6,
> 0.0) out(0.9), read-write-amplify(1.0) write-amplify(0.0) S
> hutdown in progress: Database shutdown or Column
> 2018-02-21 21:12:18.041569 7f3465561700  4 rocksdb: (Original Log Time
> 2018/02/21-21:12:18.041545) EVENT_LOG_v1 {"time_micros": 1519234938041530,
> "job": 3, "event": "compaction_finished", "compaction_time_micros": 89747,
> "output_level": 1, "num_output_files": 1, "total_ou
> tput_size": 902552, "num_input_records": 4470, "num_output_records": 4377,
> "num_subcompactions": 1, "num_single_delete_mismatches": 0,
> "num_single_delete_fallthrough": 44, "lsm_state": [5, 0, 0, 0, 0, 0, 0]}
> 2018-02-21 21:12:18.041663 7f3479fe2d00  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1519234938041657, "job": 4, "event": "table_file_deletion",
> "file_number": 249}
> 2018-02-21 21:12:18.042144 7f3479fe2d00  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/
> AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/
> MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/ceph-12.2.3/src/rocksdb/db/db_impl.cc:343]
> Shutdown com
> plete
> 2018-02-21 21:12:18.043474 7f3479fe2d00  1 bluefs umount
> 2018-02-21 21:12:18.043775 7f3479fe2d00  1 stupidalloc 0x0x55e991f05d40
> shutdown
> 2018-02-21 21:12:18.043784 7f3479fe2d00  1 stupidalloc 0x0x55e991f05db0
> shutdown
> 2018-02-21 21:12:18.043786 7f3479fe2d00  1 stupidalloc 0x0x55e991f05e20
> shutdown
> 2018-02-21 21:12:18.043826 7f3479fe2d00  1 bdev(0x55e992254600
> /dev/vg0/wal-b) close
> 2018-02-21 21:12:18.301531 7f3479fe2d00  1 bdev(0x55e992255800
> /dev/vg0/db-b) close
> 2018-02-21 21:12:18.545488 7f3479fe2d00  1 bdev(0x55e992254400
> /var/lib/ceph/osd/ceph-7/block) close
> 2018-02-21 21:12:18.650473 7f3479fe2d00  1 bdev(0x55e992254000
> /var/lib/ceph/osd/ceph-7/block) close
> 2018-02-21 21:12:18.93 7f3479fe2d00 -1  ** ERROR: osd init failed:
> (22) Invalid argument
>
>
> On Wed, Feb 21, 2018 at 5:06 PM, Behnam Loghmani <
> behnam.loghm...@gmail.com> wrote:
>
>> but disks pass all the tests with smartctl, badblocks and there isn't any
>> error on disks. because the ssd has contain WAL/DB of OSDs it's difficult
>> to test it on other cluster nodes
>>
>> On Wed, Feb 21, 2018 at 4:58 PM,  wrote:
>>
>>> Could the problem be related with some faulty hardware (RAID-controller,
>>> port, cable) but not disk? Does "faulty" disk works OK on other server?
>>>
>>> Behnam Loghmani wrote on 21/02/18 16:09:
>>>
 Hi there,

 I changed the SSD on the problematic node with the new one and
 reconfigure OSDs and MON service on it.
 but the problem occurred again with:

 "rocksdb: submit_transaction error: Corruption: block checksum mismatch
 code = 2"

 I get fully confused now.



 On Tue, Feb 20, 2018 at 5:16 PM, Behnam Loghmani <
 behnam.loghm...@gmail.com > wrote:

 Hi Caspar,

 I checked the filesystem and there isn't any error on filesystem.
 The disk is SSD and it doesn't any attribute related to Wear level
 in smartctl and filesystem is
 mounted with default options and 

[ceph-users] Balanced MDS, all as active and recomended client settings.

2018-02-21 Thread Daniel Carrasco
Hello,

I've created a Ceph cluster with 3 nodes to serve files to an high traffic
webpage. I've configured two MDS as active and one as standby, but after
add the new system to production I've noticed that MDS are not balanced and
one server get the most of clients petitions (One MDS about 700 or less vs
4.000 or more the other).

Is possible to make a better distribution on the MDS load of both nodes?.
Is posible to set all nodes as Active without problems?

I know that is possible to set max_mds to 3 and all will be active, but I
want to know what happen if one node goes down for example, or if there are
another side effects.


My last question is if someone can recomend me a good client configuration
like cache size, and maybe something to lower the metadata servers load.


Thanks!!
-- 
_

  Daniel Carrasco Marín
  Ingeniería para la Innovación i2TIC, S.L.
  Tlf:  +34 911 12 32 84 Ext: 223
  www.i2tic.com
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon service failed to start

2018-02-21 Thread Behnam Loghmani
Hi there,

I changed SATA port and cable of SSD disk and also update ceph to version
12.2.3 and rebuild OSDs
but when recovery starts OSDs failed with this error:


2018-02-21 21:12:18.037974 7f3479fe2d00 -1 bluestore(/var/lib/ceph/osd/ceph-7)
_verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x84c097b0,
expected 0xaf1040a2, device location [0x1~1000], logical extent
0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2018-02-21 21:12:18.038002 7f3479fe2d00 -1 osd.7 0 OSD::init() : unable to
read osd superblock
2018-02-21 21:12:18.038009 7f3479fe2d00  1 bluestore(/var/lib/ceph/osd/ceph-7)
umount
2018-02-21 21:12:18.038282 7f3479fe2d00  1 stupidalloc 0x0x55e99236c620
shutdown
2018-02-21 21:12:18.038308 7f3479fe2d00  1 freelist shutdown
2018-02-21 21:12:18.038336 7f3479fe2d00  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
centos7/MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/
ceph-12.2.3/src/rocksdb/db/db_impl.cc:217] Shutdown: ca
nceling all background work
2018-02-21 21:12:18.041561 7f3465561700  4 rocksdb: (Original Log Time
2018/02/21-21:12:18.041514) [/home/jenkins-build/build/
workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/
AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/
release/12.2.3/rpm/el7/BUILD/ceph-12.
2.3/src/rocksdb/db/compaction_job.cc:621] [default] compacted to: base
level 1 max bytes base 268435456 files[5 0 0 0 0 0 0] max score 0.00,
MB/sec: 2495.2 rd, 10.1 wr, level 1, files in(5, 0) out(1) MB in(213.6,
0.0) out(0.9), read-write-amplify(1.0) write-amplify(0.0) S
hutdown in progress: Database shutdown or Column
2018-02-21 21:12:18.041569 7f3465561700  4 rocksdb: (Original Log Time
2018/02/21-21:12:18.041545) EVENT_LOG_v1 {"time_micros": 1519234938041530,
"job": 3, "event": "compaction_finished", "compaction_time_micros": 89747,
"output_level": 1, "num_output_files": 1, "total_ou
tput_size": 902552, "num_input_records": 4470, "num_output_records": 4377,
"num_subcompactions": 1, "num_single_delete_mismatches": 0,
"num_single_delete_fallthrough": 44, "lsm_state": [5, 0, 0, 0, 0, 0, 0]}
2018-02-21 21:12:18.041663 7f3479fe2d00  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1519234938041657, "job": 4, "event": "table_file_deletion",
"file_number": 249}
2018-02-21 21:12:18.042144 7f3479fe2d00  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
centos7/MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/
ceph-12.2.3/src/rocksdb/db/db_impl.cc:343] Shutdown com
plete
2018-02-21 21:12:18.043474 7f3479fe2d00  1 bluefs umount
2018-02-21 21:12:18.043775 7f3479fe2d00  1 stupidalloc 0x0x55e991f05d40
shutdown
2018-02-21 21:12:18.043784 7f3479fe2d00  1 stupidalloc 0x0x55e991f05db0
shutdown
2018-02-21 21:12:18.043786 7f3479fe2d00  1 stupidalloc 0x0x55e991f05e20
shutdown
2018-02-21 21:12:18.043826 7f3479fe2d00  1 bdev(0x55e992254600
/dev/vg0/wal-b) close
2018-02-21 21:12:18.301531 7f3479fe2d00  1 bdev(0x55e992255800
/dev/vg0/db-b) close
2018-02-21 21:12:18.545488 7f3479fe2d00  1 bdev(0x55e992254400
/var/lib/ceph/osd/ceph-7/block) close
2018-02-21 21:12:18.650473 7f3479fe2d00  1 bdev(0x55e992254000
/var/lib/ceph/osd/ceph-7/block) close
2018-02-21 21:12:18.93 7f3479fe2d00 -1  ** ERROR: osd init failed: (22)
Invalid argument


On Wed, Feb 21, 2018 at 5:06 PM, Behnam Loghmani 
wrote:

> but disks pass all the tests with smartctl, badblocks and there isn't any
> error on disks. because the ssd has contain WAL/DB of OSDs it's difficult
> to test it on other cluster nodes
>
> On Wed, Feb 21, 2018 at 4:58 PM,  wrote:
>
>> Could the problem be related with some faulty hardware (RAID-controller,
>> port, cable) but not disk? Does "faulty" disk works OK on other server?
>>
>> Behnam Loghmani wrote on 21/02/18 16:09:
>>
>>> Hi there,
>>>
>>> I changed the SSD on the problematic node with the new one and
>>> reconfigure OSDs and MON service on it.
>>> but the problem occurred again with:
>>>
>>> "rocksdb: submit_transaction error: Corruption: block checksum mismatch
>>> code = 2"
>>>
>>> I get fully confused now.
>>>
>>>
>>>
>>> On Tue, Feb 20, 2018 at 5:16 PM, Behnam Loghmani <
>>> behnam.loghm...@gmail.com > wrote:
>>>
>>> Hi Caspar,
>>>
>>> I checked the filesystem and there isn't any error on filesystem.
>>> The disk is SSD and it doesn't any attribute related to Wear level
>>> in smartctl and filesystem is
>>> mounted with default options and no discard.
>>>
>>> my ceph structure on this node is like this:
>>>
>>> it has osd,mon,rgw services
>>> 1 SSD for OS and WAL/DB
>>> 2 HDD
>>>
>>> OSDs are created by ceph-volume lvm.
>>>
>>> the whole SSD is on 1 vg.
>>> OS is on root lv
>>> OSD.1 DB is on db-a
>>> OSD.1 WAL is on wal-a
>>> OSD.2 DB is on db-b
>>> OSD.2 WAL is on wal-b
>>>
>>> output of lvs:
>>>
>>> 

Re: [ceph-users] ceph-volume activation

2018-02-21 Thread Alfredo Deza
On Wed, Feb 21, 2018 at 9:40 AM, Dan van der Ster  wrote:
> On Wed, Feb 21, 2018 at 2:24 PM, Alfredo Deza  wrote:
>> On Tue, Feb 20, 2018 at 9:05 PM, Oliver Freyermuth
>>  wrote:
>>> Many thanks for your replies!
>>>
>>> Are there plans to have something like
>>> "ceph-volume discover-and-activate"
>>> which would effectively do something like:
>>> ceph-volume list and activate all OSDs which are re-discovered from LVM 
>>> metadata?
>>
>> This is a good idea, I think ceph-disk had an 'activate all', and it
>> would make it easier for the situation you explain with ceph-volume
>>
>> I've created http://tracker.ceph.com/issues/23067 to follow up on this
>> an implement it.
>
> +1 thanks a lot for this thread and clear answers!
> We were literally stuck today not knowing how to restart a ceph-volume
> lvm created OSD.
>
> (It seems that once you systemctl stop ceph-osd@* on a machine, the
> only way to get them back is ceph-volume lvm activate ... )
>
> BTW, ceph-osd.target now has less obvious functionality. For example,
> this works:
>
>   systemctl restart ceph-osd.target
>
> But if you stop ceph-osd.target, then you can no longer start ceph-osd.target.
>
> Is this a regression or something we'll have to live with?

This sounds surprising. Stopping a ceph-osd target should not do
anything with the devices. All that 'activate' does when called in
ceph-volume is to ensure that
the devices are available and mounted in the right places so that the
OSD can start.

If you are experiencing a problem stopping an OSD that can't be
started again, then something is going on. I would urge you to create
a ticket with as many details as you can
at http://tracker.ceph.com/projects/ceph-volume/issues/new

>
> Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume activation

2018-02-21 Thread Dan van der Ster
On Wed, Feb 21, 2018 at 2:24 PM, Alfredo Deza  wrote:
> On Tue, Feb 20, 2018 at 9:05 PM, Oliver Freyermuth
>  wrote:
>> Many thanks for your replies!
>>
>> Are there plans to have something like
>> "ceph-volume discover-and-activate"
>> which would effectively do something like:
>> ceph-volume list and activate all OSDs which are re-discovered from LVM 
>> metadata?
>
> This is a good idea, I think ceph-disk had an 'activate all', and it
> would make it easier for the situation you explain with ceph-volume
>
> I've created http://tracker.ceph.com/issues/23067 to follow up on this
> an implement it.

+1 thanks a lot for this thread and clear answers!
We were literally stuck today not knowing how to restart a ceph-volume
lvm created OSD.

(It seems that once you systemctl stop ceph-osd@* on a machine, the
only way to get them back is ceph-volume lvm activate ... )

BTW, ceph-osd.target now has less obvious functionality. For example,
this works:

  systemctl restart ceph-osd.target

But if you stop ceph-osd.target, then you can no longer start ceph-osd.target.

Is this a regression or something we'll have to live with?

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Missing clones

2018-02-21 Thread Karsten Becker
So - here is the feedback. After a long night...

The plain copying did not help... it then complains about the Snaps of
another VM (also with old Snapshots).

I remembered about a thread I read that the problem could solved by
converting back to filestore, because you then have access of the data
in filesystem. So I did that for the 3 OSDs affected. After that, of
course (rgh), the PG got located on other OSDs - but at least one
was still on a filestore converted OSD.

So I first set the primary affinity in a way that the PG was primary on
the filestore OSD. Then I quickly turned off all three OSDs. The PG got
stale then (all replicas were down). Flushed the journals to be on the
safe side.

Then I took a detailed look in the filesystem (with find) and found the
rbd_data.2313975238e1f29.000XXX, which was size 0. So no data in it.

I then used
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-X 
> rbd_data.2313975238e1f29.000XXX remove

on all three OSDs and fired them up again.

Then - after waiting for the cluster to get balanced again (PG still
reported as inconsistent) - I fired up a repair on the PG (primary still
on the filestore OSD).

-> Fixed.   :-)  HEALTHY

This night I will set the OSD up as BlueStore again. Hopefully it will
not happen again.

I found in a bug report the tip to set "bluefs_allocator = stupid" in
ceph.conf. I also did that and restarted all OSDs afterwards. So maybe
this prevents the problem to happen again.

Best
Karsten


On 20.02.2018 16:03, Eugen Block wrote:
> Alright, good luck!
> The results would be interesting. :-)


Ecologic Institut gemeinnuetzige GmbH
Pfalzburger Str. 43/44, D-10717 Berlin
Geschaeftsfuehrerin / Director: Dr. Camilla Bausch
Sitz der Gesellschaft / Registered Office: Berlin (Germany)
Registergericht / Court of Registration: Amtsgericht Berlin (Charlottenburg), 
HRB 57947
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon service failed to start

2018-02-21 Thread Behnam Loghmani
but disks pass all the tests with smartctl, badblocks and there isn't any
error on disks. because the ssd has contain WAL/DB of OSDs it's difficult
to test it on other cluster nodes

On Wed, Feb 21, 2018 at 4:58 PM,  wrote:

> Could the problem be related with some faulty hardware (RAID-controller,
> port, cable) but not disk? Does "faulty" disk works OK on other server?
>
> Behnam Loghmani wrote on 21/02/18 16:09:
>
>> Hi there,
>>
>> I changed the SSD on the problematic node with the new one and
>> reconfigure OSDs and MON service on it.
>> but the problem occurred again with:
>>
>> "rocksdb: submit_transaction error: Corruption: block checksum mismatch
>> code = 2"
>>
>> I get fully confused now.
>>
>>
>>
>> On Tue, Feb 20, 2018 at 5:16 PM, Behnam Loghmani <
>> behnam.loghm...@gmail.com > wrote:
>>
>> Hi Caspar,
>>
>> I checked the filesystem and there isn't any error on filesystem.
>> The disk is SSD and it doesn't any attribute related to Wear level in
>> smartctl and filesystem is
>> mounted with default options and no discard.
>>
>> my ceph structure on this node is like this:
>>
>> it has osd,mon,rgw services
>> 1 SSD for OS and WAL/DB
>> 2 HDD
>>
>> OSDs are created by ceph-volume lvm.
>>
>> the whole SSD is on 1 vg.
>> OS is on root lv
>> OSD.1 DB is on db-a
>> OSD.1 WAL is on wal-a
>> OSD.2 DB is on db-b
>> OSD.2 WAL is on wal-b
>>
>> output of lvs:
>>
>>data-a data-a -wi-a-
>>data-b data-b -wi-a-
>>db-a   vg0-wi-a-
>>db-b   vg0-wi-a-
>>root   vg0-wi-ao
>>wal-a  vg0-wi-a-
>>wal-b  vg0-wi-a-
>>
>> after making a heavy write on the radosgw, OSD.1 and OSD.2 has
>> stopped with "block checksum
>> mismatch" error.
>> Now on this node MON and OSDs services has stopped working with this
>> error
>>
>> I think my issue is related to this bug:
>> http://tracker.ceph.com/issues/22102
>> 
>>
>> I ran
>> #ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-1 --deep 1
>> but it returns the same error:
>>
>> *** Caught signal (Aborted) **
>>   in thread 7fbf6c923d00 thread_name:ceph-bluestore-
>> 2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort: Corruption: block
>> checksum mismatch
>>   ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba)
>> luminous (stable)
>>   1: (()+0x3eb0b1) [0x55f779e6e0b1]
>>   2: (()+0xf5e0) [0x7fbf61ae15e0]
>>   3: (gsignal()+0x37) [0x7fbf604d31f7]
>>   4: (abort()+0x148) [0x7fbf604d48e8]
>>   5: (RocksDBStore::get(std::string const&, char const*, unsigned
>> long,
>> ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
>>   6: (BlueStore::Collection::get_onode(ghobject_t const&,
>> bool)+0x545) [0x55f779cd8f75]
>>   7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
>>   8: (main()+0xde0) [0x55f779baab90]
>>   9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
>>   10: (()+0x1bc59f) [0x55f779c3f59f]
>> 2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal
>> (Aborted) **
>>   in thread 7fbf6c923d00 thread_name:ceph-bluestore-
>>
>>   ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba)
>> luminous (stable)
>>   1: (()+0x3eb0b1) [0x55f779e6e0b1]
>>   2: (()+0xf5e0) [0x7fbf61ae15e0]
>>   3: (gsignal()+0x37) [0x7fbf604d31f7]
>>   4: (abort()+0x148) [0x7fbf604d48e8]
>>   5: (RocksDBStore::get(std::string const&, char const*, unsigned
>> long,
>> ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
>>   6: (BlueStore::Collection::get_onode(ghobject_t const&,
>> bool)+0x545) [0x55f779cd8f75]
>>   7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
>>   8: (main()+0xde0) [0x55f779baab90]
>>   9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
>>   10: (()+0x1bc59f) [0x55f779c3f59f]
>>   NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>>
>>  -1> 2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort:
>> Corruption: block checksum mismatch
>>   0> 2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal
>> (Aborted) **
>>   in thread 7fbf6c923d00 thread_name:ceph-bluestore-
>>
>>   ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba)
>> luminous (stable)
>>   1: (()+0x3eb0b1) [0x55f779e6e0b1]
>>   2: (()+0xf5e0) [0x7fbf61ae15e0]
>>   3: (gsignal()+0x37) [0x7fbf604d31f7]
>>   4: (abort()+0x148) [0x7fbf604d48e8]
>>   5: (RocksDBStore::get(std::string const&, char const*, unsigned
>> long,
>> ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
>>   6: (BlueStore::Collection::get_onode(ghobject_t const&,
>> bool)+0x545) [0x55f779cd8f75]
>>   7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
>>   8: (main()+0xde0) [0x55f779baab90]
>>   9: (__libc_start_main()+0xf5) 

Re: [ceph-users] mon service failed to start

2018-02-21 Thread knawnd
Could the problem be related with some faulty hardware (RAID-controller, port, cable) but not disk? 
Does "faulty" disk works OK on other server?


Behnam Loghmani wrote on 21/02/18 16:09:

Hi there,

I changed the SSD on the problematic node with the new one and reconfigure OSDs 
and MON service on it.
but the problem occurred again with:

"rocksdb: submit_transaction error: Corruption: block checksum mismatch code = 
2"

I get fully confused now.



On Tue, Feb 20, 2018 at 5:16 PM, Behnam Loghmani > wrote:


Hi Caspar,

I checked the filesystem and there isn't any error on filesystem.
The disk is SSD and it doesn't any attribute related to Wear level in 
smartctl and filesystem is
mounted with default options and no discard.

my ceph structure on this node is like this:

it has osd,mon,rgw services
1 SSD for OS and WAL/DB
2 HDD

OSDs are created by ceph-volume lvm.

the whole SSD is on 1 vg.
OS is on root lv
OSD.1 DB is on db-a
OSD.1 WAL is on wal-a
OSD.2 DB is on db-b
OSD.2 WAL is on wal-b

output of lvs:

   data-a data-a -wi-a-
   data-b data-b -wi-a-
   db-a   vg0    -wi-a-
   db-b   vg0    -wi-a-
   root   vg0    -wi-ao
   wal-a  vg0    -wi-a-
   wal-b  vg0    -wi-a-

after making a heavy write on the radosgw, OSD.1 and OSD.2 has stopped with 
"block checksum
mismatch" error.
Now on this node MON and OSDs services has stopped working with this error

I think my issue is related to this bug: 
http://tracker.ceph.com/issues/22102


I ran
#ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-1 --deep 1
but it returns the same error:

*** Caught signal (Aborted) **
  in thread 7fbf6c923d00 thread_name:ceph-bluestore-
2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort: Corruption: block 
checksum mismatch
  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
(stable)
  1: (()+0x3eb0b1) [0x55f779e6e0b1]
  2: (()+0xf5e0) [0x7fbf61ae15e0]
  3: (gsignal()+0x37) [0x7fbf604d31f7]
  4: (abort()+0x148) [0x7fbf604d48e8]
  5: (RocksDBStore::get(std::string const&, char const*, unsigned long,
ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
  6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545) 
[0x55f779cd8f75]
  7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
  8: (main()+0xde0) [0x55f779baab90]
  9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
  10: (()+0x1bc59f) [0x55f779c3f59f]
2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal (Aborted) **
  in thread 7fbf6c923d00 thread_name:ceph-bluestore-

  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
(stable)
  1: (()+0x3eb0b1) [0x55f779e6e0b1]
  2: (()+0xf5e0) [0x7fbf61ae15e0]
  3: (gsignal()+0x37) [0x7fbf604d31f7]
  4: (abort()+0x148) [0x7fbf604d48e8]
  5: (RocksDBStore::get(std::string const&, char const*, unsigned long,
ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
  6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545) 
[0x55f779cd8f75]
  7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
  8: (main()+0xde0) [0x55f779baab90]
  9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
  10: (()+0x1bc59f) [0x55f779c3f59f]
  NOTE: a copy of the executable, or `objdump -rdS ` is needed 
to interpret this.

     -1> 2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort: Corruption: 
block checksum mismatch
  0> 2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal 
(Aborted) **
  in thread 7fbf6c923d00 thread_name:ceph-bluestore-

  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
(stable)
  1: (()+0x3eb0b1) [0x55f779e6e0b1]
  2: (()+0xf5e0) [0x7fbf61ae15e0]
  3: (gsignal()+0x37) [0x7fbf604d31f7]
  4: (abort()+0x148) [0x7fbf604d48e8]
  5: (RocksDBStore::get(std::string const&, char const*, unsigned long,
ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
  6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545) 
[0x55f779cd8f75]
  7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
  8: (main()+0xde0) [0x55f779baab90]
  9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
  10: (()+0x1bc59f) [0x55f779c3f59f]
  NOTE: a copy of the executable, or `objdump -rdS ` is needed 
to interpret this.



Could you please help me to recover this node or find a way to prove SSD 
disk problem.

Best regards,
Behnam Loghmani




On Mon, Feb 19, 2018 at 1:35 PM, Caspar Smit > wrote:

Hi Behnam,

I would firstly recommend running a filesystem check on the monitor 
disk first to see if
there are any inconsistencies.

Is the disk 

Re: [ceph-users] Luminous : performance degrade while read operations (ceph-volume)

2018-02-21 Thread Alfredo Deza
On Tue, Feb 20, 2018 at 9:33 PM, nokia ceph 
wrote:

> Hi Alfredo Deza,
>
> I understand the point between lvm and simple however we see issue , was
> it issue in luminous because we use same ceph config and workload from
> client. The graphs i attached in previous mail is from ceph-volume lvm osd.
>

If the issue is a performance regression in Luminous I wouldn't know :( I
was trying to say that if you are seeing the same regression with
previously deployed OSDs then it can't possibly be a thing we are doing
incorrectly in ceph-volume


>
> In this case does it ococcupies 2 times only inside ceph. If we consider
> only lvm based system does this high iops because of dm-cache created for
> each osd?.
>

Not sure again. Maybe someone else might be able to chime in on this.


>
> Meanwhile i will update some graphs to show this once i have.
>
> Thanks,
> Muthu
>
> On Tuesday, February 20, 2018, Alfredo Deza  wrote:
>
>>
>>
>> On Mon, Feb 19, 2018 at 9:29 PM, nokia ceph 
>> wrote:
>>
>>> Hi Alfredo Deza,
>>>
>>> We have 5 node platforms with lvm osd created from scratch and another 5
>>> node platform migrated from kraken which is ceph volume simple. Both has
>>> same issue . Both platform has only hdd for osd.
>>>
>>> We also noticed 2 times disk iops more compare to kraken , this causes
>>> less read performance. During rocksdb compaction the situation is worse.
>>>
>>>
>>> Meanwhile we are building another platform creating osd using ceph-disk
>>> and analyse on this.
>>>
>>
>> If you have two platforms, one with `simple` and the other one with `lvm`
>> experiencing the same, then something else must be at fault here.
>>
>> The `simple` setup in ceph-volume basically keeps everything as it was
>> before, it just captures details of what devices were being used so OSDs
>> can be started. There is no interaction from ceph-volume
>> in there that could cause something like this.
>>
>>
>>
>>> Thanks,
>>> Muthu
>>>
>>>
>>>
>>> On Tuesday, February 20, 2018, Alfredo Deza  wrote:
>>>


 On Mon, Feb 19, 2018 at 2:01 PM, nokia ceph 
 wrote:

> Hi All,
>
> We have 5 node clusters with EC 4+1 and use bluestore since last year
> from Kraken.
> Recently we migrated all our platforms to luminous 12.2.2 and finally
> all OSDs migrated to ceph-volume simple type and on few platforms 
> installed
> ceph using ceph-volume .
>
> Now we see two times more traffic in read compare to client traffic on
> migrated platform and newly created platforms . This was not the case in
> older releases where ceph status read B/W will be same as client read
> traffic.
>
> Some network graphs :
>
> *Client network interface* towards ceph public interface : shows
> *4.3Gbps* read
>
>
> [image: Inline image 2]
>
> *Ceph Node Public interface* : Each node around 960Mbps * 5 node =*
> 4.6 Gbps *- this matches.
> [image: Inline image 3]
>
> Ceph status output : show  1032 MB/s =* 8.06 Gbps*
>
> cn6.chn6us1c1.cdn ~# ceph status
>   cluster:
> id: abda22db-3658-4d33-9681-e3ff10690f88
> health: HEALTH_OK
>
>   services:
> mon: 5 daemons, quorum cn6,cn7,cn8,cn9,cn10
> mgr: cn6(active), standbys: cn7, cn9, cn10, cn8
> osd: 340 osds: 340 up, 340 in
>
>   data:
> pools:   1 pools, 8192 pgs
> objects: 270M objects, 426 TB
> usage:   581 TB used, 655 TB / 1237 TB avail
> pgs: 8160 active+clean
>  32   active+clean+scrubbing
>
>   io:
> client:   *1032 MB/s rd*, 168 MB/s wr, 1908 op/s rd, 1594 op/s wr
>
>
> Write operation we don't see this issue. Client traffic and this
> matches.
> Is this expected behavior in Luminous and ceph-volume lvm or a bug ?
> Wrong calculation in ceph status read B/W ?
>

 You mentioned `ceph-volume simple` but here you say lvm. With LVM
 ceph-volume will create the OSDs from scratch, while "simple" will keep
 whatever OSD was created before.

 Have you created the OSDs from scratch with ceph-volume? or is it just
 using "simple" , managing a previously deployed OSD?

>
> Please provide your feedback.
>
> Thanks,
> Muthu
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume activation

2018-02-21 Thread Alfredo Deza
On Tue, Feb 20, 2018 at 9:05 PM, Oliver Freyermuth
 wrote:
> Many thanks for your replies!
>
> Am 21.02.2018 um 02:20 schrieb Alfredo Deza:
>> On Tue, Feb 20, 2018 at 5:56 PM, Oliver Freyermuth
>>  wrote:
>>> Dear Cephalopodians,
>>>
>>> with the release of ceph-deploy we are thinking about migrating our 
>>> Bluestore-OSDs (currently created with ceph-disk via old ceph-deploy)
>>> to be created via ceph-volume (with LVM).
>>
>> When you say migrating, do you mean creating them again from scratch
>> or making ceph-volume take over the previously created OSDs
>> (ceph-volume can do both)
>
> I would recreate from scratch to switch to LVM, we have a k=4 m=2 EC-pool 
> with 6 hosts, so I can just take down a full host and recreate.
> But good to know both would work!
>
>>
>>>
>>> I note two major changes:
>>> 1. It seems the block.db partitions have to be created beforehand, manually.
>>>With ceph-disk, one should not do that - or manually set the correct 
>>> PARTTYPE ID.
>>>Will ceph-volume take care of setting the PARTTYPE on existing 
>>> partitions for block.db now?
>>>Is it not necessary anymore?
>>>Is the config option bluestore_block_db_size now also obsoleted?
>>
>> Right, ceph-volume will not create any partitions for you, so no, it
>> will not take care of setting PARTTYPE either. If your setup requires
>> a block.db, then this must be
>> created beforehand and then passed onto ceph-volume. The one
>> requirement if it is a partition is to have a PARTUUID. For logical
>> volumes it can just work as-is. This is
>> explained in detail at
>> http://docs.ceph.com/docs/master/ceph-volume/lvm/prepare/#bluestore
>>
>> PARTUUID information for ceph-volume at:
>> http://docs.ceph.com/docs/master/ceph-volume/lvm/prepare/#partitioning
>
> Ok.
> So do I understand correctly that the PARTTYPE setting (i.e. those magic 
> numbers as found e.g. in ceph-disk sources in PTYPE:
> https://github.com/ceph/ceph/blob/master/src/ceph-disk/ceph_disk/main.py#L62 )
> is not needed anymore for the block.db partitions, since it was effectively 
> only there
> to have udev work?

Right, the PARTTYPE was only for udev. We need the PARTUUID to ensure
that we can find the right device/partition always (in the case of
partitions only)

>
> I remember from ceph-disk that if I created the block.db partition beforehand 
> and without setting the magic PARTTYPE,
> it would become unhappy.
> ceph-volume and the systemd activation path should not care at all if I 
> understand this correctly.

Right again, this was part of the complex set of things that a
partition had to have in order for ceph-disk to work. A lot of users
thought that the partition approach was simple enough,
but without being aware that a lot of extra things were needed for
those partitions to be recognized by ceph-disk

>
> So in short, to create a new OSD, steps for me would be:
> - Create block.db partition (and don't care about PARTTYPE).
>   I do only have to make sure it has a PARTUUID.
> - ceph-volume lvm create --bluestore --block.db /dev/sdag1 --data /dev/sda
>   (or the same via ceph-deploy)

That would work, yes. When you pass a whole device to --data in that
example, ceph-volume will create a whole volume group and logical
volume out of that device and use it for bluestore. That
may or may not be what you want though. With LVM you can chop that
device in many pieces and use what you want. That "shim" in
ceph-volume is there to allow users that don't care about this to just
move forward with a whole device.

Similarly, if you want to use a logical volume for --block.db, you can.

To recap: yes, your example would work, but you have a lot of other
options if you need more flexibility.

>
>
>>>
>>> 2. Activation does not work via udev anymore, which solves some racy things.
>>>
>>> This second major change makes me curious: How does activation work now?
>>> In the past, I could reinstall the full OS, install ceph packages, trigger 
>>> udev / reboot and all OSDs would come back,
>>> without storing any state / activating any services in the OS.
>>
>> Activation works via systemd. This is explained in detail here
>> http://docs.ceph.com/docs/master/ceph-volume/lvm/activate
>>
>> Nothing with `ceph-volume lvm` requires udev for discovery. If you
>> need to re-install the OS and recover your OSDs all you need to do is
>> to
>> re-activate them. You would need to know what the ID and UUID of the OSDs is.
>>
>> If you don't have that information handy, you can run:
>>
>> ceph-volume lvm list
>>
>> And all the information will be available. This will persist even on
>> system re-installs
>
> Understood - so indeed the manual step would be to run "list" and then 
> activate the OSDs one-by-one
> to re-create the service files.
> More cumbersome than letting udev do it's thing, but it certainly gives more 
> control,
> so it seems preferrable.
>
> Are there plans to have something like

Re: [ceph-users] mon service failed to start

2018-02-21 Thread Behnam Loghmani
Hi there,

I changed the SSD on the problematic node with the new one and reconfigure
OSDs and MON service on it.
but the problem occurred again with:

"rocksdb: submit_transaction error: Corruption: block checksum mismatch
code = 2"

I get fully confused now.



On Tue, Feb 20, 2018 at 5:16 PM, Behnam Loghmani 
wrote:

> Hi Caspar,
>
> I checked the filesystem and there isn't any error on filesystem.
> The disk is SSD and it doesn't any attribute related to Wear level in
> smartctl and filesystem is mounted with default options and no discard.
>
> my ceph structure on this node is like this:
>
> it has osd,mon,rgw services
> 1 SSD for OS and WAL/DB
> 2 HDD
>
> OSDs are created by ceph-volume lvm.
>
> the whole SSD is on 1 vg.
> OS is on root lv
> OSD.1 DB is on db-a
> OSD.1 WAL is on wal-a
> OSD.2 DB is on db-b
> OSD.2 WAL is on wal-b
>
> output of lvs:
>
>   data-a data-a -wi-a-
>
>   data-b data-b -wi-a-
>   db-a   vg0-wi-a-
>
>   db-b   vg0-wi-a-
>
>   root   vg0-wi-ao
>
>   wal-a  vg0-wi-a-
>
>   wal-b  vg0-wi-a-
>
> after making a heavy write on the radosgw, OSD.1 and OSD.2 has stopped
> with "block checksum mismatch" error.
> Now on this node MON and OSDs services has stopped working with this error
>
> I think my issue is related to this bug: http://tracker.ceph.com/
> issues/22102
>
> I ran
> #ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-1 --deep 1
> but it returns the same error:
>
> *** Caught signal (Aborted) **
>  in thread 7fbf6c923d00 thread_name:ceph-bluestore-
> 2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort: Corruption: block
> checksum mismatch
>  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
> (stable)
>  1: (()+0x3eb0b1) [0x55f779e6e0b1]
>  2: (()+0xf5e0) [0x7fbf61ae15e0]
>  3: (gsignal()+0x37) [0x7fbf604d31f7]
>  4: (abort()+0x148) [0x7fbf604d48e8]
>  5: (RocksDBStore::get(std::string const&, char const*, unsigned long,
> ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
>  6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545)
> [0x55f779cd8f75]
>  7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
>  8: (main()+0xde0) [0x55f779baab90]
>  9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
>  10: (()+0x1bc59f) [0x55f779c3f59f]
> 2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal (Aborted) **
>  in thread 7fbf6c923d00 thread_name:ceph-bluestore-
>
>  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
> (stable)
>  1: (()+0x3eb0b1) [0x55f779e6e0b1]
>  2: (()+0xf5e0) [0x7fbf61ae15e0]
>  3: (gsignal()+0x37) [0x7fbf604d31f7]
>  4: (abort()+0x148) [0x7fbf604d48e8]
>  5: (RocksDBStore::get(std::string const&, char const*, unsigned long,
> ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
>  6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545)
> [0x55f779cd8f75]
>  7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
>  8: (main()+0xde0) [0x55f779baab90]
>  9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
>  10: (()+0x1bc59f) [0x55f779c3f59f]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
>
> -1> 2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort: Corruption:
> block checksum mismatch
>  0> 2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal
> (Aborted) **
>  in thread 7fbf6c923d00 thread_name:ceph-bluestore-
>
>  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
> (stable)
>  1: (()+0x3eb0b1) [0x55f779e6e0b1]
>  2: (()+0xf5e0) [0x7fbf61ae15e0]
>  3: (gsignal()+0x37) [0x7fbf604d31f7]
>  4: (abort()+0x148) [0x7fbf604d48e8]
>  5: (RocksDBStore::get(std::string const&, char const*, unsigned long,
> ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
>  6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545)
> [0x55f779cd8f75]
>  7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
>  8: (main()+0xde0) [0x55f779baab90]
>  9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
>  10: (()+0x1bc59f) [0x55f779c3f59f]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
>
>
>
> Could you please help me to recover this node or find a way to prove SSD
> disk problem.
>
> Best regards,
> Behnam Loghmani
>
>
>
>
> On Mon, Feb 19, 2018 at 1:35 PM, Caspar Smit 
> wrote:
>
>> Hi Behnam,
>>
>> I would firstly recommend running a filesystem check on the monitor disk
>> first to see if there are any inconsistencies.
>>
>> Is the disk where the monitor is running on a spinning disk or SSD?
>>
>> If SSD you should check the Wear level stats through smartctl.
>> Maybe trim (discard) enabled on the filesystem mount? (discard could
>> cause problems/corruption in combination with certain SSD firmwares)
>>
>> Caspar
>>
>> 2018-02-16 23:03 GMT+01:00 Behnam Loghmani :
>>
>>> I checked the disk that monitor is on it with smartctl and it didn't
>>> return any error and it doesn't have any 

Re: [ceph-users] Migrating to new pools

2018-02-21 Thread Jason Dillaman
On Tue, Feb 20, 2018 at 8:35 PM, Rafael Lopez  wrote:
>> There is also work-in-progress for online
>> image migration [1] that will allow you to keep using the image while
>> it's being migrated to a new destination image.
>
>
> Hi Jason,
>
> Is there any recommended method/workaround for live rbd migration in
> luminous? eg. snapshot/copy or export/import then export/import-diff?
> We are looking at options for moving large RBDs (100T) to a new pool with
> minimal downtime.
>
> I was thinking we might be able to configure/hack rbd mirroring to mirror to
> a pool on the same cluster but I gather from the OP and your post that this
> is not really possible?

No, it's not really possible currently and we have no plans to add
such support since it would not be of any long-term value. If you are
using RBD with a kernel block device, you could temporarily wrap two
mapped RBD volumes (original and new) under a md RAID1 with the
original as the primary -- and then do a RAID repair. If using
QEMU+librbd, you could use its built-in block live migration feature
(I've never played w/ this before and I believe you would need to use
the QEMU monitor instead of libvirt to configure).

> --
> Rafael Lopez
> Research Devops Engineer
> Monash University eResearch Centre
>
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous 12.2.3 Changelog ?

2018-02-21 Thread Wido den Hollander



On 02/21/2018 01:39 PM, Konstantin Shalygin wrote:

Is there any changelog for this release ?



https://github.com/ceph/ceph/pull/20503


And this one: https://github.com/ceph/ceph/pull/20500

Wido





k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous 12.2.3 Changelog ?

2018-02-21 Thread Konstantin Shalygin

Is there any changelog for this release ?



https://github.com/ceph/ceph/pull/20503



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous 12.2.3 Changelog ?

2018-02-21 Thread Wido den Hollander

They aren't there yet: http://docs.ceph.com/docs/master/release-notes/

And no Git commit yet: 
https://github.com/ceph/ceph/commits/master/doc/release-notes.rst


I think the Release Manager is doing its best to release them asap.

12.2.3 packages were released this morning :)

Wido

On 02/21/2018 12:48 PM, Christoph Adomeit wrote:

Hi there,

I noticed that luminous 12.2.3 is already released.

Is there any changelog for this release ?

Thanks
   Christoph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous 12.2.3 Changelog ?

2018-02-21 Thread Christoph Adomeit
Hi there,

I noticed that luminous 12.2.3 is already released.

Is there any changelog for this release ?

Thanks
  Christoph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to really change public network in ceph

2018-02-21 Thread Mario Giammarco
I try to ask a simpler question: when I change monitors network and the
network of osds, how can monitors know the new addresses of osds?
Thanks,
Mario

2018-02-19 10:22 GMT+01:00 Mario Giammarco :

> Hello,
> I have a test proxmox/ceph cluster with four servers.
> I need to change the ceph public subnet from 10.1.0.0/24 to 10.1.5.0/24.
> I have read documentation and tutorials.
> The most critical part seems monitor map editing.
> But it seems to me that osds need to bind to new subnet too.
> I tried to put 10.1.0 and 10.1.5 subnets to public but it seems it changes
> nothing.
> Infact official documentation is unclear: it says you can put in public
> network more than one subnet. It says they must be routed. But it does not
> say what happens when you use multiple subnets or why you should do it.
>
> So I need help on these questions:
> - exact sequence of operations to change public network in ceph (not only
> monitors, also osds)
> - details on multiple subnets in public networks in ceph
>
> Thanks in advance for any help,
> Mario
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] identifying public buckets

2018-02-21 Thread Dave Holland
Hi,

We would like to scan our users' buckets to identify those which are
publicly-accessible, to avoid potential embarrassment (or worse), e.g.
http://www.bbc.co.uk/news/technology-42839462

I didn't find a way to use radosgw-admin to report ACL information for a
given bucket. And using the API to query a bucket's information would
require a valid access key for that bucket. What am I missing, please?

(Ceph 10.2.7)

thanks,
Dave
-- 
** Dave Holland ** Systems Support -- Informatics Systems Group **
** 01223 496923 **Wellcome Sanger Institute, Hinxton, UK**


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG_DAMAGED Possible data damage: 1 pg inconsistent

2018-02-21 Thread Yoann Moulin
Hello,

I migrated my cluster from jewel to luminous 3 weeks ago (using ceph-ansible 
playbook), a few days after, ceph status told me "PG_DAMAGED
Possible data damage: 1 pg inconsistent", I tried to repair the PG without 
success, I tried to stop the OSD, flush the journal and restart the
OSDs but the OSD refuse to start due to a bad journal. I decided to destroy the 
OSD and recreated it from scratch. After that, everything seemed
to be all right, but, I just saw now I have exactly the same error again on the 
same PG on the same OSD (78).

> $ ceph health detail
> HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent
> OSD_SCRUB_ERRORS 3 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 11.5f is active+clean+inconsistent, acting [78,154,170]

> $ ceph -s
>   cluster:
> id: f9dfd27f-c704-4d53-9aa0-4a23d655c7c4
> health: HEALTH_ERR
> 3 scrub errors
> Possible data damage: 1 pg inconsistent
>  
>   services:
> mon: 3 daemons, quorum 
> iccluster002.iccluster.epfl.ch,iccluster010.iccluster.epfl.ch,iccluster018.iccluster.epfl.ch
> mgr: iccluster001(active), standbys: iccluster009, iccluster017
> mds: cephfs-3/3/3 up  
> {0=iccluster022.iccluster.epfl.ch=up:active,1=iccluster006.iccluster.epfl.ch=up:active,2=iccluster014.iccluster.epfl.ch=up:active}
> osd: 180 osds: 180 up, 180 in
> rgw: 6 daemons active
>  
>   data:
> pools:   29 pools, 10432 pgs
> objects: 82862k objects, 171 TB
> usage:   515 TB used, 465 TB / 980 TB avail
> pgs: 10425 active+clean
>  6 active+clean+scrubbing+deep
>  1 active+clean+inconsistent
>  
>   io:
> client:   21538 B/s wr, 0 op/s rd, 33 op/s wr

> ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
> (stable)

Short log :

> 2018-02-21 09:08:33.408396 7fb7b8222700  0 log_channel(cluster) log [DBG] : 
> 11.5f repair starts
> 2018-02-21 09:08:33.727277 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
> 11.5f shard 78: soid 
> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head 
> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 
> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9- 
> b494-57bdb48fab4e.314528.19:head(98394'20014544 osd.78.0:1623704 
> dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd  od d46bb5a1 
> alloc_hint [0 0 0])
> 2018-02-21 09:08:33.727290 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
> 11.5f shard 154: soid 
> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head 
> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 
> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544
>  osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd 
>  od d46bb5a1 alloc_hint [0 0 0])
> 2018-02-21 09:08:33.727293 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
> 11.5f shard 170: soid 
> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head 
> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 
> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544
>  osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd 
>  od d46bb5a1 alloc_hint [0 0 0])
> 2018-02-21 09:08:33.727295 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
> 11.5f soid 
> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head: 
> failed to pick suitable auth object
> 2018-02-21 09:08:33.727333 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
> 11.5f repair 3 errors, 0 fixed

I set "debug_osd 20/20" on osd.78 and start the repair again, the log file is 
here :

ceph-post-file: 1ccac8ea-0947-4fe4-90b1-32d1048548f1

What can I do in that situation ?

Thanks for your help.

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com