Re: [ceph-users] MDS is Readonly

2018-05-02 Thread Yan, Zheng
try running "rados -p  touch 1002fc5d22d."
before mds restart

On Thu, May 3, 2018 at 2:31 AM, Pavan, Krish  wrote:
>
>
> We have ceph 12.2.4 cephfs with two active MDS server and directory are
> pinned  to MDS servers. Yesterday MDS server crashed.  Once all fuse clients
> have  unmounted, we bring back MDS online. Both MDS are active now.
>
>
>
> Once It came back, we started to see one MDS is   Readonly.
>
> …
>
> 2018-05-01 23:41:22.765920 7f71481b8700  1 mds.0.cache.dir(0x1002fc5d22d)
> commit error -2 v 3
>
> 2018-05-01 23:41:22.765964 7f71481b8700 -1 log_channel(cluster) log [ERR] :
> failed to commit dir 0x1002fc5d22d object, errno -2
>
> 2018-05-01 23:41:22.765974 7f71481b8700 -1 mds.0.222755 unhandled write
> error (2) No such file or directory, force readonly...
>
> 2018-05-01 23:41:22.766013 7f71481b8700  1 mds.0.cache force file system
> read-only
>
> 2018-05-01 23:41:22.766019 7f71481b8700  0 log_channel(cluster) log [WRN] :
> force file system read-only
>
> ….
>
>
>
> It health waring I see
>
> health: HEALTH_WARN
>
> 1 MDSs are read only
>
> 1 MDSs behind on trimming
>
>
>
> There is no error on OSDs on metadata pool
>
> Will ceph daemon mds.x scrub_path / force recursive repair will fix?.Or
> offline data-scan need to be done
>
>
>
>
>
>
>
> Regards
>
> Krish
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CentOS release 7.4.1708 and selinux-policy-base >= 3.13.1-166.el7_4.9

2018-05-02 Thread ceph . novice

Hi all.

We try to setup our first CentOS 7.4.1708 CEPH cluster, based on Luminous 
12.2.5. What we get is:
 

Error: Package: 2:ceph-selinux-12.2.5-0.el7.x86_64 (Ceph-Luminous)
   Requires: selinux-policy-base >= 3.13.1-166.el7_4.9


__Host infos__:

root> lsb_release -d
Description:CentOS Linux release 7.4.1708 (Core)

root@> uname -a
Linux  3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC 2017 
x86_64 x86_64 x86_64 GNU/Linux

__Question__:
Where can I find the elinux-policy-base-3.13.1-166.el7_4.9 package?


Regards
 Anton
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore on HDD+SSD sync write latency experiences

2018-05-02 Thread Alex Gorbachev
Hi Nick,

On Tue, May 1, 2018 at 4:50 PM, Nick Fisk  wrote:
> Hi all,
>
>
>
> Slowly getting round to migrating clusters to Bluestore but I am interested
> in how people are handling the potential change in write latency coming from
> Filestore? Or maybe nobody is really seeing much difference?
>
>
>
> As we all know, in Bluestore, writes are not double written and in most
> cases go straight to disk. Whilst this is awesome for people with pure SSD
> or pure HDD clusters as the amount of overhead is drastically reduced, for
> people with HDD+SSD journals in Filestore land, the double write had the
> side effect of acting like a battery backed cache, accelerating writes when
> not under saturation.
>
>
>
> In some brief testing I am seeing Filestore OSD’s with NVME journal show an
> average apply latency of around 1-2ms whereas some new Bluestore OSD’s in
> the same cluster are showing 20-40ms. I am fairly certain this is due to
> writes exhibiting the latency of the underlying 7.2k disk. Note, cluster is
> very lightly loaded, this is not anything being driven into saturation.
>
>
>
> I know there is a deferred write tuning knob which adjusts the cutover for
> when an object is double written, but at the default of 32kb, I suspect a
> lot of IO’s even in the 1MB area are still drastically slower going straight
> to disk than if double written to NVME 1st. Has anybody else done any
> investigation in this area? Is there any long turn harm at running a cluster
> deferring writes up to 1MB+ in size to mimic the Filestore double write
> approach?
>
>
>
> I also suspect after looking through github that deferred writes only happen
> when overwriting an existing object or blob (not sure which case applies),
> so new allocations are still written straight to disk. Can anyone confirm?
>
>
>
> PS. If your spinning disks are connected via a RAID controller with BBWC
> then you are not affected by this.

We saw this behavior even on Areca 1883, which does buffer HDD writes.
The way out was to put WAL and DB on NVMe drives and that solved
performance problems.
--
Alex Gorbachev
Storcium

>
>
>
> Thanks,
>
> Nick
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS is Readonly

2018-05-02 Thread Pavan, Krish

We have ceph 12.2.4 cephfs with two active MDS server and directory are  pinned 
 to MDS servers. Yesterday MDS server crashed.  Once all fuse clients have  
unmounted, we bring back MDS online. Both MDS are active now.

Once It came back, we started to see one MDS is   Readonly.
...
2018-05-01 23:41:22.765920 7f71481b8700  1 mds.0.cache.dir(0x1002fc5d22d) 
commit error -2 v 3
2018-05-01 23:41:22.765964 7f71481b8700 -1 log_channel(cluster) log [ERR] : 
failed to commit dir 0x1002fc5d22d object, errno -2
2018-05-01 23:41:22.765974 7f71481b8700 -1 mds.0.222755 unhandled write error 
(2) No such file or directory, force readonly...
2018-05-01 23:41:22.766013 7f71481b8700  1 mds.0.cache force file system 
read-only
2018-05-01 23:41:22.766019 7f71481b8700  0 log_channel(cluster) log [WRN] : 
force file system read-only


It health waring I see
health: HEALTH_WARN
1 MDSs are read only
1 MDSs behind on trimming

There is no error on OSDs on metadata pool
Will ceph daemon mds.x scrub_path / force recursive repair will fix?.Or 
offline data-scan need to be done



Regards
Krish
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Announcing mountpoint, August 27-28, 2018

2018-05-02 Thread Amye Scavarda
Our first mountpoint is coming!
Software-defined Storage (SDS) is changing the traditional way we
think of storage. Decoupling software from hardware allows you to
choose your hardware vendors and provides enterprises with more
flexibility.

Attend mountpoint on August 27 - 28, 2018 in Vancouver, BC, before
Open Source Summit North America for this first time event. We are
joining forces with the Ceph and Gluster communities, SDS experts, and
partners to bring to you an exciting 2 day event. Help lead the
conversation on open source software defined storage and share your
knowledge!

Our CFP is open on May 3rd through June 15th, 2018.
More details available, including sponsorship:
http://mountpoint.io/

--
Amye Scavarda | a...@redhat.com | Gluster Community Lead
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-05-02 Thread Nicolas Huillard
Le dimanche 08 avril 2018 à 20:40 +, Jens-U. Mozdzen a écrit :
> sorry for bringing up that old topic again, but we just faced a  
> corresponding situation and have successfully tested two migration  
> scenarios.

Thank you very much for this update, as I needed to do exactly that,
due to an SSD crash triggering hardware replacement.
The block.db on the crashed SSD were lost, so the whole two OSDs
depending on it were re-created. I also replaced two other bad SSDs
before they failed, thus needed to effectively replace DB/WAL devices
on the live cluster (2 SSDs on 2 hosts and 4 OSDs).

> it is possible to move a separate WAL/DB to a new device, whilst  
> without changing the size. We have done this for multiple OSDs,
> using  
> only existing (mainstream :) ) tools and have documented the
> procedure  
> in  
> http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-b
> lock-db/  
> . It will *not* allow to separate WAL / DB after OSD creation, nor  
> does it allow changing the DB size.

The lost OSD were still backfilling when I did the above procedure
(data redundancy was high enough to risk losing one more node). I even
mis-typed the "ceph osd set noout" command ("ceph osd unset noout"
instead, effectively a no-op), and replaced 2 OSDs of a single host at
the same time (thus taking more time than the 10 minutes before kicking
the OSDs out, triggering even more data movement).
Everything went cleanly though, thanks to your detailed commands, which
I ran one at a time, thinking twice before each [Enter].

I digged a bit into the LVM tags :
* make a backup of all pv/vg/lv config : vgcfgbackup
* check the backed-up tags : grep tags /etc/lvm/backup/*

I then noticed that :
* there are lots of "ceph.*=" tags
* tags are still present on the old DB/WAL LVs (since I didn't remove
them)
* tags are absent from the new DB/WAL LVs (ditto, I didn't create
them), which may be a problem later on...
* I changed the ceph.db_device= tag, but there is also a ceph.db_uuid=
tag which was not changed, and may or may not trigger a problem upon
reboot (I don't know if this UUID is part of the dd'ed data)

You effectively helped a lot! Thanks.

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] GDPR encryption at rest

2018-05-02 Thread David Turner
At 'rest' is talking about data on it's own, not being accessed through an
application.  Encryption at rest is most commonly done by encrypting the
block device with something like dmcrypt.  It's anything that makes having
the physical disk useless without being able to decrypt it.  You can also
just encrypt a folder with sensitive information which would also be
encryption at rest.  Encryption not at rest would be like putting a secure
layer between the data and the users that access it, like HTTPS/SSL.

On Wed, May 2, 2018 at 11:25 AM Alfredo Deza  wrote:

> On Wed, May 2, 2018 at 11:12 AM, David Turner 
> wrote:
> > I've heard conflicting opinions if GDPR requires data to be encrypted at
> > rest, but enough of our customers believe that it is that we're looking
> at
> > addressing it in our clusters.  I had a couple questions about the state
> of
> > encryption in ceph.
> >
> > 1) My experience with encryption in Ceph is dmcrypt, is this still the
> > standard method or is there something new with bluestore?
>
> Standard, yes.
>
> > 2) Assuming dmcrypt is still the preferred option, is it fully
> > supported/tested in ceph-volume?  There were problems with this when
> > ceph-volume was initially released, but I believe those have been
> resolved.
>
> It is fully supported, but only with LUKS. The initial release of
> ceph-volume didn't have dmcrypt support.
>
> > 3) Any other thoughts about encryption at rest?  I have an upgrade path
> to
> > get to encryption (basically the same as getting to bluestore from
> > filestore).
>
> Not sure what you mean by 'rest'. The ceph-volume encryption would
> give you the same type of encryption that was provided by ceph-disk
> with the only "gotcha" being it is LUKS (plain is not supported for
> newly encrypted devices)
>
> >
> > Thanks for your comments.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] GDPR encryption at rest

2018-05-02 Thread Alfredo Deza
On Wed, May 2, 2018 at 11:12 AM, David Turner  wrote:
> I've heard conflicting opinions if GDPR requires data to be encrypted at
> rest, but enough of our customers believe that it is that we're looking at
> addressing it in our clusters.  I had a couple questions about the state of
> encryption in ceph.
>
> 1) My experience with encryption in Ceph is dmcrypt, is this still the
> standard method or is there something new with bluestore?

Standard, yes.

> 2) Assuming dmcrypt is still the preferred option, is it fully
> supported/tested in ceph-volume?  There were problems with this when
> ceph-volume was initially released, but I believe those have been resolved.

It is fully supported, but only with LUKS. The initial release of
ceph-volume didn't have dmcrypt support.

> 3) Any other thoughts about encryption at rest?  I have an upgrade path to
> get to encryption (basically the same as getting to bluestore from
> filestore).

Not sure what you mean by 'rest'. The ceph-volume encryption would
give you the same type of encryption that was provided by ceph-disk
with the only "gotcha" being it is LUKS (plain is not supported for
newly encrypted devices)

>
> Thanks for your comments.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] GDPR encryption at rest

2018-05-02 Thread David Turner
I've heard conflicting opinions if GDPR requires data to be encrypted at
rest, but enough of our customers believe that it is that we're looking at
addressing it in our clusters.  I had a couple questions about the state of
encryption in ceph.

1) My experience with encryption in Ceph is dmcrypt, is this still the
standard method or is there something new with bluestore?
2) Assuming dmcrypt is still the preferred option, is it fully
supported/tested in ceph-volume?  There were problems with this when
ceph-volume was initially released, but I believe those have been resolved.
3) Any other thoughts about encryption at rest?  I have an upgrade path to
get to encryption (basically the same as getting to bluestore from
filestore).

Thanks for your comments.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore on HDD+SSD sync write latency experiences

2018-05-02 Thread Igor Fedotov

Hi Nick,

On 5/1/2018 11:50 PM, Nick Fisk wrote:


Hi all,

Slowly getting round to migrating clusters to Bluestore but I am 
interested in how people are handling the potential change in write 
latency coming from Filestore? Or maybe nobody is really seeing much 
difference?


As we all know, in Bluestore, writes are not double written and in 
most cases go straight to disk. Whilst this is awesome for people with 
pure SSD or pure HDD clusters as the amount of overhead is drastically 
reduced, for people with HDD+SSD journals in Filestore land, the 
double write had the side effect of acting like a battery backed 
cache, accelerating writes when not under saturation.


In some brief testing I am seeing Filestore OSD’s with NVME journal 
show an average apply latency of around 1-2ms whereas some new 
Bluestore OSD’s in the same cluster are showing 20-40ms. I am fairly 
certain this is due to writes exhibiting the latency of the underlying 
7.2k disk. Note, cluster is very lightly loaded, this is not anything 
being driven into saturation.


I know there is a deferred write tuning knob which adjusts the cutover 
for when an object is double written, but at the default of 32kb, I 
suspect a lot of IO’s even in the 1MB area are still drastically 
slower going straight to disk than if double written to NVME 1^st . 
Has anybody else done any investigation in this area? Is there any 
long turn harm at running a cluster deferring writes up to 1MB+ in 
size to mimic the Filestore double write  approach?


This should work fine with low load but be careful when load is raising. 
RocksDB and corresponding stuff around it might become a bottleneck in 
this scenario.


I also suspect after looking through github that deferred writes only 
happen when overwriting an existing object or blob (not sure which 
case applies), so new allocations are still written straight to disk. 
Can anyone confirm?


"small" writes (length < min_alloc_size) are direct if they go to unused 
chunk (4K or more depending on checksum settings) of an existing mutable 
block and write length > bluestore_prefer_deferred_size only.
E.g. appending with 4K data  blocks to an object at HDD will trigger 
deferred mode for the first of every 16 writes (given that default 
min_alloc_size for HDD is 64K). Rest 15 go direct.


"big" writes are unconditionally deferred if length <= 
bluestore_prefer_deferred_size.


PS. If your spinning disks are connected via a RAID controller with 
BBWC then you are not affected by this.


Thanks,

Nick



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph scrub logs: _scan_snaps no head for $object?

2018-05-02 Thread Stefan Kooman
Hi,

Quoting Stefan Kooman (ste...@bit.nl):
> Hi,
> 
> We see the following in the logs after we start a scrub for some osds:
> 
> ceph-osd.2.log:2017-12-14 06:50:47.180344 7f0f47db2700  0 
> log_channel(cluster) log [DBG] : 1.2d8 scrub starts
> ceph-osd.2.log:2017-12-14 06:50:47.180915 7f0f47db2700 -1 osd.2 pg_epoch: 
> 11897 pg[1.2d8( v 11890'165209 (3221'163647,11890'165209] 
> local-lis/les=11733/11734 n=67 ec=132/132 lis/c 11733/11733 les/c/f 
> 11734/11734/0 11733/11733/11733) [2,45,31] r=0 lpr=11733 crt=11890'165209 
> lcod 11890'165208 mlcod 11890'165208 active+clean+scrubbing] _scan_snaps no 
> head for 1:1b518155:::rbd_data.620652ae8944a.0126:29 (have MIN)
> ceph-osd.2.log:2017-12-14 06:50:47.180929 7f0f47db2700 -1 osd.2 pg_epoch: 
> 11897 pg[1.2d8( v 11890'165209 (3221'163647,11890'165209] 
> local-lis/les=11733/11734 n=67 ec=132/132 lis/c 11733/11733 les/c/f 
> 11734/11734/0 11733/11733/11733) [2,45,31] r=0 lpr=11733 crt=11890'165209 
> lcod 11890'165208 mlcod 11890'165208 active+clean+scrubbing] _scan_snaps no 
> head for 1:1b518155:::rbd_data.620652ae8944a.0126:14 (have MIN)
> ceph-osd.2.log:2017-12-14 06:50:47.180941 7f0f47db2700 -1 osd.2 pg_epoch: 
> 11897 pg[1.2d8( v 11890'165209 (3221'163647,11890'165209] 
> local-lis/les=11733/11734 n=67 ec=132/132 lis/c 11733/11733 les/c/f 
> 11734/11734/0 11733/11733/11733) [2,45,31] r=0 lpr=11733 crt=11890'165209 
> lcod 11890'165208 mlcod 11890'165208 active+clean+scrubbing] _scan_snaps no 
> head for 1:1b518155:::rbd_data.620652ae8944a.0126:a (have MIN)
> ceph-osd.2.log:2017-12-14 06:50:47.214198 7f0f43daa700  0 
> log_channel(cluster) log [DBG] : 1.2d8 scrub ok
> 
> So finally it logs "scrub ok", but what does " _scan_snaps no head for ..." 
> mean?
> Does this indicate a problem?

Still seeing this issue on a freshly installed luminous cluster. I
*think* it either has to do with "cloned" RBDs that get snapshots by
themselves or RBDs that are cloned from a snapshot.

Any dev that wants to debug this behaviour if I'm able to reliably
reproduce this?

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com