[ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-13 Thread Troy Ablan

I've opened a tracker issue at https://tracker.ceph.com/issues/41240

Background: Cluster of 13 hosts, 5 of which contain 14 SSD OSDs between 
them.  409 HDDs in as well.


The SSDs contain the RGW index and log pools, and some smaller pools
The HDDs ccontain all other pools, including the RGW data pool

The RGW instance contains just over 1 billion objects across about 65k 
buckets.  I don't know of any action on the cluster that would have 
caused this.  There have been no changes to the crush map in months, but 
HDDs were added a couple weeks ago and backfilling is still in progress 
but in the home stretch.


I don't know what I can do at this point, though something points to the 
osdmap on these being wrong and/or corrupted?  Log excerpt from crash 
included below.  All of the OSD logs I checked look very similar.





2019-08-13 18:09:52.913 7f76484e9d80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
/el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3362] Recovered 
from manifest file:db/MANIFEST-245361 succeeded,manifest_file_number is 
245361, next_file_number is 245364, last_sequence is 606668564
6, log_number is 0,prev_log_number is 0,max_column_family is 
0,deleted_log_number is 245359


2019-08-13 18:09:52.913 7f76484e9d80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
/el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3370] Column family 
[default] (ID 0), log number is 245360


2019-08-13 18:09:52.918 7f76484e9d80  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1565719792920682, "job": 1, "event": "recovery_started", 
"log_files": [245362]}
2019-08-13 18:09:52.918 7f76484e9d80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
/el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:551] Recovering 
log #245362 mode 0
2019-08-13 18:09:52.919 7f76484e9d80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
/el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:2863] Creating 
manifest 245364


2019-08-13 18:09:52.933 7f76484e9d80  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1565719792935329, "job": 1, "event": "recovery_finished"}
2019-08-13 18:09:52.951 7f76484e9d80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
/el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:1218] DB pointer 
0x56445a6c8000
2019-08-13 18:09:52.951 7f76484e9d80  1 
bluestore(/var/lib/ceph/osd/ceph-46) _open_db opened rocksdb path db 
options 
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=

1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152
2019-08-13 18:09:52.964 7f76484e9d80  1 freelist init
2019-08-13 18:09:52.976 7f76484e9d80  1 
bluestore(/var/lib/ceph/osd/ceph-46) _open_alloc opening allocation metadata
2019-08-13 18:09:53.119 7f76484e9d80  1 
bluestore(/var/lib/ceph/osd/ceph-46) _open_alloc loaded 926 GiB in 13292 
extents

2019-08-13 18:09:53.133 7f76484e9d80 -1 *** Caught signal (Aborted) **
 in thread 7f76484e9d80 thread_name:ceph-osd

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)

 1: (()+0xf5d0) [0x7f763c4455d0]
 2: (gsignal()+0x37) [0x7f763b466207]
 3: (abort()+0x148) [0x7f763b4678f8]
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f763bd757d5]
 5: (()+0x5e746) [0x7f763bd73746]
 6: (()+0x5e773) [0x7f763bd73773]
 7: (__cxa_rethrow()+0x49) [0x7f763bd739e9]
 8: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x18b8) 
[0x7f763fcb48d8]

 9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x4ad) [0x7f763fa924ad]
 10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f763fa94db1]
 11: (OSDService::try_get_map(unsigned int)+0x4f8) [0x5644576e1e08]
 12: (OSDService::get_map(unsigned int)+0x1e) [0x564457743dae]
 13: (OSD::init()+0x1d32) [0x5644576ef982]
 14: (main()+0x23a3) [0x5644575cc7a3]
 15: (__libc_start_main()+0xf5) [0x7f763b4523d5]
 16: (()+0x385900) [0x5644576a4900]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-13 Thread Mark Nelson

On 8/13/19 3:51 PM, Paul Emmerich wrote:


On Tue, Aug 13, 2019 at 10:04 PM Wido den Hollander  wrote:

I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of DB in
use. No slow db in use.

random rgw-only setup here: 12TB drive, 77% full, 48GB metadata and
10GB omap for index and whatever.

That's 0.5% + 0.1%. And that's a setup that's using mostly erasure
coding and small-ish objects.



I've talked with many people from the community and I don't see an
agreement for the 4% rule.

agreed, 4% isn't a reasonable default.
I've seen setups with even 10% metadata usage, but these are weird
edge cases with very small objects on NVMe-only setups (obviously
without a separate DB device).

Paul



I agree, and I did quite a bit of the early space usage analysis.  I 
have a feeling that someone was trying to be well-meaning and make a 
simple ratio for users to target that was big enough to handle the 
majority of use cases.  The problem is that reality isn't that simple 
and one-size-fits all doesn't really work here.



For RBD you can usually get away with far less than 4%.  A small 
fraction of that is often sufficient.  For tiny (say 4K) RGW objects  
(especially objects with very long names!) you potentially can end up 
using significantly more than 4%. Unfortunately there's no really good 
way for us to normalize this so long as RGW is using OMAP to store 
bucket indexes.  I think the best we can do long run is make it much 
clearer how space is being used on the block/db/wal devices and easier 
for users to shrink/grow the amount of "fast" disk they have on an OSD. 
Alternately we could put bucket indexes into rados objects instead of 
OMAP, but that would be a pretty big project (with it's own challenges 
but potentially also with rewards).



Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph capacity versus pool replicated size discrepancy?

2019-08-13 Thread Konstantin Shalygin

Hey guys, this is probably a really silly question, but I’m trying to reconcile 
where all of my space has gone in one cluster that I am responsible for.

The cluster is made up of 36 2TB SSDs across 3 nodes (12 OSDs per node), all 
using FileStore on XFS.  We are running Ceph Luminous 12.2.8 on this particular 
cluster. The only pool where data is heavily stored is the “rbd” pool, of which 
7.09TiB is consumed.  With a replication of “3”, I would expect that the raw 
used to be close to 21TiB, but it’s actually closer to 35TiB.  Some additional 
details are below.  Any thoughts?

[cluster]root at dashboard  
:~# ceph df
GLOBAL:
 SIZEAVAIL   RAW USED %RAW USED
 62.8TiB 27.8TiB  35.1TiB 55.81
POOLS:
 NAME   ID USED%USED MAX AVAIL 
OBJECTS
 rbd0  7.09TiB 53.76   6.10TiB 
3056783
 data   3  29.4GiB  0.47   6.10TiB  
  7918
 metadata   4  57.2MiB 0   6.10TiB  
95
 .rgw.root  5  1.09KiB 0   6.10TiB  
 4
 default.rgw.control6   0B 0   6.10TiB  
 8
 default.rgw.meta   7   0B 0   6.10TiB  
 0
 default.rgw.log8   0B 0   6.10TiB  
   207
 default.rgw.buckets.index  9   0B 0   6.10TiB  
 0
 default.rgw.buckets.data   10  0B 0   6.10TiB  
 0
 default.rgw.buckets.non-ec 11  0B 0   6.10TiB  
 0

[cluster]root at dashboard  
:~# ceph --version
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

[cluster]root at dashboard  
:~# ceph osd dump | 
grep 'replicated size'
pool 0 'rbd' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins 
pg_num 682 pgp_num 682 last_change 414873 flags hashpspool 
min_write_recency_for_promote 1 stripe_width 0 application rbd
pool 3 'data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins 
pg_num 682 pgp_num 682 last_change 409614 flags hashpspool 
crash_replay_interval 45 min_write_recency_for_promote 1 stripe_width 0 
application cephfs
pool 4 'metadata' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 682 pgp_num 682 last_change 409617 flags hashpspool 
min_write_recency_for_promote 1 stripe_width 0 application cephfs
pool 5 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 409 pgp_num 409 last_change 409710 lfor 0/336229 flags 
hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409711 lfor 0/336232 
flags hashpspool stripe_width 0 application rgw
pool 7 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 409 pgp_num 409 last_change 409713 lfor 0/336235 flags 
hashpspool stripe_width 0 application rgw
pool 8 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 409 pgp_num 409 last_change 409712 lfor 0/336238 flags 
hashpspool stripe_width 0 application rgw
pool 9 'default.rgw.buckets.index' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409714 lfor 0/336241 
flags hashpspool stripe_width 0 application rgw
pool 10 'default.rgw.buckets.data' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409715 lfor 0/336244 
flags hashpspool stripe_width 0 application rgw
pool 11 'default.rgw.buckets.non-ec' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409716 lfor 0/336247 
flags hashpspool stripe_width 0 application rgw

[cluster]root at dashboard  
:~# ceph osd lspools
0 rbd,3 data,4 metadata,5 .rgw.root,6 default.rgw.control,7 default.rgw.meta,8 
default.rgw.log,9 default.rgw.buckets.index,10 default.rgw.buckets.data,11 
default.rgw.buckets.non-ec,

[cluster]root at dashboard  
:~# rados df
POOL_NAME  USEDOBJECTS CLONES  COPIES  MISSING_ON_PRIMARY 
UNFOUND DEGRADED RD_OPS  RD  WR_OPS  WR
.rgw.root  1.09KiB   4   0  12  0   
00  128KiB   0  0B
data   29.4GiB7918   0   23754  0   
00 1414777 3.74TiB 3524833 4.54TiB
default.rgw.buckets.data0B   0   0   0  0   
0

Re: [ceph-users] reproducible rbd-nbd crashes

2019-08-13 Thread Mike Christie
On 08/13/2019 07:04 PM, Mike Christie wrote:
> On 07/31/2019 05:20 AM, Marc Schöchlin wrote:
>> Hello Jason,
>>
>> it seems that there is something wrong in the rbd-nbd implementation.
>> (added this information also at  https://tracker.ceph.com/issues/40822)
>>
>> The problem not seems to be related to kernel releases, filesystem types or 
>> the ceph and network setup.
>> Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 
>> seems to have the described problem.
>>
>> This night a 18 hour testrun with the following procedure was successful:
>> -
>> #!/bin/bash
>> set -x
>> while true; do
>>date
>>find /srv_ec -type f -name "*.MYD" -print0 |head -n 50|xargs -0 -P 10 -n 
>> 2 gzip -v
>>date
>>find /srv_ec -type f -name "*.MYD.gz" -print0 |head -n 50|xargs -0 -P 10 
>> -n 2 gunzip -v
>> done
>> -
>> Previous tests crashed in a reproducible manner with "-P 1" (single io 
>> gzip/gunzip) after a few minutes up to 45 minutes.
>>
>> Overview of my tests:
>>
>> - SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 
>> 120s device timeout
>>   -> 18 hour testrun was successful, no dmesg output
>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
>> device timeout
>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
>> errors, map/mount can be re-created without reboot
>>   -> parallel krbd device usage with 99% io usage worked without a problem 
>> while running the test
>> - FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
>> device timeout
>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
>> errors, map/mount can be re-created
>>   -> parallel krbd device usage with 99% io usage worked without a problem 
>> while running the test
>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no 
>> timeout
>>   -> failed after < 10 minutes
>>   -> system runs in a high system load, system is almost unusable, unable to 
>> shutdown the system, hard reset of vm necessary, manual exclusive lock 
>> removal is necessary before remapping the device
> 
> Did you see Mykola's question on the tracker about this test? Did the
> system become unusable at 13:00?
> 
> Above you said it took less than 10 minutes, so we want to clarify if
> the test started at 12:39 and failed at 12:49 or if it started at 12:49
> and failed by 13:00.
> 
>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 
>> 120s device timeout
>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
>> errors, map/mount can be re-created
> 
> How many CPUs and how much memory does the VM have?
> 
> I'm not sure which test it covers above, but for
> test-with-timeout/ceph-client.archiv.log and dmesg-crash it looks like
> the command that probably triggered the timeout got stuck in safe_write
> or write_fd, because we see:
> 
> // Command completed and right after this log message we try to write
> the reply and data to the nbd.ko module.
> 
> 2019-07-29 21:55:21.148118 7fffbf7fe700 20 rbd-nbd: writer_entry: got:
> [4500 READ 24043755000~2 0]
> 
> // We got stuck and 2 minutes go by and so the timeout fires. That kills
> the socket, so we get an error here and after that rbd-nbd is going to exit.
> 
> 2019-07-29 21:57:21.785111 7fffbf7fe700 -1 rbd-nbd: [4500
> READ 24043755000~2 0]: failed to write replay data: (32) Broken pipe
> 
> We could hit this in a couple ways:
> 
> 1. The block layer sends a command that is larger than the socket's send
> buffer limits. These are those values you sometimes set in sysctl.conf like:
> 
> net.core.rmem_max
> net.core.wmem_max
> net.core.rmem_default
> net.core.wmem_default
> net.core.optmem_max
> 
> There does not seem to be any checks/code to make sure there is some
> alignment with limits. I will send a patch but that will not help you
> right now. The max io size for nbd is 128k so make sure your net values
> are large enough. Increase the values in sysctl.conf and retry if they
> were too small.

Not sure what I was thinking. Just checked the logs and we have done IO
of the same size that got stuck and it was fine, so the socket sizes
should be ok.

We still need to add code to make sure IO sizes and the af_unix sockets
size limits match up.


> 
> 2. If memory is low on the system, we could be stuck trying to allocate
> memory in the kernel in that code path too.
> 
> rbd-nbd just uses more memory per device, so it could be why we do not
> see a problem with krbd.
> 
> 3. I wonder if we are hitting a bug with PF_MEMALLOC Ilya hit with krbd.
> He removed that code from the krbd. I will ping him on that.
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] reproducible rbd-nbd crashes

2019-08-13 Thread Mike Christie
On 07/31/2019 05:20 AM, Marc Schöchlin wrote:
> Hello Jason,
> 
> it seems that there is something wrong in the rbd-nbd implementation.
> (added this information also at  https://tracker.ceph.com/issues/40822)
> 
> The problem not seems to be related to kernel releases, filesystem types or 
> the ceph and network setup.
> Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems 
> to have the described problem.
> 
> This night a 18 hour testrun with the following procedure was successful:
> -
> #!/bin/bash
> set -x
> while true; do
>date
>find /srv_ec -type f -name "*.MYD" -print0 |head -n 50|xargs -0 -P 10 -n 2 
> gzip -v
>date
>find /srv_ec -type f -name "*.MYD.gz" -print0 |head -n 50|xargs -0 -P 10 
> -n 2 gunzip -v
> done
> -
> Previous tests crashed in a reproducible manner with "-P 1" (single io 
> gzip/gunzip) after a few minutes up to 45 minutes.
> 
> Overview of my tests:
> 
> - SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s 
> device timeout
>   -> 18 hour testrun was successful, no dmesg output
> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
> device timeout
>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
> errors, map/mount can be re-created without reboot
>   -> parallel krbd device usage with 99% io usage worked without a problem 
> while running the test
> - FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
> device timeout
>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
> errors, map/mount can be re-created
>   -> parallel krbd device usage with 99% io usage worked without a problem 
> while running the test
> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
>   -> failed after < 10 minutes
>   -> system runs in a high system load, system is almost unusable, unable to 
> shutdown the system, hard reset of vm necessary, manual exclusive lock 
> removal is necessary before remapping the device

Did you see Mykola's question on the tracker about this test? Did the
system become unusable at 13:00?

Above you said it took less than 10 minutes, so we want to clarify if
the test started at 12:39 and failed at 12:49 or if it started at 12:49
and failed by 13:00.

> - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 
> 120s device timeout
>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
> errors, map/mount can be re-created

How many CPUs and how much memory does the VM have?

I'm not sure which test it covers above, but for
test-with-timeout/ceph-client.archiv.log and dmesg-crash it looks like
the command that probably triggered the timeout got stuck in safe_write
or write_fd, because we see:

// Command completed and right after this log message we try to write
the reply and data to the nbd.ko module.

2019-07-29 21:55:21.148118 7fffbf7fe700 20 rbd-nbd: writer_entry: got:
[4500 READ 24043755000~2 0]

// We got stuck and 2 minutes go by and so the timeout fires. That kills
the socket, so we get an error here and after that rbd-nbd is going to exit.

2019-07-29 21:57:21.785111 7fffbf7fe700 -1 rbd-nbd: [4500
READ 24043755000~2 0]: failed to write replay data: (32) Broken pipe

We could hit this in a couple ways:

1. The block layer sends a command that is larger than the socket's send
buffer limits. These are those values you sometimes set in sysctl.conf like:

net.core.rmem_max
net.core.wmem_max
net.core.rmem_default
net.core.wmem_default
net.core.optmem_max

There does not seem to be any checks/code to make sure there is some
alignment with limits. I will send a patch but that will not help you
right now. The max io size for nbd is 128k so make sure your net values
are large enough. Increase the values in sysctl.conf and retry if they
were too small.

2. If memory is low on the system, we could be stuck trying to allocate
memory in the kernel in that code path too.

rbd-nbd just uses more memory per device, so it could be why we do not
see a problem with krbd.

3. I wonder if we are hitting a bug with PF_MEMALLOC Ilya hit with krbd.
He removed that code from the krbd. I will ping him on that.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-13 Thread Paul Emmerich
On Tue, Aug 13, 2019 at 10:04 PM Wido den Hollander  wrote:
> I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of DB in
> use. No slow db in use.

random rgw-only setup here: 12TB drive, 77% full, 48GB metadata and
10GB omap for index and whatever.

That's 0.5% + 0.1%. And that's a setup that's using mostly erasure
coding and small-ish objects.


> I've talked with many people from the community and I don't see an
> agreement for the 4% rule.

agreed, 4% isn't a reasonable default.
I've seen setups with even 10% metadata usage, but these are weird
edge cases with very small objects on NVMe-only setups (obviously
without a separate DB device).

Paul

>
> Wido
>
> >
> > Thank you,
> >
> > Dominic L. Hilsbos, MBA
> > Director – Information Technology
> > Perform Air International Inc.
> > dhils...@performair.com
> > www.PerformAir.com
> >
> >
> >
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> > Wido den Hollander
> > Sent: Tuesday, August 13, 2019 12:51 PM
> > To: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] WAL/DB size
> >
> >
> >
> > On 8/13/19 5:54 PM, Hemant Sonawane wrote:
> >> Hi All,
> >> I have 4 6TB of HDD and 2 450GB SSD and I am going to partition each
> >> disk to 220GB for rock.db. So my question is does it make sense to use
> >> wal for my configuration? if yes then what could be the size of it? help
> >> will be really appreciated.
> >
> > Yes, the WAL needs to be about 1GB in size. That should work in allmost
> > all configurations.
> >
> > 220GB is more then you need for the DB as well. It's doesn't hurt, but
> > it's not needed. For each 6TB drive you need about ~60GB of space for
> > the DB.
> >
> > Wido
> >
> >> --
> >> Thanks and Regards,
> >>
> >> Hemant Sonawane
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Canonical Livepatch broke CephFS client

2019-08-13 Thread Tim Bishop
Hi,

This email is mostly a heads up for others who might be using
Canonical's livepatch on Ubuntu on a CephFS client.

I have an Ubuntu 18.04 client with the standard kernel currently at
version linux-image-4.15.0-54-generic 4.15.0-54.58. CephFS is mounted
with the kernel client. Cluster is running mimic 13.2.6. I've got
livepatch running and this evening it did an update:

Aug 13 17:33:55 myclient canonical-livepatch[2396]: Client.Check
Aug 13 17:33:55 myclient canonical-livepatch[2396]: Checking with livepatch 
service.
Aug 13 17:33:55 myclient canonical-livepatch[2396]: updating last-check
Aug 13 17:33:55 myclient canonical-livepatch[2396]: touched last check
Aug 13 17:33:56 myclient canonical-livepatch[2396]: Applying update 54.1 for 
4.15.0-54.58-generic
Aug 13 17:33:56 myclient kernel: [3700923.970750] PKCS#7 signature not signed 
with a trusted key
Aug 13 17:33:59 myclient kernel: [3700927.069945] livepatch: enabling patch 
'lkp_Ubuntu_4_15_0_54_58_generic_54'
Aug 13 17:33:59 myclient kernel: [3700927.154956] livepatch: 
'lkp_Ubuntu_4_15_0_54_58_generic_54': starting patching transition
Aug 13 17:34:01 myclient kernel: [3700928.994487] livepatch: 
'lkp_Ubuntu_4_15_0_54_58_generic_54': patching complete
Aug 13 17:34:09 myclient canonical-livepatch[2396]: Applied patch version 54.1 
to 4.15.0-54.58-generic

And then immediately I saw:

Aug 13 17:34:18 myclient kernel: [3700945.728684] libceph: mds0 1.2.3.4:6800 
socket closed (con state OPEN)
Aug 13 17:34:18 myclient kernel: [3700946.040138] libceph: mds0 1.2.3.4:6800 
socket closed (con state OPEN)
Aug 13 17:34:19 myclient kernel: [3700947.105692] libceph: mds0 1.2.3.4:6800 
socket closed (con state OPEN)
Aug 13 17:34:20 myclient kernel: [3700948.033704] libceph: mds0 1.2.3.4:6800 
socket closed (con state OPEN)

And on the MDS:

2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 Message signature 
does not match contents.
2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Signature on message:
2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367sig: 
10517606059379971075
2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Locally calculated 
signature:
2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 
sig_check:4899837294009305543
2019-08-13 17:34:18.286 7ff165e75700  0 Signature failed.
2019-08-13 17:34:18.286 7ff165e75700  0 -- 1.2.3.4:6800/512468759 >> 
4.3.2.1:0/928333509 conn(0xe6b9500 :6800 >> 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=0).process >> 
Signature check failed

Thankfully I was able to umount -f to unfreeze the client, but I have
been unsuccessful remounting the file system using the kernel client.
The fuse client worked fine as a workaround, but is slower.

Taking a look at livepatch 54.1 I can see it touches Ceph code in the
kernel:

https://git.launchpad.net/~ubuntu-livepatch/+git/bionic-livepatches/commit/?id=3a3081c1e4c8e2e0f9f7a1ae4204eba5f38fbd29

But the relevance of those changes isn't immediately clear to me. I
expect after a reboot it'll be fine, but as yet untested.

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-13 Thread Wido den Hollander


On 8/13/19 10:00 PM, dhils...@performair.com wrote:
> Wildo / Hemant;
> 
> Current recommendations (since at least luminous) say that a block.db device 
> should be at least 4% of the block device.  For a 6 TB drive, this would be 
> 240 GB, not 60 GB.

I know and I don't agree with that. I'm not sure where that number came
from either.

There could be various configurations, but none of the configs I have
seen require that amount of DB space.

I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of DB in
use. No slow db in use.

I've talked with many people from the community and I don't see an
agreement for the 4% rule.

Wido

> 
> Thank you,
> 
> Dominic L. Hilsbos, MBA 
> Director – Information Technology 
> Perform Air International Inc.
> dhils...@performair.com 
> www.PerformAir.com
> 
> 
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido 
> den Hollander
> Sent: Tuesday, August 13, 2019 12:51 PM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] WAL/DB size
> 
> 
> 
> On 8/13/19 5:54 PM, Hemant Sonawane wrote:
>> Hi All,
>> I have 4 6TB of HDD and 2 450GB SSD and I am going to partition each
>> disk to 220GB for rock.db. So my question is does it make sense to use
>> wal for my configuration? if yes then what could be the size of it? help
>> will be really appreciated.
> 
> Yes, the WAL needs to be about 1GB in size. That should work in allmost
> all configurations.
> 
> 220GB is more then you need for the DB as well. It's doesn't hurt, but
> it's not needed. For each 6TB drive you need about ~60GB of space for
> the DB.
> 
> Wido
> 
>> -- 
>> Thanks and Regards,
>>
>> Hemant Sonawane
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-13 Thread DHilsbos
Wildo / Hemant;

Current recommendations (since at least luminous) say that a block.db device 
should be at least 4% of the block device.  For a 6 TB drive, this would be 240 
GB, not 60 GB.

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido 
den Hollander
Sent: Tuesday, August 13, 2019 12:51 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] WAL/DB size



On 8/13/19 5:54 PM, Hemant Sonawane wrote:
> Hi All,
> I have 4 6TB of HDD and 2 450GB SSD and I am going to partition each
> disk to 220GB for rock.db. So my question is does it make sense to use
> wal for my configuration? if yes then what could be the size of it? help
> will be really appreciated.

Yes, the WAL needs to be about 1GB in size. That should work in allmost
all configurations.

220GB is more then you need for the DB as well. It's doesn't hurt, but
it's not needed. For each 6TB drive you need about ~60GB of space for
the DB.

Wido

> -- 
> Thanks and Regards,
> 
> Hemant Sonawane
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Time of response of "rbd ls" command

2019-08-13 Thread Ilya Dryomov
On Tue, Aug 13, 2019 at 6:37 PM Gesiel Galvão Bernardes
 wrote:
>
> HI,
>
> I recently noticed that in two of my pools the command "rbd ls" has take 
> several minutes to return the values. These pools have between 100 and 120 
> images each.
>
> Where should I look to check why this slowness? The cluster is apparently 
> fine, without any warning.
>
> Thank you very much in advance.

Hi Gesiel,

Try

$ rbd ls --debug-ms 1

and look at the timestamps.  If the latency is coming from RADOS, it
would probably be between "... osd_op(..." and "... osd_op_reply(...".

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-13 Thread Wido den Hollander


On 8/13/19 5:54 PM, Hemant Sonawane wrote:
> Hi All,
> I have 4 6TB of HDD and 2 450GB SSD and I am going to partition each
> disk to 220GB for rock.db. So my question is does it make sense to use
> wal for my configuration? if yes then what could be the size of it? help
> will be really appreciated.

Yes, the WAL needs to be about 1GB in size. That should work in allmost
all configurations.

220GB is more then you need for the DB as well. It's doesn't hurt, but
it's not needed. For each 6TB drive you need about ~60GB of space for
the DB.

Wido

> -- 
> Thanks and Regards,
> 
> Hemant Sonawane
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] add writeback to Bluestore thanks to lvm-writecache

2019-08-13 Thread Olivier Bonvalet
Hi,

we use OSDs with data on HDD and db/wal on NVMe.
But for now, BlueStore.DB and BlueStore.WAL only store medadata NOT
data. Right ?

So, when we migrated from :
A) Filestore + HDD with hardware writecache + journal on SSD
to :
B) Bluestore + HDD without hardware writecache + DB/WAL on NVMe

Performance on ours random-write workloads drops.

Since default OSD setup now use LVM, enabling LVM-writecache is easy.
But is it a good idea ? Do you try it ?

Thanks,

Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] More than 100% in a dashboard PG Status

2019-08-13 Thread Fyodor Ustinov
Hi!

I create Bug #41234

thanx for your advice!

- Original Message -
> From: "Lenz Grimmer" 
> To: "ceph-users" 
> Cc: "Alfonso Martinez Hidalgo" 
> Sent: Tuesday, 13 August, 2019 16:13:18
> Subject: Re: [ceph-users] More than 100% in a dashboard PG Status

> Hi Fyodor,
> 
> (Cc:ing Alfonso)
> 
> On 8/13/19 12:47 PM, Fyodor Ustinov wrote:
> 
>> I have ceph nautilus (upgraded from mimic, if it is important) and in
>> dashboard in "PG Status" section I see "Clean (2397%)"
>> 
>> It's a bug?
> 
> Huh, That might be possible - sorry about that. We'd be grateful if you
> could submit this on the bug tracker (please attach a screen shot as well):
> 
>  https://tracker.ceph.com/projects/mgr/issues/new
> 
> We may require additional information from you, so please keep an eye on
> the issue. Thanks in advance!
> 
> Lenz
> 
> --
> SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Time of response of "rbd ls" command

2019-08-13 Thread Gesiel Galvão Bernardes
HI,

I recently noticed that in two of my pools the command "rbd ls" has take
several minutes to return the values. These pools have between 100 and 120
images each.

Where should I look to check why this slowness? The cluster is apparently
fine, without any warning.

Thank you very much in advance.

Gesiel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] More than 100% in a dashboard PG Status

2019-08-13 Thread DHilsbos
All;

I also noticed this behavior. It may have started after inducing a failure in 
the cluster in order to observe the self-healing behavior.

 In the "PG Status" section of the dashboard, I have "Clean (200%)."  This has 
not seemed to affect the functioning of the cluster.

Cluster is a new deployment, using nautilus (14.2.2).

Are there any commands I can run on the cluster to show what the numbers under 
this look like?

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Fyodor 
Ustinov
Sent: Tuesday, August 13, 2019 3:48 AM
To: ceph-users
Subject: [ceph-users] More than 100% in a dashboard PG Status

Hi!

I have ceph nautilus (upgraded from mimic, if it is important) and in dashboard 
in "PG Status" section I see "Clean (2397%)"

It's a bug?

WBR,
Fyodor.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducible rbd-nbd crashes

2019-08-13 Thread Marc Schöchlin
Hello Jason,

thanks for your response.
See my inline comments.

Am 31.07.19 um 14:43 schrieb Jason Dillaman:
> On Wed, Jul 31, 2019 at 6:20 AM Marc Schöchlin  wrote:
>
>
> The problem not seems to be related to kernel releases, filesystem types or 
> the ceph and network setup.
> Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems 
> to have the described problem.
>  ...
>
> It's basically just a log message tweak and some changes to how the
> process is daemonized. If you could re-test w/ each release after
> 12.2.5 and pin-point where the issue starts occurring, we would have
> something more to investigate.

Are there changes related to https://tracker.ceph.com/issues/23891?


You showed me the very low amount of changes in rbd-nbd.
What about librbd, librados, ...?

What else can we do to find a detailed reason for the crash?
Do you think it is useful to activate coredump-creation for that process?

>> Whats next? Is i a good idea to do a binary search between 12.2.12 and 
>> 12.2.5?
>>
Due to the absence of a coworker i almost had no capacity to execute deeper 
tests with this problem.
But i can say that in reproduced the problem also with release 12.2.12.

The new (updated) list:

- SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s 
device timeout
  -> 18 hour testrun was successful, no dmesg output
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device 
timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created without reboot
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
  -> failed after < 10 minutes
  -> system runs in a high system load, system is almost unusable, unable to 
shutdown the system, hard reset of vm necessary, manual exclusive lock removal 
is necessary before remapping the device
- FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created
*- FAILED: kernel 5.0, ceph 12.2.12, 2TB ec-volume, ext4 file system, 120s 
device timeout-> failed after < 1 hour, rbd-nbd map/device is gone, mount 
throws io errors, map/mount can be re-created*

Regards
Marc

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] WAL/DB size

2019-08-13 Thread Hemant Sonawane
Hi All,
I have 4 6TB of HDD and 2 450GB SSD and I am going to partition each disk
to 220GB for rock.db. So my question is does it make sense to use wal for
my configuration? if yes then what could be the size of it? help will be
really appreciated.
-- 
Thanks and Regards,

Hemant Sonawane
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs cannot mount with kernel client

2019-08-13 Thread Ilya Dryomov
On Tue, Aug 13, 2019 at 4:30 PM Serkan Çoban  wrote:
>
> I am out of office right now, but I am pretty sure it was the same
> stack trace as in tracker.
> I will confirm tomorrow.
> Any workarounds?

Compaction

# echo 1 >/proc/sys/vm/compact_memory

might help if the memory in question is moveable.  If not, reboot and
mount on a freshly booted node.

I have raised the priority on the ticket.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph capacity versus pool replicated size discrepancy?

2019-08-13 Thread Kenneth Van Alstyne
Hey guys, this is probably a really silly question, but I’m trying to reconcile 
where all of my space has gone in one cluster that I am responsible for.

The cluster is made up of 36 2TB SSDs across 3 nodes (12 OSDs per node), all 
using FileStore on XFS.  We are running Ceph Luminous 12.2.8 on this particular 
cluster. The only pool where data is heavily stored is the “rbd” pool, of which 
7.09TiB is consumed.  With a replication of “3”, I would expect that the raw 
used to be close to 21TiB, but it’s actually closer to 35TiB.  Some additional 
details are below.  Any thoughts?

[cluster] root@dashboard:~# ceph df
GLOBAL:
SIZEAVAIL   RAW USED %RAW USED
62.8TiB 27.8TiB  35.1TiB 55.81
POOLS:
NAME   ID USED%USED MAX AVAIL 
OBJECTS
rbd0  7.09TiB 53.76   6.10TiB 
3056783
data   3  29.4GiB  0.47   6.10TiB   
 7918
metadata   4  57.2MiB 0   6.10TiB   
   95
.rgw.root  5  1.09KiB 0   6.10TiB   
4
default.rgw.control6   0B 0   6.10TiB   
8
default.rgw.meta   7   0B 0   6.10TiB   
0
default.rgw.log8   0B 0   6.10TiB   
  207
default.rgw.buckets.index  9   0B 0   6.10TiB   
0
default.rgw.buckets.data   10  0B 0   6.10TiB   
0
default.rgw.buckets.non-ec 11  0B 0   6.10TiB   
0

[cluster] root@dashboard:~# ceph --version
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

[cluster] root@dashboard:~# ceph osd dump | grep 'replicated size'
pool 0 'rbd' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins 
pg_num 682 pgp_num 682 last_change 414873 flags hashpspool 
min_write_recency_for_promote 1 stripe_width 0 application rbd
pool 3 'data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins 
pg_num 682 pgp_num 682 last_change 409614 flags hashpspool 
crash_replay_interval 45 min_write_recency_for_promote 1 stripe_width 0 
application cephfs
pool 4 'metadata' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 682 pgp_num 682 last_change 409617 flags hashpspool 
min_write_recency_for_promote 1 stripe_width 0 application cephfs
pool 5 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 409 pgp_num 409 last_change 409710 lfor 0/336229 flags 
hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409711 lfor 0/336232 
flags hashpspool stripe_width 0 application rgw
pool 7 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 409 pgp_num 409 last_change 409713 lfor 0/336235 flags 
hashpspool stripe_width 0 application rgw
pool 8 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 409 pgp_num 409 last_change 409712 lfor 0/336238 flags 
hashpspool stripe_width 0 application rgw
pool 9 'default.rgw.buckets.index' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409714 lfor 0/336241 
flags hashpspool stripe_width 0 application rgw
pool 10 'default.rgw.buckets.data' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409715 lfor 0/336244 
flags hashpspool stripe_width 0 application rgw
pool 11 'default.rgw.buckets.non-ec' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409716 lfor 0/336247 
flags hashpspool stripe_width 0 application rgw

[cluster] root@dashboard:~# ceph osd lspools
0 rbd,3 data,4 metadata,5 .rgw.root,6 default.rgw.control,7 default.rgw.meta,8 
default.rgw.log,9 default.rgw.buckets.index,10 default.rgw.buckets.data,11 
default.rgw.buckets.non-ec,

[cluster] root@dashboard:~# rados df
POOL_NAME  USEDOBJECTS CLONES  COPIES  MISSING_ON_PRIMARY 
UNFOUND DEGRADED RD_OPS  RD  WR_OPS  WR
.rgw.root  1.09KiB   4   0  12  0   
00  128KiB   0  0B
data   29.4GiB7918   0   23754  0   
00 1414777 3.74TiB 3524833 4.54TiB
default.rgw.buckets.data0B   0   0   0  0   
00   0  0B   0  0B
default.rgw.buckets.index   0B   0   0   0  0   
00   0  0B   0  0B
default.rgw.buckets.non-ec  0B   0   0   0  0   
00   0  0B   0  0B

[ceph-users] CephFS "denied reconnect attempt" after updating Ceph

2019-08-13 Thread William Edwards

Hello,

I've been using CephFS for quite a while now, and am very happy with it. 
However, I'm experiencing an issue that's quite hard to debug.

On almost every server where CephFS is mounted, the CephFS mount becomes 
unusable after updating Ceph (has happened 3 times now, after Ceph update). 
When attempting to access the CephFS mount, I'd get a permission denied:

root@cephmgr:~# cd /mnt/cephfs
-bash: cd: /mnt/cephfs: Permission denied
[root@da0 ~]# ls /home
ls: cannot access /home: Permission denied
root@http-cap01:~# ls /var/www/vhosts
ls: cannot access '/var/www/vhosts': Permission denied

What strikes me as odd is that on some machines, the mount works fine, and I 
get above error on other machines while server configurations are identical.

Here's a log of a client machine where the CephFS mount 'broke' after updating 
Ceph:

--
Jun 12 00:56:47 http-cap01 kernel: [4163998.163498] ceph: mds0 caps stale
Jun 12 01:00:20 http-cap01 kernel: [4164210.581767] ceph: mds0 caps went stale, 
renewing
Jun 12 01:00:20 http-cap01 kernel: [4164210.581771] ceph: mds0 caps stale
Jun 12 01:00:45 http-cap01 kernel: [4164236.434456] libceph: mds0 
[fdb7:b01e:7b8e:0:10:10:10:3]:6816 socket closed (con state OPEN)
Jun 12 01:00:46 http-cap01 kernel: [4164236.980990] libceph: reset on mds0
Jun 12 01:00:46 http-cap01 kernel: [4164236.980996] ceph: mds0 closed our 
session
Jun 12 01:00:46 http-cap01 kernel: [4164236.980997] ceph: mds0 reconnect start
Jun 12 01:00:46 http-cap01 kernel: [4164237.035294] ceph: mds0 reconnect denied
Jun 12 01:00:46 http-cap01 kernel: [4164237.036990] libceph: mds0 
[fdb7:b01e:7b8e:0:10:10:10:3]:6816 socket closed (con state NEGOTIATING)
Jun 12 01:00:47 http-cap01 kernel: [4164237.972853] ceph: mds0 rejected session
Jun 12 01:05:15 http-cap01 kernel: [4164506.065665] libceph: mon0 
[fdb7:b01e:7b8e:0:10:10:10:1]:6789 session lost, hunting for new mon
Jun 12 01:05:15 http-cap01 kernel: [4164506.068613] libceph: mon1 
[fdb7:b01e:7b8e:0:10:10:10:2]:6789 session established
Jun 12 01:06:47 http-cap01 kernel: [4164597.858261] libceph: mon1 
[fdb7:b01e:7b8e:0:10:10:10:2]:6789 socket closed (con state OPEN)
Jun 12 01:06:47 http-cap01 kernel: [4164597.858323] libceph: mon1 
[fdb7:b01e:7b8e:0:10:10:10:2]:6789 session lost, hunting for new mon
Jun 12 01:06:47 http-cap01 kernel: [4164597.864745] libceph: mon2 
[fdb7:b01e:7b8e:0:10:10:10:3]:6789 session established
Jun 12 01:23:02 http-cap01 kernel: [4165572.915743] ceph: mds0 reconnect start
Jun 12 01:23:02 http-cap01 kernel: [4165572.918197] ceph: mds0 reconnect denied
Jun 12 01:23:15 http-cap01 kernel: [4165586.526195] libceph: mds0 
[fdb7:b01e:7b8e:0:10:10:10:2]:6817 socket closed (con state NEGOTIATING)
Jun 12 01:23:16 http-cap01 kernel: [4165586.992411] ceph: mds0 rejected session
--

Note the "ceph: mds0 reconnect denied"

Log on a machine where the CephFS mount kept working fine:

--
Jun 12 01:08:26 http-hlp02 kernel: [3850613.358329] libceph: mon0 
[fdb7:b01e:7b8e:0:10:10:10:1]:6789 socket closed (con state OPEN)
Jun 12 01:08:26 http-hlp02 kernel: [3850613.358418] libceph: mon0 
[fdb7:b01e:7b8e:0:10:10:10:1]:6789 session lost, hunting for new mon
Jun 12 01:08:26 http-hlp02 kernel: [3850613.369597] libceph: mon1 
[fdb7:b01e:7b8e:0:10:10:10:2]:6789 session established
Jun 12 01:09:50 http-hlp02 kernel: [3850697.708357] libceph: osd9 
[fdb7:b01e:7b8e:0:10:10:10:1]:6806 socket closed (con state OPEN)
Jun 12 01:09:50 http-hlp02 kernel: [3850697.709897] libceph: osd0 down
Jun 12 01:09:50 http-hlp02 kernel: [3850697.709898] libceph: osd1 down
Jun 12 01:09:50 http-hlp02 kernel: [3850697.709899] libceph: osd6 down
Jun 12 01:09:50 http-hlp02 kernel: [3850697.709899] libceph: osd9 down
Jun 12 01:12:37 http-hlp02 kernel: [3850864.673357] libceph: osd9 up
Jun 12 01:12:37 http-hlp02 kernel: [3850864.673378] libceph: osd6 up
Jun 12 01:12:37 http-hlp02 kernel: [3850864.673394] libceph: osd0 up
Jun 12 01:12:37 http-hlp02 kernel: [3850864.673402] libceph: osd1 up
Jun 12 01:14:30 http-hlp02 kernel: [3850977.916749] libceph: wrong peer, want 
[fdb7:b01e:7b8e:0:10:10:10:2]:6808/434742, got 
[fdb7:b01e:7b8e:0:10:10:10:2]:6808/19887
Jun 12 01:14:30 http-hlp02 kernel: [3850977.916765] libceph: osd10 
[fdb7:b01e:7b8e:0:10:10:10:2]:6808 wrong peer at address
Jun 12 01:14:30 http-hlp02 kernel: [3850977.918368] libceph: osd4 down
Jun 12 01:14:30 http-hlp02 kernel: [3850977.918369] libceph: osd5 down
Jun 12 01:14:30 http-hlp02 kernel: [3850977.918370] libceph: osd7 down
Jun 12 01:14:30 http-hlp02 kernel: [3850977.918370] libceph: osd10 down
Jun 12 01:14:30 http-hlp02 kernel: [3850977.918401] libceph: osd10 up
Jun 12 01:14:30 http-hlp02 kernel: [3850977.918406] libceph: osd7 up
Jun 12 01:14:34 http-hlp02 kernel: [3850981.985720] libceph: osd5 up
Jun 12 01:14:34 http-hlp02 kernel: [3850981.985742] libceph: osd4 up
Jun 12 01:19:56 http-hlp02 kernel: [3851304.177469] libceph: osd2 down
Jun 12 01:19:56 http-hlp02 kernel: [3851304.177471] libceph: osd3 down
Jun 12 01:19:56 

Re: [ceph-users] Cephfs cannot mount with kernel client

2019-08-13 Thread Serkan Çoban
I am out of office right now, but I am pretty sure it was the same
stack trace as in tracker.
I will confirm tomorrow.
Any workarounds?

On Tue, Aug 13, 2019 at 5:16 PM Ilya Dryomov  wrote:
>
> On Tue, Aug 13, 2019 at 3:57 PM Serkan Çoban  wrote:
> >
> > I checked /var/log/messages and see there are page allocation
> > failures. But I don't understand why?
> > The client has 768GB memory and most of it is not used, cluster has
> > 1500OSDs. Do I need to increase vm.min_free_kytes? It is set to 1GB
> > now.
> > Also huge_page is disabled in clients.
>
> https://tracker.ceph.com/issues/40481
>
> I can confirm if you pastebin page allocation splats.
>
> Thanks,
>
> Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs cannot mount with kernel client

2019-08-13 Thread Ilya Dryomov
On Tue, Aug 13, 2019 at 3:57 PM Serkan Çoban  wrote:
>
> I checked /var/log/messages and see there are page allocation
> failures. But I don't understand why?
> The client has 768GB memory and most of it is not used, cluster has
> 1500OSDs. Do I need to increase vm.min_free_kytes? It is set to 1GB
> now.
> Also huge_page is disabled in clients.

https://tracker.ceph.com/issues/40481

I can confirm if you pastebin page allocation splats.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs cannot mount with kernel client

2019-08-13 Thread Serkan Çoban
I checked /var/log/messages and see there are page allocation
failures. But I don't understand why?
The client has 768GB memory and most of it is not used, cluster has
1500OSDs. Do I need to increase vm.min_free_kytes? It is set to 1GB
now.
Also huge_page is disabled in clients.

Thanks,
Serkan

On Tue, Aug 13, 2019 at 3:42 PM Ilya Dryomov  wrote:
>
> On Tue, Aug 13, 2019 at 12:36 PM Serkan Çoban  wrote:
> >
> > Hi,
> >
> > Just installed nautilus 14.2.2 and setup cephfs on it. OS is all centos 7.6.
> > From a client I can mount the cephfs with ceph-fuse, but I cannot
> > mount with ceph kernel client.
> > It gives "mount error 110 connection timeout" and I can see "libceph:
> > corrupt full osdmap (-12) epoch 2759 off 656" in /var/log/messages.
> > This client is not on same subnet with ceph servers.
> >
> > However on a client with the same subnet with the servers I can
> > successfully mount both with ceph-fuse and kernel client.
> >
> > Do I need to configure anything for the clients those are in different 
> > subnet?
> > Is this a kernel issue?
>
> Hi Serkan,
>
> It is failing to allocate memory, so the subnet is probably not the
> issue.  Is there anything else pointing to memory shortage -- "page
> allocation failure" splats, etc?
>
> How much memory is available for use on that node?  How many OSDs do
> you have in your cluster?
>
> Thanks,
>
> Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS corruption

2019-08-13 Thread Yan, Zheng
nautilus version (14.2.2) of ‘cephfs-data-scan scan_links’  can fix
snaptable. hopefully it will fix your issue.

you don't need to upgrade whole cluster. Just install nautilus in a
temp machine or compile ceph from source.



On Tue, Aug 13, 2019 at 2:35 PM Adam  wrote:
>
> Pierre Dittes helped me with adding --rank=yourfsname:all and I ran the
> following steps from the disaster recovery page: journal export, dentry
> recovery, journal truncation, mds table wipes (session, snap and inode),
> scan_extents, scan_inodes, scan_links, and cleanup.
>
> Now all three of my MDS servers are crashing due to a failed assert.
> Logs with stacktrace are included (the other two servers have the same
> stacktrace in their logs).
>
> Currently I can't mount cephfs (which makes sense since there aren't any
> MDS services up for more than a few minutes before they crash).  Any
> suggestions on next steps to troubleshoot/fix this?
>
> Hopefully there's some way to recover from this and I don't have to tell
> my users that I lost all the data and we need to go back to the backups.
>  It shouldn't be a huge problem if we do, but it'll lose a lot of
> confidence in ceph and its ability to keep data safe.
>
> Thanks,
> Adam
>
> On 8/8/19 3:31 PM, Adam wrote:
> > I had a machine with insufficient memory and it seems to have corrupted
> > data on my MDS.  The filesystem seems to be working fine, with the
> > exception of accessing specific files.
> >
> > The ceph-mds logs include things like:
> > mds.0.1596621 unhandled write error (2) No such file or directory, force
> > readonly...
> > dir 0x100fb03 object missing on disk; some files may be lost
> > (/adam/programming/bash)
> >
> > I'm using mimic and trying to follow the instructions here:
> > https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
> >
> > The punchline is this:
> > cephfs-journal-tool --rank all journal export backup.bin
> > Error ((22) Invalid argument)
> > 2019-08-08 20:02:39.847 7f06827537c0 -1 main: Couldn't determine MDS rank.
> >
> > I have a backup (outside of ceph) of all data which is inaccessible and
> > I can back anything which is accessible if need be.  There's some more
> > information below, but my main question is: what are my next steps?
> >
> > On a side note, I'd like to get involved with helping with documentation
> > (man pages, the ceph website, usage text, etc). Where can I get started?
> >
> >
> >
> > Here's the context:
> >
> > cephfs-journal-tool event recover_dentries summary
> > Error ((22) Invalid argument)
> > 2019-08-08 19:50:04.798 7f21f4ffe7c0 -1 main: missing mandatory "--rank"
> > argument
> >
> > Seems like a bug in the documentation since `--rank` is a "mandatory
> > option" according to the help text.  It looks like the rank of this node
> > for MDS is 0, based on `ceph health detail`, but using `--rank 0` or
> > `--rank all` doesn't work either:
> >
> > ceph health detail
> > HEALTH_ERR 1 MDSs report damaged metadata; 1 MDSs are read only
> > MDS_DAMAGE 1 MDSs report damaged metadata
> > mdsge.hax0rbana.org(mds.0): Metadata damage detected
> > MDS_READ_ONLY 1 MDSs are read only
> > mdsge.hax0rbana.org(mds.0): MDS in read-only mode
> >
> > cephfs-journal-tool --rank 0 event recover_dentries summary
> > Error ((22) Invalid argument)
> > 2019-08-08 19:54:45.583 7f5b37c4c7c0 -1 main: Couldn't determine MDS rank.
> >
> >
> > The only place I've found this error message is in an unanswered
> > stackoverflow question and in the source code here:
> > https://github.com/ceph/ceph/blob/master/src/tools/cephfs/JournalTool.cc#L114
> >
> > It looks like that is trying to read a filesystem map (fsmap), which
> > might be corrupted.  Running `rados export` prints part of the help text
> > and then segfaults, which is rather concerning.  This is 100% repeatable
> > (outside of gdb, details below).  I tried `rados df` and that worked
> > fine, so it's not all rados commands which are having this problem.
> > However, I tried `rados bench 60 seq` and that also printed out the
> > usage text and then segfaulted.
> >
> >
> >
> >
> >
> > Info on the `rados export` crash:
> > rados export
> > usage: rados [options] [commands]
> > POOL COMMANDS
> > 
> > IMPORT AND EXPORT
> >export [filename]
> >Serialize pool contents to a file or standard out.
> > 
> > OMAP OPTIONS:
> > --omap-key-file fileread the omap key from a file
> > *** Caught signal (Segmentation fault) **
> >  in thread 7fcb6bfff700 thread_name:fn_anonymous
> >
> > When running it in gdb:
> > (gdb) bt
> > #0  0x7fffef07331f in std::_Rb_tree > std::char_traits, std::allocator >,
> > std::pair,
> > std::allocator > const, std::map > std::__cxx11::basic_string,
> > std::allocator >, unsigned long, long, double, bool,
> > entity_addr_t, std::chrono::duration >,
> > Option::size_t, uuid_d>, std::less, std::allocator > const, boost::variant > std::char_traits, std::allocator >, unsigned long, long,
> > double, bool, entity_addr_t, 

Re: [ceph-users] More than 100% in a dashboard PG Status

2019-08-13 Thread Lenz Grimmer
Hi Fyodor,

(Cc:ing Alfonso)

On 8/13/19 12:47 PM, Fyodor Ustinov wrote:

> I have ceph nautilus (upgraded from mimic, if it is important) and in
> dashboard in "PG Status" section I see "Clean (2397%)"
> 
> It's a bug?

Huh, That might be possible - sorry about that. We'd be grateful if you
could submit this on the bug tracker (please attach a screen shot as well):

  https://tracker.ceph.com/projects/mgr/issues/new

We may require additional information from you, so please keep an eye on
the issue. Thanks in advance!

Lenz

-- 
SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs cannot mount with kernel client

2019-08-13 Thread Ilya Dryomov
On Tue, Aug 13, 2019 at 12:36 PM Serkan Çoban  wrote:
>
> Hi,
>
> Just installed nautilus 14.2.2 and setup cephfs on it. OS is all centos 7.6.
> From a client I can mount the cephfs with ceph-fuse, but I cannot
> mount with ceph kernel client.
> It gives "mount error 110 connection timeout" and I can see "libceph:
> corrupt full osdmap (-12) epoch 2759 off 656" in /var/log/messages.
> This client is not on same subnet with ceph servers.
>
> However on a client with the same subnet with the servers I can
> successfully mount both with ceph-fuse and kernel client.
>
> Do I need to configure anything for the clients those are in different subnet?
> Is this a kernel issue?

Hi Serkan,

It is failing to allocate memory, so the subnet is probably not the
issue.  Is there anything else pointing to memory shortage -- "page
allocation failure" splats, etc?

How much memory is available for use on that node?  How many OSDs do
you have in your cluster?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] More than 100% in a dashboard PG Status

2019-08-13 Thread Fyodor Ustinov
Hi!

I have ceph nautilus (upgraded from mimic, if it is important) and in dashboard 
in "PG Status" section I see "Clean (2397%)"

It's a bug?

WBR,
Fyodor.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephfs cannot mount with kernel client

2019-08-13 Thread Serkan Çoban
Hi,

Just installed nautilus 14.2.2 and setup cephfs on it. OS is all centos 7.6.
>From a client I can mount the cephfs with ceph-fuse, but I cannot
mount with ceph kernel client.
It gives "mount error 110 connection timeout" and I can see "libceph:
corrupt full osdmap (-12) epoch 2759 off 656" in /var/log/messages.
This client is not on same subnet with ceph servers.

However on a client with the same subnet with the servers I can
successfully mount both with ceph-fuse and kernel client.

Do I need to configure anything for the clients those are in different subnet?
Is this a kernel issue?

Thanks,
Serkan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW how to delete orphans

2019-08-13 Thread Andrei Mikhailovsky
Hello

I was hoping to follow up on this email and if Florian manage to get to the 
bottom of this.

I have a case where I believe my RGW bucket is using too much space. For me, 
the ceph df command shows over 16TB usage, whereas the bucket stats shows the 
total of about 6TB. So, It seems that the 10TB is wasted somewhere and I would 
like to find out how to trim this.

I am running "ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) 
mimic (stable)" on all cluster nodes (server and client side). I have the total 
of 4 osd servers with 48 osds, including a combination of SSD and SAS drives 
for different pools.

I have started "radosgw-admin orphans find --pool=.rgw.buckets --job-id=find1 
--num-shards=64 --yes-i-really-mean-it" command about 2 weeks ago and the only 
output I can see from it is similar to this:

storing 20 entries at orphan.scan.find1.linked.50
storing 28 entries at orphan.scan.find1.linked.16

The command is still running and I can see about 5K IOPs increase on the 
cluster's throughput since the command started. However, I can't seem to find 
any indication on the progress. Nor do I see an increase in the RGW pool usage.

Anyone can suggest on the next steps please?

Cheers

Andrei

- Original Message -
> From: "Florian Engelmann" 
> To: "Andreas Calminder" , "Christian Wuerdig" 
> 
> Cc: "ceph-users" 
> Sent: Friday, 26 October, 2018 11:28:19
> Subject: Re: [ceph-users] RGW how to delete orphans

> Hi,
> 
> we've got the same problem here. Our 12.2.5 RadosGWs crashed
> (unrecognised by us) about 30.000 times with ongoing multipart uploads.
> After a couple of days we ended up with:
> 
> xx-1.rgw.buckets.data   6  N/A   N/A
> 116TiB 87.22   17.1TiB 36264870 36.26M 3.63GiB
> 148MiB   194TiB
> 
> 116TB data (194TB raw) while only:
> 
> for i in $(radosgw-admin bucket list | jq -r '.[]'); do  radosgw-admin
> bucket stats --bucket=$i | jq '.usage | ."rgw.main" | .size_kb' ; done |
> awk '{ SUM += $1} END { print SUM/1024/1024/1024 }'
> 
> 46.0962
> 
> 116 - 46 = 70TB
> 
> So 70TB of objects are orphans, right?
> 
> And there are 36.264.870 objects in our rgw.buckets.data pool.
> 
> So we started:
> 
> radosgw-admin orphans list-jobs --extra-info
> [
> {
> "orphan_search_state": {
> "info": {
> "orphan_search_info": {
> "job_name": "check-orph",
> "pool": "zh-1.rgw.buckets.data",
> "num_shards": 64,
> "start_time": "2018-10-10 09:01:14.746436Z"
> }
> },
> "stage": {
> "orphan_search_stage": {
> "search_stage": "iterate_bucket_index",
> "shard": 0,
> "marker": ""
> }
> }
> }
> }
> ]
> 
> writing stdout to: orphans.txt
> 
> I am not sure about how to interpret the output but:
> 
> cat orphans.txt | awk '/^storing / { SUM += $2} END { print SUM }'
> 2145042765
> 
> So how to interpret those output lines:
> ...
> storing 16 entries at orphan.scan.check-orph.linked.62
> storing 19 entries at orphan.scan.check-orph.linked.63
> storing 13 entries at orphan.scan.check-orph.linked.0
> storing 13 entries at orphan.scan.check-orph.linked.1
> ...
> 
> Is it like
> 
> "I am storing 16 'healthy' object 'names' to the shard
> orphan.scan.check-orph.linked.62"
> 
> Is it objects? What is meant by "entries"? Where are those "shards"? Are
> they files or objects in a pool? How to know about the progress of
> "orphans find"? Is the job still doing the right thing? Time estimated
> to run on SATA disks with 194TB RAW?
> 
> The orphan find command stored already 2.145.042.765 (more than 2
> billion) "entries"... while there are "only" 36 million objects...
> 
> Is the process still healthy and doing the right thing?
> 
> All the best,
> Florian
> 
> 
> 
> 
> 
> Am 10/3/17 um 10:48 AM schrieb Andreas Calminder:
>> The output, to stdout, is something like leaked: $objname. Am I supposed
>> to pipe it to a log, grep for leaked: and pipe it to rados delete? Or am
>> I supposed to dig around in the log pool to try and find the objects
>> there? The information available is quite vague. Maybe Yehuda can shed
>> some light on this issue?
>> 
>> Best regards,
>> /Andreas
>> 
>> On 3 Oct 2017 06:25, "Christian Wuerdig" > > wrote:
>> 
>> yes, at least that's how I'd interpret the information given in this
>> thread:
>> 
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-February/016521.html
>> 
>> 
>> 
>> On Tue, Oct 3, 2017 at 1:11 AM, Webert de Souza Lima
>> mailto:webert.b...@gmail.com>> wrote:
>>  > Hey Christian,
>>  >
>>  >> On 29 Sep 2017 12:32 a.m., "Christian Wuerdig"
>>  >> > 

Re: [ceph-users] optane + 4x SSDs for VM disk images?

2019-08-13 Thread vitalif

Could performance of Optane + 4x SSDs per node ever exceed that of
pure Optane disks?


No. With Ceph, the results for Optane and just for good server SSDs 
are

almost the same. One thing is that you can run more OSDs per an Optane
than per a usual SSD. However, the latency you get from both is almost
the same as most of it comes from Ceph itself, not from the underlying
storage. This also results in Optanes being useless for
block.db/block.wal if your SSDs aren't shitty desktop ones.

And as usual I'm posting the link to my article
https://yourcmc.ru/wiki/Ceph_performance :)


You write that they are not reporting QD=1 single-threaded numbers,
but in Table 10 and 11 the average latencies are reported which
is "close to the same", so they can get

Read latency: 0.32ms (thereby 3125 IOPS)
Write latency: 1.1ms  (therby 909 IOPS)

Really nice writeup and very true - should be a must-read for anyone
starting out with Ceph.


Thanks! :)

Tables 10 and 11 refer to QD=32 and 10 clients which is a significant 
load, at least because their CPUs were at 61.4% during the test. I think 
the latency with QD=1 and 1 client would be slightly better in their 
case (at least if they turned powersave off :)).


There is a new version of their "reference architecture" here: 
https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9300_and_red_hat_ceph_reference_architecture.pdf


The closest thing in the new PDF is 100 RBD clients with QD=1 and 70/30 
mixed R/W. The numbers are messed up because they report the read 
latency of 0.72ms and the write latency of 0.37ms. This is probably 
reversed, it's the read that should be 0.37ms :) the write latency of 
0.72ms looks real for their setup...


--
Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS meltdown fallout: mds assert failure, kernel oopses

2019-08-13 Thread Hector Martin
I just had a minor CephFS meltdown caused by underprovisioned RAM on the 
MDS servers. This is a CephFS with two ranks; I manually failed over the 
first rank and the new MDS server ran out of RAM in the rejoin phase 
(ceph-mds didn't get OOM-killed, but I think things slowed down enough 
due to swapping out that something timed out). This happened 4 times, 
with the rank bouncing between two MDS servers, until I brought up an 
MDS on a bigger machine.


The new MDS managed to become active, but then crashed with an assert:

2019-08-13 16:03:37.346 7fd4578b2700  1 mds.0.1164 clientreplay_done
2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.mon02 Updating MDS map to 
version 1239 from mon.1
2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 handle_mds_map i am 
now mds.0.1164
2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 handle_mds_map state 
change up:clientreplay --> up:active

2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 active_start
2019-08-13 16:03:37.690 7fd45e2a7700  1 mds.0.1164 cluster recovered.
2019-08-13 16:03:45.130 7fd45e2a7700  1 mds.mon02 Updating MDS map to 
version 1240 from mon.1
2019-08-13 16:03:46.162 7fd45e2a7700  1 mds.mon02 Updating MDS map to 
version 1241 from mon.1
2019-08-13 16:03:50.286 7fd4578b2700 -1 
/build/ceph-13.2.6/src/mds/MDCache.cc: In function 'void 
MDCache::remove_inode(CInode*)' thread 7fd4578b2700 time 2019-08-13 
16:03:50.279463
/build/ceph-13.2.6/src/mds/MDCache.cc: 361: FAILED 
assert(o->get_num_ref() == 0)


 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14e) [0x7fd46650eb5e]

 2: (()+0x2c4cb7) [0x7fd46650ecb7]
 3: (MDCache::remove_inode(CInode*)+0x59d) [0x55f423d6992d]
 4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long, 
LogSegment*)+0x1f2) [0x55f423dc7192]

 5: (MDSIOContextBase::complete(int)+0x11d) [0x55f423ed42bd]
 6: (MDSLogContextBase::complete(int)+0x40) [0x55f423ed4430]
 7: (Finisher::finisher_thread_entry()+0x135) [0x7fd46650d0a5]
 8: (()+0x76db) [0x7fd465dc26db]
 9: (clone()+0x3f) [0x7fd464fa888f]

Thankfully this didn't happen on a subsequent attempt, and I got the 
filesystem happy again.


At this point, of the 4 kernel clients actively using the filesystem, 3 
had gone into a strange state (can't SSH in, partial service). Here is a 
kernel log from one of the hosts (the other two were similar):

https://mrcn.st/p/ezrhr1qR

After playing some service failover games and hard rebooting the three 
affected client boxes everything seems to be fine. The remaining FS 
client box had no kernel errors (other than blocked task warnings and 
cephfs talking about reconnections and such) and seems to be fine.


I can't find these errors anywhere, so I'm guessing they're not known bugs?

--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS corruption

2019-08-13 Thread ☣Adam
Pierre Dittes helped me with adding --rank=yourfsname:all and I ran the
following steps from the disaster recovery page: journal export, dentry
recovery, journal truncation, mds table wipes (session, snap and inode),
scan_extents, scan_inodes, scan_links, and cleanup.

Now all three of my MDS servers are crashing due to a failed assert.
Logs with stacktrace are included (the other two servers have the same
stacktrace in their logs).

Currently I can't mount cephfs (which makes sense since there aren't any
MDS services up for more than a few minutes before they crash).  Any
suggestions on next steps to troubleshoot/fix this?

Hopefully there's some way to recover from this and I don't have to tell
my users that I lost all the data and we need to go back to the backups.
 It shouldn't be a huge problem if we do, but it'll lose a lot of
confidence in ceph and its ability to keep data safe.

Thanks,
Adam

On 8/8/19 3:31 PM, ☣Adam wrote:
> I had a machine with insufficient memory and it seems to have corrupted
> data on my MDS.  The filesystem seems to be working fine, with the
> exception of accessing specific files.
> 
> The ceph-mds logs include things like:
> mds.0.1596621 unhandled write error (2) No such file or directory, force
> readonly...
> dir 0x100fb03 object missing on disk; some files may be lost
> (/adam/programming/bash)
> 
> I'm using mimic and trying to follow the instructions here:
> https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
> 
> The punchline is this:
> cephfs-journal-tool --rank all journal export backup.bin
> Error ((22) Invalid argument)
> 2019-08-08 20:02:39.847 7f06827537c0 -1 main: Couldn't determine MDS rank.
> 
> I have a backup (outside of ceph) of all data which is inaccessible and
> I can back anything which is accessible if need be.  There's some more
> information below, but my main question is: what are my next steps?
> 
> On a side note, I'd like to get involved with helping with documentation
> (man pages, the ceph website, usage text, etc). Where can I get started?
> 
> 
> 
> Here's the context:
> 
> cephfs-journal-tool event recover_dentries summary
> Error ((22) Invalid argument)
> 2019-08-08 19:50:04.798 7f21f4ffe7c0 -1 main: missing mandatory "--rank"
> argument
> 
> Seems like a bug in the documentation since `--rank` is a "mandatory
> option" according to the help text.  It looks like the rank of this node
> for MDS is 0, based on `ceph health detail`, but using `--rank 0` or
> `--rank all` doesn't work either:
> 
> ceph health detail
> HEALTH_ERR 1 MDSs report damaged metadata; 1 MDSs are read only
> MDS_DAMAGE 1 MDSs report damaged metadata
> mdsge.hax0rbana.org(mds.0): Metadata damage detected
> MDS_READ_ONLY 1 MDSs are read only
> mdsge.hax0rbana.org(mds.0): MDS in read-only mode
> 
> cephfs-journal-tool --rank 0 event recover_dentries summary
> Error ((22) Invalid argument)
> 2019-08-08 19:54:45.583 7f5b37c4c7c0 -1 main: Couldn't determine MDS rank.
> 
> 
> The only place I've found this error message is in an unanswered
> stackoverflow question and in the source code here:
> https://github.com/ceph/ceph/blob/master/src/tools/cephfs/JournalTool.cc#L114
> 
> It looks like that is trying to read a filesystem map (fsmap), which
> might be corrupted.  Running `rados export` prints part of the help text
> and then segfaults, which is rather concerning.  This is 100% repeatable
> (outside of gdb, details below).  I tried `rados df` and that worked
> fine, so it's not all rados commands which are having this problem.
> However, I tried `rados bench 60 seq` and that also printed out the
> usage text and then segfaulted.
> 
> 
> 
> 
> 
> Info on the `rados export` crash:
> rados export
> usage: rados [options] [commands]
> POOL COMMANDS
> 
> IMPORT AND EXPORT
>export [filename]
>Serialize pool contents to a file or standard out.
> 
> OMAP OPTIONS:
> --omap-key-file fileread the omap key from a file
> *** Caught signal (Segmentation fault) **
>  in thread 7fcb6bfff700 thread_name:fn_anonymous
> 
> When running it in gdb:
> (gdb) bt
> #0  0x7fffef07331f in std::_Rb_tree std::char_traits, std::allocator >,
> std::pair,
> std::allocator > const, std::map std::__cxx11::basic_string,
> std::allocator >, unsigned long, long, double, bool,
> entity_addr_t, std::chrono::duration >,
> Option::size_t, uuid_d>, std::less, std::allocator const, boost::variant std::char_traits, std::allocator >, unsigned long, long,
> double, bool, entity_addr_t, std::chrono::duration 1l> >, Option::size_t, uuid_d> > > > >,
> std::_Select1st std::char_traits, std::allocator > const, std::map boost::variant std::char_traits, std::allocator >, unsigned long, long,
> double, bool, entity_addr_t, std::chrono::duration 1l> >, Option::size_t, uuid_d>, std::less,
> std::allocator std::__cxx11::basic_string,
> std::allocator >, unsigned long, long, double, bool,
> entity_addr_t, std::chrono::duration >,
> Option::size_t, uuid_d> > > > > >,
>