Re: [ceph-users] Ceph assimilated configuration - unable to remove item

2019-12-13 Thread David Herselman
Hi,

I've logged a bug report 
(https://tracker.ceph.com/issues/43296?next_issue_id=43295_issue_id=43297) 
and Alwin from Proxmox was kind enough to provide a work around:
ceph config rm global rbd_default_features;
ceph config-key rm config/global/rbd_default_features;
ceph config set global rbd_default_features 31;

ceph config dump | grep -e WHO -e rbd_default_features;
WHOMASK LEVELOPTION VALUE  RO
global  advanced rbd_default_features   31


Regards
David Herselman

-Original Message-
From: Stefan Kooman  
Sent: Wednesday, 11 December 2019 3:05 PM
To: David Herselman 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph assimilated configuration - unable to remove item

Quoting David Herselman (d...@syrex.co):
> Hi,
> 
> We assimilated our Ceph configuration to store attributes within Ceph 
> itself and subsequently have a minimal configuration file. Whilst this 
> works perfectly we are unable to remove configuration entries 
> populated by the assimilate-conf command.

I forgot about this issue, but I encountered this when we upgraded to mimic. I 
can confirm this bug. It's possible to have the same key present with different 
values. For our production cluster we decided to stick to ceph.conf for the 
time being. That's also the workaround for now if you want to override the 
config store: just put that in your config file and reboot the daemon(s).

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph assimilated configuration - unable to remove item

2019-12-11 Thread David Herselman
Hi,

We assimilated our Ceph configuration to store attributes within Ceph itself 
and subsequently have a minimal configuration file. Whilst this works perfectly 
we are unable to remove configuration entries populated by the assimilate-conf 
command.

Ceph Nautilus 14.2.4.1 upgrade notes:
cd /etc/pve;
ceph config assimilate-conf -i ceph.conf -o ceph.conf.new;
mv ceph.conf.new ceph.conf;
pico /etc/ceph/ceph.conf
  # add back: cluster_network
  #   public_network
ceph config rm global cluster_network;
ceph config rm global public_network;
ceph config set global mon_osd_down_out_subtree_limit host;

Resulting minimal Ceph configuration file:
[admin@kvm1c ~]# cat /etc/ceph/ceph.conf
[global]
 cluster_network = 10.248.1.0/24
 filestore_xattr_use_omap = true
 fsid = 31f6ea46-12cb-47e8-a6f3-60fb6bbd1782
 mon_host = 10.248.1.60 10.248.1.61 10.248.1.62
 public_network = 10.248.1.0/24

[client]
 keyring = /etc/pve/priv/$cluster.$name.keyring

Ceph configuration entries:
[admin@kvm1c ~]# ceph config dump
WHOMASK LEVELOPTION VALUE  RO
global  advanced auth_client_required   cephx  *
global  advanced auth_cluster_required  cephx  *
global  advanced auth_service_required  cephx  *
global  advanced cluster_network10.248.1.0/24  *
global  advanced debug_filestore0/0
global  advanced debug_journal  0/0
global  advanced debug_ms   0/0
global  advanced debug_osd  0/0
global  basicdevice_failure_prediction_mode cloud
global  advanced mon_allow_pool_delete  true
global  advanced mon_osd_down_out_subtree_limit host
global  advanced osd_deep_scrub_interval1209600.00
global  advanced osd_pool_default_min_size  2
global  advanced osd_pool_default_size  3
global  advanced osd_scrub_begin_hour   19
global  advanced osd_scrub_end_hour 6
global  advanced osd_scrub_sleep0.10
global  advanced public_network 10.248.1.0/24  *
global  advanced rbd_default_features   7
global  advanced rbd_default_features   31
  mgr   advanced mgr/balancer/activetrue
  mgr   advanced mgr/balancer/mode  upmap
  mgr   advanced mgr/devicehealth/enable_monitoring true

Note the duplicate 'rdb_default_features' entry. We've switched to kernel 5.3 
which supports object-map and fast-diff and subsequently wanted to change the 
default features for new RBD images to reflect this.

Commands we entered to get here:
[admin@kvm1b ~]# ceph config dump | grep -e WHO -e rbd_default_features
WHOMASK LEVELOPTION VALUE  RO
global  advanced rbd_default_features   7

[admin@kvm1b ~]# ceph config rm global rbd_default_features
[admin@kvm1b ~]# ceph config rm global rbd_default_features
[admin@kvm1b ~]# ceph config rm global rbd_default_features

[admin@kvm1b ~]# ceph config dump | grep -e WHO -e rbd_default_features
WHOMASK LEVELOPTION VALUE  RO
global  advanced rbd_default_features   7

[admin@kvm1b ~]# ceph config set global rbd_default_features 31
[admin@kvm1b ~]# ceph config dump | grep -e WHO -e rbd_default_features
WHOMASK LEVELOPTION VALUE  RO
global  advanced rbd_default_features   7
global  advanced rbd_default_features   31



Regards
David Herselman
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Pool Max Avail and Ceph Dashboard Pool Useage on Nautilus giving different percentages

2019-12-10 Thread David Majchrzak, ODERLAND Webbhotell AB
Hi!

While browsing /#/pool in nautilus ceph dashboard I noticed it said 93%
used on the single pool we have (3x replica).

ceph df detal however shows 81% used on the pool and 67% raw useage.

# ceph df detail
RAW STORAGE:
CLASS SIZEAVAIL   USEDRAW USED %RAW
USED 
ssd   478 TiB 153 TiB 324 TiB  325
TiB 67.96 
TOTAL 478 TiB 153 TiB 324 TiB  325
TiB 67.96 
 
POOLS:
POOLID STORED  OBJECTS USED%USED   
  MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY  USED
COMPR UNDER COMPR 
echo  3 108 TiB  29.49M 324 TiB 81.6124
TiB N/A   N/A 29.49M0
B 0 B


I know we're looking at the most full OSD (210PGs, 79% used, 1.17 VAR)
and count max avail from that. But where's the 93% full from in
dashboard?

My guess is that is comes from calculating: 

1 - Max Avail / (Used + Max avail) = 0.93


Kind Regards,

David Majchrzak

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW bucket stats - strange behavior & slow performance requiring RGW restarts

2019-12-03 Thread David Monschein
Hi all,

I've been observing some strange behavior with my object storage cluster
running Nautilus 14.2.4. We currently have around 1800 buckets (A small
percentage of those buckets are actively used), with a total of 13.86M
objects. We have 20 RGWs right now, 10 for regular S3 access, and 10 for
static sites.

When calling $(radosgw-admin bucket stats), it normally comes back within a
few seconds, usually less than five. This returns stats for all buckets in
the cluster, which we use for accounting.

The strange behavior: Lately we've been observing a gradual increase in
runtime for bucket stats, which in extreme cases can take almost 10 minutes
to return. Things start out fine, and over the course of the week, the
runtime increases. From a few seconds to almost 10 minutes. Restarting all
of the S3 RGWs seems to fix this problem immediately. If we restart all the
radosgw processes, the runtime for bucket stats drops to 3 seconds.

This is odd behavior, and I've found nothing so far that would indicate why
this is happening. There is nothing suspicious in the RGWs, although a
message about aborted mutli-part uploads is in there:

2019-12-02 13:12:52.882 7faa7018f700 0 abort_bucket_multiparts WARNING :
aborted 8553000 incomplete multipart uploads

Otherwise, things look normal. Memory usage is low, CPU load is relatively
low and flat, and the cluster itself is not under heavy load.

Has anyone run into this before?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Tuning Nautilus for flash only

2019-11-28 Thread David Majchrzak, ODERLAND Webbhotell AB
Paul,

Absolutely, I said I was looking at those settings and most didn't make
any sense to me in a production environment (we've been running ceph
since Dumpling).

However we only have 1 cluster on Bluestore and I wanted to get some
opinions if anything other than the defaults in ceph.conf or sysctl or
things like Wido suggested with c-states would make any differences.
(Thank you Wido!)

Yes, running benchmarks is great, and we're already doing that
ourselves.

Cheers and have a nice evening!

-- 
David Majchrzak


On tor, 2019-11-28 at 17:46 +0100, Paul Emmerich wrote:
> Please don't run this config in production.
> Disabling checksumming is a bad idea, disabling authentication is
> also
> pretty bad.
> 
> There are also a few options in there that no longer exist (osd op
> threads) or are no longer relevant (max open files), in general, you
> should not blindly copy config files you find on the Internet. Only
> set an option to its non-default value after carefully checking what
> it does and whether it applies to your use case.
> 
> Also, run benchmarks yourself. Use benchmarks that are relevant to
> your use case.
> 
> Paul
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Tuning Nautilus for flash only

2019-11-28 Thread David Majchrzak, ODERLAND Webbhotell AB
Hi!

We've deployed a new flash only ceph cluster running Nautilus and I'm
currently looking at any tunables we should set to get the most out of
our NVMe SSDs.

I've been looking a bit at the options from the blog post here:

https://ceph.io/community/bluestore-default-vs-tuned-performance-comparison/

with the conf here:
https://gist.github.com/likid0/1b52631ff5d0d649a22a3f30106ccea7

However some of them, like checksumming, is for testing speed only but
not really applicable in a real life scenario with critical data.

Should we stick with defaults or is there anything that could help?

We have 256GB of RAM on each OSD host, 8 OSD hosts with 10 SSDs on
each. 2 osd daemons on each SSD. Raise ssd bluestore cache to 8GB?

Workload is about 50/50 r/w ops running qemu VMs through librbd. So
mixed block size.

3 replicas.

Appreciate any advice!

Kind Regards,
-- 
David Majchrzak


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Dynamic bucket index resharding bug? - rgw.none showing unreal number of objects

2019-11-22 Thread David Monschein
Hi all. Running an Object Storage cluster with Ceph Nautilus 14.2.4.

We are running into what appears to be a serious bug that is affecting our
fairly new object storage cluster. While investigating some performance
issues -- seeing abnormally high IOPS, extremely slow bucket stat listings
(over 3 minutes) -- we noticed some dynamic bucket resharding jobs running.
Strangely enough they were resharding buckets that had very few objects.
Even more worrying was the number of new shards Ceph was planning: 65521

[root@os1 ~]# radosgw-admin reshard list
[
{
"time": "2019-11-22 00:12:40.192886Z",
"tenant": "",
"bucket_name": "redacteed",
"bucket_id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
"new_instance_id":
"redacted:c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7552496.28",
"old_num_shards": 1,
"new_num_shards": 65521
}
]

Upon further inspection we noticed a seemingly impossible number of objects
(18446744073709551603) in rgw.none for the same bucket:
[root@os1 ~]# radosgw-admin bucket stats --bucket=redacted
{
"bucket": "redacted",
"tenant": "",
"zonegroup": "dbb69c5b-b33f-4af2-950c-173d695a4d2c",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
"marker": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
"index_type": "Normal",
"owner": "d52cb8cc-1f92-47f5-86bf-fb28bc6b592c",
"ver": "0#12623",
"master_ver": "0#0",
"mtime": "2019-11-22 00:18:41.753188Z",
"max_marker": "0#",
"usage": {
"rgw.none": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 18446744073709551603
},
"rgw.main": {
"size": 63410030,
"size_actual": 63516672,
"size_utilized": 63410030,
"size_kb": 61924,
"size_kb_actual": 62028,
"size_kb_utilized": 61924,
"num_objects": 27
},
"rgw.multimeta": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 0
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}

It would seem that the unreal number of objects in rgw.none is driving the
resharding process, making ceph reshard the bucket 65521 times. I am
assuming 65521 is the limit.

I have seen only a couple of references to this issue, none of which had a
resolution or much of a conversation around them:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030791.html
https://tracker.ceph.com/issues/37942

For now we are cancelling these resharding jobs since they seem to be
causing performance issues with the cluster, but this is an untenable
solution. Does anyone know what is causing this? Or how to prevent it/fix
it?

Thanks,
Dave Monschein
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Zombie OSD filesystems rise from the grave during bluestore conversion

2019-11-05 Thread J David
On Tue, Nov 5, 2019 at 2:21 PM Janne Johansson  wrote:
> I seem to recall some ticket where zap would "only" clear 100M of the drive, 
> but lvm and all partition info needed more to be cleared, so using dd  
> bs=1M count=1024 (or more!) would be needed to make sure no part of the OS 
> picks up anything from the previous contents.

Based on the output, it seems like it is systematically destroying the
LVM stuff and partitions, not just doing the dd.  So far, we've
converted ~80 OSDs with no real issues other than the occasional
surprise udev remounts.

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Zombie OSD filesystems rise from the grave during bluestore conversion

2019-11-05 Thread J David
On Tue, Nov 5, 2019 at 3:18 AM Paul Emmerich  wrote:
> could be a new feature, I've only realized this exists/works since Nautilus.
> You seem to be a relatively old version since you still have ceph-disk 
> installed

None of this is using ceph-disk?  It's all done with ceph-volume.

The ceph clusters are all running Luminous 12.2.12, which shouldn't be
*that* old!  (Looking forward to Nautilus but it hasn't been qualified
for production use by our team yet.)

But a couple of our ceph clusters, including the ones at issue here,
originally date back to Firefly, so who knows what artifacts of the
past are still lurking around?

The next approach may be to just try to stop udev while ceph-volume
lvm zap is running.

It seems like we have a couple of months to figure this out since
we've moved on to HDD OSD's and it takes a day or so to drain a single
one. :-/

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Zombie OSD filesystems rise from the grave during bluestore conversion

2019-11-04 Thread J David
On Mon, Nov 4, 2019 at 1:32 PM Paul Emmerich  wrote:
> BTW: you can run destroy before stopping the OSD, you won't need the
> --yes-i-really-mean-it if it's drained in this case

This actually does not seem to work:

$ sudo ceph osd safe-to-destroy 42
OSD(s) 42 are safe to destroy without reducing data durability.
$ sudo ceph osd destroy 42
Error EPERM: Are you SURE? This will mean real, permanent data loss,
as well as cephx and lockbox keys. Pass --yes-i-really-mean-it if you
really do.

Is that a bug?  Or did we miss a step?

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Zombie OSD filesystems rise from the grave during bluestore conversion

2019-11-04 Thread J David
On Mon, Nov 4, 2019 at 1:32 PM Paul Emmerich  wrote:
> That's probably the ceph-disk udev script being triggered from
> something somewhere (and a lot of things can trigger that script...)

That makes total sense.

> Work-around: convert everything to ceph-volume simple first by running
> "ceph-volume simple scan" and "ceph-volume simple activate", that will
> disable udev in the intended way.

OK. Is there possibly a more surgical approach?  It's going to take a
really long time to convert the cluster, so we don't want to do
anything global that might cause weirdness if any of the OSD servers
with unconverted OSD's need to be rebooted during the process.

> BTW: you can run destroy before stopping the OSD, you won't need the
> --yes-i-really-mean-it if it's drained in this case

Great, we'll try that!

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Zombie OSD filesystems rise from the grave during bluestore conversion

2019-11-04 Thread J David
While converting a luminous cluster from filestore to bluestore, we
are running into a weird race condition on a fairly regular basis.

We have a master script that writes upgrade scripts for each OSD
server.  The script for an OSD looks like this:

ceph osd out 68
while ! ceph osd safe-to-destroy 68 ; do sleep 10 ; done
systemctl stop ceph-osd@68
sleep 10
systemctl kill ceph-osd@68
sleep 10
umount /var/lib/ceph/osd/ceph-68
ceph osd destroy 68 --yes-i-really-mean-it
ceph-volume lvm zap /dev/sda --destroy
ceph-volume lvm create --bluestore --data /dev/sda --osd-id 68
sleep 10
while [ "`ceph health`" != "HEALTH_OK" ] ; do ceph health; sleep 10 ; done

(It's run with sh -e so any error will cause an abort.)

The problem we run into is that in about 1 out of 10 runs, when this
gets to the "lvm zap" stage, and fails:

--> Zapping: /dev/sda
Running command: wipefs --all /dev/sda2
Running command: dd if=/dev/zero of=/dev/sda2 bs=1M count=10
 stderr: 10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.00667608 s, 1.6 GB/s
--> Destroying partition since --destroy was used: /dev/sda2
Running command: parted /dev/sda --script -- rm 2
--> Unmounting /dev/sda1
Running command: umount -v /dev/sda1
 stderr: umount: /var/lib/ceph/tmp/mnt.9k0GDx (/dev/sda1) unmounted
Running command: wipefs --all /dev/sda1
 stderr: wipefs: error: /dev/sda1: probing initialization failed:
 stderr: Device or resource busy
-->  RuntimeError: command returned non-zero exit status: 1

And, lo and behold, it's right: /dev/sda1 has been remounted as
/var/lib/ceph/osd/ceph-68.

That's after the OSD has been stopped, killed, and destroyed; there
*is no* osd.68.  It happens after the filesystem has been unmounted
twice (once by an explicit umount and once by "lvm zap."  The "lvm
zap" umount shown here with the path /var/lib/ceph/tmp/mnt.9k0GDx
suggests that the remount is happening in the background somewhere
while the lvm zap is running.

If we do the zap before the osd destroy, the same thing happens but
the (still-existing) OSD does not actually restart.  So it's just the
filesystem that won't stay unmounted long enough to destroy it, not
the whole OSD.

What's causing this?  How do we keep the filesystem from lurching out
of the grave in mid-conversion like this?

This is on Debian Stretch with systemd, if that matters.

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Using multisite to migrate data between bucket data pools.

2019-10-30 Thread David Turner
This is a tangent on Paul Emmerich's response to "[ceph-users] Correct
Migration Workflow Replicated -> Erasure Code". I've tried Paul's method
before to migrate between 2 data pools. However I ran into some issues.

The first issue seems like a bug in RGW where the RGW for the new zone was
able to pull data directly from the data pool of the original zone after
the metadata had been sync'd. The metadata seemed to realize the file
actually exists and so it went ahead and grabbed it from the pool backing
the other zone. I worked around that slightly by using cephx to specify
which pools each RGW user could access, but it gives a permission denied
error instead of a file not found error. This happens on buckets that are
set not to replicate as well as buckets that failed to sync properly. Seems
like a bit of a security threat, but not a super common situation at all.

The second issue I think has to do with corrupt index files in my index
pool. Some of the buckets I don't need any more so I went to delete them
for simplicity, but the command failed to delete them. I just set them
aside for now and can just set the ones that I don't need any more to not
replicate on the bucket level. That works for most things, but then I have
a few buckets that I need to migrate, but when I set them to start
replicating the data sync between zones gets stuck. Does anyone have any
ideas on how to clean up the bucket indexes to make these operations
possible?

At this point I've disabled multisite and cleared up the new zone so I can
run operations on these buckets without dealing with multisite and
replication. I've tried a few things and can get some additional
information on my specific errors tomorrow at work.


-- Forwarded message -
From: Paul Emmerich 
Date: Wed, Oct 30, 2019 at 4:32 AM
Subject: [ceph-users] Re: Correct Migration Workflow Replicated -> Erasure
Code
To: Konstantin Shalygin 
Cc: Mac Wynkoop , ceph-users 


We've solved this off-list (because I already got access to the cluster)

For the list:

Copying on rados level is possible, but requires to shut down radosgw
to get a consistent copy. This wasn't feasible here due to the size
and performance.
We've instead added a second zone where the placement maps to an EC
pool to the zonegroup and it's currently copying over data. We'll then
make the second zone master and default and ultimately delete the
first one.
This allows for a migration without downtime.

Another possibility would be using a Transition lifecycle rule, but
that's not ideal because it doesn't actually change the bucket.

I don't think it would be too complicated to add a native bucket
migration mechanism that works similar to "bucket rewrite" (which is
intended for something similar but different).

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph balancer do not start

2019-10-22 Thread David Turner
Of the top of my head, if say your cluster might have the wrong tunables
for crush-compat. I know I ran into that when I first set up the balancer
and nothing obviously said that was the problem. Only researching find it
for me.

My real question, though, is why aren't you using upmap? It is
significantly better than crush-compat. Unless you have clients on really
old kernels that can't update or that are on pre-luminous Ceph versions
that can't update, there's really no reason not to use upmap.

On Mon, Oct 21, 2019, 8:08 AM Jan Peters  wrote:

> Hello,
>
> I use ceph 12.2.12 and would like to activate the ceph balancer.
>
> unfortunately no redistribution of the PGs is started:
>
> ceph balancer status
> {
> "active": true,
> "plans": [],
> "mode": "crush-compat"
> }
>
> ceph balancer eval
> current cluster score 0.023776 (lower is better)
>
>
> ceph config-key dump
> {
> "initial_mon_keyring":
> "AQBLchlbABAA+5CuVU+8MB69xfc3xAXkjQ==",
> "mgr/balancer/active": "1",
> "mgr/balancer/max_misplaced:": "0.01",
> "mgr/balancer/mode": "crush-compat"
> }
>
>
> What am I not doing correctly?
>
> best regards
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Decreasing the impact of reweighting osds

2019-10-22 Thread David Turner
Most times you are better served with simpler settings like
osd_recovery_sleep, which has 3 variants if you have multiple types of OSDs
in your cluster (osd_recovery_sleep_hdd, osd_recovery_sleep_sdd,
osd_recovery_sleep_hybrid).
Using those you can tweak a specific type of OSD that might be having
problems during recovery/backfill while allowing the others to continue to
backfill at regular speeds.

Additionally you mentioned reweighting OSDs, but it sounded like you do
this manually. The balancer module, especially in upmap mode, can be
configured quite well to minimize client IO impact while balancing. You can
specify times of day that it can move data (only in UTC, it ignores local
timezones), a threshold of misplaced data that it will stop moving PGs at,
the increment size it will change weights with per operation, how many
weights it will adjust with each pass, etc.

On Tue, Oct 22, 2019, 6:07 PM Mark Kirkwood 
wrote:

> Thanks - that's a good suggestion!
>
> However I'd still like to know the answers to my 2 questions.
>
> regards
>
> Mark
>
> On 22/10/19 11:22 pm, Paul Emmerich wrote:
> > getting rid of filestore solves most latency spike issues during
> > recovery because they are often caused by random XFS hangs (splitting
> > dirs or just xfs having a bad day)
> >
> >
> > Paul
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Nfs-ganesha-devel] 2.7.3 with CEPH_FSAL Crashing

2019-10-10 Thread David C
Thanks, Patrick. Looks like the fix is awaiting review, I guess my options
are to hold tight for 14.2.5 or patch myself if I get desperate. I've seen
this crash about 4 times over the past 96 hours, is there anything I can do
to mitigate the issue in the meantime?

On Wed, Oct 9, 2019 at 9:23 PM Patrick Donnelly  wrote:

> Looks like this bug: https://tracker.ceph.com/issues/41148
>
> On Wed, Oct 9, 2019 at 1:15 PM David C  wrote:
> >
> > Hi Daniel
> >
> > Thanks for looking into this. I hadn't installed ceph-debuginfo, here's
> the bt with line numbers:
> >
> > #0  operator uint64_t (this=0x10) at
> /usr/src/debug/ceph-14.2.2/src/include/object.h:123
> > #1  Client::fill_statx (this=this@entry=0x274b980, in=0x0,
> mask=mask@entry=341, stx=stx@entry=0x7fccdbefa210) at
> /usr/src/debug/ceph-14.2.2/src/client/Client.cc:7336
> > #2  0x7fce4ea1d4ca in fill_statx (stx=0x7fccdbefa210, mask=341,
> in=..., this=0x274b980) at
> /usr/src/debug/ceph-14.2.2/src/client/Client.h:898
> > #3  Client::_readdir_cache_cb (this=this@entry=0x274b980,
> dirp=dirp@entry=0x7fcb7d0e7860,
> > cb=cb@entry=0x7fce4e9d0950 <_readdir_single_dirent_cb(void*,
> dirent*, ceph_statx*, off_t, Inode*)>, p=p@entry=0x7fccdbefa6a0,
> caps=caps@entry=341,
> > getref=getref@entry=true) at
> /usr/src/debug/ceph-14.2.2/src/client/Client.cc:7999
> > #4  0x7fce4ea1e865 in Client::readdir_r_cb (this=0x274b980,
> d=0x7fcb7d0e7860,
> > cb=cb@entry=0x7fce4e9d0950 <_readdir_single_dirent_cb(void*,
> dirent*, ceph_statx*, off_t, Inode*)>, p=p@entry=0x7fccdbefa6a0,
> want=want@entry=1775,
> > flags=flags@entry=0, getref=true) at
> /usr/src/debug/ceph-14.2.2/src/client/Client.cc:8138
> > #5  0x7fce4ea1f3dd in Client::readdirplus_r (this=,
> d=, de=de@entry=0x7fccdbefa8c0, stx=stx@entry=0x7fccdbefa730,
> want=want@entry=1775,
> > flags=flags@entry=0, out=0x7fccdbefa720) at
> /usr/src/debug/ceph-14.2.2/src/client/Client.cc:8307
> > #6  0x7fce4e9c92d8 in ceph_readdirplus_r (cmount=,
> dirp=, de=de@entry=0x7fccdbefa8c0, stx=stx@entry
> =0x7fccdbefa730,
> > want=want@entry=1775, flags=flags@entry=0, out=out@entry=0x7fccdbefa720)
> at /usr/src/debug/ceph-14.2.2/src/libcephfs.cc:629
> > #7  0x7fce4ece7b0e in fsal_ceph_readdirplus (dir=,
> cred=, out=0x7fccdbefa720, flags=0, want=1775,
> stx=0x7fccdbefa730, de=0x7fccdbefa8c0,
> > dirp=, cmount=) at
> /usr/src/debug/nfs-ganesha-2.7.3/FSAL/FSAL_CEPH/statx_compat.h:314
> > #8  ceph_fsal_readdir (dir_pub=, whence=,
> dir_state=0x7fccdbefaa30, cb=0x522640 ,
> attrmask=122830,
> > eof=0x7fccdbefac0b) at
> /usr/src/debug/nfs-ganesha-2.7.3/FSAL/FSAL_CEPH/handle.c:211
> > #9  0x005256e1 in mdcache_readdir_uncached
> (directory=directory@entry=0x7fcaa8bb84a0, whence=,
> dir_state=, cb=,
> > attrmask=, eod_met=) at
> /usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1654
> > #10 0x00517a88 in mdcache_readdir (dir_hdl=0x7fcaa8bb84d8,
> whence=0x7fccdbefab18, dir_state=0x7fccdbefab30, cb=0x432db0
> , attrmask=122830,
> > eod_met=0x7fccdbefac0b) at
> /usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:551
> > #11 0x0043434a in fsal_readdir 
> > (directory=directory@entry=0x7fcaa8bb84d8,
> cookie=cookie@entry=0, nbfound=nbfound@entry=0x7fccdbefac0c,
> > eod_met=eod_met@entry=0x7fccdbefac0b, attrmask=122830, 
> > cb=cb@entry=0x46f600
> , opaque=opaque@entry=0x7fccdbefac20)
> > at /usr/src/debug/nfs-ganesha-2.7.3/FSAL/fsal_helper.c:1164
> > #12 0x004705b9 in nfs4_op_readdir (op=0x7fcb7fed1f80,
> data=0x7fccdbefaea0, resp=0x7fcb7d106c40)
> > at
> /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_op_readdir.c:664
> > #13 0x0045d120 in nfs4_Compound (arg=,
> req=, res=0x7fcb7e001000)
> > at /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_Compound.c:942
> > #14 0x004512cd in nfs_rpc_process_request
> (reqdata=0x7fcb7e1d1950) at
> /usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_worker_thread.c:1328
> > #15 0x00450766 in nfs_rpc_decode_request (xprt=0x7fcaf17fb0e0,
> xdrs=0x7fcb7e1ddb90) at
> /usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_rpc_dispatcher_thread.c:1345
> > #16 0x7fce6165707d in svc_rqst_xprt_task (wpe=0x7fcaf17fb2f8) at
> /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:769
> > #17 0x7fce6165759a in svc_rqst_epoll_events (n_events= out>, sr_rec=0x56a24c0) at
> /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:941
> > #18 svc_rqst_epoll_loop (sr_rec=) at
> /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c

Re: [ceph-users] [Nfs-ganesha-devel] 2.7.3 with CEPH_FSAL Crashing

2019-10-09 Thread David C
Hi Daniel

Thanks for looking into this. I hadn't installed ceph-debuginfo, here's the
bt with line numbers:

#0  operator uint64_t (this=0x10) at
/usr/src/debug/ceph-14.2.2/src/include/object.h:123
#1  Client::fill_statx (this=this@entry=0x274b980, in=0x0, mask=mask@entry=341,
stx=stx@entry=0x7fccdbefa210) at
/usr/src/debug/ceph-14.2.2/src/client/Client.cc:7336
#2  0x7fce4ea1d4ca in fill_statx (stx=0x7fccdbefa210, mask=341, in=...,
this=0x274b980) at /usr/src/debug/ceph-14.2.2/src/client/Client.h:898
#3  Client::_readdir_cache_cb (this=this@entry=0x274b980, dirp=dirp@entry
=0x7fcb7d0e7860,
cb=cb@entry=0x7fce4e9d0950 <_readdir_single_dirent_cb(void*, dirent*,
ceph_statx*, off_t, Inode*)>, p=p@entry=0x7fccdbefa6a0, caps=caps@entry=341,
getref=getref@entry=true) at
/usr/src/debug/ceph-14.2.2/src/client/Client.cc:7999
#4  0x7fce4ea1e865 in Client::readdir_r_cb (this=0x274b980,
d=0x7fcb7d0e7860,
cb=cb@entry=0x7fce4e9d0950 <_readdir_single_dirent_cb(void*, dirent*,
ceph_statx*, off_t, Inode*)>, p=p@entry=0x7fccdbefa6a0, want=want@entry
=1775,
flags=flags@entry=0, getref=true) at
/usr/src/debug/ceph-14.2.2/src/client/Client.cc:8138
#5  0x7fce4ea1f3dd in Client::readdirplus_r (this=,
d=, de=de@entry=0x7fccdbefa8c0, stx=stx@entry=0x7fccdbefa730,
want=want@entry=1775,
flags=flags@entry=0, out=0x7fccdbefa720) at
/usr/src/debug/ceph-14.2.2/src/client/Client.cc:8307
#6  0x7fce4e9c92d8 in ceph_readdirplus_r (cmount=,
dirp=, de=de@entry=0x7fccdbefa8c0, stx=stx@entry
=0x7fccdbefa730,
want=want@entry=1775, flags=flags@entry=0, out=out@entry=0x7fccdbefa720)
at /usr/src/debug/ceph-14.2.2/src/libcephfs.cc:629
#7  0x7fce4ece7b0e in fsal_ceph_readdirplus (dir=,
cred=, out=0x7fccdbefa720, flags=0, want=1775,
stx=0x7fccdbefa730, de=0x7fccdbefa8c0,
dirp=, cmount=) at
/usr/src/debug/nfs-ganesha-2.7.3/FSAL/FSAL_CEPH/statx_compat.h:314
#8  ceph_fsal_readdir (dir_pub=, whence=,
dir_state=0x7fccdbefaa30, cb=0x522640 ,
attrmask=122830,
eof=0x7fccdbefac0b) at
/usr/src/debug/nfs-ganesha-2.7.3/FSAL/FSAL_CEPH/handle.c:211
#9  0x005256e1 in mdcache_readdir_uncached
(directory=directory@entry=0x7fcaa8bb84a0, whence=,
dir_state=, cb=,
attrmask=, eod_met=) at
/usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1654
#10 0x00517a88 in mdcache_readdir (dir_hdl=0x7fcaa8bb84d8,
whence=0x7fccdbefab18, dir_state=0x7fccdbefab30, cb=0x432db0
, attrmask=122830,
eod_met=0x7fccdbefac0b) at
/usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:551
#11 0x0043434a in fsal_readdir
(directory=directory@entry=0x7fcaa8bb84d8,
cookie=cookie@entry=0, nbfound=nbfound@entry=0x7fccdbefac0c,
eod_met=eod_met@entry=0x7fccdbefac0b, attrmask=122830, cb=cb@entry=0x46f600
, opaque=opaque@entry=0x7fccdbefac20)
at /usr/src/debug/nfs-ganesha-2.7.3/FSAL/fsal_helper.c:1164
#12 0x004705b9 in nfs4_op_readdir (op=0x7fcb7fed1f80,
data=0x7fccdbefaea0, resp=0x7fcb7d106c40)
at /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_op_readdir.c:664
#13 0x0045d120 in nfs4_Compound (arg=,
req=, res=0x7fcb7e001000)
at /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_Compound.c:942
#14 0x004512cd in nfs_rpc_process_request (reqdata=0x7fcb7e1d1950)
at /usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_worker_thread.c:1328
#15 0x00450766 in nfs_rpc_decode_request (xprt=0x7fcaf17fb0e0,
xdrs=0x7fcb7e1ddb90) at
/usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_rpc_dispatcher_thread.c:1345
#16 0x7fce6165707d in svc_rqst_xprt_task (wpe=0x7fcaf17fb2f8) at
/usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:769
#17 0x7fce6165759a in svc_rqst_epoll_events (n_events=,
sr_rec=0x56a24c0) at
/usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:941
#18 svc_rqst_epoll_loop (sr_rec=) at
/usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:1014
#19 svc_rqst_run_task (wpe=0x56a24c0) at
/usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:1050
#20 0x7fce6165f123 in work_pool_thread (arg=0x7fcd381c77b0) at
/usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/work_pool.c:181
#21 0x7fce5fc17dd5 in start_thread (arg=0x7fccdbefe700) at
pthread_create.c:307
#22 0x7fce5ed8eead in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

On Mon, Oct 7, 2019 at 3:40 PM Daniel Gryniewicz  wrote:

> Client::fill_statx() is a fairly large function, so it's hard to know
> what's causing the crash.  Can you get line numbers from your backtrace?
>
> Daniel
>
> On 10/7/19 9:59 AM, David C wrote:
> > Hi All
> >
> > Further to my previous messages, I upgraded
> > to libcephfs2-14.2.2-0.el7.x86_64 as suggested and things certainly seem
> > a lot more stable, I have had some crashes though, could someone assist
> > in debugging this latest crash please?
> >
> > (gdb) bt
> > #0  0x7fce4e9fc1bb in Client::fi

Re: [ceph-users] [Nfs-ganesha-devel] 2.7.3 with CEPH_FSAL Crashing

2019-10-07 Thread David C
Hi All

Further to my previous messages, I upgraded
to libcephfs2-14.2.2-0.el7.x86_64 as suggested and things certainly seem a
lot more stable, I have had some crashes though, could someone assist in
debugging this latest crash please?

(gdb) bt
#0  0x7fce4e9fc1bb in Client::fill_statx(Inode*, unsigned int,
ceph_statx*) () from /lib64/libcephfs.so.2
#1  0x7fce4ea1d4ca in Client::_readdir_cache_cb(dir_result_t*, int
(*)(void*, dirent*, ceph_statx*, long, Inode*), void*, int, bool) () from
/lib64/libcephfs.so.2
#2  0x7fce4ea1e865 in Client::readdir_r_cb(dir_result_t*, int
(*)(void*, dirent*, ceph_statx*, long, Inode*), void*, unsigned int,
unsigned int, bool) () from /lib64/libcephfs.so.2
#3  0x7fce4ea1f3dd in Client::readdirplus_r(dir_result_t*, dirent*,
ceph_statx*, unsigned int, unsigned int, Inode**) () from
/lib64/libcephfs.so.2
#4  0x7fce4ece7b0e in fsal_ceph_readdirplus (dir=,
cred=, out=0x7fccdbefa720, flags=0, want=1775,
stx=0x7fccdbefa730, de=0x7fccdbefa8c0, dirp=,
cmount=)
at /usr/src/debug/nfs-ganesha-2.7.3/FSAL/FSAL_CEPH/statx_compat.h:314
#5  ceph_fsal_readdir (dir_pub=, whence=,
dir_state=0x7fccdbefaa30, cb=0x522640 ,
attrmask=122830, eof=0x7fccdbefac0b) at
/usr/src/debug/nfs-ganesha-2.7.3/FSAL/FSAL_CEPH/handle.c:211
#6  0x005256e1 in mdcache_readdir_uncached
(directory=directory@entry=0x7fcaa8bb84a0, whence=,
dir_state=, cb=, attrmask=,
eod_met=)
at
/usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1654
#7  0x00517a88 in mdcache_readdir (dir_hdl=0x7fcaa8bb84d8,
whence=0x7fccdbefab18, dir_state=0x7fccdbefab30, cb=0x432db0
, attrmask=122830, eod_met=0x7fccdbefac0b) at
/usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:551
#8  0x0043434a in fsal_readdir
(directory=directory@entry=0x7fcaa8bb84d8,
cookie=cookie@entry=0, nbfound=nbfound@entry=0x7fccdbefac0c,
eod_met=eod_met@entry=0x7fccdbefac0b, attrmask=122830, cb=cb@entry=0x46f600
, opaque=opaque@entry=0x7fccdbefac20)
at /usr/src/debug/nfs-ganesha-2.7.3/FSAL/fsal_helper.c:1164
#9  0x004705b9 in nfs4_op_readdir (op=0x7fcb7fed1f80,
data=0x7fccdbefaea0, resp=0x7fcb7d106c40) at
/usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_op_readdir.c:664
#10 0x0045d120 in nfs4_Compound (arg=,
req=, res=0x7fcb7e001000) at
/usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_Compound.c:942
#11 0x004512cd in nfs_rpc_process_request (reqdata=0x7fcb7e1d1950)
at /usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_worker_thread.c:1328
#12 0x00450766 in nfs_rpc_decode_request (xprt=0x7fcaf17fb0e0,
xdrs=0x7fcb7e1ddb90) at
/usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_rpc_dispatcher_thread.c:1345
#13 0x7fce6165707d in svc_rqst_xprt_task (wpe=0x7fcaf17fb2f8) at
/usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:769
#14 0x7fce6165759a in svc_rqst_epoll_events (n_events=,
sr_rec=0x56a24c0) at
/usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:941
#15 svc_rqst_epoll_loop (sr_rec=) at
/usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:1014
#16 svc_rqst_run_task (wpe=0x56a24c0) at
/usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:1050
#17 0x7fce6165f123 in work_pool_thread (arg=0x7fcd381c77b0) at
/usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/work_pool.c:181
#18 0x7fce5fc17dd5 in start_thread () from /lib64/libpthread.so.0
#19 0x7fce5ed8eead in clone () from /lib64/libc.so.6

Package versions:

nfs-ganesha-vfs-2.7.3-0.1.el7.x86_64
nfs-ganesha-debuginfo-2.7.3-0.1.el7.x86_64
nfs-ganesha-ceph-2.7.3-0.1.el7.x86_64
nfs-ganesha-2.7.3-0.1.el7.x86_64
libcephfs2-14.2.2-0.el7.x86_64
librados2-14.2.2-0.el7.x86_64

Ganesha export:

EXPORT
{
Export_ID=100;
Protocols = 4;
Transports = TCP;
Path = /;
Pseudo = /ceph/;
Access_Type = RW;
Attr_Expiration_Time = 0;
Disable_ACL = FALSE;
Manage_Gids = TRUE;
Filesystem_Id = 100.1;
FSAL {
Name = CEPH;
}
}

Ceph.conf:

[client]
mon host = --removed--
client_oc_size = 6291456000 #6GB
client_acl_type=posix_acl
client_quota = true
client_quota_df = true

Client mount options:

rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=removed,local_lock=none,addr=removed)

On Fri, Jul 19, 2019 at 5:47 PM David C  wrote:

> Thanks, Jeff. I'll give 14.2.2 a go when it's released.
>
> On Wed, 17 Jul 2019, 22:29 Jeff Layton,  wrote:
>
>> Ahh, I just noticed you were running nautilus on the client side. This
>> patch went into v14.2.2, so once you update to that you should be good
>> to go.
>>
>> -- Jeff
>>
>> On Wed, 2019-07-17 at 17:10 -0400, Jeff Layton wrote:
>> > This is almost certainly the same bug that is fixed here:
>> >
>> > https://github.com/ceph/ceph/pull/28324
>> >
>> > It should get backported 

Re: [ceph-users] eu.ceph.com mirror out of sync?

2019-09-23 Thread David Majchrzak, ODERLAND Webbhotell AB
Hi,

I'll have a look at the status of se.ceph.com tomorrow morning, it's
maintained by us.

Kind Regards,

David


On mån, 2019-09-23 at 22:41 +0200, Oliver Freyermuth wrote:
> Hi together,
> 
> the EU mirror still seems to be out-of-sync - does somebody on this
> list happen to know whom to contact about this?
> Or is this mirror unmaintained and we should switch to something
> else?
> 
> Going through the list of appropriate mirrors from 
> https://docs.ceph.com/docs/master/install/mirrors/ (we are in
> Germany) I also find:
>http://de.ceph.com/
> (the mirror in Germany) to be non-resolvable.
> 
> Closest by then for us is possibly France:
>http://fr.ceph.com/rpm-nautilus/el7/x86_64/
> but also here, there's only 14.2.2, so that's also out-of-sync.
> 
> So in the EU, at least geographically, this only leaves Sweden and
> UK.
> Sweden at se.ceph.com does not load for me, but UK indeed seems fine.
> 
> Should people in the EU use that mirror, or should we all just use
> download.ceph.com instead of something geographically close-by?
> 
> Cheers,
>   Oliver
> 
> 
> On 2019-09-17 23:01, Oliver Freyermuth wrote:
> > Dear Cephalopodians,
> > 
> > I realized just now that:
> >https://eu.ceph.com/rpm-nautilus/el7/x86_64/
> > still holds only released up to 14.2.2, and nothing is to be seen
> > of 14.2.3 or 14.2.4,
> > while the main repository at:
> >https://download.ceph.com/rpm-nautilus/el7/x86_64/
> > looks as expected.
> > 
> > Is this issue with the eu.ceph.com mirror already knwon?
> > 
> > Cheers,
> >  Oliver
> > 
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problem formatting erasure coded image

2019-09-22 Thread David Herselman
Hi,

I'm seeing errors in Windows VM guests's event logs, for example:
The IO operation at logical block address 0x607bf7 for Disk 1 (PDO name 
\Device\001e) was retried
Log Name: System
Source: Disk
Event ID: 153
Level: Warning

Initialising the disk to use GPT is successful but attempting to create a 
standard NTFS volume eventually times out and fails.


Pretty sure this is in production in numerous environments, so I must be doing 
something wrong... Could anyone please validate that a rbd cached erasure coded 
image can be used as a Windows VM data disc?


Running Ceph Nautilus 14.2.4 with kernel 5.0.21

Created new erasure coded pool backed by spinners and a new replicated ssd pool 
for metadata:
ceph osd erasure-code-profile set ec32_hdd \
  plugin=jerasure k=3 m=2 technique=reed_sol_van \
  crush-root=default crush-failure-domain=host crush-device-class=hdd \
  directory=/usr/lib/ceph/erasure-code;
ceph osd pool create ec_hdd 64 erasure ec32_hdd;
ceph osd pool set ec_hdd allow_ec_overwrites true;
ceph osd pool application enable ec_hdd rbd;

ceph osd crush rule create-replicated replicated_ssd default host ssd;
ceph osd pool create rbd_ssd 64 64 replicated replicated_ssd;
ceph osd pool application enable rbd_ssd rbd;

rbd create rbd_ssd/surveylance-recordings --size 1T --data-pool ec_hdd;

Added a caching tier:
ceph osd pool create ec_hdd_cache 64 64 replicated replicated_ssd;
ceph osd tier add ec_hdd ec_hdd_cache;
ceph osd tier cache-mode ec_hdd_cache writeback;
ceph osd tier set-overlay ec_hdd ec_hdd_cache;
ceph osd pool set ec_hdd_cache hit_set_type bloom;

ceph osd pool set ec_hdd_cache hit_set_count 12
ceph osd pool set ec_hdd_cache hit_set_period 14400
ceph osd pool set ec_hdd_cache target_max_bytes $[128*1024*1024*1024]
ceph osd pool set ec_hdd_cache min_read_recency_for_promote 2
ceph osd pool set ec_hdd_cache min_write_recency_for_promote 2
ceph osd pool set ec_hdd_cache cache_target_dirty_ratio 0.4
ceph osd pool set ec_hdd_cache cache_target_dirty_high_ratio 0.6
ceph osd pool set ec_hdd_cache cache_target_full_ratio 0.8


Image appears to have been created correctly:
rbd ls rbd_ssd -l
NAME   SIZE  PARENT FMT PROT LOCK
surveylance-recordings 1 TiB  2

rbd info rbd_ssd/surveylance-recordings
rbd image 'surveylance-recordings':
size 1 TiB in 262144 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 7341cc54df71f
data_pool: ec_hdd
block_name_prefix: rbd_data.2.7341cc54df71f
format: 2
features: layering, data-pool
op_features:
flags:
create_timestamp: Sun Sep 22 17:47:30 2019
access_timestamp: Sun Sep 22 17:47:30 2019
modify_timestamp: Sun Sep 22 17:47:30 2019

Ceph appears healthy:
ceph -s
  cluster:
id: 31f6ea46-12cb-47e8-a6f3-60fb6bbd1782
health: HEALTH_OK

  services:
mon: 3 daemons, quorum kvm1a,kvm1b,kvm1c (age 5d)
mgr: kvm1c(active, since 5d), standbys: kvm1b, kvm1a
mds: cephfs:1 {0=kvm1c=up:active} 2 up:standby
osd: 24 osds: 24 up (since 4d), 24 in (since 4d)

  data:
pools:   9 pools, 417 pgs
objects: 325.04k objects, 1.1 TiB
usage:   3.3 TiB used, 61 TiB / 64 TiB avail
pgs: 417 active+clean

  io:
client:   25 KiB/s rd, 2.7 MiB/s wr, 17 op/s rd, 306 op/s wr
cache:0 op/s promote

ceph df
  RAW STORAGE:
CLASS SIZEAVAIL   USEDRAW USED %RAW USED
hdd62 TiB  59 TiB 2.9 TiB  2.9 TiB  4.78
ssd   2.4 TiB 2.1 TiB 303 GiB  309 GiB 12.36
TOTAL  64 TiB  61 TiB 3.2 TiB  3.3 TiB  5.07

  POOLS:
POOL  ID STORED  OBJECTS USED%USED  
   MAX AVAIL
rbd_hdd1 995 GiB 289.54k 2.9 TiB  5.23  
  18 TiB
rbd_ssd217 B   4  48 KiB 0  
 666 GiB
rbd_hdd_cache  3  99 GiB  34.91k 302 GiB 13.13  
 666 GiB
cephfs_data4 2.1 GiB 526 6.4 GiB  0.01  
  18 TiB
cephfs_metadata5 767 KiB  22 3.7 MiB 0  
  18 TiB
device_health_metrics  6 5.9 MiB  24 5.9 MiB 0  
  18 TiB
ec_hdd10 4.0 MiB   3 7.5 MiB 0  
  32 TiB
ec_hdd_cache  11  67 MiB  30 200 MiB 0  
 666 GiB



Regards
David Herselman

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] FYI: Mailing list domain change

2019-08-07 Thread David Galloway
Hi all,

I am in the process of migrating the upstream Ceph mailing lists from
Dreamhost to a self-hosted instance of Mailman 3.

Please update your address book and mail filters to ceph-us...@ceph.io
(notice the Top Level Domain change).

You may receive a "Welcome" e-mail as I subscribe you to the new list.
No other action should be required on your part.
-- 
David Galloway
Systems Administrator, RDU
Ceph Engineering
IRC: dgalloway
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Nautilus - can't balance due to degraded state

2019-08-03 Thread David Herselman
tems 8.33c [304,305]
pg_upmap_items 8.344 [404,403]
pg_upmap_items 8.346 [201,204]
pg_upmap_items 8.349 [504,503]
pg_upmap_items 8.350 [501,500]
pg_upmap_items 8.356 [101,102]
pg_upmap_items 8.358 [404,405]
pg_upmap_items 8.363 [103,102]
pg_upmap_items 8.364 [404,403]
pg_upmap_items 8.366 [404,403]
pg_upmap_items 8.369 [304,305]
pg_upmap_items 8.36b [103,102]
pg_upmap_items 8.373 [404,403]
pg_upmap_items 8.383 [404,403]
pg_upmap_items 8.39d [203,205]
pg_upmap_items 8.3a3 [103,102]
pg_upmap_items 8.3a6 [304,305]
pg_upmap_items 8.3ab [304,305]
pg_upmap_items 8.3af [304,305]
pg_upmap_items 8.3b3 [404,405]
pg_upmap_items 8.3b4 [303,305]
pg_upmap_items 8.3b7 [404,403]
pg_upmap_items 8.3b9 [404,403]
pg_upmap_items 8.3ba [404,403,201,205]
pg_upmap_items 8.3bd [404,405]
pg_upmap_items 8.3c0 [304,305]
pg_upmap_items 8.3c3 [404,403]
pg_upmap_items 8.3ca [404,403]
pg_upmap_items 8.3cf [404,405]
pg_upmap_items 8.3d0 [404,405]
pg_upmap_items 8.3da [404,403]
pg_upmap_items 8.3e4 [404,405]
pg_upmap_items 8.3ea [404,405]
pg_upmap_items 8.3ec [203,205]
pg_upmap_items 8.3f3 [501,505]
pg_upmap_items 8.3f7 [304,305]
pg_upmap_items 8.3fb [404,405]
pg_upmap_items 8.3fc [304,305]
pg_upmap_items 8.400 [105,102,404,403]
pg_upmap_items 8.409 [404,403]
pg_upmap_items 8.40b [103,102,404,405]
pg_upmap_items 8.40c [404,400]
pg_upmap_items 8.410 [404,403]
pg_upmap_items 8.411 [404,405]
pg_upmap_items 8.417 [404,403]
pg_upmap_items 8.418 [404,403]
pg_upmap_items 9.2 [10401,10400]
pg_upmap_items 9.9 [10200,10201]


Regards
David Herselman
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Nfs-ganesha-devel] 2.7.3 with CEPH_FSAL Crashing

2019-07-19 Thread David C
Thanks, Jeff. I'll give 14.2.2 a go when it's released.

On Wed, 17 Jul 2019, 22:29 Jeff Layton,  wrote:

> Ahh, I just noticed you were running nautilus on the client side. This
> patch went into v14.2.2, so once you update to that you should be good
> to go.
>
> -- Jeff
>
> On Wed, 2019-07-17 at 17:10 -0400, Jeff Layton wrote:
> > This is almost certainly the same bug that is fixed here:
> >
> > https://github.com/ceph/ceph/pull/28324
> >
> > It should get backported soon-ish but I'm not sure which luminous
> > release it'll show up in.
> >
> > Cheers,
> > Jeff
> >
> > On Wed, 2019-07-17 at 10:36 +0100, David C wrote:
> > > Thanks for taking a look at this, Daniel. Below is the only
> interesting bit from the Ceph MDS log at the time of the crash but I
> suspect the slow requests are a result of the Ganesha crash rather than the
> cause of it. Copying the Ceph list in case anyone has any ideas.
> > >
> > > 2019-07-15 15:06:54.624007 7f5fda5bb700  0 log_channel(cluster) log
> [WRN] : 6 slow requests, 5 included below; oldest blocked for > 34.588509
> secs
> > > 2019-07-15 15:06:54.624017 7f5fda5bb700  0 log_channel(cluster) log
> [WRN] : slow request 33.113514 seconds old, received at 2019-07-15
> 15:06:21.510423: client_request(client.16140784:5571174 setattr
> mtime=2019-07-15 14:59:45.642408 #0x10009079cfb 2019-07
> > > -15 14:59:45.642408 caller_uid=1161, caller_gid=1131{}) currently
> failed to xlock, waiting
> > > 2019-07-15 15:06:54.624020 7f5fda5bb700  0 log_channel(cluster) log
> [WRN] : slow request 34.588509 seconds old, received at 2019-07-15
> 15:06:20.035428: client_request(client.16129440:1067288 create
> #0x1000907442e/filePathEditorRegistryPrefs.melDXAtss 201
> > > 9-07-15 14:59:53.694087 caller_uid=1161,
> caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,35
> > > 22,3520,3523,}) currently failed to wrlock, waiting
> > > 2019-07-15 15:06:54.624025 7f5fda5bb700  0 log_channel(cluster) log
> [WRN] : slow request 34.583918 seconds old, received at 2019-07-15
> 15:06:20.040019: client_request(client.16140784:5570551 getattr pAsLsXsFs
> #0x1000907443b 2019-07-15 14:59:44.171408 cal
> > > ler_uid=1161, caller_gid=1131{}) currently failed to rdlock, waiting
> > > 2019-07-15 15:06:54.624028 7f5fda5bb700  0 log_channel(cluster) log
> [WRN] : slow request 34.580632 seconds old, received at 2019-07-15
> 15:06:20.043305: client_request(client.16129440:1067293 unlink
> #0x1000907442e/filePathEditorRegistryPrefs.melcdzxxc 201
> > > 9-07-15 14:59:53.701964 caller_uid=1161,
> caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,35
> > > 22,3520,3523,}) currently failed to wrlock, waiting
> > > 2019-07-15 15:06:54.624032 7f5fda5bb700  0 log_channel(cluster) log
> [WRN] : slow request 34.538332 seconds old, received at 2019-07-15
> 15:06:20.085605: client_request(client.16129440:1067308 create
> #0x1000907442e/filePathEditorRegistryPrefs.melHHljMk 201
> > > 9-07-15 14:59:53.744266 caller_uid=1161,
> caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,3522,3520,3523,})
> currently failed to wrlock, waiting
> > > 2019-07-15 15:06:55.014073 7f5fdcdc0700  1 mds.mds01 Updating MDS map
> to version 68166 from mon.2
> > > 2019-07-15 15:06:59.624041 7f5fda5bb700  0 log_channel(cluster) log
> [WRN] : 7 slow requests, 2 included below; oldest blocked for > 39.588571
> secs
> > > 2019-07-15 15:06:59.624048 7f5fda5bb700  0 log_channel(cluster) log
> [WRN] : slow request 30.495843 seconds old, received at 2019-07-15
> 15:06:29.128156: client_request(client.16129440:1072227 create
> #0x1000907442e/filePathEditorRegistryPrefs.mel58AQSv 2019-07-15
> 15:00:02.786754 caller_uid=1161,
> caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,3522,3520,3523,})
> currently failed to wrlock, waiting
> > > 2019-07-15 15:06:59.624053 7f5fda5bb700  0 log_channel(cluster) log
> [WRN] : slow request 39.432848 seconds old, received at 2019-07-15
> 15:06:20.191151: client_request(client.16140784:5570649 mknod
> #0x1000907442e/fileP

Re: [ceph-users] [Nfs-ganesha-devel] 2.7.3 with CEPH_FSAL Crashing

2019-07-17 Thread David C
fda5bb700  0 log_channel(cluster) log [WRN] :
slow request 32.689838 seconds old, received at 2019-07-15 15:06:36.934283:
client_request(client.16129440:1072271 getattr pAsLsXsFs #0x1000907443b
2019-07-15 15:00:10.592734 caller_uid=1161,
caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,3522,3520,3523,})
currently failed to rdlock, waiting
2019-07-15 15:07:09.624177 7f5fda5bb700  0 log_channel(cluster) log [WRN] :
slow request 34.962719 seconds old, received at 2019-07-15 15:06:34.661402:
client_request(client.16129440:1072256 getattr pAsLsXsFs #0x1000907443b
2019-07-15 15:00:08.319912 caller_uid=1161,
caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,3522,3520,3523,})
currently failed to rdlock, waiting
2019-07-15 15:07:11.519928 7f5fdcdc0700  1 mds.mds01 Updating MDS map to
version 68169 from mon.2
2019-07-15 15:07:19.624272 7f5fda5bb700  0 log_channel(cluster) log [WRN] :
11 slow requests, 1 included below; oldest blocked for > 59.588812 secs
2019-07-15 15:07:19.624278 7f5fda5bb700  0 log_channel(cluster) log [WRN] :
slow request 32.164260 seconds old, received at 2019-07-15 15:06:47.459980:
client_request(client.16129440:1072326 getattr pAsLsXsFs #0x1000907443b
2019-07-15 15:00:21.118372 caller_uid=1161,
caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,3522,3520,3523,})
currently failed to rdlock, waiting


On Tue, Jul 16, 2019 at 1:18 PM Daniel Gryniewicz  wrote:

> This is not one I've seen before, and a quick look at the code looks
> strange.  The only assert in that bit is asserting the parent is a
> directory, but the parent directory is not something that was passed in
> by Ganesha, but rather something that was looked up internally in
> libcephfs.  This is beyond my expertise, at this point.  Maybe some ceph
> logs would help?
>
> Daniel
>
> On 7/15/19 10:54 AM, David C wrote:
> > This list has been deprecated. Please subscribe to the new devel list at
> lists.nfs-ganesha.org.
> >
> >
> > Hi All
> >
> > I'm running 2.7.3 using the CEPH FSAL to export CephFS (Luminous), it
> > ran well for a few days and crashed. I have a coredump, could someone
> > assist me in debugging this please?
> >
> > (gdb) bt
> > #0  0x7f04dcab6207 in raise () from /lib64/libc.so.6
> > #1  0x7f04dcab78f8 in abort () from /lib64/libc.so.6
> > #2  0x7f04d2a9d6c5 in ceph::__ceph_assert_fail(char const*, char
> > const*, int, char const*) () from /usr/lib64/ceph/libceph-common.so.0
> > #3  0x7f04d2a9d844 in ceph::__ceph_assert_fail(ceph::assert_data
> > const&) () from /usr/lib64/ceph/libceph-common.so.0
> > #4  0x7f04cc807f04 in Client::_lookup_name(Inode*, Inode*, UserPerm
> > const&) () from /lib64/libcephfs.so.2
> > #5  0x7f04cc81c41f in Client::ll_lookup_inode(inodeno_t, UserPerm
> > const&, Inode**) () from /lib64/libcephfs.so.2
> > #6  0x7f04ccadbf0e in create_handle (export_pub=0x1baff10,
> > desc=, pub_handle=0x7f0470fd4718,
> > attrs_out=0x7f0470fd4740) at
> > /usr/src/debug/nfs-ganesha-2.7.3/FSAL/FSAL_CEPH/export.c:256
> > #7  0x00523895 in mdcache_locate_host (fh_desc=0x7f0470fd4920,
> > export=export@entry=0x1bafbf0, entry=entry@entry=0x7f0470fd48b8,
> > attrs_out=attrs_out@entry=0x0)
> >  at
> >
> /usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1011
> > #8  0x0051d278 in mdcache_create_handle (exp_hdl=0x1bafbf0,
> > fh_desc=, handle=0x7f0470fd4900, attrs_out=0x0) at
> >
> /usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1578
> > #9  0x0046d404 in nfs4_mds_putfh
> > (data=data@entry=0x7f0470fd4ea0) at
> > /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_op_putfh.c:211
> > #10 0x0046d8e8 in nfs4_op_putfh (op=0x7f03effaf1d0,
> > data=0x7f0470fd4ea0, resp=0x7f03ec1de1f0) at
> > /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_op_putfh.c:281
> > #11 0x0045d120 in nfs4_Compound (arg=,
> > req=, res=0x7f03ec1de9d0) at
> > /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_Compound.c:942
> > #12 0x004512cd in nfs_rpc_process_request
> > (reqdata=0x7f03ee5ed4b0) at
> > /usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_worker_thread.c:1328
> > #13 0x00450766 in nfs_rpc_

[ceph-users] What's the best practice for Erasure Coding

2019-07-07 Thread David
Hi Ceph-Users,

 

I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on lvm).

Recently, I'm trying to use the Erasure Code pool.

My question is "what's the best practice for using EC pools ?".

More specifically, which plugin (jerasure, isa, lrc, shec or  clay) should I 
adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2), (k=6,m=3) ).

 

Does anyone share some experience?

 

Thanks for any help.

 

Regards,

David

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot delete bucket

2019-06-27 Thread David Turner
I'm still going at 452M incomplete uploads. There are guides online for
manually deleting buckets kinda at the RADOS level that tend to leave data
stranded. That doesn't work for what I'm trying to do so I'll keep going
with this and wait for that PR to come through and hopefully help with
bucket deletion.

On Thu, Jun 27, 2019 at 2:58 PM Sergei Genchev  wrote:

> @David Turner
> Did your bucket delete ever finish? I am up to 35M incomplete uploads,
> and I doubt that I actually had that many upload attempts. I could be
> wrong though.
> Is there a way to force bucket deletion, even at the cost of not
> cleaning up space?
>
> On Tue, Jun 25, 2019 at 12:29 PM J. Eric Ivancich 
> wrote:
> >
> > On 6/24/19 1:49 PM, David Turner wrote:
> > > It's aborting incomplete multipart uploads that were left around. First
> > > it will clean up the cruft like that and then it should start actually
> > > deleting the objects visible in stats. That's my understanding of it
> > > anyway. I'm int he middle of cleaning up some buckets right now doing
> > > this same thing. I'm up to `WARNING : aborted 108393000 incomplete
> > > multipart uploads`. This bucket had a client uploading to it constantly
> > > with a very bad network connection.
> >
> > There's a PR to better deal with this situation:
> >
> > https://github.com/ceph/ceph/pull/28724
> >
> > Eric
> >
> > --
> > J. Eric Ivancich
> > he/him/his
> > Red Hat Storage
> > Ann Arbor, Michigan, USA
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot delete bucket

2019-06-24 Thread David Turner
It's aborting incomplete multipart uploads that were left around. First it
will clean up the cruft like that and then it should start actually
deleting the objects visible in stats. That's my understanding of it
anyway. I'm int he middle of cleaning up some buckets right now doing this
same thing. I'm up to `WARNING : aborted 108393000 incomplete multipart
uploads`. This bucket had a client uploading to it constantly with a very
bad network connection.

On Fri, Jun 21, 2019 at 1:13 PM Sergei Genchev  wrote:

>  Hello,
> Trying to delete bucket using radosgw-admin, and failing. Bucket has
> 50K objects but all of them are large. This is what I get:
> $ radosgw-admin bucket rm --bucket=di-omt-mapupdate --purge-objects
> --bypass-gc
> 2019-06-21 17:09:12.424 7f53f621f700  0 WARNING : aborted 1000
> incomplete multipart uploads
> 2019-06-21 17:09:19.966 7f53f621f700  0 WARNING : aborted 2000
> incomplete multipart uploads
> 2019-06-21 17:09:26.819 7f53f621f700  0 WARNING : aborted 3000
> incomplete multipart uploads
> 2019-06-21 17:09:33.430 7f53f621f700  0 WARNING : aborted 4000
> incomplete multipart uploads
> 2019-06-21 17:09:40.304 7f53f621f700  0 WARNING : aborted 5000
> incomplete multipart uploads
>
> Looks like it is trying to delete objects 1000 at a time, as it
> should, but failing. Bucket stats do not change.
>  radosgw-admin bucket stats --bucket=di-omt-mapupdate |jq .usage
> {
>   "rgw.main": {
> "size": 521929247648,
> "size_actual": 521930674176,
> "size_utilized": 400701129125,
> "size_kb": 509696531,
> "size_kb_actual": 509697924,
> "size_kb_utilized": 391309697,
> "num_objects": 50004
>   },
>   "rgw.multimeta": {
> "size": 0,
> "size_actual": 0,
> "size_utilized": 0,
> "size_kb": 0,
> "size_kb_actual": 0,
> "size_kb_utilized": 0,
> "num_objects": 32099
>   }
> }
> How can I get this bucket deleted?
> Thanks!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the release cadence

2019-06-17 Thread David Turner
This was a little long to respond with on Twitter, so I thought I'd share
my thoughts here. I love the idea of a 12 month cadence. I like October
because admins aren't upgrading production within the first few months of a
new release. It gives it plenty of time to be stable for the OS distros as
well as giving admins something low-key to work on over the holidays with
testing the new releases in stage/QA.

On Mon, Jun 17, 2019 at 12:22 PM Sage Weil  wrote:

> On Wed, 5 Jun 2019, Sage Weil wrote:
> > That brings us to an important decision: what time of year should we
> > release?  Once we pick the timing, we'll be releasing at that time
> *every
> > year* for each release (barring another schedule shift, which we want to
> > avoid), so let's choose carefully!
>
> I've put up a twitter poll:
>
> https://twitter.com/liewegas/status/1140655233430970369
>
> Thanks!
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding - FPGA / Hardware Acceleration

2019-06-14 Thread David Byte
I can't speak to the SoftIron solution, but I have done some testing on an 
all-SSD environment comparing latency, CPU, etc between using the Intel ISA 
plugin and using Jerasure.  Very little difference is seen in CPU and 
capability in my tests, so I am not sure of the benefit.

David Byte
Sr. Technology Strategist
SCE Enterprise Linux 
SCE Enterprise Storage
Alliances and SUSE Embedded
db...@suse.com
918.528.4422

On 6/14/19, 2:50 PM, "ceph-users on behalf of Brett Niver" 
 wrote:

Also the picture I saw at Cephalocon - which could have been
inaccurate, looked to me as if it multiplied the data path.

On Fri, Jun 14, 2019 at 8:27 AM Janne Johansson  wrote:
>
> Den fre 14 juni 2019 kl 13:58 skrev Sean Redmond 
:
>>
>> Hi Ceph-Uers,
>> I noticed that Soft Iron now have hardware acceleration for Erasure 
Coding[1], this is interesting as the CPU overhead can be a problem in addition 
to the extra disk I/O required for EC pools.
>> Does anyone know if any other work is ongoing to support generic FPGA 
Hardware Acceleration for EC pools, or if this is just a vendor specific 
feature.
>>
>> [1] 
https://www.theregister.co.uk/2019/05/20/softiron_unleashes_accepherator_an_erasure_coding_accelerator_for_ceph/
>
>
> Are there numbers anywhere to see how "tough" on a CPU it would be to 
calculate an EC code compared to "writing a sector to
> a disk on a remote server and getting an ack back" ? To my very untrained 
eye, it seems like a very small part of the whole picture,
> especially if you are meant to buy a ton of cards to do it.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS-Ganesha CEPH_FSAL | potential locking issue

2019-05-17 Thread David C
Thanks for your response on that, Jeff. Pretty sure this is nothing to do
with Ceph or Ganesha, sorry for wasting your time. What I'm seeing is
related to writeback on the client. I can mitigate the behaviour a bit by
playing around with the vm.dirty* parameters.




On Tue, Apr 16, 2019 at 7:07 PM Jeff Layton  wrote:

> On Tue, Apr 16, 2019 at 10:36 AM David C  wrote:
> >
> > Hi All
> >
> > I have a single export of my cephfs using the ceph_fsal [1]. A CentOS 7
> machine mounts a sub-directory of the export [2] and is using it for the
> home directory of a user (e.g everything under ~ is on the server).
> >
> > This works fine until I start a long sequential write into the home
> directory such as:
> >
> > dd if=/dev/zero of=~/deleteme bs=1M count=8096
> >
> > This saturates the 1GbE link on the client which is great but during the
> transfer, apps that are accessing files in home start to lock up. Google
> Chrome for example, which puts it's config in ~/.config/google-chrome/,
> locks up during the transfer, e.g I can't move between tabs, as soon as the
> transfer finishes, Chrome goes back to normal. Essentially the desktop
> environment reacts as I'd expect if the server was to go away. I'm using
> the MATE DE.
> >
> > However, if I mount a separate directory from the same export on the
> machine [3] and do the same write into that directory, my desktop
> experience isn't affected.
> >
> > I hope that makes some sense, it's a bit of a weird one to describe.
> This feels like a locking issue to me, although I can't explain why a
> single write into the root of a mount would affect access to other files
> under that same mount.
> >
>
> It's not a single write. You're doing 8G worth of 1M I/Os. The server
> then has to do all of those to the OSD backing store.
>
> > [1] CephFS export:
> >
> > EXPORT
> > {
> > Export_ID=100;
> > Protocols = 4;
> > Transports = TCP;
> > Path = /;
> > Pseudo = /ceph/;
> > Access_Type = RW;
> > Attr_Expiration_Time = 0;
> > Disable_ACL = FALSE;
> > Manage_Gids = TRUE;
> > Filesystem_Id = 100.1;
> > FSAL {
> > Name = CEPH;
> > }
> > }
> >
> > [2] Home directory mount:
> >
> > 10.10.10.226:/ceph/homes/username on /homes/username type nfs4
> (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.135,local_lock=none,addr=10.10.10.226)
> >
> > [3] Test directory mount:
> >
> > 10.10.10.226:/ceph/testing on /tmp/testing type nfs4
> (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.135,local_lock=none,addr=10.10.10.226)
> >
> > Versions:
> >
> > Luminous 12.2.10
> > nfs-ganesha-2.7.1-0.1.el7.x86_64
> > nfs-ganesha-ceph-2.7.1-0.1.el7.x86_64
> >
> > Ceph.conf on nfs-ganesha server:
> >
> > [client]
> > mon host = 10.10.10.210:6789, 10.10.10.211:6789,
> 10.10.10.212:6789
> > client_oc_size = 8388608000
> > client_acl_type=posix_acl
> > client_quota = true
> > client_quota_df = true
> >
>
> No magic bullets here, I'm afraid.
>
> Sounds like ganesha is probably just too swamped with write requests
> to do much else, but you'll probably want to do the legwork starting
> with the hanging application, and figure out what it's doing that
> takes so long. Is it some syscall? Which one?
>
> From there you can start looking at statistics in the NFS client to
> see what's going on there. Are certain RPCs taking longer than they
> should? Which ones?
>
> Once you know what's going on with the client, you can better tell
> what's going on with the server.
> --
> Jeff Layton 
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Samba vfs_ceph or kernel client

2019-05-16 Thread David Disseldorp
Hi Maged,

On Fri, 10 May 2019 18:32:15 +0200, Maged Mokhtar wrote:

> What is the recommended way for Samba gateway integration: using 
> vfs_ceph or mounting CephFS via kernel client ? i tested the kernel 
> solution in a ctdb setup and gave good performance, does it have any 
> limitations relative to vfs_ceph ?

At this stage kernel-backed and vfs_ceph-backed shares are pretty
similar feature wise. ATM kernel backed shares have the performance
advantage of page-cache + async vfs_default dispatch. vfs_ceph will
likely gain more features in future as cross-protocol share-mode locks
and leases can be supported without the requirement for a kernel
interface.

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IMPORTANT : NEED HELP : Low IOPS on hdd : MAX AVAIL Draining fast

2019-04-27 Thread David C
On Sat, 27 Apr 2019, 18:50 Nikhil R,  wrote:

> Guys,
> We now have a total of 105 osd’s on 5 baremetal nodes each hosting 21
> osd’s on HDD which are 7Tb with journals on HDD too. Each journal is about
> 5GB
>

This would imply you've got a separate hdd partition for journals, I don't
think there's any value in that and would probabaly be detrimental to
performance.

>
> We expanded our cluster last week and added 1 more node with 21 HDD and
> journals on same disk.
> Our client i/o is too heavy and we are not able to backfill even 1 thread
> during peak hours - incase we backfill during peak hours osd's are crashing
> causing undersized pg's and if we have another osd crash we wont be able to
> use our cluster due to undersized and recovery pg's. During non-peak we can
> just backfill 8-10 pgs.
> Due to this our MAX AVAIL is draining out very fast.
>

How much ram have you got in your nodes? In my experience that's a common
reason for crashing OSDs during recovery ops

What does your recovery and backfill tuning look like?



> We are thinking of adding 2 more baremetal nodes with 21 *7tb  osd’s on
>  HDD and add 50GB SSD Journals for these.
> We aim to backfill from the 105 osd’s a bit faster and expect writes of
> backfillis coming to these osd’s faster.
>

Ssd journals would certainly help, just be sure it's a model that performs
well with Ceph

>
> Is this a good viable idea?
> Thoughts please?
>

I'd recommend sharing more detail e.g full spec of the nodes, Ceph version
etc.

>
> -Nikhil
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Default Pools

2019-04-23 Thread David Turner
You should be able to see all pools in use in a RGW zone from the
radosgw-admin command. This [1] is probably overkill for most, but I deal
with multi-realm clusters so I generally think like this when dealing with
RGW.  Running this as is will create a file in your current directory for
each zone in your deployment (likely to be just one file).  My rough guess
for what you would find in that file based on your pool names would be this
[2].

If you identify any pools not listed from the zone get command, then you
can rename [3] the pool to see if it is being created and/or used by rgw
currently.  The process here would be to stop all RGW daemons, rename the
pools, start a RGW daemon, stop it again, and see which pools were
recreated.  Clean up the pools that were freshly made and rename the
original pools back into place before starting your RGW daemons again.
Please note that .rgw.root is a required pool in every RGW deployment and
will not be listed in the zones themselves.


[1]
for realm in $(radosgw-admin realm list --format=json | jq '.realms[]' -r);
do
  for zonegroup in $(radosgw-admin --rgw-realm=$realm zonegroup list
--format=json | jq '.zonegroups[]' -r); do
for zone in $(radosgw-admin --rgw-realm=$realm
--rgw-zonegroup=$zonegroup zone list --format=json | jq '.zones[]' -r); do
  echo $realm.$zonegroup.$zone.json
  radosgw-admin --rgw-realm=$realm --rgw-zonegroup=$zonegroup
--rgw-zone=$zone zone get > $realm.$zonegroup.$zone.json
done
  done
done

[2] default.default.default.json
{
"id": "{{ UUID }}",
"name": "default",
"domain_root": "default.rgw.meta",
"control_pool": "default.rgw.control",
"gc_pool": ".rgw.gc",
"log_pool": "default.rgw.log",
"user_email_pool": ".users.email",
"user_uid_pool": ".users.uid",
"system_key": {
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": "default.rgw.buckets.index",
"data_pool": "default.rgw.buckets.data",
"data_extra_pool": "default.rgw.buckets.non-ec",
"index_type": 0,
"compression": ""
}
}
],
"metadata_heap": "",
"tier_config": [],
"realm_id": "{{ UUID }}"
}

[3] ceph osd pool rename  

On Thu, Apr 18, 2019 at 10:46 AM Brent Kennedy  wrote:

> Yea, that was a cluster created during firefly...
>
> Wish there was a good article on the naming and use of these, or perhaps a
> way I could make sure they are not used before deleting them.  I know RGW
> will recreate anything it uses, but I don’t want to lose data because I
> wanted a clean system.
>
> -Brent
>
> -Original Message-
> From: Gregory Farnum 
> Sent: Monday, April 15, 2019 5:37 PM
> To: Brent Kennedy 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] Default Pools
>
> On Mon, Apr 15, 2019 at 1:52 PM Brent Kennedy  wrote:
> >
> > I was looking around the web for the reason for some of the default
> pools in Ceph and I cant find anything concrete.  Here is our list, some
> show no use at all.  Can any of these be deleted ( or is there an article
> my googlefu failed to find that covers the default pools?
> >
> > We only use buckets, so I took out .rgw.buckets, .users and
> > .rgw.buckets.index…
> >
> > Name
> > .log
> > .rgw.root
> > .rgw.gc
> > .rgw.control
> > .rgw
> > .users.uid
> > .users.email
> > .rgw.buckets.extra
> > default.rgw.control
> > default.rgw.meta
> > default.rgw.log
> > default.rgw.buckets.non-ec
>
> All of these are created by RGW when you run it, not by the core Ceph
> system. I think they're all used (although they may report sizes of 0, as
> they mostly make use of omap).
>
> > metadata
>
> Except this one used to be created-by-default for CephFS metadata, but
> that hasn't been true in many releases. So I guess you're looking at an old
> cluster? (In which case it's *possible* some of those RGW pools are also
> unused now but were needed in the past; I haven't kept good track of them.)
> -Greg
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Osd update from 12.2.11 to 12.2.12

2019-04-22 Thread David Turner
Do you perhaps have anything in the ceph.conf files on the servers with
those OSDs that would attempt to tell the daemon that they are filestore
osds instead of bluestore?  I'm sure you know that the second part [1] of
the output in both cases only shows up after an OSD has been rebooted.  I'm
sure this too could be cleaned up by adding that line to the ceph.conf file.

[1] rocksdb_separate_wal_dir = 'false' (not observed, change may require
restart)

On Sun, Apr 21, 2019 at 8:32 AM Marc Roos  wrote:

>
>
> Just updated luminous, and setting max_scrubs value back. Why do I get
> osd's reporting differently
>
>
> I get these:
> osd.18: osd_max_scrubs = '1' (not observed, change may require restart)
> osd_objectstore = 'bluestore' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.19: osd_max_scrubs = '1' (not observed, change may require restart)
> osd_objectstore = 'bluestore' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.20: osd_max_scrubs = '1' (not observed, change may require restart)
> osd_objectstore = 'bluestore' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.21: osd_max_scrubs = '1' (not observed, change may require restart)
> osd_objectstore = 'bluestore' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.22: osd_max_scrubs = '1' (not observed, change may require restart)
> osd_objectstore = 'bluestore' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
>
>
> And I get osd's reporting like this:
> osd.23: osd_max_scrubs = '1' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.24: osd_max_scrubs = '1' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.25: osd_max_scrubs = '1' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.26: osd_max_scrubs = '1' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.27: osd_max_scrubs = '1' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.28: osd_max_scrubs = '1' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NFS-Ganesha CEPH_FSAL | potential locking issue

2019-04-16 Thread David C
Hi All

I have a single export of my cephfs using the ceph_fsal [1]. A CentOS 7
machine mounts a sub-directory of the export [2] and is using it for the
home directory of a user (e.g everything under ~ is on the server).

This works fine until I start a long sequential write into the home
directory such as:

dd if=/dev/zero of=~/deleteme bs=1M count=8096

This saturates the 1GbE link on the client which is great but during the
transfer, apps that are accessing files in home start to lock up. Google
Chrome for example, which puts it's config in ~/.config/google-chrome/,
locks up during the transfer, e.g I can't move between tabs, as soon as the
transfer finishes, Chrome goes back to normal. Essentially the desktop
environment reacts as I'd expect if the server was to go away. I'm using
the MATE DE.

However, if I mount a separate directory from the same export on the
machine [3] and do the same write into that directory, my desktop
experience isn't affected.

I hope that makes some sense, it's a bit of a weird one to describe. This
feels like a locking issue to me, although I can't explain why a single
write into the root of a mount would affect access to other files under
that same mount.

[1] CephFS export:

EXPORT
{
Export_ID=100;
Protocols = 4;
Transports = TCP;
Path = /;
Pseudo = /ceph/;
Access_Type = RW;
Attr_Expiration_Time = 0;
Disable_ACL = FALSE;
Manage_Gids = TRUE;
Filesystem_Id = 100.1;
FSAL {
Name = CEPH;
}
}

[2] Home directory mount:

10.10.10.226:/ceph/homes/username on /homes/username type nfs4
(rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.135,local_lock=none,addr=10.10.10.226)

[3] Test directory mount:

10.10.10.226:/ceph/testing on /tmp/testing type nfs4
(rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.135,local_lock=none,addr=10.10.10.226)

Versions:

Luminous 12.2.10
nfs-ganesha-2.7.1-0.1.el7.x86_64
nfs-ganesha-ceph-2.7.1-0.1.el7.x86_64

Ceph.conf on nfs-ganesha server:

[client]
mon host = 10.10.10.210:6789, 10.10.10.211:6789, 10.10.10.212:6789
client_oc_size = 8388608000
client_acl_type=posix_acl
client_quota = true
client_quota_df = true

Thanks,
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking up buckets in multi-site radosgw configuration

2019-03-20 Thread David Coles
On Tue, Mar 19, 2019 at 7:51 AM Casey Bodley  wrote:

> Yeah, correct on both points. The zonegroup redirects would be the only
> way to guide clients between clusters.

Awesome. Thank you for the clarification.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Looking up buckets in multi-site radosgw configuration

2019-03-18 Thread David Coles
I'm looking at setting up a multi-site radosgw configuration where
data is sharded over multiple clusters in a single physical location;
and would like to understand how Ceph handles requests in this
configuration.

Looking through the radosgw source[1] it looks like radowgw will
return 301 redirect if I request a bucket that is not in the current
zonegroup. This redirect appears to be to the endpoint for the
zonegroup (I assume as configured by `radosgw-admin zonegroup create
--endpoints`). This seems like it would work well for multiple
geographic regions (e.g. us-east and us-west) for ensuring that a
request is redirected to the region (zonegroup) that hosts the bucket.
We could possibly improve this by virtual hosted buckets and having
DNS point to the correct region for that bucket.

I notice that it's also possible to configure zones in a zonegroup
that don't peform replication[2] (e.g. us-east-1 and us-east-2). In
this case I assume that if I direct a request to the wrong zone, then
Ceph will just report that the object as not-found because, despite
the bucket metadata being replicated from the zonegroup master, the
objects will never be replicated from one zone to the other. Another
layer (like a consistent hash across the bucket name or database)
would be required for routing to the correct zone.

Is this mostly correct? Are there other ways of controlling which
cluster data is placed (i.e. placement groups)?

Thanks!

1. 
https://github.com/ceph/ceph/blob/affb7d396f76273e885cfdbcd363c1882496726c/src/rgw/rgw_op.cc#L653-L669
2. 
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/object_gateway_guide_for_red_hat_enterprise_linux/multi_site#configuring_multiple_zones_without_replication
-- 
David Coles
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd pg-upmap-items not working

2019-03-15 Thread David Turner
Why do you think that it can't resolve this by itself?  You just said that
the balancer was able to provide an optimization, but then that the
distribution isn't perfect.  When there are no further optimizations,
running `ceph balancer optimize plan` won't create a plan with any
changes.  Possibly the active mgr needs a kick.  When my cluster isn't
balancing when it's supposed to, I just run `ceph mgr fail {active mgr}`
and within a minute or so the cluster is moving PGs around.

On Sat, Mar 9, 2019 at 8:05 PM Kári Bertilsson 
wrote:

> Thanks
>
> I did apply https://github.com/ceph/ceph/pull/26179.
>
> Running manual upmap commands work now. I did run "ceph balancer optimize
> new"and It did add a few upmaps.
>
> But now another issue. Distribution is far from perfect but the balancer
> can't find further optimization.
> Specifically OSD 23 is getting way more pg's than the other 3tb OSD's.
>
> See https://pastebin.com/f5g5Deak
>
> On Fri, Mar 1, 2019 at 10:25 AM  wrote:
>
>> > Backports should be available in v12.2.11.
>>
>> s/v12.2.11/ v12.2.12/
>>
>> Sorry for the typo.
>>
>>
>>
>>
>> 原始邮件
>> *发件人:*谢型果10072465
>> *收件人:*d...@vanderster.com ;
>> *抄送人:*ceph-users@lists.ceph.com ;
>> *日 期 :*2019年03月01日 17:09
>> *主 题 :**Re: [ceph-users] ceph osd pg-upmap-items not working*
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> See  
>> 
>> 
>> 
>> https://github.com/ceph/ceph/pull/26179
>>
>> Backports should be available in v12.2.11.
>>
>> Or you can manually do it by simply adopting
>> 
>> 
>> 
>> 
>> 
>> https://github.com/ceph/ceph/pull/26127   if you are eager to get out of
>> the trap right now.
>>
>> 
>> 
>> 
>>
>> 
>> 
>>
>>
>>
>>
>>
>>
>> *发件人:*DanvanderSter 
>> *收件人:*Kári Bertilsson ;
>> *抄送人:*ceph-users ;谢型果10072465;
>> *日 期 :*2019年03月01日 14:48
>> *主 题 :**Re: [ceph-users] ceph osd pg-upmap-items not working*
>> It looks like that somewhat unusual crush rule is confusing the new
>> upmap cleaning.
>> (debug_mon 10 on the active mon should show those cleanups).
>>
>>
>> I'm copying Xie Xingguo, and probably you should create a tracker for this.
>>
>> -- dan
>>
>>
>>
>>
>> On Fri, Mar 1, 2019 at 3:12 AM Kári Bertilsson > > wrote:
>> >
>> > This is the pool
>>
>> > pool 41 'ec82_pool' erasure size 10 min_size 8 crush_rule 1 object_hash 
>> > rjenkins pg_num 512 pgp_num 512 last_change 63794 lfor 21731/21731 flags 
>> > hashpspool,ec_overwrites stripe_width 32768 application cephfs
>> >removed_snaps [1~5]
>> >
>> > Here is the relevant crush rule:
>>
>> > rule ec_pool { id 1 type erasure min_size 3 max_size 10 step 
>> > set_chooseleaf_tries 5 step set_choose_tries 100 step take default class 
>> > hdd step choose indep 5 type host step choose indep 2 type osd step emit }
>> >
>>
>> > Both OSD 23 and 123 are in the same host. So this change should be 
>> > perfectly acceptable by the rule set.
>>
>> > Something must be blocking the change, but i can't find anything about it 
>> > in any logs.
>> >
>> > - Kári
>> >
>> > On Thu, Feb 28, 2019 at 8:07 AM Dan van der Ster > > wrote:
>> >>
>> >> Hi,
>> >>
>> >> pg-upmap-items became more strict in v12.2.11 when validating upmaps.
>> >> E.g., it now won't let you put two PGs in the same rack if the crush
>> >> rule doesn't allow it.
>> >>
>>
>> >> Where are OSDs 23 and 123 in your cluster? What is the relevant crush 
>> >> rule?
>> >>
>> >> -- dan
>> >>
>> >>
>> >> On Wed, Feb 27, 2019 at 9:17 PM Kári Bertilsson > > wrote:
>> >> >
>> >> > Hello
>> >> >
>>
>> >> > I am trying to diagnose why upmap stopped working where it was 
>> >> > previously working fine.
>> >> >
>> >> > Trying to move pg 41.1 to 123 has no effect and seems to be ignored.
>> >> >
>> >> > # ceph osd pg-upmap-items 41.1 23 123
>> >> > set 41.1 pg_upmap_items mapping to [23->123]
>> >> >
>>
>> >> > No rebalacing happens and if i run it again it shows the same output 
>> >> > every time.
>> >> >
>> >> > I have in config
>> >> > debug mgr = 4/5
>> >> > debug mon = 4/5
>> >> >
>> >> > Paste from mon & mgr logs. Also output from "ceph osd dump"
>> >> > https://pastebin.com/9VrT4YcU
>> >> >
>> >> >
>>
>> >> > I have run "ceph osd set-require-min-compat-client luminous" long time 
>> >> > ago. And all servers running ceph have been rebooted numerous times 
>> >> > since then.
>>
>> >> > But 

Re: [ceph-users] mount cephfs on ceph servers

2019-03-12 Thread David C
Out of curiosity, are you guys re-exporting the fs to clients over
something like nfs or running applications directly on the OSD nodes?

On Tue, 12 Mar 2019, 18:28 Paul Emmerich,  wrote:

> Mounting kernel CephFS on an OSD node works fine with recent kernels
> (4.14+) and enough RAM in the servers.
>
> We did encounter problems with older kernels though
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Tue, Mar 12, 2019 at 10:07 AM Hector Martin 
> wrote:
> >
> > It's worth noting that most containerized deployments can effectively
> > limit RAM for containers (cgroups), and the kernel has limits on how
> > many dirty pages it can keep around.
> >
> > In particular, /proc/sys/vm/dirty_ratio (default: 20) means at most 20%
> > of your total RAM can be dirty FS pages. If you set up your containers
> > such that the cumulative memory usage is capped below, say, 70% of RAM,
> > then this might effectively guarantee that you will never hit this issue.
> >
> > On 08/03/2019 02:17, Tony Lill wrote:
> > > AFAIR the issue is that under memory pressure, the kernel will ask
> > > cephfs to flush pages, but that this in turn causes the osd (mds?) to
> > > require more memory to complete the flush (for network buffers, etc).
> As
> > > long as cephfs and the OSDs are feeding from the same kernel mempool,
> > > you are susceptible. Containers don't protect you, but a full VM, like
> > > xen or kvm? would.
> > >
> > > So if you don't hit the low memory situation, you will not see the
> > > deadlock, and you can run like this for years without a problem. I
> have.
> > > But you are most likely to run out of memory during recovery, so this
> > > could compound your problems.
> > >
> > > On 3/7/19 3:56 AM, Marc Roos wrote:
> > >>
> > >>
> > >> Container =  same kernel, problem is with processes using the same
> > >> kernel.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> -Original Message-
> > >> From: Daniele Riccucci [mailto:devs...@posteo.net]
> > >> Sent: 07 March 2019 00:18
> > >> To: ceph-users@lists.ceph.com
> > >> Subject: Re: [ceph-users] mount cephfs on ceph servers
> > >>
> > >> Hello,
> > >> is the deadlock risk still an issue in containerized deployments? For
> > >> example with OSD daemons in containers and mounting the filesystem on
> > >> the host machine?
> > >> Thank you.
> > >>
> > >> Daniele
> > >>
> > >> On 06/03/19 16:40, Jake Grimmett wrote:
> > >>> Just to add "+1" on this datapoint, based on one month usage on Mimic
> > >>> 13.2.4 essentially "it works great for us"
> > >>>
> > >>> Prior to this, we had issues with the kernel driver on 12.2.2. This
> > >>> could have been due to limited RAM on the osd nodes (128GB / 45 OSD),
> > >>> and an older kernel.
> > >>>
> > >>> Upgrading the RAM to 256GB and using a RHEL 7.6 derived kernel has
> > >>> allowed us to reliably use the kernel driver.
> > >>>
> > >>> We keep 30 snapshots ( one per day), have one active metadata server,
> > >>> and change several TB daily - it's much, *much* faster than with
> fuse.
> > >>>
> > >>> Cluster has 10 OSD nodes, currently storing 2PB, using ec 8:2 coding.
> > >>>
> > >>> ta ta
> > >>>
> > >>> Jake
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On 3/6/19 11:10 AM, Hector Martin wrote:
> >  On 06/03/2019 12:07, Zhenshi Zhou wrote:
> > > Hi,
> > >
> > > I'm gonna mount cephfs from my ceph servers for some reason,
> > > including monitors, metadata servers and osd servers. I know it's
> > > not a best practice. But what is the exact potential danger if I
> > > mount cephfs from its own server?
> > 
> >  As a datapoint, I have been doing this on two machines (single-host
> >  Ceph
> >  clusters) for months with no ill effects. The FUSE client performs a
> >  lot worse than the kernel client, so I switched to the latter, and
> >  it's been working well with no deadlocks.
> > 
> > >>> ___
> > >>> ceph-users mailing list
> > >>> ceph-users@lists.ceph.com
> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>
> > >>
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>
> > >
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> > --
> > Hector Martin (hec...@marcansoft.com)
> > Public Key: https://mrcn.st/pub
> > ___
> > ceph-users mailing list
> > 

Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-11 Thread David Clarke
On 9/03/19 10:07 PM, Victor Hooi wrote:
> Hi,
> 
> I'm setting up a 3-node Proxmox cluster with Ceph as the shared storage,
> based around Intel Optane 900P drives (which are meant to be the bee's
> knees), and I'm seeing pretty low IOPS/bandwidth.

We found that CPU performance, specifically power state, settings played
a large part in latency, and therefore IOPS.  This wasn't too evident
with spinning disks, but makes a large percentage difference in our NVMe
based clusters.

You may want to investigate setting processor.max_cstate=1 or
intel_idle.max_state=1, whichever is appropriate for your CPUs and
kernel, in the boot cmdline.



-- 
David Clarke
Systems Architect
Catalyst IT



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OpenStack with Ceph RDMA

2019-03-11 Thread David Turner
I can't speak to the rdma portion. But to clear up what each of these
does... the cluster network is only traffic between the osds for
replicating writes, reading EC data, as well as backfilling and recovery
io. Mons, mds, rgw, and osds talking with clients all happen on the public
network. The general consensus has been to not split the two networks,
except for maybe by vlans for potential statistics and graphing. Even if
you were running out of bandwidth, just upgrade the dual interface instead
of segregating them physically.

On Sat, Mar 9, 2019, 11:10 AM Lazuardi Nasution 
wrote:

> Hi,
>
> I'm looking for information about where is the RDMA messaging of Ceph
> happen, on cluster network, public network or both (it seem both, CMIIW)?
> I'm talking about configuration of ms_type, ms_cluster_type and
> ms_public_type.
>
> In case of OpenStack integration with RBD, which of above three is
> possible? In this case, should I still separate cluster network and public
> network?
>
> Best regards,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] priorize degraged objects than misplaced

2019-03-11 Thread David Turner
Ceph has been getting better and better about prioritizing this sorry of
recovery, but free of those optimizations are in Jewel, which had been out
of the support cycle for about a year. You should look into upgrading to
mimic where you should see a pretty good improvement on this sorry of
prioritization.

On Sat, Mar 9, 2019, 3:10 PM Fabio Abreu  wrote:

> HI Everybody,
>
> I have a doubt about degraded objects in the Jewel 10.2.7 version, can I
> priorize the degraded objects than misplaced?
>
> I asking this because I try simulate a disaster recovery scenario.
>
>
> Thanks and best regards,
> Fabio Abreu Reis
> http://fajlinux.com.br
> *Tel : *+55 21 98244-0161
> *Skype : *fabioabreureis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH ISCSI Gateway

2019-03-11 Thread David Turner
The problem with clients on osd nodes is for kernel clients only. That's
true of krbd and the kernel client for cephfs. The only other reason not to
run any other Ceph daemon in the same node as osds is resource contention
if you're running at higher CPU and memory utilizations.

On Sat, Mar 9, 2019, 10:15 PM Mike Christie  wrote:

> On 03/07/2019 09:22 AM, Ashley Merrick wrote:
> > Been reading into the gateway, and noticed it’s been mentioned a few
> > times it can be installed on OSD servers.
> >
> > I am guessing therefore there be no issues like is sometimes mentioned
> > when using kRBD on a OSD node apart from the extra resources required
> > from the hardware.
> >
>
> That is correct. You might have a similar issue if you were to run the
> iscsi gw/target, OSD and then also run the iscsi initiator that logs
> into the iscsi gw/target all on the same node. I don't think any use
> case like that has ever come up though.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failed to repair pg

2019-03-07 Thread David Zafman


On 3/7/19 9:32 AM, Herbert Alexander Faleiros wrote:

On Thu, Mar 07, 2019 at 01:37:55PM -0300, Herbert Alexander Faleiros wrote:
Should I do something like this? (below, after stop osd.36)

# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-36/ --journal-path 
/dev/sdc1 rbd_data.dfd5e2235befd0.0001c299 remove-clone-metadata 326022

I'm no sure about rbd_data.$RBD and $CLONEID (took from rados
list-inconsistent-obj, also below).



See what results you get from this command.

# rados list-inconsistent-snapset 2.2bb --format=json-pretty

You might see this, so nothing interesting.  If you don't get json, then 
re-run a scrub again.


{
    "epoch": ##,
    "inconsistents": []
}

I don't think you need to do the remove-clone-metadata because you got 
"unexpected clone" so I think you'd get "Clone 326022 not present"


I think you need to remove the clone object from osd.12 and osd.80.  For 
example:


# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12/ 
--journal-path /dev/sdXX --op list rbd_data.dfd5e2235befd0.0001c299


["2.2bb",{"oid":"rbd_data.dfd5e2235befd0.0001c299","key":"","snapid":-2,"hash":,"max":0,"pool":2,"namespace":"","max":0}]
["2.2bb",{"oid":"rbd_data.dfd5e2235befd0.0001c299","key":"","snapid":326022,"hash":#,"max":0,"pool":2,"namespace":"","max":0}]

Use the json for snapid 326022 to remove it.

# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12/ 
--journal-path /dev/sdXX 
'["2.2bb",{"oid":"rbd_data.dfd5e2235befd0.0001c299","key":"","snapid":326022,"hash":#,"max":0,"pool":2,"namespace":"","max":0}]' 
remove



David

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mount cephfs on ceph servers

2019-03-06 Thread David C
The general advice has been to not use the kernel client on an osd node as
you may see a deadlock under certain conditions. Using the fuse client
should be fine or use the kernel client inside a VM.

On Wed, 6 Mar 2019, 03:07 Zhenshi Zhou,  wrote:

> Hi,
>
> I'm gonna mount cephfs from my ceph servers for some reason,
> including monitors, metadata servers and osd servers. I know it's
> not a best practice. But what is the exact potential danger if I mount
> cephfs from its own server?
>
> Thanks
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Nfs-ganesha-devel] NFS-Ganesha CEPH_FSAL ceph.quota.max_bytes not enforced

2019-03-04 Thread David C
On Mon, Mar 4, 2019 at 5:53 PM Jeff Layton  wrote:

>
> On Mon, 2019-03-04 at 17:26 +, David C wrote:
> > Looks like you're right, Jeff. Just tried to write into the dir and am
> > now getting the quota warning. So I guess it was the libcephfs cache
> > as you say. That's fine for me, I don't need the quotas to be too
> > strict, just a failsafe really.
> >
>
> Actually, I said it was likely the NFS client cache. The Linux kernel is
> allowed to aggressively cache writes if you're doing buffered I/O. The
> NFS client has no concept of the quota here, so you'd only see
> enforcement once those writes start getting flushed back to the server.
>

Ah sorry, that makes a lot of sense!

>
>
> > Interestingly, if I create a new dir, set the same 100MB quota, I can
> > write multiple files with "dd if=/dev/zero of=1G bs=1M count=1024
> > oflag=direct". Wouldn't that bypass the cache? I have the following in
> > my ganesha.conf which I believe effectively disables Ganesha's
> > caching:
> >
> > CACHEINODE {
> > Dir_Chunk = 0;
> > NParts = 1;
> > Cache_Size = 1;
> > }
> >
>
> Using direct I/O like that should take the NFS client cache out of the
> picture. That said, cephfs quota enforcement is pretty "lazy". According
> to http://docs.ceph.com/docs/mimic/cephfs/quota/ :
>
> "Quotas are imprecise. Processes that are writing to the file system
> will be stopped a short time after the quota limit is reached. They will
> inevitably be allowed to write some amount of data over the configured
> limit. How far over the quota they are able to go depends primarily on
> the amount of time, not the amount of data. Generally speaking writers
> will be stopped within 10s of seconds of crossing the configured limit."
>
> You can write quite a bit of data in 10s of seconds (multiple GBs is not
> unreasonable here).
>
> > On Mon, Mar 4, 2019 at 2:50 PM Jeff Layton  wrote:
>
> > > > > On Mon, 2019-03-04 at 09:11 -0500, Jeff Layton wrote:
> > This list has
> > > been deprecated. Please subscribe to the new devel list at
> > > lists.nfs-ganesha.org.
> > On Fri, 2019-03-01 at 15:49 +, David C
> > > wrote:
> > > This list has been deprecated. Please subscribe to the new
> > > devel list at lists.nfs-ganesha.org.
> > > Hi All
> > >
> > > Exporting
> > > cephfs with the CEPH_FSAL
> > >
> > > I set the following on a dir:
> > >
> >
> > > > setfattr -n ceph.quota.max_bytes -v 1 /dir
> > > setfattr -n
> > > ceph.quota.max_files -v 10 /dir
> > >
> > > From an NFSv4 client, the
> > > quota.max_bytes appears to be completely ignored, I can go GBs over
> > > the quota in the dir. The quota.max_files DOES work however, if I
> > > try and create more than 10 files, I'll get "Error opening file
> > > 'dir/new file': Disk quota exceeded" as expected.
> > >
> > > From a
> > > fuse-mount on the same server that is running nfs-ganesha, I've
> > > confirmed ceph.quota.max_bytes is enforcing the quota, I'm unable to
> > > copy more than 100MB into the dir.
> > >
> > > According to [1] and [2]
> > > this should work.
> > >
> > > Cluster is Luminous 12.2.10
> > >
> > > Package
> > > versions on nfs-ganesha server:
> > >
> > > nfs-ganesha-rados-grace-
> > > 2.7.1-0.1.el7.x86_64
> > > nfs-ganesha-2.7.1-0.1.el7.x86_64
> > > nfs-
> > > ganesha-vfs-2.7.1-0.1.el7.x86_64
> > > nfs-ganesha-ceph-2.7.1-
> > > 0.1.el7.x86_64
> > > libcephfs2-13.2.2-0.el7.x86_64
> > > ceph-fuse-
> > > 12.2.10-0.el7.x86_64
> > >
> > > My Ganesha export:
> > >
> > > EXPORT
> > > {
> >
> > > > Export_ID=100;
> > > Protocols = 4;
> > > Transports = TCP;
> >
> > > >     Path = /;
> > > Pseudo = /ceph/;
> > > Access_Type = RW;
> > >
> > >Attr_Expiration_Time = 0;
> > > #Manage_Gids = TRUE;
> > >
> > >  Filesystem_Id = 100.1;
> > > FSAL {
> > > Name = CEPH;
> > >
> > >  }
> > > }
> > >
> > > My ceph.conf client section:
> > >
> > > [client]
> > >
> > >mon host = 10.10.10.210:6789, 10.10.10.211:6789,
> > > 10.10.10.212:6789
> > > client_oc_size = 8388608000
> > >
> > &

Re: [ceph-users] [Nfs-ganesha-devel] NFS-Ganesha CEPH_FSAL ceph.quota.max_bytes not enforced

2019-03-04 Thread David C
Looks like you're right, Jeff. Just tried to write into the dir and am now
getting the quota warning. So I guess it was the libcephfs cache as you
say. That's fine for me, I don't need the quotas to be too strict, just a
failsafe really.

Interestingly, if I create a new dir, set the same 100MB quota, I can write
multiple files with "dd if=/dev/zero of=1G bs=1M count=1024 oflag=direct".
Wouldn't that bypass the cache? I have the following in my ganesha.conf
which I believe effectively disables Ganesha's caching:

CACHEINODE {
Dir_Chunk = 0;
NParts = 1;
Cache_Size = 1;
}

Thanks,

On Mon, Mar 4, 2019 at 2:50 PM Jeff Layton  wrote:

> On Mon, 2019-03-04 at 09:11 -0500, Jeff Layton wrote:
> > This list has been deprecated. Please subscribe to the new devel list at
> lists.nfs-ganesha.org.
> > On Fri, 2019-03-01 at 15:49 +, David C wrote:
> > > This list has been deprecated. Please subscribe to the new devel list
> at lists.nfs-ganesha.org.
> > > Hi All
> > >
> > > Exporting cephfs with the CEPH_FSAL
> > >
> > > I set the following on a dir:
> > >
> > > setfattr -n ceph.quota.max_bytes -v 1 /dir
> > > setfattr -n ceph.quota.max_files -v 10 /dir
> > >
> > > From an NFSv4 client, the quota.max_bytes appears to be completely
> ignored, I can go GBs over the quota in the dir. The quota.max_files DOES
> work however, if I try and create more than 10 files, I'll get "Error
> opening file 'dir/new file': Disk quota exceeded" as expected.
> > >
> > > From a fuse-mount on the same server that is running nfs-ganesha, I've
> confirmed ceph.quota.max_bytes is enforcing the quota, I'm unable to copy
> more than 100MB into the dir.
> > >
> > > According to [1] and [2] this should work.
> > >
> > > Cluster is Luminous 12.2.10
> > >
> > > Package versions on nfs-ganesha server:
> > >
> > > nfs-ganesha-rados-grace-2.7.1-0.1.el7.x86_64
> > > nfs-ganesha-2.7.1-0.1.el7.x86_64
> > > nfs-ganesha-vfs-2.7.1-0.1.el7.x86_64
> > > nfs-ganesha-ceph-2.7.1-0.1.el7.x86_64
> > > libcephfs2-13.2.2-0.el7.x86_64
> > > ceph-fuse-12.2.10-0.el7.x86_64
> > >
> > > My Ganesha export:
> > >
> > > EXPORT
> > > {
> > > Export_ID=100;
> > > Protocols = 4;
> > > Transports = TCP;
> > > Path = /;
> > > Pseudo = /ceph/;
> > > Access_Type = RW;
> > > Attr_Expiration_Time = 0;
> > > #Manage_Gids = TRUE;
> > > Filesystem_Id = 100.1;
> > > FSAL {
> > > Name = CEPH;
> > > }
> > > }
> > >
> > > My ceph.conf client section:
> > >
> > > [client]
> > > mon host = 10.10.10.210:6789, 10.10.10.211:6789,
> 10.10.10.212:6789
> > > client_oc_size = 8388608000
> > > #fuse_default_permission=0
> > > client_acl_type=posix_acl
> > > client_quota = true
> > > client_quota_df = true
> > >
> > > Related links:
> > >
> > > [1] http://tracker.ceph.com/issues/16526
> > > [2] https://github.com/nfs-ganesha/nfs-ganesha/issues/100
> > >
> > > Thanks
> > > David
> > >
> >
> > It looks like you're having ganesha do the mount as "client.admin", and
> > I suspect that that may allow you to bypass quotas? You may want to try
> > creating a cephx user with less privileges, have ganesha connect as that
> > user and see if it changes things?
> >
>
> Actually, this may be wrong info.
>
> How are you testing being able to write to the file past quota? Are you
> using O_DIRECT I/O? If not, then it may just be that you're seeing the
> effect of the NFS client caching writes.
> --
> Jeff Layton 
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd unmap fails with error: rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy

2019-03-01 Thread David Turner
True, but not before you unmap it from the previous server. It's like
physically connecting a harddrive to two servers at the same time. Neither
knows what the other is doing to it and can corrupt your data. You should
always make sure to unmap an rbd before mapping it to another server.

On Fri, Mar 1, 2019, 6:28 PM solarflow99  wrote:

> It has to be mounted from somewhere, if that server goes offline, you need
> to mount it from somewhere else right?
>
>
> On Thu, Feb 28, 2019 at 11:15 PM David Turner 
> wrote:
>
>> Why are you making the same rbd to multiple servers?
>>
>> On Wed, Feb 27, 2019, 9:50 AM Ilya Dryomov  wrote:
>>
>>> On Wed, Feb 27, 2019 at 12:00 PM Thomas <74cmo...@gmail.com> wrote:
>>> >
>>> > Hi,
>>> > I have noticed an error when writing to a mapped RBD.
>>> > Therefore I unmounted the block device.
>>> > Then I tried to unmap it w/o success:
>>> > ld2110:~ # rbd unmap /dev/rbd0
>>> > rbd: sysfs write failed
>>> > rbd: unmap failed: (16) Device or resource busy
>>> >
>>> > The same block device is mapped on another client and there are no
>>> issues:
>>> > root@ld4257:~# rbd info hdb-backup/ld2110
>>> > rbd image 'ld2110':
>>> > size 7.81TiB in 2048000 objects
>>> > order 22 (4MiB objects)
>>> > block_name_prefix: rbd_data.3cda0d6b8b4567
>>> > format: 2
>>> > features: layering
>>> > flags:
>>> > create_timestamp: Fri Feb 15 10:53:50 2019
>>> > root@ld4257:~# rados -p hdb-backup  listwatchers
>>> rbd_data.3cda0d6b8b4567
>>> > error listing watchers hdb-backup/rbd_data.3cda0d6b8b4567: (2) No such
>>> > file or directory
>>> > root@ld4257:~# rados -p hdb-backup  listwatchers
>>> rbd_header.3cda0d6b8b4567
>>> > watcher=10.76.177.185:0/1144812735 client.21865052 cookie=1
>>> > watcher=10.97.206.97:0/4023931980 client.18484780
>>> > cookie=18446462598732841027
>>> >
>>> >
>>> > Question:
>>> > How can I force to unmap the RBD on client ld2110 (= 10.76.177.185)?
>>>
>>> Hi Thomas,
>>>
>>> It appears that /dev/rbd0 is still open on that node.
>>>
>>> Was the unmount successful?  Which filesystem (ext4, xfs, etc)?
>>>
>>> What is the output of "ps aux | grep rbd" on that node?
>>>
>>> Try lsof, fuser, check for LVM volumes and multipath -- these have been
>>> reported to cause this issue previously:
>>>
>>>   http://tracker.ceph.com/issues/12763
>>>
>>> Thanks,
>>>
>>> Ilya
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NFS-Ganesha CEPH_FSAL ceph.quota.max_bytes not enforced

2019-03-01 Thread David C
Hi All

Exporting cephfs with the CEPH_FSAL

I set the following on a dir:

setfattr -n ceph.quota.max_bytes -v 1 /dir
setfattr -n ceph.quota.max_files -v 10 /dir

>From an NFSv4 client, the quota.max_bytes appears to be completely ignored,
I can go GBs over the quota in the dir. The *quota.max_files* DOES work
however, if I try and create more than 10 files, I'll get "Error opening
file 'dir/new file': Disk quota exceeded" as expected.

>From a fuse-mount on the same server that is running nfs-ganesha, I've
confirmed ceph.quota.max_bytes is enforcing the quota, I'm unable to copy
more than 100MB into the dir.

According to [1] and [2] this should work.

Cluster is Luminous 12.2.10

Package versions on nfs-ganesha server:

nfs-ganesha-rados-grace-2.7.1-0.1.el7.x86_64
nfs-ganesha-2.7.1-0.1.el7.x86_64
nfs-ganesha-vfs-2.7.1-0.1.el7.x86_64
nfs-ganesha-ceph-2.7.1-0.1.el7.x86_64
libcephfs2-13.2.2-0.el7.x86_64
ceph-fuse-12.2.10-0.el7.x86_64

My Ganesha export:

EXPORT
{
Export_ID=100;
Protocols = 4;
Transports = TCP;
Path = /;
Pseudo = /ceph/;
Access_Type = RW;
Attr_Expiration_Time = 0;
#Manage_Gids = TRUE;
Filesystem_Id = 100.1;
FSAL {
Name = CEPH;
}
}

My ceph.conf client section:

[client]
mon host = 10.10.10.210:6789, 10.10.10.211:6789, 10.10.10.212:6789
client_oc_size = 8388608000
#fuse_default_permission=0
client_acl_type=posix_acl
client_quota = true
client_quota_df = true

Related links:

[1] http://tracker.ceph.com/issues/16526
[2] https://github.com/nfs-ganesha/nfs-ganesha/issues/100

Thanks
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic 13.2.4 rbd du slowness

2019-02-28 Thread David Turner
Have you used strace on the du command to see what it's spending its time
doing?

On Thu, Feb 28, 2019, 8:45 PM Glen Baars 
wrote:

> Hello Wido,
>
> The cluster layout is as follows:
>
> 3 x Monitor hosts ( 2 x 10Gbit bonded )
> 9 x OSD hosts (
> 2 x 10Gbit bonded,
> LSI cachecade and write cache drives set to single,
> All HDD in this pool,
> no separate DB / WAL. With the write cache and the SSD read cache on the
> LSI card it seems to perform well.
> 168 OSD disks
>
> No major increase in OSD disk usage or CPU usage. The RBD DU process uses
> 100% of a single 2.4Ghz core while running - I think that is the limiting
> factor.
>
> I have just tried removing most of the snapshots for that volume ( from 14
> snapshots down to 1 snapshot ) and the rbd du command now takes around 2-3
> minutes.
>
> Kind regards,
> Glen Baars
>
> -Original Message-
> From: Wido den Hollander 
> Sent: Thursday, 28 February 2019 5:05 PM
> To: Glen Baars ; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Mimic 13.2.4 rbd du slowness
>
>
>
> On 2/28/19 9:41 AM, Glen Baars wrote:
> > Hello Wido,
> >
> > I have looked at the libvirt code and there is a check to ensure that
> fast-diff is enabled on the image and only then does it try to get the real
> disk usage. The issue for me is that even with fast-diff enabled it takes
> 25min to get the space usage for a 50TB image.
> >
> > I had considered turning off fast-diff on the large images to get
> > around to issue but I think that will hurt my snapshot removal times (
> > untested )
> >
>
> Can you tell a bit more about the Ceph cluster? HDD? SSD? DB and WAL on
> SSD?
>
> Do you see OSDs spike in CPU or Disk I/O when you do a 'rbd du' on these
> images?
>
> Wido
>
> > I can't see in the code any other way of bypassing the disk usage check
> but I am not that familiar with the code.
> >
> > ---
> > if (volStorageBackendRBDUseFastDiff(features)) {
> > VIR_DEBUG("RBD image %s/%s has fast-diff feature enabled. "
> >   "Querying for actual allocation",
> >   def->source.name, vol->name);
> >
> > if (virStorageBackendRBDSetAllocation(vol, image, ) < 0)
> > goto cleanup;
> > } else {
> > vol->target.allocation = info.obj_size * info.num_objs; }
> > --
> >
> > Kind regards,
> > Glen Baars
> >
> > -Original Message-
> > From: Wido den Hollander 
> > Sent: Thursday, 28 February 2019 3:49 PM
> > To: Glen Baars ;
> > ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Mimic 13.2.4 rbd du slowness
> >
> >
> >
> > On 2/28/19 2:59 AM, Glen Baars wrote:
> >> Hello Ceph Users,
> >>
> >> Has anyone found a way to improve the speed of the rbd du command on
> large rbd images? I have object map and fast diff enabled - no invalid
> flags on the image or it's snapshots.
> >>
> >> We recently upgraded our Ubuntu 16.04 KVM servers for Cloudstack to
> Ubuntu 18.04. The upgrades libvirt to version 4. When libvirt 4 adds an rbd
> pool it discovers all images in the pool and tries to get their disk usage.
> We are seeing a 50TB image take 25min. The pool has over 300TB of images in
> it and takes hours for libvirt to start.
> >>
> >
> > This is actually a pretty bad thing imho. As a lot of images people will
> be using do not have fast-diff enabled (images from the past) and that will
> kill their performance.
> >
> > Isn't there a way to turn this off in libvirt?
> >
> > Wido
> >
> >> We can replicate the issue without libvirt by just running a rbd du on
> the large images. The limiting factor is the cpu on the rbd du command, it
> uses 100% of a single core.
> >>
> >> Our cluster is completely bluestore/mimic 13.2.4. 168 OSDs, 12 Ubuntu
> 16.04 hosts.
> >>
> >> Kind regards,
> >> Glen Baars
> >> This e-mail is intended solely for the benefit of the addressee(s) and
> any other named recipient. It is confidential and may contain legally
> privileged or confidential information. If you are not the recipient, any
> use, distribution, disclosure or copying of this e-mail is prohibited. The
> confidentiality and legal privilege attached to this communication is not
> waived or lost by reason of the mistaken transmission or delivery to you.
> If you have received this e-mail in error, please notify us immediately.
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > This e-mail is intended solely for the benefit of the addressee(s) and
> any other named recipient. It is confidential and may contain legally
> privileged or confidential information. If you are not the recipient, any
> use, distribution, disclosure or copying of this e-mail is prohibited. The
> confidentiality and legal privilege attached to this communication is not
> waived or lost by reason of the mistaken transmission or delivery to you.
> If you have received this e-mail in 

Re: [ceph-users] rbd unmap fails with error: rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy

2019-02-28 Thread David Turner
Why are you making the same rbd to multiple servers?

On Wed, Feb 27, 2019, 9:50 AM Ilya Dryomov  wrote:

> On Wed, Feb 27, 2019 at 12:00 PM Thomas <74cmo...@gmail.com> wrote:
> >
> > Hi,
> > I have noticed an error when writing to a mapped RBD.
> > Therefore I unmounted the block device.
> > Then I tried to unmap it w/o success:
> > ld2110:~ # rbd unmap /dev/rbd0
> > rbd: sysfs write failed
> > rbd: unmap failed: (16) Device or resource busy
> >
> > The same block device is mapped on another client and there are no
> issues:
> > root@ld4257:~# rbd info hdb-backup/ld2110
> > rbd image 'ld2110':
> > size 7.81TiB in 2048000 objects
> > order 22 (4MiB objects)
> > block_name_prefix: rbd_data.3cda0d6b8b4567
> > format: 2
> > features: layering
> > flags:
> > create_timestamp: Fri Feb 15 10:53:50 2019
> > root@ld4257:~# rados -p hdb-backup  listwatchers rbd_data.3cda0d6b8b4567
> > error listing watchers hdb-backup/rbd_data.3cda0d6b8b4567: (2) No such
> > file or directory
> > root@ld4257:~# rados -p hdb-backup  listwatchers
> rbd_header.3cda0d6b8b4567
> > watcher=10.76.177.185:0/1144812735 client.21865052 cookie=1
> > watcher=10.97.206.97:0/4023931980 client.18484780
> > cookie=18446462598732841027
> >
> >
> > Question:
> > How can I force to unmap the RBD on client ld2110 (= 10.76.177.185)?
>
> Hi Thomas,
>
> It appears that /dev/rbd0 is still open on that node.
>
> Was the unmount successful?  Which filesystem (ext4, xfs, etc)?
>
> What is the output of "ps aux | grep rbd" on that node?
>
> Try lsof, fuser, check for LVM volumes and multipath -- these have been
> reported to cause this issue previously:
>
>   http://tracker.ceph.com/issues/12763
>
> Thanks,
>
> Ilya
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG Calculations Issue

2019-02-28 Thread David Turner
Those numbers look right for a pool only containing 10% of your data. Now
continue to calculate the pg counts for the remaining 90% of your data.

On Wed, Feb 27, 2019, 12:17 PM Krishna Venkata 
wrote:

> Greetings,
>
>
> I am having issues in the way PGs are calculated in
> https://ceph.com/pgcalc/ [Ceph PGs per Pool Calculator ] and the formulae
> mentioned in the site.
>
> Below are my findings
>
> The formula to calculate PGs as mentioned in the https://ceph.com/pgcalc/
>  :
>
> 1.  Need to pick the highest value from either of the formulas
>
> *(( Target PGs per OSD ) x ( OSD # ) x ( %Data ))/(size)*
>
> Or
>
> *( OSD# ) / ( Size )*
>
> 2.  The output value is then rounded to the nearest power of 2
>
>1. If the nearest power of 2 is more than 25% below the original
>value, the next higher power of 2 is used.
>
>
>
> Based on the above procedure, we calculated PGs for 25, 32 and 64 OSDs
>
> *Our Dataset:*
>
> *%Data:* 0.10
>
> *Target PGs per OSD:* 100
>
> *OSDs* 25, 32 and 64
>
>
>
> *For 25 OSDs*
>
>
>
> (100*25* (0.10/100))/(3) = 0.833
>
>
>
> ( 25 ) / ( 3 ) = 8.33
>
>
>
> 1. Raw pg num 8.33  ( Since we need to pick the highest of (0.833, 8.33))
>
> 2. max pg 16 ( For, 8.33 the nearest power of 2 is 16)
>
> 3. 16 > 2.08  ( 25 % of 8.33 is 2.08 which is more than 25% the power of 2)
>
>
>
> So 16 PGs
>
> ü  GUI Calculator gives the same value and matches with Formula.
>
>
>
> *For 32 OSD*
>
>
>
> (100*32*(0.10/100))/3 = 1.066
>
> ( 32 ) / ( 3 ) = 10.66
>
>
>
> 1. Raw pg num 10.66 ( Since we need to pick the highest of (1.066, 10.66))
>
> 2. max pg 16 ( For, 10.66 the nearest power of 2 is 16)
>
> 3.  16 > 2.655 ( 25 % of 10.66 is 2.655 which is more than 25% the power
> of 2)
>
>
>
> So 16 PGs
>
> û  GUI Calculator gives different value (32 PGs) which doesn’t match with
> Formula.
>
>
>
> *For 64 OSD*
>
>
>
> (100 * 64 * (0.10/100))/3 = 2.133
>
> ( 64 ) / ( 3 ) 21.33
>
>
>
> 1. Raw pg num 21.33 ( Since we need to pick the highest of (2.133, 21.33))
>
> 2. max pg 32 ( For, 21.33 the nearest power of 2 is 32)
>
> 3. 32 > 5.3325 ( 25 % of 21.33 is 5.3325 which is more than 25% the power
> of 2)
>
>
>
> So 32 PGs
>
> û  GUI Calculator gives different value (64 PGs) which doesn’t match with
> Formula.
>
>
>
> We checked the PG calculator logic from [
> https://ceph.com/pgcalc_assets/pgcalc.js ] which is not matching from
> above formulae.
>
>
>
> Can someone Guide/reference us to correct formulae to calculate PGs.
>
>
>
> Thanks in advance.
>
>
>
> Regards,
>
> Krishna Venkata
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] redirect log to syslog and disable log to stderr

2019-02-28 Thread David Turner
You can always set it in your ceph.conf file and restart the mgr daemon.

On Tue, Feb 26, 2019, 1:30 PM Alex Litvak 
wrote:

> Dear Cephers,
>
> In mimic 13.2.2
> ceph tell mgr.* injectargs --log-to-stderr=false
> Returns an error (no valid command found ...).  What is the correct way to
> inject mgr configuration values?
>
> The same command works on mon
>
> ceph tell mon.* injectargs --log-to-stderr=false
>
>
> Thank you in advance,
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Right way to delete OSD from cluster?

2019-02-28 Thread David Turner
The reason is that an osd still contributes to the host weight in the crush
map even while it is marked out. When you out and then purge, the purging
operation removed the osd from the map and changes the weight of the host
which changes the crush map and data moves. By weighting the osd to 0.0,
the hosts weight is already the same it will be when you purge the osd.
Weighting to 0.0 is definitely the best option for removing storage if you
can trust the data on the osd being removed.

On Tue, Feb 26, 2019, 3:19 AM Fyodor Ustinov  wrote:

> Hi!
>
> Thank you so much!
>
> I do not understand why, but your variant really causes only one rebalance
> compared to the "osd out".
>
> - Original Message -
> From: "Scottix" 
> To: "Fyodor Ustinov" 
> Cc: "ceph-users" 
> Sent: Wednesday, 30 January, 2019 20:31:32
> Subject: Re: [ceph-users] Right way to delete OSD from cluster?
>
> I generally have gone the crush reweight 0 route
> This way the drive can participate in the rebalance, and the rebalance
> only happens once. Then you can take it out and purge.
>
> If I am not mistaken this is the safest.
>
> ceph osd crush reweight  0
>
> On Wed, Jan 30, 2019 at 7:45 AM Fyodor Ustinov  wrote:
> >
> > Hi!
> >
> > But unless after "ceph osd crush remove" I will not got the undersized
> objects? That is, this is not the same thing as simply turning off the OSD
> and waiting for the cluster to be restored?
> >
> > - Original Message -
> > From: "Wido den Hollander" 
> > To: "Fyodor Ustinov" , "ceph-users" <
> ceph-users@lists.ceph.com>
> > Sent: Wednesday, 30 January, 2019 15:05:35
> > Subject: Re: [ceph-users] Right way to delete OSD from cluster?
> >
> > On 1/30/19 2:00 PM, Fyodor Ustinov wrote:
> > > Hi!
> > >
> > > I thought I should first do "ceph osd out", wait for the end
> relocation of the misplaced objects and after that do "ceph osd purge".
> > > But after "purge" the cluster starts relocation again.
> > >
> > > Maybe I'm doing something wrong? Then what is the correct way to
> delete the OSD from the cluster?
> > >
> >
> > You are not doing anything wrong, this is the expected behavior. There
> > are two CRUSH changes:
> >
> > - Marking it out
> > - Purging it
> >
> > You could do:
> >
> > $ ceph osd crush remove osd.X
> >
> > Wait for all good
> >
> > $ ceph osd purge X
> >
> > The last step should then not initiate any data movement.
> >
> > Wido
> >
> > > WBR,
> > > Fyodor.
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> T: @Thaumion
> IG: Thaumion
> scot...@gmail.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs recursive stats | rctime in the future

2019-02-28 Thread David C
On Wed, Feb 27, 2019 at 11:35 AM Hector Martin 
wrote:

> On 27/02/2019 19:22, David C wrote:
> > Hi All
> >
> > I'm seeing quite a few directories in my filesystem with rctime years in
> > the future. E.g
> >
> > ]# getfattr -d -m ceph.dir.* /path/to/dir
> > getfattr: Removing leading '/' from absolute path names
> > # file:  path/to/dir
> > ceph.dir.entries="357"
> > ceph.dir.files="1"
> > ceph.dir.rbytes="35606883904011"
> > ceph.dir.rctime="1851480065.090"
> > ceph.dir.rentries="12216551"
> > ceph.dir.rfiles="10540827"
> > ceph.dir.rsubdirs="1675724"
> > ceph.dir.subdirs="356"
> >
> > That's showing a last modified time of 2 Sept 2028, the day and month
> > are also wrong.
>
> Obvious question: are you sure the date/time on your cluster nodes and
> your clients is correct? Can you track down which files (if any) have
> the ctime in the future by following the rctime down the filesystem tree?
>

Times are all correct on the nodes and CephFS clients however the fs is
being exported over NFS. It's possible some NFS clients have the wrong time
although I'm reasonably confident they are all correct as the machines are
synced to local time servers and they use AD for auth, things wouldn't work
if the time was that wildly out of sync.

Good idea on checking down the tree. I've found the offending files but
can't find any explanation as to why they have a modified date so far in
the future.

For example one dir is "/.config/caja/" in a users home dir. The files in
this dir are all wildly different, the modified times are 1984, 1997,
2028...

It certainly feels like a MDS issue to me. I've used the recursive stats
since Jewel and I've never seen this before.

Any ideas?



> --
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://mrcn.st/pub
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephfs recursive stats | rctime in the future

2019-02-27 Thread David C
Hi All

I'm seeing quite a few directories in my filesystem with rctime years in
the future. E.g

]# getfattr -d -m ceph.dir.* /path/to/dir
getfattr: Removing leading '/' from absolute path names
# file:  path/to/dir
ceph.dir.entries="357"
ceph.dir.files="1"
ceph.dir.rbytes="35606883904011"
ceph.dir.rctime="1851480065.090"
ceph.dir.rentries="12216551"
ceph.dir.rfiles="10540827"
ceph.dir.rsubdirs="1675724"
ceph.dir.subdirs="356"

That's showing a last modified time of 2 Sept 2028, the day and month are
also wrong.

Most dirs are still showing the correct rctime.

I've used the recursive stats for a few years now and they've always been
reliable. The last major changes I made to this cluster was an update to
Luminous 12.2.10, moving the metadata pool to an SSD backed pool and the
addition of a second Cephfs data pool.

I have just received a scrub error this morning with 1 inconsistent pg but
I've been noticing the incorrect rctimes for a while a now so not sure if
that's related.

Any help much appreciated

Thanks
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Usenix Vault 2019

2019-02-24 Thread David Turner
There is a scheduled birds of a feather for Ceph tomorrow night, but I also
noticed that there are only trainings tomorrow. Unless you are paying more
for those, you likely don't have much to do on Monday. That's the boat I'm
in. Is anyone interested in getting together tomorrow in Boston during the
training day?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Configuration about using nvme SSD

2019-02-24 Thread David Turner
One thing that's worked for me to get more out of nvmes with Ceph is to
create multiple partitions on the nvme with an osd on each partition. That
way you get more osd processes and CPU per nvme device. I've heard of
people using up to 4 partitions like this.

On Sun, Feb 24, 2019, 10:25 AM Vitaliy Filippov  wrote:

> > We can get 513558 IOPS in 4K read per nvme by fio but only 45146 IOPS
> > per OSD.by rados.
>
> Don't expect Ceph to fully utilize NVMe's, it's software and it's slow :)
> some colleagues tell that SPDK works out of the box, but almost doesn't
> increase performance, because the userland-kernel interaction isn't the
> bottleneck currently, it's Ceph code itself. I also tried once, but I
> couldn't make it work. When I have some spare NVMe's I'll make another
> attempt.
>
> So... try it and share your results here :) we're all interested.
>
> --
> With best regards,
>Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Doubts about backfilling performance

2019-02-23 Thread David Turner
Jewel is really limited on the settings you can tweak for backfilling [1].
Luminous and Mimic have a few more knobs. An option you can do, though, is
to use osd_crush_initial_weight found [2] here. With this setting you set
your initial crush weight for new osds to 0.0 and gradually increase them
to what you want them to be. This doesn't help with already added osds, but
can help in the future.


[1]
http://docs.ceph.com/docs/jewel/rados/configuration/osd-config-ref/#backfilling
[2] http://docs.ceph.com/docs/jewel/rados/configuration/pool-pg-config-ref/

On Sat, Feb 23, 2019, 6:08 AM Fabio Abreu  wrote:

> Hello everybody,
>
> I try to improve the backfilling proccess without impact my client I/O,
> that is a painfull thing  when i putted a new osd in my environment.
>
> I look some options like osd backfill scan max , Can I improve the
> performance if I reduce this ?
>
> Someome recommend parameter to study in my scenario.
>
> My environment is jewel 10.2.7 .
>
> Best Regards,
> Fabio Abreu
> --
> Atenciosamente,
> Fabio Abreu Reis
> http://fajlinux.com.br
> *Tel : *+55 21 98244-0161
> *Skype : *fabioabreureis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread David Turner
Mon disks don't have journals, they're just a folder on a filesystem on a
disk.

On Fri, Feb 22, 2019, 6:40 AM M Ranga Swami Reddy 
wrote:

> ceph mons looks fine during the recovery.  Using  HDD with SSD
> journals. with recommeded CPU and RAM numbers.
>
> On Fri, Feb 22, 2019 at 4:40 PM David Turner 
> wrote:
> >
> > What about the system stats on your mons during recovery? If they are
> having a hard time keeping up with requests during a recovery, I could see
> that impacting client io. What disks are they running on? CPU? Etc.
> >
> > On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy 
> wrote:
> >>
> >> Debug setting defaults are using..like 1/5 and 0/5 for almost..
> >> Shall I try with 0 for all debug settings?
> >>
> >> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius 
> wrote:
> >> >
> >> > Hello,
> >> >
> >> >
> >> > Check your CPU usage when you are doing those kind of operations. We
> >> > had a similar issue where our CPU monitoring was reporting fine < 40%
> >> > usage, but our load on the nodes was high mid 60-80. If it's possible
> >> > try disabling ht and see the actual cpu usage.
> >> > If you are hitting CPU limits you can try disabling crc on messages.
> >> > ms_nocrc
> >> > ms_crc_data
> >> > ms_crc_header
> >> >
> >> > And setting all your debug messages to 0.
> >> > If you haven't done you can also lower your recovery settings a
> little.
> >> > osd recovery max active
> >> > osd max backfills
> >> >
> >> > You can also lower your file store threads.
> >> > filestore op threads
> >> >
> >> >
> >> > If you can also switch to bluestore from filestore. This will also
> >> > lower your CPU usage. I'm not sure that this is bluestore that does
> >> > it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
> >> > compared to filestore + leveldb .
> >> >
> >> >
> >> > On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
> >> >  wrote:
> >> > >
> >> > > Thats expected from Ceph by design. But in our case, we are using
> all
> >> > > recommendation like rack failure domain, replication n/w,etc, still
> >> > > face client IO performance issues during one OSD down..
> >> > >
> >> > > On Tue, Feb 19, 2019 at 10:56 PM David Turner <
> drakonst...@gmail.com> wrote:
> >> > > >
> >> > > > With a RACK failure domain, you should be able to have an entire
> rack powered down without noticing any major impact on the clients.  I
> regularly take down OSDs and nodes for maintenance and upgrades without
> seeing any problems with client IO.
> >> > > >
> >> > > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy <
> swamire...@gmail.com> wrote:
> >> > > >>
> >> > > >> Hello - I have a couple of questions on ceph cluster stability,
> even
> >> > > >> we follow all recommendations as below:
> >> > > >> - Having separate replication n/w and data n/w
> >> > > >> - RACK is the failure domain
> >> > > >> - Using SSDs for journals (1:4ratio)
> >> > > >>
> >> > > >> Q1 - If one OSD down, cluster IO down drastically and customer
> Apps impacted.
> >> > > >> Q2 - what is stability ratio, like with above, is ceph cluster
> >> > > >> workable condition, if one osd down or one node down,etc.
> >> > > >>
> >> > > >> Thanks
> >> > > >> Swami
> >> > > >> ___
> >> > > >> ceph-users mailing list
> >> > > >> ceph-users@lists.ceph.com
> >> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > ___
> >> > > ceph-users mailing list
> >> > > ceph-users@lists.ceph.com
> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] REQUEST_SLOW across many OSDs at the same time

2019-02-22 Thread David Turner
Can you correlate the times to scheduled tasks inside of any VMs? For
instance if you have several Linux VMs with the updatedb command installed
that by default they will all be scanning their disks at the same time each
day to see where files are. Other common culprits could be scheduled
backups, db cleanup, etc. Do you track cluster io at all? When I first
configured a graphing tool on my home cluster I found the updatedb/locate
command happening with a drastic io spike at the same time every day. I
also found a spike when a couple Windows VMs were checking for updates
automatically.

On Fri, Feb 22, 2019, 4:28 AM mart.v  wrote:

> Hello everyone,
>
> I'm experiencing a strange behaviour. My cluster is relatively small (43
> OSDs, 11 nodes), running Ceph 12.2.10 (and Proxmox 5). Nodes are connected
> via 10 Gbit network (Nexus 6000). Cluster is mixed (SSD and HDD), but with
> different pools. Descibed error is only on the SSD part of the cluster.
>
> I noticed that few times a day the cluster slows down a bit and I have
> discovered this in logs:
>
> 2019-02-22 08:21:20.064396 mon.node1 mon.0 172.16.254.101:6789/0 1794159
> : cluster [WRN] Health check failed: 27 slow requests are blocked > 32 sec.
> Implicated osds 10,22,33 (REQUEST_SLOW)
> 2019-02-22 08:21:26.589202 mon.node1 mon.0 172.16.254.101:6789/0 1794169
> : cluster [WRN] Health check update: 199 slow requests are blocked > 32
> sec. Implicated osds 0,4,5,6,7,8,9,10,12,16,17,19,20,21,22,25,26,33,41
> (REQUEST_SLOW)
> 2019-02-22 08:21:32.655671 mon.node1 mon.0 172.16.254.101:6789/0 1794183
> : cluster [WRN] Health check update: 448 slow requests are blocked > 32
> sec. Implicated osds
> 0,3,4,5,6,7,8,9,10,12,15,16,17,19,20,21,22,24,25,26,33,41 (REQUEST_SLOW)
> 2019-02-22 08:21:38.744210 mon.node1 mon.0 172.16.254.101:6789/0 1794210
> : cluster [WRN] Health check update: 388 slow requests are blocked > 32
> sec. Implicated osds 4,8,10,16,24,33 (REQUEST_SLOW)
> 2019-02-22 08:21:42.790346 mon.node1 mon.0 172.16.254.101:6789/0 1794214
> : cluster [INF] Health check cleared: REQUEST_SLOW (was: 18 slow requests
> are blocked > 32 sec. Implicated osds 8,16)
>
> "ceph health detail" shows nothing more
>
> It is happening through the whole day and the times can't be linked to any
> read or write intensive task (e.g. backup). I also tried to disable
> scrubbing, but it kept on going. These errors were not there since
> beginning, but unfortunately I cannot track the day they started (it is
> beyond my logs).
>
> Any ideas?
>
> Thank you!
> Martin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread David Turner
What about the system stats on your mons during recovery? If they are
having a hard time keeping up with requests during a recovery, I could see
that impacting client io. What disks are they running on? CPU? Etc.

On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy 
wrote:

> Debug setting defaults are using..like 1/5 and 0/5 for almost..
> Shall I try with 0 for all debug settings?
>
> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius 
> wrote:
> >
> > Hello,
> >
> >
> > Check your CPU usage when you are doing those kind of operations. We
> > had a similar issue where our CPU monitoring was reporting fine < 40%
> > usage, but our load on the nodes was high mid 60-80. If it's possible
> > try disabling ht and see the actual cpu usage.
> > If you are hitting CPU limits you can try disabling crc on messages.
> > ms_nocrc
> > ms_crc_data
> > ms_crc_header
> >
> > And setting all your debug messages to 0.
> > If you haven't done you can also lower your recovery settings a little.
> > osd recovery max active
> > osd max backfills
> >
> > You can also lower your file store threads.
> > filestore op threads
> >
> >
> > If you can also switch to bluestore from filestore. This will also
> > lower your CPU usage. I'm not sure that this is bluestore that does
> > it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
> > compared to filestore + leveldb .
> >
> >
> > On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
> >  wrote:
> > >
> > > Thats expected from Ceph by design. But in our case, we are using all
> > > recommendation like rack failure domain, replication n/w,etc, still
> > > face client IO performance issues during one OSD down..
> > >
> > > On Tue, Feb 19, 2019 at 10:56 PM David Turner 
> wrote:
> > > >
> > > > With a RACK failure domain, you should be able to have an entire
> rack powered down without noticing any major impact on the clients.  I
> regularly take down OSDs and nodes for maintenance and upgrades without
> seeing any problems with client IO.
> > > >
> > > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy <
> swamire...@gmail.com> wrote:
> > > >>
> > > >> Hello - I have a couple of questions on ceph cluster stability, even
> > > >> we follow all recommendations as below:
> > > >> - Having separate replication n/w and data n/w
> > > >> - RACK is the failure domain
> > > >> - Using SSDs for journals (1:4ratio)
> > > >>
> > > >> Q1 - If one OSD down, cluster IO down drastically and customer Apps
> impacted.
> > > >> Q2 - what is stability ratio, like with above, is ceph cluster
> > > >> workable condition, if one osd down or one node down,etc.
> > > >>
> > > >> Thanks
> > > >> Swami
> > > >> ___
> > > >> ceph-users mailing list
> > > >> ceph-users@lists.ceph.com
> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] faster switch to another mds

2019-02-20 Thread David Turner
If I'm not mistaken, if you stop them at the same time during a reboot on a
node with both mds and mon, the mons might receive it, but wait to finish
their own election vote before doing anything about it.  If you're trying
to keep optimal uptime for your mds, then stopping it first and on its own
makes sense.

On Wed, Feb 20, 2019 at 3:46 PM Patrick Donnelly 
wrote:

> On Tue, Feb 19, 2019 at 11:39 AM Fyodor Ustinov  wrote:
> >
> > Hi!
> >
> > From documentation:
> >
> > mds beacon grace
> > Description:The interval without beacons before Ceph declares an MDS
> laggy (and possibly replace it).
> > Type:   Float
> > Default:15
> >
> > I do not understand, 15 - are is seconds or beacons?
>
> seconds
>
> > And an additional misunderstanding - if we gently turn off the MDS (or
> MON), why it does not inform everyone interested before death - "I am
> turned off, no need to wait, appoint a new active server"
>
> The MDS does inform the monitors if it has been shutdown. If you pull
> the plug or SIGKILL, it does not. :)
>
>
> --
> Patrick Donnelly
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] faster switch to another mds

2019-02-19 Thread David Turner
It's also been mentioned a few times that when MDS and MON are on the same
host that the downtime for MDS is longer when both daemons stop at about
the same time.  It's been suggested to stop the MDS daemon, wait for `ceph
mds stat` to reflect the change, and then restart the rest of the server.
HTH.

On Mon, Feb 11, 2019 at 3:55 PM Gregory Farnum  wrote:

> You can't tell from the client log here, but probably the MDS itself was
> failing over to a new instance during that interval. There's not much
> experience with it, but you could experiment with faster failover by
> reducing the mds beacon and grace times. This may or may not work
> reliably...
>
> On Sat, Feb 9, 2019 at 10:52 AM Fyodor Ustinov  wrote:
>
>> Hi!
>>
>> I have ceph cluster with 3 nodes with mon/mgr/mds servers.
>> I reboot one node and see this in client log:
>>
>> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon2 10.5.105.40:6789 socket
>> closed (con state OPEN)
>> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon2 10.5.105.40:6789 session
>> lost, hunting for new mon
>> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon0 10.5.105.34:6789 session
>> established
>> Feb 09 20:29:22 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state OPEN)
>> Feb 09 20:29:23 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state CONNECTING)
>> Feb 09 20:29:24 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state CONNECTING)
>> Feb 09 20:29:24 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state CONNECTING)
>> Feb 09 20:29:53 ceph-nfs1 kernel: ceph: mds0 reconnect start
>> Feb 09 20:29:53 ceph-nfs1 kernel: ceph: mds0 reconnect success
>> Feb 09 20:30:05 ceph-nfs1 kernel: ceph: mds0 recovery completed
>>
>> As I understand it, the following has happened:
>> 1. Client detects - link with mon server broken and fast switches to
>> another mon (less that 1 seconds).
>> 2. Client detects - link with mds server broken, 3 times trying reconnect
>> (unsuccessful), waiting and reconnects to the same mds after 30 seconds
>> downtime.
>>
>> I have 2 questions:
>> 1. Why?
>> 2. How to reduce switching time to another mds?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS overwrite/truncate performance hit

2019-02-19 Thread David Turner
If your client needs to be able to handle the writes like that on its own,
RBDs might be the more appropriate use case.  You lose the ability to have
multiple clients accessing the data as easily as with CephFS, but you would
gain the features you're looking for.

On Tue, Feb 12, 2019 at 1:43 PM Gregory Farnum  wrote:

>
>
> On Tue, Feb 12, 2019 at 5:10 AM Hector Martin 
> wrote:
>
>> On 12/02/2019 06:01, Gregory Farnum wrote:
>> > Right. Truncates and renames require sending messages to the MDS, and
>> > the MDS committing to RADOS (aka its disk) the change in status, before
>> > they can be completed. Creating new files will generally use a
>> > preallocated inode so it's just a network round-trip to the MDS.
>>
>> I see. Is there a fundamental reason why these kinds of metadata
>> operations cannot be buffered in the client, or is this just the current
>> way they're implemented?
>>
>
> It's pretty fundamental, at least to the consistency guarantees we hold
> ourselves to. What happens if the client has buffered an update like that,
> performs writes to the data with those updates in mind, and then fails
> before they're flushed to the MDS? A local FS doesn't need to worry about a
> different node having a different lifetime, and can control the write order
> of its metadata and data updates on belated flush a lot more precisely than
> we can. :(
> -Greg
>
>
>>
>> e.g. on a local FS these kinds of writes can just stick around in the
>> block cache unflushed. And of course for CephFS I assume file extension
>> also requires updating the file size in the MDS, yet that doesn't block
>> while truncation does.
>>
>> > Going back to your first email, if you do an overwrite that is confined
>> > to a single stripe unit in RADOS (by default, a stripe unit is the size
>> > of your objects which is 4MB and it's aligned from 0), it is guaranteed
>> > to be atomic. CephFS can only tear writes across objects, and only if
>> > your client fails before the data has been flushed.
>>
>> Great! I've implemented this in a backwards-compatible way, so that gets
>> rid of this bottleneck. It's just a 128-byte flag file (formerly
>> variable length, now I just pad it to the full 128 bytes and rewrite it
>> in-place). This is good information to know for optimizing things :-)
>>
>> --
>> Hector Martin (hec...@marcansoft.com)
>> Public Key: https://mrcn.st/pub
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: client hangs

2019-02-19 Thread David Turner
You're attempting to use mismatching client name and keyring.  You want to
use matching name and keyring.  For your example, you would want to either
use `--keyring /etc/ceph/ceph.client.admin.keyring --name client.admin` or
`--keyring /etc/ceph/ceph.client.cephfs.keyring --name client.cephfs`.
Mixing and matching does not work.  Treat them like username and password.
You wouldn't try to log into your computer under your account with the
admin password.

On Tue, Feb 19, 2019 at 12:58 PM Hennen, Christian <
christian.hen...@uni-trier.de> wrote:

> > sounds like network issue. are there firewall/NAT between nodes?
> No, there is currently no firewall in place. Nodes and clients are on the
> same network. MTUs match, ports are opened according to nmap.
>
> > try running ceph-fuse on the node that run mds, check if it works
> properly.
> When I try to run ceph-fuse on either a client or cephfiler1
> (MON,MGR,MDS,OSDs) I get
> - "operation not permitted" when using the client keyring
> - "invalid argument" when using the admin keyring
> - "ms_handle_refused" when using the admin keyring and connecting to
> 127.0.0.1:6789
>
> ceph-fuse --keyring /etc/ceph/ceph.client.admin.keyring --name
> client.cephfs -m 192.168.1.17:6789 /mnt/cephfs
>
> -Ursprüngliche Nachricht-
> Von: Yan, Zheng 
> Gesendet: Dienstag, 19. Februar 2019 11:31
> An: Hennen, Christian 
> Cc: ceph-users@lists.ceph.com
> Betreff: Re: [ceph-users] CephFS: client hangs
>
> On Tue, Feb 19, 2019 at 5:10 PM Hennen, Christian <
> christian.hen...@uni-trier.de> wrote:
> >
> > Hi!
> >
> > >mon_max_pg_per_osd = 400
> > >
> > >In the ceph.conf and then restart all the services / or inject the
> > >config into the running admin
> >
> > I restarted each server (MONs and OSDs weren’t enough) and now the
> health warning is gone. Still no luck accessing CephFS though.
> >
> >
> > > MDS show a client got evicted. Nothing else looks abnormal.  Do new
> > > cephfs clients also get evicted quickly?
> >
> > Aside from the fact that evicted clients don’t show up in ceph –s, we
> observe other strange things:
> >
> > ·   Setting max_mds has no effect
> >
> > ·   Ceph osd blacklist ls sometimes lists cluster nodes
> >
>
> sounds like network issue. are there firewall/NAT between nodes?
>
> > The only client that is currently running is ‚master1‘. It also hosts a
> MON and a MGR. Its syslog (https://gitlab.uni-trier.de/snippets/78) shows
> messages like:
> >
> > Feb 13 06:40:33 master1 kernel: [56165.943008] libceph: wrong peer,
> > want 192.168.1.17:6800/-2045158358, got 192.168.1.17:6800/1699349984
> >
> > Feb 13 06:40:33 master1 kernel: [56165.943014] libceph: mds1
> > 192.168.1.17:6800 wrong peer at address
> >
> > The other day I did the update from 12.2.8 to 12.2.11, which can also be
> seen in the logs. Again, there appeared these messages. I assume that’s
> normal operations since ports can change and daemons have to find each
> other again? But what about Feb 13 in the morning? I didn’t do any restarts
> then.
> >
> > Also, clients are printing messages like the following on the console:
> >
> > [1026589.751040] ceph: handle_cap_import: mismatched seq/mseq: ino
> > (1994988.fffe) mds0 seq1 mseq 15 importer mds1 has
> > peer seq 2 mseq 15
> >
> > [1352658.876507] ceph: build_path did not end path lookup where
> > expected, namelen is 23, pos is 0
> >
> > Oh, and btw, the ceph nodes are running on Ubuntu 16.04, clients are on
> 14.04 with kernel 4.4.0-133.
> >
>
> try running ceph-fuse on the node that run mds, check if it works properly.
>
>
> > For reference:
> >
> > > Cluster details: https://gitlab.uni-trier.de/snippets/77
> >
> > > MDS log:
> > > https://gitlab.uni-trier.de/snippets/79?expanded=true=simple)
> >
> >
> > Kind regards
> > Christian Hennen
> >
> > Project Manager Infrastructural Services ZIMK University of Trier
> > Germany
> >
> > Von: Ashley Merrick 
> > Gesendet: Montag, 18. Februar 2019 16:53
> > An: Hennen, Christian 
> > Cc: ceph-users@lists.ceph.com
> > Betreff: Re: [ceph-users] CephFS: client hangs
> >
> > Correct yes from my expirence OSD’s aswel.
> >
> > On Mon, 18 Feb 2019 at 11:51 PM, Hennen, Christian <
> christian.hen...@uni-trier.de> wrote:
> >
> > Hi!
> >
> > >mon_max_pg_per_osd = 400
> > >
> > >In the ceph.conf and then restart all the services / or inject the
> > >config into the running admin
> >
> > I restarted all MONs, but I assume the OSDs need to be restarted as well?
> >
> > > MDS show a client got evicted. Nothing else looks abnormal.  Do new
> > > cephfs clients also get evicted quickly?
> >
> > Yeah, it seems so. But strangely there is no indication of it in 'ceph
> > -s' or 'ceph health detail'. And they don't seem to be evicted
> > permanently? Right now, only 1 client is connected. The others are shut
> down since last week.
> > 'ceph osd blacklist ls' shows 0 entries.
> >
> >
> > Kind regards
> > Christian Hennen
> >
> > Project Manager Infrastructural Services ZIMK 

Re: [ceph-users] crush map has straw_calc_version=0 and legacy tunables on luminous

2019-02-19 Thread David Turner
[1] Here is a really cool set of slides from Ceph Day Berlin where Dan van
der Ster uses the mgr balancer module with upmap to gradually change the
tunables of a cluster without causing major client impact.  The down side
for you is that upmap requires all luminous or newer clients, but if you
upgrade your kernel clients to 1.13+, then you can enable upmap in the
cluster and utilize the balancer module to upgrade your cluster tunables.
As stated [2] here that those kernel versions still report as Jewel
clients, but only because they are missing some non-essential luminous
client features even they they are fully compatible with the upmap
features, and other required features.

As a side note to the balancer manager in upmap mode, it balances your
cluster in such a way that it attempts to evenly distribute all PGs for a
pool evenly across all OSDs.  So if you have 3 different pools, the PGs for
those pools should each be within 1 or 2 PG totals on every OSD in your
cluster... it's really cool.  The slides discuss how to get your cluster to
that point as well, incase you have modified your weights or reweights at
all.


[1]
https://www.slideshare.net/Inktank_Ceph/ceph-day-berlin-mastering-ceph-operations-upmap-and-the-mgr-balancer
[2]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-November/031206.html

On Mon, Feb 4, 2019 at 6:31 PM Shain Miley  wrote:

> For future reference I found these 2 links which answer most of the
> questions:
>
> http://docs.ceph.com/docs/master/rados/operations/crush-map/
>
>
> https://www.openstack.org/assets/presentation-media/Advanced-Tuning-and-Operation-guide-for-Block-Storage-using-Ceph-Boston-2017-final.pdf
>
>
>
> We have about 250TB (x3) in our cluster so I am leaning toward not
> changing things at this point because it sounds like there will be a
> significant amount of data movement involved for not a lot in return.
>
>
>
> If anyone knows of a strong reason I should change the tunables profile
> away from what I have…then please let me know so I don’t end up running the
> cluster in a sub-optimal state for no reason.
>
>
>
> Thanks,
>
> Shain
>
>
>
> --
>
> Shain Miley | Manager of Systems and Infrastructure, Digital Media |
> smi...@npr.org | 202.513.3649
>
>
>
> *From: *ceph-users  on behalf of Shain
> Miley 
> *Date: *Monday, February 4, 2019 at 3:03 PM
> *To: *"ceph-users@lists.ceph.com" 
> *Subject: *[ceph-users] crush map has straw_calc_version=0 and legacy
> tunables on luminous
>
>
>
> Hello,
>
> I just upgraded our cluster to 12.2.11 and I have a few questions around
> straw_calc_version and tunables.
>
> Currently ceph status shows the following:
>
> crush map has straw_calc_version=0
>
> crush map has legacy tunables (require argonaut, min is firefly)
>
>
>
>1. Will setting tunables to optimal also change the staw_calc_version
>or do I need to set that separately?
>
>
>2. Right now I have a set of rbd kernel clients connecting using
>kernel version 4.4.  The ‘ceph daemon mon.id sessions’ command shows
>that this client is still connecting using the hammer feature set (and a
>few others on jewel as well):
>
>"MonSession(client.113933130 10.35.100.121:0/3425045489 is open allow
>*, features 0x7fddff8ee8cbffb (jewel))",  “MonSession(client.112250505
>10.35.100.99:0/4174610322 is open allow *, features 0x106b84a842a42
>(hammer))",
>
>My question is what is the minimum kernel version I would need to
>upgrade the 4.4 kernel server to in order to get to jewel or luminous?
>
>
>
>1. Will setting the tunables to optimal on luminous prevent jewel and
>hammer clients from connecting?  I want to make sure I don’t do anything
>will prevent my existing clients from connecting to the cluster.
>
>
>
>
> Thanks in advance,
>
> Shain
>
>
>
> --
>
> Shain Miley | Manager of Systems and Infrastructure, Digital Media |
> smi...@npr.org | 202.513.3649
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-19 Thread David Turner
With a RACK failure domain, you should be able to have an entire rack
powered down without noticing any major impact on the clients.  I regularly
take down OSDs and nodes for maintenance and upgrades without seeing any
problems with client IO.

On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
wrote:

> Hello - I have a couple of questions on ceph cluster stability, even
> we follow all recommendations as below:
> - Having separate replication n/w and data n/w
> - RACK is the failure domain
> - Using SSDs for journals (1:4ratio)
>
> Q1 - If one OSD down, cluster IO down drastically and customer Apps
> impacted.
> Q2 - what is stability ratio, like with above, is ceph cluster
> workable condition, if one osd down or one node down,etc.
>
> Thanks
> Swami
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook

2019-02-19 Thread David Turner
Have you ever seen an example of a Ceph cluster being run and managed by
Rook?  It's a really cool idea and takes care of containerizing mons, rgw,
mds, etc that I've been thinking about doing anyway.  Having those
containerized means that if you can upgrade all of the mon services before
any of your other daemons are even aware of a new Ceph version even if
they're running on the same server.  There are some recent upgrade bugs for
small clusters with mons and osds on the same node that would have been
mitigated with containerized Ceph versions.  For putting OSDs in
containers, have you ever needed to run a custom compiled version of Ceph
for a few OSDs to get past a bug that was causing you some troubles?  With
OSDs in containers, you could do that without worrying about that version
of Ceph being used by any other OSDs.

On top of all of that, I keep feeling like a dinosaur for not understanding
Kubernetes better and have been really excited since seeing Rook
orchestrating a Ceph cluster in K8s.  I spun up a few VMs to start testing
configuring a Kubernetes cluster.  The Rook Slack channel recommended using
kubeadm to set up K8s to manage Ceph.

On Mon, Feb 18, 2019 at 11:50 AM Marc Roos  wrote:

>
> Why not just keep it bare metal? Especially with future ceph
> upgrading/testing. I am having centos7 with luminous and am running
> libvirt on the nodes aswell. If you configure them with a tls/ssl
> connection, you can even nicely migrate a vm, from one host/ceph node to
> the other.
> Next thing I am testing with is mesos, to use the ceph nodes to run
> containers. I am still testing this on some vm's, but looks like you
> have to install only a few rpms (maybe around 300MB) and 2 extra
> services on the nodes to get this up and running aswell. (But keep in
> mind that the help on their mailing list is not so good as here ;))
>
>
>
> -Original Message-
> From: David Turner [mailto:drakonst...@gmail.com]
> Sent: 18 February 2019 17:31
> To: ceph-users
> Subject: [ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook
>
> I'm getting some "new" (to me) hardware that I'm going to upgrade my
> home Ceph cluster with.  Currently it's running a Proxmox cluster
> (Debian) which precludes me from upgrading to Mimic.  I am thinking
> about taking the opportunity to convert most of my VMs into containers
> and migrate my cluster into a K8s + Rook configuration now that Ceph is
> [1] stable on Rook.
>
> I haven't ever configured a K8s cluster and am planning to test this out
> on VMs before moving to it with my live data.  Has anyone done a
> migration from a baremetal Ceph cluster into K8s + Rook?  Additionally
> what is a good way for a K8s beginner to get into managing a K8s
> cluster.  I see various places recommend either CoreOS or kubeadm for
> starting up a new K8s cluster but I don't know the pros/cons for either.
>
> As far as migrating the Ceph services into Rook, I would assume that the
> process would be pretty simple to add/create new mons, mds, etc into
> Rook with the baremetal cluster details.  Once those are active and
> working just start decommissioning the services on baremetal.  For me,
> the OSD migration should be similar since I don't have any multi-device
> OSDs so I only need to worry about migrating individual disks between
> nodes.
>
>
> [1]
> https://blog.rook.io/rook-v0-9-new-storage-backends-in-town-ab952523ec53
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-19 Thread David Turner
I don't know that there's anything that can be done to resolve this yet
without rebuilding the OSD.  Based on a Nautilus tool being able to resize
the DB device, I'm assuming that Nautilus is also capable of migrating the
DB/WAL between devices.  That functionality would allow anyone to migrate
their DB back off of their spinner which is what's happening to you.  I
don't believe that sort of tooling exists yet, though, without compiling
the Nautilus Beta tooling for yourself.

On Tue, Feb 19, 2019 at 12:03 AM Konstantin Shalygin  wrote:

> On 2/18/19 9:43 PM, David Turner wrote:
> > Do you have historical data from these OSDs to see when/if the DB used
> > on osd.73 ever filled up?  To account for this OSD using the slow
> > storage for DB, all we need to do is show that it filled up the fast
> > DB at least once.  If that happened, then something spilled over to
> > the slow storage and has been there ever since.
>
> Yes, I have. Also I checked my JIRA records what I was do at this times
> and marked this on timeline: [1]
>
> Another graph compared osd.(33|73) for a last year: [2]
>
>
> [1] https://ibb.co/F7smCxW
>
> [1] https://ibb.co/dKWWDzW
>
> k
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade Luminous to mimic on Ubuntu 18.04

2019-02-18 Thread David Turner
Everybody is just confused that you don't have a newer version of Ceph
available. Are you running `apt-get dist-upgrade` to upgrade ceph? Do you
have any packages being held back? There is no reason that Ubuntu 18.04
shouldn't be able to upgrade to 12.2.11.

On Mon, Feb 18, 2019, 4:38 PM  Hello people,
>
> Am 11. Februar 2019 12:47:36 MEZ schrieb c...@elchaka.de:
> >Hello Ashley,
> >
> >Am 9. Februar 2019 17:30:31 MEZ schrieb Ashley Merrick
> >:
> >>What does the output of apt-get update look like on one of the nodes?
> >>
> >>You can just list the lines that mention CEPH
> >>
> >
> >... .. .
> >Get:6 Https://Download.ceph.com/debian-luminous bionic InRelease [8393
> >B]
> >... .. .
> >
> >The Last available is 12.2.8.
>
> Any advice or recommends on how to proceed to be able to Update to
> mimic/(nautilus)?
>
> - Mehmet
> >
> >- Mehmet
> >
> >>Thanks
> >>
> >>On Sun, 10 Feb 2019 at 12:28 AM,  wrote:
> >>
> >>> Hello Ashley,
> >>>
> >>> Thank you for this fast response.
> >>>
> >>> I cannt prove this jet but i am using already cephs own repo for
> >>Ubuntu
> >>> 18.04 and this 12.2.7/8 is the latest available there...
> >>>
> >>> - Mehmet
> >>>
> >>> Am 9. Februar 2019 17:21:32 MEZ schrieb Ashley Merrick <
> >>> singap...@amerrick.co.uk>:
> >>> >Around available versions, are you using the Ubuntu repo’s or the
> >>CEPH
> >>> >18.04 repo.
> >>> >
> >>> >The updates will always be slower to reach you if your waiting for
> >>it
> >>> >to
> >>> >hit the Ubuntu repo vs adding CEPH’s own.
> >>> >
> >>> >
> >>> >On Sun, 10 Feb 2019 at 12:19 AM,  wrote:
> >>> >
> >>> >> Hello m8s,
> >>> >>
> >>> >> Im curious how we should do an Upgrade of our ceph Cluster on
> >>Ubuntu
> >>> >> 16/18.04. As (At least on our 18.04 nodes) we only have 12.2.7
> >(or
> >>> >.8?)
> >>> >>
> >>> >> For an Upgrade to mimic we should First Update to Last version,
> >>> >actualy
> >>> >> 12.2.11 (iirc).
> >>> >> Which is not possible on 18.04.
> >>> >>
> >>> >> Is there a Update path from 12.2.7/8 to actual mimic release or
> >>> >better the
> >>> >> upcoming nautilus?
> >>> >>
> >>> >> Any advice?
> >>> >>
> >>> >> - Mehmet___
> >>> >> ceph-users mailing list
> >>> >> ceph-users@lists.ceph.com
> >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> >>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >___
> >ceph-users mailing list
> >ceph-users@lists.ceph.com
> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IRC channels now require registered and identified users

2019-02-18 Thread David Turner
Is this still broken in the 1-way direction where Slack users' comments do
not show up in IRC?  That would explain why nothing I ever type (as either
helping someone or asking a question) ever have anyone respond to them.

On Tue, Dec 18, 2018 at 6:50 AM Joao Eduardo Luis  wrote:

> On 12/18/2018 11:22 AM, Joao Eduardo Luis wrote:
> > On 12/18/2018 11:18 AM, Dan van der Ster wrote:
> >> Hi Joao,
> >>
> >> Has that broken the Slack connection? I can't tell if its broken or
> >> just quiet... last message on #ceph-devel was today at 1:13am.
> >
> > Just quiet, it seems. Just tested it and the bridge is still working.
>
> Okay, turns out the ceph-ircslackbot user is not identified, and that
> makes it unable to send messages to the channel. This means the bridge
> is working in one direction only (irc to slack), and will likely break
> when/if the user leaves the channel (as it won't be able to get back in).
>
> I will figure out just how this works today. In the mean time, I've
> relaxed the requirement for registered/identified users so that the bot
> works again. It will be reactivated once this is addressed.
>
>   -Joao
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook

2019-02-18 Thread David Turner
I'm getting some "new" (to me) hardware that I'm going to upgrade my home
Ceph cluster with.  Currently it's running a Proxmox cluster (Debian) which
precludes me from upgrading to Mimic.  I am thinking about taking the
opportunity to convert most of my VMs into containers and migrate my
cluster into a K8s + Rook configuration now that Ceph is [1] stable on Rook.

I haven't ever configured a K8s cluster and am planning to test this out on
VMs before moving to it with my live data.  Has anyone done a migration
from a baremetal Ceph cluster into K8s + Rook?  Additionally what is a good
way for a K8s beginner to get into managing a K8s cluster.  I see various
places recommend either CoreOS or kubeadm for starting up a new K8s cluster
but I don't know the pros/cons for either.

As far as migrating the Ceph services into Rook, I would assume that the
process would be pretty simple to add/create new mons, mds, etc into Rook
with the baremetal cluster details.  Once those are active and working just
start decommissioning the services on baremetal.  For me, the OSD migration
should be similar since I don't have any multi-device OSDs so I only need
to worry about migrating individual disks between nodes.


[1] https://blog.rook.io/rook-v0-9-new-storage-backends-in-town-ab952523ec53
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Intel P4600 3.2TB U.2 form factor NVMe firmware problems causing dead disks

2019-02-18 Thread David Turner
We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk
(partitioned), 3 disks per node, 5 nodes per cluster.  The clusters are
12.2.4 running CephFS and RBDs.  So in total we have 15 NVMe's per cluster
and 30 NVMe's in total.  They were all built at the same time and were
running firmware version QDV10130.  On this firmware version we early on
had 2 disks failures, a few months later we had 1 more, and then a month
after that (just a few weeks ago) we had 7 disk failures in 1 week.

The failures are such that the disk is no longer visible to the OS.  This
holds true beyond server reboots as well as placing the failed disks into a
new server.  With a firmware upgrade tool we got an error that pretty much
said there's no way to get data back and to RMA the disk.  We upgraded all
of our remaining disks' firmware to QDV101D1 and haven't had any problems
since then.  Most of our failures happened while rebalancing the cluster
after replacing dead disks and we tested rigorously around that use case
after upgrading the firmware.  This firmware version seems to have resolved
whatever the problem was.

We have about 100 more of these scattered among database servers and other
servers that have never had this problem while running the
QDV10130 firmware as well as firmwares between this one and the one we
upgraded to.  Bluestore on Ceph is the only use case we've had so far with
this sort of failure.

Has anyone else come across this issue before?  Our current theory is that
Bluestore is accessing the disk in a way that is triggering a bug in the
older firmware version that isn't triggered by more traditional
filesystems.  We have a scheduled call with Intel to discuss this, but
their preliminary searches into the bugfixes and known problems between
firmware versions didn't indicate the bug that we triggered.  It would be
good to have some more information about what those differences for disk
accessing might be to hopefully get a better answer from them as to what
the problem is.


[1]
https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Placing replaced disks to correct buckets.

2019-02-18 Thread David Turner
Also what commands did you run to remove the failed HDDs and the commands
you have so far run to add their replacements back in?

On Sat, Feb 16, 2019 at 9:55 PM Konstantin Shalygin  wrote:

> I recently replaced failed HDDs and removed them from their respective
> buckets as per procedure.
>
> But I’m now facing an issue when trying to place new ones back into the
> buckets. I’m getting an error of ‘osd nr not found’ OR ‘file or
> directory not found’ OR command sintax error.
>
> I have been using the commands below:
>
> ceph osd crush set   
> ceph osd crush  set   
>
> I do however find the OSD number when i run command:
>
> ceph osd find 
>
> Your assistance/response to this will be highly appreciated.
>
> Regards
> John.
>
>
> Please, paste your `ceph osd tree`, your version and what exactly error
> you get include osd number.
>
> Less obfuscation is better in this, perhaps, simple case.
>
>
> k
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-18 Thread David Turner
Do you have historical data from these OSDs to see when/if the DB used on
osd.73 ever filled up?  To account for this OSD using the slow storage for
DB, all we need to do is show that it filled up the fast DB at least once.
If that happened, then something spilled over to the slow storage and has
been there ever since.

On Sat, Feb 16, 2019 at 1:50 AM Konstantin Shalygin  wrote:

> On 2/16/19 12:33 AM, David Turner wrote:
> > The answer is probably going to be in how big your DB partition is vs
> > how big your HDD disk is.  From your output it looks like you have a
> > 6TB HDD with a 28GB Blocks.DB partition.  Even though the DB used size
> > isn't currently full, I would guess that at some point since this OSD
> > was created that it did fill up and what you're seeing is the part of
> > the DB that spilled over to the data disk. This is why the official
> > recommendation (that is quite cautious, but cautious because some use
> > cases will use this up) for a blocks.db partition is 4% of the data
> > drive.  For your 6TB disks that's a recommendation of 240GB per DB
> > partition.  Of course the actual size of the DB needed is dependent on
> > your use case.  But pretty much every use case for a 6TB disk needs a
> > bigger partition than 28GB.
>
>
> My current db size of osd.33 is 7910457344 bytes, and osd.73 is
> 2013265920+4685037568 bytes. 7544Mbyte (24.56% of db_total_bytes) vs
> 6388Mbyte (6.69% of db_total_bytes).
>
> Why osd.33 is not used slow storage at this case?
>
>
>
> k
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-15 Thread David Turner
The answer is probably going to be in how big your DB partition is vs how
big your HDD disk is.  From your output it looks like you have a 6TB HDD
with a 28GB Blocks.DB partition.  Even though the DB used size isn't
currently full, I would guess that at some point since this OSD was created
that it did fill up and what you're seeing is the part of the DB that
spilled over to the data disk.  This is why the official recommendation
(that is quite cautious, but cautious because some use cases will use this
up) for a blocks.db partition is 4% of the data drive.  For your 6TB disks
that's a recommendation of 240GB per DB partition.  Of course the actual
size of the DB needed is dependent on your use case.  But pretty much every
use case for a 6TB disk needs a bigger partition than 28GB.

On Thu, Feb 14, 2019 at 11:58 PM Konstantin Shalygin  wrote:

> Wrong metadata paste of osd.73 in previous message.
>
>
> {
>
>  "id": 73,
>  "arch": "x86_64",
>  "back_addr": "10.10.10.6:6804/175338",
>  "back_iface": "vlan3",
>  "bluefs": "1",
>  "bluefs_db_access_mode": "blk",
>  "bluefs_db_block_size": "4096",
>  "bluefs_db_dev": "259:22",
>  "bluefs_db_dev_node": "nvme2n1",
>  "bluefs_db_driver": "KernelDevice",
>  "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
>  "bluefs_db_partition_path": "/dev/nvme2n1p11",
>  "bluefs_db_rotational": "0",
>  "bluefs_db_serial": "CVFT4324002Q400BGN  ",
>  "bluefs_db_size": "30064771072",
>  "bluefs_db_type": "nvme",
>  "bluefs_single_shared_device": "0",
>  "bluefs_slow_access_mode": "blk",
>  "bluefs_slow_block_size": "4096",
>  "bluefs_slow_dev": "8:176",
>  "bluefs_slow_dev_node": "sdl",
>  "bluefs_slow_driver": "KernelDevice",
>  "bluefs_slow_model": "TOSHIBA HDWE160 ",
>  "bluefs_slow_partition_path": "/dev/sdl2",
>  "bluefs_slow_rotational": "1",
>  "bluefs_slow_size": "6001069199360",
>  "bluefs_slow_type": "hdd",
>  "bluefs_wal_access_mode": "blk",
>  "bluefs_wal_block_size": "4096",
>  "bluefs_wal_dev": "259:22",
>  "bluefs_wal_dev_node": "nvme2n1",
>  "bluefs_wal_driver": "KernelDevice",
>  "bluefs_wal_model": "INTEL SSDPEDMD400G4 ",
>  "bluefs_wal_partition_path": "/dev/nvme2n1p12",
>  "bluefs_wal_rotational": "0",
>  "bluefs_wal_serial": "CVFT4324002Q400BGN  ",
>  "bluefs_wal_size": "1073741824",
>  "bluefs_wal_type": "nvme",
>  "bluestore_bdev_access_mode": "blk",
>  "bluestore_bdev_block_size": "4096",
>  "bluestore_bdev_dev": "8:176",
>  "bluestore_bdev_dev_node": "sdl",
>  "bluestore_bdev_driver": "KernelDevice",
>  "bluestore_bdev_model": "TOSHIBA HDWE160 ",
>  "bluestore_bdev_partition_path": "/dev/sdl2",
>  "bluestore_bdev_rotational": "1",
>  "bluestore_bdev_size": "6001069199360",
>  "bluestore_bdev_type": "hdd",
>  "ceph_version": "ceph version 12.2.10
> (177915764b752804194937482a39e95e0ca3de94) luminous (stable)",
>  "cpu": "Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz",
>  "default_device_class": "hdd",
>  "distro": "centos",
>  "distro_description": "CentOS Linux 7 (Core)",
>  "distro_version": "7",
>  "front_addr": "172.16.16.16:6803/175338",
>  "front_iface": "vlan4",
>  "hb_back_addr": "10.10.10.6:6805/175338",
>  "hb_front_addr": "172.16.16.16:6805/175338",
>  "hostname": "ceph-osd5",
>  "journal_rotational": "0",
>  "kernel_description": "#1 SMP Tue Aug 14 21:49:04 UTC 2018",
>  "kernel_version": "3.10.0-862.11.6.el7.x86_64",
>  "mem_swap_kb": "0",
>  "mem_total_kb": "65724256",
>  "os": "Linux",
>  "osd_data": "/var/lib/ceph/osd/ceph-73",
>  "osd_objectstore": "bluestore",
>  "rotational": "1"
> }
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel10.2.11 EC pool out a osd, its PGs remap to the osds in the same host

2019-02-15 Thread David Turner
I'm leaving the response on the CRUSH rule for Gregory, but you have
another problem you're running into that is causing more of this data to
stay on this node than you intend.  While you `out` the OSD it is still
contributing to the Host's weight.  So the host is still set to receive
that amount of data and distribute it among the disks inside of it.  This
is the default behavior (even if you `destroy` the OSD) to minimize the
data movement for losing the disk and again for adding it back into the
cluster after you replace the device.  If you are really strapped for
space, though, then you might consider fully purging the OSD which will
reduce the Host weight to what the other OSDs are.  However if you do have
a problem in your CRUSH rule, then doing this won't change anything for you.

On Thu, Feb 14, 2019 at 11:15 PM hnuzhoulin2  wrote:

> Thanks. I read the your reply in
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg48717.html
> so using indep will do fewer data remap when osd failed.
> using firstn: 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6 , 60% data remap
> using indep :1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5, 25% data remap
>
> am I right?
> if so, what recommend to do when a disk failed and the total available
> size of the rest disk in the machine is not enough(can not replace failed
> disk immediately). or I should reserve more available size in EC situation.
>
> On 02/14/2019 02:49,Gregory Farnum
>  wrote:
>
> Your CRUSH rule for EC spools is forcing that behavior with the line
>
> step chooseleaf indep 1 type ctnr
>
> If you want different behavior, you’ll need a different crush rule.
>
> On Tue, Feb 12, 2019 at 5:18 PM hnuzhoulin2  wrote:
>
>> Hi, cephers
>>
>>
>> I am building a ceph EC cluster.when a disk is error,I out it.But its all
>> PGs remap to the osds in the same host,which I think they should remap to
>> other hosts in the same rack.
>> test process is:
>>
>> ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2
>> site1_sata_erasure_ruleset 4
>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1
>> /etc/init.d/ceph stop osd.2
>> ceph osd out 2
>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2
>> diff /tmp/1 /tmp/2 -y --suppress-common-lines
>>
>> 0 1.0 1.0 118 osd.0   | 0 1.0 1.0 126 osd.0
>> 1 1.0 1.0 123 osd.1   | 1 1.0 1.0 139 osd.1
>> 2 1.0 1.0 122 osd.2   | 2 1.0 0 0 osd.2
>> 3 1.0 1.0 113 osd.3   | 3 1.0 1.0 131 osd.3
>> 4 1.0 1.0 122 osd.4   | 4 1.0 1.0 136 osd.4
>> 5 1.0 1.0 112 osd.5   | 5 1.0 1.0 127 osd.5
>> 6 1.0 1.0 114 osd.6   | 6 1.0 1.0 128 osd.6
>> 7 1.0 1.0 124 osd.7   | 7 1.0 1.0 136 osd.7
>> 8 1.0 1.0 95 osd.8   | 8 1.0 1.0 113 osd.8
>> 9 1.0 1.0 112 osd.9   | 9 1.0 1.0 119 osd.9
>> TOTAL 3073T 197G | TOTAL 3065T 197G
>> MIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52
>>
>>
>> some config info: (detail configs see:
>> https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125)
>> jewel 10.2.11  filestore+rocksdb
>>
>> ceph osd erasure-code-profile get ISA-4-2
>> k=4
>> m=2
>> plugin=isa
>> ruleset-failure-domain=ctnr
>> ruleset-root=site1-sata
>> technique=reed_sol_van
>>
>> part of ceph.conf is:
>>
>> [global]
>> fsid = 1CAB340D-E551-474F-B21A-399AC0F10900
>> auth cluster required = cephx
>> auth service required = cephx
>> auth client required = cephx
>> pid file = /home/ceph/var/run/$name.pid
>> log file = /home/ceph/log/$cluster-$name.log
>> mon osd nearfull ratio = 0.85
>> mon osd full ratio = 0.95
>> admin socket = /home/ceph/var/run/$cluster-$name.asok
>> osd pool default size = 3
>> osd pool default min size = 1
>> osd objectstore = filestore
>> filestore merge threshold = -10
>>
>> [mon]
>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
>> mon data = /home/ceph/var/lib/$type/$cluster-$id
>> mon cluster log file = /home/ceph/log/$cluster.log
>> [osd]
>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
>> osd data = /home/ceph/var/lib/$type/$cluster-$id
>> osd journal = /home/ceph/var/lib/$type/$cluster-$id/journal
>> osd journal size = 1
>> osd mkfs type = xfs
>> osd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256k
>> osd backfill full ratio = 0.92
>> osd failsafe full ratio = 0.95
>> osd failsafe nearfull ratio = 0.85
>> osd max backfills = 1
>> osd crush update on start = false
>> osd op thread timeout = 60
>> filestore split multiple = 8
>> filestore max sync interval = 15
>> filestore min sync interval = 5
>> [osd.0]
>> host = cld-osd1-56
>> addr = X
>> user = ceph
>> devs = /disk/link/osd-0/data
>> osd journal = /disk/link/osd-0/journal
>> …….
>> [osd.503]
>> host = cld-osd42-56
>> addr = 10.108.87.52
>> user = ceph
>> devs = /disk/link/osd-503/data
>> osd journal = /disk/link/osd-503/journal
>>
>>
>> crushmap is below:
>>
>> # begin crush map
>> 

Re: [ceph-users] Problems with osd creation in Ubuntu 18.04, ceph 13.2.4-1bionic

2019-02-15 Thread David Turner
I have found that running a zap before all prepare/create commands with
ceph-volume helps things run smoother.  Zap is specifically there to clear
everything on a disk away to make the disk ready to be used as an OSD.
Your wipefs command is still fine, but then I would lvm zap the disk before
continuing.  I would run the commands like [1] this.  I also prefer the
single command lvm create as opposed to lvm prepare and lvm activate.  Try
that out and see if you still run into the problems creating the BlueStore
filesystem.

[1] ceph-volume lvm zap /dev/sdg
ceph-volume lvm prepare --bluestore --data /dev/sdg

On Thu, Feb 14, 2019 at 10:25 AM Rainer Krienke 
wrote:

> Hi,
>
> I am quite new to ceph and just try to set up a ceph cluster. Initially
> I used ceph-deploy for this but when I tried to create a BlueStore osd
> ceph-deploy fails. Next I tried the direct way on one of the OSD-nodes
> using ceph-volume to create the osd, but this also fails. Below you can
> see what  ceph-volume says.
>
> I ensured that there was no left over lvm VG and LV on the disk sdg
> before I started the osd creation for this disk. The very same error
> happens also on other disks not just for /dev/sdg. All the disk have 4TB
> in size and the linux system is Ubuntu 18.04 and finally ceph is
> installed in version 13.2.4-1bionic from this repo:
> https://download.ceph.com/debian-mimic.
>
> There is a VG and two LV's  on the system for the ubuntu system itself
> that is installed on two separate disks configured as software raid1 and
> lvm on top of the raid. But I cannot imagine that this might do any harm
> to cephs osd creation.
>
> Does anyone have an idea what might be wrong?
>
> Thanks for hints
> Rainer
>
> root@ceph1:~# wipefs -fa /dev/sdg
> root@ceph1:~# ceph-volume lvm prepare --bluestore --data /dev/sdg
> Running command: /usr/bin/ceph-authtool --gen-print-key
> Running command: /usr/bin/ceph --cluster ceph --name
> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
> -i - osd new 14d041d6-0beb-4056-8df2-3920e2febce0
> Running command: /sbin/vgcreate --force --yes
> ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b /dev/sdg
>  stdout: Physical volume "/dev/sdg" successfully created.
>  stdout: Volume group "ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b"
> successfully created
> Running command: /sbin/lvcreate --yes -l 100%FREE -n
> osd-block-14d041d6-0beb-4056-8df2-3920e2febce0
> ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b
>  stdout: Logical volume "osd-block-14d041d6-0beb-4056-8df2-3920e2febce0"
> created.
> Running command: /usr/bin/ceph-authtool --gen-print-key
> Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
> --> Absolute path not found for executable: restorecon
> --> Ensure $PATH environment variable contains common executable locations
> Running command: /bin/chown -h ceph:ceph
>
> /dev/ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b/osd-block-14d041d6-0beb-4056-8df2-3920e2febce0
> Running command: /bin/chown -R ceph:ceph /dev/dm-8
> Running command: /bin/ln -s
>
> /dev/ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b/osd-block-14d041d6-0beb-4056-8df2-3920e2febce0
> /var/lib/ceph/osd/ceph-0/block
> Running command: /usr/bin/ceph --cluster ceph --name
> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
> mon getmap -o /var/lib/ceph/osd/ceph-0/activate.monmap
>  stderr: got monmap epoch 1
> Running command: /usr/bin/ceph-authtool /var/lib/ceph/osd/ceph-0/keyring
> --create-keyring --name osd.0 --add-key
> AQAAY2VcU968HxAAvYWMaJZmriUc4H9bCCp8XQ==
>  stdout: creating /var/lib/ceph/osd/ceph-0/keyring
> added entity osd.0 auth auth(auid = 18446744073709551615
> key=AQAAY2VcU968HxAAvYWMaJZmriUc4H9bCCp8XQ== with 0 caps)
> Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/keyring
> Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/
> Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore
> bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap
> --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid
> 14d041d6-0beb-4056-8df2-3920e2febce0 --setuser ceph --setgroup ceph
>  stderr: 2019-02-14 13:45:54.788 7f3fcecb3240 -1
> bluestore(/var/lib/ceph/osd/ceph-0/) _read_fsid unparsable uuid
>  stderr: /build/ceph-13.2.4/src/os/bluestore/KernelDevice.cc: In
> function 'virtual int KernelDevice::read(uint64_t, uint64_t,
> ceph::bufferlist*, IOContext*, bool)' thread 7f3fcecb3240 time
> 2019-02-14 13:45:54.841130
>  stderr: /build/ceph-13.2.4/src/os/bluestore/KernelDevice.cc: 821:
> FAILED assert((uint64_t)r == len)
>  stderr: ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e)
> mimic (stable)
>  stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int,
> char const*)+0x102) [0x7f3fc60d33e2]
>  stderr: 2: (()+0x26d5a7) [0x7f3fc60d35a7]
>  stderr: 3: (KernelDevice::read(unsigned long, unsigned long,
> ceph::buffer::list*, IOContext*, bool)+0x4a7) [0x561371346817]
>  stderr: 4: 

Re: [ceph-users] [Ceph-community] Deploy and destroy monitors

2019-02-13 Thread David Turner
Ceph-users is the proper ML to post questions like this.

On Thu, Dec 20, 2018 at 2:30 PM Joao Eduardo Luis  wrote:

> On 12/20/2018 04:55 PM, João Aguiar wrote:
> > I am having an issue with "ceph-ceploy mon”
> >
> > I started by creating a cluster with one monitor with "create-deploy
> new"… "create-initial”...
> > And ended up with ceph,conf like:
> > ...
> > mon_initial_members = node0
> > mon_host = 10.2.2.2
> > ….
> >
> > Later I try to deploy a new monitor (ceph-deploy mon create node1),
> wait for it to get in quorum and then destroy the node0 (ceph-deploy mon
> destroy node0).
>
> Is the new monitor forming a quorum with the existing monitor? If not,
> then you won't have monitors running when you remove node0.
>
> Does ceph-deploy remove the mon being destroyed from the monmap? If not,
> you'll have two monitors in the monmap, and you'll need a majority to
> form quorum; for a 2 monitor deployment that means you'll need 2
> monitors up and running.
>
> > Result: Ceph gets unresponsive.
>
> This is the typical symptom of absence of a quorum.
>
>   -Joao
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Ceph SSE-KMS integration to use Safenet as Key Manager service

2019-02-13 Thread David Turner
Ceph-users is the correct ML to post questions like this.

On Wed, Jan 2, 2019 at 5:40 PM Rishabh S  wrote:

> Dear Members,
>
> Please let me know if you have any link with examples/detailed steps of
> Ceph-Safenet(KMS) integration.
>
> Thanks & Regards,
> Rishabh
>
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Error during playbook deployment: TASK [ceph-mon : test if rbd exists]

2019-02-13 Thread David Turner
Ceph-users ML is the proper mailing list for questions like this.

On Sat, Jan 26, 2019 at 12:31 PM Meysam Kamali  wrote:

> Hi Ceph Community,
>
> I am using ansible 2.2 and ceph branch stable-2.2, on centos7, to deploy
> the playbook. But the deployment get hangs in this step "TASK [ceph-mon :
> test if rbd exists]". it gets hangs there and doesnot move.
> I have all the three ceph nodes ceph-admin, ceph-mon, ceph-osd
> I appreciate any help! Here I am providing log:
>
> ---Log --
> TASK [ceph-mon : test if rbd exists]
> ***
> task path: /root/ceph-ansible/roles/ceph-mon/tasks/ceph_keys.yml:60
> Using module file
> /usr/lib/python2.7/site-packages/ansible/modules/core/commands/command.py
>  ESTABLISH SSH CONNECTION FOR USER: None
>  SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o
> KbdInteractiveAuthentication=no -o
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
> -o PasswordAuthentication=no -o ConnectTimeout=10 -o
> ControlPath=/root/.ansible/cp/%h-%r ceph2mon '/bin/sh -c '"'"'echo ~ &&
> sleep 0'"'"''
>  ESTABLISH SSH CONNECTION FOR USER: None
>  SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o
> KbdInteractiveAuthentication=no -o
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
> -o PasswordAuthentication=no -o ConnectTimeout=10 -o
> ControlPath=/root/.ansible/cp/%h-%r ceph2mon '/bin/sh -c '"'"'( umask 77 &&
> mkdir -p "` echo
> /root/.ansible/tmp/ansible-tmp-1547740115.56-213823795856896 `" && echo
> ansible-tmp-1547740115.56-213823795856896="` echo
> /root/.ansible/tmp/ansible-tmp-1547740115.56-213823795856896 `" ) && sleep
> 0'"'"''
>  PUT /tmp/tmpG7u1eN TO
> /root/.ansible/tmp/ansible-tmp-1547740115.56-213823795856896/command.py
>  SSH: EXEC sftp -b - -C -o ControlMaster=auto -o
> ControlPersist=60s -o KbdInteractiveAuthentication=no -o
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
> -o PasswordAuthentication=no -o ConnectTimeout=10 -o
> ControlPath=/root/.ansible/cp/%h-%r '[ceph2mon]'
>  ESTABLISH SSH CONNECTION FOR USER: None
>  SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o
> KbdInteractiveAuthentication=no -o
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
> -o PasswordAuthentication=no -o ConnectTimeout=10 -o
> ControlPath=/root/.ansible/cp/%h-%r ceph2mon '/bin/sh -c '"'"'chmod u+x
> /root/.ansible/tmp/ansible-tmp-1547740115.56-213823795856896/
> /root/.ansible/tmp/ansible-tmp-1547740115.56-213823795856896/command.py &&
> sleep 0'"'"''
>  ESTABLISH SSH CONNECTION FOR USER: None
>  SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o
> KbdInteractiveAuthentication=no -o
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
> -o PasswordAuthentication=no -o ConnectTimeout=10 -o
> ControlPath=/root/.ansible/cp/%h-%r -tt ceph2mon '/bin/sh -c '"'"'sudo -H
> -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo
> BECOME-SUCCESS-iefqzergptqzfhqmxouabfjfvdvbadku; /usr/bin/python
> /root/.ansible/tmp/ansible-tmp-1547740115.56-213823795856896/command.py; rm
> -rf "/root/.ansible/tmp/ansible-tmp-1547740115.56-213823795856896/" >
> /dev/null 2>&1'"'"'"'"'"'"'"'"' && sleep 0'"'"''
>
> -
>
>
> Thanks,
> Meysam Kamali
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Need help related to ceph client authentication

2019-02-13 Thread David Turner
The Ceph-users ML is the correct list to ask questions like this.  Did you
figure out the problems/questions you had?

On Tue, Dec 4, 2018 at 11:39 PM Rishabh S  wrote:

> Hi Gaurav,
>
> Thank You.
>
> Yes, I am using boto, though I was looking for suggestions on how my ceph
> client should get access and secret keys.
>
> Another thing where I need help is regarding encryption
> http://docs.ceph.com/docs/mimic/radosgw/encryption/#
>
> I am little confused what does these statement means.
>
> The Ceph Object Gateway supports server-side encryption of uploaded
> objects, with 3 options for the management of encryption keys. Server-side
> encryption means that the data is sent over HTTP in its unencrypted form,
> and the Ceph Object Gateway stores that data in the Ceph Storage Cluster in
> encrypted form.
>
> Note
>
>
> Requests for server-side encryption must be sent over a secure HTTPS
> connection to avoid sending secrets in plaintext.
>
> CUSTOMER-PROVIDED KEYS
> 
>
> In this mode, the client passes an encryption key along with each request
> to read or write encrypted data. It is the client’s responsibility to
> manage those keys and remember which key was used to encrypt each object.
>
> My understanding is when ceph client is trying to upload a file/object to
> Ceph cluster then client request should be https and will include
>  “customer-provided-key”.
> Then Ceph will use customer-provided-key to encrypt file/object before
> storing data into Ceph cluster.
>
> Please correct and suggest best approach to store files/object in Ceph
> cluster.
>
> Any code example of initial handshake to upload a file/object with
> encryption-key will be of great help.
>
> Regards,
> Rishabh
>
>
> On 05-Dec-2018, at 2:15 AM, Gaurav Sitlani 
> wrote:
>
> Hi Rishabh,
> You can refer the ceph RGW doc and search for boto :
> http://docs.ceph.com/docs/master/install/install-ceph-gateway/?highlight=boto
> You can get a basic python boto script where you can mention your access
> and secret key and connect to your S3 cluster.
> I hope you know how to get your keys right.
>
> Regards,
> Gaurav Sitlani
>
>
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] all vms can not start up when boot all the ceph hosts.

2019-02-13 Thread David Turner
This might not be a Ceph issue at all depending on if you're using any sort
of caching.  If you have caching on your disk controllers at all, then the
write might have happened to the cache but never made it to the OSD disks
which would show up as problems on the VM RBDs.  Make sure you have proper
BBU's on your disk controllers and/or disable caching that might be enabled
on your controllers or disks that could be benefiting you with write speed
while the cluster is healthy, but potentially causing you to run into this
state during a catastrophe.

On Tue, Dec 4, 2018 at 10:49 PM linghucongsong 
wrote:

>
> Thanks to all! I might have found the reason.
>
> It is look like relate to the below bug.
>
> https://bugs.launchpad.net/nova/+bug/1773449
>
>
>
>
> At 2018-12-04 23:42:15, "Ouyang Xu"  wrote:
>
> Hi linghucongsong:
>
> I have got this issue before, you can try to fix it as below:
>
> 1. use *rbd lock ls* to get the lock for the vm
> 2. use *rbd lock rm* to remove that lock for the vm
> 3. start vm again
>
> hope that can help you.
>
> regards,
>
> Ouyang
>
> On 2018/12/4 下午4:48, linghucongsong wrote:
>
> HI all!
>
> I have a ceph test envirment use ceph with openstack. There are some vms
> run on the openstack. It is just a test envirment.
>
> my ceph version is 12.2.4. Last day I reboot all the ceph hosts before
> this I do not shutdown the vms on the openstack.
>
> When all the hosts boot up and the ceph become healthy. I  found all the
> vms can not start up. All the vms have the
>
> below xfs error. Even I use xfs_repair also can not repair this problem .
>
> It is just a test envrement so the data is not important  to me. I know
> the ceph version 12.2..4 is not stable
>
> enough but how does it have so serious problems. Mind to other people care
> about this. Thanks to all. :)
>
>
>
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> 【网易自营|30天无忧退货】真性价比:网易员工用纸“无添加谷风一木软抽面巾纸”,限时仅16.9元一提>>
> 
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to mount one of the cephfs namespace using ceph-fuse?

2019-02-13 Thread David Turner
Note that this format in fstab does require a certain version of util-linux
because of the funky format of the line.  Pretty much it maps all command
line options at the beginning of the line separated with commas.

On Wed, Feb 13, 2019 at 2:10 PM David Turner  wrote:

> I believe the fstab line for ceph-fuse in this case would look something
> like [1] this.  We use a line very similar to that to mount cephfs at a
> specific client_mountpoint that the specific cephx user only has access to.
>
> [1] id=acapp3,client_mds_namespace=fs1   /tmp/ceph   fuse.ceph
>  defaults,noatime,_netdev 0 2
>
> On Tue, Dec 4, 2018 at 3:22 AM Zhenshi Zhou  wrote:
>
>> Hi
>>
>> I can use this mount cephfs manually. But how to edit fstab so that the
>> system will auto-mount cephfs by ceph-fuse?
>>
>> Thanks
>>
>> Yan, Zheng  于2018年11月20日周二 下午8:08写道:
>>
>>> ceph-fuse --client_mds_namespace=xxx
>>> On Tue, Nov 20, 2018 at 7:33 PM ST Wong (ITSC)  wrote:
>>> 
>>>  Hi all,
>>> 
>>> 
>>> 
>>>  We’re using mimic and enabled multiple fs flag. We can do
>>> kernel mount of particular fs (e.g. fs1) with mount option
>>> mds_namespace=fs1.However, this is not working for ceph-fuse:
>>> 
>>> 
>>> 
>>>  #ceph-fuse -n client.acapp3 -o mds_namespace=fs1 /tmp/ceph
>>> 
>>>  2018-11-20 19:30:35.246 7ff5653edcc0 -1 init, newargv =
>>> 0x5564a21633b0 newargc=9
>>> 
>>>  fuse: unknown option `mds_namespace=fs1'
>>> 
>>>  ceph-fuse[3931]: fuse failed to start
>>> 
>>>  2018-11-20 19:30:35.264 7ff5653edcc0 -1 fuse_lowlevel_new failed
>>> 
>>> 
>>> 
>>>  Sorry that I can’t find the correct option in ceph-fuse man page or
>>> doc.
>>> 
>>>  Please help.   Thanks a lot.
>>> 
>>> 
>>> 
>>>  Best Rgds
>>> 
>>>  /stwong
>>> 
>>>  ___
>>>  ceph-users mailing list
>>>  ceph-users@lists.ceph.com
>>>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to mount one of the cephfs namespace using ceph-fuse?

2019-02-13 Thread David Turner
I believe the fstab line for ceph-fuse in this case would look something
like [1] this.  We use a line very similar to that to mount cephfs at a
specific client_mountpoint that the specific cephx user only has access to.

[1] id=acapp3,client_mds_namespace=fs1   /tmp/ceph   fuse.ceph
 defaults,noatime,_netdev 0 2

On Tue, Dec 4, 2018 at 3:22 AM Zhenshi Zhou  wrote:

> Hi
>
> I can use this mount cephfs manually. But how to edit fstab so that the
> system will auto-mount cephfs by ceph-fuse?
>
> Thanks
>
> Yan, Zheng  于2018年11月20日周二 下午8:08写道:
>
>> ceph-fuse --client_mds_namespace=xxx
>> On Tue, Nov 20, 2018 at 7:33 PM ST Wong (ITSC)  wrote:
>> 
>>  Hi all,
>> 
>> 
>> 
>>  We’re using mimic and enabled multiple fs flag. We can do
>> kernel mount of particular fs (e.g. fs1) with mount option
>> mds_namespace=fs1.However, this is not working for ceph-fuse:
>> 
>> 
>> 
>>  #ceph-fuse -n client.acapp3 -o mds_namespace=fs1 /tmp/ceph
>> 
>>  2018-11-20 19:30:35.246 7ff5653edcc0 -1 init, newargv =
>> 0x5564a21633b0 newargc=9
>> 
>>  fuse: unknown option `mds_namespace=fs1'
>> 
>>  ceph-fuse[3931]: fuse failed to start
>> 
>>  2018-11-20 19:30:35.264 7ff5653edcc0 -1 fuse_lowlevel_new failed
>> 
>> 
>> 
>>  Sorry that I can’t find the correct option in ceph-fuse man page or
>> doc.
>> 
>>  Please help.   Thanks a lot.
>> 
>> 
>> 
>>  Best Rgds
>> 
>>  /stwong
>> 
>>  ___
>>  ceph-users mailing list
>>  ceph-users@lists.ceph.com
>>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] compacting omap doubles its size

2019-02-13 Thread David Turner
Sorry for the late response on this, but life has been really busy over the
holidays.

We compact our omaps offline with the ceph-kvstore-tool.  Here [1] is a
copy of the script that we use for our clusters.  You might need to modify
things a bit for your environment.  I don't remember which version this
functionality was added to ceph-kvstore-tool, but it exists in 12.2.4.  We
need to do this because our OSDs get marked out when they try to compact
their own omaps online.  We run this script monthly and then ad-hoc as we
find OSDs compacting their own omaps live.


[1] https://gist.github.com/drakonstein/4391c0b268a35b64d4f26a12e5058ba9

On Thu, Nov 29, 2018 at 6:15 PM Tomasz Płaza 
wrote:

> Hi,
>
> I have a ceph 12.2.8 cluster on filestore with rather large omap dirs
> (avg size is about 150G). Recently slow requests became a problem, so
> after some digging I decided to convert omap from leveldb to rocksdb.
> Conversion went fine and slow requests rate went down to acceptable
> level. Unfortunately  conversion did not shrink most of omap dirs, so I
> tried online compaction:
>
> Before compaction: 50G/var/lib/ceph/osd/ceph-0/current/omap/
>
> After compaction: 100G/var/lib/ceph/osd/ceph-0/current/omap/
>
> Purge and recreate: 1.5G /var/lib/ceph/osd/ceph-0/current/omap/
>
>
> Before compaction: 135G/var/lib/ceph/osd/ceph-5/current/omap/
>
> After compaction: 260G/var/lib/ceph/osd/ceph-5/current/omap/
>
> Purge and recreate: 2.5G /var/lib/ceph/osd/ceph-5/current/omap/
>
>
> For me compaction which makes omap bigger is quite weird and
> frustrating. Please help.
>
>
> P.S. My cluster suffered from ongoing index reshards (it is disabled
> now) and on many buckets with 4m+ objects I have a lot of old indexes:
>
> 634   bucket1
> 651   bucket2
>
> ...
> 1231 bucket17
> 1363 bucket18
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] backfill_toofull while OSDs are not full

2019-01-30 Thread David Zafman


Strange, I can't reproduce this with v13.2.4.  I tried the following 
scenarios:


pg acting 1, 0, 2 -> up 1, 0 4 (osd.2 marked out).  The df on osd.2 
shows 0 space, but only osd.4 (backfill target) checks full space.


pg acting 1, 0, 2 -> up 4,3,5 (osd,1,0,2 all marked out).  The df for 
1,0,2 show 0 space but osd.4,3,4 (backafill targets) check full space.


FYI, In a later release even when a backfill target is below 
backfillfull_ratio, if there isn't enough room for the pg to fit then 
backfill_toofull occurs.



The question in your case is was any of  OSDs 999, 1900, or 145 above 
90% (backfillfull_ratio) usage.


David

On 1/27/19 11:34 PM, Wido den Hollander wrote:


On 1/25/19 8:33 AM, Gregory Farnum wrote:

This doesn’t look familiar to me. Is the cluster still doing recovery so
we can at least expect them to make progress when the “out” OSDs get
removed from the set?

The recovery has already finished. It resolves itself, but in the
meantime I saw many PGs in the backfill_toofull state for a long time.

This is new since Mimic.

Wido


On Tue, Jan 22, 2019 at 2:44 PM Wido den Hollander mailto:w...@42on.com>> wrote:

 Hi,

 I've got a couple of PGs which are stuck in backfill_toofull, but none
 of them are actually full.

   "up": [
     999,
     1900,
     145
   ],
   "acting": [
     701,
     1146,
     1880
   ],
   "backfill_targets": [
     "145",
     "999",
     "1900"
   ],
   "acting_recovery_backfill": [
     "145",
     "701",
     "999",
     "1146",
     "1880",
     "1900"
   ],

 I checked all these OSDs, but they are all <75% utilization.

 full_ratio 0.95
 backfillfull_ratio 0.9
 nearfull_ratio 0.9

 So I started checking all the PGs and I've noticed that each of these
 PGs has one OSD in the 'acting_recovery_backfill' which is marked as
 out.

 In this case osd.1880 is marked as out and thus it's capacity is shown
 as zero.

 [ceph@ceph-mgr ~]$ ceph osd df|grep 1880
 1880   hdd 4.54599        0     0 B      0 B      0 B     0    0  27
 [ceph@ceph-mgr ~]$

 This is on a Mimic 13.2.4 cluster. Is this expected or is this a unknown
 side-effect of one of the OSDs being marked as out?

 Thanks,

 Wido
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH_FSAL Nfs-ganesha

2019-01-30 Thread David C
Hi Patrick

Thanks for the info. If I did multiple exports, how does that work in terms
of the cache settings defined in ceph.conf, are those settings per CephFS
client or a shared cache? I.e if I've definied client_oc_size, would that
be per export?

Cheers,

On Tue, Jan 15, 2019 at 6:47 PM Patrick Donnelly 
wrote:

> On Mon, Jan 14, 2019 at 7:11 AM Daniel Gryniewicz  wrote:
> >
> > Hi.  Welcome to the community.
> >
> > On 01/14/2019 07:56 AM, David C wrote:
> > > Hi All
> > >
> > > I've been playing around with the nfs-ganesha 2.7 exporting a cephfs
> > > filesystem, it seems to be working pretty well so far. A few questions:
> > >
> > > 1) The docs say " For each NFS-Ganesha export, FSAL_CEPH uses a
> > > libcephfs client,..." [1]. For arguments sake, if I have ten top level
> > > dirs in my Cephfs namespace, is there any value in creating a separate
> > > export for each directory? Will that potentially give me better
> > > performance than a single export of the entire namespace?
> >
> > I don't believe there are any advantages from the Ceph side.  From the
> > Ganesha side, you configure permissions, client ACLs, squashing, and so
> > on on a per-export basis, so you'll need different exports if you need
> > different settings for each top level directory.  If they can all use
> > the same settings, one export is probably better.
>
> There may be performance impact (good or bad) with having separate
> exports for CephFS. Each export instantiates a separate instance of
> the CephFS client which has its own bookkeeping and set of
> capabilities issued by the MDS. Also, each client instance has a
> separate big lock (potentially a big deal for performance). If the
> data for each export is disjoint (no hard links or shared inodes) and
> the NFS server is expected to have a lot of load, breaking out the
> exports can have a positive impact on performance. If there are hard
> links, then the clients associated with the exports will potentially
> fight over capabilities which will add to request latency.)
>
> --
> Patrick Donnelly
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How To Properly Failover a HA Setup

2019-01-21 Thread David C
It could also be the kernel client versions, what are you running? I
remember older kernel clients didn't always deal with recovery scenarios
very well.

On Mon, Jan 21, 2019 at 9:18 AM Marc Roos  wrote:

>
>
> I think his downtime is coming from the mds failover, that takes a while
> in my case to. But I am not using the cephfs that much yet.
>
>
>
> -Original Message-
> From: Robert Sander [mailto:r.san...@heinlein-support.de]
> Sent: 21 January 2019 10:05
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] How To Properly Failover a HA Setup
>
> On 21.01.19 09:22, Charles Tassell wrote:
> > Hello Everyone,
> >
> >I've got a 3 node Jewel cluster setup, and I think I'm missing
> > something.  When I want to take one of my nodes down for maintenance
> > (kernel upgrades or the like) all of my clients (running the kernel
> > module for the cephfs filesystem) hang for a couple of minutes before
> > the redundant servers kick in.
>
> Have you set the noout flag before doing cluster maintenance?
>
> ceph osd set noout
>
> and afterwards
>
> ceph osd unset noout
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> https://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Amtsgericht Berlin-Charlottenburg - HRB 93818 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - Small file - single thread - read performance.

2019-01-18 Thread David C
On Fri, 18 Jan 2019, 14:46 Marc Roos 
>
> [@test]# time cat 50b.img > /dev/null
>
> real0m0.004s
> user0m0.000s
> sys 0m0.002s
> [@test]# time cat 50b.img > /dev/null
>
> real0m0.002s
> user0m0.000s
> sys 0m0.002s
> [@test]# time cat 50b.img > /dev/null
>
> real0m0.002s
> user0m0.000s
> sys 0m0.001s
> [@test]# time cat 50b.img > /dev/null
>
> real0m0.002s
> user0m0.001s
> sys 0m0.001s
> [@test]#
>
> Luminous, centos7.6 kernel cephfs mount, 10Gbit, ssd meta, hdd data, mds
> 2,2Ghz
>

Did you drop the caches on your client before reading the file?

>
>
>
> -Original Message-
> From: Alexandre DERUMIER [mailto:aderum...@odiso.com]
> Sent: 18 January 2019 15:37
> To: Burkhard Linke
> Cc: ceph-users
> Subject: Re: [ceph-users] CephFS - Small file - single thread - read
> performance.
>
> Hi,
> I don't have so big latencies:
>
> # time cat 50bytesfile > /dev/null
>
> real0m0,002s
> user0m0,001s
> sys 0m0,000s
>
>
> (It's on an ceph ssd cluster (mimic), kernel cephfs client (4.18), 10GB
> network with small latency too, client/server have 3ghz cpus)
>
>
>
> - Mail original -
> De: "Burkhard Linke" 
> À: "ceph-users" 
> Envoyé: Vendredi 18 Janvier 2019 15:29:45
> Objet: Re: [ceph-users] CephFS - Small file - single thread - read
> performance.
>
> Hi,
>
> On 1/18/19 3:11 PM, jes...@krogh.cc wrote:
> > Hi.
> >
> > We have the intention of using CephFS for some of our shares, which
> > we'd like to spool to tape as a part normal backup schedule. CephFS
> > works nice for large files but for "small" .. < 0.1MB .. there seem to
>
> > be a "overhead" on 20-40ms per file. I tested like this:
> >
> > root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile >
> > /dev/null
> >
> > real 0m0.034s
> > user 0m0.001s
> > sys 0m0.000s
> >
> > And from local page-cache right after.
> > root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile >
> > /dev/null
> >
> > real 0m0.002s
> > user 0m0.002s
> > sys 0m0.000s
> >
> > Giving a ~20ms overhead in a single file.
> >
> > This is about x3 higher than on our local filesystems (xfs) based on
> > same spindles.
> >
> > CephFS metadata is on SSD - everything else on big-slow HDD's (in both
>
> > cases).
> >
> > Is this what everyone else see?
>
>
> Each file access on client side requires the acquisition of a
> corresponding locking entity ('file capability') from the MDS. This adds
> an extra network round trip to the MDS. In the worst case the MDS needs
> to request a capability release from another client which still holds
> the cap (e.g. file is still in page cache), adding another extra network
> round trip.
>
>
> CephFS is not NFS, and has a strong consistency model. This comes at a
> price.
>
>
> Regards,
>
> Burkhard
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - Small file - single thread - read performance.

2019-01-18 Thread David C
On Fri, Jan 18, 2019 at 2:12 PM  wrote:

> Hi.
>
> We have the intention of using CephFS for some of our shares, which we'd
> like to spool to tape as a part normal backup schedule. CephFS works nice
> for large files but for "small" .. < 0.1MB  .. there seem to be a
> "overhead" on 20-40ms per file. I tested like this:
>
> root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile >
> /dev/null
>
> real0m0.034s
> user0m0.001s
> sys 0m0.000s
>
> And from local page-cache right after.
> root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile >
> /dev/null
>
> real0m0.002s
> user0m0.002s
> sys 0m0.000s
>
> Giving a ~20ms overhead in a single file.
>
> This is about x3 higher than on our local filesystems (xfs) based on
> same spindles.
>
> CephFS metadata is on SSD - everything else on big-slow HDD's (in both
> cases).
>
> Is this what everyone else see?
>

Pretty much. Reading a file from a pool of Filestore spinners:

# time cat 13kb > /dev/null

real0m0.013s
user0m0.000s
sys 0m0.003s

That's after dropping the caches on the client however the file would have
still been in the page cache on the OSD nodes as I just created it. If the
file was coming straight off the spinners I'd expect to see something
closer to your time.

I guess if you wanted to improve the latency you would be looking at the
usual stuff e.g (off the top of my head):

- Faster network links/tuning your network
- Turning down Ceph debugging
- Trying a different striping layout on the dirs with the small files
(unlikely to have much affect)
- If you're using fuse mount try Kernel mount (or maybe vice versa)
- Play with mount options
- Tune CPU on MDS node

Still even with all of that unlikely you'll get to local file-system
performance, as Burkhard says you have the locking overhead. You'll
probably need to look at getting more parallelism going in your rsyncs.



>
> Thanks
>
> --
> Jesper
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fw: Re: Why does "df" on a cephfs not report same free space as "rados df" ?

2019-01-16 Thread David Young
Forgot to reply to the list!

‐‐‐ Original Message ‐‐‐
On Thursday, January 17, 2019 8:32 AM, David Young 
 wrote:

> Thanks David,
>
> "ceph osd df" looks like this:
>
> -
> root@node1:~# ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL%USE  VAR  PGS
> 9   hdd 7.27698  1.0 7.3 TiB 6.3 TiB 1008 GiB 86.47 1.22 122
> 10   hdd 7.27698  1.0 7.3 TiB 4.9 TiB  2.4 TiB 66.90 0.94  94
> 11   hdd 7.27739  0.90002 7.3 TiB 5.4 TiB  1.9 TiB 74.29 1.05 104
> 12   hdd 7.27698  0.95001 7.3 TiB 5.8 TiB  1.5 TiB 79.64 1.12 115
> 13   hdd   00 0 B 0 B  0 B 00  18
> 40   hdd 7.27739  1.0 7.3 TiB 6.1 TiB  1.2 TiB 83.32 1.17 120
> 41   hdd 7.27739  0.90002 7.3 TiB 5.6 TiB  1.7 TiB 76.88 1.08 113
> 42   hdd 7.27739  0.80005 7.3 TiB 6.3 TiB  1.0 TiB 85.98 1.21 123
> 43   hdd   00 0 B 0 B  0 B 00  32
> 44   hdd 7.277390 0 B 0 B  0 B 00  27
> 45   hdd 7.27739  1.0 7.3 TiB 5.1 TiB  2.2 TiB 69.44 0.98  98
> 46   hdd   00 0 B 0 B  0 B 00  38
> 47   hdd 7.27739  1.0 7.3 TiB 4.4 TiB  2.9 TiB 60.24 0.85  84
> 48   hdd 7.27739  1.0 7.3 TiB 4.5 TiB  2.8 TiB 61.66 0.87  85
> 49   hdd 7.27739  1.0 7.3 TiB 4.7 TiB  2.5 TiB 65.07 0.92  90
> 50   hdd 7.27739  1.0 7.3 TiB 4.7 TiB  2.6 TiB 64.39 0.91  87
> 51   hdd 7.27739  1.0 7.3 TiB 5.1 TiB  2.2 TiB 70.22 0.99  95
> 52   hdd 7.27739  1.0 7.3 TiB 4.9 TiB  2.4 TiB 66.69 0.94  98
> 53   hdd 7.27739  1.0 7.3 TiB 4.8 TiB  2.5 TiB 66.33 0.93  97
> 54   hdd 7.27739  1.0 7.3 TiB 4.3 TiB  3.0 TiB 59.20 0.83  82
> 0   hdd 7.27699  1.0 7.3 TiB 3.8 TiB  3.5 TiB 52.34 0.74  71
> 1   hdd 7.27699  1.0 7.3 TiB 4.9 TiB  2.4 TiB 67.62 0.95  89
> 2   hdd 7.27699  0.90002 7.3 TiB 4.9 TiB  2.4 TiB 66.69 0.94  81
> 3   hdd 7.27699  1.0 7.3 TiB 4.7 TiB  2.5 TiB 65.21 0.92  88
> 4   hdd 7.27699  0.90002 7.3 TiB 4.9 TiB  2.4 TiB 67.25 0.95  93
> 5   hdd 7.27739  0.95001 7.3 TiB 4.2 TiB  3.0 TiB 58.39 0.82  78
> 6   hdd 7.27739  1.0 7.3 TiB 5.7 TiB  1.6 TiB 78.35 1.10 105
> 7   hdd 7.27739  0.95001 7.3 TiB 5.2 TiB  2.1 TiB 71.65 1.01  98
> 8   hdd 7.27739  1.0 7.3 TiB 5.1 TiB  2.2 TiB 69.92 0.98  94
> 14   hdd 7.27739  0.95001 7.3 TiB 5.3 TiB  2.0 TiB 72.46 1.02 100
> 15   hdd 7.27739  0.85004 7.3 TiB 6.0 TiB  1.2 TiB 82.93 1.17 119
> 16   hdd 7.27739  1.0 7.3 TiB 6.3 TiB  1.0 TiB 86.11 1.21 117
> 17   hdd 7.27739  0.85004 7.3 TiB 5.2 TiB  2.1 TiB 71.48 1.01 103
> 18   hdd 7.27739  1.0 7.3 TiB 5.2 TiB  2.1 TiB 71.43 1.00 100
> 19   hdd 7.27739  1.0 7.3 TiB 5.2 TiB  2.0 TiB 72.14 1.01 103
> 20   hdd 7.27739  1.0 7.3 TiB 5.7 TiB  1.6 TiB 78.13 1.10 110
> 21   hdd 7.27739  1.0 7.3 TiB 6.2 TiB  1.0 TiB 85.58 1.20 125
> 22   hdd 7.27739  1.0 7.3 TiB 5.2 TiB  2.1 TiB 71.71 1.01 103
> 23   hdd 7.27739  0.95001 7.3 TiB 6.0 TiB  1.2 TiB 83.04 1.17 110
> 24   hdd   0  1.0 7.3 TiB 831 GiB  6.5 TiB 11.15 0.16  13
> 25   hdd 7.27739  1.0 7.3 TiB 6.3 TiB  978 GiB 86.87 1.22 121
> 26   hdd 7.27739  1.0 7.3 TiB 5.2 TiB  2.1 TiB 70.86 1.00 100
> 27   hdd 7.27739  1.0 7.3 TiB 5.9 TiB  1.4 TiB 80.92 1.14 115
> 28   hdd 7.27739  1.0 7.3 TiB 6.5 TiB  826 GiB 88.91 1.25 121
> 29   hdd 7.27739  1.0 7.3 TiB 5.2 TiB  2.1 TiB 70.99 1.00  95
> 30   hdd   0  1.0 7.3 TiB 2.0 TiB  5.3 TiB 26.99 0.38  33
> 31   hdd 7.27739  1.0 7.3 TiB 4.6 TiB  2.7 TiB 62.61 0.88  90
> 32   hdd 7.27739  0.90002 7.3 TiB 5.5 TiB  1.8 TiB 75.65 1.06 107
> 33   hdd 7.27739  1.0 7.3 TiB 5.7 TiB  1.6 TiB 77.99 1.10 111
> 34   hdd 7.277390 0 B 0 B  0 B 00  10
> 35   hdd 7.27739  1.0 7.3 TiB 5.3 TiB  2.0 TiB 73.16 1.03 106
> 36   hdd 7.27739  0.95001 7.3 TiB 6.6 TiB  694 GiB 90.68 1.28 126
> 37   hdd 7.27739  1.0 7.3 TiB 5.5 TiB  1.8 TiB 75.83 1.07 106
> 38   hdd 7.27739  0.95001 7.3 TiB 6.2 TiB  1.1 TiB 85.02 1.20 115
> 39   hdd 7.27739  1.0 7.3 TiB 4.9 TiB  2.4 TiB 67.16 0.94  94
> TOTAL 400 TiB 266 TiB  134 TiB 71.08
> MIN/MAX VAR: 0.16/1.28  STDDEV: 13.96
> root@node1:~#
> 
>
> The drives that are weighted zero are "out" pending the completion of the 
> remaining degraded objects after an OSD failure:
>
> ---
>   data:
> pools:   2 pools, 1028 pgs
> objects: 52.15 M objects, 197 TiB
> usage:   266 TiB used, 134 TiB / 400 TiB avail
> pgs: 477114/260622045 objects degraded (0.183%)
>  10027396/260622045 objects misplaced (3.847%)
> --
>
> ‐‐‐ Original Message ‐‐‐
> On Thursday, January 17, 2019 7:23 AM, David C  
> wrote:
>
>> On Wed, 16 Jan 2019, 02:20 David Young >

Re: [ceph-users] Why does "df" on a cephfs not report same free space as "rados df" ?

2019-01-16 Thread David C
On Wed, 16 Jan 2019, 02:20 David Young  Hi folks,
>
> My ceph cluster is used exclusively for cephfs, as follows:
>
> ---
> root@node1:~# grep ceph /etc/fstab
> node2:6789:/ /ceph ceph
> auto,_netdev,name=admin,secretfile=/root/ceph.admin.secret
> root@node1:~#
> ---
>
> "rados df" shows me the following:
>
> ---
> root@node1:~# rados df
> POOL_NAME  USED  OBJECTS CLONESCOPIES MISSING_ON_PRIMARY
> UNFOUND DEGRADEDRD_OPS  RDWR_OPS  WR
> cephfs_metadata 197 MiB49066  0 98132  0
> 00   9934744  55 GiB  57244243 232 GiB
> media   196 TiB 51768595  0 258842975  0
> 1   203534 477915206 509 TiB 165167618 292 TiB
>
> total_objects51817661
> total_used   266 TiB
> total_avail  135 TiB
> total_space  400 TiB
> root@node1:~#
> ---
>
> But "df" on the mounted cephfs volume shows me:
>
> ---
> root@node1:~# df -h /ceph
> Filesystem  Size  Used Avail Use% Mounted on
> 10.20.30.22:6789:/  207T  196T   11T  95% /ceph
> root@node1:~#
> ---
>
> And ceph -s shows me:
>
> ---
>   data:
> pools:   2 pools, 1028 pgs
> objects: 51.82 M objects, 196 TiB
> usage:   266 TiB used, 135 TiB / 400 TiB avail
> ---
>
> "media" is an EC pool with size of 5 (4+1), so I can expect 1TB of data to
> consume 1.25TB raw space.
>
> My question is, why does "df" show me I have 11TB free, when "rados df"
> shows me I have 135TB (raw) available?
>

Probabaly because your OSDs are quite unbalanced.  What does your 'ceph osd
df' look like?



>
> Thanks!
> D
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Why does "df" on a cephfs not report same free space as "rados df" ?

2019-01-15 Thread David Young
Hi folks,

My ceph cluster is used exclusively for cephfs, as follows:

---
root@node1:~# grep ceph /etc/fstab
node2:6789:/ /ceph ceph 
auto,_netdev,name=admin,secretfile=/root/ceph.admin.secret
root@node1:~#
---

"rados df" shows me the following:

---
root@node1:~# rados df
POOL_NAME  USED  OBJECTS CLONESCOPIES MISSING_ON_PRIMARY UNFOUND 
DEGRADEDRD_OPS  RDWR_OPS  WR
cephfs_metadata 197 MiB49066  0 98132  0   0
0   9934744  55 GiB  57244243 232 GiB
media   196 TiB 51768595  0 258842975  0   1   
203534 477915206 509 TiB 165167618 292 TiB

total_objects51817661
total_used   266 TiB
total_avail  135 TiB
total_space  400 TiB
root@node1:~#
---

But "df" on the mounted cephfs volume shows me:

---
root@node1:~# df -h /ceph
Filesystem  Size  Used Avail Use% Mounted on
10.20.30.22:6789:/  207T  196T   11T  95% /ceph
root@node1:~#
---

And ceph -s shows me:

---
  data:
pools:   2 pools, 1028 pgs
objects: 51.82 M objects, 196 TiB
usage:   266 TiB used, 135 TiB / 400 TiB avail
---

"media" is an EC pool with size of 5 (4+1), so I can expect 1TB of data to 
consume 1.25TB raw space.

My question is, why does "df" show me I have 11TB free, when "rados df" shows 
me I have 135TB (raw) available?

Thanks!
D___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CEPH_FSAL Nfs-ganesha

2019-01-14 Thread David C
Hi All

I've been playing around with the nfs-ganesha 2.7 exporting a cephfs
filesystem, it seems to be working pretty well so far. A few questions:

1) The docs say " For each NFS-Ganesha export, FSAL_CEPH uses a libcephfs
client,..." [1]. For arguments sake, if I have ten top level dirs in my
Cephfs namespace, is there any value in creating a separate export for each
directory? Will that potentially give me better performance than a single
export of the entire namespace?

2) Tuning: are there any recommended parameters to tune? So far I've found
I had to increase client_oc_size which seemed quite conservative.

Thanks
David

[1] http://docs.ceph.com/docs/mimic/cephfs/nfs/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs free space issue

2019-01-10 Thread David C
On Thu, Jan 10, 2019 at 4:07 PM Scottix  wrote:

> I just had this question as well.
>
> I am interested in what you mean by fullest, is it percentage wise or raw
> space. If I have an uneven distribution and adjusted it, would it make more
> space available potentially.
>

Yes - I'd recommend using pg-upmap if all your clients are Luminous+. I
"reclaimed" about 5TB of usable space recently by balancing my PGs.

@Yoanne, you've got a fair bit of variance so you would likely benefit from
pg-upmap (or other rebalancing).


> Thanks
> Scott
> On Thu, Jan 10, 2019 at 12:05 AM Wido den Hollander  wrote:
>
>>
>>
>> On 1/9/19 2:33 PM, Yoann Moulin wrote:
>> > Hello,
>> >
>> > I have a CEPH cluster in luminous 12.2.10 dedicated to cephfs.
>> >
>> > The raw size is 65.5 TB, with a replica 3, I should have ~21.8 TB
>> usable.
>> >
>> > But the size of the cephfs view by df is *only* 19 TB, is that normal ?
>> >
>>
>> Yes. Ceph will calculate this based on the fullest OSD. As data
>> distribution is never 100% perfect you will get such numbers.
>>
>> To go from raw to usable I use this calculation:
>>
>> (RAW / 3) * 0.85
>>
>> So yes, I take a 20%, sometimes even 30% buffer.
>>
>> Wido
>>
>> > Best regards,
>> >
>> > here some hopefully useful information :
>> >
>> >> apollo@icadmin004:~$ ceph -s
>> >>   cluster:
>> >> id: fc76846a-d0f0-4866-ae6d-d442fc885469
>> >> health: HEALTH_OK
>> >>
>> >>   services:
>> >> mon: 3 daemons, quorum icadmin006,icadmin007,icadmin008
>> >> mgr: icadmin006(active), standbys: icadmin007, icadmin008
>> >> mds: cephfs-3/3/3 up
>> {0=icadmin008=up:active,1=icadmin007=up:active,2=icadmin006=up:active}
>> >> osd: 40 osds: 40 up, 40 in
>> >>
>> >>   data:
>> >> pools:   2 pools, 2560 pgs
>> >> objects: 26.12M objects, 15.6TiB
>> >> usage:   49.7TiB used, 15.8TiB / 65.5TiB avail
>> >> pgs: 2560 active+clean
>> >>
>> >>   io:
>> >> client:   510B/s rd, 24.1MiB/s wr, 0op/s rd, 35op/s wr
>> >
>> >> apollo@icadmin004:~$ ceph df
>> >> GLOBAL:
>> >> SIZEAVAIL   RAW USED %RAW USED
>> >> 65.5TiB 15.8TiB  49.7TiB 75.94
>> >> POOLS:
>> >> NAMEID USED%USED MAX AVAIL
>>  OBJECTS
>> >> cephfs_data 1  15.6TiB 85.62   2.63TiB
>>  25874848
>> >> cephfs_metadata 2   571MiB  0.02   2.63TiB
>>  245778
>> >
>> >> apollo@icadmin004:~$ rados df
>> >> POOL_NAME   USEDOBJECTS  CLONES COPIES   MISSING_ON_PRIMARY
>> UNFOUND DEGRADED RD_OPS RD  WR_OPS   WR
>> >> cephfs_data 15.6TiB 25874848  0 77624544  0
>>00  324156851 25.9TiB 20114360 9.64TiB
>> >> cephfs_metadata  571MiB   245778  0   737334  0
>>00 1802713236 87.7TiB 75729412 16.0TiB
>> >>
>> >> total_objects26120626
>> >> total_used   49.7TiB
>> >> total_avail  15.8TiB
>> >> total_space  65.5TiB
>> >
>> >> apollo@icadmin004:~$ ceph osd pool ls detail
>> >> pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 6197 lfor 0/3885
>> flags hashpspool stripe_width 0 application cephfs
>> >> pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 512 pgp_num 512 last_change 6197 lfor 0/703
>> flags hashpspool stripe_width 0 application cephfs
>> >
>> >> apollo@icadmin004:~$ df -h /apollo/
>> >> Filesystem Size  Used Avail Use% Mounted on
>> >> 10.90.36.16,10.90.36.17,10.90.36.18:/   19T   16T  2.7T  86% /apollo
>> >
>> >> apollo@icadmin004:~$ ceph fs get cephfs
>> >> Filesystem 'cephfs' (1)
>> >> fs_name  cephfs
>> >> epoch49277
>> >> flagsc
>> >> created  2018-01-23 14:06:43.460773
>> >> modified 2019-01-09 14:17:08.520888
>> >> tableserver  0
>> >> root 0
>> >> session_timeout  60
>> >> session_autoclose300
>> >> max_file_size1099511627776
>> >> last_failure 0
>> >> last_failure_osd_epoch   6216
>> >> compat   compat={},rocompat={},incompat={1=base v0.20,2=client
>> writeable ranges,3=default file layouts on dirs,4=dir inode in separate
>> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
>> anchor table,9=file layout v2}
>> >> max_mds  3
>> >> in   0,1,2
>> >> up   {0=424203,1=424158,2=424146}
>> >> failed
>> >> damaged
>> >> stopped
>> >> data_pools   [1]
>> >> metadata_pool2
>> >> inline_data  disabled
>> >> balancer
>> >> standby_count_wanted 0
>> >> 424203:  10.90.36.18:6800/3885954695 'icadmin008' mds.0.49202
>> up:active seq 6 export_targets=1,2
>> >> 424158:  10.90.36.17:6800/152758094 'icadmin007' mds.1.49198
>> up:active seq 16 export_targets=0,2
>> >> 424146:  10.90.36.16:6801/1771587593 'icadmin006' mds.2.49195
>> up:active seq 19 export_targets=0
>> >
>> >> apollo@icadmin004:~$ ceph osd tree
>> >> ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT 

Re: [ceph-users] Mimic 13.2.3?

2019-01-08 Thread David Galloway


On 1/8/19 9:05 AM, Matthew Vernon wrote:
> Dear Greg,
> 
> On 04/01/2019 19:22, Gregory Farnum wrote:
> 
>> Regarding Ceph releases more generally:
> 
> [snip]
> 
>> I imagine we will discuss all this in more detail after the release,
>> but everybody's patience is appreciated as we work through these
>> challenges.
> 
> Thanks for this. Could you confirm that which distros (of Debian/Ubuntu)
> binary packages for the various Ceph releases are built is something
> you're going to try and sort out, please?
> 
> [e.g. my earlier post
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-December/031966.html
> ]
> 
> ...if not, should I open a tracker issue? I could build binaries myself,
> obviously, but this seems a bit wasteful...
> 
> Regards,
> 
> Matthew
> 

Hey Matthew,

The current distro matrix is:

Luminous: xenial centos7 trusty jessie stretch
Mimic: bionic xenial centos7

This may have been different in previous point releases because, as Greg
mentioned in an earlier post in this thread, the release process has
changed hands and I'm still working on getting a solid/bulletproof
process documented, in place, and (more) automated.

I wouldn't be the final decision maker but if you think we should be
building Mimic packages for Debian (for example), we could consider it.
 The build process should support it I believe.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs crashing in EC pool (whack-a-mole)

2019-01-08 Thread David Young
Hi all,

One of my OSD hosts recently ran into RAM contention (was swapping heavily), 
and after rebooting, I'm seeing this error on random OSDs in the cluster:

---
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  ceph version 13.2.4 
(b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  1: /usr/bin/ceph-osd() [0xcac700]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  2: (()+0x11390) [0x7f8fa5d0e390]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  3: (gsignal()+0x38) [0x7f8fa5241428]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  4: (abort()+0x16a) [0x7f8fa524302a]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  5: (ceph::__ceph_assert_fail(char 
const*, char const*, int, char const*)+0x250) [0x7f8fa767c510]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  6: (()+0x2e5587) [0x7f8fa767c587]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  7: 
(BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ObjectStore::Transaction*)+0x923) [0xbab5e3]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  8: 
(BlueStore::queue_transactions(boost::intrusive_ptr&,
 std::vector 
>&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x5c3) [0xbade03]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  9: 
(ObjectStore::queue_transaction(boost::intrusive_ptr&,
 ObjectStore::Transaction&&, boost::intrusive_ptr, 
ThreadPool::TPHandle*)+0x82) [0x79c812]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  10: 
(OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, 
ThreadPool::TPHandle*)+0x58) [0x730ff8]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  11: 
(OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr, 
ThreadPool::TPHandle&)+0xfe) [0x759aae]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  12: (PGPeeringItem::run(OSD*, 
OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x50) [0x9c5720]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  13: 
(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x590) 
[0x769760]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  14: 
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) 
[0x7f8fa76824f6]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  15: 
(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f8fa76836b0]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  16: (()+0x76ba) [0x7f8fa5d046ba]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  17: (clone()+0x6d) [0x7f8fa531341d]
Jan 08 03:34:36 prod1 ceph-osd[3357939]:  NOTE: a copy of the executable, or 
`objdump -rdS ` is needed to interpret this.
Jan 08 03:34:36 prod1 systemd[1]: ceph-osd@43.service: Main process exited, 
code=killed, status=6/ABRT
---

I've restarted all the OSDs and the mons, but still encountering the above.

Any ideas / suggestions?

Thanks!
D___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancer=on with crush-compat mode

2019-01-05 Thread David C
On Sat, 5 Jan 2019, 13:38 Marc Roos 
> I have straw2, balancer=on, crush-compat and it gives worst spread over
> my ssd drives (4 only) being used by only 2 pools. One of these pools
> has pg 8. Should I increase this to 16 to create a better result, or
> will it never be any better.
>
> For now I like to stick to crush-compat, so I can use a default centos7
> kernel.
>

Pg upmap is supported in the CentOS 7.5+ kernels

>
> Luminous 12.2.8, 3.10.0-862.14.4.el7.x86_64, CentOS Linux release
> 7.5.1804 (Core)
>
>
>
> [@c01 ~]# cat balancer-1-before.txt | egrep '^19|^20|^21|^30'
> 19   ssd 0.48000  1.0  447GiB  164GiB  283GiB 36.79 0.93  31
> 20   ssd 0.48000  1.0  447GiB  136GiB  311GiB 30.49 0.77  32
> 21   ssd 0.48000  1.0  447GiB  215GiB  232GiB 48.02 1.22  30
> 30   ssd 0.48000  1.0  447GiB  151GiB  296GiB 33.72 0.86  27
>
> [@c01 ~]# ceph osd df | egrep '^19|^20|^21|^30'
> 19   ssd 0.48000  1.0  447GiB  157GiB  290GiB 35.18 0.87  30
> 20   ssd 0.48000  1.0  447GiB  125GiB  322GiB 28.00 0.69  30
> 21   ssd 0.48000  1.0  447GiB  245GiB  202GiB 54.71 1.35  30
> 30   ssd 0.48000  1.0  447GiB  217GiB  230GiB 48.46 1.20  30
>
> [@c01 ~]# ceph osd pool ls detail | egrep 'fs_meta|rbd.ssd'
> pool 19 'fs_meta' replicated size 3 min_size 2 crush_rule 5 object_hash
> rjenkins pg_num 16 pgp_num 16 last_change 22425 lfor 0/9035 flags
> hashpspool stripe_width 0 application cephfs
> pool 54 'rbd.ssd' replicated size 3 min_size 2 crush_rule 5 object_hash
> rjenkins pg_num 8 pgp_num 8 last_change 24666 flags hashpspool
> stripe_width 0 application rbd
>
> [@c01 ~]# ceph df |egrep 'ssd|fs_meta'
> fs_meta   19  170MiB  0.07
> 240GiB 2451382
> fs_data.ssd   33  0B 0
> 240GiB   0
> rbd.ssd   54  266GiB 52.57
> 240GiB   75902
> fs_data.ec21.ssd  55  0B 0
> 480GiB   0
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   4   5   6   7   8   9   10   >