date:20180425

[ceph-users] Ceph Developer Monthly - May 2018

2018-04-25 Thread Leonardo Vaz

Hey Cephers,

This is just a friendly reminder that the next Ceph Developer Monthly
meeting is coming up:

 http://wiki.ceph.com/Planning

If you have work that you're doing that it a feature work, significant
backports, or anything you would like to discuss with the core team,
please add it to the following page:

 http://wiki.ceph.com/CDM_02-MAY-2018

This edition happens on APAC friendly hours (21:00 EST) and we will
use the following Bluejeans URL for the video conference:

 https://redhat.bluejeans.com/376400604

The meeting details are also available on Ceph Community Calendar:

 
https://calendar.google.com/calendar/b/1?cid=OXRzOWM3bHQ3dTF2aWMyaWp2dnFxbGZwbzBAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ

If you have questions or comments, please let us know.

Kindest regards,

Leo

-- 
Leonardo Vaz
Ceph Community Manager
Open Source and Standards Team
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cluster can't remapped objects after change crush tree

2018-04-25 Thread Konstantin Shalygin


# ceph osd crush tree
ID  CLASS WEIGHT  TYPE NAME
  -1   3.63835 root default
  -9   0.90959 pod group1
  -5   0.90959 host feather1
   1   hdd 0.90959 osd.1
-10   2.72876 pod group2
  -7   1.81918 host ds1
   2   hdd 0.90959 osd.2
   3   hdd 0.90959 osd.3
  -3   0.90958 host feather0
   0   hdd 0.90958 osd.0

And I've made a rule

# ceph osd crush rule dump pods
{
 "rule_id": 1,
 "rule_name": "pods",
 "ruleset": 1,
 "type": 1,
 "min_size": 1,
 "max_size": 10,
 "steps": [
 {
 "op": "take",
 "item": -1,
 "item_name": "default"
 },
 {
 "op": "chooseleaf_firstn",
 "num": 0,
 "type": "pod"
 },
 {
 "op": "emit"
 }
 ]
}



1. Assign device class to your crush rule:

ceph osd crush rule create-replicated pods default pod hdd

2. Your crush is imbalanced:

*good*:

root:

    host1:

        - osd0

    host2:

        - osd1

    host3:

        - osd3


*bad*:

root:

    host1:

      - osd0

    host2:

        - osd1

        - osd2

        - osd3




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor read performance.

2018-04-25 Thread Christian Balzer

Hello,

On Wed, 25 Apr 2018 17:20:55 -0400 Jonathan Proulx wrote:

> On Wed Apr 25 02:24:19 PDT 2018 Christian Balzer wrote:
> 
> > Hello,  
> 
> > On Tue, 24 Apr 2018 12:52:55 -0400 Jonathan Proulx wrote:  
> 
> > > The performence I really care about is over rbd for VMs in my
> > > OpenStack but 'rbd bench' seems to line up frety well with 'fio' test
> > > inside VMs so a more or less typical random write rbd bench (from a
> > > monitor node with 10G connection on same net as osds):
> > >  
> 
> > "rbd bench" does things differently than fio (lots of happy switches
> > there) so to make absolutely sure you're not doing and apples and oranges
> > thing I'd suggest you stick to fio in a VM.  
> 
> There's some tradeoffs yes, but I get very close results and I figured
> ceph tools for ceph list rather than pulling in all the rest of my
> working stack, since the ceph toools do show the problem.
> 
> but I do see your point.
> 
> 
> > In comparison this fio:
> > ---
> > fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
> > --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32
> > ---  
> 
> > Will only result in this, due to the network latencies of having direct
> > I/O and only having one OSD at a time being busy:
> > ---
> >   write: io=110864KB, bw=1667.2KB/s, iops=416, runt= 66499msec
> > ---  
> 
> I may simply have under estimated the impact of write caching in
> libvirt, that fio commang does get me just abotu as crappy write
> performance as read (which would point to I just need more IOPS from
> more/faster disks which is definitely true to a greater or lesser
> extent.
> 
No caching with the direct flag, rather intentionally so.
But yes, these numbers are rather sad.

> WRITE: io=1024.0MB, aggrb=5705KB/s, minb=5705KB/s, maxb=5705KB/s,
>mint=183789msec, maxt=183789msec
> 
> READ: io=1024.0MB, aggrb=4322KB/s, minb=4322KB/s, maxb=4322KB/s,
>   mint=242606msec, maxt=242606msec 
> 
> > That being said, something is rather wrong here indeed, my crappy test
> > cluster shouldn't be able to outperform yours.  
> 
> well load ... the asymetry was may main puzzlement but that may be illusory
> 
Yeah, those 1.7k VMs...

> > > rbd bench  --io-total=4G --io-size 4096 --io-type write \
> > > --io-pattern rand --io-threads 16 mypool/myvol
> > > 
> > > 
> > > 
> > > elapsed:   361  ops:  1048576  ops/sec:  2903.82  bytes/sec: 11894034.98
> > > 
> > > same for random read is an order of magnitude lower:
> > > 
> > > rbd bench  --io-total=4G --io-size 4096 --io-type read \
> > > --io-pattern rand --io-threads 16  mypool/myvol
> > > 
> > > elapsed:  3354  ops:  1048576  ops/sec:   312.60  bytes/sec: 1280403.47
> > > 
> > > (sequencial reads and bigger io-size help but not a lot)
> > > 
> > > ceph -s from during read bench so get a sense of comparing traffic:
> > > 
> > >   cluster:
> > > id: 
> > > health: HEALTH_OK
> > >  
> > >   services:
> > > mon: 3 daemons, quorum ceph-mon0,ceph-mon1,ceph-mon2
> > > mgr: ceph-mon0(active), standbys: ceph-mon2, ceph-mon1
> > > osd: 174 osds: 174 up, 174 in
> > > rgw: 3 daemon active
> > >  
> > >   data:
> > > pools:   19 pools, 10240 pgs
> > > objects: 17342k objects, 80731 GB
> > > usage:   240 TB used, 264 TB / 505 TB avail
> > > pgs: 10240 active+clean
> > >  
> > >   io:
> > > client:   4296 kB/s rd, 417 MB/s wr, 1635 op/s rd, 3518 op/s wr
> > > 
> > > 
> > > During deep-scrubs overnight I can see the disks doing >500MBps reads
> > > and ~150rx/iops (each at peak), while during read bench (including all
> > > traffic from ~1k VMs) individual osd data partitions peak around 25
> > > rx/iops and 1.5MBps rx bandwidth so it seems like there should be
> > > performance to spare.
> > >   
> > OK, there are a couple of things here.
> > 1k VMs?!?  
> 
> Actually 1.7k VMs just now, which caught me a bit by surprise when I
> looked at it.  Many are idle because we don't charge per use
> internally so people are sloppy, but many aren't and even the idle
> ones are writing logs and such.
> 
Indeed. 
And these writes often hit the same object and thus PG over and over, add
a little bad luck from the random fairy and some OSDs get hit much more
than others.

> > One assumes that they're not idle, looking at the output above.
> > And writes will compete with reads on the same spindle of course.
> > "performance to spare" you say, but have you verified this with iostat or
> > atop?  
> 
> Thsi assertion is mostly based on collectd stats that show a spike in
> read ops and bandwidth during our scrub window and no large change in
> write ops or bandwidth.  So I presume the disk *could* do that much
> (at least ops wise) for client traffic as well.
> 
> here's a snap of 24hr graph form one server (others are similar in
> general shape):
> 
> https://snapshot.raintank.io/dashboard/snapshot/gB3FDPl7uRGWmL17NHNBCuWKGsXdiqlt
> 
Thanks for that and what Blair said, you're hitting the end stops obviously
for

Re: [ceph-users] Poor read performance.

2018-04-25 Thread Blair Bethwaite

Hi Jon,

On 25 April 2018 at 21:20, Jonathan Proulx  wrote:
>
> here's a snap of 24hr graph form one server (others are similar in
> general shape):
>
> https://snapshot.raintank.io/dashboard/snapshot/gB3FDPl7uRGWmL17NHNBCuWKGsXdiqlt

That's what, a median IOPs of about 80? Pretty high for spinning disk.
I'd guess you're seeing write-choking. You might be able to improve
things a bit by upping your librbd cache size (though obviously that
would only have an effect on new or reset instances), also perhaps
double check your block queue scheduler max_sectors_kb inside a guest
and make sure you're not splitting up all writes into 512 byte chunks.
But does kinda look like you need more hardware, and fast.

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor read performance.

2018-04-25 Thread Jonathan Proulx

On Wed Apr 25 02:24:19 PDT 2018 Christian Balzer wrote:

> Hello,

> On Tue, 24 Apr 2018 12:52:55 -0400 Jonathan Proulx wrote:

> > The performence I really care about is over rbd for VMs in my
> > OpenStack but 'rbd bench' seems to line up frety well with 'fio' test
> > inside VMs so a more or less typical random write rbd bench (from a
> > monitor node with 10G connection on same net as osds):
> >

> "rbd bench" does things differently than fio (lots of happy switches
> there) so to make absolutely sure you're not doing and apples and oranges
> thing I'd suggest you stick to fio in a VM.

There's some tradeoffs yes, but I get very close results and I figured
ceph tools for ceph list rather than pulling in all the rest of my
working stack, since the ceph toools do show the problem.

but I do see your point.

> In comparison this fio:
> ---
> fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
> --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32
> ---

> Will only result in this, due to the network latencies of having direct
> I/O and only having one OSD at a time being busy:
> ---
>   write: io=110864KB, bw=1667.2KB/s, iops=416, runt= 66499msec
> ---

I may simply have under estimated the impact of write caching in
libvirt, that fio commang does get me just abotu as crappy write
performance as read (which would point to I just need more IOPS from
more/faster disks which is definitely true to a greater or lesser
extent.

WRITE: io=1024.0MB, aggrb=5705KB/s, minb=5705KB/s, maxb=5705KB/s,
   mint=183789msec, maxt=183789msec

READ: io=1024.0MB, aggrb=4322KB/s, minb=4322KB/s, maxb=4322KB/s,
  mint=242606msec, maxt=242606msec 

> That being said, something is rather wrong here indeed, my crappy test
> cluster shouldn't be able to outperform yours.

well load ... the asymetry was may main puzzlement but that may be illusory

> > rbd bench  --io-total=4G --io-size 4096 --io-type write \
> > --io-pattern rand --io-threads 16 mypool/myvol
> > 
> > 
> > 
> > elapsed:   361  ops:  1048576  ops/sec:  2903.82  bytes/sec: 11894034.98
> > 
> > same for random read is an order of magnitude lower:
> > 
> > rbd bench  --io-total=4G --io-size 4096 --io-type read \
> > --io-pattern rand --io-threads 16  mypool/myvol
> > 
> > elapsed:  3354  ops:  1048576  ops/sec:   312.60  bytes/sec: 1280403.47
> > 
> > (sequencial reads and bigger io-size help but not a lot)
> > 
> > ceph -s from during read bench so get a sense of comparing traffic:
> > 
> >   cluster:
> > id: 
> > health: HEALTH_OK
> >  
> >   services:
> > mon: 3 daemons, quorum ceph-mon0,ceph-mon1,ceph-mon2
> > mgr: ceph-mon0(active), standbys: ceph-mon2, ceph-mon1
> > osd: 174 osds: 174 up, 174 in
> > rgw: 3 daemon active
> >  
> >   data:
> > pools:   19 pools, 10240 pgs
> > objects: 17342k objects, 80731 GB
> > usage:   240 TB used, 264 TB / 505 TB avail
> > pgs: 10240 active+clean
> >  
> >   io:
> > client:   4296 kB/s rd, 417 MB/s wr, 1635 op/s rd, 3518 op/s wr
> > 
> > 
> > During deep-scrubs overnight I can see the disks doing >500MBps reads
> > and ~150rx/iops (each at peak), while during read bench (including all
> > traffic from ~1k VMs) individual osd data partitions peak around 25
> > rx/iops and 1.5MBps rx bandwidth so it seems like there should be
> > performance to spare.
> > 
> OK, there are a couple of things here.
> 1k VMs?!?

Actually 1.7k VMs just now, which caught me a bit by surprise when I
looked at it.  Many are idle because we don't charge per use
internally so people are sloppy, but many aren't and even the idle
ones are writing logs and such.

> One assumes that they're not idle, looking at the output above.
> And writes will compete with reads on the same spindle of course.
> "performance to spare" you say, but have you verified this with iostat or
> atop?

Thsi assertion is mostly based on collectd stats that show a spike in
read ops and bandwidth during our scrub window and no large change in
write ops or bandwidth.  So I presume the disk *could* do that much
(at least ops wise) for client traffic as well.

here's a snap of 24hr graph form one server (others are similar in
general shape):

https://snapshot.raintank.io/dashboard/snapshot/gB3FDPl7uRGWmL17NHNBCuWKGsXdiqlt

(link good for 7days)

You can clearly see the low read line behinf the higehr writes jump up
during scrub window (20:00->02:00 local time here) and a much smaller
bump around 6am cron.daily or the thundering herd. 

The scrubs do impact performance which does mean I'm over capacity as
I should be able to scrub and not impact production, but there's still
a fair amount of capacity used during scrubbing that doesn't seem used
outside.

But looking harder the only answer may be "buy hardware" which is
valid answer.

Thanks,
-Jon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Backup LUKS/Dmcrypt keys

2018-04-25 Thread Kevin Olbrich

Hi,

how can I backup the dmcrypt keys on luminous?
The folder under /etc/ceph does not exist anymore.

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd reweight (doing -1 or actually -0.0001)

2018-04-25 Thread Paul Emmerich

Hi,


the reweight is internally a number between 0 and 0x1 for the range 0
to 1.
0.8 is not representable in this number system.
Having an actual floating point number in there would be annoying because
CRUSH needs to be 100% deterministic on all clients (also, no floating
point in the kernel).

osd reweight apparently just echoes whatever you entered here if it can't
be mapped to a whole number:
(Or it rounds differently, not sure and doesn't matter.)

reweight 0.8 --> 0.8 * 0x1 = 52428.8; is cut off to 52428 == 0x.
So it just prints out 0.8 but stores 0x.

The logic to read 0x back and convert it back to the 0 - 1 range is:
0x / 0x1 = 0.79998779296 which is printed as 0.7.

Anyways, the important value is what is actually stored and that's 0x.
It could be argued that "osd reweight" should convert 0.8 to 0xcccd (0.
8305175), i.e. to round instead of cut off.


Paul

2018-04-25 12:33 GMT+02:00 Marc Roos :

>
> Makes me also wonder what is actually being used by ceph? And thus which
> one is wrong 'ceph osd reweight' output or 'ceph osd df' output.
>
> -Original Message-
> From: Marc Roos
> Sent: woensdag 25 april 2018 11:58
> To: ceph-users
> Subject: [ceph-users] ceph osd reweight (doing -1 or actually -0.0001)
>
>
> Is there some logic behind why ceph is doing this -1, or is this some
> coding error?
>
> 0.8 gives 0.7, and 0.80001 gives 0.8
>
> (ceph 12.2.4)
>
>
> [@~]# ceph osd reweight 11 0.8
> reweighted osd.11 to 0.8 ()
>
> [@~]# ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
> 11   hdd 3.0  0.7  2794G  2503G   290G 89.59 1.33  38
>
>
>
> [@~]# ceph osd reweight 11 0.80001
> reweighted osd.11 to 0.80001 (cccd)
>
> [@~]# ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
> 11   hdd 3.0  0.8  2794G  2503G   290G 89.59 1.33  38
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2

2018-04-25 Thread Ronny Aasen

the difference in cost between 2 and 3 servers are not HUGE. but the 
reliability  difference between a size 2/1 pool and a 3/2 pool is 
massive. a 2/1 pool is just a single fault during maintenance away from 
dataloss.  but you need multiple simultaneous faults, and have very bad 
luck to break a 3/2 pool


I would recommend rather using 2/2 pools if you are willing to accept a 
little downtime when a disk dies.  the cluster io would stop until the 
disks backfill to cover for the lost disk.
but it is better then having inconsistent pg's or dataloss because a 
disk crashed during a routine reboot, or 2 disks


also worth to read this link 
https://www.spinics.net/lists/ceph-users/msg32895.html   a good explanation.


you have good backups and are willing to restore the whole pool. And it 
is of course your privilege to run 2/1 pools but be mind full of the 
risks of doing so.



kind regards
Ronny Aasen

BTW: i did not know ubuntu automagically rebooted after a upgrade. you 
can probably avoid that reboot somehow in ubuntu. and do the restarts of 
services manually. if you wish to maintain service during upgrade





On 25.04.2018 11:52, Ranjan Ghosh wrote:
Thanks a lot for your detailed answer. The problem for us, however, 
was that we use the Ceph packages that come with the Ubuntu 
distribution. If you do a Ubuntu upgrade, all packages are upgraded in 
one go and the server is rebooted. You cannot influence anything or 
start/stop services one-by-one etc. This was concering me, because the 
upgrade instructions didn't mention anything about an alternative or 
what to do in this case. But someone here enlightened me that - in 
general - it all doesnt matter that much *if you are just accepting a 
downtime*. And, indeed, it all worked nicely. We stopped all services 
on all servers, upgraded the Ubuntu version, rebooted all servers and 
were ready to go again. Didn't encounter any problems there. The only 
problem turned out to be our own fault and simply a firewall 
misconfiguration.


And, yes, we're running a "size:2 min_size:1" because we're on a very 
tight budget. If I understand correctly, this means: Make changes of 
files to one server. *Eventually* copy them to the other server. I 
hope this *eventually* means after a few minutes. Up until now I've 
never experienced *any* problems with file integrity with this 
configuration. In fact, Ceph is incredibly stable. Amazing. I have 
never ever had any issues whatsoever with broken files/partially 
written files, files that contain garbage etc. Even after 
starting/stopping services, rebooting etc. With GlusterFS and other 
Cluster file system I've experienced many such problems over the 
years, so this is what makes Ceph so great. I have now a lot of trust 
in Ceph, that it will eventually repair everything :-) And: If a file 
that has been written a few seconds ago is really lost it wouldnt be 
that bad for our use-case. It's a web-server. Most important stuff is 
in the DB. We have hourly backups of everything. In a huge emergency, 
we could even restore the backup from an hour ago if we really had to. 
Not nice, but if it happens every 6 years or sth due to some freak 
hardware failure, I think it is manageable. I accept it's not the 
recommended/perfect solution if you have infinite amounts of money at 
your hands, but in our case, I think it's not extremely audacious 
either to do it like this, right?



Am 11.04.2018 um 19:25 schrieb Ronny Aasen:

ceph upgrades are usualy not a problem:
ceph have to be upgraded in the right order. normally when each 
service is on its own machine this is not difficult.
but when you have mon, mgr, osd, mds, and klients on the same host 
you have to do it a bit carefully..


i tend to have a terminal open with "watch ceph -s" running, and i 
never do another service until the health is ok again.


first apt upgrade the packages on all the hosts. This only update the 
software on disk and not the running services.
then do the restart of services in the right order.  and only on one 
host at the time


mons: first you restart the mon service on all mon running hosts.
all the 3 mons are active at the same time, so there is no "shifting 
around" but make sure the quorum is ok again before you do the next mon.


mgr: then restart mgr on all hosts that run mgr. there is only one 
active mgr at the time now, so here there will be a bit of shifting 
around. but it is only for statistics/management so it may affect 
your ceph -s command, but not the cluster operation.


osd: restart osd processes one osd at the time, make sure health_ok 
before doing the next osd process. do this for all hosts that have osd's


mds: restart mds's one at the time. you will notice the standby mds 
taking over for the mds that was restarted. do both.


klients: restart clients, that means remount filesystems, migrate or 
restart vm's. or restart whatever process uses the old ceph libraries.



about pools:
since you only have 2 osd's you

[ceph-users] Is RDMA Worth Exploring? Howto ?

2018-04-25 Thread Paul Kunicki

I have a working Luminous 12.2.4 cluster CentOS 7.4 connected via 10G and
Mellanox Connect X-3 QDR IB and would like to know if there are any
worthwhile gains to be had from enabling RDMA and if there are any good up
to date docs on how to do so?



Thanks.



   -
  - *Paul Kunicki*
 - *Systems Manager*
 - SproutLoud Media Networks, LLC.
 - 954-476-6211 ext. 144
 - pkuni...@sproutloud.com
 - www.sproutloud.com
  -  •   •   •
  - [image: Facebook]   [image:
 Twitter]   [image: LinkedIn]
   [image:
 LinkedIn]   [image: YouTube]
 

  The information contained in this communication is intended solely
  for the use of the individual or entity to whom it is addressed and for
  others authorized to receive it. It may contain confidential or legally
  privileged information. If you are not the intended recipient, you are
  hereby notified that any disclosure, copying, distribution, or taking any
  action in reliance on these contents is strictly prohibited and may be
  unlawful. In the event the recipient or recipients of this communication
  are under a non-disclosure agreement, any and all information discussed
  during phone calls and online presentations fall under the agreements
  signed by both parties. If you received this communication in
error, please
  notify us immediately by responding to this e-mail.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] cluster can't remapped objects after change crush tree

2018-04-25 Thread Igor Gajsin

Hi, I've got stuck in a problem with crush rule.
I have a small cluster with 3 nodes and 4 osd. I've decided to split
it to 2 failure domains and made 2 buckets and put hosts in that buckets
like in that instruction
http://www.sebastien-han.fr/blog/2014/01/13/ceph-managing-crush-with-the-cli/

Finally, I've got crush tree like

# ceph osd crush tree
ID  CLASS WEIGHT  TYPE NAME
 -1   3.63835 root default
 -9   0.90959 pod group1
 -5   0.90959 host feather1
  1   hdd 0.90959 osd.1
-10   2.72876 pod group2
 -7   1.81918 host ds1
  2   hdd 0.90959 osd.2
  3   hdd 0.90959 osd.3
 -3   0.90958 host feather0
  0   hdd 0.90958 osd.0

And I've made a rule

# ceph osd crush rule dump pods
{
"rule_id": 1,
"rule_name": "pods",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "pod"
},
{
"op": "emit"
}
]
}

If to apply that rule to a pool, my cluster moves to

# ceph -s
 cluster:
id: 34b66329-b511-4d97-9e07-7b1a0a6879ef
health: HEALTH_WARN
6/42198 objects misplaced (0.014%)

  services:
mon: 3 daemons, quorum feather0,feather1,ds1
mgr: ds1(active), standbys: feather1, feather0
mds: cephfs-1/1/1 up  {0=feather0=up:active}, 2 up:standby
osd: 4 osds: 4 up, 4 in; 64 remapped pgs
rgw: 3 daemons active

  data:
pools:   8 pools, 264 pgs
objects: 14066 objects, 49429 MB
usage:   142 GB used, 3582 GB / 3725 GB avail
pgs: 6/42198 objects misplaced (0.014%)
 200 active+clean
 64  active+clean+remapped

  io:
client:   1897 kB/s wr, 0 op/s rd, 11 op/s wr

And it's frozen in that state, self-healing doesn't occur, just stuck in
the state: objects misplaced / pgs active+clean+remapped.

I think something wrong with my rule, and the cluster can't move objects
to rearrange it according to the new rule. I lost something and I have no idea
what exactly. Any help would be appreciated.

--
With best regards,
Igor Gajsin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd reweight (doing -1 or actually -0.0001)

2018-04-25 Thread Marc Roos

 
Makes me also wonder what is actually being used by ceph? And thus which 
one is wrong 'ceph osd reweight' output or 'ceph osd df' output. 

-Original Message-
From: Marc Roos 
Sent: woensdag 25 april 2018 11:58
To: ceph-users
Subject: [ceph-users] ceph osd reweight (doing -1 or actually -0.0001)


Is there some logic behind why ceph is doing this -1, or is this some 
coding error?

0.8 gives 0.7, and 0.80001 gives 0.8

(ceph 12.2.4)


[@~]# ceph osd reweight 11 0.8
reweighted osd.11 to 0.8 ()

[@~]# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
11   hdd 3.0  0.7  2794G  2503G   290G 89.59 1.33  38



[@~]# ceph osd reweight 11 0.80001
reweighted osd.11 to 0.80001 (cccd)

[@~]# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
11   hdd 3.0  0.8  2794G  2503G   290G 89.59 1.33  38




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Integrating XEN Server : Long query time for "rbd ls -l" queries

2018-04-25 Thread Jason Dillaman

Since I cannot reproduce your issue, can you generate a perf CPU flame
graph on this to figure out where the user time is being spent?

On Wed, Apr 25, 2018 at 11:25 AM, Marc Schöchlin  wrote:
> Hello Jason,
>
> according to this, latency between client and osd should not be the problem:
> (the high amount of user time in the measure above, network
> communication should not be the problem)
>
> Finding the involved osd:
>
> # ceph osd map RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
> rbd_directory
> osdmap e7570 pool 'RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c'
> (14) object 'rbd_directory' -> pg 14.30a98c1c (14.1c) -> up ([36,0,38],
> p36) acting ([36,0,38], p36)
>
> # ceph osd find osd.36
> {
> "osd": 36,
> "ip": "10.23.27.149:6826/7195",
> "crush_location": {
> "host": "ceph-ssd-s39",
> "root": "default"
> }
> }
>
> ssh ceph-ssd-s39
>
> # nuttcp -w1m ceph-mon-s43
> 11186.3391 MB /  10.00 sec = 9381.8890 Mbps 12 %TX 32 %RX 0 retrans 0.15
> msRTT
>
> # time rbd ls -l -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
> --rbd_concurrent_management_ops=1 --id xen_test
> NAMESIZE
> PARENT
> FMT PROT LOCK
> RBD-0192938e-cb4b-4ee1-9988-b8145704ac81  20480M
> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE
> 2
> RBD-0192938e-cb4b-4ee1-9988-b8145704ac81@BASE 20480M
> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE
> 2 yes
> ...
> RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3  20480M
> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE
> 2
> __srlock__
> 0
> 2
>
> real0m23.667s
> user0m15.949s
> sys0m1.276s
>
> # time rbd ls -l -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
> --rbd_concurrent_management_ops=1 --id xen_test
> NAMESIZE
> PARENT
> FMT PROT LOCK
> RBD-0192938e-cb4b-4ee1-9988-b8145704ac81  20480M
> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE
> 2
> RBD-0192938e-cb4b-4ee1-9988-b8145704ac81@BASE 20480M
> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE
> 2 yes
> ...
> RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3  20480M
> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE
> 2
>
> 
> __srlock__
> 0
> 2
>
> real0m13.937s
> user0m14.404s
> sys0m1.089s
>
> Regards
> Marc
>
>
> Am 25.04.2018 um 16:38 schrieb Jason Dillaman:
>> I'd check your latency between your client and your cluster. On my
>> development machine w/ only a single OSD running and 200 clones, each
>> with 1 snapshot, "rbd -l" only takes a couple seconds for me:
>>
>> $ time rbd ls -l --rbd_concurrent_management_ops=1 | wc -l
>> 403
>>
>> real 0m1.746s
>> user 0m1.136s
>> sys 0m0.169s
>>
>> Also, I have to ask, but how often are you expecting to scrape the
>> images from pool? The long directory list involves opening each image
>> in the pool (which involves numerous round-trips to the OSDs) plus
>> iterating through each snapshot (which also involves round-trips).
>>
>> On Wed, Apr 25, 2018 at 10:13 AM, Marc Schöchlin  wrote:
>>> Hello Piotr,
>>>
>>> i updated the issue.
>>> (https://tracker.ceph.com/issues/23853?next_issue_id=23852_issue_id=23854)
>>>
>>> # time rbd ls -l --pool
>>> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
>>> --rbd_concurrent_management_ops=1
>>> NAMESIZE PARENT
>>>
>>> RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3  20480M
>>> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE
>>> 2
>>> __srlock__
>>> 0
>>> 2
>>> 
>>> real0m18.562s
>>> user0m12.513s
>>> sys0m0.793s
>>>
>>> I also attached a json dump of my pool structure.
>>>
>>> Regards
>>> Marc
>>>
>>> Am 25.04.2018 um 14:46 schrieb Piotr Dałek:
 On 18-04-25 02:29 PM, Marc Schöchlin wrote:
> Hello list,
>
> we are trying to integrate a storage repository in xenserver.
> (i also describe the problem as a issue in the ceph bugtracker:
> https://tracker.ceph.com/issues/23853)
>
> Summary:
>
> The slowness is a real pain for us, because this prevents the xen
> storage repository to work efficently.
> Gathering information for XEN Pools with hundreds of virtual machines
> (using "--format json") would be a real pain...
> The high user time consumption and the really huge amount of threads
> suggests that there is something really inefficient in the "rbd"
> utility.
>
> So what can i do to make "rbd ls -l" faster or to get comparable
> information regarding snapshot hierarchy information?
 Can you run this command with extra argument
 "--rbd_concurrent_management_ops=1" and

Re: [ceph-users] Integrating XEN Server : Long query time for "rbd ls -l" queries

2018-04-25 Thread Marc Schöchlin

Hello Jason,

according to this, latency between client and osd should not be the problem:
(the high amount of user time in the measure above, network
communication should not be the problem)

Finding the involved osd:

# ceph osd map RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
rbd_directory
osdmap e7570 pool 'RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c'
(14) object 'rbd_directory' -> pg 14.30a98c1c (14.1c) -> up ([36,0,38],
p36) acting ([36,0,38], p36)

# ceph osd find osd.36
{
    "osd": 36,
    "ip": "10.23.27.149:6826/7195",
    "crush_location": {
    "host": "ceph-ssd-s39",
    "root": "default"
    }
}

ssh ceph-ssd-s39

# nuttcp -w1m ceph-mon-s43
11186.3391 MB /  10.00 sec = 9381.8890 Mbps 12 %TX 32 %RX 0 retrans 0.15
msRTT

# time rbd ls -l -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
--rbd_concurrent_management_ops=1 --id xen_test
NAME    SIZE
PARENT  
 
FMT PROT LOCK
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81  20480M
RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE
  
2  
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81@BASE 20480M
RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE
  
2 yes  
...
RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3  20480M
RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE
  
2  
__srlock__
0   
 
2  

real    0m23.667s
user    0m15.949s
sys    0m1.276s

# time rbd ls -l -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
--rbd_concurrent_management_ops=1 --id xen_test
NAME    SIZE
PARENT  
 
FMT PROT LOCK
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81  20480M
RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE
  
2 
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81@BASE 20480M
RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE
  
2 yes 
...
RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3  20480M
RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE
  
2


__srlock__
0   
 
2  

real    0m13.937s
user    0m14.404s
sys    0m1.089s

Regards
Marc


Am 25.04.2018 um 16:38 schrieb Jason Dillaman:
> I'd check your latency between your client and your cluster. On my
> development machine w/ only a single OSD running and 200 clones, each
> with 1 snapshot, "rbd -l" only takes a couple seconds for me:
>
> $ time rbd ls -l --rbd_concurrent_management_ops=1 | wc -l
> 403
>
> real 0m1.746s
> user 0m1.136s
> sys 0m0.169s
>
> Also, I have to ask, but how often are you expecting to scrape the
> images from pool? The long directory list involves opening each image
> in the pool (which involves numerous round-trips to the OSDs) plus
> iterating through each snapshot (which also involves round-trips).
>
> On Wed, Apr 25, 2018 at 10:13 AM, Marc Schöchlin  wrote:
>> Hello Piotr,
>>
>> i updated the issue.
>> (https://tracker.ceph.com/issues/23853?next_issue_id=23852_issue_id=23854)
>>
>> # time rbd ls -l --pool
>> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
>> --rbd_concurrent_management_ops=1
>> NAMESIZE PARENT
>>
>> RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3  20480M
>> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE
>> 2
>> __srlock__
>> 0
>> 2
>> 
>> real0m18.562s
>> user0m12.513s
>> sys0m0.793s
>>
>> I also attached a json dump of my pool structure.
>>
>> Regards
>> Marc
>>
>> Am 25.04.2018 um 14:46 schrieb Piotr Dałek:
>>> On 18-04-25 02:29 PM, Marc Schöchlin wrote:
 Hello list,

 we are trying to integrate a storage repository in xenserver.
 (i also describe the problem as a issue in the ceph bugtracker:
 https://tracker.ceph.com/issues/23853)

 Summary:

 The slowness is a real pain for us, because this prevents the xen
 storage repository to work efficently.
 Gathering information for XEN Pools with hundreds of virtual machines
 (using "--format json") would be a real pain...
 The high user time consumption and the really huge amount of threads
 suggests that there is something really inefficient in the "rbd"
 utility.

 So what can i do to make "rbd ls -l" faster or to get comparable
 information regarding snapshot

[ceph-users] ceph osd reweight (doing -1 or actually -0.0001)

2018-04-25 Thread Marc Roos


Is there some logic behind why ceph is doing this -1, or is this some 
coding error?

0.8 gives 0.7, and 0.80001 gives 0.8

(ceph 12.2.4)


[@~]# ceph osd reweight 11 0.8
reweighted osd.11 to 0.8 ()

[@~]# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
11   hdd 3.0  0.7  2794G  2503G   290G 89.59 1.33  38



[@~]# ceph osd reweight 11 0.80001
reweighted osd.11 to 0.80001 (cccd)

[@~]# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
11   hdd 3.0  0.8  2794G  2503G   290G 89.59 1.33  38




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked Requests

2018-04-25 Thread Shantur Rathore

Hi all,

So using ceph-ansible, i built the below mentioned cluster with 2 OSD
Nodes and 3 Mons
Just after creating osds i started benchmarking the performance using
"rbd bench" and "rados bench" and started seeing the performance drop.
Checking the status shows slow requests.


[root@storage-28-1 ~]# ceph -s
  cluster:
id: 009cbed0-e5a8-4b18-a313-098e55742e85
health: HEALTH_WARN
insufficient standby MDS daemons available
1264 slow requests are blocked > 32 sec

  services:
mon: 3 daemons, quorum storage-30,storage-29,storage-28-1
mgr: storage-30(active), standbys: storage-28-1, storage-29
mds: cephfs-3/3/3 up
{0=storage-30=up:active,1=storage-28-1=up:active,2=storage-29=up:active}
osd: 33 osds: 33 up, 33 in
tcmu-runner: 2 daemons active

  data:
pools:   3 pools, 1536 pgs
objects: 13289 objects, 42881 MB
usage:   102 GB used, 55229 GB / 55331 GB avail
pgs: 1536 active+clean

  io:
client:   1694 B/s rd, 1 op/s rd, 0 op/s wr



[root@storage-28-1 ~]# ceph health detail
HEALTH_WARN insufficient standby MDS daemons available; 904 slow
requests are blocked > 32 sec
MDS_INSUFFICIENT_STANDBY insufficient standby MDS daemons available
have 0; want 1 more
REQUEST_SLOW 904 slow requests are blocked > 32 sec
364 ops are blocked > 1048.58 sec
212 ops are blocked > 524.288 sec
164 ops are blocked > 262.144 sec
100 ops are blocked > 131.072 sec
64 ops are blocked > 65.536 sec
osd.11 has blocked requests > 524.288 sec
osds 9,32 have blocked requests > 1048.58 sec


osd 9 log : https://pastebin.com/ex41cFww

I see that from time to time different osds are reporting blocked
requests. I am not sure what could be the cause of this. Can anyone
help me fix this please.

[root@storage-28-1 ~]# ceph osd tree
ID CLASS WEIGHT   TYPE NAME   STATUS REWEIGHT PRI-AFF
-1   54.03387 root default
-3   27.83563 host storage-29
 2   hdd  1.63739 osd.2   up  1.0 1.0
 3   hdd  1.63739 osd.3   up  1.0 1.0
 4   hdd  1.63739 osd.4   up  1.0 1.0
 5   hdd  1.63739 osd.5   up  1.0 1.0
 6   hdd  1.63739 osd.6   up  1.0 1.0
 7   hdd  1.63739 osd.7   up  1.0 1.0
 8   hdd  1.63739 osd.8   up  1.0 1.0
 9   hdd  1.63739 osd.9   up  1.0 1.0
10   hdd  1.63739 osd.10  up  1.0 1.0
11   hdd  1.63739 osd.11  up  1.0 1.0
12   hdd  1.63739 osd.12  up  1.0 1.0
13   hdd  1.63739 osd.13  up  1.0 1.0
14   hdd  1.63739 osd.14  up  1.0 1.0
15   hdd  1.63739 osd.15  up  1.0 1.0
16   hdd  1.63739 osd.16  up  1.0 1.0
17   hdd  1.63739 osd.17  up  1.0 1.0
18   hdd  1.63739 osd.18  up  1.0 1.0
-5   26.19824 host storage-30
 0   hdd  1.63739 osd.0   up  1.0 1.0
 1   hdd  1.63739 osd.1   up  1.0 1.0
19   hdd  1.63739 osd.19  up  1.0 1.0
20   hdd  1.63739 osd.20  up  1.0 1.0
21   hdd  1.63739 osd.21  up  1.0 1.0
22   hdd  1.63739 osd.22  up  1.0 1.0
23   hdd  1.63739 osd.23  up  1.0 1.0
24   hdd  1.63739 osd.24  up  1.0 1.0
25   hdd  1.63739 osd.25  up  1.0 1.0
26   hdd  1.63739 osd.26  up  1.0 1.0
27   hdd  1.63739 osd.27  up  1.0 1.0
28   hdd  1.63739 osd.28  up  1.0 1.0
29   hdd  1.63739 osd.29  up  1.0 1.0
30   hdd  1.63739 osd.30  up  1.0 1.0
31   hdd  1.63739 osd.31  up  1.0 1.0
32   hdd  1.63739 osd.32  up  1.0 1.0

thanks


On Fri, Apr 20, 2018 at 10:24 AM, Shantur Rathore
 wrote:
>
> Thanks Alfredo.  I will use ceph-volume.
>
> On Thu, Apr 19, 2018 at 4:24 PM, Alfredo Deza  wrote:
>>
>> On Thu, Apr 19, 2018 at 11:10 AM, Shantur Rathore
>>  wrote:
>> > Hi,
>> >
>> > I am building my first Ceph cluster from hardware leftover from a previous
>> > project. I have been reading a lot of Ceph documentation but need some help
>> > to make sure I going the right way.
>> > To set the stage below is what I have
>> >
>> > Rack-1
>> >
>> > 1 x HP DL360 G9 with
>> >- 256 GB Memory
>> >- 5 x 300GB HDD
>> >- 2 x HBA SAS
>> >- 4 x 10GBe Networking Card
>> >
>> > 1 x SuperMicro chassis with 17 x HP Enterprise 400GB SSD and 17 x HP
>> > Enterprise 1.7TB HDD
>> > Chassis and HP server are connected with 2 x SAS HBA for redundancy.
>> >
>> >
>> >

Re: [ceph-users] Integrating XEN Server : Long query time for "rbd ls -l" queries

2018-04-25 Thread Jason Dillaman

I'd check your latency between your client and your cluster. On my
development machine w/ only a single OSD running and 200 clones, each
with 1 snapshot, "rbd -l" only takes a couple seconds for me:

$ time rbd ls -l --rbd_concurrent_management_ops=1 | wc -l
403

real 0m1.746s
user 0m1.136s
sys 0m0.169s

Also, I have to ask, but how often are you expecting to scrape the
images from pool? The long directory list involves opening each image
in the pool (which involves numerous round-trips to the OSDs) plus
iterating through each snapshot (which also involves round-trips).

On Wed, Apr 25, 2018 at 10:13 AM, Marc Schöchlin  wrote:
> Hello Piotr,
>
> i updated the issue.
> (https://tracker.ceph.com/issues/23853?next_issue_id=23852_issue_id=23854)
>
> # time rbd ls -l --pool
> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
> --rbd_concurrent_management_ops=1
> NAMESIZE PARENT
>
> RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3  20480M
> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE
> 2
> __srlock__
> 0
> 2
> 
> real0m18.562s
> user0m12.513s
> sys0m0.793s
>
> I also attached a json dump of my pool structure.
>
> Regards
> Marc
>
> Am 25.04.2018 um 14:46 schrieb Piotr Dałek:
>> On 18-04-25 02:29 PM, Marc Schöchlin wrote:
>>> Hello list,
>>>
>>> we are trying to integrate a storage repository in xenserver.
>>> (i also describe the problem as a issue in the ceph bugtracker:
>>> https://tracker.ceph.com/issues/23853)
>>>
>>> Summary:
>>>
>>> The slowness is a real pain for us, because this prevents the xen
>>> storage repository to work efficently.
>>> Gathering information for XEN Pools with hundreds of virtual machines
>>> (using "--format json") would be a real pain...
>>> The high user time consumption and the really huge amount of threads
>>> suggests that there is something really inefficient in the "rbd"
>>> utility.
>>>
>>> So what can i do to make "rbd ls -l" faster or to get comparable
>>> information regarding snapshot hierarchy information?
>>
>> Can you run this command with extra argument
>> "--rbd_concurrent_management_ops=1" and share the timing of that?
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] trimming the MON level db

2018-04-25 Thread Luis Periquito

Hi all,

we have a (really) big cluster that's ongoing a very bad move and the
monitor database is growing at an alarming rate.

The cluster is running jewel (10.2.7) and is there any way to trim the
monitor database before it gets HEALTH_OK?

I've searched and so far only found people saying not really, but just
wanted a final sanity check...

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Integrating XEN Server : Long query time for "rbd ls -l" queries

2018-04-25 Thread Marc Schöchlin

Hello Piotr,

i updated the issue.
(https://tracker.ceph.com/issues/23853?next_issue_id=23852_issue_id=23854)

# time rbd ls -l --pool
RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
--rbd_concurrent_management_ops=1
NAME    SIZE PARENT   

RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3  20480M
RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE
  
2  
__srlock__
0   
 
2  

real    0m18.562s
user    0m12.513s
sys    0m0.793s

I also attached a json dump of my pool structure.

Regards
Marc

Am 25.04.2018 um 14:46 schrieb Piotr Dałek:
> On 18-04-25 02:29 PM, Marc Schöchlin wrote:
>> Hello list,
>>
>> we are trying to integrate a storage repository in xenserver.
>> (i also describe the problem as a issue in the ceph bugtracker:
>> https://tracker.ceph.com/issues/23853)
>>
>> Summary:
>>
>> The slowness is a real pain for us, because this prevents the xen
>> storage repository to work efficently.
>> Gathering information for XEN Pools with hundreds of virtual machines
>> (using "--format json") would be a real pain...
>> The high user time consumption and the really huge amount of threads
>> suggests that there is something really inefficient in the "rbd"
>> utility.
>>
>> So what can i do to make "rbd ls -l" faster or to get comparable
>> information regarding snapshot hierarchy information?
>
> Can you run this command with extra argument
> "--rbd_concurrent_management_ops=1" and share the timing of that?
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Integrating XEN Server : Long query time for "rbd ls -l" queries

2018-04-25 Thread Piotr Dałek


On 18-04-25 02:29 PM, Marc Schöchlin wrote:

Hello list,

we are trying to integrate a storage repository in xenserver.
(i also describe the problem as a issue in the ceph bugtracker:
https://tracker.ceph.com/issues/23853)

Summary:

The slowness is a real pain for us, because this prevents the xen
storage repository to work efficently.
Gathering information for XEN Pools with hundreds of virtual machines
(using "--format json") would be a real pain...
The high user time consumption and the really huge amount of threads
suggests that there is something really inefficient in the "rbd" utility.

So what can i do to make "rbd ls -l" faster or to get comparable
information regarding snapshot hierarchy information?


Can you run this command with extra argument 
"--rbd_concurrent_management_ops=1" and share the timing of that?


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph 12.2.4 MGR spams syslog with "mon failed to return metadata for mds"

2018-04-25 Thread Charles Alva

Hi John,

The "ceph mds metadata mds1" produced "Error ENOENT:". Querying mds
metadata to mds2 and mds3 worked as expected. It seemed, only the active
MDS could not be queried by Ceph MGR.

I also stated wrong that Ceph MGR spamming the syslog, it should be the
ceph-mgr log itself, sorry for the confusion.


# ceph -s
  cluster:
id: b63f4ca1-f5e1-4ac1-a6fc-5ab70c65864a
health: HEALTH_OK

  services:
mon: 3 daemons, quorum mon1,mon2,mon3
mgr: mon1(active), standbys: mon2, mon3
mds: cephfs-1/1/1 up  {*0=mds1=up:active*}, 2 up:standby
osd: 14 osds: 14 up, 14 in
rgw: 3 daemons active

  data:
pools:   10 pools, 248 pgs
objects: 583k objects, 2265 GB
usage:   6816 GB used, 6223 GB / 13039 GB avail
pgs: 247 active+clean
 1   active+clean+scrubbing+deep

  io:
client:   115 kB/s rd, 759 kB/s wr, 22 op/s rd, 24 op/s wr

# ceph mds metadata mds1
*Error ENOENT:*

# ceph mds metadata mds2
{
"addr": "10.100.100.115:6800/1861195236",
"arch": "x86_64",
"ceph_version": "ceph version 12.2.4
(52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)",
"cpu": "Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz",
"distro": "ubuntu",
"distro_description": "Ubuntu 16.04.4 LTS",
"distro_version": "16.04",
"hostname": "mds2",
"kernel_description": "#1 SMP PVE 4.13.13-40 (Fri, 16 Feb 2018 09:51:20
+0100)",
"kernel_version": "4.13.13-6-pve",
"mem_swap_kb": "2048000",
"mem_total_kb": "2048000",
"os": "Linux"
}

# ceph mds metadata mds3
{
"addr": "10.100.100.116:6800/4180418633",
"arch": "x86_64",
"ceph_version": "ceph version 12.2.4
(52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)",
"cpu": "Intel(R) Xeon(R) CPU E31240 @ 3.30GHz",
"distro": "ubuntu",
"distro_description": "Ubuntu 16.04.4 LTS",
"distro_version": "16.04",
"hostname": "mds3",
"kernel_description": "#1 SMP PVE 4.13.16-47 (Mon, 9 Apr 2018 09:58:12
+0200)",
"kernel_version": "4.13.16-2-pve",
"mem_swap_kb": "4096000",
"mem_total_kb": "2048000",
"os": "Linux"
}





Kind regards,

Charles Alva
Sent from Gmail Mobile

On Tue, Apr 24, 2018 at 4:29 PM, John Spray  wrote:

> On Fri, Apr 20, 2018 at 11:29 AM, Charles Alva 
> wrote:
> > Marc,
> >
> > Thanks.
> >
> > The mgr log spam occurs even without dashboard module enabled. I never
> > checked the ceph mgr log before because the ceph cluster is always
> healthy.
> > Based on the ceph mgr logs in syslog, the spam occurred long before and
> > after I enabled the dashboard module.
> >
> >> # ceph -s
> >>   cluster:
> >> id: xxx
> >> health: HEALTH_OK
> >>
> >>   services:
> >> mon: 3 daemons, quorum mon1,mon2,mon3
> >> mgr: mon1(active), standbys: mon2, mon3
> >> mds: cephfs-1/1/1 up  {0=mds1=up:active}, 2 up:standby
> >> osd: 14 osds: 14 up, 14 in
> >> rgw: 3 daemons active
> >>
> >>   data:
> >> pools:   10 pools, 248 pgs
> >> objects: 546k objects, 2119 GB
> >> usage:   6377 GB used, 6661 GB / 13039 GB avail
> >> pgs: 248 active+clean
> >>
> >>   io:
> >> client:   25233 B/s rd, 1409 kB/s wr, 6 op/s rd, 59 op/s wr
> >
> >
> >
> > My ceph mgr log is spam with following log every second. This happens on
> 2
> > separate Ceph 12.2.4 clusters.
>
> (I assume that the mon, mgr and mds are all 12.2.4)
>
> The "failed to return metadata" part is kind of mysterious.  Do you
> also get an error if you try to do "ceph mds metadata mds1" by hand?
> (that's what the mgr is trying to do).
>
> If the metadata works when using the CLI by hand, you may have an
> issue with the mgr's auth caps, check that its mon caps are set to
> "allow profile mgr".
>
> The "unhandled message" part is from a path where the mgr code is
> ignoring messages from services that don't have any metadata (I think
> this is actually a bug, as we should be considering these messages as
> handled even if we're ignoring them).
>
> John
>
> >> # less +F /var/log/ceph/ceph-mgr.mon1.log
> >>
> >>  ...
> >>
> >> 2018-04-20 06:21:18.782861 7fca238ff700  1 mgr send_beacon active
> >> 2018-04-20 06:21:19.050671 7fca14809700  0 ms_deliver_dispatch:
> unhandled
> >> message 0x55bf897d1c00 mgrreport(mds.mds1 +24-0 packed 214) v5 from
> mds.0
> >> 10.100.100.114:6800/4132681434
> >> 2018-04-20 06:21:19.051047 7fca25102700  1 mgr finish mon failed to
> return
> >> metadata for mds.mds1: (2) No such file or directory
> >> 2018-04-20 06:21:20.050889 7fca14809700  0 ms_deliver_dispatch:
> unhandled
> >> message 0x55bf897eac00 mgrreport(mds.mds1 +24-0 packed 214) v5 from
> mds.0
> >> 10.100.100.114:6800/4132681434
> >> 2018-04-20 06:21:20.051351 7fca25102700  1 mgr finish mon failed to
> return
> >> metadata for mds.mds1: (2) No such file or directory
> >> 2018-04-20 06:21:20.784455 7fca238ff700  1 mgr send_beacon active
> >> 2018-04-20 06:21:21.050968 7fca14809700  0 ms_deliver_dispatch:
> unhandled
> >> message

[ceph-users] Integrating XEN Server : Long query time for "rbd ls -l" queries

2018-04-25 Thread Marc Schöchlin

Hello list,

we are trying to integrate a storage repository in xenserver.
(i also describe the problem as a issue in the ceph bugtracker:
https://tracker.ceph.com/issues/23853)

Summary:

The slowness is a real pain for us, because this prevents the xen
storage repository to work efficently.
Gathering information for XEN Pools with hundreds of virtual machines
(using "--format json") would be a real pain...
The high user time consumption and the really huge amount of threads
suggests that there is something really inefficient in the "rbd" utility.

So what can i do to make "rbd ls -l" faster or to get comparable
information regarding snapshot hierarchy information?

Details:

We have a strange problem on listing the images of a SSD pool.

# time rbd ls -l --pool RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
NAME    SIZE
PARENT  
 
FMT PROT LOCK
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81  20480M
RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE
  
2  
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81@BASE 20480M
RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE
  
2 yes  
..
__srlock__
0   
 
2  

real    0m8.726s
user    0m8.506s
sys    0m0.591s

===> This incredibly slow for outputting 105 lines.

Without "-l" ist pretty fast (unfortunately i need this information):

# time rbd ls  --pool RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81
...
__srlock__

real    0m0.024s
user    0m0.015s
sys    0m0.008s

===> This incredibly fast for outputting 71 lines.

The "@BASE" snapshots are created by the following procedure:

rbd snap create
rbd snap protect
rbd clone

It seems that lookups to rbd-pools are performed by using a object named
"rbd_directory" which resides on the pool...

Querying this object with 142 entries need 0.024seconds.

# time rados -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
listomapvals rbd_directory

id_12da262ae8944a
value (14 bytes) :
  0a 00 00 00 5f 5f 73 72  6c 6f 63 6b 5f 5f    |__srlock__|
000e

id_12f2943d1b58ba
value (44 bytes) :
  28 00 00 00 52 42 44 2d  31 62 34 32 31 38 39 31 
|(...RBD-1b421891|
0010  2d 34 34 31 63 2d 34 35  33 30 2d 62 64 66 33 2d 
|-441c-4530-bdf3-|
0020  61 64 32 62 31 31 34 61  36 33 66 63  |ad2b114a63fc|
002c
...

real    0m0.024s
user    0m0.023s
sys    0m0.000s

I also analyzed the state of the OSD holding this object:

# ceph osd map RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
rbd_directory
osdmap e7400 pool 'RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c'
(14) object 'rbd_directory' -> pg 14.30a98c1c (14.1c) -> up ([36,0,38],
p36) acting ([36,0,38], p36)

Repeating queries resulted in 3.6% cpu usage - the logs do not provide
any useful information.

Analyzing this command by strace, suggests me that there is something
wrong with the "rbd" command implementation.

# strace -f -c rbd ls -l --pool
RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c > /dev/null
strace: Process 50286 attached
strace: Process 50287 attached
strace: Process 50288 attached
strace: Process 50289 attached
strace: Process 50290 attached
strace: Process 50291 attached
strace: Process 50292 attached

!!! 2086 threads !!!


% time seconds  usecs/call calls    errors syscall
-- --- --- - - 
 98.42  219.207328    2020    108534 20241 futex
  1.47    3.265517 162 20099   epoll_wait
  0.06    0.131473   3 46377 22894 read
  0.02    0.053740   4 14627   sendmsg
  0.01    0.017225   2 10020   write
  0.00    0.008975   4  2001   munmap
  0.00    0.007260   3  2170   epoll_ctl
  0.00    0.007171   3  2088   madvise
  0.00    0.003139   1  4166   rt_sigprocmask
  0.00    0.002670   1  2086   prctl
  0.00    0.002494   1  3381   mprotect
  0.00    0.002315   1  2087   mmap
  0.00    0.002120   1  2087   set_robust_list
  0.00    0.002098   1  2084   gettid
  0.00    0.001833   1  2086   clone
  0.00    0.001152   8   136    87 connect
  0.00    0.000739   7   102   close
  0.00    0.000623  13    49   shutdown
  0.00    0.000622   6   110   fcntl
  0.00    0.000469  10    49   socket
  0.00    0.000466   6    73    29 open
  0.00    0.000456

Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2

2018-04-25 Thread Simon Ironside


On 25/04/18 10:52, Ranjan Ghosh wrote:

And, yes, we're running a "size:2 min_size:1" because we're on a very 
tight budget. If I understand correctly, this means: Make changes of 
files to one server. *Eventually* copy them to the other server. I hope 
this *eventually* means after a few minutes.
size:2 means there's two replicas of every object. Writes are 
synchronous (i.e. a write isn't complete until it's written to both 
OSDs) so there's no eventually - it's immediate.


min_size:1 means the cluster will still allow access while only one 
replica is available.


Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2

2018-04-25 Thread Mark Schouten

On Wed, 2018-04-25 at 11:52 +0200, Ranjan Ghosh wrote:
> Thanks a lot for your detailed answer. The problem for us, however,
> was 
> that we use the Ceph packages that come with the Ubuntu distribution.
> If 
> you do a Ubuntu upgrade, all packages are upgraded in one go and the 
> server is rebooted. You cannot influence anything or start/stop
> services 
> one-by-one etc.

Ehm, huh? Does your Ubuntu decide to reboot by itself?

-- 
Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
Mark Schouten  | Tuxis Internet Engineering
KvK: 61527076  | http://www.tuxis.nl/
T: 0318 200208 | i...@tuxis.nl

smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2

2018-04-25 Thread Ranjan Ghosh

Thanks a lot for your detailed answer. The problem for us, however, was 
that we use the Ceph packages that come with the Ubuntu distribution. If 
you do a Ubuntu upgrade, all packages are upgraded in one go and the 
server is rebooted. You cannot influence anything or start/stop services 
one-by-one etc. This was concering me, because the upgrade instructions 
didn't mention anything about an alternative or what to do in this case. 
But someone here enlightened me that - in general - it all doesnt matter 
that much *if you are just accepting a downtime*. And, indeed, it all 
worked nicely. We stopped all services on all servers, upgraded the 
Ubuntu version, rebooted all servers and were ready to go again. Didn't 
encounter any problems there. The only problem turned out to be our own 
fault and simply a firewall misconfiguration.


And, yes, we're running a "size:2 min_size:1" because we're on a very 
tight budget. If I understand correctly, this means: Make changes of 
files to one server. *Eventually* copy them to the other server. I hope 
this *eventually* means after a few minutes. Up until now I've never 
experienced *any* problems with file integrity with this configuration. 
In fact, Ceph is incredibly stable. Amazing. I have never ever had any 
issues whatsoever with broken files/partially written files, files that 
contain garbage etc. Even after starting/stopping services, rebooting 
etc. With GlusterFS and other Cluster file system I've experienced many 
such problems over the years, so this is what makes Ceph so great. I 
have now a lot of trust in Ceph, that it will eventually repair 
everything :-) And: If a file that has been written a few seconds ago is 
really lost it wouldnt be that bad for our use-case. It's a web-server. 
Most important stuff is in the DB. We have hourly backups of everything. 
In a huge emergency, we could even restore the backup from an hour ago 
if we really had to. Not nice, but if it happens every 6 years or sth 
due to some freak hardware failure, I think it is manageable. I accept 
it's not the recommended/perfect solution if you have infinite amounts 
of money at your hands, but in our case, I think it's not extremely 
audacious either to do it like this, right?



Am 11.04.2018 um 19:25 schrieb Ronny Aasen:

ceph upgrades are usualy not a problem:
ceph have to be upgraded in the right order. normally when each 
service is on its own machine this is not difficult.
but when you have mon, mgr, osd, mds, and klients on the same host you 
have to do it a bit carefully..


i tend to have a terminal open with "watch ceph -s" running, and i 
never do another service until the health is ok again.


first apt upgrade the packages on all the hosts. This only update the 
software on disk and not the running services.
then do the restart of services in the right order.  and only on one 
host at the time


mons: first you restart the mon service on all mon running hosts.
all the 3 mons are active at the same time, so there is no "shifting 
around" but make sure the quorum is ok again before you do the next mon.


mgr: then restart mgr on all hosts that run mgr. there is only one 
active mgr at the time now, so here there will be a bit of shifting 
around. but it is only for statistics/management so it may affect your 
ceph -s command, but not the cluster operation.


osd: restart osd processes one osd at the time, make sure health_ok 
before doing the next osd process. do this for all hosts that have osd's


mds: restart mds's one at the time. you will notice the standby mds 
taking over for the mds that was restarted. do both.


klients: restart clients, that means remount filesystems, migrate or 
restart vm's. or restart whatever process uses the old ceph libraries.



about pools:
since you only have 2 osd's you can obviously not be running the 
recommended 3 replication pools. ? this makes me worry that you may be 
running size:2 min_size:1 pools. and are daily running risk of 
dataloss due to corruption and inconsistencies. especially when you 
restart osd's


if your pools are size:2 min_size:2 then your cluster will fail when 
any osd is restarted, until the osd is up and healthy again. but you 
have less chance for dataloss then 2/1 pools.


if you added a osd on a third host you can run size:3 min_size:2 . the 
recommended config when you can have both redundancy and high 
availabillity.



kind regards
Ronny Aasen







On 11.04.2018 17:42, Ranjan Ghosh wrote:
Ah, nevermind, we've solved it. It was a firewall issue. The only 
thing that's weird is that it became an issue immediately after an 
update. Perhaps it has sth. to do with monitor nodes shifting around 
or anything. Well, thanks again for your quick support, though. It's 
much appreciated.


BR

Ranjan


Am 11.04.2018 um 17:07 schrieb Ranjan Ghosh:
Thank you for your answer. Do you have any specifics on which thread 
you're talking about? Would be very interested to read about a 
success story, because

Re: [ceph-users] Poor read performance.

2018-04-25 Thread Christian Balzer


Hello,

On Tue, 24 Apr 2018 12:52:55 -0400 Jonathan Proulx wrote:

> Hi All,
> 
> I seem to be seeing consitently poor read performance on my cluster
> relative to both write performance and read perormance of a single
> backend disk, by quite a lot.
> 
> cluster is luminous with 174 7.2k SAS drives across 12 storage servers
> with 10G ethernet and jumbo frames.  Drives are mix 4T and 2T
> bluestore with DB on ssd.
> 
How much RAM do these hosts have and have you changed the default cache
settings of bluestore?

> The performence I really care about is over rbd for VMs in my
> OpenStack but 'rbd bench' seems to line up frety well with 'fio' test
> inside VMs so a more or less typical random write rbd bench (from a
> monitor node with 10G connection on same net as osds):
>

"rbd bench" does things differently than fio (lots of happy switches
there) so to make absolutely sure you're not doing and apples and oranges
thing I'd suggest you stick to fio in a VM.

For example your write "rdb bench" on my crap test cluster with 6 nodes
and 4 ancient SATA drives each and 1Gb/s links (but luminous and bluestore)
will get all the HDDs nearly 100% busy:
---
elapsed:   152  ops:   262144  ops/sec:  1715.01  bytes/sec: 7024699.12
---

In comparison this fio:
---
fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
--rw=randwrite --name=fiojob --blocksize=4K --iodepth=32
---

Will only result in this, due to the network latencies of having direct
I/O and only having one OSD at a time being busy:
---
  write: io=110864KB, bw=1667.2KB/s, iops=416, runt= 66499msec
---

The fio result is a very important one, as IOPS for a single client won't
get faster than this, the bench one is good to look for what the upper
limit of the cluster is.

On contrary, the reads look like this for bench:
---
elapsed:40  ops:   262144  ops/sec:  6430.08  bytes/sec: 26337591.86
---

and this for fio:
---
  read : io=1024.0MB, bw=86717KB/s, iops=21679, runt= 12092msec
---

The later being served clearly from the OSD caches, very visible with atop
on the OSD hosts.

That being said, something is rather wrong here indeed, my crappy test
cluster shouldn't be able to outperform yours.

> rbd bench  --io-total=4G --io-size 4096 --io-type write \
> --io-pattern rand --io-threads 16 mypool/myvol
> 
> 
> 
> elapsed:   361  ops:  1048576  ops/sec:  2903.82  bytes/sec: 11894034.98
> 
> same for random read is an order of magnitude lower:
> 
> rbd bench  --io-total=4G --io-size 4096 --io-type read \
> --io-pattern rand --io-threads 16  mypool/myvol
> 
> elapsed:  3354  ops:  1048576  ops/sec:   312.60  bytes/sec: 1280403.47
> 
> (sequencial reads and bigger io-size help but not a lot)
> 
> ceph -s from during read bench so get a sense of comparing traffic:
> 
>   cluster:
> id: 
> health: HEALTH_OK
>  
>   services:
> mon: 3 daemons, quorum ceph-mon0,ceph-mon1,ceph-mon2
> mgr: ceph-mon0(active), standbys: ceph-mon2, ceph-mon1
> osd: 174 osds: 174 up, 174 in
> rgw: 3 daemon active
>  
>   data:
> pools:   19 pools, 10240 pgs
> objects: 17342k objects, 80731 GB
> usage:   240 TB used, 264 TB / 505 TB avail
> pgs: 10240 active+clean
>  
>   io:
> client:   4296 kB/s rd, 417 MB/s wr, 1635 op/s rd, 3518 op/s wr
> 
> 
> During deep-scrubs overnight I can see the disks doing >500MBps reads
> and ~150rx/iops (each at peak), while during read bench (including all
> traffic from ~1k VMs) individual osd data partitions peak around 25
> rx/iops and 1.5MBps rx bandwidth so it seems like there should be
> performance to spare.
> 
OK, there are a couple of things here.
1k VMs?!?
One assumes that they're not idle, looking at the output above.
And writes will compete with reads on the same spindle of course.
"performance to spare" you say, but have you verified this with iostat or
atop?


> Obviosly given my disk choices this isn't designed as a particularly
> high performance setup but I do expect a bit mroe performance out of
> it.
> 
> Are my expectations wrong? If not any clues what I've don (or failed
> to do) that is wrong?
> 
> Pretty sure rx/wx was much more sysmetric in earlier versions (subset
> of same hardware and filestore backend) but used a different perf tool
> so don't want to make direct comparisons.
>

It could be as easy as having lots of pagecache with filestore that
helped dramatically with (repeated) reads.  

But w/o a quiescent cluster determining things might be difficult.

Christian

-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor read performance.

2018-04-25 Thread David C

How does your rados bench look?

Have you tried playing around with read ahead and striping?


On Tue, 24 Apr 2018 17:53 Jonathan Proulx,  wrote:

> Hi All,
>
> I seem to be seeing consitently poor read performance on my cluster
> relative to both write performance and read perormance of a single
> backend disk, by quite a lot.
>
> cluster is luminous with 174 7.2k SAS drives across 12 storage servers
> with 10G ethernet and jumbo frames.  Drives are mix 4T and 2T
> bluestore with DB on ssd.
>
> The performence I really care about is over rbd for VMs in my
> OpenStack but 'rbd bench' seems to line up frety well with 'fio' test
> inside VMs so a more or less typical random write rbd bench (from a
> monitor node with 10G connection on same net as osds):
>
> rbd bench  --io-total=4G --io-size 4096 --io-type write \
> --io-pattern rand --io-threads 16 mypool/myvol
>
> 
>
> elapsed:   361  ops:  1048576  ops/sec:  2903.82  bytes/sec: 11894034.98
>
> same for random read is an order of magnitude lower:
>
> rbd bench  --io-total=4G --io-size 4096 --io-type read \
> --io-pattern rand --io-threads 16  mypool/myvol
>
> elapsed:  3354  ops:  1048576  ops/sec:   312.60  bytes/sec: 1280403.47
>
> (sequencial reads and bigger io-size help but not a lot)
>
> ceph -s from during read bench so get a sense of comparing traffic:
>
>   cluster:
> id: 
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum ceph-mon0,ceph-mon1,ceph-mon2
> mgr: ceph-mon0(active), standbys: ceph-mon2, ceph-mon1
> osd: 174 osds: 174 up, 174 in
> rgw: 3 daemon active
>
>   data:
> pools:   19 pools, 10240 pgs
> objects: 17342k objects, 80731 GB
> usage:   240 TB used, 264 TB / 505 TB avail
> pgs: 10240 active+clean
>
>   io:
> client:   4296 kB/s rd, 417 MB/s wr, 1635 op/s rd, 3518 op/s wr
>
>
> During deep-scrubs overnight I can see the disks doing >500MBps reads
> and ~150rx/iops (each at peak), while during read bench (including all
> traffic from ~1k VMs) individual osd data partitions peak around 25
> rx/iops and 1.5MBps rx bandwidth so it seems like there should be
> performance to spare.
>
> Obviosly given my disk choices this isn't designed as a particularly
> high performance setup but I do expect a bit mroe performance out of
> it.
>
> Are my expectations wrong? If not any clues what I've don (or failed
> to do) that is wrong?
>
> Pretty sure rx/wx was much more sysmetric in earlier versions (subset
> of same hardware and filestore backend) but used a different perf tool
> so don't want to make direct comparisons.
>
> -Jon
>
> --
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Broken rgw user

2018-04-25 Thread Simon Ironside


Hi Everyone,

I've got a problem with one rgw user on Hammer 0.94.7.

* "radosgw-admin user info" no longer works:

could not fetch user info: no user info saved

* I can still retrieve their stats via "radosgw-admin user stats", 
although the returned data is wrong:


{
"stats": {
"total_entries": 0,
"total_bytes": 0,
"total_bytes_rounded": 0
},
"last_stats_sync": "2018-04-24 15:58:27.354280Z",
"last_stats_update": "0.00"
}

* The user still shows in metadata list user

* As far as I can see, the user still works, I can access the account ok 
with s3cmd etc.


Does anyone know how to fix the user info and user stats issues?

Thanks,
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph Developer Monthly - May 2018

Re: [ceph-users] cluster can't remapped objects after change crush tree

Re: [ceph-users] Poor read performance.

Re: [ceph-users] Poor read performance.

Re: [ceph-users] Poor read performance.

[ceph-users] Backup LUKS/Dmcrypt keys

Re: [ceph-users] ceph osd reweight (doing -1 or actually -0.0001)

Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2

[ceph-users] Is RDMA Worth Exploring? Howto ?

[ceph-users] cluster can't remapped objects after change crush tree

Re: [ceph-users] ceph osd reweight (doing -1 or actually -0.0001)

Re: [ceph-users] Integrating XEN Server : Long query time for "rbd ls -l" queries

Re: [ceph-users] Integrating XEN Server : Long query time for "rbd ls -l" queries

[ceph-users] ceph osd reweight (doing -1 or actually -0.0001)

Re: [ceph-users] Blocked Requests

Re: [ceph-users] Integrating XEN Server : Long query time for "rbd ls -l" queries

[ceph-users] trimming the MON level db

Re: [ceph-users] Integrating XEN Server : Long query time for "rbd ls -l" queries

Re: [ceph-users] Integrating XEN Server : Long query time for "rbd ls -l" queries

Re: [ceph-users] Ceph 12.2.4 MGR spams syslog with "mon failed to return metadata for mds"

[ceph-users] Integrating XEN Server : Long query time for "rbd ls -l" queries

Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2

Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2

Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2

Re: [ceph-users] Poor read performance.

Re: [ceph-users] Poor read performance.

[ceph-users] Broken rgw user

27 matches

Site Navigation

Mail list logo

Footer information