Re: [ceph-users] Cluster unusable after 50% full, even with index sharding

2018-04-13 Thread Christian Balzer

Hello,

On Fri, 13 Apr 2018 11:59:01 -0500 Robert Stanford wrote:

>  I have 65TB stored on 24 OSDs on 3 hosts (8 OSDs per host).  SSD journals
> and spinning disks.  Our performance before was acceptable for our purposes
> - 300+MB/s simultaneous transmit and receive.  Now that we're up to about
> 50% of our total storage capacity (65/120TB, say), the write performance is
> still ok, but the read performance is unworkable (35MB/s!)
> 
As always, full details.
Versions, HW, what SSDs, what HDDs and how connected, what FS on the
OSDs, etc.
 
>  I am using index sharding, with 256 shards.  I don't see any CPUs
> saturated on any host (we are using radosgw by the way, and the load is
> light there as well).  The hard drives don't seem to be *too* busy (a
> random OSD shows ~10 wa in top).  The network's fine, as we were doing much
> better in terms of speed before we filled up.
>
top is an abysmal tool for these things, use atop in a big terminal window
on all 3 hosts for full situational awareness.
"iostat -x 3" might do in a pinch for IO related bits, too.

Keep in mind that a single busy OSD will drag the performance of the whole
cluster down. 

Other things to check and verify:
1. Are the OSDs reasonably balanced PG wise?
2. How fragmented are the OSD FS?
3. Is a deep scrub running during the low performance times?
4. Have you run out of RAM for the pagecache and more importantly the SLAB
for dir_entries due to the number of objects (files)? 
If so reads will require many more disk accesses than otherwise.  
This is a typical wall to run into and can be mitigated by more RAM and
sysctl tuning. 

Christian
 
>   Is there anything we can do about this, short of replacing hardware?  Is
> it really a limitation of Ceph that getting 50% full makes your cluster
> unusable?  Index sharding has seemed to not help at all (I did some
> benchmarking, with 128 shards and then 256; same result each time.)
> 
>  Or are we out of luck?


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd-nbd not resizing even after kernel tweaks

2018-04-13 Thread Alex Gorbachev
On Thu, Apr 12, 2018 at 9:38 AM, Alex Gorbachev  
wrote:
> On Thu, Apr 12, 2018 at 7:57 AM, Jason Dillaman  wrote:
>> If you run "partprobe" after you resize in your second example, is the
>> change visible in "parted"?
>
> No, partprobe does not help:
>
> root@lumd1:~# parted /dev/nbd2 p
> Model: Unknown (unknown)
> Disk /dev/nbd2: 2147MB
> Sector size (logical/physical): 512B/512B
> Partition Table: loop
> Disk Flags:
>
> Number  Start  End SizeFile system  Flags
>  1  0.00B  2147MB  2147MB  xfs
>
> root@lumd1:~# partprobe
> root@lumd1:~# parted /dev/nbd2 p
> Model: Unknown (unknown)
> Disk /dev/nbd2: 2147MB
> Sector size (logical/physical): 512B/512B
> Partition Table: loop
> Disk Flags:
>
> Number  Start  End SizeFile system  Flags
>  1  0.00B  2147MB  2147MB  xfs
>
>
>
>>
>> On Wed, Apr 11, 2018 at 11:01 PM, Alex Gorbachev  
>> wrote:
>>> On Wed, Apr 11, 2018 at 2:13 PM, Jason Dillaman  wrote:
 I've tested the patch on both 4.14.0 and 4.16.0 and it appears to
 function correctly for me. parted can see the newly added free-space
 after resizing the RBD image and our stress tests once again pass
 successfully. Do you have any additional details on the issues you are
 seeing?
>>>
>>> I recompiled again with 4.14-24 and tested, the resize shows up OK
>>> when the filesystem is not mounted.  dmesg shows also the "detected
>>> capacity change" message.  However, if I create a filesystem and mount
>>> it, the capacity change is no longer detected.  Steps as follows:
>>>
>>> root@lumd1:~# rbd create -s 1024 --image-format 2 matte/n4
>>> root@lumd1:~# rbd-nbd map matte/n4
>>> /dev/nbd2
>>> root@lumd1:~# mkfs.xfs /dev/nbd2
>>> meta-data=/dev/nbd2  isize=512agcount=4, agsize=65536 blks
>>>  =   sectsz=512   attr=2, projid32bit=1
>>>  =   crc=1finobt=1, sparse=0
>>> data =   bsize=4096   blocks=262144, imaxpct=25
>>>  =   sunit=0  swidth=0 blks
>>> naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
>>> log  =internal log   bsize=4096   blocks=2560, version=2
>>>  =   sectsz=512   sunit=0 blks, lazy-count=1
>>> realtime =none   extsz=4096   blocks=0, rtextents=0
>>> root@lumd1:~# parted /dev/nbd2 p
>>> Model: Unknown (unknown)
>>> Disk /dev/nbd2: 1074MB
>>> Sector size (logical/physical): 512B/512B
>>> Partition Table: loop
>>> Disk Flags:
>>>
>>> Number  Start  End SizeFile system  Flags
>>>  1  0.00B  1074MB  1074MB  xfs
>>>
>>> root@lumd1:~# rbd resize --pool matte --image n4 --size 2048
>>> Resizing image: 100% complete...done.
>>> root@lumd1:~# parted /dev/nbd2 p
>>> Model: Unknown (unknown)
>>> Disk /dev/nbd2: 2147MB
>>> Sector size (logical/physical): 512B/512B
>>> Partition Table: loop
>>> Disk Flags:
>>>
>>> Number  Start  End SizeFile system  Flags
>>>  1  0.00B  2147MB  2147MB  xfs
>>>
>>> -- All is well so far, now let's mount the fs
>>>
>>> root@lumd1:~# mount /dev/nbd2 /mnt
>>> root@lumd1:~# rbd resize --pool matte --image n4 --size 3072
>>> Resizing image: 100% complete...done.
>>> root@lumd1:~# parted /dev/nbd2 p
>>> Model: Unknown (unknown)
>>> Disk /dev/nbd2: 2147MB
>>> Sector size (logical/physical): 512B/512B
>>> Partition Table: loop
>>> Disk Flags:
>>>
>>> Number  Start  End SizeFile system  Flags
>>>  1  0.00B  2147MB  2147MB  xfs
>>>
>>> -- Now the change is not detected
>>>
>>>

 On Wed, Apr 11, 2018 at 12:06 PM, Jason Dillaman  
 wrote:
> I'll give it a try locally and see if I can figure it out. Note that
> this commit [1] also dropped the call to "bd_set_size" within
> "nbd_size_update", which seems suspicious to me at initial glance.
>
> [1] 
> https://github.com/torvalds/linux/commit/29eaadc0364943b6352e8994158febcb699c9f9b#diff-bc9273bcb259fef182ae607a1d06a142L180
>
> On Wed, Apr 11, 2018 at 11:09 AM, Alex Gorbachev 
>  wrote:
>>> On Wed, Apr 11, 2018 at 10:27 AM, Alex Gorbachev 
>>>  wrote:
 On Wed, Apr 11, 2018 at 2:43 AM, Mykola Golub 
  wrote:
> On Tue, Apr 10, 2018 at 11:14:58PM -0400, Alex Gorbachev wrote:
>
>> So Josef fixed the one issue that enables e.g. lsblk and sysfs size 
>> to
>> reflect the correct siz on change.  However, partptobe and parted
>> still do not detect the change, complete unmap and remap of rbd-nbd
>> device and remount of the filesystem is required.
>
> Does your rbd-nbd include this fix [1], targeted for v12.2.3?
>
> [1] http://tracker.ceph.com/issues/22172

 It should, the rbd-nbd version is 12.2.4

 root@lumd1:~# rbd-nbd -v
 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
 luminous (stable)
 ___
>>>

[ceph-users] Error Creating OSD

2018-04-13 Thread Rhian Resnick
Evening,

When attempting to create an OSD we receive the following error.

[ceph-admin@ceph-storage3 ~]$ sudo ceph-volume lvm create --bluestore --data 
/dev/sdu
Running command: ceph-authtool --gen-print-key
Running command: ceph --cluster ceph --name client.bootstrap-osd --keyring 
/var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 
c8cb8cff-dad9-48b8-8d77-6f130a4b629d
--> Was unable to complete a new OSD, will rollback changes
--> OSD will be fully purged from the cluster, because the ID was generated
Running command: ceph osd purge osd.140 --yes-i-really-mean-it
 stderr: purged osd.140
-->  MultipleVGsError: Got more than 1 result looking for volume group: 
ceph-6a2e8f21-bca2-492b-8869-eecc995216cc

Any hints on what to do? This occurs when we attempt to create osd's on this 
node.


Rhian Resnick

Associate Director Middleware and HPC

Office of Information Technology


Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 [image] 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How much damage have I done to RGW hardcore-wiping a bucket out of its existence?

2018-04-13 Thread Katie Holly
Hi everyone,

I found myself in a situation where dynamic sharding and writing data to a 
bucket containing a little more than 5M objects at the same time caused 
corruption on the data rendering the entire bucket unusable, I tried several 
solutions to fix this bucket and ended up ditching it.

What I tried before going the hardcore way:

* radosgw-admin reshard list -> didn't list any reshard process going on at the 
time, but
* radosgw-admin reshard cancel --bucket $bucket -> canceled the reshard process 
going on in the background, overall load on the cluster dropped after a few 
minutes

At this point I decided to start from scratch since a lot of the data was 
corrupted due to a broken application version writing to this bucket.

* aws s3 rm --recursive s3://$bucket -> Deleted most files, but 13k files 
consuming around 500G total weren't deleted, re-running the same command didn't 
fix that
* aws s3 rb s3://$bucket -> That obviously didn't work since the bucket isn't 
empty
* radosgw-admin bucket rm --bucket $bucket -> "ERROR: could not remove 
non-empty bucket $bucket" and "ERROR: unable to remove bucket(39) Directory not 
empty"
* radosgw-admin bucket rm --bucket $bucket --purge-objects -> "No such file or 
directory"

After some days of helpless Googling and trying various combinations of 
radosgw-admin bucket, bi, reshard and other commands that all did pretty much 
nothing, I did

* rados -p $pool ls | tr '\t' '\n' | fgrep $bucket_marker_id | tr '\n' '\0' | 
xargs -0 -n 128 -P 32 rados -p $pool rm

That deleted the orphan objects from the rados pool cleaning up the used ~500G 
of data.

* radosgw-admin bucket check --bucket $bucket -> listed some objects in an 
array, probably the lost ones that weren't deleted
* radosgw-admin bucket check --bucket $bucket --fix ( --check-objects) -> 
didn't do anything

* radosgw-admin bi purge --bucket=$bucket --yes-i-really-mean-it -> This 
deleted the bucket index
* radosgw-admin bucket list -> Bucket still appeared in the list
* aws s3 ls -> Bucket was still appearing in the list
* aws s3 rb $bucket -> "NoSuchBucket"
* aws s3 rm --recursive s3://$bucket -> No error or output
* aws s3 rb $bucket -> No error
* aws s3 ls -> Bucket is no longer in the list

At this point, I decided to restart all RGW frontend instances to make sure 
nothing is being cached. To confirm that it's really gone now, let's check 
everything...

* aws s3 ls -> Check.
* radosgw-admin bucket list -> Check.
* radosgw-admin metadata get bucket:$bucket -> Check.
* radosgw-admin bucket stats --bucket $bucket -> Check.
But:
* radosgw-admin reshard list -> It's doing a reshard, I stopped that for now. 
However, all RGW frontend instances were logging this repeatedly for some 
minutes:

> block_while_resharding ERROR: bucket is still resharding, please retry
> RGWWatcher::handle_error cookie xxx err (107) Transport endpoint is not 
> connected
> NOTICE: resharding operation on bucket index detected, blocking
> NOTICE: resharding operation on bucket index detected, blocking
> block_while_resharding ERROR: bucket is still resharding, please retry
> block_while_resharding ERROR: bucket is still resharding, please retry
> NOTICE: resharding operation on bucket index detected, blocking
> NOTICE: resharding operation on bucket index detected, blocking
> RGWWatcher::handle_error cookie xxx err (107) Transport endpoint is not 
> connected
> RGWWatcher::handle_error cookie xxx err (107) Transport endpoint is not 
> connected
> block_while_resharding ERROR: bucket is still resharding, please retry
> RGWWatcher::handle_error cookie xxx err (107) Transport endpoint is not 
> connected
> NOTICE: resharding operation on bucket index detected, blocking
> block_while_resharding ERROR: bucket is still resharding, please retry
> NOTICE: resharding operation on bucket index detected, blocking

One of the RGW frontend instances crashed during this, all others seem to be 
running fine at the moment:

> 2018-04-13 23:19:41.599307 7f35c6e00700  0 ERROR: flush_read_list(): 
> d->client_cb->handle_data() returned -5
> terminate called after throwing an instance of 'ceph::buffer::bad_alloc'
>   what():  buffer::bad_alloc
> *** Caught signal (Aborted) **
>  in thread 7f35f341d700 thread_name:msgr-worker-0


* aws s3 mb s3://$bucket -> This command succeeded
* aws s3 cp $file s3://$bucket/$file -> This command succeeded as well

My question at this point would be, how much have I damaged this cluster on an 
RGW pov and is it possible to undo those damages? If I want to proceed with 
cleaning up the old bucket data, where should I continue and how would I verify 
that everything, that might further damage the cluster at a later point, is 
really gone?

Thanks in advance for any help regarding this, and yes, I know that I should 
have asked on the mailing list first before doing anything stupid. Please let 
me know if I missed any information and I'll add it asap.

-- 
Best regards

Katie Holly
___

[ceph-users] Cluster unusable after 50% full, even with index sharding

2018-04-13 Thread Robert Stanford
 I have 65TB stored on 24 OSDs on 3 hosts (8 OSDs per host).  SSD journals
and spinning disks.  Our performance before was acceptable for our purposes
- 300+MB/s simultaneous transmit and receive.  Now that we're up to about
50% of our total storage capacity (65/120TB, say), the write performance is
still ok, but the read performance is unworkable (35MB/s!)

 I am using index sharding, with 256 shards.  I don't see any CPUs
saturated on any host (we are using radosgw by the way, and the load is
light there as well).  The hard drives don't seem to be *too* busy (a
random OSD shows ~10 wa in top).  The network's fine, as we were doing much
better in terms of speed before we filled up.

  Is there anything we can do about this, short of replacing hardware?  Is
it really a limitation of Ceph that getting 50% full makes your cluster
unusable?  Index sharding has seemed to not help at all (I did some
benchmarking, with 128 shards and then 256; same result each time.)

 Or are we out of luck?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS MDS stuck (failed to rdlock when getattr / lookup)

2018-04-13 Thread Oliver Freyermuth
Dear Cephalopodians,

a small addition. 

As far as I know, the I/O the user is performing is based on the following 
directory structure:
datafolder/some_older_tarball.tar.gz
datafolder/sometarball.tar.gz
datafolder/processing_number_2/
datafolder/processing_number_3/
datafolder/processing_number_4/

The problem appeared to start when:
- many clients were reading from datafolder/some_older_tarball.tar.gz, but 
extracting somewhere else (to another filesystem). 
- then, one single client starts to create datafolder/sometarball.tar.gz, 
packaging files from another filesystem. 
Can this cause such a lockup? If so, can we prevent it somehow? 

Cheers,
Oliver

Am 13.04.2018 um 18:16 schrieb Oliver Freyermuth:
> Dear Cephalopodians,
> 
> in our cluster (CentOS 7.4, EC Pool, Snappy compression, Luminous 12.2.4), 
> we often have all (~40) clients accessing one file in readonly mode, even 
> with multiple processes per client doing that. 
> 
> Sometimes (I do not yet know when, nor why!) the MDS ends up in a situation 
> like:
> ---
> 2018-04-13 18:08:34.37 7f1ce4472700  0 log_channel(cluster) log [WRN] : 
> 292 slow requests, 5 included below; oldest blocked for > 1745.864417 secs
> 2018-04-13 18:08:34.378900 7f1ce4472700  0 log_channel(cluster) log [WRN] : 
> slow request 960.563534 seconds old, received at 2018-04-13 17:52:33.815273: 
> client_request(client.34720:16487379 getattr pAsLsXsFs #0x109ff6d 
> 2018-04-13 17:52:33.814904 caller_uid=94894, caller_gid=513{513,}) currently 
> failed to rdlock, waiting
> 2018-04-13 18:08:34.378904 7f1ce4472700  0 log_channel(cluster) log [WRN] : 
> slow request 30.636678 seconds old, received at 2018-04-13 18:08:03.742128: 
> client_request(client.34302:16453640 getattr pAsLsXsFs #0x109ff6d 
> 2018-04-13 18:08:03.741630 caller_uid=94894, caller_gid=513{513,}) currently 
> failed to rdlock, waiting
> 2018-04-13 18:08:34.378908 7f1ce4472700  0 log_channel(cluster) log [WRN] : 
> slow request 972.648926 seconds old, received at 2018-04-13 17:52:21.729881: 
> client_request(client.34720:16487334 lookup #0x101fcab/sometarball.tar.gz 
> 2018-04-13 17:52:21.729450 caller_uid=94894, caller_gid=513{513,}) currently 
> failed to rdlock, waiting
> 2018-04-13 18:08:34.378913 7f1ce4472700  0 log_channel(cluster) log [WRN] : 
> slow request 1685.953657 seconds old, received at 2018-04-13 17:40:28.425149: 
> client_request(client.34810:16564864 lookup #0x101fcab/sometarball.tar.gz 
> 2018-04-13 17:40:28.424961 caller_uid=94894, caller_gid=513{513,}) currently 
> failed to rdlock, waiting
> 2018-04-13 18:08:34.378918 7f1ce4472700  0 log_channel(cluster) log [WRN] : 
> slow request 1552.743795 seconds old, received at 2018-04-13 17:42:41.635012: 
> client_request(client.34302:16453566 getattr pAsLsXsFs #0x109ff6d 
> 2018-04-13 17:42:41.634726 caller_uid=94894, caller_gid=513{513,}) currently 
> failed to rdlock, waiting
> ---
> As you can see (oldest blocked for > 1745.864417 secs) it stays in that 
> situation for quite a while. 
> The number of blocked requests is also not decreasing, but instead slowly 
> increasing whenever a new request is added to the queue. 
> 
> We have a setup of one active MDS, a standby-replay, and a standby. 
> Triggering a failover does not help, it only resets the "oldest blocked" 
> time. 
> 
> I checked the following things on the active MDS:
> ---
> # ceph daemon mds.mon001 objecter_requests
> {
> "ops": [],
> "linger_ops": [],
> "pool_ops": [],
> "pool_stat_ops": [],
> "statfs_ops": [],
> "command_ops": []
> }
> # ceph daemon mds.mon001 ops | grep event | grep -v "initiated" | grep -v 
> "failed to rdlock" | grep -v events
> => no output, only "initiated" and "rdlock" are in the queue. 
> ---
> 
> There's also almost no CPU load, almost no other I/O, and ceph is 
> deep-scrubbing ~pg (this also finishes and the next pg is scrubbed fine),
> and the scrubbing is not even happening in the metadata pool (easy to see in 
> the Luminous dashboard):
> ---
> # ceph -s
>   cluster:
> id: some_funny_hash
> health: HEALTH_WARN
> 1 MDSs report slow requests
>  
>   services:
> mon: 3 daemons, quorum mon003,mon001,mon002
> mgr: mon001(active), standbys: mon002, mon003
> mds: cephfs_baf-1/1/1 up  {0=mon001=up:active}, 1 up:standby-replay, 1 
> up:standby
> osd: 196 osds: 196 up, 196 in
>  
>   data:
> pools:   2 pools, 4224 pgs
> objects: 15649k objects, 61761 GB
> usage:   114 TB used, 586 TB / 700 TB avail
> pgs: 4223 active+clean
>  1active+clean+scrubbing+deep
>  
>   io:
> client:   175 kB/s rd, 3 op/s rd, 0 op/s wr
> ---

[ceph-users] CephFS MDS stuck (failed to rdlock when getattr / lookup)

2018-04-13 Thread Oliver Freyermuth
Dear Cephalopodians,

in our cluster (CentOS 7.4, EC Pool, Snappy compression, Luminous 12.2.4), 
we often have all (~40) clients accessing one file in readonly mode, even with 
multiple processes per client doing that. 

Sometimes (I do not yet know when, nor why!) the MDS ends up in a situation 
like:
---
2018-04-13 18:08:34.37 7f1ce4472700  0 log_channel(cluster) log [WRN] : 292 
slow requests, 5 included below; oldest blocked for > 1745.864417 secs
2018-04-13 18:08:34.378900 7f1ce4472700  0 log_channel(cluster) log [WRN] : 
slow request 960.563534 seconds old, received at 2018-04-13 17:52:33.815273: 
client_request(client.34720:16487379 getattr pAsLsXsFs #0x109ff6d 
2018-04-13 17:52:33.814904 caller_uid=94894, caller_gid=513{513,}) currently 
failed to rdlock, waiting
2018-04-13 18:08:34.378904 7f1ce4472700  0 log_channel(cluster) log [WRN] : 
slow request 30.636678 seconds old, received at 2018-04-13 18:08:03.742128: 
client_request(client.34302:16453640 getattr pAsLsXsFs #0x109ff6d 
2018-04-13 18:08:03.741630 caller_uid=94894, caller_gid=513{513,}) currently 
failed to rdlock, waiting
2018-04-13 18:08:34.378908 7f1ce4472700  0 log_channel(cluster) log [WRN] : 
slow request 972.648926 seconds old, received at 2018-04-13 17:52:21.729881: 
client_request(client.34720:16487334 lookup #0x101fcab/sometarball.tar.gz 
2018-04-13 17:52:21.729450 caller_uid=94894, caller_gid=513{513,}) currently 
failed to rdlock, waiting
2018-04-13 18:08:34.378913 7f1ce4472700  0 log_channel(cluster) log [WRN] : 
slow request 1685.953657 seconds old, received at 2018-04-13 17:40:28.425149: 
client_request(client.34810:16564864 lookup #0x101fcab/sometarball.tar.gz 
2018-04-13 17:40:28.424961 caller_uid=94894, caller_gid=513{513,}) currently 
failed to rdlock, waiting
2018-04-13 18:08:34.378918 7f1ce4472700  0 log_channel(cluster) log [WRN] : 
slow request 1552.743795 seconds old, received at 2018-04-13 17:42:41.635012: 
client_request(client.34302:16453566 getattr pAsLsXsFs #0x109ff6d 
2018-04-13 17:42:41.634726 caller_uid=94894, caller_gid=513{513,}) currently 
failed to rdlock, waiting
---
As you can see (oldest blocked for > 1745.864417 secs) it stays in that 
situation for quite a while. 
The number of blocked requests is also not decreasing, but instead slowly 
increasing whenever a new request is added to the queue. 

We have a setup of one active MDS, a standby-replay, and a standby. 
Triggering a failover does not help, it only resets the "oldest blocked" time. 

I checked the following things on the active MDS:
---
# ceph daemon mds.mon001 objecter_requests
{
"ops": [],
"linger_ops": [],
"pool_ops": [],
"pool_stat_ops": [],
"statfs_ops": [],
"command_ops": []
}
# ceph daemon mds.mon001 ops | grep event | grep -v "initiated" | grep -v 
"failed to rdlock" | grep -v events
=> no output, only "initiated" and "rdlock" are in the queue. 
---

There's also almost no CPU load, almost no other I/O, and ceph is 
deep-scrubbing ~pg (this also finishes and the next pg is scrubbed fine),
and the scrubbing is not even happening in the metadata pool (easy to see in 
the Luminous dashboard):
---
# ceph -s
  cluster:
id: some_funny_hash
health: HEALTH_WARN
1 MDSs report slow requests
 
  services:
mon: 3 daemons, quorum mon003,mon001,mon002
mgr: mon001(active), standbys: mon002, mon003
mds: cephfs_baf-1/1/1 up  {0=mon001=up:active}, 1 up:standby-replay, 1 
up:standby
osd: 196 osds: 196 up, 196 in
 
  data:
pools:   2 pools, 4224 pgs
objects: 15649k objects, 61761 GB
usage:   114 TB used, 586 TB / 700 TB avail
pgs: 4223 active+clean
 1active+clean+scrubbing+deep
 
  io:
client:   175 kB/s rd, 3 op/s rd, 0 op/s wr
---

Does anybody have any idea what's going on here? 

Yesterday, this also happened, but resolved itself after about 1 hour. 
Right now, it's going on for about half an hour... 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osds with different disk sizes may killing performance (?? ?)

2018-04-13 Thread David Turner
You'll find it said time and time agin on the ML... avoid disks of
different sizes in the same cluster.  It's a headache that sucks.  It's not
impossible, it's not even overly hard to pull off... but it's very easy to
cause a mess and a lot of headaches.  It will also make it harder to
diagnose performance issues in the cluster.

There is no way to fill up all disks evenly with the same number of Bytes
and then stop filling the small disks when they're full and only continue
filling the larger disks.  What will happen if you are filling all disks
evenly with Bytes instead of % is that the small disks will get filled
completely and all writes to the cluster will block until you do something
to reduce the amount used on the full disks.

On Fri, Apr 13, 2018 at 1:28 AM Ronny Aasen 
wrote:

> On 13. april 2018 05:32, Chad William Seys wrote:
> > Hello,
> >I think your observations suggest that, to a first approximation,
> > filling drives with bytes to the same absolute level is better for
> > performance than filling drives to the same percentage full. Assuming
> > random distribution of PGs, this would cause the smallest drives to be
> > as active as the largest drives.
> >E.g. if every drive had 1TB of data, each would be equally likely to
> > contain the PG of interest.
> >Of course, as more data was added the smallest drives could not hold
> > more and the larger drives become more active, but at least the smaller
> > drives would as active as possible.
>
> but in this case you would have a steep drop off of performance. when
> you reach the fill level where small drives do not accept more data,
> suddenly you would have a performance cliff where only your larger disks
> are doing new writes. and only larger disks doing reads on new data.
>
>
> it is also easier to make the logical connection while you are
> installing new nodes/disks. then a year later when your cluster just
> happen to reach that fill level.
>
> it would also be an easier job balancing disks between nodes when you
> are adding osd's anyway and the new ones are mostly empty. rather then
> when your small osd's are full and your large disks have significant
> data on them.
>
>
>
> kind regards
> Ronny Aasen
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com