Re: [ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12

2019-12-12 Thread Bryan Stillwell
Jelle,

Try putting just the WAL on the Optane NVMe.  I'm guessing your DB is too big 
to fit within 5GB.  We used a 5GB journal on our nodes as well, but when we 
switched to BlueStore (using ceph-volume lvm batch) it created 37GiB logical 
volumes (200GB SSD / 5 or 400GB SSD / 10) for our DBs.

Also, injecting those settings into the cluster will only work until the OSD is 
restarted.  You'll need to add them to ceph.conf to be persistent.

Bryan

> On Dec 12, 2019, at 3:40 PM, Jelle de Jong  wrote:
> 
> Notice: This email is from an external sender.
> 
> 
> 
> Hello everybody,
> 
> I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's
> with 32GB Intel Optane NVMe journal, 10GB networking.
> 
> I wanted to move to bluestore due to dropping support of filestore, our
> cluster was working fine with filestore and we could take complete nodes
> out for maintenance without issues.
> 
> root@ceph04:~# ceph osd pool get libvirt-pool size
> size: 3
> root@ceph04:~# ceph osd pool get libvirt-pool min_size
> min_size: 2
> 
> I removed all osds from one node, zapping the osd and journal devices,
> we recreated the osds as bluestore and used a small 5GB partition as
> rockdb device instead of journal for all osd's.
> 
> I saw the cluster suffer with pgs inactive and slow request.
> 
> I tried setting the following on all nodes, but no diffrence:
> ceph tell osd.* injectargs '--osd_recovery_max_active 1'
> ceph tell osd.* injectargs '--osd_recovery_op_priority 1'
> ceph tell osd.* injectargs '--osd_recovery_sleep 0.3'
> systemctl restart ceph-osd.target
> 
> It took three days to recover and during this time clients were not
> responsive.
> 
> How can I migrate to bluestore without inactive pgs or slow request. I
> got several more filestore clusters and I would like to know how to
> migrate without inactive pgs and slow reguests?
> 
> As a side question, I optimized our cluster for filestore, the Intel
> Optane NVMe journals showed good fio dsync write tests, does bluestore
> also use dsync writes for rockdb caching or can we select NVMe devices
> on other specifications? My test with filestores showed that Optane NVMe
> SSD was faster then the Samsung NVMe SSD 970 Pro and I only need a a few
> GB for filestore journals, but with bluestore rockdb caching the
> situation is different and I can't find documentation on how to speed
> test NVMe devices for bluestore.
> 
> Kind regards,
> 
> Jelle
> 
> root@ceph04:~# ceph osd tree
> ID CLASS WEIGHT   TYPE NAME   STATUS REWEIGHT PRI-AFF
> -1   60.04524 root default
> -2   20.01263 host ceph04
> 0   hdd  2.72899 osd.0   up  1.0 1.0
> 1   hdd  2.72899 osd.1   up  1.0 1.0
> 2   hdd  5.45799 osd.2   up  1.0 1.0
> 3   hdd  2.72899 osd.3   up  1.0 1.0
> 14   hdd  3.63869 osd.14  up  1.0 1.0
> 15   hdd  2.72899 osd.15  up  1.0 1.0
> -3   20.01263 host ceph05
> 4   hdd  5.45799 osd.4   up  1.0 1.0
> 5   hdd  2.72899 osd.5   up  1.0 1.0
> 6   hdd  2.72899 osd.6   up  1.0 1.0
> 13   hdd  3.63869 osd.13  up  1.0 1.0
> 16   hdd  2.72899 osd.16  up  1.0 1.0
> 18   hdd  2.72899 osd.18  up  1.0 1.0
> -4   20.01997 host ceph06
> 8   hdd  5.45999 osd.8   up  1.0 1.0
> 9   hdd  2.73000 osd.9   up  1.0 1.0
> 10   hdd  2.73000 osd.10  up  1.0 1.0
> 11   hdd  2.73000 osd.11  up  1.0 1.0
> 12   hdd  3.64000 osd.12  up  1.0 1.0
> 17   hdd  2.73000 osd.17  up  1.0 1.0
> 
> 
> root@ceph04:~# ceph status
>  cluster:
>id: 85873cda-4865-4147-819d-8deda5345db5
>health: HEALTH_WARN
>18962/11801097 objects misplaced (0.161%)
>1/3933699 objects unfound (0.000%)
>Reduced data availability: 42 pgs inactive
>Degraded data redundancy: 3645135/11801097 objects degraded
> (30.888%), 959 pgs degraded, 960 pgs undersized
>110 slow requests are blocked > 32 sec. Implicated osds 3,10,11
> 
>  services:
>mon: 3 daemons, quorum ceph04,ceph05,ceph06
>mgr: ceph04(active), standbys: ceph06, ceph05
>osd: 18 osds: 18 up, 18 in; 964 remapped pgs
> 
>  data:
>pools:   1 pools, 1024 pgs
>objects: 3.93M objects, 15.0TiB
>usage:   31.2TiB used, 28.8TiB / 60.0TiB avail
>pgs: 4.102% pgs not active
> 3645135/11801097 objects degraded (30.888%)
> 18962/11801097 objects misplaced (0.161%)
> 1/3933699 objects unfound (0.000%)
> 913 active+undersized+degraded+remapped+backfill_wait
> 60  active+clean
> 41  activating+undersized+degraded+remapped
> 4   active+remapped+backfill_wait
> 4   

Re: [ceph-users] Ceph OSD node trying to possibly start OSDs that were purged

2019-10-29 Thread Bryan Stillwell
On Oct 29, 2019, at 11:23 AM, Jean-Philippe Méthot 
 wrote:
> A few months back, we had one of our OSD node motherboards die. At the time, 
> we simply waited for recovery and purged the OSDs that were on the dead node. 
> We just replaced that node and added back the drives as new OSDs. At the ceph 
> administration level, everything looks fine, no duplicate OSDs when I execute 
> map commands or ask Ceph to list what OSDs are on the node. However, on the 
> OSD node, in /var/log/ceph/ceph-volume, I see that every time the server 
> boots, ceph-volume tries to search for OSD fsids that don’t exist. Here’s the 
> error:
> 
> [2019-10-29 13:12:02,864][ceph_volume][ERROR ] exception caught by decorator
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 59, 
> in newfunc
> return f(*a, **kw)
>   File "/usr/lib/python2.7/site-packages/ceph_volume/main.py", line 148, in 
> main
> terminal.dispatch(self.mapper, subcommand_args)
>   File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, 
> in dispatch
> instance.main()
>   File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/main.py", 
> line 40, in main
> terminal.dispatch(self.mapper, self.argv)
>   File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, 
> in dispatch
> instance.main()
>   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 16, 
> in is_root
> return func(*a, **kw)
>   File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/trigger.py", 
> line 70, in main
> Activate(['--auto-detect-objectstore', osd_id, osd_uuid]).main()
>   File 
> "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/activate.py", line 
> 339, in main
> self.activate(args)
>   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 16, 
> in is_root
> return func(*a, **kw)
>   File 
> "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/activate.py", line 
> 249, in activate
> raise RuntimeError('could not find osd.%s with fsid %s' % (osd_id, 
> osd_fsid))
> RuntimeError: could not find osd.213 with fsid 
> 22800a80-2445-41a3-8643-69b4b84d598a
> 
> Of course this fsid ID isn’t listed anywhere in Ceph. Where does ceph-volume 
> get this fsid from? Even when looking at the code, it’s not particularly 
> obvious. This is ceph mimic running on CentOS 7 and bluestore.

That's not the cluster fsid, but the osd fsid.  Try running this command on 
your OSD node to get more details:

ceph-volume inventory --format json-pretty

My guess is there are some systemd files laying around for the old OSDs, or you 
were using 'ceph-volume simple' in the past (check for /etc/ceph/osd/).

Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PGs caused by omap_digest mismatch

2019-04-09 Thread Bryan Stillwell

> On Apr 8, 2019, at 5:42 PM, Bryan Stillwell  wrote:
> 
> 
>> On Apr 8, 2019, at 4:38 PM, Gregory Farnum  wrote:
>> 
>> On Mon, Apr 8, 2019 at 3:19 PM Bryan Stillwell  
>> wrote:
>>> 
>>> There doesn't appear to be any correlation between the OSDs which would 
>>> point to a hardware issue, and since it's happening on two different 
>>> clusters I'm wondering if there's a race condition that has been fixed in a 
>>> later version?
>>> 
>>> Also, what exactly is the omap digest?  From what I can tell it appears to 
>>> be some kind of checksum for the omap data.  Is that correct?
>> 
>> Yeah; it's just a crc over the omap key-value data that's checked
>> during deep scrub. Same as the data digest.
>> 
>> I've not noticed any issues around this in Luminous but I probably
>> wouldn't have, so will have to leave it up to others if there are
>> fixes in since 12.2.8.
> 
> Thanks for adding some clarity to that Greg!
> 
> For some added information, this is what the logs reported earlier today:
> 
> 2019-04-08 11:46:15.610169 osd.504 osd.504 10.16.10.30:6804/8874 33 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x26a1241b != omap_digest 0x4c10ee76 from shard 504
> 2019-04-08 11:46:15.610190 osd.504 osd.504 10.16.10.30:6804/8874 34 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x26a1241b != omap_digest 0x4c10ee76 from shard 504
> 
> I then tried deep scrubbing it again to see if the data was fine, but the 
> digest calculation was just having problems.  It came back with the same 
> problem with new digest values:
> 
> 2019-04-08 15:56:21.186291 osd.504 osd.504 10.16.10.30:6804/8874 49 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x93bac8f != omap_digest 0 xab1b9c6f from shard 504
> 2019-04-08 15:56:21.186313 osd.504 osd.504 10.16.10.30:6804/8874 50 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x93bac8f != omap_digest 0 xab1b9c6f from shard 504
> 
> Which makes sense, but doesn’t explain why the omap data is getting out of 
> sync across multiple OSDs and clusters…
> 
> I’ll see what I can figure out tomorrow, but if anyone else has some hints I 
> would love to hear them.

I’ve dug into this more today and it appears that the omap data contains an 
extra entry on the OSDs with the mismatched omap digests.  I then searched the 
RGW logs and found that a DELETE happened shortly after the OSD booted, but the 
omap data wasn’t updated on that OSD so it became mismatched.

Here’s a timeline of the events which caused PG 7.9 to become inconsistent:

2019-04-04 14:37:34 - osd.492 marked itself down
2019-04-04 14:40:35 - osd.492 boot
2019-04-04 14:41:55 - DELETE call happened
2019-04-08 12:06:14 - omap_digest mismatch detected (pg 7.9 is 
active+clean+inconsistent, acting [492,546,523])

Here’s the timeline for PG 7.2b:

2019-04-03 13:54:17 - osd.488 marked itself down
2019-04-03 13:59:27 - osd.488 boot
2019-04-03 14:00:54 - DELETE call happened
2019-04-08 12:42:21 - omap_digest mismatch detected (pg 7.2b is 
active+clean+inconsistent, acting [488,511,541])

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PGs caused by omap_digest mismatch

2019-04-08 Thread Bryan Stillwell

> On Apr 8, 2019, at 4:38 PM, Gregory Farnum  wrote:
> 
> On Mon, Apr 8, 2019 at 3:19 PM Bryan Stillwell  wrote:
>> 
>> There doesn't appear to be any correlation between the OSDs which would 
>> point to a hardware issue, and since it's happening on two different 
>> clusters I'm wondering if there's a race condition that has been fixed in a 
>> later version?
>> 
>> Also, what exactly is the omap digest?  From what I can tell it appears to 
>> be some kind of checksum for the omap data.  Is that correct?
> 
> Yeah; it's just a crc over the omap key-value data that's checked
> during deep scrub. Same as the data digest.
> 
> I've not noticed any issues around this in Luminous but I probably
> wouldn't have, so will have to leave it up to others if there are
> fixes in since 12.2.8.

Thanks for adding some clarity to that Greg!

For some added information, this is what the logs reported earlier today:

2019-04-08 11:46:15.610169 osd.504 osd.504 10.16.10.30:6804/8874 33 : cluster 
[ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
0x26a1241b != omap_digest 0x4c10ee76 from shard 504
2019-04-08 11:46:15.610190 osd.504 osd.504 10.16.10.30:6804/8874 34 : cluster 
[ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
0x26a1241b != omap_digest 0x4c10ee76 from shard 504

I then tried deep scrubbing it again to see if the data was fine, but the 
digest calculation was just having problems.  It came back with the same 
problem with new digest values:

2019-04-08 15:56:21.186291 osd.504 osd.504 10.16.10.30:6804/8874 49 : cluster 
[ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
0x93bac8f != omap_digest 0 xab1b9c6f from shard 504
2019-04-08 15:56:21.186313 osd.504 osd.504 10.16.10.30:6804/8874 50 : cluster 
[ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
0x93bac8f != omap_digest 0 xab1b9c6f from shard 504

Which makes sense, but doesn’t explain why the omap data is getting out of sync 
across multiple OSDs and clusters…

I’ll see what I can figure out tomorrow, but if anyone else has some hints I 
would love to hear them.

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Inconsistent PGs caused by omap_digest mismatch

2019-04-08 Thread Bryan Stillwell
We have two separate RGW clusters running Luminous (12.2.8) that have started 
seeing an increase in PGs going active+clean+inconsistent with the reason being 
caused by an omap_digest mismatch.  Both clusters are using FileStore and the 
inconsistent PGs are happening on the .rgw.buckets.index pool which was moved 
from HDDs to SSDs within the last few months.

We've been repairing them by first making sure the odd omap_digest is not the 
primary by setting the primary-affinity to 0 if needed, doing the repair, and 
then setting the primary-affinity back to 1.

For example PG 7.3 went inconsistent earlier today:

# rados list-inconsistent-obj 7.3 -f json-pretty | jq -r '.inconsistents[] | 
.errors, .shards'
[
  "omap_digest_mismatch"
]
[
  {
"osd": 504,
"primary": true,
"errors": [],
"size": 0,
"omap_digest": "0x4c10ee76",
"data_digest": "0x"
  },
  {
"osd": 525,
"primary": false,
"errors": [],
"size": 0,
"omap_digest": "0x26a1241b",
"data_digest": "0x"
  },
  {
"osd": 556,
"primary": false,
"errors": [],
"size": 0,
"omap_digest": "0x26a1241b",
"data_digest": "0x"
  }
]

Since the odd omap_digest is on osd.504 and osd.504 is the primary, we would 
set the primary-affinity to 0 with:

# ceph osd primary-affinity osd.504 0

Do the repair:

# ceph pg repair 7.3

And then once the repair is complete we would set the primary-affinity back to 
1 on osd.504:

# ceph osd primary-affinity osd.504 1

There doesn't appear to be any correlation between the OSDs which would point 
to a hardware issue, and since it's happening on two different clusters I'm 
wondering if there's a race condition that has been fixed in a later version?

Also, what exactly is the omap digest?  From what I can tell it appears to be 
some kind of checksum for the omap data.  Is that correct?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Is repairing an RGW bucket index broken?

2019-03-11 Thread Bryan Stillwell
I'm wondering if the 'radosgw-admin bucket check --fix' command is broken in 
Luminous (12.2.8)?

I'm asking because I'm trying to reproduce a situation we have on one of our 
production clusters and it doesn't seem to do anything.  Here's the steps of my 
test:

1. Create a bucket with 1 million objects
2. Verify the bucket got sharded into 10 shards of (100,000 objects each)
3. Remove one of the shards using the rados command
4. Verify the bucket is broken
5. Attempt to fix the bucket

I got as far as step 4:

# rados -p .rgw.buckets.index ls | grep "default.1434737011.12485" | sort
.dir.default.1434737011.12485.0
.dir.default.1434737011.12485.1
.dir.default.1434737011.12485.2
.dir.default.1434737011.12485.3
.dir.default.1434737011.12485.4
.dir.default.1434737011.12485.5
.dir.default.1434737011.12485.6
.dir.default.1434737011.12485.8
.dir.default.1434737011.12485.9
# radosgw-admin bucket list --bucket=bstillwell-1mil
ERROR: store->list_objects(): (2) No such file or directory

But step 5 is proving problematic:

# time radosgw-admin bucket check --fix --bucket=bstillwell-1mil

real0m0.201s
user0m0.105s
sys 0m0.033s

# time radosgw-admin bucket check --fix --check-objects --bucket=bstillwell-1mil

real0m0.188s
user0m0.102s
sys 0m0.025s


Could someone help me figure out what I'm missing?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Suggestions/experiences with mixed disk sizes and models from 4TB - 14TB

2019-01-17 Thread Bryan Stillwell
I've run my home cluster with drives ranging in size from 500GB to 8TB before 
and the biggest issue you run into is that the bigger drives will get a 
proportional more number of PGs which will increase the memory requirements on 
them.  Typically you want around 100 PGs/OSD, but if you mix 4TB and 14TB 
drives in a cluster the 14TB drives will have 3.5 times the number of PGs.  So 
if the 4TB drives have 100 PGs, the 14TB drives will have 350.   Or if the 14TB 
drives have 100 PGs, the 4TB drives will only have just 28 PGs on them.  Using 
the balancer plugin in the mgr will pretty much be required.

Also since you're using EC you'll need to make sure the math works with these 
nodes receiving 2-3.5 times the data.

Bryan

From: ceph-users  on behalf of Götz Reinicke 

Date: Wednesday, January 16, 2019 at 2:33 AM
To: ceph-users 
Subject: [ceph-users] Suggestions/experiences with mixed disk sizes and models 
from 4TB - 14TB

Dear Ceph users,

I’d like to get some feedback for the following thought:

Currently I run some 24*4TB bluestore OSD nodes. The main focus is on storage 
space over IOPS.

We use erasure code and cephfs, and things look good right now.

The „but“ is, I do need more disk space and don’t have so much more rack space 
available, so I was thinking of adding some 8TB or even 12TB OSDs and/or 
exchange over time 4TB OSDs with bigger disks.

My question is: How are your experiences with the current >=8TB SATA disks are 
some very bad models out there which I should avoid?

The current OSD nodes are connected by 4*10Gb bonds, so for 
replication/recovery speed is a 24 Chassis with bigger disks useful, or should 
I go with smaller chassis? Or dose the chassi sice does not matter at all that 
much in my setup.

I know, EC is quit computing intense, so may be bigger disks hav also there an 
impact?

Lot’s of questions, may be you can help answering some.

Best regards and Thanks a lot for feedback . Götz



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to reduce min_size of an EC pool?

2019-01-17 Thread Bryan Stillwell
When you use 3+2 EC that means you have 3 data chunks and 2 erasure chunks for 
your data.  So you can handle two failures, but not three.  The min_size 
setting is preventing you from going below 3 because that's the number of data 
chunks you specified for the pool.  I'm sorry to say this, but since the data 
was wiped off the other 3 nodes there isn't anything that can be done to 
recover it.

Bryan


From: ceph-users  on behalf of Félix 
Barbeira 
Date: Thursday, January 17, 2019 at 1:27 PM
To: Ceph Users 
Subject: [ceph-users] How to reduce min_size of an EC pool?

I want to bring back my cluster to HEALTHY state because right now I have not 
access to the data.

I have an 3+2 EC pool on a 5 node cluster. 3 nodes were lost, all data wiped. 
They were reinstalled and added to cluster again.

The "ceph health detail" command says to reduce min_size number to a value 
lower than 3, but:

root@ceph-monitor02:~# ceph osd pool set default.rgw.buckets.data min_size 2
Error EINVAL: pool min_size must be between 3 and 5
root@ceph-monitor02:~#

This is the situation:

root@ceph-monitor01:~# ceph -s
  cluster:
id: ce78b02d-03df-4f9e-a35a-31b5f05c4c63
health: HEALTH_WARN
Reduced data availability: 515 pgs inactive, 512 pgs incomplete

  services:
mon: 3 daemons, quorum ceph-monitor01,ceph-monitor03,ceph-monitor02
mgr: ceph-monitor02(active), standbys: ceph-monitor01, ceph-monitor03
osd: 57 osds: 57 up, 57 in

  data:
pools:   8 pools, 568 pgs
objects: 4.48 M objects, 10 TiB
usage:   24 TiB used, 395 TiB / 419 TiB avail
pgs: 0.528% pgs unknown
 90.141% pgs not active
 512 incomplete
 53  active+clean
 3   unknown

root@ceph-monitor01:~#

And this is the output of health detail:

root@ceph-monitor01:~# ceph health detail
HEALTH_WARN Reduced data availability: 515 pgs inactive, 512 pgs incomplete
PG_AVAILABILITY Reduced data availability: 515 pgs inactive, 512 pgs incomplete
pg 10.1cd is stuck inactive since forever, current state incomplete, last 
acting [9,48,41,58,17] (reducing pool default.rgw.buckets.data min_size from 3 
may help; search ceph.com/docs for 'incomplete')
pg 10.1ce is incomplete, acting [3,13,14,42,21] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1cf is incomplete, acting [36,27,3,39,51] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d0 is incomplete, acting [29,9,38,4,56] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d1 is incomplete, acting [2,34,17,7,30] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d2 is incomplete, acting [41,45,53,13,32] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d3 is incomplete, acting [7,28,15,20,3] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d4 is incomplete, acting [11,40,25,23,0] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d5 is incomplete, acting [32,51,20,57,28] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d6 is incomplete, acting [2,53,8,16,15] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d7 is incomplete, acting [1,2,33,43,42] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d8 is incomplete, acting [27,49,9,48,20] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d9 is incomplete, acting [37,8,7,11,20] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1da is incomplete, acting [27,14,33,15,53] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1db is incomplete, acting [58,53,6,26,4] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1dc is incomplete, acting [21,12,47,35,19] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1dd is incomplete, acting [51,4,52,24,7] 

Re: [ceph-users] pgs stuck in creating+peering state

2019-01-17 Thread Bryan Stillwell
Since you're using jumbo frames, make sure everything between the nodes 
properly supports them (nics & switches).  I've tested this in the past by 
using the size option in ping (you need to use  a payload size of 8972 instead 
of 9000 to account for the 28 byte header):

ping -s 8972 192.168.160.237

If that works, then you'll need to pull out tcpdump/wireshark to determine why 
the packets aren't able to return.

Bryan

From: ceph-users  on behalf of Johan Thomsen 

Date: Thursday, January 17, 2019 at 5:42 AM
To: Kevin Olbrich 
Cc: ceph-users 
Subject: Re: [ceph-users] pgs stuck in creating+peering state

Thanks you for responding!

First thing: I disabled the firewall on all the nodes.
More specifically not firewalld, but the NixOS firewall, since I run NixOS.
I can netcat both udp and tcp traffic on all ports between all nodes
without problems.

Next, I tried raising the mtu to 9000 on the nics where the cluster
network is connected - although I don't see why the mtu should affect
the heartbeat?
I have two bonded nics connected to the cluster network (mtu 9000) and
two separate bonded nics hooked on the public network (mtu 1500).
I've tested traffic and routing on both pairs of nics and traffic gets
through without issues, apparently.


None of the above solved the problem :-(


Den tor. 17. jan. 2019 kl. 12.01 skrev Kevin Olbrich 
mailto:k...@sv01.de>>:

Are you sure, no service like firewalld is running?
Did you check that all machines have the same MTU and jumbo frames are
enabled if needed?

I had this problem when I first started with ceph and forgot to
disable firewalld.
Replication worked perfectly fine but the OSD was kicked out every few seconds.

Kevin

Am Do., 17. Jan. 2019 um 11:57 Uhr schrieb Johan Thomsen 
mailto:wr...@ownrisk.dk>>:
>
> Hi,
>
> I have a sad ceph cluster.
> All my osds complain about failed reply on heartbeat, like so:
>
> osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42
> ever on either front or back, first ping sent 2019-01-16
> 22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353)
>
> .. I've checked the network sanity all I can, and all ceph ports are
> open between nodes both on the public network and the cluster network,
> and I have no problems sending traffic back and forth between nodes.
> I've tried tcpdump'ing and traffic is passing in both directions
> between the nodes, but unfortunately I don't natively speak the ceph
> protocol, so I can't figure out what's going wrong in the heartbeat
> conversation.
>
> Still:
>
> # ceph health detail
>
> HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072
> pgs inactive, 1072 pgs peering
> OSDMAP_FLAGS nodown,noout flag(s) set
> PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs peering
> pg 7.3cd is stuck inactive for 245901.560813, current state
> creating+peering, last acting [13,41,1]
> pg 7.3ce is stuck peering for 245901.560813, current state
> creating+peering, last acting [1,40,7]
> pg 7.3cf is stuck peering for 245901.560813, current state
> creating+peering, last acting [0,42,9]
> pg 7.3d0 is stuck peering for 245901.560813, current state
> creating+peering, last acting [20,8,38]
> pg 7.3d1 is stuck peering for 245901.560813, current state
> creating+peering, last acting [10,20,42]
>()
>
>
> I've set "noout" and "nodown" to prevent all osd's from being removed
> from the cluster. They are all running and marked as "up".
>
> # ceph osd tree
>
> ID  CLASS WEIGHTTYPE NAME  STATUS REWEIGHT PRI-AFF
>  -1   249.73434 root default
> -25   166.48956 datacenter m1
> -2483.24478 pod kube1
> -3541.62239 rack 10
> -3441.62239 host ceph-sto-p102
>  40   hdd   7.27689 osd.40 up  1.0 1.0
>  41   hdd   7.27689 osd.41 up  1.0 1.0
>  42   hdd   7.27689 osd.42 up  1.0 1.0
>()
>
> I'm at a point where I don't know which options and what logs to check 
> anymore?
>
> Any debug hint would be very much appreciated.
>
> btw. I have no important data in the cluster (yet), so if the solution
> is to drop all osd and recreate them, it's ok for now. But I'd really
> like to know how the cluster ended in this state.
>
> /Johan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Rebuilding RGW bucket indices from objects

2019-01-17 Thread Bryan Stillwell
This is sort of related to my email yesterday, but has anyone ever rebuilt a 
bucket index using the objects themselves?

It seems to be that it would be possible since the bucket_id is contained 
within the rados object name:

# rados -p .rgw.buckets.index listomapkeys .dir.default.56630221.139618
error getting omap key set .rgw.buckets.index/.dir.default.56630221.139618: (2) 
No such file or directory
# rados -p .rgw.buckets ls | grep default.56630221.139618
default.56630221.139618__shadow_.IxIe8byqV61eu6g7gSVXBpHfrB3BlC4_1
default.56630221.139618_backup.20181214
default.56630221.139618_backup.20181220
default.56630221.139618__shadow_.GQcmQKfbBkb9WEF1X-6qGBEVfppGKEJ_1
...[ many more snipped ]...

Thanks,
Bryan



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fixing a broken bucket index in RGW

2019-01-16 Thread Bryan Stillwell
I'm looking for some help in fixing a bucket index on a Luminous (12.2.8)
cluster running on FileStore.

First some background on how I believe the bucket index became broken.  Last
month we had a PG in our .rgw.buckets.index pool become inconsistent:

2018-12-11 09:12:17.743983 osd.1879 osd.1879 10.36.173.147:6820/60041 16 : 
cluster [ERR] 7.8e : soid 7:717333b6:::.dir.default.1110451812.43.2:head 
omap_digest 0x59e4f686 != omap_digest 0x37b99ba6 from shard 1879

We then attempted to repair the PG by using 'ceph pg repair 7.8e', but I
have a feeling the primary copy must have been corrupt (later that day I
learned about 'rados list-inconsistent-obj 7.8e -f json-pretty').  The
repair resulted in an unfound object:

2018-12-11 09:32:02.651241 osd.1753 osd.1753 10.32.12.32:6820/3455358 13 : 
cluster [ERR] 7.8e push 7:717333b6:::.dir.default.1110451812.43.2:head v 
767605'30158112 failed because local copy is 767605'30158924

A couple hours later we started getting reports of 503s from multiple
customers.  Believing that the unfound object was the cause of the problem
we used the 'mark_unfound_lost revert' option to roll back to the previous
version:

ceph pg 7.8e mark_unfound_lost revert

This fixed the cluster, but broke the bucket.

Attempting to list the bucket contents results in:

[root@p3cephrgw007 ~]# radosgw-admin bucket list --bucket=backups.579
ERROR: store->list_objects(): (2) No such file or directory


This bucket appears to have been automatically sharded after we upgraded to
Luminous, so we do have an old bucket instance available (but it's too old
to be very helpful):

[root@p3cephrgw007 ~]# radosgw-admin metadata list bucket.instance |grep 
backups.579
"backups.579:default.1110451812.43",
"backups.579:default.28086735.566138",


Looking for for all the shards based on the name only pulls up the first 2
shards:

[root@p3cephrgw007 ~]# rados -p .rgw.buckets.index ls | grep 
"default.1110451812.43"
...
.dir.default.1110451812.43.0
...
.dir.default.1110451812.43.1
...


But the bucket metadata says there should be three:

[root@p3cephrgw007 ~]# radosgw-admin metadata get 
bucket.instance:backups.579:default.1110451812.43 | jq -r 
'.data.bucket_info.num_shards'
3


If we look in the log message above it said .dir.default.1110451812.43.2 was
the rados object that was slightly newer, so the revert command we ran must
have removed it instead of rolling it back to the previous version.

This leaves me with some questions:

What would have been the better way for dealing with this problem when the
whole cluster stopped working?

Is there a way to recreate the bucket index?  I see a couple options in the
docs for fixing the bucket index (--fix) and for rebuilding the bucket index
(--check-objects), but I don't see any explanations on how it goes about
doing that.  Will it attempt to scan all the objects in the cluster to
determine which ones belong in this bucket index?  Will the missing shard be
ignored and the fixed bucket index be missing 1/3rd of the objects?

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

2019-01-11 Thread Bryan Stillwell
I've created the following bug report to address this issue:

http://tracker.ceph.com/issues/37875

Bryan

From: ceph-users  on behalf of Bryan 
Stillwell 
Date: Friday, January 11, 2019 at 8:59 AM
To: Dan van der Ster 
Cc: ceph-users 
Subject: Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013060.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

2019-01-11 Thread Bryan Stillwell
That thread looks like the right one.

So far I haven't needed to restart the osd's for the churn trick to work.  I 
bet you're right that something thinks it still needs one of the old osdmaps on 
your cluster.  Last night our cluster finished another round of expansions and 
we're seeing up to 49,272 osdmaps hanging around.  The churn trick seems to be 
working again too.

Bryan

From: Dan van der Ster 
Date: Thursday, January 10, 2019 at 3:13 AM
To: Bryan Stillwell 
Cc: ceph-users 
Subject: Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

Hi Bryan,

I think this is the old hammer thread you refer to:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013060.html

We also have osdmaps accumulating on v12.2.8 -- ~12000 per osd at the moment.

I'm trying to churn the osdmaps like before, but our maps are not being trimmed.

Did you need to restart the osd's before the churn trick would work?
If so, it seems that something is holding references to old maps, like
like that old hammer issue.

Cheers, Dan


On Tue, Jan 8, 2019 at 5:39 PM Bryan Stillwell 
mailto:bstillw...@godaddy.com>> wrote:

I was able to get the osdmaps to slowly trim (maybe 50 would trim with each 
change) by making small changes to the CRUSH map like this:



for i in {1..100}; do

 ceph osd crush reweight osd.1754 4.1

 sleep 5

 ceph osd crush reweight osd.1754 4

 sleep 5

done



I believe this was the solution Dan came across back in the hammer days.  It 
works, but not ideal for sure.  Across the cluster it freed up around 50TB of 
data!



Bryan



From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Bryan Stillwell 
mailto:bstillw...@godaddy.com>>
Date: Monday, January 7, 2019 at 2:40 PM
To: ceph-users mailto:ceph-users@lists.ceph.com>>
Subject: [ceph-users] osdmaps not being cleaned up in 12.2.8



I have a cluster with over 1900 OSDs running Luminous (12.2.8) that isn't 
cleaning up old osdmaps after doing an expansion.  This is even after the 
cluster became 100% active+clean:



# find /var/lib/ceph/osd/ceph-1754/current/meta -name 'osdmap*' | wc -l

46181



With the osdmaps being over 600KB in size this adds up:



# du -sh /var/lib/ceph/osd/ceph-1754/current/meta

31G/var/lib/ceph/osd/ceph-1754/current/meta



I remember running into this during the hammer days:



http://tracker.ceph.com/issues/13990



Did something change recently that may have broken this fix?



Thanks,

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

2019-01-08 Thread Bryan Stillwell
I was able to get the osdmaps to slowly trim (maybe 50 would trim with each 
change) by making small changes to the CRUSH map like this:

for i in {1..100}; do
ceph osd crush reweight osd.1754 4.1
sleep 5
ceph osd crush reweight osd.1754 4
sleep 5
done

I believe this was the solution Dan came across back in the hammer days.  It 
works, but not ideal for sure.  Across the cluster it freed up around 50TB of 
data!

Bryan

From: ceph-users  on behalf of Bryan 
Stillwell 
Date: Monday, January 7, 2019 at 2:40 PM
To: ceph-users 
Subject: [ceph-users] osdmaps not being cleaned up in 12.2.8

I have a cluster with over 1900 OSDs running Luminous (12.2.8) that isn't 
cleaning up old osdmaps after doing an expansion.  This is even after the 
cluster became 100% active+clean:

# find /var/lib/ceph/osd/ceph-1754/current/meta -name 'osdmap*' | wc -l
46181

With the osdmaps being over 600KB in size this adds up:

# du -sh /var/lib/ceph/osd/ceph-1754/current/meta
31G/var/lib/ceph/osd/ceph-1754/current/meta

I remember running into this during the hammer days:

http://tracker.ceph.com/issues/13990

Did something change recently that may have broken this fix?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to increase Ceph Mon store?

2019-01-07 Thread Bryan Stillwell
I believe the option you're looking for is mon_data_size_warn.  The default is 
set to 16106127360.

I've found that sometimes the mons need a little help getting started with 
trimming if you just completed a large expansion.  Earlier today I had a 
cluster where the mon's data directory was over 40GB on all the mons.  When I 
restarted them one at a time with 'mon_compact_on_start = true' set in the 
'[mon]' section of ceph.conf, they stayed around 40GB in size.   However, when 
I was about to hit send on an email to the list about this very topic, the 
warning cleared up and now the data directory is now between 1-3GB on each of 
the mons.  This was on a cluster with >1900 OSDs.

Bryan

From: ceph-users  on behalf of Pardhiv Karri 

Date: Monday, January 7, 2019 at 11:08 AM
To: ceph-users 
Subject: [ceph-users] Is it possible to increase Ceph Mon store?

Hi,

We have a large Ceph cluster (Hammer version). We recently saw its mon store 
growing too big > 15GB on all 3 monitors without any rebalancing happening for 
quiet sometime. We have compacted the DB using  "#ceph tell mon.[ID] compact" 
for now. But is there a way to increase the size of the mon store to 32GB or 
something to avoid getting the Ceph health to warning state due to Mon store 
growing too big?

--
Thanks,
Pardhiv Karri



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osdmaps not being cleaned up in 12.2.8

2019-01-07 Thread Bryan Stillwell
I have a cluster with over 1900 OSDs running Luminous (12.2.8) that isn't 
cleaning up old osdmaps after doing an expansion.  This is even after the 
cluster became 100% active+clean:

# find /var/lib/ceph/osd/ceph-1754/current/meta -name 'osdmap*' | wc -l
46181

With the osdmaps being over 600KB in size this adds up:

# du -sh /var/lib/ceph/osd/ceph-1754/current/meta
31G /var/lib/ceph/osd/ceph-1754/current/meta

I remember running into this during the hammer days:

http://tracker.ceph.com/issues/13990

Did something change recently that may have broken this fix?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Omap issues - metadata creating too many

2019-01-03 Thread Bryan Stillwell
Josef,

I've noticed that when dynamic resharding is on it'll reshard some of our 
bucket indices daily (sometimes more).  This causes a lot of wasted space in 
the .rgw.buckets.index pool which might be what you are seeing.

You can get a listing of all the bucket instances in your cluster with this 
command:

radosgw-admin metadata list bucket.instance | jq -r '.[]' | sort

Give that a try and see if you see the same problem.  It seems that once you 
remove the old bucket instances the omap dbs don't reduce in size until you 
compact them.

Bryan

From: Josef Zelenka 
Date: Thursday, January 3, 2019 at 3:49 AM
To: "J. Eric Ivancich" 
Cc: "ceph-users@lists.ceph.com" , Bryan Stillwell 

Subject: Re: [ceph-users] Omap issues - metadata creating too many

Hi, i had the default - so it was on(according to ceph kb). turned it
off, but the issue persists. i noticed Bryan Stillwell(cc-ing him) had
the same issue (reported about it yesterday) - tried his tips about
compacting, but it doesn't do anything, however i have to add to his
last point, this happens even with bluestore. Is there anything we can
do to clean up the omap manually?

Josef

On 18/12/2018 23:19, J. Eric Ivancich wrote:
On 12/17/18 9:18 AM, Josef Zelenka wrote:
Hi everyone, i'm running a Luminous 12.2.5 cluster with 6 hosts on
ubuntu 16.04 - 12 HDDs for data each, plus 2 SSD metadata OSDs(three
nodes have an additional SSD i added to have more space to rebalance the
metadata). CUrrently, the cluster is used mainly as a radosgw storage,
with 28tb data in total, replication 2x for both the metadata and data
pools(a cephfs isntance is running alongside there, but i don't think
it's the perpetrator - this happenned likely before we had it). All
pools aside from the data pool of the cephfs and data pool of the
radosgw are located on the SSD's. Now, the interesting thing - at random
times, the metadata OSD's fill up their entire capacity with OMAP data
and go to r/o mode and we have no other option currently than deleting
them and re-creating. The fillup comes at a random time, it doesn't seem
to be triggered by anything and it isn't caused by some data influx. It
seems like some kind of a bug to me to be honest, but i'm not certain -
anyone else seen this behavior with their radosgw? Thanks a lot
Hi Josef,

Do you have rgw_dynamic_resharding turned on? Try turning it off and see
if the behavior continues.

One theory is that dynamic resharding is triggered and possibly not
completing. This could add a lot of data to omap for the incomplete
bucket index shards. After a delay it tries resharding again, possibly
failing again, and adding more data to the omap. This continues.

If this is the ultimate issue we have some commits on the upstream
luminous branch that are designed to address this set of issues.

But we should first see if this is the cause.

Eric

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Compacting omap data

2019-01-02 Thread Bryan Stillwell
Recently on one of our bigger clusters (~1,900 OSDs) running Luminous (12.2.8), 
we had a problem where OSDs would frequently get restarted while deep-scrubbing.

After digging into it I found that a number of the OSDs had very large omap 
directories (50GiB+).  I believe these were OSDs that had previous held PGs 
that were part of the .rgw.buckets.index pool which I have recently moved to 
all SSDs, however, it seems like the data remained on the HDDs.

I was able to reduce the data usage on most of the OSDs (from ~50 GiB to < 200 
MiB!) by compacting the omap dbs offline by setting 'leveldb_compact_on_mount = 
true' in the [osd] section of ceph.conf, but that didn't work on the newer OSDs 
which use rocksdb.  On those I had to do an online compaction using a command 
like:

$ ceph tell osd.510 compact

That worked, but today when I tried doing that on some of the SSD-based OSDs 
which are backing .rgw.buckets.index I started getting slow requests and the 
compaction ultimately failed with this error:

$ ceph tell osd.1720 compact
osd.1720: Error ENXIO: osd down

When I tried it again it succeeded:

$ ceph tell osd.1720 compact
osd.1720: compacted omap in 420.999 seconds

The data usage on that OSD dropped from 57.8 GiB to 43.4 GiB which was nice, 
but I don't believe that'll get any smaller until I start splitting the PGs in 
the .rgw.buckets.index pool to better distribute that pool across the SSD-based 
OSDs.

The first question I have is what is the option to do an offline compaction of 
rocksdb so I don't impact our customers while compacting the rest of the 
SSD-based OSDs?

The next question is if there's a way to configure Ceph to automatically 
compact the omap dbs in the background in a way that doesn't affect user 
experience?

Finally, I was able to figure out that the omap directories were getting large 
because we're using filestore on this cluster, but how could someone determine 
this when using BlueStore?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing orphaned radosgw bucket indexes from pool

2018-11-29 Thread Bryan Stillwell
Wido,

I've been looking into this large omap objects problem on a couple of our 
clusters today and came across your script during my research.

The script has been running for a few hours now and I'm already over 100,000 
'orphaned' objects!

It appears that ever since upgrading to Luminous (12.2.5 initially, followed by 
12.2.8) this cluster has been resharding the large bucket indexes at least once 
a day and not cleaning up the previous bucket indexes:

for instance in $(radosgw-admin metadata list bucket.instance | jq -r '.[]' | 
grep go-test-dashboard); do
  mtime=$(radosgw-admin metadata get bucket.instance:${instance} | grep mtime)
  num_shards=$(radosgw-admin metadata get bucket.instance:${instance} | grep 
num_shards)
  echo "${instance}: ${mtime} ${num_shards}"
done | column -t | sort -k3
go-test-dashboard:default.188839135.327804:  "mtime":  "2018-06-01  
22:35:28.693095Z",  "num_shards":  0,
go-test-dashboard:default.617828918.2898:"mtime":  "2018-06-02  
22:35:40.438738Z",  "num_shards":  46,
go-test-dashboard:default.617828918.4:   "mtime":  "2018-06-02  
22:38:21.537259Z",  "num_shards":  46,
go-test-dashboard:default.617663016.10499:   "mtime":  "2018-06-03  
23:00:04.185285Z",  "num_shards":  46,
[...snip...]
go-test-dashboard:default.891941432.342061:  "mtime":  "2018-11-28  
01:41:46.777968Z",  "num_shards":  7,
go-test-dashboard:default.928133068.2899:"mtime":  "2018-11-28  
20:01:49.390237Z",  "num_shards":  46,
go-test-dashboard:default.928133068.5115:"mtime":  "2018-11-29  
01:54:17.788355Z",  "num_shards":  7,
go-test-dashboard:default.928133068.8054:"mtime":  "2018-11-29  
20:21:53.733824Z",  "num_shards":  7,
go-test-dashboard:default.891941432.359004:  "mtime":  "2018-11-29  
20:22:09.201965Z",  "num_shards":  46,

The num_shards is typically around 46, but looking at all 288 instances of that 
bucket index, it has varied between 3 and 62 shards.

Have you figured anything more out about this since you posted this originally 
two weeks ago?

Thanks,
Bryan

From: ceph-users  on behalf of Wido den 
Hollander 
Date: Thursday, November 15, 2018 at 5:43 AM
To: Ceph Users 
Subject: [ceph-users] Removing orphaned radosgw bucket indexes from pool

Hi,

Recently we've seen multiple messages on the mailinglists about people
seeing HEALTH_WARN due to large OMAP objects on their cluster. This is
due to the fact that starting with 12.2.6 OSDs warn about this.

I've got multiple people asking me the same questions and I've done some
digging around.

Somebody on the ML wrote this script:

for bucket in `radosgw-admin metadata list bucket | jq -r '.[]' | sort`; do
  actual_id=`radosgw-admin bucket stats --bucket=${bucket} | jq -r '.id'`
  for instance in `radosgw-admin metadata list bucket.instance | jq -r
'.[]' | grep ${bucket}: | cut -d ':' -f 2`
  do
if [ "$actual_id" != "$instance" ]
then
  radosgw-admin bi purge --bucket=${bucket} --bucket-id=${instance}
  radosgw-admin metadata rm bucket.instance:${bucket}:${instance}
fi
  done
done

That partially works, but 'orphaned' objects in the index pool do not work.

So I wrote my own script [0]:

#!/bin/bash
INDEX_POOL=$1

if [ -z "$INDEX_POOL" ]; then
echo "Usage: $0 "
exit 1
fi

INDEXES=$(mktemp)
METADATA=$(mktemp)

trap "rm -f ${INDEXES} ${METADATA}" EXIT

radosgw-admin metadata list bucket.instance|jq -r '.[]' > ${METADATA}
rados -p ${INDEX_POOL} ls > $INDEXES

for OBJECT in $(cat ${INDEXES}); do
MARKER=$(echo ${OBJECT}|cut -d '.' -f 3,4,5)
grep ${MARKER} ${METADATA} > /dev/null
if [ "$?" -ne 0 ]; then
echo $OBJECT
fi
done

It does not remove anything, but for example, it returns these objects:

.dir.eb32b1ca-807a-4867-aea5-ff43ef7647c6.10406917.5752
.dir.eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6162
.dir.eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6186

The output of:

$ radosgw-admin metadata list|jq -r '.[]'

Does not contain:
- eb32b1ca-807a-4867-aea5-ff43ef7647c6.10406917.5752
- eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6162
- eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6186

So for me these objects do not seem to be tied to any bucket and seem to
be leftovers which were not cleaned up.

For example, I see these objects tied to a bucket:

- b32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6160
- eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6188
- eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6167

But notice the difference: 6160, 6188, 6167, but not 6162 nor 6186

Before I remove these objects I want to verify with other users if they
see the same and if my thinking is correct.

Wido

[0]: https://gist.github.com/wido/6650e66b09770ef02df89636891bef04

___
ceph-users mailing list
mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell
I could see something related to that bug might be happening, but we're not 
seeing the "clock skew" or "signal: Hangup" messages in our logs.

One reason that this cluster might be running into this problem is that we 
appear to have a script that is gathering stats for collectd which is running 
'ceph pg dump' every 16-17 seconds.  I guess you could say we're stress testing 
that code path fairly well...  :)

Bryan

On Thu, Oct 18, 2018 at 6:17 PM Bryan Stillwell 
mailto:bstillw...@godaddy.com>> wrote:

After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a 
problem where the new ceph-mgr would sometimes hang indefinitely when doing 
commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest of 
our clusters (10+) aren't seeing the same issue, but they are all under 600 
OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or so, but 
usually overnight it'll get back into the state where the hang reappears.  At 
first I thought it was a hardware issue, but switching the primary ceph-mgr to 
another node didn't fix the problem.



I've increased the logging to 20/20 for debug_mgr, and while a working dump 
looks like this:



2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3

2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command prefix=pg 
dump

2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
client.admin capable

2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
dispatch

2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command (0) 
Success dumped all



A failed dump call doesn't show up at all.  The "mgr.server handle_command 
prefix=pg dump" log entry doesn't seem to even make it to the logs.

This could be a manifestation of
https://tracker.ceph.com/issues/23460, as the "pg dump" path is one of
the places where the pgmap and osdmap locks are taken together.

Deadlockyness aside, this code path could use some improvement so that
both locks aren't being held unnecessarily, and so that we aren't
holding up all other accesses to pgmap while doing a dump.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell
Thanks Dan!

It does look like we're hitting the ms_tcp_read_timeout.  I changed it to 79 
seconds and I've had a couple dumps that were hung for ~2m40s 
(2*ms_tcp_read_timeout) and one that was hung for 8 minutes 
(6*ms_tcp_read_timeout).

I agree that 15 minutes (900s) is a long timeout.  Anyone know the reasoning 
for that decision?

Bryan

From: Dan van der Ster 
Date: Thursday, October 18, 2018 at 2:03 PM
To: Bryan Stillwell 
Cc: ceph-users 
Subject: Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

15 minutes seems like the ms tcp read timeout would be related.

Try shortening that and see if it works around the issue...

(We use ms tcp read timeout = 60 over here -- the 900s default seems
really long to keep idle connections open)

-- dan


On Thu, Oct 18, 2018 at 9:39 PM Bryan Stillwell 
mailto:bstillw...@godaddy.com>> wrote:

I left some of the 'ceph pg dump' commands running and twice they returned 
results after 30 minutes, and three times it took 45 minutes.  Is there 
something that runs every 15 minutes that would let these commands finish?

Bryan

From: Bryan Stillwell mailto:bstillw...@godaddy.com>>
Date: Thursday, October 18, 2018 at 11:16 AM
To: "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>" 
mailto:ceph-users@lists.ceph.com>>
Subject: ceph-mgr hangs on larger clusters in Luminous

After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a 
problem where the new ceph-mgr would sometimes hang indefinitely when doing 
commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest of 
our clusters (10+) aren't seeing the same issue, but they are all under 600 
OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or so, but 
usually overnight it'll get back into the state where the hang reappears.  At 
first I thought it was a hardware issue, but switching the primary ceph-mgr to 
another node didn't fix the problem.

I've increased the logging to 20/20 for debug_mgr, and while a working dump 
looks like this:

2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3
2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command prefix=pg 
dump
2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
client.admin capable
2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
dispatch
2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command (0) 
Success dumped all

A failed dump call doesn't show up at all.  The "mgr.server handle_command 
prefix=pg dump" log entry doesn't seem to even make it to the logs.

This problem also continued to appear after upgrading to 12.2.8.

Has anyone else seen this?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell
I left some of the 'ceph pg dump' commands running and twice they returned 
results after 30 minutes, and three times it took 45 minutes.  Is there 
something that runs every 15 minutes that would let these commands finish?

Bryan

From: Bryan Stillwell 
Date: Thursday, October 18, 2018 at 11:16 AM
To: "ceph-users@lists.ceph.com" 
Subject: ceph-mgr hangs on larger clusters in Luminous

After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a 
problem where the new ceph-mgr would sometimes hang indefinitely when doing 
commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest of 
our clusters (10+) aren't seeing the same issue, but they are all under 600 
OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or so, but 
usually overnight it'll get back into the state where the hang reappears.  At 
first I thought it was a hardware issue, but switching the primary ceph-mgr to 
another node didn't fix the problem.
 
I've increased the logging to 20/20 for debug_mgr, and while a working dump 
looks like this:
 
2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3
2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command prefix=pg 
dump
2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
client.admin capable
2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
dispatch
2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command (0) 
Success dumped all
 
A failed dump call doesn't show up at all.  The "mgr.server handle_command 
prefix=pg dump" log entry doesn't seem to even make it to the logs.
 
This problem also continued to appear after upgrading to 12.2.8.
 
Has anyone else seen this?
 
Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell
After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a 
problem where the new ceph-mgr would sometimes hang indefinitely when doing 
commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest of 
our clusters (10+) aren't seeing the same issue, but they are all under 600 
OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or so, but 
usually overnight it'll get back into the state where the hang reappears.  At 
first I thought it was a hardware issue, but switching the primary ceph-mgr to 
another node didn't fix the problem.

I've increased the logging to 20/20 for debug_mgr, and while a working dump 
looks like this:

2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3
2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command prefix=pg 
dump
2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
client.admin capable
2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
dispatch
2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command (0) 
Success dumped all

A failed dump call doesn't show up at all.  The "mgr.server handle_command 
prefix=pg dump" log entry doesn't seem to even make it to the logs.

This problem also continued to appear after upgrading to 12.2.8.

Has anyone else seen this?

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb mon stores growing until restart

2018-09-19 Thread Bryan Stillwell
> On 08/30/2018 11:00 AM, Joao Eduardo Luis wrote:
> > On 08/30/2018 09:28 AM, Dan van der Ster wrote:
> > Hi,
> > Is anyone else seeing rocksdb mon stores slowly growing to >15GB,
> > eventually triggering the 'mon is using a lot of disk space' warning?
> > Since upgrading to luminous, we've seen this happen at least twice.
> > Each time, we restart all the mons and then stores slowly trim down to
> > <500MB. We have 'mon compact on start = true', but it's not the
> > compaction that's shrinking the rockdb's -- the space used seems to
> > decrease over a few minutes only after *all* mons have been restarted.
> > This reminds me of a hammer-era issue where references to trimmed maps
> > were leaking -- I can't find that bug at the moment, though.
>
> Next time this happens, mind listing the store contents and check if you
> are holding way too many osdmaps? You shouldn't be holding more osdmaps
> than the default IF the cluster is healthy and all the pgs are clean.
>
> I've chased a bug pertaining this last year, even got a patch, but then
> was unable to reproduce it. Didn't pursue merging the patch any longer
> (I think I may still have an open PR for it though), simply because it
> was no longer clear if it was needed.

I just had this happen to me while using ceph-gentle-split on a 12.2.5
cluster with 1,370 OSDs.  Unfortunately, I restarted the mon nodes which
fixed the problem before finding this thread.  I'm only halfway done
with the split, so I'll see if the problem resurfaces again.

Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v13.2.1 Mimic released

2018-07-27 Thread Bryan Stillwell
I decided to upgrade my home cluster from Luminous (v12.2.7) to Mimic (v13.2.1) 
today and ran into a couple issues:

1. When restarting the OSDs during the upgrade it seems to forget my upmap 
settings.  I had to manually return them to the way they were with commands 
like:

ceph osd pg-upmap-items 5.1 11 18 8 6 9 0
ceph osd pg-upmap-items 5.1f 11 17

I also saw this when upgrading from v12.2.5 to v12.2.7.

2. Also after restarting the first OSD during the upgrade I saw 21 messages 
like these in ceph.log:

2018-07-27 15:53:49.868552 osd.1 osd.1 10.0.0.207:6806/4029643 97 : cluster 
[WRN] failed to encode map e100467 with expected crc
2018-07-27 15:53:49.922365 osd.6 osd.6 10.0.0.16:6804/90400 25 : cluster [WRN] 
failed to encode map e100467 with expected crc
2018-07-27 15:53:49.925585 osd.6 osd.6 10.0.0.16:6804/90400 26 : cluster [WRN] 
failed to encode map e100467 with expected crc
2018-07-27 15:53:49.944414 osd.18 osd.18 10.0.0.15:6808/120845 8 : cluster 
[WRN] failed to encode map e100467 with expected crc
2018-07-27 15:53:49.944756 osd.17 osd.17 10.0.0.15:6800/120749 13 : cluster 
[WRN] failed to encode map e100467 with expected crc

Is this a sign that full OSD maps were sent out by the mons to every OSD like 
back in the hammer days?  I seem to remember that OSD maps should be a lot 
smaller now, so maybe this isn't as big of a problem as it was back then?

Thanks,
Bryan

From: ceph-users  on behalf of Sage Weil 

Date: Friday, July 27, 2018 at 1:25 PM
To: "ceph-annou...@lists.ceph.com" , 
"ceph-users@lists.ceph.com" , 
"ceph-maintain...@lists.ceph.com" , 
"ceph-de...@vger.kernel.org" 
Subject: [ceph-users] v13.2.1 Mimic released

This is the first bugfix release of the Mimic v13.2.x long term stable release
series. This release contains many fixes across all components of Ceph,
including a few security fixes. We recommend that all users upgrade.

Notable Changes
--

* CVE 2018-1128: auth: cephx authorizer subject to replay attack (issue#24836 
http://tracker.ceph.com/issues/24836, Sage Weil)
* CVE 2018-1129: auth: cephx signature check is weak (issue#24837 
http://tracker.ceph.com/issues/24837, Sage Weil)
* CVE 2018-10861: mon: auth checks not correct for pool ops (issue#24838
* 

Re: [ceph-users] Ceph osd crush weight to utilization incorrect on one node

2018-05-11 Thread Bryan Stillwell
> We have a large 1PB ceph cluster. We recently added 6 nodes with 16 2TB disks
> each to the cluster. All the 5 nodes rebalanced well without any issues and
> the sixth/last node OSDs started acting weird as I increase weight of one osd
> the utilization doesn't change but a different osd on the same node
> utilization is getting increased. Rebalance complete fine but utilization is
> not right.
>
> Increased weight of OSD 610 to 0.2 from 0.0 but utilization of OSD 611
> started increasing but its weight is 0.0. If I increase weight of OSD 611 to
> 0.2 then its overall utilization is growing to what if its weight is 0.4. So
> if I increase weight of 610 and 615 to their full weight then utilization on
> OSD 610 is 1% and on OSD 611 is inching towards 100% where I had to stop and
> downsize the OSD's crush weight back to 0.0 to avoid any implications on ceph
> cluster. Its not just one osd but different OSD's on that one node. The only
> correlation I found out is 610 and 611 OSD Journal partitions are on the same
> SSD drive and all the OSDs are SAS drives. Any help on how to debug or
> resolve this will be helpful.

You didn't say which version of Ceph you were using, but based on the output
of 'ceph osd df' I'm guessing it's pre-Jewel (maybe Hammer?) cluster?

I've found that data placement can be a little weird when you have really
low CRUSH weights (0.2) on one of the nodes where the other nodes have large
CRUSH weights (2.0).  I've had it where a single OSD in a node was getting
almost all the data.  It wasn't until I increased the weights to be more in
line with the rest of the cluster that it evened back out.

I believe this can also be caused by not having enough PGs in your cluster.
Or the PGs you do have aren't distributed correctly based on the data usage
in each pool.  Have you used https://ceph.com/pgcalc/ to determine the
correct number of PGs you should have per pool?

Since you are likely running a pre-Jewel cluster it could also be that you
haven't switched your tunables to use the straw2 data placement algorithm:

http://docs.ceph.com/docs/master/rados/operations/crush-map/#hammer-crush-v4

That should help as well.  Once that's enabled you can convert your existing
buckets to straw2 as well.  Just be careful you don't have any old clients
connecting to your cluster that don't support that feature yet.

Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW (Swift) failures during upgrade from Jewel to Luminous

2018-05-08 Thread Bryan Stillwell
We recently began our upgrade testing for going from Jewel (10.2.10) to
Luminous (12.2.5) on our clusters.  The first part of the upgrade went
pretty smoothly (upgrading the mon nodes, adding the mgr nodes, upgrading
the OSD nodes), however, when we got to the RGWs we started seeing internal
server errors (500s) on the Jewel RGWs once the first RGW was upgraded to
Luminous.  Further testing found two different problems:

The first problem (internal server error) was seen when the container and
object were created by a Luminous RGW, but then a Jewel RGW attempted to
list the container.

The second problem (container appears to be empty) was seen when the
container was created by a Luminous RGW, an object was added using a Jewel
RGW, and then the container was listed by a Luminous RGW.

Here were all the tests I performed:

Test #1: Create container (Jewel),Add object (Jewel),List container 
(Jewel),Result: Success
Test #2: Create container (Jewel),Add object (Jewel),List container 
(Luminous), Result: Success
Test #3: Create container (Jewel),Add object (Luminous), List container 
(Jewel),Result: Success
Test #4: Create container (Jewel),Add object (Luminous), List container 
(Luminous), Result: Success
Test #5: Create container (Luminous), Add object (Jewel),List container 
(Jewel),Result: Success
Test #6: Create container (Luminous), Add object (Jewel),List container 
(Luminous), Result: Failure (Container appears empty)
Test #7: Create container (Luminous), Add object (Luminous), List container 
(Jewel),Result: Failure (Internal Server Error)
Test #8: Create container (Luminous), Add object (Luminous), List container 
(Luminous), Result: Success

It appears that we ran into these bugs because our load balancer was
alternating between the RGWs while they were running a mixture of the two
versions (like you would expect during an upgrade).

Has anyone run into this problem as well?  Is there a way to workaround it
besides disabling half the RGWs, upgrading that half, swinging all the
traffic to the upgraded RGWs, upgrading the other half, and then enabling
the second half?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-21 Thread Bryan Stillwell
Bryan,

The good news is that there is progress being made on making this harder to 
screw up.  Read this article for example:

https://ceph.com/community/new-luminous-pg-overdose-protection/

The bad news is that I don't have a great solution for you regarding your 
peering problem.  I've run into things like that on testing clusters.  That 
almost always teaches me not to do too many operations at one time.  Usually 
some combination of flags (norecover, norebalance, nobackfill, noout, etc.) 
with OSD restarts will fix the problem.  You can also query PGs to figure out 
why they aren't peering, increase logging, or if you want to get it back 
quickly you should consider RedHat support or contacting a Ceph consultant like 
Wido:

In fact, I would recommend watching Wido's presentation on "10 ways to break 
your Ceph cluster" from Ceph Days Germany earlier this month for other things 
to watch out for:

https://ceph.com/cephdays/germany/

Bryan

From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Bryan 
Banister <bbanis...@jumptrading.com>
Date: Tuesday, February 20, 2018 at 2:53 PM
To: David Turner <drakonst...@gmail.com>
Cc: Ceph Users <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

HI David [Resending with smaller message size],

I tried setting the OSDs down and that does clear the blocked requests 
momentarily but they just return back to the same state.  Not sure how to 
proceed here, but one thought was just to do a full cold restart of the entire 
cluster.  We have disabled our backups so the cluster is effectively down.  Any 
recommendations on next steps?

This also seems like a pretty serious issue, given that making this change has 
effectively broken the cluster.  Perhaps Ceph should not allow you to increase 
the number of PGs so drastically or at least make you put in a 
‘--yes-i-really-mean-it’ flag?

Or perhaps just some warnings on the docs.ceph.com placement groups page 
(http://docs.ceph.com/docs/master/rados/operations/placement-groups/ ) and the 
ceph command man page?

Would be good to help other avoid this pitfall.

Thanks again,
-Bryan

From: David Turner [mailto:drakonst...@gmail.com]
Sent: Friday, February 16, 2018 3:21 PM
To: Bryan Banister <bbanis...@jumptrading.com<mailto:bbanis...@jumptrading.com>>
Cc: Bryan Stillwell <bstillw...@godaddy.com<mailto:bstillw...@godaddy.com>>; 
Janne Johansson <icepic...@gmail.com<mailto:icepic...@gmail.com>>; Ceph Users 
<ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email

That sounds like a good next step.  Start with OSDs involved in the longest 
blocked requests.  Wait a couple minutes after the osd marks itself back up and 
continue through them.  Hopefully things will start clearing up so that you 
don't need to mark all of them down.  There is usually a only a couple OSDs 
holding everything up.

On Fri, Feb 16, 2018 at 4:15 PM Bryan Banister 
<bbanis...@jumptrading.com<mailto:bbanis...@jumptrading.com>> wrote:
Thanks David,

Taking the list of all OSDs that are stuck reports that a little over 50% of 
all OSDs are in this condition.  There isn’t any discernable pattern that I can 
find and they are spread across the three servers.  All of the OSDs are online 
as far as the service is concern.

I have also taken all PGs that were reported the health detail output and 
looked for any that report “peering_blocked_by” but none do, so I can’t tell if 
any OSD is actually blocking the peering operation.

As suggested, I got a report of all peering PGs:
[root@carf-ceph-osd01 ~]# ceph health detail | grep "pg " | grep peering | sort 
-k13
pg 14.fe0 is stuck peering since forever, current state peering, last 
acting [104,94,108]
pg 14.fe0 is stuck unclean since forever, current state peering, last 
acting [104,94,108]
pg 14.fbc is stuck peering since forever, current state peering, last 
acting [110,91,0]
pg 14.fd1 is stuck peering since forever, current state peering, last 
acting [130,62,111]
pg 14.fd1 is stuck unclean since forever, current state peering, last 
acting [130,62,111]
pg 14.fed is stuck peering since forever, current state peering, last 
acting [32,33,82]
pg 14.fed is stuck unclean since forever, current state peering, last 
acting [32,33,82]
pg 14.fee is stuck peering since forever, current state peering, last 
acting [37,96,68]
pg 14.fee is stuck unclean since forever, current state peering, last 
acting [37,96,68]
pg 14.fe8 is stuck peering since forever, current state peering, last 
acting [45,31,107]
pg 14.fe8 is stuck unclean since forever, current state peering, last 
acting [45,31,107]
pg 14.fc1 is stuck peering since forever, current state peering, last 
acting [59,124,39]
  

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-13 Thread Bryan Stillwell
It may work fine, but I would suggest limiting the number of operations going 
on at the same time.

Bryan

From: Bryan Banister <bbanis...@jumptrading.com>
Date: Tuesday, February 13, 2018 at 1:16 PM
To: Bryan Stillwell <bstillw...@godaddy.com>, Janne Johansson 
<icepic...@gmail.com>
Cc: Ceph Users <ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Thanks for the response Bryan!

Would it be good to go ahead and do the increase up to 4096 PGs for thee pool 
given that it's only at 52% done with the rebalance backfilling operations?

Thanks in advance!!
-Bryan

-Original Message-----
From: Bryan Stillwell [mailto:bstillw...@godaddy.com]
Sent: Tuesday, February 13, 2018 12:43 PM
To: Bryan Banister 
<bbanis...@jumptrading.com<mailto:bbanis...@jumptrading.com>>; Janne Johansson 
<icepic...@gmail.com<mailto:icepic...@gmail.com>>
Cc: Ceph Users <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
-

Bryan,

Based off the information you've provided so far, I would say that your largest 
pool still doesn't have enough PGs.

If you originally had only 512 PGs for you largest pool (I'm guessing 
.rgw.buckets has 99% of your data), then on a balanced cluster you would have 
just ~11.5 PGs per OSD (3*512/133).  That's way lower than the recommended 100 
PGs/OSD.

Based on the number of disks and assuming your .rgw.buckets pool has 99% of the 
data, you should have around 4,096 PGs for that pool.  You'll still end up with 
an uneven distribution, but the outliers shouldn't be as far out.

Sage recently wrote a new balancer plugin that makes balancing a cluster 
something that happens automatically.  He gave a great talk at LinuxConf 
Australia that you should check out, here's a link into the video where he 
talks about the balancer and the need for it:

https://youtu.be/GrStE7XSKFE?t=20m14s

Even though your objects are fairly large, they are getting broken up into 
chunks that are spread across the cluster.  You can see how large each of your 
PGs are with a command like this:

ceph pg dump | grep '[0-9]*\.[0-9a-f]*' | awk '{ print $1 "\t" $7 }' |sort -n 
-k2

You'll see that within a pool the PG sizes are fairly close to the same size, 
but in your cluster the PGs are fairly large (~200GB would be my guess).

Bryan

From: ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Bryan Banister 
<bbanis...@jumptrading.com<mailto:bbanis...@jumptrading.com>>
Date: Monday, February 12, 2018 at 2:19 PM
To: Janne Johansson <icepic...@gmail.com<mailto:icepic...@gmail.com>>
Cc: Ceph Users <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Hi Janne and others,

We used the “ceph osd reweight-by-utilization “ command to move a small amount 
of data off of the top four OSDs by utilization.  Then we updated the pg_num 
and pgp_num on the pool from 512 to 1024 which started moving roughly 50% of 
the objects around as a result.  The unfortunate issue is that the weights on 
the OSDs are still roughly equivalent and the OSDs that are nearfull were still 
getting allocated objects during the rebalance backfill operations.

At this point I have made some massive changes to the weights of the OSDs in an 
attempt to stop Ceph from allocating any more data to OSDs that are getting 
close to full.  Basically the OSD with the lowest utilization remains weighted 
at 1 and the rest of the OSDs are now reduced in weight based on the percent 
usage of the OSD + the %usage of the OSD with the amount of data (21% at the 
time).  This means the OSD that is at the most full at this time at 86% full 
now has a weight of only .33 (it was at 89% when reweight was applied).  I’m 
not sure this is a good idea, but it seemed like the only option I had.  Please 
let me know if I’m making a bad situation worse!

I still have the question on how this happened in the first place and how to 
prevent it from happening going forward without a lot of monitoring and 
reweighting on weekends/etc to keep things balanced.  It sounds like Ceph is 
really expecting that objects stored into a pool will roughly have the same 
size, is that right?

Our backups going into this pool have very large variation in size, so would it 
be better to create multiple pools based on expected size of objects and then 
put backups of similar size into each pool?

The backups also have basically the same names with the only difference being 
the date which it was taken (e.g. backup name difference in subsequent days can 
be one digit at times), so does this mean that large backups with basically the 
same name will end up being placed in the

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-13 Thread Bryan Stillwell
Bryan,

Based off the information you've provided so far, I would say that your largest 
pool still doesn't have enough PGs.

If you originally had only 512 PGs for you largest pool (I'm guessing 
.rgw.buckets has 99% of your data), then on a balanced cluster you would have 
just ~11.5 PGs per OSD (3*512/133).  That's way lower than the recommended 100 
PGs/OSD.

Based on the number of disks and assuming your .rgw.buckets pool has 99% of the 
data, you should have around 4,096 PGs for that pool.  You'll still end up with 
an uneven distribution, but the outliers shouldn't be as far out.

Sage recently wrote a new balancer plugin that makes balancing a cluster 
something that happens automatically.  He gave a great talk at LinuxConf 
Australia that you should check out, here's a link into the video where he 
talks about the balancer and the need for it:

https://youtu.be/GrStE7XSKFE?t=20m14s

Even though your objects are fairly large, they are getting broken up into 
chunks that are spread across the cluster.  You can see how large each of your 
PGs are with a command like this:

ceph pg dump | grep '[0-9]*\.[0-9a-f]*' | awk '{ print $1 "\t" $7 }' |sort -n 
-k2

You'll see that within a pool the PG sizes are fairly close to the same size, 
but in your cluster the PGs are fairly large (~200GB would be my guess).

Bryan

From: ceph-users  on behalf of Bryan 
Banister 
Date: Monday, February 12, 2018 at 2:19 PM
To: Janne Johansson 
Cc: Ceph Users 
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Hi Janne and others,
 
We used the “ceph osd reweight-by-utilization “ command to move a small amount 
of data off of the top four OSDs by utilization.  Then we updated the pg_num 
and pgp_num on the pool from 512 to 1024 which started moving roughly 50% of 
the objects around as a result.  The unfortunate issue is that the weights on 
the OSDs are still roughly equivalent and the OSDs that are nearfull were still 
getting allocated objects during the rebalance backfill operations.
 
At this point I have made some massive changes to the weights of the OSDs in an 
attempt to stop Ceph from allocating any more data to OSDs that are getting 
close to full.  Basically the OSD with the lowest utilization remains weighted 
at 1 and the rest of the OSDs are now reduced in weight based on the percent 
usage of the OSD + the %usage of the OSD with the amount of data (21% at the 
time).  This means the OSD that is at the most full at this time at 86% full 
now has a weight of only .33 (it was at 89% when reweight was applied).  I’m 
not sure this is a good idea, but it seemed like the only option I had.  Please 
let me know if I’m making a bad situation worse!
 
I still have the question on how this happened in the first place and how to 
prevent it from happening going forward without a lot of monitoring and 
reweighting on weekends/etc to keep things balanced.  It sounds like Ceph is 
really expecting that objects stored into a pool will roughly have the same 
size, is that right?
 
Our backups going into this pool have very large variation in size, so would it 
be better to create multiple pools based on expected size of objects and then 
put backups of similar size into each pool?
 
The backups also have basically the same names with the only difference being 
the date which it was taken (e.g. backup name difference in subsequent days can 
be one digit at times), so does this mean that large backups with basically the 
same name will end up being placed in the same PGs based on the CRUSH 
calculation using the object name?
 
Thanks,
-Bryan
 
From: Janne Johansson [mailto:icepic...@gmail.com] 
Sent: Wednesday, January 31, 2018 9:34 AM
To: Bryan Banister 
Cc: Ceph Users 
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
 
Note: External Email

 
 
2018-01-31 15:58 GMT+01:00 Bryan Banister :
 
 
Given that this will move data around (I think), should we increase the pg_num 
and pgp_num first and then see how it looks?
 
 
I guess adding pgs and pgps will move stuff around too, but if the PGCALC 
formula says you should have more then that would still be a good
start. Still, a few manual reweights first to take the 85-90% ones down might 
be good, some move operations are going to refuse adding things
to too-full OSDs, so you would not want to get accidentally bumped above such a 
limit due to some temp-data being created during moves.
 
Also, dont bump pgs like crazy, you can never move down. Aim for getting ~100 
per OSD at most, and perhaps even then in smaller steps so
that the creation (and evening out of data to the new empty PGs) doesn't kill 
normal client I/O perf in the meantime.

 
-- 
May the most significant bit of your life be positive.



Note: This email is for the confidential 

[ceph-users] Switching failure domains

2018-01-31 Thread Bryan Stillwell
We're looking into switching the failure domains on several of our
clusters from host-level to rack-level and I'm trying to figure out the
least impactful way to accomplish this.

First off, I've made this change before on a couple large (500+ OSDs)
OpenStack clusters where the volumes, images, and vms pools were all
about 33% of the cluster.  The way I did it then was to create a new
rule which had a switch-based failure domain and then did one pool at a
time.

That worked pretty well, but now I've inherited several large RGW
clusters (500-1000+ OSDs) where 99% of the data is in the .rgw.buckets
pool with slower and bigger disks (7200 RPM 4TB SATA HDDs vs. the 10k
RPM 1.2TB SAS HDDs I was using previously).  This makes the change take
longer and early testing has shown it being fairly impactful.

I'm wondering if there is a way to more gradually switch to a rack-based
failure domain?

One of the ideas we had was to create new hosts that are actually the
racks and gradually move all the OSDs to those hosts.  Once that is
complete we should be able to turn those hosts into racks and switch the
failure domain at the same time.

Does anyone see a problem with that approach?

I was also wondering if we could take advantage of RGW in any way to
gradually move the data to a new pool with the proper failure domain set
on it?

BTW, these clusters will all be running jewel (10.2.10).  The time I
made the switch previously was done on hammer.

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problems removing buckets with --bypass-gc

2017-10-31 Thread Bryan Stillwell
As mentioned in another thread I'm trying to remove several thousand buckets on 
a hammer cluster (0.94.10), but I'm running into a problem using --bypass-gc.

I usually see either this error:

# radosgw-admin bucket rm --bucket=sg2pl598 --purge-objects --bypass-gc
2017-10-31 09:21:04.111599 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=4194304 stripe_ofs=4194304 part_ofs=0 rule->part_size=15728640
2017-10-31 09:21:04.121664 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=8388608 stripe_ofs=8388608 part_ofs=0 rule->part_size=15728640
2017-10-31 09:21:04.126356 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=12582912 stripe_ofs=12582912 part_ofs=0 rule->part_size=15728640
2017-10-31 09:21:04.130582 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=15728640 stripe_ofs=15728640 part_ofs=15728640 
rule->part_size=15728640
2017-10-31 09:21:04.135791 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=19922944 stripe_ofs=19922944 part_ofs=15728640 
rule->part_size=15728640
2017-10-31 09:21:04.140240 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=24117248 stripe_ofs=24117248 part_ofs=15728640 
rule->part_size=15728640
2017-10-31 09:21:04.145792 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=28311552 stripe_ofs=28311552 part_ofs=15728640 
rule->part_size=15728640
2017-10-31 09:21:04.149964 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=31457280 stripe_ofs=31457280 part_ofs=31457280 
rule->part_size=15728640
2017-10-31 09:21:04.165820 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=35651584 stripe_ofs=35651584 part_ofs=31457280 
rule->part_size=15728640
2017-10-31 09:21:04.171099 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=39845888 stripe_ofs=39845888 part_ofs=31457280 
rule->part_size=15728640
2017-10-31 09:21:04.176765 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=44040192 stripe_ofs=44040192 part_ofs=31457280 
rule->part_size=15728640
2017-10-31 09:21:04.183664 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=47185920 stripe_ofs=47185920 part_ofs=47185920 rule->part_size=83674
2017-10-31 09:21:04.188140 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=47269594 stripe_ofs=47269594 part_ofs=47269594 rule->part_size=83674
2017-10-31 09:21:05.034837 7f45f5d108c0 -1 ERROR: failed to get obj ref with 
ret=-22
2017-10-31 09:21:05.034846 7f45f5d108c0 -1 ERROR: delete obj aio failed with -22

or this error:

# radosgw-admin bucket rm --bucket=sg2pl593 --purge-objects --bypass-gc
2017-10-31 09:24:09.082063 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=4194304 stripe_ofs=4194304 part_ofs=0 rule->part_size=15728640
2017-10-31 09:24:09.090394 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=8388608 stripe_ofs=8388608 part_ofs=0 rule->part_size=15728640
2017-10-31 09:24:09.095172 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=12582912 stripe_ofs=12582912 part_ofs=0 rule->part_size=15728640
2017-10-31 09:24:09.099116 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=15728640 stripe_ofs=15728640 part_ofs=15728640 
rule->part_size=15728640
[...snip...]
2017-10-31 09:24:09.245171 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=110100480 stripe_ofs=110100480 part_ofs=110100480 
rule->part_size=15728640
2017-10-31 09:24:09.251659 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=114294784 stripe_ofs=114294784 part_ofs=110100480 
rule->part_size=15728640
2017-10-31 09:24:09.269739 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=118489088 stripe_ofs=118489088 part_ofs=110100480 
rule->part_size=15728640
2017-10-31 09:24:09.273871 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=122683392 stripe_ofs=122683392 part_ofs=110100480 
rule->part_size=15728640
2017-10-31 09:24:09.274968 7fe7f4be68c0 -1 ERROR: could not drain handles as 
aio completion returned with -2

Then successive runs continue failing at the same spot preventing further 
progress.  I can then run it without --bypass-gc for a few seconds followed by 
running it with --bypass-gc, but usually it fails again after a few minutes.

For example, here's another run on sg2pl593 after running it without 
--bypass-gc for a few seconds:

# radosgw-admin bucket rm --bucket=sg2pl593 --purge-objects --bypass-gc
2017-10-31 09:28:03.704490 7efdb31d08c0  0 RGWObjManifest::operator++(): 
result: ofs=565628 stripe_ofs=565628 part_ofs=0 rule->part_size=0
2017-10-31 09:28:03.890675 7efdb31d08c0  0 RGWObjManifest::operator++(): 
result: ofs=1757663 stripe_ofs=1757663 part_ofs=0 rule->part_size=0
2017-10-31 09:28:04.144966 7efdb31d08c0  0 RGWObjManifest::operator++(): 
result: ofs=2723340 stripe_ofs=2723340 part_ofs=0 rule->part_size=0
2017-10-31 09:28:04.380761 7efdb31d08c0 -1 ERROR: could not drain handles as 
aio completion returned with -2

This cluster recently switched from a production cluster to a test cluster 
after a data migration, so I have the option to 

Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-27 Thread Bryan Stillwell
On Wed, Oct 25, 2017 at 4:02 PM, Yehuda Sadeh-Weinraub <yeh...@redhat.com> 
wrote:
>
> On Wed, Oct 25, 2017 at 2:32 PM, Bryan Stillwell <bstillw...@godaddy.com> 
> wrote:
> > That helps a little bit, but overall the process would take years at this
> > rate:
> >
> > # for i in {1..3600}; do ceph df -f json-pretty |grep -A7 '".rgw.buckets"' 
> > |grep objects; sleep 60; done
> >  "objects": 1660775838
> >  "objects": 1660775733
> >  "objects": 1660775548
> >  "objects": 1660774825
> >  "objects": 1660774790
> >  "objects": 1660774735
> >
> > This is on a hammer cluster.  Would upgrading to Jewel or Luminous speed up
> > this process at all?
>
> I'm not sure it's going to help much, although the omap performance
> might improve there. The big problem is that the omaps are just too
> big, so that every operation on them take considerable time. I think
> the best way forward there is to take a list of all the rados objects
> that need to be removed from the gc omaps, and then get rid of the gc
> objects themselves (newer ones will be created, this time using the
> new configurable). Then remove the objects manually (and concurrently)
> using the rados command line tool.
> The one problem I see here is that even just removal of objects with
> large omaps can affect the availability of the osds that hold these
> objects. I discussed that now with Josh, and we think the best way to
> deal with that is not to remove the gc objects immediatly, but to
> rename the gc pool, and create a new one (with appropriate number of
> pgs). This way new gc entries will now go into the new gc pool (with
> higher number of gc shards), and you don't need to remove the old gc
> objects (thus no osd availability problem). Then you can start
> trimming the old gc objects (on the old renamed pool) by using the
> rados command. It'll take a very very long time, but the process
> should pick up speed slowly, as the objects shrink.

That's fine for us.  We'll be tearing down this cluster in a few weeks
and adding the nodes to the new cluster we created.  I just wanted to
explore other options now that we can use it as a test cluster.

The solution you described with renaming the .rgw.gc pool and creating a
new one is pretty interesting.  I'll have to give that a try, but until
then I've been trying to remove some of the other buckets with the
--bypass-gc option and it keeps dying with output like this:

# radosgw-admin bucket rm --bucket=sg2pl5000 --purge-objects --bypass-gc
2017-10-27 08:00:00.865993 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=1488744 stripe_ofs=1488744 part_ofs=0 rule->part_size=0
2017-10-27 08:00:04.385875 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=673900 stripe_ofs=673900 part_ofs=0 rule->part_size=0
2017-10-27 08:00:04.517241 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=1179224 stripe_ofs=1179224 part_ofs=0 rule->part_size=0
2017-10-27 08:00:05.791876 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=566620 stripe_ofs=566620 part_ofs=0 rule->part_size=0
2017-10-27 08:00:26.815081 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=1090645 stripe_ofs=1090645 part_ofs=0 rule->part_size=0
2017-10-27 08:00:46.757556 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=1488744 stripe_ofs=1488744 part_ofs=0 rule->part_size=0
2017-10-27 08:00:47.093813 7f2b387228c0 -1 ERROR: could not drain handles as 
aio completion returned with -2


I can typically make further progress by running it again:

# radosgw-admin bucket rm --bucket=sg2pl5000 --purge-objects --bypass-gc
2017-10-27 08:20:57.310859 7fae9c3d48c0  0 RGWObjManifest::operator++(): 
result: ofs=673900 stripe_ofs=673900 part_ofs=0 rule->part_size=0
2017-10-27 08:20:57.406684 7fae9c3d48c0  0 RGWObjManifest::operator++(): 
result: ofs=1179224 stripe_ofs=1179224 part_ofs=0 rule->part_size=0
2017-10-27 08:20:57.808050 7fae9c3d48c0 -1 ERROR: could not drain handles as 
aio completion returned with -2


and again:

# radosgw-admin bucket rm --bucket=sg2pl5000 --purge-objects --bypass-gc
2017-10-27 08:22:04.992578 7ff8071038c0  0 RGWObjManifest::operator++(): 
result: ofs=566620 stripe_ofs=566620 part_ofs=0 rule->part_size=0
2017-10-27 08:22:05.726485 7ff8071038c0 -1 ERROR: could not drain handles as 
aio completion returned with -2


What does this error mean, and is there any way to keep it from dying
like this?  This cluster is running 0.94.10, but I can upgrade it to jewel
pretty easily if you would like.

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-25 Thread Bryan Stillwell
That helps a little bit, but overall the process would take years at this rate:

# for i in {1..3600}; do ceph df -f json-pretty |grep -A7 '".rgw.buckets"' 
|grep objects; sleep 60; done
"objects": 1660775838
"objects": 1660775733
"objects": 1660775548
"objects": 1660774825
"objects": 1660774790
"objects": 1660774735

This is on a hammer cluster.  Would upgrading to Jewel or Luminous speed up 
this process at all?

Bryan

From: Yehuda Sadeh-Weinraub <yeh...@redhat.com>
Date: Wednesday, October 25, 2017 at 11:32 AM
To: Bryan Stillwell <bstillw...@godaddy.com>
Cc: David Turner <drakonst...@gmail.com>, Ben Hines <bhi...@gmail.com>, 
"ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

Some of the options there won't do much for you as they'll only affect
newer object removals. I think the default number of gc objects is
just inadequate for your needs. You can try manually running
'radosgw-admin gc process' concurrently (for the start 2 or 3
processes), see if it makes any dent there. I think one of the problem
is that the gc omaps grew so much that operations on them are too
slow.

Yehuda

On Wed, Oct 25, 2017 at 9:05 AM, Bryan Stillwell 
<bstillw...@godaddy.com<mailto:bstillw...@godaddy.com>> wrote:
We tried various options like the one's Ben mentioned to speed up the garbage 
collection process and were unsuccessful.  Luckily, we had the ability to 
create a new cluster and move all the data that wasn't part of the POC which 
created our problem.

One of the things we ran into was the .rgw.gc pool became too large to handle 
drive failures without taking down the cluster.  We eventually had to move that 
pool to SSDs just to get the cluster healthy.  It was not obvious it was 
getting large though, because this is what it looked like in the 'ceph df' 
output:

 NAME   ID USED  %USED MAX AVAIL OBJECTS
 .rgw.gc17 0 0  235G   2647

However, if you look at the SSDs we used (repurposed journal SSDs to get out of 
the disaster) in 'ceph osd df' you can see quite a bit of data is being used:

410 0.2  1.0  181G 23090M   158G 12.44 0.18
411 0.2  1.0  181G 29105M   152G 15.68 0.22
412 0.2  1.0  181G   110G 72223M 61.08 0.86
413 0.2  1.0  181G 42964M   139G 23.15 0.33
414 0.2  1.0  181G 33530M   148G 18.07 0.26
415 0.2  1.0  181G 38420M   143G 20.70 0.29
416 0.2  1.0  181G 92215M 93355M 49.69 0.70
417 0.2  1.0  181G 64730M   118G 34.88 0.49
418 0.2  1.0  181G 61353M   121G 33.06 0.47
419 0.2  1.0  181G 77168M   105G 41.58 0.59

That's ~560G of omap data for the .rgw.gc pool that isn't being reported in 
'ceph df'.

Right now the cluster is still around while we wait to verify the new cluster 
isn't missing anything.  So if there is anything the RGW developers would like 
to try on it to speed up the gc process, we should be able to do that.

Bryan

From: ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of David Turner <drakonst...@gmail.com<mailto:drakonst...@gmail.com>>
Date: Tuesday, October 24, 2017 at 4:07 PM
To: Ben Hines <bhi...@gmail.com<mailto:bhi...@gmail.com>>
Cc: "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>" 
<ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

Thank you so much for chiming in, Ben.

Can you explain what each setting value means? I believe I understand min wait, 
that's just how long to wait before allowing the object to be cleaned up.  gc 
max objs is how many will be cleaned up during each period?  gc processor 
period is how often it will kick off gc to clean things up?  And gc processor 
max time is the longest the process can run after the period starts?  Is that 
about right for that?  I read somewhere saying that prime numbers are optimal 
for gc max objs.  Do you know why that is?  I notice you're using one there.  
What is lc max objs?  I couldn't find a reference for that setting.

Additionally, do you know if the radosgw-admin gc list is ever cleaned up, or 
is it an ever growing list?  I got up to 3.6 Billion objects in the list before 
I killed the gc list command.

On Tue, Oct 24, 2017 at 4:47 PM Ben Hines 
<bhi...@gmail.com<mailto:bhi...@gmail.com>> wrote:
I agree the settings are rather confusing. We also have many millions of 
objects and had this trouble, so i set these rather aggressive gc settings on 
our cluster which result in gc almost always running. We also use lifecycles to 
expire objects.

rgw lifecy

Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-25 Thread Bryan Stillwell
u're in this position?  There were about 8M 
objects that were deleted from this bucket.  I've come across a few references 
to the rgw-gc settings in the config, but nothing that explained the times well 
enough for me to feel comfortable doing anything with them.

On Tue, Jul 25, 2017 at 4:01 PM Bryan Stillwell <bstillw...@godaddy.com> wrote:
Excellent, thank you!  It does exist in 0.94.10!  :)
 
Bryan
 
From: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Date: Tuesday, July 25, 2017 at 11:21 AM

To: Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW
 
I’ve just realized that the option is present in Hammer (0.94.10) as well, you 
should try that.
 
From: Bryan Stillwell <bstillw...@godaddy.com>
Date: Tuesday, 25 July 2017 at 9:45 PM
To: Pavan Rallabhandi <prallabha...@walmartlabs.com>, 
"ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Subject: EXT: Re: [ceph-users] Speeding up garbage collection in RGW
 
Unfortunately, we're on hammer still (0.94.10).  That option looks like it 
would work better, so maybe it's time to move the upgrade up in the schedule.
 
I've been playing with the various gc options and I haven't seen any speedups 
like we would need to remove them in a reasonable amount of time.
 
Thanks,
Bryan
 
From: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Date: Tuesday, July 25, 2017 at 3:00 AM
To: Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW
 
If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in 
radosgw-admin, which would remove the tails objects as well without marking 
them to be GCed.
 
Thanks,
 
On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell" 
<ceph-users-boun...@lists.ceph.com on behalf of bstillw...@godaddy.com> wrote:
 
I'm in the process of cleaning up a test that an internal customer did on 
our production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket 
rm --bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate 
the objects are being removed.  From what I can tell a large number of the 
objects are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be 
set to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, 
or reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw crashing after buffer overflows detected

2017-09-11 Thread Bryan Stillwell
I found a couple OSDs that were seeing medium errors and marked them out
of the cluster.  Once all the PGs were moved off those OSDs all the
buffer overflows went away.

So there must be some kind of bug that's being triggered when an OSD is
misbehaving.

Bryan

From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Bryan 
Stillwell <bstillw...@godaddy.com>
Date: Friday, September 8, 2017 at 9:26 AM
To: ceph-users <ceph-users@lists.ceph.com>
Subject: [ceph-users] radosgw crashing after buffer overflows detected

[This sender failed our fraud detection checks and may not be who they appear 
to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing]

For about a week we've been seeing a decent number of buffer overflows
detected across all our RGW nodes in one of our clusters.  This started
happening a day after we started weighing in some new OSD nodes, so
we're thinking it's probably related to that.  Could someone help us
determine the root cause of this?

Cluster details:
  Distro: CentOS 7.2
  Release: 0.94.10-0.el7.x86_64
  OSDs: 1120
  RGW nodes: 10

See log messages below.  If you know how to improve the call trace
below I would like to hear that too.  I tried installing the
ceph-debuginfo-0.94.10-0.el7.x86_64 package, but that didn't seem to
help.

Thanks,
Bryan


# From /var/log/messages:

Sep  7 20:06:11 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 21:01:55 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 21:37:00 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 23:14:54 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 23:17:08 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 00:12:39 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:04:07 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:17:49 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:41:39 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:59:29 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated


# From /var/log/ceph/client.radosgw.p3cephrgw003.log:

 0> 2017-09-08 07:59:29.696615 7f7b296a2700 -1 *** Caught signal (Aborted) 
**
in thread 7f7b296a2700

ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
1: /bin/radosgw() [0x6d3d92]
2: (()+0xf100) [0x7f7f425e9100]
3: (gsignal()+0x37) [0x7f7f4141d5f7]
4: (abort()+0x148) [0x7f7f4141ece8]
5: (()+0x75317) [0x7f7f4145d317]
6: (__fortify_fail()+0x37) [0x7f7f414f5ac7]
7: (()+0x10bc80) [0x7f7f414f3c80]
8: (()+0x10da37) [0x7f7f414f5a37]
9: (OS_Accept()+0xc1) [0x7f7f435bd8b1]
10: (FCGX_Accept_r()+0x9c) [0x7f7f435bb91c]
11: (RGWFCGXProcess::run()+0x7bf) [0x58136f]
12: (RGWProcessControlThread::entry()+0xe) [0x5821fe]
13: (()+0x7dc5) [0x7f7f425e1dc5]
14: (clone()+0x6d) [0x7f7f414de21d]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Client features by IP?

2017-09-08 Thread Bryan Stillwell
On 09/07/2017 01:26 PM, Josh Durgin wrote:
> On 09/07/2017 11:31 AM, Bryan Stillwell wrote:
>> On 09/07/2017 10:47 AM, Josh Durgin wrote:
>>> On 09/06/2017 04:36 PM, Bryan Stillwell wrote:
>>>> I was reading this post by Josh Durgin today and was pretty happy to
>>>> see we can get a summary of features that clients are using with the
>>>> 'ceph features' command:
>>>>
>>>> http://ceph.com/community/new-luminous-upgrade-complete/
>>>>
>>>> However, I haven't found an option to display the IP address of
>>>> those clients with the older feature sets.  Is there a flag I can
>>>> pass to 'ceph features' to list the IPs associated with each feature
>>>> set?
>>>
>>> There is not currently, we should add that - it'll be easy to backport
>>> to luminous too. The only place both features and IP are shown is in
>>> 'debug mon = 10' logs right now.
>>
>> I think that would be great!  The first thing I would want to do after
>> seeing an old client listed would be to find it and upgrade it.  Having
>> the IP of the client would make that a ton easier!
>
> Yup, should've included that in the first place!
>
>> Anything I could do to help make that happen?  File a feature request
>> maybe?
>
> Sure, adding a short tracker.ceph.com ticket would help, that way we can
> track the backport easily too.

Ticket created:

http://tracker.ceph.com/issues/21315

Thanks Josh!

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw crashing after buffer overflows detected

2017-09-08 Thread Bryan Stillwell
For about a week we've been seeing a decent number of buffer overflows
detected across all our RGW nodes in one of our clusters.  This started
happening a day after we started weighing in some new OSD nodes, so
we're thinking it's probably related to that.  Could someone help us
determine the root cause of this?

Cluster details:
  Distro: CentOS 7.2
  Release: 0.94.10-0.el7.x86_64
  OSDs: 1120
  RGW nodes: 10

See log messages below.  If you know how to improve the call trace
below I would like to hear that too.  I tried installing the
ceph-debuginfo-0.94.10-0.el7.x86_64 package, but that didn't seem to
help.

Thanks,
Bryan


# From /var/log/messages:

Sep  7 20:06:11 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 21:01:55 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 21:37:00 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 23:14:54 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 23:17:08 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 00:12:39 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:04:07 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:17:49 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:41:39 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:59:29 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated


# From /var/log/ceph/client.radosgw.p3cephrgw003.log:

 0> 2017-09-08 07:59:29.696615 7f7b296a2700 -1 *** Caught signal (Aborted) 
**
 in thread 7f7b296a2700

 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
 1: /bin/radosgw() [0x6d3d92]
 2: (()+0xf100) [0x7f7f425e9100]
 3: (gsignal()+0x37) [0x7f7f4141d5f7]
 4: (abort()+0x148) [0x7f7f4141ece8]
 5: (()+0x75317) [0x7f7f4145d317]
 6: (__fortify_fail()+0x37) [0x7f7f414f5ac7]
 7: (()+0x10bc80) [0x7f7f414f3c80]
 8: (()+0x10da37) [0x7f7f414f5a37]
 9: (OS_Accept()+0xc1) [0x7f7f435bd8b1]
 10: (FCGX_Accept_r()+0x9c) [0x7f7f435bb91c]
 11: (RGWFCGXProcess::run()+0x7bf) [0x58136f]
 12: (RGWProcessControlThread::entry()+0xe) [0x5821fe]
 13: (()+0x7dc5) [0x7f7f425e1dc5]
 14: (clone()+0x6d) [0x7f7f414de21d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Client features by IP?

2017-09-07 Thread Bryan Stillwell
On 09/07/2017 10:47 AM, Josh Durgin wrote:
> On 09/06/2017 04:36 PM, Bryan Stillwell wrote:
> > I was reading this post by Josh Durgin today and was pretty happy to
> > see we can get a summary of features that clients are using with the
> > 'ceph features' command:
> >
> > http://ceph.com/community/new-luminous-upgrade-complete/
> >
> > However, I haven't found an option to display the IP address of
> > those clients with the older feature sets.  Is there a flag I can
> > pass to 'ceph features' to list the IPs associated with each feature
> > set?
>
> There is not currently, we should add that - it'll be easy to backport
> to luminous too. The only place both features and IP are shown is in
> 'debug mon = 10' logs right now.

I think that would be great!  The first thing I would want to do after
seeing an old client listed would be to find it and upgrade it.  Having
the IP of the client would make that a ton easier!

Anything I could do to help make that happen?  File a feature request
maybe?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Client features by IP?

2017-09-06 Thread Bryan Stillwell
I was reading this post by Josh Durgin today and was pretty happy to see we can 
get a summary of features that clients are using with the 'ceph features' 
command:

http://ceph.com/community/new-luminous-upgrade-complete/

However, I haven't found an option to display the IP address of those clients 
with the older feature sets.  Is there a flag I can pass to 'ceph features' to 
list the IPs associated with each feature set?

Thanks,
Bryan 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] expanding cluster with minimal impact

2017-08-07 Thread Bryan Stillwell
Dan,

We recently went through an expansion of an RGW cluster and found that we 
needed 'norebalance' set whenever making CRUSH weight changes to avoid slow 
requests.  We were also increasing the CRUSH weight by 1.0 each time which 
seemed to reduce the extra data movement we were seeing with smaller weight 
increases.  Maybe something to try out next time?

Bryan

From: ceph-users  on behalf of Dan van der 
Ster 
Date: Friday, August 4, 2017 at 1:59 AM
To: Laszlo Budai 
Cc: ceph-users 
Subject: Re: [ceph-users] expanding cluster with minimal impact

Hi Laszlo,

The script defaults are what we used to do a large intervention (the
default delta weight is 0.01). For our clusters going any faster
becomes disruptive, but this really depends on your cluster size and
activity.

BTW, in case it wasn't clear, to use this script for adding capacity
you need to create the new OSDs to your cluster with initial crush
weight = 0.0

osd crush initial weight = 0
osd crush update on start = true

-- Dan



On Thu, Aug 3, 2017 at 8:12 PM, Laszlo Budai  wrote:
Dear all,

I need to expand a ceph cluster with minimal impact. Reading previous
threads on this topic from the list I've found the ceph-gentle-reweight
script
(https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight)
created by Dan van der Ster (Thank you Dan for sharing the script with us!).

I've done some experiments, and it looks promising, but it is needed to
properly set the parameters. Did any of you tested this script before? what
is the recommended delta_weight to be used? From the default parameters of
the script I can see that the default delta weight is .5% of the target
weight that means 200 reweighting cycles. I have experimented with a
reweight ratio of 5% while running a fio test on a client. The results were
OK (I mean no slow requests), but my  test cluster was a very small one.

If any of you has done some larger experiments with this script I would be
really interested to read about your results.

Thank you!
Laszlo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-25 Thread Bryan Stillwell
Excellent, thank you!  It does exist in 0.94.10!  :)

Bryan

From: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Date: Tuesday, July 25, 2017 at 11:21 AM
To: Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

I’ve just realized that the option is present in Hammer (0.94.10) as well, you 
should try that.

From: Bryan Stillwell <bstillw...@godaddy.com>
Date: Tuesday, 25 July 2017 at 9:45 PM
To: Pavan Rallabhandi <prallabha...@walmartlabs.com>, 
"ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Subject: EXT: Re: [ceph-users] Speeding up garbage collection in RGW

Unfortunately, we're on hammer still (0.94.10).  That option looks like it 
would work better, so maybe it's time to move the upgrade up in the schedule.

I've been playing with the various gc options and I haven't seen any speedups 
like we would need to remove them in a reasonable amount of time.

Thanks,
Bryan

From: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Date: Tuesday, July 25, 2017 at 3:00 AM
To: Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in 
radosgw-admin, which would remove the tails objects as well without marking 
them to be GCed.

Thanks,

On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell" 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com> on 
behalf of bstillw...@godaddy.com<mailto:bstillw...@godaddy.com>> wrote:

I'm in the process of cleaning up a test that an internal customer did on 
our production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket 
rm --bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate 
the objects are being removed.  From what I can tell a large number of the 
objects are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be 
set to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, 
or reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-25 Thread Bryan Stillwell
Unfortunately, we're on hammer still (0.94.10).  That option looks like it 
would work better, so maybe it's time to move the upgrade up in the schedule.

I've been playing with the various gc options and I haven't seen any speedups 
like we would need to remove them in a reasonable amount of time.

Thanks,
Bryan

From: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Date: Tuesday, July 25, 2017 at 3:00 AM
To: Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in 
radosgw-admin, which would remove the tails objects as well without marking 
them to be GCed.

Thanks,

On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell" 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com> on 
behalf of bstillw...@godaddy.com<mailto:bstillw...@godaddy.com>> wrote:

I'm in the process of cleaning up a test that an internal customer did on 
our production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket 
rm --bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate 
the objects are being removed.  From what I can tell a large number of the 
objects are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be 
set to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, 
or reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-24 Thread Bryan Stillwell
Wouldn't doing it that way cause problems since references to the objects 
wouldn't be getting removed from .rgw.buckets.index?

Bryan

From: Roger Brown <rogerpbr...@gmail.com>
Date: Monday, July 24, 2017 at 2:43 PM
To: Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

I hope someone else can answer your question better, but in my case I found 
something like this helpful to delete objects faster than I could through the 
gateway: 

rados -p default.rgw.buckets.data ls | grep 'replace this with pattern matching 
files you want to delete' | xargs -d '\n' -n 200 rados -p 
default.rgw.buckets.data rm


On Mon, Jul 24, 2017 at 2:02 PM Bryan Stillwell <bstillw...@godaddy.com> wrote:
I'm in the process of cleaning up a test that an internal customer did on our 
production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket rm 
--bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate the 
objects are being removed.  From what I can tell a large number of the objects 
are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be set 
to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, or 
reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Speeding up garbage collection in RGW

2017-07-24 Thread Bryan Stillwell
I'm in the process of cleaning up a test that an internal customer did on our 
production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket rm 
--bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate the 
objects are being removed.  From what I can tell a large number of the objects 
are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be set 
to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, or 
reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Directory size doesn't match contents

2017-06-15 Thread Bryan Stillwell
On 6/15/17, 9:20 AM, "John Spray" <jsp...@redhat.com> wrote:
>
> On Wed, Jun 14, 2017 at 4:31 PM, Bryan Stillwell <bstillw...@godaddy.com> 
> wrote:
> > I have a cluster running 10.2.7 that is seeing some extremely large 
> > directory sizes in CephFS according to the recursive stats:
> >
> > $ ls -lhd Originals/
> > drwxrwxr-x 1 bryan bryan 16E Jun 13 13:27 Originals/
>
> What client (and version of the client) are you using?

I'm using the ceph-fuse client from the 10.2.7-1trusty packages.


> rstats being out of date is a known issue, but getting a completely
> bogus value like this is not.
>
> Do you get the correct value if you mount a new client and look from there?

I tried doing a new ceph-fuse mount on another host running
10.2.7-1trusty and also see the same problem there:

$ ceph-fuse --version
ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
[root@b3:/root]$ ls -ld /ceph/Originals
drwxrwxr-x 1 bryan bryan 1844674382704167 Jun 13 13:27 /ceph/Originals


I then tried mounting it with a newer kernel and rstats don't seem to be
working for that directory or any other directory:

[root@shilling:/root]$ uname -a
Linux shilling 4.8.0-52-generic #55~16.04.1-Ubuntu SMP Fri Apr 28 14:36:29 UTC 
2017 x86_64 x86_64 x86_64 GNU/Linux
[root@shilling:/root]$ ls -ld /ceph-old/{Logs,Music,Originals,Pictures}
drwxrwxr-x 1 bryan bryan 111 Feb 29  2016 /ceph-old/Logs
drwxr-xr-x 1 bryan bryan   5 Feb 17  2012 /ceph-old/Music
drwxrwxr-x 1 bryan bryan   1 Jun 13 13:27 /ceph-old/Originals
drwxr-xr-x 1 bryan bryan  25 Jul  1  2015 /ceph-old/Pictures

I also gave ceph-fuse in kraken a try too:

[root@shilling:/root]$ ceph-fuse --version
ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
[root@shilling:/root]$ ls -ld /ceph-old/Originals
drwxrwxr-x 1 bryan bryan 1844674382704167 Jun 13 13:27 /ceph-old/Originals


Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Directory size doesn't match contents

2017-06-14 Thread Bryan Stillwell
I have a cluster running 10.2.7 that is seeing some extremely large directory 
sizes in CephFS according to the recursive stats:

$ ls -lhd Originals/
drwxrwxr-x 1 bryan bryan 16E Jun 13 13:27 Originals/

du reports a much smaller (and accurate) number:

$ du -sh Originals/
300GOriginals/

This directory recently saw some old rsync temporary files re-appear that I 
have since removed.  Perhaps that could be related?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_op_tp timeouts

2017-06-13 Thread Bryan Stillwell
Is this on an RGW cluster?

If so, you might be running into the same problem I was seeing with large 
bucket sizes:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018504.html

The solution is to shard your buckets so the bucket index doesn't get too big.

Bryan

From: ceph-users  on behalf of Tyler Bischel 

Date: Monday, June 12, 2017 at 5:12 PM
To: "ceph-us...@ceph.com" 
Subject: [ceph-users] osd_op_tp timeouts

Hi,
  We've been having this ongoing problem with threads timing out on the OSDs.  
Typically we'll see the OSD become unresponsive for about a minute, as threads 
from other OSDs time out.  The timeouts don't seem to be correlated to high 
load.  We turned up the logs to 10/10 for part of a day to catch some of these 
in progress, and saw the pattern below in the logs several times (grepping for 
individual threads involved in the time outs).

We are using Jewel 10.2.7.

Logs:

2017-06-12 18:45:12.530698 7f82ebfa8700 10 osd.30 pg_epoch: 5484 pg[10.6d2( v 
5484'12967030 (5469'12963946,5484'12967030] local-les=5476 n=419 ec=593 les/c/f 
5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475 pi=4780-5474/109 luod=0'0 
lua=5484'12967019 crt=5484'12967027 lcod 5484'12967028 active] add_log_entry 
5484'12967030 (0'0) modify   
10:4b771c01:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0634p1mdw1-11203-593EE138-2E:head
 by client.1274027169.0:3107075054 2017-06-12 18:45:12.523899

2017-06-12 18:45:12.530718 7f82ebfa8700 10 osd.30 pg_epoch: 5484 pg[10.6d2( v 
5484'12967030 (5469'12963946,5484'12967030] local-les=5476 n=419 ec=593 les/c/f 
5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475 pi=4780-5474/109 luod=0'0 
lua=5484'12967019 crt=5484'12967028 lcod 5484'12967028 active] append_log: 
trimming to 5484'12967028 entries 5484'12967028 (5484'12967026) delete   
10:4b796a74:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0469p1mdw1-21390-593EE137-57:head
 by client.1274027164.0:3183456083 2017-06-12 18:45:12.491741

2017-06-12 18:45:12.530754 7f82ebfa8700  5 write_log with: dirty_to: 0'0, 
dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: false, 
divergent_priors: 0, writeout_from: 5484'12967030, trimmed:

2017-06-12 18:45:28.171843 7f82dc503700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.171877 7f82dc402700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174900 7f82d8887700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174979 7f82d8786700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248499 7f82df05e700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248651 7f82df967700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.261044 7f82d8483700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15



Metrics:
OSD Disk IO Wait spikes from 2ms to 1s, CPU Procs Blocked spikes from 0 to 16, 
IO In progress spikes from 0 to hundreds, IO Time Weighted, IO Time spike.  
Average Queue Size on the device spikes.  One minute later, Write Time, Reads, 
and Read Time spike briefly.

Any thoughts on what may be causing this behavior?

--Tyler

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Living with huge bucket sizes

2017-06-08 Thread Bryan Stillwell
This has come up quite a few times before, but since I was only working with
RBD before I didn't pay too close attention to the conversation.  I'm looking
for the best way to handle existing clusters that have buckets with a large
number of objects (>20 million) in them.  The cluster I'm doing test on is
currently running hammer (0.94.10), so if things got better in jewel I would
love to hear about it!

One idea I've played with is to create a new SSD pool by adding an OSD
to every journal SSD.  My thinking was that our data is mostly small
objects (~100KB) so the journal drives were unlikely to be getting close
to any throughput limitations.  They should also have plenty of IOPs
left to handle the .rgw.buckets.index pool.

So on our test cluster I created a separate root that I called
rgw-buckets-index, I added all the OSDs I created on the journal SSDs,
and created a new crush rule to place data on it:

ceph osd crush rule create-simple rgw-buckets-index_ruleset rgw-buckets-index 
chassis

Once everything was set up correctly I tried switching the
.rgw.buckets.index pool over to it by doing:

ceph osd set norebalance
ceph osd pool set .rgw.buckets.index crush_ruleset 1
# Wait for peering to complete
ceph osd unset norebalance

Things started off well, but once it got to backfilling the PGs which
have the large buckets on them, I started seeing a large number of slow
requests like these:

  ack+ondisk+write+known_if_redirected e68708) currently waiting for degraded 
object
  ondisk+write+known_if_redirected e68708) currently waiting for degraded object
  ack+ondisk+write+known_if_redirected e68708) currently waiting for rw locks

Digging in on the OSDs, it seems they would either restart or die after
seeing a lot of these messages:

  heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f8f5d604700' had timed 
out after 30

or:

  heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f99ec2e4700' had timed out 
after 15

The ones that died saw messages like these:

  heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fcd59e7c700' had timed 
out after 60

Followed by:

  heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fcd48c1d700' had suicide 
timed out after 150


The backfilling process would appear to hang on some of the PGs, but I
figured out that they were recovering omap data and was able to keep an
eye on the process by running:

watch 'ceph pg 272.22 query | grep omap_recovered_to'

A lot of the timeouts happened after the PGs finished the omap recovery,
which took over an hour on one of the PGs.

Has anyone found a good solution for this for existing large buckets?  I
know sharding is the solution going forward, but afaik it can't be done
on existing buckets yet (although the dynamic resharding work mentioned
on today's performance call sounds promising).

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] centos and 'print continue' support

2014-05-23 Thread Bryan Stillwell
Yesterday I went through manually configuring a ceph cluster with a
rados gateway on centos 6.5, and I have a question about the
documentation.  On this page:

https://ceph.com/docs/master/radosgw/config/

It mentions On CentOS/RHEL distributions, turn off print continue. If
you have it set to true, you may encounter problems with PUT
operations.  However, when I had 'rgw print continue = false' in my
ceph.conf, adding objects with the python boto module would hang at:

key.set_contents_from_string('Hello World!')

After switching it to 'rgw print continue = true' things started working.

I'm wondering if this is because I installed the custom
apache/mod_fastcgi packages from the instructions on this page?:

http://ceph.com/docs/master/install/install-ceph-gateway/#id2

If that's the case, could the docs be updated to mention that setting
'rgw print continue = false' is only needed if you're using the distro
packages?

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Full OSD with 29% free

2013-10-31 Thread Bryan Stillwell
Shain,

After getting the segfaults when running 'xfs_db -r -c freesp -s' on
a couple partitions, I'm concerned that 2K block sizes aren't nearly
as well tested as 4K block sizes.  This could just be a problem with
RHEL/CentOS 6.4 though, so if you're using a newer kernel the problem
might already be fixed.  There also appears to be more overhead with
2K block sizes which I believe manifests as high CPU usage by the
xfsalloc processes.  However, my cluster has been running in a clean
state for over 24 hours and none of the scrubs have found a problem
yet.

According to 'ceph -s' my cluster has the following stats:

 osdmap e16882: 40 osds: 40 up, 40 in
  pgmap v3520420: 2808 pgs, 13 pools, 5694 GB data, 72705 kobjects
18095 GB used, 13499 GB / 31595 GB avail

That's about 78k per object on average, so if your files aren't that
small I would stay with 4K block sizes to avoid headaches.

Bryan


On Thu, Oct 31, 2013 at 6:43 AM, Shain Miley smi...@npr.org wrote:

 Bryan,

 We are setting up a cluster using xfs and have been a bit concerned about 
 running into similar issues to the ones you described below.

 I am just wondering if you came across any potential downsides to using a 2K 
 block size with xfs on your osd's.

 Thanks,

 Shain

 Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
 smi...@npr.org | 202.513.3649

 
 From: ceph-users-boun...@lists.ceph.com [ceph-users-boun...@lists.ceph.com] 
 on behalf of Bryan Stillwell [bstillw...@photobucket.com]
 Sent: Wednesday, October 30, 2013 2:18 PM
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Full OSD with 29% free

 I wanted to report back on this since I've made some progress on
 fixing this issue.

 After converting every OSD on a single server to use a 2K block size,
 I've been able to cross 90% utilization without running into the 'No
 space left on device' problem.  They're currently between 51% and 75%,
 but I hit 90% over the weekend after a couple OSDs died during
 recovery.

 This conversion was pretty rough though with OSDs randomly dying
 multiple times during the process (logs point at suicide time outs).
 When looking at top I would frequently see xfsalloc pegging multiple
 cores, so I wonder if that has something to do with it.  I also had
 the 'xfs_db -r -c freesp -s' command segfault on me a few times
 which was fixed by running xfs_repair on those partitions.  This has
 me wondering how well XFS is tested with non-default block sizes on
 CentOS 6.4...

 Anyways, after about a week I was finally able to get the cluster to
 fully recover today.  Now I need to repeat the process on 7 more
 servers before I can finish populating my cluster...

 In case anyone is wondering how I switched to a 2K block size, this is
 what I added to my ceph.conf:

 [osd]
 osd_mount_options_xfs = rw,noatime,inode64
 osd_mkfs_options_xfs = -f -b size=2048


 The cluster is currently running the 0.71 release.

 Bryan

 On Mon, Oct 21, 2013 at 2:39 PM, Bryan Stillwell
 bstillw...@photobucket.com wrote:
  So I'm running into this issue again and after spending a bit of time
  reading the XFS mailing lists, I believe the free space is too
  fragmented:
 
  [root@den2ceph001 ceph-0]# xfs_db -r -c freesp -s /dev/sdb1
 from  to extents  blockspct
1   1 85773 85773   0.24
2   3  176891  444356   1.27
4   7  430854 2410929   6.87
8  15 2327527 30337352  86.46
   16  31   75871 1809577   5.16
  total free extents 3096916
  total free blocks 35087987
  average free extent size 11.33
 
 
  Compared to a drive which isn't reporting 'No space left on device':
 
  [root@den2ceph008 ~]# xfs_db -r -c freesp -s /dev/sdc1
 from  to extents  blockspct
1   1  133148  133148   0.15
2   3  320737  808506   0.94
4   7  809748 4532573   5.27
8  15 4536681 59305608  68.96
   16  31   31531  751285   0.87
   32  63 364   16367   0.02
   64 127  909174   0.01
  128 255   92072   0.00
  256 511  48   18018   0.02
  5121023 128  102422   0.12
 10242047 290  451017   0.52
 20484095 538 1649408   1.92
 40968191 851 5066070   5.89
 8192   16383 746 8436029   9.81
16384   32767 194 4042573   4.70
32768   65535  15  614301   0.71
65536  131071   1   66630   0.08
  total free extents 5835119
  total free blocks 86005201
  average free extent size 14.7392
 
 
  What I'm wondering is if reducing the block size from 4K to 2K (or 1K)
  would help?  I'm pretty sure this would take require re-running
  mkfs.xfs on every OSD to fix if that's the case...
 
  Thanks,
  Bryan
 
 
  On Mon, Oct 14, 2013 at 5:28 PM, Bryan Stillwell
  bstillw...@photobucket.com wrote:
 
  The filesystem isn't as full now, but the fragmentation is pretty low:
 
  [root

Re: [ceph-users] Full OSD with 29% free

2013-10-31 Thread Bryan Stillwell
Shain,

I investigated the segfault a little more since I sent this message
and found this email thread:

http://oss.sgi.com/archives/xfs/2012-06/msg00066.html

After reading that I did the following:

[root@den2ceph001 ~]# xfs_db -r -c freesp -s /dev/sdb1
Segmentation fault (core dumped)
[root@den2ceph001 ~]# service ceph stop osd.0
=== osd.0 ===
Stopping Ceph osd.0 on den2ceph001...kill 2407...kill 2407...done
[root@den2ceph001 ~]# umount /dev/sdb1
[root@den2ceph001 ~]# xfs_db -r -c freesp -s /dev/sdb1
   from  to extents  blockspct
  1   1   44510   44510   0.05
  2   3   60341  142274   0.16
  4   7   68836  355735   0.39
  8  15  274122 3212122   3.50
 16  31 1429274 37611619  41.02
 32  63   43225 1945740   2.12
 64 127   39480 3585579   3.91
128 255   36046 6544005   7.14
256 511   30946 10899979  11.89
5121023   14119 9907129  10.80
   102420475727 7998938   8.72
   204840952647 6811258   7.43
   40968191 362 1940622   2.12
   8192   16383  59  603690   0.66
  16384   32767   5   90464   0.10
total free extents 2049699
total free blocks 91693664
average free extent size 44.7352


That gives me a little more confidence in using 2K block sizes now.  :)

Bryan

On Thu, Oct 31, 2013 at 11:02 AM, Bryan Stillwell
bstillw...@photobucket.com wrote:
 Shain,

 After getting the segfaults when running 'xfs_db -r -c freesp -s' on
 a couple partitions, I'm concerned that 2K block sizes aren't nearly
 as well tested as 4K block sizes.  This could just be a problem with
 RHEL/CentOS 6.4 though, so if you're using a newer kernel the problem
 might already be fixed.  There also appears to be more overhead with
 2K block sizes which I believe manifests as high CPU usage by the
 xfsalloc processes.  However, my cluster has been running in a clean
 state for over 24 hours and none of the scrubs have found a problem
 yet.

 According to 'ceph -s' my cluster has the following stats:

  osdmap e16882: 40 osds: 40 up, 40 in
   pgmap v3520420: 2808 pgs, 13 pools, 5694 GB data, 72705 kobjects
 18095 GB used, 13499 GB / 31595 GB avail

 That's about 78k per object on average, so if your files aren't that
 small I would stay with 4K block sizes to avoid headaches.

 Bryan


 On Thu, Oct 31, 2013 at 6:43 AM, Shain Miley smi...@npr.org wrote:

 Bryan,

 We are setting up a cluster using xfs and have been a bit concerned about 
 running into similar issues to the ones you described below.

 I am just wondering if you came across any potential downsides to using a 2K 
 block size with xfs on your osd's.

 Thanks,

 Shain

 Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
 smi...@npr.org | 202.513.3649

 
 From: ceph-users-boun...@lists.ceph.com [ceph-users-boun...@lists.ceph.com] 
 on behalf of Bryan Stillwell [bstillw...@photobucket.com]
 Sent: Wednesday, October 30, 2013 2:18 PM
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Full OSD with 29% free

 I wanted to report back on this since I've made some progress on
 fixing this issue.

 After converting every OSD on a single server to use a 2K block size,
 I've been able to cross 90% utilization without running into the 'No
 space left on device' problem.  They're currently between 51% and 75%,
 but I hit 90% over the weekend after a couple OSDs died during
 recovery.

 This conversion was pretty rough though with OSDs randomly dying
 multiple times during the process (logs point at suicide time outs).
 When looking at top I would frequently see xfsalloc pegging multiple
 cores, so I wonder if that has something to do with it.  I also had
 the 'xfs_db -r -c freesp -s' command segfault on me a few times
 which was fixed by running xfs_repair on those partitions.  This has
 me wondering how well XFS is tested with non-default block sizes on
 CentOS 6.4...

 Anyways, after about a week I was finally able to get the cluster to
 fully recover today.  Now I need to repeat the process on 7 more
 servers before I can finish populating my cluster...

 In case anyone is wondering how I switched to a 2K block size, this is
 what I added to my ceph.conf:

 [osd]
 osd_mount_options_xfs = rw,noatime,inode64
 osd_mkfs_options_xfs = -f -b size=2048


 The cluster is currently running the 0.71 release.

 Bryan

 On Mon, Oct 21, 2013 at 2:39 PM, Bryan Stillwell
 bstillw...@photobucket.com wrote:
  So I'm running into this issue again and after spending a bit of time
  reading the XFS mailing lists, I believe the free space is too
  fragmented:
 
  [root@den2ceph001 ceph-0]# xfs_db -r -c freesp -s /dev/sdb1
 from  to extents  blockspct
1   1 85773 85773   0.24
2   3  176891  444356   1.27
4   7  430854 2410929   6.87
8  15 2327527 30337352  86.46
   16  31   75871 1809577   5.16
  total free extents 3096916
  total free

Re: [ceph-users] Full OSD with 29% free

2013-10-30 Thread Bryan Stillwell
I wanted to report back on this since I've made some progress on
fixing this issue.

After converting every OSD on a single server to use a 2K block size,
I've been able to cross 90% utilization without running into the 'No
space left on device' problem.  They're currently between 51% and 75%,
but I hit 90% over the weekend after a couple OSDs died during
recovery.

This conversion was pretty rough though with OSDs randomly dying
multiple times during the process (logs point at suicide time outs).
When looking at top I would frequently see xfsalloc pegging multiple
cores, so I wonder if that has something to do with it.  I also had
the 'xfs_db -r -c freesp -s' command segfault on me a few times
which was fixed by running xfs_repair on those partitions.  This has
me wondering how well XFS is tested with non-default block sizes on
CentOS 6.4...

Anyways, after about a week I was finally able to get the cluster to
fully recover today.  Now I need to repeat the process on 7 more
servers before I can finish populating my cluster...

In case anyone is wondering how I switched to a 2K block size, this is
what I added to my ceph.conf:

[osd]
osd_mount_options_xfs = rw,noatime,inode64
osd_mkfs_options_xfs = -f -b size=2048


The cluster is currently running the 0.71 release.

Bryan

On Mon, Oct 21, 2013 at 2:39 PM, Bryan Stillwell
bstillw...@photobucket.com wrote:
 So I'm running into this issue again and after spending a bit of time
 reading the XFS mailing lists, I believe the free space is too
 fragmented:

 [root@den2ceph001 ceph-0]# xfs_db -r -c freesp -s /dev/sdb1
from  to extents  blockspct
   1   1 85773 85773   0.24
   2   3  176891  444356   1.27
   4   7  430854 2410929   6.87
   8  15 2327527 30337352  86.46
  16  31   75871 1809577   5.16
 total free extents 3096916
 total free blocks 35087987
 average free extent size 11.33


 Compared to a drive which isn't reporting 'No space left on device':

 [root@den2ceph008 ~]# xfs_db -r -c freesp -s /dev/sdc1
from  to extents  blockspct
   1   1  133148  133148   0.15
   2   3  320737  808506   0.94
   4   7  809748 4532573   5.27
   8  15 4536681 59305608  68.96
  16  31   31531  751285   0.87
  32  63 364   16367   0.02
  64 127  909174   0.01
 128 255   92072   0.00
 256 511  48   18018   0.02
 5121023 128  102422   0.12
10242047 290  451017   0.52
20484095 538 1649408   1.92
40968191 851 5066070   5.89
8192   16383 746 8436029   9.81
   16384   32767 194 4042573   4.70
   32768   65535  15  614301   0.71
   65536  131071   1   66630   0.08
 total free extents 5835119
 total free blocks 86005201
 average free extent size 14.7392


 What I'm wondering is if reducing the block size from 4K to 2K (or 1K)
 would help?  I'm pretty sure this would take require re-running
 mkfs.xfs on every OSD to fix if that's the case...

 Thanks,
 Bryan


 On Mon, Oct 14, 2013 at 5:28 PM, Bryan Stillwell
 bstillw...@photobucket.com wrote:

 The filesystem isn't as full now, but the fragmentation is pretty low:

 [root@den2ceph001 ~]# df /dev/sdc1
 Filesystem   1K-blocks  Used Available Use% Mounted on
 /dev/sdc1486562672 270845628 215717044  56% 
 /var/lib/ceph/osd/ceph-1
 [root@den2ceph001 ~]# xfs_db -c frag -r /dev/sdc1
 actual 3481543, ideal 3447443, fragmentation factor 0.98%

 Bryan

 On Mon, Oct 14, 2013 at 4:35 PM, Michael Lowe j.michael.l...@gmail.com 
 wrote:
 
  How fragmented is that file system?
 
  Sent from my iPad
 
   On Oct 14, 2013, at 5:44 PM, Bryan Stillwell 
   bstillw...@photobucket.com wrote:
  
   This appears to be more of an XFS issue than a ceph issue, but I've
   run into a problem where some of my OSDs failed because the filesystem
   was reported as full even though there was 29% free:
  
   [root@den2ceph001 ceph-1]# touch blah
   touch: cannot touch `blah': No space left on device
   [root@den2ceph001 ceph-1]# df .
   Filesystem   1K-blocks  Used Available Use% Mounted on
   /dev/sdc1486562672 342139340 144423332  71% 
   /var/lib/ceph/osd/ceph-1
   [root@den2ceph001 ceph-1]# df -i .
   FilesystemInodes   IUsed   IFree IUse% Mounted on
   /dev/sdc160849984 4097408 567525767% 
   /var/lib/ceph/osd/ceph-1
   [root@den2ceph001 ceph-1]#
  
   I've tried remounting the filesystem with the inode64 option like a
   few people recommended, but that didn't help (probably because it
   doesn't appear to be running out of inodes).
  
   This happened while I was on vacation and I'm pretty sure it was
   caused by another OSD failing on the same node.  I've been able to
   recover from the situation by bringing the failed OSD back online, but
   it's only a matter of time until I'll be running into this issue again
   since my cluster is still being populated

Re: [ceph-users] Full OSD with 29% free

2013-10-14 Thread Bryan Stillwell
The filesystem isn't as full now, but the fragmentation is pretty low:

[root@den2ceph001 ~]# df /dev/sdc1
Filesystem   1K-blocks  Used Available Use% Mounted on
/dev/sdc1486562672 270845628 215717044  56% /var/lib/ceph/osd/ceph-1
[root@den2ceph001 ~]# xfs_db -c frag -r /dev/sdc1
actual 3481543, ideal 3447443, fragmentation factor 0.98%

Bryan

On Mon, Oct 14, 2013 at 4:35 PM, Michael Lowe j.michael.l...@gmail.com wrote:

 How fragmented is that file system?

 Sent from my iPad

  On Oct 14, 2013, at 5:44 PM, Bryan Stillwell bstillw...@photobucket.com 
  wrote:
 
  This appears to be more of an XFS issue than a ceph issue, but I've
  run into a problem where some of my OSDs failed because the filesystem
  was reported as full even though there was 29% free:
 
  [root@den2ceph001 ceph-1]# touch blah
  touch: cannot touch `blah': No space left on device
  [root@den2ceph001 ceph-1]# df .
  Filesystem   1K-blocks  Used Available Use% Mounted on
  /dev/sdc1486562672 342139340 144423332  71% 
  /var/lib/ceph/osd/ceph-1
  [root@den2ceph001 ceph-1]# df -i .
  FilesystemInodes   IUsed   IFree IUse% Mounted on
  /dev/sdc160849984 4097408 567525767% 
  /var/lib/ceph/osd/ceph-1
  [root@den2ceph001 ceph-1]#
 
  I've tried remounting the filesystem with the inode64 option like a
  few people recommended, but that didn't help (probably because it
  doesn't appear to be running out of inodes).
 
  This happened while I was on vacation and I'm pretty sure it was
  caused by another OSD failing on the same node.  I've been able to
  recover from the situation by bringing the failed OSD back online, but
  it's only a matter of time until I'll be running into this issue again
  since my cluster is still being populated.
 
  Any ideas on things I can try the next time this happens?
 
  Thanks,
  Bryan
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues with small files

2013-09-05 Thread Bryan Stillwell
Wouldn't using only the first two characters in the file name result
in less then 65k buckets being used?

For example if the file names contained 0-9 and a-f, that would only
be 256 buckets (16*16).  Or if they contained 0-9, a-z, and A-Z, that
would only be 3,844 buckets (62 * 62).

Bryan


On Thu, Sep 5, 2013 at 8:19 AM, Bill Omer bill.o...@gmail.com wrote:

 Thats correct.  We created 65k buckets, using two hex characters as the 
 naming convention, then stored the files in each container based on their 
 first two characters in the file name.  The end result was 20-50 files per 
 bucket.  Once all of the buckets were created and files were being loaded, we 
 still observed an increase in latency overtime.

 Is there a way to disable indexing?  Or are there other settings you can 
 suggest to attempt to speed this process up?


 On Wed, Sep 4, 2013 at 5:21 PM, Mark Nelson mark.nel...@inktank.com wrote:

 Just for clarification, distributing objects over lots of buckets isn't 
 helping improve small object performance?

 The degradation over time is similar to something I've seen in the past, 
 with higher numbers of seeks on the underlying OSD device over time.  Is it 
 always (temporarily) resolved writing to a new empty bucket?

 Mark


 On 09/04/2013 02:45 PM, Bill Omer wrote:

 We've actually done the same thing, creating 65k buckets and storing
 20-50 objects in each.  No change really, not noticeable anyway


 On Wed, Sep 4, 2013 at 2:43 PM, Bryan Stillwell
 bstillw...@photobucket.com mailto:bstillw...@photobucket.com wrote:

 So far I haven't seen much of a change.  It's still working through
 removing the bucket that reached 1.5 million objects though (my
 guess is that'll take a few more days), so I believe that might have
 something to do with it.

 Bryan


 On Wed, Sep 4, 2013 at 12:14 PM, Mark Nelson
 mark.nel...@inktank.com mailto:mark.nel...@inktank.com wrote:

 Bryan,

 Good explanation.  How's performance now that you've spread the
 load over multiple buckets?

 Mark

 On 09/04/2013 12:39 PM, Bryan Stillwell wrote:

 Bill,

 I've run into a similar issue with objects averaging
 ~100KiB.  The
 explanation I received on IRC is that there are scaling
 issues if you're
 uploading them all to the same bucket because the index
 isn't sharded.
The recommended solution is to spread the objects out to
 a lot of
 buckets.  However, that ran me into another issue once I hit
 1000
 buckets which is a per user limit.  I switched the limit to
 be unlimited
 with this command:

 radosgw-admin user modify --uid=your_username --max-buckets=0

 Bryan


 On Wed, Sep 4, 2013 at 11:27 AM, Bill Omer
 bill.o...@gmail.com mailto:bill.o...@gmail.com
 mailto:bill.o...@gmail.com mailto:bill.o...@gmail.com

 wrote:

  I'm testing ceph for storing a very large number of
 small files.
I'm seeing some performance issues and would like to
 see if anyone
  could offer any insight as to what I could do to
 correct this.

  Some numbers:

  Uploaded 184111 files, with an average file size of
 5KB, using
  10 separate servers to upload the request using Python
 and the
  cloudfiles module.  I stopped uploading after 53
 minutes, which
  seems to average 5.7 files per second per node.


  My storage cluster consists of 21 OSD's across 7
 servers, with their
  journals written to SSD drives.  I've done a default
 installation,
  using ceph-deploy with the dumpling release.

  I'm using statsd to monitor the performance, and what's
 interesting
  is when I start with an empty bucket, performance is
 amazing, with
  average response times of 20-50ms.  However as time
 goes on, the
  response times go in to the hundreds, and the average
 number of
  uploads per second drops.

  I've installed radosgw on all 7 ceph servers.  I've
 tested using a
  load balancer to distribute the api calls, as well as
 pointing the
  10 worker servers to a single instance.  I've not seen
 a real
  different in performance with this either.


  Each of the ceph servers are 16 core Xeon 2.53GHz with
 72GB of ram,
  OCZ Vertex4 SSD drives for the journals and Seagate
 Barracuda ES2

Re: [ceph-users] Performance issues with small files

2013-09-05 Thread Bryan Stillwell
Mark,

Yesterday I blew away all the objects and restarted my test using
multiple buckets, and things are definitely better!

After ~20 hours I've already uploaded ~3.5 million objects, which much
is better then the ~1.5 million I did over ~96 hours this past
weekend.  Unfortunately it seems that things have slowed down a bit.
The average upload rate over those first 20 hours was ~48
objects/second, but now I'm only seeing ~20 objects/second.  This is
with 18,836 buckets.

Bryan

On Wed, Sep 4, 2013 at 12:43 PM, Bryan Stillwell
bstillw...@photobucket.com wrote:
 So far I haven't seen much of a change.  It's still working through removing
 the bucket that reached 1.5 million objects though (my guess is that'll take
 a few more days), so I believe that might have something to do with it.

 Bryan


 On Wed, Sep 4, 2013 at 12:14 PM, Mark Nelson mark.nel...@inktank.com
 wrote:

 Bryan,

 Good explanation.  How's performance now that you've spread the load over
 multiple buckets?

 Mark

 On 09/04/2013 12:39 PM, Bryan Stillwell wrote:

 Bill,

 I've run into a similar issue with objects averaging ~100KiB.  The
 explanation I received on IRC is that there are scaling issues if you're
 uploading them all to the same bucket because the index isn't sharded.
   The recommended solution is to spread the objects out to a lot of
 buckets.  However, that ran me into another issue once I hit 1000
 buckets which is a per user limit.  I switched the limit to be unlimited
 with this command:

 radosgw-admin user modify --uid=your_username --max-buckets=0

 Bryan


 On Wed, Sep 4, 2013 at 11:27 AM, Bill Omer bill.o...@gmail.com
 mailto:bill.o...@gmail.com wrote:

 I'm testing ceph for storing a very large number of small files.
   I'm seeing some performance issues and would like to see if anyone
 could offer any insight as to what I could do to correct this.

 Some numbers:

 Uploaded 184111 files, with an average file size of 5KB, using
 10 separate servers to upload the request using Python and the
 cloudfiles module.  I stopped uploading after 53 minutes, which
 seems to average 5.7 files per second per node.


 My storage cluster consists of 21 OSD's across 7 servers, with their
 journals written to SSD drives.  I've done a default installation,
 using ceph-deploy with the dumpling release.

 I'm using statsd to monitor the performance, and what's interesting
 is when I start with an empty bucket, performance is amazing, with
 average response times of 20-50ms.  However as time goes on, the
 response times go in to the hundreds, and the average number of
 uploads per second drops.

 I've installed radosgw on all 7 ceph servers.  I've tested using a
 load balancer to distribute the api calls, as well as pointing the
 10 worker servers to a single instance.  I've not seen a real
 different in performance with this either.


 Each of the ceph servers are 16 core Xeon 2.53GHz with 72GB of ram,
 OCZ Vertex4 SSD drives for the journals and Seagate Barracuda ES2
 drives for storage.


 Any help would be greatly appreciated.


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues with small files

2013-09-05 Thread Bryan Stillwell
I need to restart the upload process again because all the objects
have a content-type of 'binary/octet-stream' instead of 'image/jpeg',
'image/png', etc.  I plan on enabling monitoring this time so we can
see if there are any signs of what might be going on.  Did you want me
to increase the number of buckets to see if that changes anything?
This is pretty easy for me to do.

Bryan

On Thu, Sep 5, 2013 at 11:07 AM, Mark Nelson mark.nel...@inktank.com wrote:
 based on your numbers, you were at something like an average of 186 objects
 per bucket at the 20 hour mark?  I wonder how this trend compares to what
 you'd see with a single bucket.

 With that many buckets you should have indexes well spread across all of the
 OSDs.  It'd be interesting to know what the iops/throughput is on all of
 your OSDs now (blktrace/seekwatcher can help here, but they are not the
 easiest tools to setup/use).

 Mark

 On 09/05/2013 11:59 AM, Bryan Stillwell wrote:

 Mark,

 Yesterday I blew away all the objects and restarted my test using
 multiple buckets, and things are definitely better!

 After ~20 hours I've already uploaded ~3.5 million objects, which much
 is better then the ~1.5 million I did over ~96 hours this past
 weekend.  Unfortunately it seems that things have slowed down a bit.
 The average upload rate over those first 20 hours was ~48
 objects/second, but now I'm only seeing ~20 objects/second.  This is
 with 18,836 buckets.

 Bryan

 On Wed, Sep 4, 2013 at 12:43 PM, Bryan Stillwell
 bstillw...@photobucket.com wrote:

 So far I haven't seen much of a change.  It's still working through
 removing
 the bucket that reached 1.5 million objects though (my guess is that'll
 take
 a few more days), so I believe that might have something to do with it.

 Bryan


 On Wed, Sep 4, 2013 at 12:14 PM, Mark Nelson mark.nel...@inktank.com
 wrote:


 Bryan,

 Good explanation.  How's performance now that you've spread the load
 over
 multiple buckets?

 Mark

 On 09/04/2013 12:39 PM, Bryan Stillwell wrote:


 Bill,

 I've run into a similar issue with objects averaging ~100KiB.  The
 explanation I received on IRC is that there are scaling issues if
 you're
 uploading them all to the same bucket because the index isn't sharded.
The recommended solution is to spread the objects out to a lot of
 buckets.  However, that ran me into another issue once I hit 1000
 buckets which is a per user limit.  I switched the limit to be
 unlimited
 with this command:

 radosgw-admin user modify --uid=your_username --max-buckets=0

 Bryan


 On Wed, Sep 4, 2013 at 11:27 AM, Bill Omer bill.o...@gmail.com
 mailto:bill.o...@gmail.com wrote:

  I'm testing ceph for storing a very large number of small files.
I'm seeing some performance issues and would like to see if
 anyone
  could offer any insight as to what I could do to correct this.

  Some numbers:

  Uploaded 184111 files, with an average file size of 5KB, using
  10 separate servers to upload the request using Python and the
  cloudfiles module.  I stopped uploading after 53 minutes, which
  seems to average 5.7 files per second per node.


  My storage cluster consists of 21 OSD's across 7 servers, with
 their
  journals written to SSD drives.  I've done a default installation,
  using ceph-deploy with the dumpling release.

  I'm using statsd to monitor the performance, and what's
 interesting
  is when I start with an empty bucket, performance is amazing, with
  average response times of 20-50ms.  However as time goes on, the
  response times go in to the hundreds, and the average number of
  uploads per second drops.

  I've installed radosgw on all 7 ceph servers.  I've tested using a
  load balancer to distribute the api calls, as well as pointing the
  10 worker servers to a single instance.  I've not seen a real
  different in performance with this either.


  Each of the ceph servers are 16 core Xeon 2.53GHz with 72GB of
 ram,
  OCZ Vertex4 SSD drives for the journals and Seagate Barracuda ES2
  drives for storage.


  Any help would be greatly appreciated.


  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues with small files

2013-09-04 Thread Bryan Stillwell
So far I haven't seen much of a change.  It's still working through
removing the bucket that reached 1.5 million objects though (my guess is
that'll take a few more days), so I believe that might have something to do
with it.

Bryan


On Wed, Sep 4, 2013 at 12:14 PM, Mark Nelson mark.nel...@inktank.comwrote:

 Bryan,

 Good explanation.  How's performance now that you've spread the load over
 multiple buckets?

 Mark

 On 09/04/2013 12:39 PM, Bryan Stillwell wrote:

 Bill,

 I've run into a similar issue with objects averaging ~100KiB.  The
 explanation I received on IRC is that there are scaling issues if you're
 uploading them all to the same bucket because the index isn't sharded.
   The recommended solution is to spread the objects out to a lot of
 buckets.  However, that ran me into another issue once I hit 1000
 buckets which is a per user limit.  I switched the limit to be unlimited
 with this command:

 radosgw-admin user modify --uid=your_username --max-buckets=0

 Bryan


 On Wed, Sep 4, 2013 at 11:27 AM, Bill Omer bill.o...@gmail.com
 mailto:bill.o...@gmail.com wrote:

 I'm testing ceph for storing a very large number of small files.
   I'm seeing some performance issues and would like to see if anyone
 could offer any insight as to what I could do to correct this.

 Some numbers:

 Uploaded 184111 files, with an average file size of 5KB, using
 10 separate servers to upload the request using Python and the
 cloudfiles module.  I stopped uploading after 53 minutes, which
 seems to average 5.7 files per second per node.


 My storage cluster consists of 21 OSD's across 7 servers, with their
 journals written to SSD drives.  I've done a default installation,
 using ceph-deploy with the dumpling release.

 I'm using statsd to monitor the performance, and what's interesting
 is when I start with an empty bucket, performance is amazing, with
 average response times of 20-50ms.  However as time goes on, the
 response times go in to the hundreds, and the average number of
 uploads per second drops.

 I've installed radosgw on all 7 ceph servers.  I've tested using a
 load balancer to distribute the api calls, as well as pointing the
 10 worker servers to a single instance.  I've not seen a real
 different in performance with this either.


 Each of the ceph servers are 16 core Xeon 2.53GHz with 72GB of ram,
 OCZ Vertex4 SSD drives for the journals and Seagate Barracuda ES2
 drives for storage.


 Any help would be greatly appreciated.


 __**_
 ceph-users mailing list
 ceph-users@lists.ceph.com 
 mailto:ceph-us...@lists.ceph.**comceph-users@lists.ceph.com
 
 
 http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --
 Photobucket http://photobucket.com

 *Bryan Stillwell*
 SENIOR SYSTEM ADMINISTRATOR

 E: bstillw...@photobucket.com 
 mailto:bstillwell@**photobucket.combstillw...@photobucket.com
 
 O: 303.228.5109
 M: 970.310.6085

 Facebook 
 http://www.facebook.com/**photobuckethttp://www.facebook.com/photobucket
  Twitter
 http://twitter.com/**photobucket http://twitter.com/photobucket
Photobucket
 http://photobucket.com/**images/photobuckethttp://photobucket.com/images/photobucket
 




 __**_
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 __**_
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 
[image: Photobucket] http://photobucket.com

*Bryan Stillwell*
SENIOR SYSTEM ADMINISTRATOR

E: bstillw...@photobucket.com
O: 303.228.5109
M: 970.310.6085

[image: Facebook] http://www.facebook.com/photobucket[image:
Twitter]http://twitter.com/photobucket[image:
Photobucket] http://photobucket.com/images/photobucket
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Moving an MDS

2013-06-11 Thread Bryan Stillwell
I have a cluster I originally built on argonaut and have since
upgraded it to bobtail and then cuttlefish.  I originally configured
it with one node for both the mds node and mon node, and 4 other nodes
for hosting osd's:

a1: mon.a/mds.a
b1: osd.0, osd.1, osd.2, osd.3, osd.4, osd.20
b2: osd.5, osd.6, osd.7, osd.8, osd.9, osd.21
b3: osd.10, osd.11, osd.12, osd.13, osd.14, osd.22
b4: osd.15, osd.16, osd.17, osd.18, osd.19, osd.23

Yesterday I added two more mon nodes and moved mon.a off of a1 so it
now looks like:

a1: mds.a
b1: osd.0, osd.1, osd.2, osd.3, osd.4, osd.20
b2: mon.a, osd.5, osd.6, osd.7, osd.8, osd.9, osd.21
b3: mon.b, osd.10, osd.11, osd.12, osd.13, osd.14, osd.22
b4: mon.c, osd.15, osd.16, osd.17, osd.18, osd.19, osd.23

What I would like to do is move mds.a to server b1 so I can power-off
a1 and bring up b5 with another 6 osd's (power in my basement is at a
premium), but I'm not finding much in the way of documentation on how
to do that.  I found some docs on doing it with ceph-deploy, but since
I built this a while ago I haven't been using ceph-deploy (and I
haven't had a great experience using it for building a new cluster
either).

Could some one point me at some docs on how to do this?  Also should I
be running with multiple mds nodes at this time?

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Moving an MDS

2013-06-11 Thread Bryan Stillwell
On Tue, Jun 11, 2013 at 3:50 PM, Gregory Farnum g...@inktank.com wrote:
 You should not run more than one active MDS (less stable than a
 single-MDS configuration, bla bla bla), but you can run multiple
 daemons and let the extras serve as a backup in case of failure. The
 process for moving an MDS is pretty easy: turn on a daemon somewhere
 else, confirm it's connected to the cluster, then turn off the old
 one.
 Doing it that way will induce ~30 seconds of MDS inavailability while
 it times out, but on cuttlefish you should be able to force an instant
 takeover if the new daemon uses the same name as the old one (I
 haven't worked with this much myself so I might be missing a detail;
 if this is important you should check).

 (These relatively simple takeovers are thanks to the MDS only storing
 data in RADOS, and are one of the big design considerations in the
 system architecture).

Thanks Greg!

That sounds pretty easy.  Although it has me wondering what config
option differentiates between an active MDS and a backup MDS daemon?

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mon problems after upgrading to cuttlefish

2013-05-22 Thread Bryan Stillwell
I attempted to upgrade my bobtail cluster to cuttlefish tonight and I
believe I'm running into some mon related issues.  I did the original
install manually instead of with mkcephfs or ceph-deploy, so I think
that might have to do with this error:

root@a1:~# ceph-mon -d -c /etc/ceph/ceph.conf
2013-05-22 23:37:29.283975 7f8fb97b3780  0 ceph version 0.61.2
(fea782543a844bb277ae94d3391788b76c5bee60), process ceph-mon, pid 5531
IO error: /var/lib/ceph/mon/ceph-admin/store.db/LOCK: No such file or directory
2013-05-22 23:37:29.286534 7f8fb97b3780  1 unable to open monitor
store at /var/lib/ceph/mon/ceph-admin
2013-05-22 23:37:29.286544 7f8fb97b3780  1 check for old monitor store format
2013-05-22 23:37:29.286550 7f8fb97b3780  1
store(/var/lib/ceph/mon/ceph-admin) mount
2013-05-22 23:37:29.286559 7f8fb97b3780  1
store(/var/lib/ceph/mon/ceph-admin) basedir
/var/lib/ceph/mon/ceph-admin dne
2013-05-22 23:37:29.286564 7f8fb97b3780 -1 unable to mount monitor
store: (2) No such file or directory
2013-05-22 23:37:29.286577 7f8fb97b3780 -1 found errors while
attempting to convert the monitor store: (2) No such file or directory
root@a1:~# ls -l /var/lib/ceph/mon/
total 4
drwxr-xr-x 15 root root 4096 May 22 23:30 ceph-a


I only have one mon daemon in this cluster as well.  I was planning on
upgrading it to 3 tonight but when I try to run most commands they
just hang now.

I do see the store.db directory in the ceph-a directory if that helps:

root@a1:~# ls -l  /var/lib/ceph/mon/ceph-a/
total 868
drwxr-xr-x 2 root root   4096 May 22 23:30 auth
drwxr-xr-x 2 root root   4096 May 22 23:30 auth_gv
-rw--- 1 root root 37 Feb  4 14:22 cluster_uuid
-rw--- 1 root root  2 May 22 23:30 election_epoch
-rw--- 1 root root120 Feb  4 14:22 feature_set
-rw--- 1 root root  2 Dec 28 11:35 joined
-rw--- 1 root root 77 May 22 22:30 keyring
-rw--- 1 root root  0 Dec 28 11:35 lock
drwxr-xr-x 2 root root  20480 May 22 23:30 logm
drwxr-xr-x 2 root root  20480 May 22 23:30 logm_gv
-rw--- 1 root root 21 Dec 28 11:35 magic
drwxr-xr-x 2 root root  12288 May 22 23:30 mdsmap
drwxr-xr-x 2 root root  12288 May 22 23:30 mdsmap_gv
drwxr-xr-x 2 root root   4096 Dec 28 11:35 monmap
drwxr-xr-x 2 root root 233472 May 22 23:30 osdmap
drwxr-xr-x 2 root root 237568 May 22 23:30 osdmap_full
drwxr-xr-x 2 root root 253952 May 22 23:30 osdmap_gv
drwxr-xr-x 2 root root  20480 May 22 23:30 pgmap
drwxr-xr-x 2 root root  20480 May 22 23:30 pgmap_gv
drwxr-xr-x 2 root root   4096 May 22 23:36 store.db


Any help would be appreciated.

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy documentation fixes

2013-05-07 Thread Bryan Stillwell
With the release of cuttlefish, I decided to try out ceph-deploy and
ran into some documentation errors along the way:


http://ceph.com/docs/master/rados/deployment/preflight-checklist/

Under 'CREATE A USER' it has the following line:

To provide full privileges to the user, add the following to
/etc/sudoers.d/chef.

Based on the command that followed, chef should be replaced with ceph.


http://ceph.com/docs/master/rados/deployment/ceph-deploy-osd/

Under 'ZAP DISKS' it has an 'Important' message that states:

Important: This will delete all data in the partition.

If I understand it correctly, this should be changed to:

Important: This will delete all data on the disk.


Under 'PREPARE OSDS' it first gives an example to prepare a disk:

ceph-deploy osd prepare {host-name}:{path/to/disk}[:{path/to/journal}]

And then it gives an example that attempts to prepare a partition:

ceph-deploy osd prepare osdserver1:/dev/sdb1:/dev/ssd1


The same issue exists for 'ACTIVATE OSDS' and 'CREATE OSDS'.


Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
I've run into an issue where after copying a file to my cephfs cluster
the md5sums no longer match.  I believe I've tracked it down to some
parts of the file which are missing:

$ obj_name=$(cephfs title1.mkv show_location -l 0 | grep object_name
| sed -e s/.*:\W*\([0-9a-f]*\)\.[0-9a-f]*/\1/)
$ echo Object name: $obj_name
Object name: 1001120

$ file_size=$(stat title1.mkv | grep Size | awk '{ print $2 }')
$ printf File size: %d MiB (%d Bytes)\n $(($file_size/1048576)) $file_size
File size: 20074 MiB (21049178117 Bytes)

$ blocks=$((file_size/4194304+1))
$ printf Blocks: %d\n $blocks
Blocks: 5019

$ for b in `seq 0 $(($blocks-1))`; do rados -p data stat
${obj_name}.`printf '%8.8x\n' $b` | grep error; done
 error stat-ing data/1001120.1076: No such file or directory
 error stat-ing data/1001120.11c7: No such file or directory
 error stat-ing data/1001120.129c: No such file or directory
 error stat-ing data/1001120.12f4: No such file or directory
 error stat-ing data/1001120.1307: No such file or directory


Any ideas where to look to investigate what caused these blocks to not
be written?

Here's the current state of the cluster:

ceph -s
   health HEALTH_OK
   monmap e1: 1 mons at {a=172.24.88.50:6789/0}, election epoch 1, quorum 0 a
   osdmap e22059: 24 osds: 24 up, 24 in
pgmap v1783615: 1920 pgs: 1917 active+clean, 3
active+clean+scrubbing+deep; 4667 GB data, 9381 GB used, 4210 GB /
13592 GB avail
   mdsmap e437: 1/1/1 up {0=a=up:active}

Here's my current crushmap:

# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool

# buckets
host b1 {
id -2   # do not change unnecessarily
# weight 2.980
alg straw
hash 0  # rjenkins1
item osd.0 weight 0.500
item osd.1 weight 0.500
item osd.2 weight 0.500
item osd.3 weight 0.500
item osd.4 weight 0.500
item osd.20 weight 0.480
}
host b2 {
id -4   # do not change unnecessarily
# weight 4.680
alg straw
hash 0  # rjenkins1
item osd.5 weight 0.500
item osd.6 weight 0.500
item osd.7 weight 2.200
item osd.8 weight 0.500
item osd.9 weight 0.500
item osd.21 weight 0.480
}
host b3 {
id -5   # do not change unnecessarily
# weight 3.480
alg straw
hash 0  # rjenkins1
item osd.10 weight 0.500
item osd.11 weight 0.500
item osd.12 weight 1.000
item osd.13 weight 0.500
item osd.14 weight 0.500
item osd.22 weight 0.480
}
host b4 {
id -6   # do not change unnecessarily
# weight 3.480
alg straw
hash 0  # rjenkins1
item osd.15 weight 0.500
item osd.16 weight 1.000
item osd.17 weight 0.500
item osd.18 weight 0.500
item osd.19 weight 0.500
item osd.23 weight 0.480
}
pool default {
id -1   # do not change unnecessarily
# weight 14.620
alg straw
hash 0  # rjenkins1
item b1 weight 2.980
item b2 weight 4.680
item b3 weight 3.480
item b4 weight 3.480
}

# rules
rule data {
ruleset 0
type replicated
min_size 2
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 2
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map


Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
I've tried a few different ones:

1. cp to cephfs mounted filesystem on Ubuntu 12.10 (quantal)
2. rsync over ssh to cephfs mounted filesystem on Ubuntu 12.04.2 (precise)
3. scp to cephfs mounted filesystem on Ubuntu 12.04.2 (precise)

It's fairly reproducible, so I can collect logs for you.  Which ones
would you be interested in?

The cluster has been in a couple states during testing (during
expansion/rebalancing and during an all active+clean state).

BTW, all the nodes are running with the 0.56.4-1precise packages.

Bryan

On Tue, Apr 23, 2013 at 12:56 PM, Gregory Farnum g...@inktank.com wrote:
 On Tue, Apr 23, 2013 at 11:38 AM, Bryan Stillwell
 bstillw...@photobucket.com wrote:
 I've run into an issue where after copying a file to my cephfs cluster
 the md5sums no longer match.  I believe I've tracked it down to some
 parts of the file which are missing:

 $ obj_name=$(cephfs title1.mkv show_location -l 0 | grep object_name
 | sed -e s/.*:\W*\([0-9a-f]*\)\.[0-9a-f]*/\1/)
 $ echo Object name: $obj_name
 Object name: 1001120

 $ file_size=$(stat title1.mkv | grep Size | awk '{ print $2 }')
 $ printf File size: %d MiB (%d Bytes)\n $(($file_size/1048576)) $file_size
 File size: 20074 MiB (21049178117 Bytes)

 $ blocks=$((file_size/4194304+1))
 $ printf Blocks: %d\n $blocks
 Blocks: 5019

 $ for b in `seq 0 $(($blocks-1))`; do rados -p data stat
 ${obj_name}.`printf '%8.8x\n' $b` | grep error; done
  error stat-ing data/1001120.1076: No such file or directory
  error stat-ing data/1001120.11c7: No such file or directory
  error stat-ing data/1001120.129c: No such file or directory
  error stat-ing data/1001120.12f4: No such file or directory
  error stat-ing data/1001120.1307: No such file or directory


 Any ideas where to look to investigate what caused these blocks to not
 be written?

 What client are you using to write this? Is it fairly reproducible (so
 you could collect logs of it happening)?

 Usually the only times I've seen anything like this were when either
 the file data was supposed to go into a pool which the client didn't
 have write permissions on, or when the RADOS cluster was in bad shape
 and so the data never got flushed to disk. Has your cluster been
 healthy since you started writing the file out?
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com



 Here's the current state of the cluster:

 ceph -s
health HEALTH_OK
monmap e1: 1 mons at {a=172.24.88.50:6789/0}, election epoch 1, quorum 0 a
osdmap e22059: 24 osds: 24 up, 24 in
 pgmap v1783615: 1920 pgs: 1917 active+clean, 3
 active+clean+scrubbing+deep; 4667 GB data, 9381 GB used, 4210 GB /
 13592 GB avail
mdsmap e437: 1/1/1 up {0=a=up:active}

 Here's my current crushmap:

 # begin crush map

 # devices
 device 0 osd.0
 device 1 osd.1
 device 2 osd.2
 device 3 osd.3
 device 4 osd.4
 device 5 osd.5
 device 6 osd.6
 device 7 osd.7
 device 8 osd.8
 device 9 osd.9
 device 10 osd.10
 device 11 osd.11
 device 12 osd.12
 device 13 osd.13
 device 14 osd.14
 device 15 osd.15
 device 16 osd.16
 device 17 osd.17
 device 18 osd.18
 device 19 osd.19
 device 20 osd.20
 device 21 osd.21
 device 22 osd.22
 device 23 osd.23

 # types
 type 0 osd
 type 1 host
 type 2 rack
 type 3 row
 type 4 room
 type 5 datacenter
 type 6 pool

 # buckets
 host b1 {
 id -2   # do not change unnecessarily
 # weight 2.980
 alg straw
 hash 0  # rjenkins1
 item osd.0 weight 0.500
 item osd.1 weight 0.500
 item osd.2 weight 0.500
 item osd.3 weight 0.500
 item osd.4 weight 0.500
 item osd.20 weight 0.480
 }
 host b2 {
 id -4   # do not change unnecessarily
 # weight 4.680
 alg straw
 hash 0  # rjenkins1
 item osd.5 weight 0.500
 item osd.6 weight 0.500
 item osd.7 weight 2.200
 item osd.8 weight 0.500
 item osd.9 weight 0.500
 item osd.21 weight 0.480
 }
 host b3 {
 id -5   # do not change unnecessarily
 # weight 3.480
 alg straw
 hash 0  # rjenkins1
 item osd.10 weight 0.500
 item osd.11 weight 0.500
 item osd.12 weight 1.000
 item osd.13 weight 0.500
 item osd.14 weight 0.500
 item osd.22 weight 0.480
 }
 host b4 {
 id -6   # do not change unnecessarily
 # weight 3.480
 alg straw
 hash 0  # rjenkins1
 item osd.15 weight 0.500
 item osd.16 weight 1.000
 item osd.17 weight 0.500
 item osd.18 weight 0.500
 item osd.19 weight 0.500
 item osd.23 weight 0.480
 }
 pool default {
 id -1   # do not change unnecessarily
 # weight 14.620
 alg straw
 hash 0  # rjenkins1
 item b1 weight 2.980
 item b2 weight 4.680
 item b3 weight 3.480
 item b4 weight 3.480
 }

 # rules
 rule data

Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil s...@inktank.com wrote:

 On Tue, 23 Apr 2013, Bryan Stillwell wrote:
  I'm testing this now, but while going through the logs I saw something
  that might have something to do with this:
 
  Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap
  epoch 22146 off 102 (88021e0dc802 of
  88021e0dc79c-88021e0dc802)

 Oh, that's not right...  What kernel version is this?  Which ceph version?

$ uname -a
Linux a1 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013
x86_64 x86_64 x86_64 GNU/Linux
$ ceph -v
ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
On Tue, Apr 23, 2013 at 5:45 PM, Sage Weil s...@inktank.com wrote:
 On Tue, 23 Apr 2013, Bryan Stillwell wrote:
 On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil s...@inktank.com wrote:
 
  On Tue, 23 Apr 2013, Bryan Stillwell wrote:
   I'm testing this now, but while going through the logs I saw something
   that might have something to do with this:
  
   Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap
   epoch 22146 off 102 (88021e0dc802 of
   88021e0dc79c-88021e0dc802)
 
  Oh, that's not right...  What kernel version is this?  Which ceph version?

 $ uname -a
 Linux a1 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013
 x86_64 x86_64 x86_64 GNU/Linux

 Oh, that's a sufficiently old kernel that we don't support.  3.4 or later
 is considered stable.  You should be able to get recent mainline kernels
 from an ubuntu ppa...

It looks like Canonical released a 3.5.0 kernel as a security update
to precise that I'll give a try.

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
On Tue, Apr 23, 2013 at 5:54 PM, Gregory Farnum g...@inktank.com wrote:
 On Tue, Apr 23, 2013 at 4:45 PM, Sage Weil s...@inktank.com wrote:
 On Tue, 23 Apr 2013, Bryan Stillwell wrote:
 On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil s...@inktank.com wrote:
 
  On Tue, 23 Apr 2013, Bryan Stillwell wrote:
   I'm testing this now, but while going through the logs I saw something
   that might have something to do with this:
  
   Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap
   epoch 22146 off 102 (88021e0dc802 of
   88021e0dc79c-88021e0dc802)
 
  Oh, that's not right...  What kernel version is this?  Which ceph version?

 $ uname -a
 Linux a1 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013
 x86_64 x86_64 x86_64 GNU/Linux

 Oh, that's a sufficiently old kernel that we don't support.  3.4 or later
 is considered stable.  You should be able to get recent mainline kernels
 from an ubuntu ppa...

 By which he means that could have caused the trouble and there are
 some osdmap decoding problems which are fixed in later kernels. :)
 I'd forgotten about these problems, although fortunately they're not
 consistent. But especially for CephFS you'll want to stick with
 userspace rather than kernelspace for a while if you aren't in the
 habit of staying very up-to-date.

Thanks, that's good to know.  :)

The first copy test using fuse finished and the MD5s match up!  I'm
going to do some more testing overnight, but this seems to be the
cause.

Thanks for the help!

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS First product release discussion

2013-03-05 Thread Bryan Stillwell
On Tue, Mar 5, 2013 at 12:44 PM, Kevin Decherf ke...@kdecherf.com wrote:

 On Tue, Mar 05, 2013 at 12:27:04PM -0600, Dino Yancey wrote:
  The only two features I'd deem necessary for our workload would be
  stable distributed metadata / MDS and a working fsck equivalent.
  Snapshots would be great once the feature is deemed stable, as would

 We have the same needs here.

Stable distributed metadata and snapshots are the most important to me.

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com