Re: [ceph-users] CephFS msg length greater than osd_max_write_size

2019-05-22 Thread Ryan Leimenstoll
Thanks for the reply! We will be more proactive about evicting clients in the 
future rather than waiting.


One followup however, it seems that the filesystem going read only was only a 
WARNING state, which didn’t immediately catch our eye due to some other 
rebalancing operations. Is there a reason that this wouldn’t be a HEALTH_ERR 
condition since it represents a significant service degradation?


Thanks!
Ryan


> On May 22, 2019, at 4:20 AM, Yan, Zheng  wrote:
> 
> On Tue, May 21, 2019 at 6:10 AM Ryan Leimenstoll
>  wrote:
>> 
>> Hi all,
>> 
>> We recently encountered an issue where our CephFS filesystem unexpectedly 
>> was set to read-only. When we look at some of the logs from the daemons I 
>> can see the following:
>> 
>> On the MDS:
>> ...
>> 2019-05-18 16:34:24.341 7fb3bd610700 -1 mds.0.89098 unhandled write error 
>> (90) Message too long, force readonly...
>> 2019-05-18 16:34:24.341 7fb3bd610700  1 mds.0.cache force file system 
>> read-only
>> 2019-05-18 16:34:24.341 7fb3bd610700  0 log_channel(cluster) log [WRN] : 
>> force file system read-only
>> 2019-05-18 16:34:41.289 7fb3c0616700  1 heartbeat_map is_healthy 'MDSRank' 
>> had timed out after 15
>> 2019-05-18 16:34:41.289 7fb3c0616700  0 mds.beacon.objmds00 Skipping beacon 
>> heartbeat to monitors (last acked 4.00101s ago); MDS internal heartbeat is 
>> not healthy!
>> ...
>> 
>> On one of the OSDs it was most likely targeting:
>> ...
>> 2019-05-18 16:34:24.140 7f8134e6c700 -1 osd.602 pg_epoch: 682796 pg[49.20b( 
>> v 682796'15706523 (682693'15703449,682796'15706523] 
>> local-lis/les=673041/673042 n=10524 ec=245563/245563 lis/c 673041/673041 
>> les/c/f 673042/673042/0 673038/673041/668565) [602,530,558] r=0 lpr=673041 
>> crt=682796'15706523 lcod 682796'15706522 mlcod 682796'15706522 active+clean] 
>> do_op msg data len 95146005 > osd_max_write_size 94371840 on 
>> osd_op(mds.0.89098:48609421 49.20b 49:d0630e4c:::mds0_sessionmap:head 
>> [omap-set-header,omap-set-vals] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e682796) v8
>> 2019-05-18 17:10:33.695 7f813466b700  0 log_channel(cluster) log [DBG] : 
>> 49.31c scrub starts
>> 2019-05-18 17:10:34.980 7f813466b700  0 log_channel(cluster) log [DBG] : 
>> 49.31c scrub ok
>> 2019-05-18 22:17:37.320 7f8134e6c700 -1 osd.602 pg_epoch: 683434 pg[49.20b( 
>> v 682861'15706526 (682693'15703449,682861'15706526] 
>> local-lis/les=673041/673042 n=10525 ec=245563/245563 lis/c 673041/673041 
>> les/c/f 673042/673042/0 673038/673041/668565) [602,530,558] r=0 lpr=673041 
>> crt=682861'15706526 lcod 682859'15706525 mlcod 682859'15706525 active+clean] 
>> do_op msg data len 95903764 > osd_max_write_size 94371840 on 
>> osd_op(mds.0.91565:357877 49.20b 49:d0630e4c:::mds0_sessionmap:head 
>> [omap-set-header,omap-set-vals,omap-rm-keys] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e683434) v8
>> …
>> 
>> During this time there were some health concerns with the cluster. 
>> Significantly, since the error above seems to be related to the SessionMap, 
>> we had a client that had a few blocked requests for over 35948 secs (it’s a 
>> member of a compute cluster so we let the node drain/finish jobs before 
>> rebooting). We have also had some issues with certain OSDs running older 
>> hardware staying up/responding timely to heartbeats after upgrading to 
>> Nautilus, although that seems to be an iowait/load issue that we are 
>> actively working to mitigate separately.
>> 
> 
> This prevent mds from trimming completed requests recorded in session.
> which results a very large session item.  To recovery, blacklist the
> client that has blocked request, the restart mds.
> 
>> We are running Nautilus 14.2.1 on RHEL7.6. There is only one MDS Rank, with 
>> an active/standby setup between two MDS nodes. MDS clients are mounted using 
>> the RHEL7.6 kernel driver.
>> 
>> My read here would be that the MDS is sending too large a message to the 
>> OSD, however my understanding was that the MDS should be using 
>> osd_max_write_size to determine the size of that message [0]. Is this maybe 
>> a bug in how this is calculated on the MDS side?
>> 
>> 
>> Thanks!
>> Ryan Leimenstoll
>> rleim...@umiacs.umd.edu
>> University of Maryland Institute for Advanced Computer Studies
>> 
>> 
>> 
>> [0] https://www.spinics.net/lists/ceph-devel/msg11951.html
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS msg length greater than osd_max_write_size

2019-05-20 Thread Ryan Leimenstoll
Hi all, 

We recently encountered an issue where our CephFS filesystem unexpectedly was 
set to read-only. When we look at some of the logs from the daemons I can see 
the following: 

On the MDS:
...
2019-05-18 16:34:24.341 7fb3bd610700 -1 mds.0.89098 unhandled write error (90) 
Message too long, force readonly...
2019-05-18 16:34:24.341 7fb3bd610700  1 mds.0.cache force file system read-only
2019-05-18 16:34:24.341 7fb3bd610700  0 log_channel(cluster) log [WRN] : force 
file system read-only
2019-05-18 16:34:41.289 7fb3c0616700  1 heartbeat_map is_healthy 'MDSRank' had 
timed out after 15
2019-05-18 16:34:41.289 7fb3c0616700  0 mds.beacon.objmds00 Skipping beacon 
heartbeat to monitors (last acked 4.00101s ago); MDS internal heartbeat is not 
healthy!
...

On one of the OSDs it was most likely targeting:
...
2019-05-18 16:34:24.140 7f8134e6c700 -1 osd.602 pg_epoch: 682796 pg[49.20b( v 
682796'15706523 (682693'15703449,682796'15706523] local-lis/les=673041/673042 
n=10524 ec=245563/245563 lis/c 673041/673041 les/c/f 673042/673042/0 
673038/673041/668565) [602,530,558] r=0 lpr=673041 crt=682796'15706523 lcod 
682796'15706522 mlcod 682796'15706522 active+clean] do_op msg data len 95146005 
> osd_max_write_size 94371840 on osd_op(mds.0.89098:48609421 49.20b 
49:d0630e4c:::mds0_sessionmap:head [omap-set-header,omap-set-vals] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e682796) v8
2019-05-18 17:10:33.695 7f813466b700  0 log_channel(cluster) log [DBG] : 49.31c 
scrub starts
2019-05-18 17:10:34.980 7f813466b700  0 log_channel(cluster) log [DBG] : 49.31c 
scrub ok
2019-05-18 22:17:37.320 7f8134e6c700 -1 osd.602 pg_epoch: 683434 pg[49.20b( v 
682861'15706526 (682693'15703449,682861'15706526] local-lis/les=673041/673042 
n=10525 ec=245563/245563 lis/c 673041/673041 les/c/f 673042/673042/0 
673038/673041/668565) [602,530,558] r=0 lpr=673041 crt=682861'15706526 lcod 
682859'15706525 mlcod 682859'15706525 active+clean] do_op msg data len 95903764 
> osd_max_write_size 94371840 on osd_op(mds.0.91565:357877 49.20b 
49:d0630e4c:::mds0_sessionmap:head [omap-set-header,omap-set-vals,omap-rm-keys] 
snapc 0=[] ondisk+write+known_if_redirected+full_force e683434) v8
…

During this time there were some health concerns with the cluster. 
Significantly, since the error above seems to be related to the SessionMap, we 
had a client that had a few blocked requests for over 35948 secs (it’s a member 
of a compute cluster so we let the node drain/finish jobs before rebooting). We 
have also had some issues with certain OSDs running older hardware staying 
up/responding timely to heartbeats after upgrading to Nautilus, although that 
seems to be an iowait/load issue that we are actively working to mitigate 
separately.

We are running Nautilus 14.2.1 on RHEL7.6. There is only one MDS Rank, with an 
active/standby setup between two MDS nodes. MDS clients are mounted using the 
RHEL7.6 kernel driver. 

My read here would be that the MDS is sending too large a message to the OSD, 
however my understanding was that the MDS should be using osd_max_write_size to 
determine the size of that message [0]. Is this maybe a bug in how this is 
calculated on the MDS side?


Thanks!
Ryan Leimenstoll
rleim...@umiacs.umd.edu
University of Maryland Institute for Advanced Computer Studies



[0] https://www.spinics.net/lists/ceph-devel/msg11951.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread Ryan
I just ran your test on a cluster with 5 hosts 2x Intel 6130, 12x 860 Evo
2TB SSD per host (6 per SAS3008), 2x bonded 10GB NIC, 2x Arista switches.

Pool with 3x replication

rados bench -p scbench -b 4096 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for
up to 10 seconds or 0 objects
Object prefix: benchmark_data_dc1-kube-01_3458991
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
0   0 0 0 0 0   -
 0
1  16  5090  5074   19.7774   19.8203  0.00312568
0.00315352
2  16 10441 10425   20.3276   20.9023  0.00332591
0.00307105
3  16 15548 1553220.201   19.9492  0.00337573
0.00309134
4  16 20906 20890   20.3826   20.9297  0.00282902
0.00306437
5  16 26107 26091   20.3686   20.3164  0.00269844
0.00306698
6  16 31246 31230   20.3187   20.0742  0.00339814
0.00307462
7  16 36372 36356   20.2753   20.0234  0.00286653
 0.0030813
8  16 41470 41454   20.2293   19.9141  0.00272051
0.00308839
9  16 46815 46799   20.3011   20.8789  0.00284063
0.00307738
Total time run: 10.0035
Total writes made:  51918
Write size: 4096
Object size:4096
Bandwidth (MB/sec): 20.2734
Stddev Bandwidth:   0.464082
Max bandwidth (MB/sec): 20.9297
Min bandwidth (MB/sec): 19.8203
Average IOPS:   5189
Stddev IOPS:118
Max IOPS:   5358
Min IOPS:   5074
Average Latency(s): 0.00308195
Stddev Latency(s):  0.00142825
Max latency(s): 0.0267947
Min latency(s): 0.00217364

rados bench -p scbench 10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
0   0 0 0 0 0   -
 0
1  15 39691 39676154.95   154.984  0.00027022
0.000395993
2  16 83701 83685   163.416171.91 0.000318949
0.000375363
3  15129218129203   168.199   177.805 0.000300898
0.000364647
4  15173733173718   169.617   173.887 0.000311723
0.00036156
5  15216073216058   168.769   165.391 0.000407594
0.000363371
6  16260381260365   169.483   173.074 0.000323371
0.000361829
7  15306838306823   171.193   181.477 0.000284247
0.000358199
8  15353675353660   172.661   182.957 0.000338128
0.000355139
9  15399221399206   173.243   177.914 0.000422527
0.00035393
Total time run:   10.0003
Total reads made: 446353
Read size:4096
Object size:  4096
Bandwidth (MB/sec):   174.351
Average IOPS: 44633
Stddev IOPS:  2220
Max IOPS: 46837
Min IOPS: 39676
Average Latency(s):   0.000351679
Max latency(s):   0.00530195
Min latency(s):   0.000135292

On Thu, Feb 7, 2019 at 2:17 AM  wrote:

> Hi List
>
> We are in the process of moving to the next usecase for our ceph cluster
> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
> that works fine.
>
> We're currently on luminous / bluestore, if upgrading is deemed to
> change what we're seeing then please let us know.
>
> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected
> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to
> deadline, nomerges = 1, rotational = 0.
>
> Each disk "should" give approximately 36K IOPS random write and the double
> random read.
>
> Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of
> well performing SSD block devices - potentially to host databases and
> things like that. I ready through this nice document [0], I know the
> HW are radically different from mine, but I still think I'm in the
> very low end of what 6 x S4510 should be capable of doing.
>
> Since it is IOPS i care about I have lowered block size to 4096 -- 4M
> blocksize nicely saturates the NIC's in both directions.
>
>
> $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for
> up to 10 seconds or 0 objects
> Object prefix: benchmark_data_torsk2_11207
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
> 0   0 0 0 0 0   -
>  0
> 1  16  5857  5841   22.8155   22.8164  0.00238437
> 0.00273434
> 2  15 11768 11753   22.9533   23.0938   0.0028559
> 0.00271944
> 3  16 17264 17248   22.4564   21.4648  0.0024
> 0.00278101
> 4  16 22857 22841   22.3037   21.84770.002716
> 0.00280023
> 5  16 28462 28446   22.2213   21.8945  0.00220186
> 0.002811
> 6  16 34216 34200   22.2635   22.4766  0.00234315
> 0.00280552
> 7  16 39616 

Re: [ceph-users] Object Gateway Cloud Sync to S3

2019-02-05 Thread Ryan
On Tue, Feb 5, 2019 at 3:35 PM Ryan  wrote:

> I've been trying to configure the cloud sync module to push changes to an
> Amazon S3 bucket without success. I've configured the module according to
> the docs with the trivial configuration settings. Is there an error log I
> should be checking? Is the "radosgw-admin sync status
> --rgw-zone=mycloudtierzone" the correct command to check status?
>
> Thanks,
> Ryan
>

It turns out I can get it to sync as long as I leave "radosgw-admin
--rgw-zone=aws-docindex data sync run --source-zone=default" running. I
thought with mimic the sync was built into the ceph-radosgw service? I'm
running version 13.2.4. I'm also seeing these errors on the console while
running that command.

2019-02-05 17:40:10.679 7fb1ef06b680  0 meta sync: ERROR:
RGWBackoffControlCR called coroutine returned -2
2019-02-05 17:40:10.694 7fb1ef06b680  0 RGW-SYNC:data:sync:shard[25]:
ERROR: failed to read remote data log info: ret=-2
2019-02-05 17:40:10.695 7fb1ef06b680  0 meta sync: ERROR:
RGWBackoffControlCR called coroutine returned -2
2019-02-05 17:40:10.711 7fb1ef06b680  0 RGW-SYNC:data:sync:shard[43]:
ERROR: failed to read remote data log info: ret=-2
2019-02-05 17:40:10.712 7fb1ef06b680  0 meta sync: ERROR:
RGWBackoffControlCR called coroutine returned -2
2019-02-05 17:40:10.720 7fb1ef06b680  0 meta sync: ERROR:
RGWBackoffControlCR called coroutine returned -2

Additionally "radosgw-admin --rgw-zone=aws-docindex data sync error list
--source-zone=default" is showing numerous error code 39 responses/

 "message": "failed to sync bucket instance: (39) Directory not empty"
"message": "failed to sync object(39) Directory not empty"

When it successfully completes I see the following

  metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
  data sync source: af57fe9a-43a7-4998-9574-4016f5fa6661 (default)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source

When I stop the "data sync run" the status will just sit on

  data sync source: af57fe9a-43a7-4998-9574-4016f5fa6661 (default)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 1 shards
behind shards: [75]
oldest incremental change not applied: 2019-02-05
17:44:51.0.367478s

Thanks,
Ryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Object Gateway Cloud Sync to S3

2019-02-05 Thread Ryan
I've been trying to configure the cloud sync module to push changes to an
Amazon S3 bucket without success. I've configured the module according to
the docs with the trivial configuration settings. Is there an error log I
should be checking? Is the "radosgw-admin sync status
--rgw-zone=mycloudtierzone" the correct command to check status?

Thanks,
Ryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] deleting a file

2018-12-14 Thread Rhys Ryan - NOAA Affiliate
Hello.

I am deleting files via S3CMD and for the most part have no issue.  Every
once in a while though, I get a positive response that a file has been
deleted but when I check back the next day, the file is still there.

I was wondering if there is a way to delete a file from within CEPH?  I
don't want to go through the RADOS Gateway but instead SSH into the system
and delete the file.

Thank you and happy holidays.
Rhys Ryan
Data Architect
NOAA
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "rgw relaxed s3 bucket names" and underscores

2018-10-02 Thread Ryan Leimenstoll
Nope, you are right. I think it was just boto catching this for me and I took 
that for granted. 

I think that is the behavior I would expect too, S3-compliant restrictions on 
create and allow legacy buckets to remain. Anyway, noticed you created a ticket 
[0] in the tracker for this, thanks!

Best,
Ryan

[0] https://tracker.ceph.com/issues/36293 
<https://tracker.ceph.com/issues/36293>


> On Oct 2, 2018, at 6:08 PM, Robin H. Johnson  wrote:
> 
> On Tue, Oct 02, 2018 at 12:37:02PM -0400, Ryan Leimenstoll wrote:
>> I was hoping to get some clarification on what "rgw relaxed s3 bucket
>> names = false” is intended to filter. 
> Yes, it SHOULD have caught this case, but does not.
> 
> Are you sure it rejects the uppercase? My test also showed that it did
> NOT reject the uppercase as intended.
> 
> This code did used to work, I contributed to the logic and discussion
> for earlier versions. A related part I wanted was allowing access to
> existing buckets w/ relaxed names, but disallowing creating of relaxed
> names.
> 
> -- 
> Robin Hugh Johnson
> Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
> E-Mail   : robb...@gentoo.org
> GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
> GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] "rgw relaxed s3 bucket names" and underscores

2018-10-02 Thread Ryan Leimenstoll
Hi all, 

I was hoping to get some clarification on what "rgw relaxed s3 bucket names = 
false” is intended to filter. In our cluster (Luminous 12.2.8, serving S3) it 
seems that RGW, with that setting set to false, is still allowing buckets with 
underscores in the name to be created, although this is now prohibited by 
Amazon in US-East and seemingly all of their other regions [0]. Since clients 
typically follow Amazon’s direction, should RGW be rejecting underscores in 
these names to be in compliance? (I did notice it already rejects uppercase 
letters.) 

Thanks much!
Ryan Leimenstoll
rleim...@umiacs.umd.edu <mailto:rleim...@umiacs.umd.edu>


[0] https://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-data-scan safety on active filesystem

2018-05-08 Thread Ryan Leimenstoll
Hi Gregg, John, 

Thanks for the warning. It was definitely conveyed that they are dangerous. I 
thought the online part was implied to be a bad idea, but just wanted to verify.

John,

We were mostly operating off of what the mds logs reported. After bringing the 
mds back online and active, we mounted the volume using the kernel driver to 
one host and started a recursive ls through the root of the filesystem to see 
what was broken. There were seemingly two main paths of the tree that were 
affected initially, both reporting errors like the following in the mds log 
(I’ve swapped out the paths):

Group 1:
2018-05-04 12:04:38.004029 7fc81f69a700 -1 log_channel(cluster) log [ERR] : dir 
0x10011125556 object missing on disk; some files may be lost 
(/cephfs/redacted1/path/dir1) 
2018-05-04 12:04:38.028861 7fc81f69a700 -1 log_channel(cluster) log [ERR] : dir 
0x1001112bf14 object missing on disk; some files may be lost 
(/cephfs/redacted1/path/dir2)
2018-05-04 12:04:38.030504 7fc81f69a700 -1 log_channel(cluster) log [ERR] : dir 
0x10011131118 object missing on disk; some files may be lost 
(/cephfs/redacted1/path/dir3) 

Group 2:
2021-05-04 13:24:29.495892 7fc81f69a700 -1 log_channel(cluster) log [ERR] : dir 
0x1001102c5f6 object missing on disk; some files may be lost 
(/cephfs/redacted2/path/dir1) 

For some of the paths it complained about were empty via ls, although trying to 
rm [-r] them via the mount failed with an error suggesting files still exist in 
the directory. We removed the dir object in the metadata pool that it was still 
warning about (rados -p metapool rm 10011125556., for example). This 
cleaned up errors on this path. We then did the same for Group 2. 

After this, we initiated a recursive scrub with the mds daemon on the root of 
the filesystem to run over the weekend.

In retrospect, we probably should have done the data scan steps mentioned in 
the disaster recovery guide before bringing the system online. The cluster is 
currently healthy (or, rather, reporting healthy) and has been for a while.

My understanding here is that we would need something like the cephfs-data-scan 
steps to recreate metadata or at least identify (for cleanup) objects that may 
have been stranded in the data pool. Is there anyway, likely with another tool, 
to do this for an active cluster? If not, is this something that can be done 
with some amount of safety on an offline system? (not sure how long it would 
take, data pool is ~100T large w/ 242 million objects, and downtime is a big 
pain point for our users with deadlines).

Thanks,

Ryan

> On May 8, 2018, at 5:05 AM, John Spray <jsp...@redhat.com> wrote:
> 
> On Mon, May 7, 2018 at 8:50 PM, Ryan Leimenstoll
> <rleim...@umiacs.umd.edu> wrote:
>> Hi All,
>> 
>> We recently experienced a failure with our 12.2.4 cluster running a CephFS
>> instance that resulted in some data loss due to a seemingly problematic OSD
>> blocking IO on its PGs. We restarted the (single active) mds daemon during
>> this, which caused damage due to the journal not having the chance to flush
>> back. We reset the journal, session table, and fs to bring the filesystem
>> online. We then removed some directories/inodes that were causing the
>> cluster to report damaged metadata (and were otherwise visibly broken by
>> navigating the filesystem).
> 
> This may be over-optimistic of me, but is there any chance you kept a
> detailed record of exactly what damage was reported, and what you did
> to the filesystem so far?  It's hard to give any intelligent advice on
> repairing it, when we don't know exactly what was broken, and a bunch
> of unknown repair-ish things have already manipulated the metadata
> behind the scenes.
> 
> John
> 
>> With that, there are now some paths that seem to have been orphaned (which
>> we expected). We did not run the ‘cephfs-data-scan’ tool [0] in the name of
>> getting the system back online ASAP. Now that the filesystem is otherwise
>> stable, can we initiate a scan_links operation with the mds active safely?
>> 
>> [0]
>> http://docs.ceph.com/docs/luminous/cephfs/disaster-recovery/#recovery-from-missing-metadata-objects
>> 
>> Thanks much,
>> Ryan Leimenstoll
>> rleim...@umiacs.umd.edu
>> University of Maryland Institute for Advanced Computer Studies
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs-data-scan safety on active filesystem

2018-05-07 Thread Ryan Leimenstoll
Hi All, 

We recently experienced a failure with our 12.2.4 cluster running a CephFS 
instance that resulted in some data loss due to a seemingly problematic OSD 
blocking IO on its PGs. We restarted the (single active) mds daemon during 
this, which caused damage due to the journal not having the chance to flush 
back. We reset the journal, session table, and fs to bring the filesystem 
online. We then removed some directories/inodes that were causing the cluster 
to report damaged metadata (and were otherwise visibly broken by navigating the 
filesystem).

With that, there are now some paths that seem to have been orphaned (which we 
expected). We did not run the ‘cephfs-data-scan’ tool [0] in the name of 
getting the system back online ASAP. Now that the filesystem is otherwise 
stable, can we initiate a scan_links operation with the mds active safely?

[0] 
http://docs.ceph.com/docs/luminous/cephfs/disaster-recovery/#recovery-from-missing-metadata-objects

Thanks much, 
Ryan Leimenstoll
rleim...@umiacs.umd <mailto:rleim...@umiacs.umd>.edu
University of Maryland Institute for Advanced Computer Studies


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] change radosgw object owner

2018-03-08 Thread Ryan Leimenstoll
Hi Robin, 

Thanks for the pointer! My one concern though is that it didn’t seem to update 
the original object owner’s quota however, which is a bit of a sticking point. 
Is this expected (and is there a workaround)? I will admit to being a bit naive 
to how radosgw’s quota system works under the hood. 

Thanks,
Ryan

> On Mar 6, 2018, at 2:54 PM, Robin H. Johnson <robb...@gentoo.org> wrote:
> 
> On Tue, Mar 06, 2018 at 02:40:11PM -0500, Ryan Leimenstoll wrote:
>> Hi all, 
>> 
>> We are trying to move a bucket in radosgw from one user to another in an 
>> effort both change ownership and attribute the storage usage of the data to 
>> the receiving user’s quota. 
>> 
>> I have unlinked the bucket and linked it to the new user using: 
>> 
>> radosgw-admin bucket unlink —bucket=$MYBUCKET —uid=$USER
>> radosgw-admin bucket link —bucket=$MYBUCKET —bucket-id=$BUCKET_ID 
>> —uid=$NEWUSER
>> 
>> However, perhaps as expected, the owner of all the objects in the
>> bucket remain as $USER. I don’t believe changing the owner is a
>> supported operation from the S3 protocol, however it would be very
>> helpful to have the ability to do this on the radosgw backend. This is
>> especially useful for large buckets/datasets where copying the objects
>> out and into radosgw could be time consuming.
> At the raw radosgw-admin level, you should be able to do it with
> bi-list/bi-get/bi-put. The downside here is that I don't think the BI ops are
> exposed in the HTTP Admin API, so it's going to be really expensive to chown
> lots of objects.
> 
> Using a quick example:
> # radosgw-admin \
>  --uid UID-CENSORED \
>  --bucket BUCKET-CENSORED \
>  bi get \
>  --object=OBJECTNAME-CENSORED
> {
>"type": "plain",
>"idx": "OBJECTNAME-CENSORED",
>"entry": {
>"name": "OBJECTNAME-CENSORED",
>"instance": "",
>"ver": {
>"pool": 5,
>"epoch": 266028
>},
>"locator": "",
>"exists": "true",
>"meta": {
>"category": 1,
>"size": 1066,
>"mtime": "2016-11-17 17:01:29.668746Z",
>"etag": "e7a75c39df3d123c716d5351059ad2d9",
>"owner": "UID-CENSORED",
>"owner_display_name": "UID-CENSORED",
>"content_type": "image/png",
>"accounted_size": 1066,
>"user_data": ""
>},
>"tag": "default.293024600.1188196",
>"flags": 0,
>"pending_map": [],
>"versioned_epoch": 0
>}
> }
> 
> -- 
> Robin Hugh Johnson
> Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
> E-Mail   : robb...@gentoo.org
> GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
> GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] change radosgw object owner

2018-03-06 Thread Ryan Leimenstoll
Hi all, 

We are trying to move a bucket in radosgw from one user to another in an effort 
both change ownership and attribute the storage usage of the data to the 
receiving user’s quota. 

I have unlinked the bucket and linked it to the new user using: 

radosgw-admin bucket unlink —bucket=$MYBUCKET —uid=$USER
radosgw-admin bucket link —bucket=$MYBUCKET —bucket-id=$BUCKET_ID —uid=$NEWUSER

However, perhaps as expected, the owner of all the objects in the bucket remain 
as $USER. I don’t believe changing the owner is a supported operation from the 
S3 protocol, however it would be very helpful to have the ability to do this on 
the radosgw backend. This is especially useful for large buckets/datasets where 
copying the objects out and into radosgw could be time consuming.

 Is this something that is currently possible within radosgw? We are running 
Ceph 12.2.2. 

Thanks,
Ryan Leimenstoll
rleim...@umiacs.umd.edu
University of Maryland Institute for Advanced Computer Studies
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw resharding operation seemingly won't end

2017-10-10 Thread Ryan Leimenstoll
Thanks for the response Yehuda. 


Staus:
[root@objproxy02 UMobjstore]# radosgw-admin reshard status —bucket=$bucket_name
[
{
"reshard_status": 1,
"new_bucket_instance_id": 
"8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.47370206.1",
"num_shards": 4
}
]

I cleared the flag using the bucket check —fix command and will keep an eye on 
that tracker issue. 

Do you have any insight into why the RGWs ultimately paused/reloaded and failed 
to come back? I am happy to provide more information that could assist. At the 
moment we are somewhat nervous to reenable dynamic sharding as it seems to have 
contributed to this problem. 

Thanks,
Ryan



> On Oct 9, 2017, at 5:26 PM, Yehuda Sadeh-Weinraub <yeh...@redhat.com> wrote:
> 
> On Mon, Oct 9, 2017 at 1:59 PM, Ryan Leimenstoll
> <rleim...@umiacs.umd.edu> wrote:
>> Hi all,
>> 
>> We recently upgraded to Ceph 12.2.1 (Luminous) from 12.2.0 however are now 
>> seeing issues running radosgw. Specifically, it appears an automatically 
>> triggered resharding operation won’t end, despite the jobs being cancelled 
>> (radosgw-admin reshard cancel). I have also disabled dynamic sharding for 
>> the time being in the ceph.conf.
>> 
>> 
>> [root@objproxy02 ~]# radosgw-admin reshard list
>> []
>> 
>> The two buckets were also reported in the `radosgw-admin reshard list` 
>> before our RGW frontends paused recently (and only came back after a service 
>> restart). These two buckets cannot currently be written to at this point 
>> either.
>> 
>> 2017-10-06 22:41:19.547260 7f90506e9700 0 block_while_resharding ERROR: 
>> bucket is still resharding, please retry
>> 2017-10-06 22:41:19.547411 7f90506e9700 0 WARNING: set_req_state_err 
>> err_no=2300 resorting to 500
>> 2017-10-06 22:41:19.547729 7f90506e9700 0 ERROR: 
>> RESTFUL_IO(s)->complete_header() returned err=Input/output error
>> 2017-10-06 22:41:19.548570 7f90506e9700 1 == req done req=0x7f90506e3180 
>> op status=-2300 http_status=500 ==
>> 2017-10-06 22:41:19.548790 7f90506e9700 1 civetweb: 0x55766d111000: 
>> $MY_IP_HERE$ - - [06/Oct/2017:22:33:47 -0400] "PUT /
>> $REDACTED_BUCKET_NAME$/$REDACTED_KEY_NAME$ HTTP/1.1" 1 0 - Boto3/1.4.7 
>> Python/2.7.12 Linux/4.9.43-17.3
>> 9.amzn1.x86_64 exec-env/AWS_Lambda_python2.7 Botocore/1.7.2 Resource
>> [.. slightly later in the logs..]
>> 2017-10-06 22:41:53.516272 7f90406c9700 1 rgw realm reloader: Frontends 
>> paused
>> 2017-10-06 22:41:53.528703 7f907893f700 0 ERROR: failed to clone shard, 
>> completion_mgr.get_next() returned ret=-125
>> 2017-10-06 22:44:32.049564 7f9074136700 0 ERROR: keystone revocation 
>> processing returned error r=-22
>> 2017-10-06 22:59:32.059222 7f9074136700 0 ERROR: keystone revocation 
>> processing returned error r=-22
>> 
>> Can anyone advise on the best path forward to stop the current sharding 
>> states and avoid this moving forward?
>> 
> 
> What does 'radosgw-admin reshard status --bucket=' return?
> I think just manually resharding the buckets should clear this flag,
> is that not an option?
> manual reshard: radosgw-admin bucket reshard --bucket=
> --num-shards=
> 
> also, the 'radosgw-admin bucket check --fix' might clear that flag.
> 
> For some reason it seems that the reshard cancellation code is not
> clearing that flag on the bucket index header (pretty sure it used to
> do it at one point). I'll open a tracker ticket.
> 
> Thanks,
> Yehuda
> 
>> 
>> Some other details:
>> - 3 rgw instances
>> - Ceph Luminous 12.2.1
>> - 584 active OSDs, rgw bucket index is on Intel NVMe OSDs
>> 
>> 
>> Thanks,
>> Ryan Leimenstoll
>> rleim...@umiacs.umd.edu
>> University of Maryland Institute for Advanced Computer Studies
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rgw resharding operation seemingly won't end

2017-10-09 Thread Ryan Leimenstoll
Hi all, 

We recently upgraded to Ceph 12.2.1 (Luminous) from 12.2.0 however are now 
seeing issues running radosgw. Specifically, it appears an automatically 
triggered resharding operation won’t end, despite the jobs being cancelled 
(radosgw-admin reshard cancel). I have also disabled dynamic sharding for the 
time being in the ceph.conf.


[root@objproxy02 ~]# radosgw-admin reshard list
[]

The two buckets were also reported in the `radosgw-admin reshard list` before 
our RGW frontends paused recently (and only came back after a service restart). 
These two buckets cannot currently be written to at this point either. 

2017-10-06 22:41:19.547260 7f90506e9700 0 block_while_resharding ERROR: bucket 
is still resharding, please retry 
2017-10-06 22:41:19.547411 7f90506e9700 0 WARNING: set_req_state_err 
err_no=2300 resorting to 500 
2017-10-06 22:41:19.547729 7f90506e9700 0 ERROR: 
RESTFUL_IO(s)->complete_header() returned err=Input/output error 
2017-10-06 22:41:19.548570 7f90506e9700 1 == req done req=0x7f90506e3180 op 
status=-2300 http_status=500 == 
2017-10-06 22:41:19.548790 7f90506e9700 1 civetweb: 0x55766d111000: 
$MY_IP_HERE$ - - [06/Oct/2017:22:33:47 -0400] "PUT / 
$REDACTED_BUCKET_NAME$/$REDACTED_KEY_NAME$ HTTP/1.1" 1 0 - Boto3/1.4.7 
Python/2.7.12 Linux/4.9.43-17.3 
9.amzn1.x86_64 exec-env/AWS_Lambda_python2.7 Botocore/1.7.2 Resource 
[.. slightly later in the logs..]
2017-10-06 22:41:53.516272 7f90406c9700 1 rgw realm reloader: Frontends paused 
2017-10-06 22:41:53.528703 7f907893f700 0 ERROR: failed to clone shard, 
completion_mgr.get_next() returned ret=-125 
2017-10-06 22:44:32.049564 7f9074136700 0 ERROR: keystone revocation processing 
returned error r=-22 
2017-10-06 22:59:32.059222 7f9074136700 0 ERROR: keystone revocation processing 
returned error r=-22 

Can anyone advise on the best path forward to stop the current sharding states 
and avoid this moving forward?


Some other details:
 - 3 rgw instances
 - Ceph Luminous 12.2.1
 - 584 active OSDs, rgw bucket index is on Intel NVMe OSDs
 

Thanks,
Ryan Leimenstoll
rleim...@umiacs.umd.edu
University of Maryland Institute for Advanced Computer Studies



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous RGW dynamic sharding

2017-09-20 Thread Ryan Leimenstoll
Hi all, 

I noticed Luminous now has dynamic sharding for RGW bucket indices as a 
production option. Does anyone know of any potential caveats or issues we 
should be aware of before enabling this? Beyond the Luminous v12.2.0 release 
notes and a few mailing list entries from during the release candidate phase, I 
haven’t seen much mention of it. For some time now we have been experiencing 
blocked requests when deep scrubbing PGs in our bucket index, so this could be 
quite useful for us. 

Thanks,
Ryan Leimenstoll
rleim...@umiacs.umd.edu
University of Maryland Institute for Advanced Computer Studies

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW Multisite Sync Memory Usage

2017-07-26 Thread Ryan Leimenstoll
Hi all, 

We are currently trying to migrate our RGW Object Storage service from one zone 
to another (in the same zonegroup) in part to make use of erasure coded data 
pools. That being said, the rgw daemon is reliably getting OOM killed on the 
rgw origin host serving the original zone (and thus the current production 
data) as a result of high rgw memory usage. We are willing to consider more 
memory for the rgw daemon’s hosts to solve this problem, but was wondering what 
would be expected memory wise (at least as a rule of thumb). I noticed there 
were a few memory related rgw sync fixes in 10.2.9, but so far upgrading hasn’t 
seemed to prevent crashing. 


Some details about our cluster:
Ceph Version: 10.2.9
OS: RHEL 7.3

584 OSDs
Serving RBD, CephFS, and RGW

RGW Origin Hosts:
Virtualized via KVM/QEMU, RHEL 7.3
Memory: 32GB
CPU: 12 virtual cores (Hypervisor processors: Intel E5-2630)

First zone data and index pools:
pool name KB  objects   clones degraded  
unfound   rdrd KB   wrwr KB
.rgw.buckets112190858231 3423974600
0   2713542251 265848150719475841837 153970795085
.rgw.buckets.index0 497200  
  0   3721485483   5926323574 360300980


Thanks,
Ryan Leimenstoll
University of Maryland Institute for Advanced Computer Studies

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-dbg package for Xenial (ubuntu-16.04.x) broken

2016-08-03 Thread J. Ryan Earl
Inspecting the ceph-dbg packages under
http://download.ceph.com/debian-jewel/pool/main/c/ceph/ it looks like this
is an ongoing issue and not specific to just 10.2.2.  Specifically there
are only 2 ceph-dbg package versions:

ceph-dbg_10.0.2-1trusty_amd64.deb
ceph-dbg_10.0.2-1~bpo80+1_amd64.deb

There aren't even 10.0.2 'ceph' packages there, only 10.1.x and 10.2.x
versions of the actual binaries.  So it seems that there are literally no
debug packages available for any of the Debian-based Jewel releases
available.  This seems like a systemic issue.

I've created an issue on the tracker: http://tracker.ceph.com/issues/16912

On Wed, Aug 3, 2016 at 1:30 PM, Ken Dreyer <kdre...@redhat.com> wrote:

> For some reason, during the v10.2.2 release,
> ceph-dbg_10.0.2-1xenial_amd64.deb did not get transferred to
> http://download.ceph.com/debian-jewel/pool/main/c/ceph/
>
> - Ken
>
> On Wed, Aug 3, 2016 at 12:27 PM, J. Ryan Earl <o...@jryanearl.us> wrote:
> > Hello,
> >
> > New to the list.  I'm working on performance tuning and testing a new
> Ceph
> > cluster built on Ubuntu 16.04 LTS and newest "Jewel" Ceph release.  I'm
> in
> > the process of collecting stack frames as part of a profiling inspection
> > using FlameGraph (https://github.com/brendangregg/FlameGraph) to inspect
> > where the CPU is spending time but need to load the 'dbg' packages to get
> > symbol information.  However, it appears the 'ceph-dbg' package has
> broken
> > dependencies:
> >
> > ceph1.oak:/etc/apt# apt-get install ceph-dbgReading package lists...
> > DoneBuilding dependency tree   Reading state information... DoneSome
> > packages could not be installed. This may mean that you haverequested an
> > impossible situation or if you are using the unstabledistribution that
> some
> > required packages have not yet been createdor been moved out of
> Incoming.The
> > following information may help to resolve the situation:
> > The following packages have unmet dependencies: ceph-dbg : Depends: ceph
> (=
> > 10.2.2-0ubuntu0.16.04.2) but 10.2.2-1xenial is to be installedE: Unable
> to
> > correct problems, you have held broken packages.
> > Any ideas on how to quickly work around this issue so I can continue
> > performance profiling?
> > Thank you,-JR
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-dbg package for Xenial (ubuntu-16.04.x) broken

2016-08-03 Thread J. Ryan Earl
Hello,

New to the list.  I'm working on performance tuning and testing a new Ceph
cluster built on Ubuntu 16.04 LTS and newest "Jewel" Ceph release.  I'm in
the process of collecting stack frames as part of a profiling inspection
using FlameGraph (https://github.com/brendangregg/FlameGraph) to inspect
where the CPU is spending time but need to load the 'dbg' packages to get
symbol information.  However, it appears the 'ceph-dbg' package has broken
dependencies:

ceph1.oak:/etc/apt# apt-get install ceph-dbgReading package lists...
DoneBuilding dependency tree   Reading state information... DoneSome
packages could not be installed. This may mean that you haverequested an
impossible situation or if you are using the unstabledistribution that some
required packages have not yet been createdor been moved out of
Incoming.The following information may help to resolve the situation:
The following packages have unmet dependencies: ceph-dbg : Depends: ceph (=
10.2.2-0ubuntu0.16.04.2) but 10.2.2-1xenial is to be installedE: Unable to
correct problems, you have held broken packages.
Any ideas on how to quickly work around this issue so I can continue
performance profiling?
Thank you,-JR
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High 0.94.5 OSD memory use at 8GB RAM/TB raw disk during recovery

2015-12-01 Thread Ryan Tokarek

> On Nov 30, 2015, at 6:52 PM, Laurent GUERBY <laur...@guerby.net> wrote:
> 
> Hi,
> 
> We lost a disk today in our ceph cluster so we added a new machine with
> 4 disks to replace the capacity and we activated straw1 tunable too
> (we also tried straw2 but we quickly backed up this change).
> 
> During recovery OSD started crashing on all of our machines
> the issue being OSD RAM usage that goes very high, eg:
> 
> 24078 root  20   0 27.784g 0.026t  10888 S   5.9 84.9
> 16:23.63 /usr/bin/ceph-osd --cluster=ceph -i 41 -f
> /dev/sda1   2.7T  2.2T  514G  82% /var/lib/ceph/osd/ceph-41
> 
> That's about 8GB resident RAM per TB of disk, way above
> what we provisionned ~ 2-4 GB RAM/TB.

We had something vaguely similar (not nearly that dramatic though!) happen to 
us. During a recovery (actually, I think this was rebalancing after upgrading 
from an earlier version of ceph), our OSDs took so much memory they would get 
killed by oom_killer and we couldn't keep the cluster up long enough to get 
back to healthy. 

A solution for us was to enable zswap; previously we had been running with no 
swap at all. 

If you are running a kernel newer than 3.11 (you might want more recent than 
that as I believe there were major fixes after 3.17), then enabling zswap 
allows the kernel to compress pages in memory before needing to touch disk. The 
default max pool size for this is 20% of memory. There is extra CPU time to 
compress/decompress, but it's much faster than going to disk, and the OSD data 
appears to be quite compressible. For us, nothing actually made it to the disk, 
but a swapfile must to be enabled for zswap to do its work. 

https://www.kernel.org/doc/Documentation/vm/zswap.txt
http://askubuntu.com/questions/471912/zram-vs-zswap-vs-zcache-ultimate-guide-when-to-use-which-one

Add "zswap.enabled=1" to your kernel bool parameters and reboot. 

If you have no swap file/partition/disk/whatever, then you need one for zswap 
to actually do anything. Here is an example, but use whatever sizes, locations, 
process you prefer:

dd if=/dev/zero of=/var/swap bs=1M count=8192
chmod 600 /var/swap
mkswap /var/swap
swapon /var/swap

Consider adding it to /etc/fstab:
/var/swap   swapswapdefaults 0 0 

This got us through the rebalancing. The OSDs eventually returned to normal, 
but we've just left zswap enabled with no apparent problems. I don't know that 
it will be enough for your situation, but it might help. 

Ryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hanging nfsd requests on an RBD to NFS gateway

2015-10-22 Thread Ryan Tokarek

> On Oct 22, 2015, at 3:57 PM, John-Paul Robinson <j...@uab.edu> wrote:
> 
> Hi,
> 
> Has anyone else experienced a problem with RBD-to-NFS gateways blocking
> nfsd server requests when their ceph cluster has a placement group that
> is not servicing I/O for some reason, eg. too few replicas or an osd
> with slow request warnings?

We have experienced exactly that kind of problem except that it sometimes 
happens even when ceph health reports "HEALTH_OK". This has been incredibly 
vexing for us. 


If the cluster is unhealthy for some reason, then I'd expect your/our symptoms 
as writes can't be completed. 

I'm guessing that you have file systems with barriers turned on. Whichever file 
system that has a barrier write stuck on the problem pg, will cause any other 
process trying to write anywhere in that FS also to block. This likely means a 
cascade of nfsd processes will block as they each try to service various client 
writes to that FS. Even though, theoretically, the rest of the "disk" (rbd) and 
other file systems might still be writable, the NFS processes will still be in 
uninterruptible sleep just because of that stuck write request (or such is my 
understanding). 

Disabling barriers on the gateway machine might postpone the problem (never 
tried it and don't want to) until you hit your vm.dirty_bytes or vm.dirty_ratio 
thresholds, but it is dangerous as you could much more easily lose data. You'd 
be better off solving the underlying issues when they happen (too few replicas 
available or overloaded osds). 


For us, even when the cluster reports itself as healthy, we sometimes have this 
problem. All nfsd processes block. sync blocks. echo 3 > 
/proc/sys/vm/drop_caches blocks. There is a persistent 4-8MB "Dirty" in 
/proc/meminfo. None of the osds log slow requests. Everything seems fine on the 
osds and mons. Neither CPU nor I/O load is extraordinary on the ceph nodes, but 
at least one file system on the gateway machine will stop accepting writes. 

If we just wait, the situation resolves itself in 10 to 30 minutes. A forced 
reboot of the NFS gateway "solves" the performance problem, but is annoying and 
dangerous (we unmount all of the file systems that are still unmountable, but 
the stuck ones lead us to a sysrq-b). 

This is on Scientific Linux 6.7 systems with elrepo 4.1.10 Kernels running Ceph 
Firefly (0.8.10) and XFS file systems exported over NFS and samba. 

Ryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hanging nfsd requests on an RBD to NFS gateway

2015-10-22 Thread Ryan Tokarek

> On Oct 22, 2015, at 10:19 PM, John-Paul Robinson <j...@uab.edu> wrote:
> 
> A few clarifications on our experience:
> 
> * We have 200+ rbd images mounted on our RBD-NFS gateway.  (There's
> nothing easier for a user to understand than "your disk is full".)

Same here, and agreed. It sounds like our situations are similar except for my 
blocking on an apparently healthy cluster issue. 

> * I'd expect more contention potential with a single shared RBD back
> end, but with many distinct and presumably isolated backend RBD images,
> I've always been surprised that *all* the nfsd task hang.  This leads me
> to think  it's an nfsd issue rather than and rbd issue.  (I realize this
> is an rbd list, looking for shared experience. ;) )

It's definitely possible. I've experienced exactly the behavior you're seeing. 
My guess is that when an nfsd thread blocks and goes dark, affected clients 
(even if it's only one) will retransmit their requests thinking there's a 
network issue causing more nfsds to go dark until all the server threads are 
stuck (that could be hogwash, but it fits the behavior). Or perhaps there are 
enough individual clients writing to the affected NFS volume that they consume 
all the available nfsd threads (I'm not sure about your client to FS and nfsd 
thread ratio, but that is plausible in my situation).  I think some testing 
with xfs_freeze and non-critical nfs server/clients is called for. 

I don't think this part is related to ceph except that it happens to be 
providing the underlying storage. I'm fairly certain that my problems with an 
apparently healthy cluster blocking writes is a ceph problem, but I haven't 
figured out what the source of that is. 

> * I haven't seen any difference between reads and writes.  Any access to
> any backing RBD store from the NFS client hangs.

All NFS clients are hung, but in my situation, it's usually only 1-3 local file 
systems that stop accepting writes. NFS is completely unresponsive, but local 
and remote-samba operations on the unaffected file systems are totally happy. 

I don't have a solution to NFS issue, but I've seen it all too often. I wonder 
whether setting a huge number of threads and or playing with client retransmit 
times would help, but I suspect this problem is just intrinsic to Linux NFS 
servers. 

Ryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] getting started

2013-09-16 Thread Justin Ryan
Hi,

I'm brand new to Ceph, attempting to follow the Getting
Startedhttp://ceph.com/docs/master/start/guide with 2 VMs. I
completed the Preflight without issue.  I completed Storage
Cluster Quick Start http://ceph.com/docs/master/start/quick-ceph-deploy/,
but have some questions:

The *Single Node Quick Start* grey box -- does 'single node' mean if you're
running the whole thing on a single machine, if you have only one server
node like the diagram at the top of the page, or if you're only running one
OSD process? I'm not sure if I need to make the `osd crush chooseleaf type`
change.

Are the LIST, ZAP, and ADD OSDS ON STANDALONE DISKS sections an alternative
to the MULTIPLE OSDS ON THE OS DISK (DEMO ONLY) section? I thought I set up
my OSDs already on /tmp/osd{0,1}.

Moving on to the Block Device Quick
Starthttp://ceph.com/docs/master/start/quick-rbd/ --
it says To use this guide, you must have executed the procedures in the
Object Store Quick Start guide first -- but the link to the Object Store
Quick Start actually points to the Storage Cluster Quick
Starthttp://ceph.com/docs/master/start/quick-ceph-deploy/ --
which is it?

Most importantly, it says Ensure your Ceph Storage Cluster is in an active
+ clean state before working with the Ceph Block Device --- how can tell
if my cluster is active+clean?? The only ceph* command on the admin node is
ceph-deploy, and running `ceph` on the server node:

ceph@jr-ceph2:~$ ceph
2013-09-16 16:53:10.880267 7feb96c1b700 -1 monclient(hunting): ERROR:
missing keyring, cannot use cephx for authentication
2013-09-16 16:53:10.880271 7feb96c1b700  0 librados: client.admin
initialization error (2) No such file or directory
Error connecting to cluster: ObjectNotFound

Thanks in advance for any help, and apologies if I missed anything obvious.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] getting started

2013-09-16 Thread Justin Ryan
thanks, running as root does give me status, but not clean.

r...@jr-ceph2.vm:~# ceph status
  cluster 9059dfad-924a-425c-a20b-17dc1d53111e
   health HEALTH_WARN 91 pgs degraded; 192 pgs stuck unclean; recovery
21/42 degraded (50.000%)
   monmap e1: 1 mons at {jr-ceph2=10.88.26.55:6789/0}, election epoch 2,
quorum 0 jr-ceph2
   osdmap e10: 2 osds: 2 up, 2 in
pgmap v2715: 192 pgs: 101 active+remapped, 91 active+degraded; 9518
bytes data, 9148 MB used, 362 GB / 391 GB avail; 21/42 degraded (50.000%)
   mdsmap e4: 1/1/1 up {0=jr-ceph2.XXX=up:active}

don't see anything telling in the ceph logs; Should I wait for the new
quickstart?


On Mon, Sep 16, 2013 at 2:27 PM, John Wilkins john.wilk...@inktank.comwrote:

 We will have a new update to the quick start this week.

 On Mon, Sep 16, 2013 at 12:18 PM, Alfredo Deza alfredo.d...@inktank.com
 wrote:
  On Mon, Sep 16, 2013 at 12:58 PM, Justin Ryan justin.r...@kixeye.com
 wrote:
  Hi,
 
  I'm brand new to Ceph, attempting to follow the Getting Started guide
 with 2
  VMs. I completed the Preflight without issue.  I completed Storage
 Cluster
  Quick Start, but have some questions:
 
  The Single Node Quick Start grey box -- does 'single node' mean if
 you're
  running the whole thing on a single machine, if you have only one server
  node like the diagram at the top of the page, or if you're only running
 one
  OSD process? I'm not sure if I need to make the `osd crush chooseleaf
 type`
  change.
 
  Are the LIST, ZAP, and ADD OSDS ON STANDALONE DISKS sections an
 alternative
  to the MULTIPLE OSDS ON THE OS DISK (DEMO ONLY) section? I thought I
 set up
  my OSDs already on /tmp/osd{0,1}.
 
  Moving on to the Block Device Quick Start -- it says To use this
 guide, you
  must have executed the procedures in the Object Store Quick Start guide
  first -- but the link to the Object Store Quick Start actually points
 to
  the Storage Cluster Quick Start -- which is it?
 
  Most importantly, it says Ensure your Ceph Storage Cluster is in an
 active
  + clean state before working with the Ceph Block Device --- how can
 tell if
  my cluster is active+clean?? The only ceph* command on the admin node is
  ceph-deploy, and running `ceph` on the server node:
 
  ceph@jr-ceph2:~$ ceph
  2013-09-16 16:53:10.880267 7feb96c1b700 -1 monclient(hunting): ERROR:
  missing keyring, cannot use cephx for authentication
  2013-09-16 16:53:10.880271 7feb96c1b700  0 librados: client.admin
  initialization error (2) No such file or directory
  Error connecting to cluster: ObjectNotFound
 
  There is a ticket open for this, but you basically need super-user
  permissions here to run (any?) ceph commands.
 
  Thanks in advance for any help, and apologies if I missed anything
 obvious.
 
 
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 John Wilkins
 Senior Technical Writer
 Intank
 john.wilk...@inktank.com
 (415) 425-9599
 http://inktank.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Format 2 Image support in the RBD driver

2013-04-18 Thread Whelan, Ryan
I've not been following the list for long, so forgive me if this has been 
covered, but is there a plan for image 2 support in the kernel RBD driver?  I 
assume with Linux 3.9 in the RC phase, its not likely to appear there?

Thanks!

NOTICE: Protect the information in this message in accordance with the 
company's security policies. If you received this message in error, immediately 
notify the sender and destroy all copies.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com