from:"Brian \:"

Re: [ceph-users] unsubscribe

2019-07-12 Thread Brian Topping

It’s in the mail headers on every email: 
mailto:ceph-users-requ...@lists.ceph.com?subject=unsubscribe

> On Jul 12, 2019, at 5:00 PM, Robert Stanford  wrote:
> 
> unsubscribe
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How does monitor know OSD is dead?

2019-07-02 Thread Brian :

I wouldn't say that's a pretty common failure. The flaw here perhaps is the
design of the cluster and that it was relying on a single power source.
Power sources fail. Dual power supplies connected to a b power sources in
the data centre is pretty standard.

On Tuesday, July 2, 2019, Bryan Henderson  wrote:
>> Normally in the case of a restart then somebody who used to have a
>> connection to the OSD would still be running and flag it as dead. But
>> if *all* the daemons in the cluster lose their soft state, that can't
>> happen.
>
> OK, thanks.  I guess that explains it.  But that's a pretty serious design
> flaw, isn't it?  What I experienced is a pretty common failure mode: a
power
> outage caused the entire cluster to die simultaneously, then when power
came
> back, some OSDs didn't (the most common time for a server to fail is at
> startup).
>
> I wonder if I could close this gap with additional monitoring of my own.
I
> could have a cluster bringup protocol that detects OSD processes that
aren't
> running after a while and mark those OSDs down.  It would be cleaner,
though,
> if I could just find out from the monitor what OSDs are in the map but not
> connected to the monitor cluster.  Is that possible?
>
> A related question: If I mark an OSD down administratively, does it stay
down
> until I give a command to mark it back up, or will the monitor detect
signs of
> life and declare it up again on its own?
>
> --
> Bryan Henderson   San Jose, California
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Weird behaviour of ceph-deploy

2019-06-17 Thread Brian Topping

I don’t have an answer for you, but it’s going to help others to have shown:
Versions of all nodes involved and multi-master configuration
Confirm forward and reverse DNS and SSH / remote sudo since you are using deploy
Specific steps that did not behave properly
> On Jun 17, 2019, at 6:29 AM, CUZA Frédéric  wrote:
> 
> I’ll keep updating this until I find a solution so if anyone faces the same 
> problem he might have solution.
>  
> Atm : I install the new osd node with ceph-deploy and nothing change, node is 
> still not present in the cluster nor in the crushmap.
> I decided to manually add it to the crush map :
> ceph osd crush add-bucket sd0051 host
> and move it to where it should be :
> ceph osd crush move sd0051 room=roomA
> Then I added an osd to that node :
> ceph-deploy osd create sd0051 --data /dev/sde --block-db /dev/sda1 
> --block-wal /dev/sdb1 –bluestore
> Once finally created the osd is still not linked to the host where it is 
> created and I can’t move it the this host right now;
>  
>  
> Regards,
>  
>  
> De : ceph-users  De la part de CUZA 
> Frédéric
> Envoyé : 15 June 2019 00:34
> À : ceph-users@lists.ceph.com
> Objet : Re: [ceph-users] Weird behaviour of ceph-deploy
>  
> Little update :
> I check one osd I’ve installed even if the host isn’t not present in the 
> crushmap (or in cluster I guess) and I found this :
>  
> monclient: wait_auth_rotating timed out after 30
> osd.xxx 0 unable to obtain rotating service keys; retrying
>  
> I alosy add the host to the admins host :
> ceph-deploy admin sd0051
> and nothing change.
>  
> When I do the install there is not .conf pushed the new node.
>  
> Regards,
>  
> De : ceph-users  > De la part de CUZA Frédéric
> Envoyé : 14 June 2019 18:28
> À : ceph-users@lists.ceph.com 
> Objet : [ceph-users] Weird behaviour of ceph-deploy
>  
> Hi everyone,
>  
> I am facing a strange behavious from ceph-deploy.
> I try to add a new node to our cluster :
> ceph-deploy install --no-adjust-repos sd0051
>  
> Everything seems to work fine but the new bucket (host) is not created in the 
> crushmap and when I try to add a new osd to that host, the osd is created but 
> is not link to any host (normal behaviour the host is not present).
> Anyone already faces this ?
>  
> FI : We already add new node with this and this is the first time we face it.
>  
> Thanks !
>  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] one pg blocked at ctive+undersized+degraded+remapped+backfilling

2019-06-13 Thread Brian Chang-Chien

We want to change index pool(radosgw) rule from sata to ssd, when we run
ceph osd pool set default.rgw.buckets.index crush_ruleset x
All of index pg migrated to ssd, but only one pg is still stuck in sata and
cannot be migrated
and it status is active+undersized+degraded+remapped+backfilling


ceph version : 10.2.5
default.rgw.buckets.index  size=2 min_size=1


How can I solve the problem of continuous backfilling?


 health HEALTH_WARN
 1 pgs backfilling
 1 pgs degraded
 1 pgs stuck unclean
 1 pgs undersized
 1 requests are blocked > 32 sec
 recovery 13/548944664 objects degraded (0.000%)
 recovery 31/548944664 objects misplaced (0.000%)
 monmap e1: 3 mons at {sa101=
192.168.8.71:6789/0,sa102=192.168.8.72:6789/0,sa103=192.168.8.73:6789/0}
election epoch 198, quorum 0,1,2 sa101,sa102,sa103
 osdmap e113094: 311 osds: 295 up, 295 in; 1 remapped pgs
flags noout,noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
  pgmap v62454723: 4752 pgs, 15 pools, 134 TB data, 174 Mobjects
409 TB used, 1071 TB / 1481 TB avail
13/548944664 objects degraded (0.000%)
31/548944664 objects misplaced (0.000%)
4751 active+clean
   1 active+undersized+degraded+remapped+backfilling

[sa101 ~]# ceph pg map 11.28
osdmap e113094 pg 11.28 (11.28) -> up [251,254] acting [192]

[sa101 ~]# ceph health detail
HEALTH_WARN 1 pgs backfilling; 1 pgs degraded; 1 pgs stuck unclean; 1 pgs
undersized; 1 requests are blocked > 32 sec; 1 osds have slow requests;
recovery 13/548949428 objects degraded (0.000%); recovery 31/548949428
objects misplaced (0.000%);
noout,noscrub,nodeep-scrub,sortbitwise,require_jewel_osds flag(s) set
pg 11.28 is stuck unclean for 624019.077931, current state
active+undersized+degraded+remapped+backfilling, last acting [192]
pg 11.28 is active+undersized+degraded+remapped+backfilling, acting [192]
1 ops are blocked > 32.768 sec on osd.192
1 osds have slow requests
recovery 13/548949428 objects degraded (0.000%)
recovery 31/548949428 objects misplaced (0.000%)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pool migration for cephfs?

2019-05-15 Thread Brian Topping

Lars, I just got done doing this after generating about a dozen CephFS subtrees 
for different Kubernetes clients. 

tl;dr: there is no way for files to move between filesystem formats (ie CephFS 
,> RBD) without copying them.

If you are doing the same thing, there may be some relevance for you in 
https://github.com/kubernetes/enhancements/pull/643. It’s worth checking to see 
if it meets your use case if so.

In any event, what I ended up doing was letting Kubernetes create the new PV 
with the RBD provisioner, then using find piped to cpio to move the file 
subtree. In a non-Kubernetes environment, one would simply create the 
destination RBD as usual. It should be most performant to do this on a monitor 
node.

cpio ensures you don’t lose metadata. It’s been fine for me, but if you have 
special xattrs that the clients of the files need, be sure to test that those 
are copied over. It’s very difficult to move that metadata once a file is 
copied and even harder to deal with a situation where the destination volume 
went live and some files on the destination are both newer versions and missing 
metadata. 

Brian

> On May 15, 2019, at 6:05 AM, Lars Täuber  wrote:
> 
> Hi,
> 
> is there a way to migrate a cephfs to a new data pool like it is for rbd on 
> nautilus?
> https://ceph.com/geen-categorie/ceph-pool-migration/
> 
> Thanks
> Lars
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

2019-04-26 Thread Brian Topping

> On Apr 26, 2019, at 1:50 PM, Gregory Farnum  wrote:
> 
> Hmm yeah, it's probably not using UTC. (Despite it being good
> practice, it's actually not an easy default to adhere to.) cephx
> requires synchronized clocks and probably the same timezone (though I
> can't swear to that.)

Apps don’t “see” timezones, timezones are a rendering transform of an absolute 
time. The instant “now” is the same throughout space and time, regardless of 
how that instant is quantified. UNIX wall time is just one such quantification.

Problems ensue when the rendered time is incorrect for the time zone shown in 
the rendering. If a machine that is “not using time zones” shows that the time 
is 3PM UTC and one lives in London, the internal time will be correct. On the 
other hand, if they live in NYC, the internal time is incorrect. This is to say 
15:00UTC rendered at 3PM in NYC is very wrong *because it’s not 3PM in London*, 
where UTC is true.

tl;dr: Make sure that your clock is set for the correct time in the time zone 
in whatever rendering is set. It doesn’t matter where the system actually 
resides or whether the TZ matches it’s geographic location. ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SOLVED: Multi-site replication speed

2019-04-20 Thread Brian Topping

Followup: Seems to be solved, thanks again for your help. I did have some 
issues with the replication that may have been solved by getting the metadata 
init/run finished first. I haven’t replicated that back to the production 
servers yet, but I’m a lot more comfortable with the behaviors by setting this 
up on a test cluster. 

I believe this is the desired state for unidirectional replication:

> [root@left01 ~]# radosgw-admin sync status
>   realm d5078dd2-6a6e-49f8-941e-55c02ad58af7 (example-test)
>   zonegroup de533461-2593-45d2-8975-99072d860bb2 (us)
>zone 479d3f20-d57d-4b37-995b-510ba10756bf (left)
>   metadata sync no sync (zone is master)
>   data sync source: 5dc80bbc-3d9d-46d5-8f3e-4611fbc17fbe (right)
> not syncing from zone


> [root@right01 ~]# radosgw-admin sync status
>   realm d5078dd2-6a6e-49f8-941e-55c02ad58af7 (example-test)
>   zonegroup de533461-2593-45d2-8975-99072d860bb2 (us)
>zone 5dc80bbc-3d9d-46d5-8f3e-4611fbc17fbe (right)
>   metadata sync syncing
> full sync: 0/64 shards
> incremental sync: 64/64 shards
> metadata is caught up with master
>   data sync source: 479d3f20-d57d-4b37-995b-510ba10756bf (left)
> syncing
> full sync: 0/128 shards
> incremental sync: 128/128 shards
> data is caught up with source

My confusion at the start of this thread was not knowing what would be 
replicated. It was not clear to me that the only objects that would be 
replicated were the S3/Swift objects that were created by other S3/Swift 
clients. The note at [1] from John Spray pretty much sorted out everything I 
was looking to do as well as what I did not fully understand about the Ceph 
stack. All in all, a very informative adventure! 

Hopefully the thread is helpful to others who follow. I’m happy to answer 
questions off-thread as well.

best, Brian

[1] 
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/415#note_16192610

> On Apr 19, 2019, at 10:21 PM, Brian Topping  wrote:
> 
> Hi Casey,
> 
> I set up a completely fresh cluster on a new VM host.. everything is fresh 
> fresh fresh. I feel like it installed cleanly and because there is 
> practically zero latency and unlimited bandwidth as peer VMs, this is a 
> better place to experiment. The behavior is the same as the other cluster.
> 
> The realm is “example-test”, has a single zone group named “us”, and there 
> are zones “left” and “right”. The master zone is “left” and I am trying to 
> unidirectionally replicate to “right”. “left” is a two node cluster and right 
> is a single node cluster. Both show "too few PGs per OSD” but are otherwise 
> 100% active+clean. Both clusters have been completely restarted to make sure 
> there are no latent config issues, although only the RGW nodes should require 
> that. 
> 
> The thread at [1] is the most involved engagement I’ve found with a staff 
> member on the subject, so I checked and believe I attached all the logs that 
> were requested there. They all appear to be consistent and are attached below.
> 
> For start: 
>> [root@right01 ~]# radosgw-admin sync status
>>   realm d5078dd2-6a6e-49f8-941e-55c02ad58af7 (example-test)
>>   zonegroup de533461-2593-45d2-8975-99072d860bb2 (us)
>>zone 5dc80bbc-3d9d-46d5-8f3e-4611fbc17fbe (right)
>>   metadata sync syncing
>> full sync: 0/64 shards
>> incremental sync: 64/64 shards
>> metadata is caught up with master
>>   data sync source: 479d3f20-d57d-4b37-995b-510ba10756bf (left)
>> syncing
>> full sync: 0/128 shards
>> incremental sync: 128/128 shards
>> data is caught up with source
> 
> 
> I tried the information at [2] and do not see any ops in progress, just 
> “linger_ops”. I don’t know what those are, but probably explain the slow 
> stream of requests back and forth between the two RGW endpoints:
>> [root@right01 ~]# ceph daemon client.rgw.right01.54395.94074682941968 
>> objecter_requests
>> {
>> "ops": [],
>> "linger_ops": [
>> {
>> "linger_id": 2,
>> "pg": "2.16dafda0",
>> "osd": 0,
>> "object_id": "notify.1",
>> "object_locator": "@2",
>> "target_object_id": "notify.1",
>> "target_object_locator": &quo

Re: [ceph-users] Multi-site replication speed

2019-04-19 Thread Brian Topping

Hi Casey,

I set up a completely fresh cluster on a new VM host.. everything is fresh 
fresh fresh. I feel like it installed cleanly and because there is practically 
zero latency and unlimited bandwidth as peer VMs, this is a better place to 
experiment. The behavior is the same as the other cluster.

The realm is “example-test”, has a single zone group named “us”, and there are 
zones “left” and “right”. The master zone is “left” and I am trying to 
unidirectionally replicate to “right”. “left” is a two node cluster and right 
is a single node cluster. Both show "too few PGs per OSD” but are otherwise 
100% active+clean. Both clusters have been completely restarted to make sure 
there are no latent config issues, although only the RGW nodes should require 
that. 

The thread at [1] is the most involved engagement I’ve found with a staff 
member on the subject, so I checked and believe I attached all the logs that 
were requested there. They all appear to be consistent and are attached below.

For start: 
> [root@right01 ~]# radosgw-admin sync status
>   realm d5078dd2-6a6e-49f8-941e-55c02ad58af7 (example-test)
>   zonegroup de533461-2593-45d2-8975-99072d860bb2 (us)
>zone 5dc80bbc-3d9d-46d5-8f3e-4611fbc17fbe (right)
>   metadata sync syncing
> full sync: 0/64 shards
> incremental sync: 64/64 shards
> metadata is caught up with master
>   data sync source: 479d3f20-d57d-4b37-995b-510ba10756bf (left)
> syncing
> full sync: 0/128 shards
> incremental sync: 128/128 shards
> data is caught up with source


I tried the information at [2] and do not see any ops in progress, just 
“linger_ops”. I don’t know what those are, but probably explain the slow stream 
of requests back and forth between the two RGW endpoints:
> [root@right01 ~]# ceph daemon client.rgw.right01.54395.94074682941968 
> objecter_requests
> {
> "ops": [],
> "linger_ops": [
> {
> "linger_id": 2,
> "pg": "2.16dafda0",
> "osd": 0,
> "object_id": "notify.1",
> "object_locator": "@2",
> "target_object_id": "notify.1",
> "target_object_locator": "@2",
> "paused": 0,
> "used_replica": 0,
> "precalc_pgid": 0,
> "snapid": "head",
> "registered": "1"
> },
> ...
> ],
> "pool_ops": [],
> "pool_stat_ops": [],
> "statfs_ops": [],
> "command_ops": []
> }
> 


The next thing I tried is `radosgw-admin data sync run --source-zone=left` from 
the right side. I get bursts of messages of the following form:
> 2019-04-19 21:46:34.281 7f1c006ad580  0 RGW-SYNC:data:sync:shard[1]: ERROR: 
> failed to read remote data log info: ret=-2
> 2019-04-19 21:46:34.281 7f1c006ad580  0 meta sync: ERROR: RGWBackoffControlCR 
> called coroutine returned -2


When I sorted and filtered the messages, each burst has one RGW-SYNC message 
for each of the PGs on the left side identified by the number in “[]”. Since 
left has 128 PGs, these are the numbers between 0-127. The bursts happen about 
once every five seconds.

The packet traces between the nodes during the `data sync run` are mostly 
requests and responses of the following form:
> HTTP GET: 
> http://right01.example.com:7480/admin/log/?type=data=7=true=de533461-2593-45d2-8975-99072d860bb2
>  
> <http://right01.example.com:7480/admin/log/?type=data=7=true=de533461-2593-45d2-8975-99072d860bb2>HTTP
>  404 RESPONSE: 
> {"Code":"NoSuchKey","RequestId":"tx02a01-005cba9593-371d-right","HostId":"371d-right-us”}

When I stop the `data sync run`, these 404s stop, so clearly the `data sync 
run` isn’t changing a state in the rgw, but doing something synchronously. In 
the past, I have done a `data sync init` but it doesn’t seem like doing it 
repeatedly will make a difference so I didn’t do it any more.

NEXT STEPS:

I am working on how to get better logging output from daemons and hope to find 
something in there that will help. If I am lucky, I will find something in 
there and can report back so this thread is useful for others. If I have not 
written back, I probably haven’t found anything, so would be grateful for any 
leads.

Kind regards and thank you!

Brian

[1] 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013188.html 
<http://lists.ceph.com/pipermail/ceph-users-ceph

Re: [ceph-users] Are there any statistics available on how most production ceph clusters are being used?

2019-04-19 Thread Brian Topping

> On Apr 19, 2019, at 10:59 AM, Janne Johansson  wrote:
> 
> May the most significant bit of your life be positive.

Marc, my favorite thing about open source software is it has a 100% money back 
satisfaction guarantee: If you are not completely satisfied, you can have an 
instant refund, just for waving your arm! :D

Seriously though, Janne is right, for any OSS project. Think of it like a party 
where the some people go home “when it’s over” and some people stick around and 
help clean up. Using myself as an example, I’ve been asking questions about RGW 
multi-site, and now that I have a little more experience with it (not much more 
— it’s not working yet, just where I can see gaps in the documentation), I owe 
it to those that have helped me get here by filling those gaps in the docs. 

That’s where I can start, and when I understand what’s going on with more 
authority, I can go into the source and create changes that alter how it works 
for others to review.

Note in both cases I am proposing concrete changes, which is far more effective 
than trying to describe situations that others may have never been in. Many can 
try to help, but if it is frustrating for them, they will lose interest. Good 
pull requests are never frustrating to understand, even if they need more work 
to handle cases others know about. It’s a more quantitative means of expression.

If that kind of commitment doesn’t sound appealing, buy support contracts. Pay 
back in to the community so that those with passion for the product can do 
exactly what I’ve described here. There’s no shame in that, but users like you 
and me need to be careful with the time of those who have put their lives into 
this, at least until we can put more into the party than we have taken out.

Hope that helps!  :B
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rgw windows/mac clients shitty, develop a new one?

2019-04-19 Thread Brian :

I've always used the standalone mac and Linux package version. Wasn't aware
of the 'bundled software' in the installers. Ugh. Thanks for pointing it
out.

On Thursday, April 18, 2019, Janne Johansson  wrote:
> https://www.reddit.com/r/netsec/comments/8t4xrl/filezilla_malware/
>
> not saying it definitely is, or isn't malware-ridden, but it sure was
shady at that time.
> I would suggest not pointing people to it.
>
> Den tors 18 apr. 2019 kl 16:41 skrev Brian : :
>>
>> Hi Marc
>>
>> Filezilla has decent S3 support https://filezilla-project.org/
>>
>> ymmv of course!
>>
>> On Thu, Apr 18, 2019 at 2:18 PM Marc Roos 
wrote:
>> >
>> >
>> > I have been looking a bit at the s3 clients available to be used, and I
>> > think they are quite shitty, especially this Cyberduck that processes
>> > files with default reading rights to everyone. I am in the process to
>> > advice clients to use for instance this mountain duck. But I am not to
>> > happy about it. I don't like the fact that everything has default
>> > settings for amazon or other stuff in there for ftp or what ever.
>> >
>> > I am thinking of developing something in-house, more aimed at the ceph
>> > environments, easier/better to use.
>> >
>> > What I can think of:
>> >
>> > - cheaper, free or maybe even opensource
>> > - default settings for your ceph cluster
>> > - only configuration for object storage (no amazon, rackspace,
backblaze
>> > shit)
>> > - default secure settings
>> > - offer in the client only functionality that is available from the
>> > specific ceph release
>> > - integration with the finder / explorer windows
>> >
>> > I am curious who would be interested in a such new client? Maybe better
>> > to send me your wishes directly, and not clutter the mailing list with
>> > this.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> May the most significant bit of your life be positive.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rgw windows/mac clients shitty, develop a new one?

2019-04-18 Thread Brian :

Hi Marc

Filezilla has decent S3 support https://filezilla-project.org/

ymmv of course!

On Thu, Apr 18, 2019 at 2:18 PM Marc Roos  wrote:
>
>
> I have been looking a bit at the s3 clients available to be used, and I
> think they are quite shitty, especially this Cyberduck that processes
> files with default reading rights to everyone. I am in the process to
> advice clients to use for instance this mountain duck. But I am not to
> happy about it. I don't like the fact that everything has default
> settings for amazon or other stuff in there for ftp or what ever.
>
> I am thinking of developing something in-house, more aimed at the ceph
> environments, easier/better to use.
>
> What I can think of:
>
> - cheaper, free or maybe even opensource
> - default settings for your ceph cluster
> - only configuration for object storage (no amazon, rackspace, backblaze
> shit)
> - default secure settings
> - offer in the client only functionality that is available from the
> specific ceph release
> - integration with the finder / explorer windows
>
> I am curious who would be interested in a such new client? Maybe better
> to send me your wishes directly, and not clutter the mailing list with
> this.
>
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multi-site replication speed

2019-04-18 Thread Brian Topping

Hi Casey, thanks for this info. It’s been doing something for 36 hours, but not 
updating the status at all. So it either takes a really long time for 
“preparing for full sync” or I’m doing something wrong. This is helpful 
information, but there’s a myriad of states that the system could be in. 

With that, I’m going to set up a lab rig and see if I can build a fully 
replicated state. At that point, I’ll have a better understanding of what a 
working system responds like and maybe I can at least ask better questions, 
hopefully figure it out myself. 

Thanks again! Brian

> On Apr 16, 2019, at 08:38, Casey Bodley  wrote:
> 
> Hi Brian,
> 
> On 4/16/19 1:57 AM, Brian Topping wrote:
>>> On Apr 15, 2019, at 5:18 PM, Brian Topping >> <mailto:brian.topp...@gmail.com>> wrote:
>>> 
>>> If I am correct, how do I trigger the full sync?
>> 
>> Apologies for the noise on this thread. I came to discover the 
>> `radosgw-admin [meta]data sync init` command. That’s gotten me with 
>> something that looked like this for several hours:
>> 
>>> [root@master ~]# radosgw-admin  sync status
>>>   realm 54bb8477-f221-429a-bbf0-76678c767b5f (example)
>>>   zonegroup 8e33f5e9-02c8-4ab8-a0ab-c6a37c2bcf07 (us)
>>>zone b6e32bc8-f07e-4971-b825-299b5181a5f0 (secondary)
>>>   metadata sync preparing for full sync
>>> full sync: 64/64 shards
>>> full sync: 0 entries to sync
>>> incremental sync: 0/64 shards
>>> metadata is behind on 64 shards
>>> behind shards: 
>>> [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63]
>>>   data sync source: 35835cb0-4639-43f4-81fd-624d40c7dd6f (master)
>>> preparing for full sync
>>> full sync: 1/128 shards
>>> full sync: 0 buckets to sync
>>> incremental sync: 127/128 shards
>>> data is behind on 1 shards
>>> behind shards: [0]
>> 
>> I also had the data sync showing a list of “behind shards”, but both of them 
>> sat in “preparing for full sync” for several hours, so I tried 
>> `radosgw-admin [meta]data sync run`. My sense is that was a bad idea, but 
>> neither of the commands seem to be documented and the thread I found them on 
>> indicated they wouldn’t damage the source data.
>> 
>> QUESTIONS at this point:
>> 
>> 1) What is the best sequence of commands to properly start the sync? Does 
>> init just set things up and do nothing until a run is started?
> The sync is always running. Each shard starts with full sync (where it lists 
> everything on the remote, and replicates each), then switches to incremental 
> sync (where it polls the replication logs for changes). The 'metadata sync 
> init' command clears the sync status, but this isn't synchronized with the 
> metadata sync process running in radosgw(s) - so the gateways need to restart 
> before they'll see the new status and restart the full sync. The same goes 
> for 'data sync init'.
>> 2) Are there commands I should run before that to clear out any previous bad 
>> runs?
> Just restart gateways, and you should see progress via 'sync status'.
>> 
>> *Thanks very kindly for any assistance. *As I didn’t really see any 
>> documentation outside of setting up the realms/zones/groups, it seems like 
>> this would be useful information for others that follow.
>> 
>> best, Brian
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multi-site replication speed

2019-04-15 Thread Brian Topping

> On Apr 15, 2019, at 5:18 PM, Brian Topping  wrote:
> 
> If I am correct, how do I trigger the full sync?

Apologies for the noise on this thread. I came to discover the `radosgw-admin 
[meta]data sync init` command. That’s gotten me with something that looked like 
this for several hours:

> [root@master ~]# radosgw-admin  sync status
>   realm 54bb8477-f221-429a-bbf0-76678c767b5f (example)
>   zonegroup 8e33f5e9-02c8-4ab8-a0ab-c6a37c2bcf07 (us)
>zone b6e32bc8-f07e-4971-b825-299b5181a5f0 (secondary)
>   metadata sync preparing for full sync
> full sync: 64/64 shards
> full sync: 0 entries to sync
> incremental sync: 0/64 shards
> metadata is behind on 64 shards
> behind shards: 
> [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63]
>   data sync source: 35835cb0-4639-43f4-81fd-624d40c7dd6f (master)
> preparing for full sync
> full sync: 1/128 shards
> full sync: 0 buckets to sync
> incremental sync: 127/128 shards
> data is behind on 1 shards
> behind shards: [0]

I also had the data sync showing a list of “behind shards”, but both of them 
sat in “preparing for full sync” for several hours, so I tried `radosgw-admin 
[meta]data sync run`. My sense is that was a bad idea, but neither of the 
commands seem to be documented and the thread I found them on indicated they 
wouldn’t damage the source data. 

QUESTIONS at this point:

1) What is the best sequence of commands to properly start the sync? Does init 
just set things up and do nothing until a run is started?
2) Are there commands I should run before that to clear out any previous bad 
runs?

Thanks very kindly for any assistance. As I didn’t really see any documentation 
outside of setting up the realms/zones/groups, it seems like this would be 
useful information for others that follow.

best, Brian___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multi-site replication speed

2019-04-15 Thread Brian Topping

I’m starting to wonder if I actually have things configured and working 
correctly, but the light traffic I am seeing is that of an incremental 
replication. That would make sense, the cluster being replicated does not have 
a lot of traffic on it yet. Obviously, without the full replication, the 
incremental is pretty useless.

Here’s the status coming from the secondary side:

> [root@secondary ~]# radosgw-admin  sync status
>   realm 54bb8477-f221-429a-bbf0-76678c767b5f (example)
>   zonegroup 8e33f5e9-02c8-4ab8-a0ab-c6a37c2bcf07 (us)
>zone b6e32bc8-f07e-4971-b825-299b5181a5f0 (secondary)
>   metadata sync syncing
> full sync: 0/64 shards
> incremental sync: 64/64 shards
> metadata is caught up with master
>   data sync source: 35835cb0-4639-43f4-81fd-624d40c7dd6f (master)
> syncing
> full sync: 0/128 shards
> incremental sync: 128/128 shards
> data is caught up with source


If I am correct, how do I trigger the full sync?

Thanks!! Brian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multi-site replication speed

2019-04-14 Thread Brian Topping



> On Apr 14, 2019, at 2:08 PM, Brian Topping  wrote:
> 
> Every so often I might see the link running at 20 Mbits/sec, but it’s not 
> consistent. It’s probably going to take a very long time at this rate, if 
> ever. What can I do?

Correction: I was looking at statistics on an aggregate interface while my 
laptop was rebuilding a mailbox. The typical transfer is around 60Kbits/sec, 
but as I said, iperf3 can easily push the link between the two points to 
>750Mbits/sec. Also, system load always has >90% idle on both machines...

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Multi-site replication speed

2019-04-14 Thread Brian Topping

Hi all! I’m finally running with Ceph multi-site per 
http://docs.ceph.com/docs/nautilus/radosgw/multisite/ 
, woo hoo!

I wanted to confirm that the process can be slow. It’s been a couple of hours 
since the sync started and `radosgw-admin sync status` does not report any 
errors, but the speeds are nowhere near link saturation. iperf3 reports 773 
Mbits/sec on the link in TCP mode, latency is about 5ms. 

Every so often I might see the link running at 20 Mbits/sec, but it’s not 
consistent. It’s probably going to take a very long time at this rate, if ever. 
What can I do?

I’m using civetweb without SSL on the gateway endpoints, only one 
master/mon/rgw for each end on Nautilus 14.2.0.

Apologies if I’ve missed some crucial tuning docs or archive messages somewhere 
on the subject.

Thanks! Brian___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 1/3 mon not working after upgrade to Nautilus

2019-03-25 Thread Brian Topping

Did you check port access from other nodes?  My guess is a forgotten firewall 
re-emerged on that node after reboot. 

Sent from my iPhone

> On Mar 25, 2019, at 07:26, Clausen, Jörn  wrote:
> 
> Hi again!
> 
>> moment, one of my three MONs (the then active one) fell out of the 
> 
> "active one" is of course nonsense, I confused it with MGRs. Which are 
> running okay, btw, on the same three hosts.
> 
> I reverted the MON back to a snapshot (vSphere) before the upgrade, repeated 
> the upgrade, and ended up in the same situation. ceph-mon.log is filled with 
> ~3000 lines per second.
> 
> The only line I can assume has any value to this is
> 
> mon.cephtmon03@-1(probing) e1  my rank is now 2 (was -1)
> 
> What does that mean?
> 
> -- 
> Jörn Clausen
> Daten- und Rechenzentrum
> GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel
> Düsternbrookerweg 20
> 24105 Kiel
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook

2019-02-19 Thread Brian Topping

> On Feb 19, 2019, at 3:30 PM, Vitaliy Filippov  wrote:
> 
> In our russian-speaking Ceph chat we swear "ceph inside kuber" people all the 
> time because they often do not understand in what state their cluster is at 
> all

Agreed 100%. This is a really good way to lock yourself out of your data (and 
maybe lose it), especially if you’re new to Kubernetes and using Rook to manage 
Ceph. 

Some months ago, I was on VMs running on Citrix. Everything is stable on 
Kubernetes and Ceph now, but it’s been a lot of work. I’d suggest starting with 
Kubernetes first, especially if you are going to do this on bare metal. I can 
give you some ideas about how to lay things out if you are running with limited 
hardware.

Brian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Brian Topping

Thanks Hector. So many things going through my head and I totally forgot to 
explore if just turning off the warnings (if only until I get more disks) was 
an option. 

This is 1000% more sensible for sure.

> On Feb 8, 2019, at 7:19 PM, Hector Martin  wrote:
> 
> My practical suggestion would be to do nothing for now (perhaps tweaking
> the config settings to shut up the warnings about PGs per OSD). Ceph
> will gain the ability to downsize pools soon, and in the meantime,
> anecdotally, I have a production cluster where we overshot the current
> recommendation by 10x due to confusing documentation at the time, and
> it's doing fine :-)
> 
> Stable multi-FS support is also coming, so really, multiple ways to fix
> your problem will probably materialize Real Soon Now, and in the
> meantime having more PGs than recommended isn't the end of the world.
> 
> (resending because the previous reply wound up off-list)
> 
> On 09/02/2019 10.39, Brian Topping wrote:
>> Thanks again to Jan, Burkhard, Marc and Hector for responses on this. To
>> review, I am removing OSDs from a small cluster and running up against
>> the “too many PGs per OSD problem due to lack of clarity. Here’s a
>> summary of what I have collected on it:
>> 
>> 1. The CephFS data pool can’t be changed, only added to. 
>> 2. CephFS metadata pool might be rebuildable
>>via https://www.spinics.net/lists/ceph-users/msg29536.html, but the
>>post is a couple of years old, and even then, the author stated that
>>he wouldn’t do this unless it was an emergency.
>> 3. Running multiple clusters on the same hardware is deprecated, so
>>there’s no way to make a new cluster with properly-sized pools and
>>cpio across.
>> 4. Running multiple filesystems on the same hardware is considered
>>experimental: 
>> http://docs.ceph.com/docs/master/cephfs/experimental-features/#multiple-filesystems-within-a-ceph-cluster.
>>It’s unclear what permanent changes this will effect on the cluster
>>that I’d like to use moving forward. This would be a second option
>>to mount and cpio across.
>> 5. Importing pools (ie `zpool export …`, `zpool import …`) from other
>>clusters is likely not supported, so even if I created a new cluster
>>on a different machine, getting the pools back in the original
>>cluster is fraught.
>> 6. There’s really no way to tell Ceph where to put pools, so when the
>>new drives are added to CRUSH, everything starts rebalancing unless
>>`max pg per osd` is set to some small number that is already
>>exceeded. But if I start copying data to the new pool, doesn’t it fail?
>> 7. Maybe the former problem can be avoided by changing the weights of
>>the OSDs...
>> 
>> 
>> All these options so far seem either a) dangerous or b) like I’m going
>> to have a less-than-pristine cluster to kick off the next ten years
>> with. Unless I am mistaken in that, the only options are to copy
>> everything at least once or twice more:
>> 
>> 1. Copy everything back off CephFS to a `mdadm` RAID 1 with two of the
>>6TB drives. Blow away the cluster and start over with the other two
>>drives, copy everything back to CephFS, then re-add the freed drive
>>used as a store. Might be done by the end of next week.
>> 2. Create a new, properly sized cluster on a second machine, copy
>>everything over ethernet, then move the drives and the
>>`/var/lib/ceph` and `/etc/ceph` back to the cluster seed.
>> 
>> 
>> I appreciate small clusters are not the target use case of Ceph, but
>> everyone has to start somewhere!
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://mrcn.st/pub

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Brian Topping

Thanks again to Jan, Burkhard, Marc and Hector for responses on this. To
review, I am removing OSDs from a small cluster and running up against the “too
many PGs per OSD problem due to lack of clarity. Here’s a summary of what I
have collected on it:

The CephFS data pool can’t be changed, only added to.
CephFS metadata pool might be rebuildable via
https://www.spinics.net/lists/ceph-users/msg29536.html
, but the post is a
couple of years old, and even then, the author stated that he wouldn’t do this
unless it was an emergency.
Running multiple clusters on the same hardware is deprecated, so there’s no way
to make a new cluster with properly-sized pools and cpio across.
Running multiple filesystems on the same hardware is considered experimental:
http://docs.ceph.com/docs/master/cephfs/experimental-features/#multiple-filesystems-within-a-ceph-cluster

.
It’s unclear what permanent changes this will effect on the cluster that I’d
like to use moving forward. This would be a second option to mount and cpio
across.
Importing pools (ie `zpool export …`, `zpool import …`) from other clusters is
likely not supported, so even if I created a new cluster on a different
machine, getting the pools back in the original cluster is fraught.
There’s really no way to tell Ceph where to put pools, so when the new drives
are added to CRUSH, everything starts rebalancing unless `max pg per osd` is
set to some small number that is already exceeded. But if I start copying data
to the new pool, doesn’t it fail?
Maybe the former problem can be avoided by changing the weights of the OSDs...

All these options so far seem either a) dangerous or b) like I’m going to have
a less-than-pristine cluster to kick off the next ten years with. Unless I am
mistaken in that, the only options are to copy everything at least once or
twice more:

Copy everything back off CephFS to a `mdadm` RAID 1 with two of the 6TB drives.
Blow away the cluster and start over with the other two drives, copy everything
back to CephFS, then re-add the freed drive used as a store. Might be done by
the end of next week.
Create a new, properly sized cluster on a second machine, copy everything over
ethernet, then move the drives and the `/var/lib/ceph` and `/etc/ceph` back to
the cluster seed.

I appreciate small clusters are not the target use case of Ceph, but everyone
has to start somewhere!___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Brian Topping

Thanks Marc and Burkhard. I think what I am learning is it’s best to copy 
between filesystems with cpio, if not impossible to do it any other way due to 
the “fs metadata in first pool” problem.

FWIW, the mimic docs still describe how to create a differently named cluster 
on the same hardware. But then I see 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021560.html 
 
saying that behavior is deprecated and problematic. 

A hard lesson, but no data was lost. I will set up two machines and a new 
cluster with the larger drives tomorrow.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Brian Topping

Hi Mark, that’s great advice, thanks! I’m always grateful for the knowledge. 

What about the issue with the pools containing a CephFS though? Is it something 
where I can just turn off the MDS, copy the pools and rename them back to the 
original name, then restart the MDS? 

Agreed about using smaller numbers. When I went to using seven disks, I was 
getting warnings about too few PGs per OSD. I’m sure this is something one 
learns to cope with via experience and I’m still picking that up. Had hoped not 
I get in a bind like this so quickly, but hey, here I am again :)

> On Feb 8, 2019, at 01:53, Marc Roos  wrote:
> 
> 
> There is a setting to set the max pg per osd. I would set that 
> temporarily so you can work, create a new pool with 8 pg's and move data 
> over to the new pool, remove the old pool, than unset this max pg per 
> osd.
> 
> PS. I am always creating pools starting 8 pg's and when I know I am at 
> what I want in production I can always increase the pg count.
> 
> 
> 
> -Original Message-
> From: Brian Topping [mailto:brian.topp...@gmail.com] 
> Sent: 08 February 2019 05:30
> To: Ceph Users
> Subject: [ceph-users] Downsizing a cephfs pool
> 
> Hi all, I created a problem when moving data to Ceph and I would be 
> grateful for some guidance before I do something dumb.
> 
> 
> 1.I started with the 4x 6TB source disks that came together as a 
> single XFS filesystem via software RAID. The goal is to have the same 
> data on a cephfs volume, but with these four disks formatted for 
> bluestore under Ceph.
> 2.The only spare disks I had were 2TB, so put 7x together. I sized 
> data and metadata for cephfs at 256 PG, but it was wrong.
> 3.The copy went smoothly, so I zapped and added the original 4x 6TB 
> disks to the cluster.
> 4.I realized what I did, that when the 7x2TB disks were removed, 
> there were going to be far too many PGs per OSD.
> 
> 
> I just read over https://stackoverflow.com/a/39637015/478209, but that 
> addresses how to do this with a generic pool, not pools used by CephFS. 
> It looks easy to copy the pools, but once copied and renamed, CephFS may 
> not recognize them as the target and the data may be lost.
> 
> Do I need to create new pools and copy again using cpio? Is there a 
> better way?
> 
> Thanks! Brian
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Downsizing a cephfs pool

2019-02-07 Thread Brian Topping

Hi all, I created a problem when moving data to Ceph and I would be grateful 
for some guidance before I do something dumb.

I started with the 4x 6TB source disks that came together as a single XFS 
filesystem via software RAID. The goal is to have the same data on a cephfs 
volume, but with these four disks formatted for bluestore under Ceph.
The only spare disks I had were 2TB, so put 7x together. I sized data and 
metadata for cephfs at 256 PG, but it was wrong.
The copy went smoothly, so I zapped and added the original 4x 6TB disks to the 
cluster.
I realized what I did, that when the 7x2TB disks were removed, there were going 
to be far too many PGs per OSD.

I just read over https://stackoverflow.com/a/39637015/478209 
, but that addresses how to do 
this with a generic pool, not pools used by CephFS. It looks easy to copy the 
pools, but once copied and renamed, CephFS may not recognize them as the target 
and the data may be lost.

Do I need to create new pools and copy again using cpio? Is there a better way?

Thanks! Brian___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rezising an online mounted ext4 on a rbd - failed

2019-01-30 Thread Brian Godette

Did you mkfs with -O 64bit or have it in the [defaults] section of 
/etc/mke2fs.conf before creating the filesystem? If you didn't 4TB is as big as 
it goes and can't be changed after the fact. If the device is already larger 
than 4TB when you create the filesystem, mkfs does the right then and silently 
enables 64bit.


man ext4



From: ceph-users  on behalf of Götz Reinicke 

Sent: Saturday, January 26, 2019 8:10 AM
To: Ceph Users
Subject: Re: [ceph-users] Rezising an online mounted ext4 on a rbd - failed



Am 26.01.2019 um 14:16 schrieb Kevin Olbrich 
mailto:k...@sv01.de>>:

Am Sa., 26. Jan. 2019 um 13:43 Uhr schrieb Götz Reinicke
mailto:goetz.reini...@filmakademie.de>>:

Hi,

I have a fileserver which mounted a 4TB rbd, which is ext4 formatted.

I grow that rbd and ext4 starting with an 2TB rbd that way:

rbd resize testpool/disk01--size 4194304

resize2fs /dev/rbd0

Today I wanted to extend that ext4 to 8 TB and did:

rbd resize testpool/disk01--size 8388608

resize2fs /dev/rbd0

=> which gives an error: The filesystem is already 1073741824 blocks. Nothing 
to do.


   I bet I missed something very simple. Any hint? Thanks and regards . Götz

Try "partprobe" to read device metrics again.

Did not change anything and did not give any output/log messages.

/Götz


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] One host with 24 OSDs is offline - best way to get it back online

2019-01-26 Thread Brian Topping

I went through this as I reformatted all the OSDs with a much smaller cluster 
last weekend. When turning nodes back on, PGs would sometimes move, only to 
move back, prolonging the operation and system stress. 

What I took away is it’s least overall system stress to have the OSD tree back 
to target state as quickly as safe and practical. Replication will happen as 
replication will, but if the strategy changes midway, it just means the same 
speed of movement over a longer time. 

> On Jan 26, 2019, at 15:41, Chris  wrote:
> 
> It sort of depends on your workload/use case.  Recovery operations can be 
> computationally expensive.  If your load is light because its the weekend you 
> should be able to turn that host back on  as soon as you resolve whatever the 
> issue is with minimal impact.  You can also increase the priority of the 
> recovery operation to make it go faster if you feel you can spare additional 
> IO and it won't affect clients.
> 
> We do this in our cluster regularly and have yet to see an issue (given that 
> we take care to do it during periods of lower client io)
> 
>> On January 26, 2019 17:16:38 Götz Reinicke  
>> wrote:
>> 
>> Hi,
>> 
>> one host out of 10 is down for yet unknown reasons. I guess a power failure. 
>> I could not yet see the server.
>> 
>> The Cluster is recovering and remapping fine, but still has some objects to 
>> process.
>> 
>> My question: May I just switch the server back on and in best case, the 24 
>> OSDs get back online and recovering will do the job without problems.
>> 
>> Or what might be a good way to handle that host? Should I first wait till 
>> the recover is finished?
>> 
>> Thanks for feedback and suggestions - Happy Saturday Night  :) . Regards . 
>> Götz
>> 
>> 
>> --
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Problem with OSDs

2019-01-21 Thread Brian Topping

> On Jan 21, 2019, at 6:47 AM, Alfredo Deza  wrote:
> 
> When creating an OSD, ceph-volume will capture the ID and the FSID and
> use these to create a systemd unit. When the system boots, it queries
> LVM for devices that match that ID/FSID information.

Thanks Alfredo, I see that now. The name comes from the symlink and is passed 
into the script as %i. I should have seen that before, but at best I would have 
done a hacky job of recreating them manually, so in hindsight I’m glad I did 
not see that sooner.

> Is it possible you've attempted to create an OSD and then failed, and
> tried again? That would explain why there would be a systemd unit with
> an FSID that doesn't match. By the output, it does look like
> you have an OSD 1, but with a different FSID (467... instead of
> e3b...). You could try to disable the failing systemd unit with:
> 
>systemctl disable
> ceph-volume@lvm-1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce.service 
> 
> 
> (Follow up with OSD 3) and then run:
> 
>ceph-volume lvm activate --all

That worked and recovered startup of all four OSDs on the second node. In an 
abundance of caution, I only disabled one of the volumes with systemctl disable 
and then ran ceph-volume lvm activate --all. That cleaned up all of them 
though, so there was nothing left to do.

https://bugzilla.redhat.com/show_bug.cgi?id=1567346#c21 
 helped resolve the 
final issue getting to HEALTH_OK. After rebuilding the mon/mgr node, I did not 
properly clear / restore the firewall. It’s odd that osd tree was reporting 
that two of the OSDs were up and in when the ports for mon/mgr/mds were all 
inaccessible.

I don’t believe there were any failed creation attempts. Cardinal process rule 
with filesystems: Always maintain a known-good state that can be rolled back 
to. If an error comes up that can’t be fully explained, roll back and restart. 
Sometimes a command gets missed by the best of fingers and fully caffeinated 
minds.. :)  I do see that I didn’t do a `ceph osd purge` on the empty/downed 
OSDs that were gracefully `out`. That explains the tree with the even numbered 
OSDs on the rebuilt node. After purging the references to the empty OSDs and 
re-adding the volumes, I am back to full health with all devices and OSDs up/in.

THANK YOU!!! :D___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] quick questions about a 5-node homelab setup

2019-01-21 Thread Brian Topping

> On Jan 18, 2019, at 3:48 AM, Eugen Leitl  wrote:
> 
> 
> (Crossposting this from Reddit /r/ceph , since likely to have more technical 
> audience present here).
> 
> I've scrounged up 5 old Atom Supermicro nodes and would like to run them 
> 365/7 for limited production as RBD with Bluestore (ideally latest 13.2.4 
> Mimic), triple copy redundancy. Underlying OS is a Debian 9 64 bit, minimal 
> install.

The other thing to consider about a lab is “what do you want to learn?” If 
reliability isn’t an issue (ie you aren’t putting your family pictures on it), 
regardless of the cluster technology, you can often learn basics more quickly 
without the overhead of maintaining quorums and all that stuff on day one. So 
at risk of being a heretic, start small, for instance with single mon/manager 
and add more later. 

Adding mons into a running cluster just as unique and valuable of an experience 
as maintaining perfect quorum. Knowing when, why and where to add resources is 
much harder if one builds out a monster cluster from the start. This is to say 
“if there are no bottlenecks to solve for, there is far less learning being 
required”. And when it comes to being proficient with such a critical piece of 
production infrastructure, you’ll want to have as many experiences with the 
system going sideways and bringing it back as you can. 

Production heroes are measured by their uptime statistics, and when things get 
testy, the more cluster-foo you have (regardless of the cluster), the less risk 
you’ll have maintaining perfect stats.

$0.02… 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Problem with OSDs

2019-01-20 Thread Brian Topping

-0
> brw-rw. 1 ceph ceph 253, 1 Jan 20 16:19 /dev/dm-1
> brw-rw. 1 ceph ceph 253, 2 Jan 20 16:19 /dev/dm-2
> brw-rw. 1 ceph ceph 253, 3 Jan 20 16:19 /dev/dm-3
> brw-rw. 1 ceph ceph 253, 4 Jan 20 16:19 /dev/dm-4
> [root@gw02 ~]# dmsetup ls
> ceph--1f3d4406--af86--4813--8d06--a001c57408fa-osd--block--5c0d0404--390e--4801--94a9--da52c104206f
>(253:1)
> ceph--f5f453df--1d41--4883--b0f8--d662c6ba8bea-osd--block--084cf33d--8a38--4c82--884a--7c88e3161479
>(253:4)
> ceph--033e2bbe--5005--45d9--9ecd--4b541fe010bd-osd--block--e854930d--1617--4fe7--b3cd--98ef284643fd
>(253:2)
> hndc1.centos02-root   (253:0)
> ceph--c7640f3e--0bf5--4d75--8dd4--00b6434c84d9-osd--block--4672bb90--8cea--4580--85f2--1e692811a05a
>(253:3)

How can I debug this? I suspect this is just some kind of a UID swap that that 
happened somewhere, but I don’t know what the chain of truth is through the 
database files to connect the two together and make sure I have the correct OSD 
blocks where the mon expects to find them.

Thanks! Brian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Boot volume on OSD device

2019-01-19 Thread Brian Topping

> On Jan 18, 2019, at 10:58 AM, Hector Martin  wrote:
> 
> Just to add a related experience: you still need 1.0 metadata (that's
> the 1.x variant at the end of the partition, like 0.9.0) for an
> mdadm-backed EFI system partition if you boot using UEFI. This generally
> works well, except on some Dell servers where the firmware inexplicably
> *writes* to the ESP, messing up the RAID mirroring. 

I love this list. You guys are great. I have to admit I was kind of intimidated 
at first, I felt a little unworthy in the face of such cutting-edge tech. 
Thanks to everyone that’s helped with my posts.

Hector, one of the things I was thinking through last night and finally pulled 
the trigger on today was the overhead of various subsystems. LVM does not 
create much overhead, but tiny initial mistakes explode into a lot of wasted 
CPU over the course of a deployment lifetime. So I wanted to review everything 
and thought I would share my notes here.

My main constraint is I had four disks on a single machine to start with and 
any one of the disks should be able to fail without affecting the ability for 
the machine to boot, the bad disk replaced without requiring obscure admin 
skills, and the final recovery to the promised land of “HEALTH_OK”. A single 
machine Ceph deployment is not much better than just using local storage, 
except the ability to later scale out. That’s the use case I’m addressing here.

The first exploration I had was how to optimize for a good balance between 
safety for mon logs, disk usage and performance for the boot partitions. As I 
learned, an OSD can fit in a single partition with no spillover, so I had three 
partitions to work with. `inotifywait -mr /var/lib/ceph/` provided a good 
handle on what was being written to the log and with what frequency and I could 
see that the log was mostly writes.

https://theithollow.com/2012/03/21/understanding-raid-penalty/ 
 provided a 
good background that I did not previously have on the RAID write penalty. I 
combined this with what I learned in 
https://serverfault.com/questions/685289/software-vs-hardware-raid-performance-and-cache-usage/685328#685328

.
 By the end of these two articles, I felt like I knew all the tradeoffs, but 
the final decision really came down to the penalty table in the first article 
and a “RAID penalty” of 2 for RAID 10, which was the same as the penalty for 
RAID 1, but with 50% better storage efficiency.

For the boot partition, there are fewer choices. Specifying anything other than 
RAID 1 will not keep all the copies of /boot both up-to-date and ready to 
seamlessly restart the machine in case of a disk failure. Combined with a 
choice of RAID 10 for the root partition, we are left with a configuration that 
can reliably boot from any single drive failure (maybe two, I don’t know what 
mdadm would do if a “less than perfect storm” happened with one mirror from 
each stripe were to be lost instead of two mirrors from one stripe…)

With this setup, each disk used exactly two partitions and mdadm is using the 
latest MD metadata because Grub2 knows how to deal with everything. As well, 
`sfdisk /dev/sd[abcd]` shows all disks marked with the first partition as 
bootable. Milestone 1 success!

The next piece I was unsure of but didn’t want to spam the list with stuff I 
could just try was how many partitions an OSD would use. Hector mentioned that 
he was using LVM for Bluestore volumes. I privately wondered the value in 
creating LVM VGs when groups did not span disks. But this is exactly what the 
`ceph-deploy osd create` command as documented does in creating Bluestore OSDs. 
Knowing how to wire LVM is not rocket science, but if possible, I wanted to 
avoid as many manual steps as possible. This was a biggie.

And after adding the OSD partitions one after the other, “HEALTH_OK”. w00t!!! 
Final Milestone Success!!

I know there’s no perfect starter configuration for every hardware environment, 
but I thought I would share exactly what I ended up with here for future 
seekers. This has been a fun adventure. 

Next up: Convert my existing two pre-production nodes that need to use this 
layout. Fortunately there’s nothing on the second node except Ceph and I can 
take that one down pretty easily. It will be good practice to gracefully shut 
down the four OSDs on that node without losing any data, reformat the node with 
this pattern, bring it the cluster back to health, then migrate the mon (and 
the workloads) to it while I do the same for the first node. With that, I’ll be 
able to remove these satanic SATADOMs and get back to some real work!! ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Today's DocuBetter meeting topic is... SEO

2019-01-18 Thread Brian Topping

Hi Noah!

With an eye toward improving documentation and community, two things come to 
mind:

1. I didn’t know about this meeting or I would have done my very best to enlist 
my roommate, who probably could have answered these questions very quickly. I 
do know there’s something to do with the metadata tags in the HTML  that 
manages most of this. Web spiders see these tags and know what to do.

2. I realized I really didn’t know there were any Ceph meetings like this and 
thought I would raise awareness to 
https://github.com/kubernetes/community/blob/master/events/community-meeting.md 
<https://github.com/kubernetes/community/blob/master/events/community-meeting.md>,
 where the kubernetes team has created an iCal subscription that one can 
automatically get alerts and updates for upcoming events. Best, they work 
accurately across time zones, so no need to have people doing math ("daylight 
savings time” is a pet peeve, please don’t get me started! :))

Hope this provides some value! 

Brian

> On Jan 18, 2019, at 11:37 AM, Noah Watkins  wrote:
> 
> 1 PM PST / 9 PM GMT
> https://bluejeans.com/908675367
> 
> On Fri, Jan 18, 2019 at 10:31 AM Noah Watkins  wrote:
>> 
>> We'll be discussing SEO for the Ceph documentation site today at the
>> DocuBetter meeting. Currently when Googling or DuckDuckGoing for
>> Ceph-related things you may see results from master, mimic, or what's
>> a dumpling? The goal is figure out what sort of approach we can take
>> to make these results more relevant. If you happen to know a bit about
>> the topic of SEO please join and contribute to the conversation.
>> 
>> Best,
>> Noah
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Boot volume on OSD device

2019-01-18 Thread Brian Topping

> On Jan 18, 2019, at 4:29 AM, Hector Martin  wrote:
> 
> On 12/01/2019 15:07, Brian Topping wrote:
>> I’m a little nervous that BlueStore assumes it owns the partition table and 
>> will not be happy that a couple of primary partitions have been used. Will 
>> this be a problem?
> 
> You should look into using ceph-volume in LVM mode. This will allow you to 
> create an OSD out of any arbitrary LVM logical volume, and it doesn't care 
> about other volumes on the same PV/VG. I'm running BlueStore OSDs sharing PVs 
> with some non-Ceph stuff without any issues. It's the easiest way for OSDs to 
> coexist with other stuff right now.

Very interesting, thanks!

On the subject, I just rediscovered the technique of putting boot and root 
volumes on mdadm-backed stores. The last time I felt the need for this, it was 
a lot of careful planning and commands. 

Now, at least with RHEL/CentOS, it’s now available in Anaconda. As it’s set up 
before mkfs, there’s no manual hackery to reduce the size of a volume to make 
room for the metadata. Even better, one isn’t stuck using metadata 0.9.0 just 
because they need the /boot volume to have the header at the end (grub now 
understands mdadm 1.2 headers). Just be sure /boot is RAID 1 and it doesn’t 
seem to matter what one does with the rest of the volumes. Kernel upgrades 
process correctly as well (another major hassle in the old days since mkinitrd 
had to be carefully managed).

best, B

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Offsite replication scenario

2019-01-16 Thread Brian Topping

> On Jan 16, 2019, at 12:08 PM, Anthony Verevkin  wrote:
> 
> I would definitely see huge value in going to 3 MONs here (and btw 2 on-site 
> MGR and 2 on-site MDS)
> However 350Kbps is quite low and MONs may be latency sensitive, so I suggest 
> you do heavy QoS if you want to use that link for ANYTHING else.
> If you do so, make sure your clients are only listing the on-site MONs so 
> they don't try to read from the off-site MON.
> Still you risk the stability of the cluster if the off-site MON starts 
> lagging. If it's still considered on while lagging, all changes to cluster 
> (osd going up/down, etc) would be blocked by waiting it to commit.

Using QOS is definitely something I hadn’t thought of, thanks. Setting up tc 
wouldn’t be rocket science. I’d probably make sure that the offsite mon wasn’t 
even reachable from clients, just the other mons.

> Even if you choose against an off-site MON, maybe consider 2 on-site MON 
> instead. Yes, you'd double the risk of cluster going to a halt if any one 
> node dies vs one specific node dying. But if that happens you have a manual 
> way of downgrading to a single MON (and you still have your MON's data) vs 
> risking to get stuck with a OSD-only node that had never had MON installed 
> and not having a copy of MON DB.

I had thought through some of that. That’s a good idea of having it ready to go 
though. 

It got also me thinking if it would be practical to have the two local and one 
remote set up and ready to go as you note, but only two running at a time. So 
if I need to take a primary node down, I would re-add the remote, do the 
service on the primary node, bring it back, re-establish mon health, then 
remove the remote. That’s probably great until there’s an actual primary 
failure — no quorum and the out-of-date remote can’t be re-added so one is just 
as bad off. 

> I also see how you want to get the data out for backups.
> Having a third replica off-site definitely won't fly with such bandwidth as 
> it would once again block the IO until committed by the off-site OSD.
> I am not quite sure RBD mirroring would play nicely with this kind of link 
> either. Maybe stick with application-level off-site backups.
> And again, whatever replication/backup strategy you do, need to QoS or else 
> you'd cripple your connection which I assume is used for some other 
> communications as well.

The connection is a consumer-grade gig fiber. It would be great if I could 
optimize it for higher speeds, the locations are only 30 miles from each other! 

It seems at this point that I’m not missing anything, I’m grateful for your 
thoughts! I think I just need to get another node in there, whether through a 
VPS from the provider or another unit. The cost of dealing with the house of 
cards being built seems much higher as soon as there is a single unplanned 
configuration and hours get put into bringing it back to health.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] /var/lib/ceph/mon/ceph-{node}/store.db on mon nodes

2019-01-16 Thread Brian Topping

Thanks guys! This does leave me a little worried that I only have one mon at 
the moment based on reasons in my previous emails in the list (physical limit 
of two nodes at the moment). Going to have to get more creative!

Sent from my iPhone

> On Jan 16, 2019, at 02:56, Wido den Hollander  wrote:
> 
> 
> 
>> On 1/16/19 10:36 AM, Matthew Vernon wrote:
>> Hi,
>> 
>>> On 16/01/2019 09:02, Brian Topping wrote:
>>> 
>>> I’m looking at writes to a fragile SSD on a mon node,
>>> /var/lib/ceph/mon/ceph-{node}/store.db is the big offender at the
>>> moment.
>>> Is it required to be on a physical disk or can it be in tempfs? One
>>> of the log files has paxos strings, so I’m guessing it has to be on
>>> disk for a panic recovery? Are there other options?
>> Yeah, the mon store is worth keeping ;-) It can get quite large with a
>> large cluster and/or big rebalances. We bought some extra storage for
>> our mons and put the mon store onto dedicated storage.
> 
> Yes, this can't be stressed enough. Keep in mind: If you loose the MON
> stores you will effectively loose your cluster and thus data!
> 
> With some tooling you might be able to rebuild your MON store, but
> that's a task you don't want to take.
> 
> Use a DC-grade SSD for your MON stores with enough space (~100GB) and
> you'll be fine.
> 
> Wido
> 
>> 
>> Regards,
>> 
>> Matthew
>> 
>> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] /var/lib/ceph/mon/ceph-{node}/store.db on mon nodes

2019-01-16 Thread Brian Topping

I’m looking at writes to a fragile SSD on a mon node, 
/var/lib/ceph/mon/ceph-{node}/store.db is the big offender at the moment.

Is it required to be on a physical disk or can it be in tempfs? One of the log 
files has paxos strings, so I’m guessing it has to be on disk for a panic 
recovery? Are there other options?

Thanks, Brian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Offsite replication scenario

2019-01-14 Thread Brian Topping

Ah! Makes perfect sense now. Thanks!! 

Sent from my iPhone

> On Jan 14, 2019, at 12:30, Gregory Farnum  wrote:
> 
>> On Fri, Jan 11, 2019 at 10:07 PM Brian Topping  
>> wrote:
>> Hi all,
>> 
>> I have a simple two-node Ceph cluster that I’m comfortable with the care and 
>> feeding of. Both nodes are in a single rack and captured in the attached 
>> dump, it has two nodes, only one mon, all pools size 2. Due to physical 
>> limitations, the primary location can’t move past two nodes at the present 
>> time. As far as hardware, those two nodes are 18-core Xeon with 128GB RAM 
>> and connected with 10GbE. 
>> 
>> My next goal is to add an offsite replica and would like to validate the 
>> plan I have in mind. For it’s part, the offsite replica can be considered 
>> read-only except for the occasional snapshot in order to run backups to 
>> tape. The offsite location is connected with a reliable and secured ~350Kbps 
>> WAN link. 
> 
> Unfortunately this is just not going to work. All writes to a Ceph OSD are 
> replicated synchronously to every replica, all reads are served from the 
> primary OSD for any given piece of data, and unless you do some hackery on 
> your CRUSH map each of your 3 OSD nodes is going to be a primary for about 
> 1/3 of the total data.
> 
> If you want to move your data off-site asynchronously, there are various 
> options for doing that in RBD (either periodic snapshots and export-diff, or 
> by maintaining a journal and streaming it out) and RGW (with the multi-site 
> stuff). But you're not going to be successful trying to stretch a Ceph 
> cluster over that link.
> -Greg
>  
>> 
>> The following presuppositions bear challenge:
>> 
>> * There is only a single mon at the present time, which could be expanded to 
>> three with the offsite location. Two mons at the primary location is 
>> obviously a lower MTBF than one, but  with a third one on the other side of 
>> the WAN, I could create resiliency against *either* a WAN failure or a 
>> single node maintenance event. 
>> * Because there are two mons at the primary location and one at the offsite, 
>> the degradation mode for a WAN loss (most likely scenario due to facility 
>> support) leaves the primary nodes maintaining the quorum, which is 
>> desirable. 
>> * It’s clear that a WAN failure and a mon failure at the primary location 
>> will halt cluster access.
>> * The CRUSH maps will be managed to reflect the topology change.
>> 
>> If that’s a good capture so far, I’m comfortable with it. What I don’t 
>> understand is what to expect in actual use:
>> 
>> * Is the link speed asymmetry between the two primary nodes and the offsite 
>> node going to create significant risk or unexpected behaviors?
>> * Will the performance of the two primary nodes be limited to the speed that 
>> the offsite mon can participate? Or will the primary mons correctly 
>> calculate they have quorum and keep moving forward under normal operation?
>> * In the case of an extended WAN outage (and presuming full uptime on 
>> primary site mons), would return to full cluster health be simply a matter 
>> of time? Are there any limits on how long the WAN could be down if the other 
>> two maintain quorum?
>> 
>> I hope I’m asking the right questions here. Any feedback appreciated, 
>> including blogs and RTFM pointers.
>> 
>> 
>> Thanks for a great product!! I’m really excited for this next frontier!
>> 
>> Brian
>> 
>> > [root@gw01 ~]# ceph -s
>> >  cluster:
>> >id: 
>> >health: HEALTH_OK
>> > 
>> >  services:
>> >mon: 1 daemons, quorum gw01
>> >mgr: gw01(active)
>> >mds: cephfs-1/1/1 up  {0=gw01=up:active}
>> >osd: 8 osds: 8 up, 8 in
>> > 
>> >  data:
>> >pools:   3 pools, 380 pgs
>> >objects: 172.9 k objects, 11 GiB
>> >usage:   30 GiB used, 5.8 TiB / 5.8 TiB avail
>> >pgs: 380 active+clean
>> > 
>> >  io:
>> >client:   612 KiB/s wr, 0 op/s rd, 50 op/s wr
>> > 
>> > [root@gw01 ~]# ceph df
>> > GLOBAL:
>> >SIZEAVAIL   RAW USED %RAW USED 
>> >5.8 TiB 5.8 TiB   30 GiB  0.51 
>> > POOLS:
>> >NAMEID USED%USED MAX AVAIL OBJECTS 
>> >cephfs_metadata 2  264 MiB 0   2.7 TiB1085 
>> >cephfs_data 3  8.3 GiB  0.29   2.7 TiB  171283 
>> >rbd

[ceph-users] Boot volume on OSD device

2019-01-11 Thread Brian Topping

Question about OSD sizes: I have two cluster nodes, each with 4x 800GiB SLC SSD 
using BlueStore. They boot from SATADOM so the OSDs are data-only, but the MLC 
SATADOM have terrible reliability and the SLC are way overpriced for this 
application.

Can I carve off 64GiB of from one of the four drives on a node without causing 
problems? If I understand the strategy properly, this will cause mild extra 
load on the other three drives as the weight goes down on the partitioned 
drive, but it probably won’t be a big deal.

Assuming the correct procedure is documented at 
http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-osds/, first 
removing the OSD as documented, zap it, carve off the partition of the freed 
drive, then adding the remaining space back in.

I’m a little nervous that BlueStore assumes it owns the partition table and 
will not be happy that a couple of primary partitions have been used. Will this 
be a problem?

Thanks, Brian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Offsite replication scenario

2019-01-11 Thread Brian Topping

Hi all,

I have a simple two-node Ceph cluster that I’m comfortable with the care and 
feeding of. Both nodes are in a single rack and captured in the attached dump, 
it has two nodes, only one mon, all pools size 2. Due to physical limitations, 
the primary location can’t move past two nodes at the present time. As far as 
hardware, those two nodes are 18-core Xeon with 128GB RAM and connected with 
10GbE. 

My next goal is to add an offsite replica and would like to validate the plan I 
have in mind. For it’s part, the offsite replica can be considered read-only 
except for the occasional snapshot in order to run backups to tape. The offsite 
location is connected with a reliable and secured ~350Kbps WAN link. 

The following presuppositions bear challenge:

* There is only a single mon at the present time, which could be expanded to 
three with the offsite location. Two mons at the primary location is obviously 
a lower MTBF than one, but  with a third one on the other side of the WAN, I 
could create resiliency against *either* a WAN failure or a single node 
maintenance event. 
* Because there are two mons at the primary location and one at the offsite, 
the degradation mode for a WAN loss (most likely scenario due to facility 
support) leaves the primary nodes maintaining the quorum, which is desirable. 
* It’s clear that a WAN failure and a mon failure at the primary location will 
halt cluster access.
* The CRUSH maps will be managed to reflect the topology change.

If that’s a good capture so far, I’m comfortable with it. What I don’t 
understand is what to expect in actual use:

* Is the link speed asymmetry between the two primary nodes and the offsite 
node going to create significant risk or unexpected behaviors?
* Will the performance of the two primary nodes be limited to the speed that 
the offsite mon can participate? Or will the primary mons correctly calculate 
they have quorum and keep moving forward under normal operation?
* In the case of an extended WAN outage (and presuming full uptime on primary 
site mons), would return to full cluster health be simply a matter of time? Are 
there any limits on how long the WAN could be down if the other two maintain 
quorum?

I hope I’m asking the right questions here. Any feedback appreciated, including 
blogs and RTFM pointers.


Thanks for a great product!! I’m really excited for this next frontier!

Brian

> [root@gw01 ~]# ceph -s
>  cluster:
>id: 
>health: HEALTH_OK
> 
>  services:
>mon: 1 daemons, quorum gw01
>mgr: gw01(active)
>mds: cephfs-1/1/1 up  {0=gw01=up:active}
>osd: 8 osds: 8 up, 8 in
> 
>  data:
>pools:   3 pools, 380 pgs
>objects: 172.9 k objects, 11 GiB
>usage:   30 GiB used, 5.8 TiB / 5.8 TiB avail
>pgs: 380 active+clean
> 
>  io:
>client:   612 KiB/s wr, 0 op/s rd, 50 op/s wr
> 
> [root@gw01 ~]# ceph df
> GLOBAL:
>SIZEAVAIL   RAW USED %RAW USED 
>5.8 TiB 5.8 TiB   30 GiB  0.51 
> POOLS:
>NAMEID USED%USED MAX AVAIL OBJECTS 
>cephfs_metadata 2  264 MiB 0   2.7 TiB1085 
>cephfs_data 3  8.3 GiB  0.29   2.7 TiB  171283 
>rbd 4  2.0 GiB  0.07   2.7 TiB 542 
> [root@gw01 ~]# ceph osd tree
> ID CLASS WEIGHT  TYPE NAME STATUS REWEIGHT PRI-AFF 
> -1   5.82153 root default  
> -3   2.91077 host gw01 
> 0   ssd 0.72769 osd.0 up  1.0 1.0 
> 2   ssd 0.72769 osd.2 up  1.0 1.0 
> 4   ssd 0.72769 osd.4 up  1.0 1.0 
> 6   ssd 0.72769 osd.6 up  1.0 1.0 
> -5   2.91077 host gw02 
> 1   ssd 0.72769 osd.1 up  1.0 1.0 
> 3   ssd 0.72769 osd.3 up  1.0 1.0 
> 5   ssd 0.72769 osd.5 up  1.0 1.0 
> 7   ssd 0.72769 osd.7 up  1.0 1.0 
> [root@gw01 ~]# ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE VAR  PGS 
> 0   ssd 0.72769  1.0 745 GiB 4.9 GiB 740 GiB 0.66 1.29 115 
> 2   ssd 0.72769  1.0 745 GiB 3.1 GiB 742 GiB 0.42 0.82  83 
> 4   ssd 0.72769  1.0 745 GiB 3.6 GiB 742 GiB 0.49 0.96  90 
> 6   ssd 0.72769  1.0 745 GiB 3.5 GiB 742 GiB 0.47 0.93  92 
> 1   ssd 0.72769  1.0 745 GiB 3.4 GiB 742 GiB 0.46 0.90  76 
> 3   ssd 0.72769  1.0 745 GiB 3.9 GiB 741 GiB 0.52 1.02 102 
> 5   ssd 0.72769  1.0 745 GiB 3.9 GiB 741 GiB 0.52 1.02  98 
> 7   ssd 0.72769  1.0 745 GiB 4.0 GiB 741 GiB 0.54 1.06 104 
>TOTAL 5.8 TiB  30 GiB 5.8 TiB 0.51  
> MIN/MAX VAR: 0.82/1.29  STDDEV: 0.07
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] list admin issues

2018-12-22 Thread Brian :

Sorry to drag this one up again.

Just got the unsubscribed due to excessive bounces thing.

'Your membership in the mailing list ceph-users has been disabled due
to excessive bounces The last bounce received from you was dated
21-Dec-2018.  You will not get any more messages from this list until
you re-enable your membership.  You will receive 3 more reminders like
this before your membership in the list is deleted.'

can anyone check MTA logs to see what the bounce is?


On Tue, Nov 6, 2018 at 4:24 PM Janne Johansson  wrote:
>
> Den lör 6 okt. 2018 kl 15:06 skrev Elias Abacioglu
> :
> > I'm bumping this old thread cause it's getting annoying. My membership get 
> > disabled twice a month.
> > Between my two Gmail accounts I'm in more than 25 mailing lists and I see 
> > this behavior only here. Why is only ceph-users only affected? Maybe 
> > Christian was on to something, is this intentional?
> > Reality is that there is a lot of ceph-users with Gmail accounts, perhaps 
> > it wouldn't be so bad to actually trying to figure this one out?
> > So can the maintainers of this list please investigate what actually gets 
> > bounced? Look at my address if you want.
> > I got disabled 20181006, 20180927, 20180916, 20180725, 20180718 most 
> > recently.
>
> Guess it's time for it again.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] JBOD question

2018-07-20 Thread Brian :

Hi Satish

You should be able to choose different modes of operation for each
port / disk. Most dell servers will let you do RAID and JBOD in
parallel.

If you can't do that and can only either turn RAID on or off then you
can use SW RAID for your OS


On Fri, Jul 20, 2018 at 9:01 PM, Satish Patel  wrote:
> Folks,
>
> I never used JBOD mode before and now i am planning so i have stupid
> question if i switch RAID controller to JBOD mode in that case how
> does my OS disk will get mirror?
>
> Do i need to use software raid for OS disk when i use JBOD mode?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph snapshots

2018-06-27 Thread Brian :

Hi John

Have you looked at ceph documentation?

RBD: http://docs.ceph.com/docs/luminous/rbd/rbd-snapshot/

The ceph project documentation is really good for most areas. Have a
look at what you can find then come back with more specific questions!

Thanks
Brian




On Wed, Jun 27, 2018 at 2:24 PM, John Molefe  wrote:
> Hi everyone
>
> I would like some advice and insight into how ceph snapshots work and how it
> can be setup.
>
> Responses will be much appreciated.
>
> Thanks
> John
>
> Vrywaringsklousule / Disclaimer:
> http://www.nwu.ac.za/it/gov-man/disclaimer.html
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Mimic on CentOS 7.5 dependency issue (liboath)

2018-06-23 Thread Brian :

Hi Stefan

$ sudo yum provides liboath
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: mirror.strencom.net
 * epel: mirror.sax.uk.as61049.net
 * extras: mirror.strencom.net
 * updates: mirror.strencom.net
liboath-2.4.1-9.el7.x86_64 : Library for OATH handling
Repo: epel



On Sat, Jun 23, 2018 at 9:02 AM, Stefan Kooman  wrote:
> Hi list,
>
> I'm trying to install "Ceph mimic" on a CentOS 7.5 client (base
> install). I Added the "rpm-mimic" repo from our mirror and tried to
> install ceph-common, but I run into a dependency problem:
>
> --> Finished Dependency Resolution
> Error: Package: 2:ceph-common-13.2.0-0.el7.x86_64 
> (ceph.download.bit.nl_rpm-mimic_el7_x86_64)
>Requires: liboath.so.0()(64bit)
> Error: Package: 2:ceph-common-13.2.0-0.el7.x86_64 
> (ceph.download.bit.nl_rpm-mimic_el7_x86_64)
>Requires: liboath.so.0(LIBOATH_1.10.0)(64bit)
> Error: Package: 2:ceph-common-13.2.0-0.el7.x86_64 
> (ceph.download.bit.nl_rpm-mimic_el7_x86_64)
>Requires: liboath.so.0(LIBOATH_1.2.0)(64bit)
> Error: Package: 2:librgw2-13.2.0-0.el7.x86_64 
> (ceph.download.bit.nl_rpm-mimic_el7_x86_64)
>
> Is this "oath" package something I need to install from a 3rd party repo?
>
> Gr. Stefan
>
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] HDD-only performance, how far can it be sped up ?

2018-06-20 Thread Brian :

Hi Wladimir,

A combination of slow enough clock speed , erasure code, single node
and SATA spinners is probably going to lead to not a really great
evaluation. Some of the experts will chime in here with answers to
your specific questions I"m sure but this test really isn't ever going
to give great results.

Brian

On Wed, Jun 20, 2018 at 8:28 AM, Wladimir Mutel  wrote:
> Dear all,
>
> I set up a minimal 1-node Ceph cluster to evaluate its performance. We
> tried to save as much as possible on the hardware, so now the box has Asus
> P10S-M WS motherboard, Xeon E3-1235L v5 CPU, 64 GB DDR4 ECC RAM and 8x3TB
> HDDs (WD30EFRX) connected to on-board SATA ports. Also we are trying to save
> on storage redundancy, so for most of our RBD images we use erasure-coded
> data-pool (default profile, jerasure 2+1) instead of 3x replication. I
> started with Luminous/Xenial 12.2.5 setup which initialized my OSDs as
> Bluestore during deploy, then updated it to Mimic/Bionic 13.2.0. Base OS is
> Ubuntu 18.04 with kernel updated to 4.17.2 from Ubuntu mainline PPA.
>
> With this setup, I created a number of RBD images to test iSCSI, rbd-nbd
> and QEMU+librbd performance (running QEMU VMs on the same box). And that
> worked moderately well as far as data volume transferred within one session
> was limited. The fastest transfers I had with 'rbd import' which pulled an
> ISO image file at up to 25 MBytes/sec from the remote CIFS share over
> Gigabit Ethernet and stored it into EC data-pool. Windows 2008 R2 & 2016
> setup, update installation, Win 2008 upgrade to 2012 and to 2016 within QEMU
> VM also went through tolerably well. I found that cache=writeback gives the
> best performance with librbd, unlike cache=unsafe which gave the best
> performance with VMs on plain local SATA drives. Also I have a subjective
> feeling (not confirmed by exact measurements) that providing a huge
> libRBD cache (like, cache size = 1GB, max dirty = 7/8GB, max dirty age = 60)
> improved Windows VM performance on bursty writes (like, during Windows
> update installations) as well as on reboots (due to cached reads).
>
> Now, what discouraged me, was my next attempt to clone an NTFS partition
> of ~2TB from a physical drive (via USB3-SATA3 convertor) to a partition on
> an RBD image. I tried to map RBD image with rbd-nbd either locally or
> remotely over Gigabit Ethernet, and the fastest speed I got with ntfsclone
> was about 8 MBytes/sec. Which means that it could spend up to 3 days copying
> these ~2TB of NTFS data. I thought about running
> ntfsclone /dev/sdX1 -o - | rbd import ... - , but ntfsclone needs to rewrite
> a part of existing RBD image starting from certain offset, so I decided this
> was not a solution in my situation. Now I am thinking about taking out one
> of OSDs and using it as a 'bcache' for this operation, but I am not sure how
> good is bcache performance with cache on rotating HDD. I know that keeping
> OSD logs and RocksDB on the same HDD creates a seeky workload which hurts
> overall transfer performance.
>
> Also I am thinking about a number of next-close possibilities, and I
> would like to hear your opinions on the benefits and drawbacks of each of
> them.
>
> 1. Would iSCSI access to that RBD image improve my performance (compared
> to rbd-nbd) ? I did not check that yet, but I noticed that Windows
> transferred about 2.5 MBytes/sec while formatting NTFS volume on this RBD
> attached to it by iSCSI. So, for seeky/sparse workloads like NTFS formatting
> the performance was not great.
>
> 2. Would it help to run ntfsclone in Linux VM, with RBD image accessed
> through QEMU+librbd ? (also going to measure that myself)
>
> 3. Is there any performance benefits in using Ceph cache-tier pools with
> my setup ? I hear now use of this technique is advised against, no?
>
> 4. We have an unused older box (Supermicro X8SIL-F mobo, Xeon X3430 CPU,
> 32 GB of DDR3 ECC RAM, 6 onboard SATA ports, used from 2010 to 2017, in
> perfectly working condition) which can be stuffed with up to 6 SATA HDDs and
> added to this Ceph cluster, so far with only Gigabit network interconnect.
> Like, move 4 OSDs out of first box into it, to have 2 boxes with 4 HDDs
> each. Is this going to improve Ceph performance with the setup described
> above ?
>
> 5. I hear that RAID controllers like Adaptec 5805, LSI 2108 provide
> better performance with SATA HDDs exported as JBODs than onboard SATA AHCI
> controllers due to more aggressive caching and reordering requests. Is this
> true ?
>
> 6. On the local market we can buy Kingston KC1000/960GB NVMe drive for
> moderately reasonable price. Its specification has rewrite limit of 1 PB and
> 0.58 DWPD (drive rewrite per day). Is there a

Re: [ceph-users] PM1633a

2018-06-18 Thread Brian :

Thanks Paul Wido and Konstantin! If we give them a go I'll share some
test results.


On Sat, Jun 16, 2018 at 12:09 PM, Konstantin Shalygin  wrote:
> Hello List - anyone using these drives and have any good  / bad things
> to say about them?
>
>
> A few moths ago I was asking about PM1725
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024950.html
>
> No feedback, so I bought HGST SN260, because on the same price is better for
> write iops (200k vs 180k on Samsung) and SN260 is MLC memory, Samsung is
> TLC.
>
>
>
> k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] PM1633a

2018-06-15 Thread Brian :

Hello List - anyone using these drives and have any good  / bad things
to say about them?

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] CephFS Single Threaded Performance

2018-02-26 Thread Brian Woods

I have a small test cluster (just two nodes) and after rebuilding it
several times I found my latest configuration that SHOULD be the fastest is
by far the slowest (per thread).


I have around 10 spinals that I have an erasure encoded CephFS on. When I
installed several SSDs and recreated it with the meta data and the write
cache on SSD my performance plummeted from about 10-20MBps to 2-3MBps, but
only per thread… I did a rados benchmark and the SSDs Meta and Write pools
can sustain anywhere from 50 to 150MBps without issue.


And, if I spool up multiple copies to the FS, each copy adds to that
throughput without much of a hit. In fact I can go up to about 8 copied
(about 16MBps) before they start slowing down at all. Even while I have
several threads actively writing I still benchmark around 25MBps.


Any ideas why single threaded performance would take a hit like this?
Almost everything is running on a single node (just a few OSDs on another
node) and I have plenty of RAM (96GBs) and CPU (8 Xeon Cores).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mon service failed to start

2018-02-21 Thread Brian :

Hello

Wasn't this originally an issue with mon store now you are getting a
checksum error from an OSD? I think some hardware here in this node is just
hosed.


On Wed, Feb 21, 2018 at 5:46 PM, Behnam Loghmani 
wrote:

> Hi there,
>
> I changed SATA port and cable of SSD disk and also update ceph to version
> 12.2.3 and rebuild OSDs
> but when recovery starts OSDs failed with this error:
>
>
> 2018-02-21 21:12:18.037974 7f3479fe2d00 -1 bluestore(/var/lib/ceph/osd/ceph-7)
> _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x84c097b0,
> expected 0xaf1040a2, device location [0x1~1000], logical extent
> 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
> 2018-02-21 21:12:18.038002 7f3479fe2d00 -1 osd.7 0 OSD::init() : unable to
> read osd superblock
> 2018-02-21 21:12:18.038009 7f3479fe2d00  1 bluestore(/var/lib/ceph/osd/ceph-7)
> umount
> 2018-02-21 21:12:18.038282 7f3479fe2d00  1 stupidalloc 0x0x55e99236c620
> shutdown
> 2018-02-21 21:12:18.038308 7f3479fe2d00  1 freelist shutdown
> 2018-02-21 21:12:18.038336 7f3479fe2d00  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/
> AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/
> MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/ceph-12.2.3/src/rocksdb/db/db_impl.cc:217]
> Shutdown: ca
> nceling all background work
> 2018-02-21 21:12:18.041561 7f3465561700  4 rocksdb: (Original Log Time
> 2018/02/21-21:12:18.041514) [/home/jenkins-build/build/wor
> kspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL
> E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.
> 2.3/rpm/el7/BUILD/ceph-12.
> 2.3/src/rocksdb/db/compaction_job.cc:621] [default] compacted to: base
> level 1 max bytes base 268435456 files[5 0 0 0 0 0 0] max score 0.00,
> MB/sec: 2495.2 rd, 10.1 wr, level 1, files in(5, 0) out(1) MB in(213.6,
> 0.0) out(0.9), read-write-amplify(1.0) write-amplify(0.0) S
> hutdown in progress: Database shutdown or Column
> 2018-02-21 21:12:18.041569 7f3465561700  4 rocksdb: (Original Log Time
> 2018/02/21-21:12:18.041545) EVENT_LOG_v1 {"time_micros": 1519234938041530,
> "job": 3, "event": "compaction_finished", "compaction_time_micros": 89747,
> "output_level": 1, "num_output_files": 1, "total_ou
> tput_size": 902552, "num_input_records": 4470, "num_output_records": 4377,
> "num_subcompactions": 1, "num_single_delete_mismatches": 0,
> "num_single_delete_fallthrough": 44, "lsm_state": [5, 0, 0, 0, 0, 0, 0]}
> 2018-02-21 21:12:18.041663 7f3479fe2d00  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1519234938041657, "job": 4, "event": "table_file_deletion",
> "file_number": 249}
> 2018-02-21 21:12:18.042144 7f3479fe2d00  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/
> AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/
> MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/ceph-12.2.3/src/rocksdb/db/db_impl.cc:343]
> Shutdown com
> plete
> 2018-02-21 21:12:18.043474 7f3479fe2d00  1 bluefs umount
> 2018-02-21 21:12:18.043775 7f3479fe2d00  1 stupidalloc 0x0x55e991f05d40
> shutdown
> 2018-02-21 21:12:18.043784 7f3479fe2d00  1 stupidalloc 0x0x55e991f05db0
> shutdown
> 2018-02-21 21:12:18.043786 7f3479fe2d00  1 stupidalloc 0x0x55e991f05e20
> shutdown
> 2018-02-21 21:12:18.043826 7f3479fe2d00  1 bdev(0x55e992254600
> /dev/vg0/wal-b) close
> 2018-02-21 21:12:18.301531 7f3479fe2d00  1 bdev(0x55e992255800
> /dev/vg0/db-b) close
> 2018-02-21 21:12:18.545488 7f3479fe2d00  1 bdev(0x55e992254400
> /var/lib/ceph/osd/ceph-7/block) close
> 2018-02-21 21:12:18.650473 7f3479fe2d00  1 bdev(0x55e992254000
> /var/lib/ceph/osd/ceph-7/block) close
> 2018-02-21 21:12:18.93 7f3479fe2d00 -1  ** ERROR: osd init failed:
> (22) Invalid argument
>
>
> On Wed, Feb 21, 2018 at 5:06 PM, Behnam Loghmani <
> behnam.loghm...@gmail.com> wrote:
>
>> but disks pass all the tests with smartctl, badblocks and there isn't any
>> error on disks. because the ssd has contain WAL/DB of OSDs it's difficult
>> to test it on other cluster nodes
>>
>> On Wed, Feb 21, 2018 at 4:58 PM,  wrote:
>>
>>> Could the problem be related with some faulty hardware (RAID-controller,
>>> port, cable) but not disk? Does "faulty" disk works OK on other server?
>>>
>>> Behnam Loghmani wrote on 21/02/18 16:09:
>>>
 Hi there,

 I changed the SSD on the problematic node with the new one and
 reconfigure OSDs and MON service on it.
 but the problem occurred again with:

 "rocksdb: submit_transaction error: Corruption: block checksum mismatch
 code = 2"

 I get fully confused now.



 On Tue, Feb 20, 2018 at 5:16 PM, Behnam Loghmani <
 behnam.loghm...@gmail.com > wrote:

 Hi Caspar,

 I checked the filesystem and there isn't any error on filesystem.
 The disk is SSD and it doesn't any attribute related to Wear level
 in smartctl and filesystem is
 mounted with default options and

[ceph-users] data_digest_mismatch_oi with missing object and I/O errors (repaired!)

2018-01-17 Thread Brian Andrus

We recently had a few inconsistent PGs crop up on one of our clusters, and
I wanted to describe the process used to repair them for review and perhaps
to help someone in the future.

Our state roughly matched David's described comment here:

http://tracker.ceph.com/issues/21388#note-1

However, we were missing the object entirely on the primary OSD. This may
have been due to previous manual repair attempts, but the exact cause of
the missing object is unclear.

In order to get the PG into a state consistent with David's comment, I
exported the perceived "good" copy of the PG using ceph-objectstore-tool
and imported it to the primary OSD.

At this point, a repair would consistently cause an empty listing in "rados
list-inconsistent-obj" (but still inconsistent), and a deep-scrub would
cause the "list-inconsistent-obj" state to appear as David described.
However, "rados get" resulted in I/O errors.

I again used ceph-objectstore-tool with the "get-bytes" option to dump the
object contents to a file and "rados put" that.

It seemed to work and the customer's VM hasn't noticed anything awry yet...
but then again it wasn't prior to this either. Seems the right data is in
place and the PG is consistent after a deep-scrub.

Pretty standard stuff, but might help with alternative ways of dumping byte
data in the future as long as others don't see an issue with this. I see at
least one other with the same I/O error on the bug.

--
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph not reclaiming space or overhead?

2017-12-21 Thread Brian Woods

I will start with I am very new to ceph and am trying to teach myself the
ins and outs.  While doing this I have been creating and destroying pools
as I experiment on some test hardware.  Something I noticed was that when a
pool is deleted, the space is not always freed 100%.  This is true even
after days of idle time.

Right now with 7 OSD and a few empty pools I have 70GBs of raw spaced used.

Now, I am not sure if this is normal, but I did migrate my OSDs to
bluestore and have been adding OSDs.  So maybe some space is just overhead
for each OSD?  I lost one of my disks and the usage dropped to 70GBs.
Though when I had that failure I got some REALLY odd results from ceph -s…
  Note the number of data objects (242 total) vs. the number of degraded
objects (101 of 726):

--

root@MediaServer:~# ceph -s

 cluster:

id: 26c81563-ee27-4967-a950-afffb795f29e

health: HEALTH_WARN

   1 filesystem is degraded

   insufficient standby MDS daemons available

   1 osds down

   Degraded data redundancy: 101/726 objects degraded (13.912%), 92
pgs unclean, 92 pgs degraded, 92 pgs undersized

 services:

mon: 2 daemons, quorum TheMonolith,MediaServer

mgr: MediaServer.domain(active), standbys: TheMonolith.domain

mds: MediaStoreFS-1/1/1 up  {0=MediaMDS=up:reconnect(laggy or crashed)}

osd: 8 osds: 7 up, 8 in

rgw: 2 daemons active

 data:

pools:   8 pools, 176 pgs

objects: 242 objects, 3568 bytes

usage:   80463 MB used, 10633 GB / 10712 GB avail

pgs: 101/726 objects degraded (13.912%)

92 active+undersized+degraded

84 active+clean

--

After reweighting the failed OSD out:

--

root@MediaServer:/var/log/ceph# ceph -s

 cluster:

id: 26c81563-ee27-4967-a950-afffb795f29e

health: HEALTH_WARN

   1 filesystem is degraded

   insufficient standby MDS daemons available

 services:

mon: 2 daemons, quorum TheMonolith,MediaServer

mgr: MediaServer.domain(active), standbys: TheMonolith.domain

mds: MediaStoreFS-1/1/1 up  {0=MediaMDS=up:reconnect(laggy or crashed)}

osd: 8 osds: 7 up, 7 in

rgw: 2 daemons active

 data:

pools:   8 pools, 176 pgs

objects: 242 objects, 3568 bytes

usage:   71189 MB used, 8779 GB / 8849 GB avail

pgs: 176 active+clean

--

My pools:

--

root@MediaServer:/var/log/ceph# ceph df

GLOBAL:

SIZE  AVAIL RAW USED %RAW USED

8849G 8779G   71189M  0.79

POOLS:

NAME  ID USED %USED MAX AVAIL
OBJECTS

.rgw.root 6  1322 0 3316G
3

default.rgw.control   7 0 0 3316G
11

default.rgw.meta  8 0 0 3316G
0

default.rgw.log   9 0 0 3316G
207

MediaStorePool190 0 5970G
0

MediaStorePool-Meta   20 2246 0 3316G
21

MediaStorePool-WriteCache 210 0 3316G
0

rbd   220 0 4975G
0

--

Am I looking at some sort of a file system leak, or is this normal?


Also, before I deleted (or broke rather) my last pool, I marked OSDs in and
out and tracked the space. The data pool was erasure with 4 data and 1
parity and all data cleared from the cache pool:


Obj Used Total Size
Data Expected Usage Difference Notes

639 10712 417 521.25 -117.75 8 OSDs
337k 636 10246 417 521.25 -114.75 7 OSDs (complete removal, osd 0, 500GB)
337k 629 10712 417 521.25 -107.75 8 OSDs (Wiped and re-added as osd.51002)
337k 631 9780 417 521.25 -109.75 7 OSDs (out, crush removed, osd 5, 1TB)
337k 649 10712 417 521.25 -127.75 8 OSDs (crush add, osd in)
337k 643 9780 417 521.25 -121.75 7 OSDs (out, osd 5, 1TB)
337k 625 9780 417 521.25 -103.75 7 OSDs (crush reweight 0, osd 5, 1TB)

There was enough difference between the in and out of OSDs that I kinda
think something is up. Even with the 80GBs removed from the difference when
I have no data at all, that still leaved me with upwards of 40GBs of
unaccounted for usage...


Debian 9 \ Kernel: 4.4.0-104-generic

ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
(stable)


Thanks for your input! It's appreciated!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-27 Thread Brian Andrus

I would be interested in seeing the results from the post mentioned by an
earlier contributor:

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

Test an "old" M500 and a "new" M500 and see if the performance is A)
acceptable and B) comparable. Find hardware revision or firmware revision
in case of A=Good and B=different.

If the "old" device doesn't test well in fio/dd testing, then the drives
are (as expected) not a great choice for journals and you might want to
look at hardware/backplane/RAID configuration differences that are somehow
allowing them to perform adequately.

On Fri, Oct 27, 2017 at 12:36 PM, Russell Glaue <rgl...@cait.org> wrote:

> Yes, all the MD500s we use are both journal and OSD, even the older ones.
> We have a 3 year lifecycle and move older nodes from one ceph cluster to
> another.
> On old systems with 3 year old MD500s, they run as RAID0, and run faster
> than our current problem system with 1 year old MD500s, ran as nonraid
> pass-through on the controller.
>
> All disks are SATA and are connected to a SAS controller. We were
> wondering if the SAS/SATA conversion is an issue. Yet, the older systems
> don't exhibit a problem.
>
> I found what I wanted to know from a colleague, that when the current ceph
> cluster was put together, the SSDs tested at 300+MB/s, and ceph cluster
> writes at 30MB/s.
>
> Using SMART tools, the reserved cells in all drives is nearly 100%.
>
> Restarting the OSDs minorly improved performance. Still betting on
> hardware issues that a firmware upgrade may resolve.
>
> -RG
>
>
> On Oct 27, 2017 1:14 PM, "Brian Andrus" <brian.and...@dreamhost.com>
> wrote:
>
> @Russel, are your "older Crucial M500"s being used as journals?
>
> Crucial M500s are not to be used as a Ceph journal in my last experience
> with them. They make good OSDs with an NVMe in front of them perhaps, but
> not much else.
>
> Ceph uses O_DSYNC for journal writes and these drives do not handle them
> as expected. It's been many years since I've dealt with the M500s
> specifically, but it has to do with the capacitor/power save feature and
> how it handles those types of writes. I'm sorry I don't have the emails
> with specifics around anymore, but last I remember, this was a hardware
> issue and could not be resolved with firmware.
>
> Paging Kyle Bader...
>
> On Fri, Oct 27, 2017 at 9:24 AM, Russell Glaue <rgl...@cait.org> wrote:
>
>> We have older crucial M500 disks operating without such problems. So, I
>> have to believe it is a hardware firmware issue.
>> And its peculiar seeing performance boost slightly, even 24 hours later,
>> when I stop then start the OSDs.
>>
>> Our actual writes are low, as most of our Ceph Cluster based images are
>> low-write, high-memory. So a 20GB/day life/write capacity is a non-issue
>> for us. Only write speed is the concern. Our write-intensive images are
>> locked on non-ceph disks.
>>
>> What are others using for SSD drives in their Ceph cluster?
>> With 0.50+ DWPD (Drive Writes Per Day), the Kingston SEDC400S37 models
>> seems to be the best for the price today.
>>
>>
>>
>> On Fri, Oct 27, 2017 at 6:34 AM, Maged Mokhtar <mmokh...@petasan.org>
>> wrote:
>>
>>> It is quiet likely related, things are pointing to bad disks. Probably
>>> the best thing is to plan for disk replacement, the sooner the better as it
>>> could get worse.
>>>
>>>
>>>
>>> On 2017-10-27 02:22, Christian Wuerdig wrote:
>>>
>>> Hm, no necessarily directly related to your performance problem,
>>> however: These SSDs have a listed endurance of 72TB total data written
>>> - over a 5 year period that's 40GB a day or approx 0.04 DWPD. Given
>>> that you run the journal for each OSD on the same disk, that's
>>> effectively at most 0.02 DWPD (about 20GB per day per disk). I don't
>>> know many who'd run a cluster on disks like those. Also it means these
>>> are pure consumer drives which have a habit of exhibiting random
>>> performance at times (based on unquantified anecdotal personal
>>> experience with other consumer model SSDs). I wouldn't touch these
>>> with a long stick for anything but small toy-test clusters.
>>>
>>> On Fri, Oct 27, 2017 at 3:44 AM, Russell Glaue <rgl...@cait.org> wrote:
>>>
>>>
>>> On Wed, Oct 25, 2017 at 7:09 PM, Maged Mokhtar <mmokh...@petasan.org>
>>> wrote:
>>>
>>>
>>> It depends on what stage you are in:
>>> in producti

Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-27 Thread Brian Andrus

>>28  32 29592 29560   4.12325   3.08203
>> 0.00188185
>> 0.0301911
>>29  32 30595 30563   4.11616   3.91797
>> 0.00379099
>> 0.0296794
>>30  32 31031 30999   4.03572   1.70312
>> 0.00283347
>> 0.0302411
>> Total time run: 30.822350
>> Total writes made:  31032
>> Write size: 4096
>> Object size:4096
>> Bandwidth (MB/sec): 3.93282
>> Stddev Bandwidth:   3.66265
>> Max bandwidth (MB/sec): 13.668
>> Min bandwidth (MB/sec): 0
>> Average IOPS:   1006
>> Stddev IOPS:937
>> Max IOPS:   3499
>> Min IOPS:   0
>> Average Latency(s): 0.0317779
>> Stddev Latency(s):  0.164076
>> Max latency(s): 2.27707
>> Min latency(s): 0.0013848
>> Cleaning up (deleting benchmark objects)
>> Clean up completed and total clean up time :20.166559
>>
>>
>>
>>
>> On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar
>> <mmokh...@petasan.org>
>> wrote:
>>
>> First a general comment: local RAID will be faster than Ceph
>> for a
>> single threaded (queue depth=1) io operation test. A single
>> thread Ceph
>> client will see at best same disk speed for reads and for
>> writes 4-6 times
>> slower than single disk. Not to mention the latency of local
>> disks will
>> much better. Where Ceph shines is when you have many
>> concurrent ios, it
>> scales whereas RAID will decrease speed per client as you add
>> more.
>>
>> Having said that, i would recommend running rados/rbd
>> bench-write
>> and measure 4k iops at 1 and 32 threads to get a better idea
>> of how your
>> cluster performs:
>>
>> ceph osd pool create testpool 256 256
>> rados bench -p testpool -b 4096 30 write -t 1
>> rados bench -p testpool -b 4096 30 write -t 32
>> ceph osd pool delete testpool testpool
>> --yes-i-really-really-mean-it
>>
>> rbd bench-write test-image --io-threads=1 --io-size 4096
>> --io-pattern rand --rbd_cache=false
>> rbd bench-write test-image --io-threads=32 --io-size 4096
>> --io-pattern rand --rbd_cache=false
>>
>> I think the request size difference you see is due to the io
>> scheduler in the case of local disks having more ios to
>> re-group so has a
>> better chance in generating larger requests. Depending on
>> your kernel, the
>> io scheduler may be different for rbd (blq-mq) vs sdx (cfq)
>> but again i
>> would think the request size is a result not a cause.
>>
>> Maged
>>
>> On 2017-10-17 23:12, Russell Glaue wrote:
>>
>> I am running ceph jewel on 5 nodes with SSD OSDs.
>> I have an LVM image on a local RAID of spinning disks.
>> I have an RBD image on in a pool of SSD disks.
>> Both disks are used to run an almost identical CentOS 7
>> system.
>> Both systems were installed with the same kickstart, though
>> the disk
>> partitioning is different.
>>
>> I want to make writes on the the ceph image faster. For
>> example,
>> lots of writes to MySQL (via MySQL replication) on a ceph SSD
>> image are
>> about 10x slower than on a spindle RAID disk image. The MySQL
>> server on
>> ceph rbd image has a hard time keeping up in replication.
>>
>> So I wanted to test writes on these two systems
>> I have a 10GB compressed (gzip) file on both servers.
>> I simply gunzip the file on both systems, while running
>> iostat.
>>
>> The primary difference I see in the results is the average
>> size of
>> the request to the disk.
>> CentOS7-lvm-raid-sata writes a lot faster to disk, and the
>> size of
>> the request is about 40x, but the number of writes per second
>> is about the
>> same
>> This makes me want to conclude that the smaller size of the
>> request
>> for CentOS7-ceph-rbd-ssd system is the cause of it being
>> slow.
>>
>>
>> How can I make the size of the request larger for ceph rbd
>> images,
>> so I can increase the write throughput?
>> Would this be related to having jumbo packets enabled in my
>> ceph
>> storage network?
>>
>>
>> Here is a sample of the results:
>>
>> [CentOS7-lvm-raid-sata]
>> $ gunzip large10gFile.gz &
>> $ iostat -x vg_root-lv_var -d 5 -m -N
>> Device: rrqm/s   wrqm/s r/s w/srMB/s
>> wMB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> ...
>> vg_root-lv_var 0.00 0.00   30.60  452.2013.60
>> 222.15
>>  1000.04 8.69   14.050.99   14.93   2.07 100.04
>> vg_root-lv_var 0.00 0.00   88.20  182.0039.20
>> 89.43
>> 974.95 4.659.820.99   14.10   3.70 100.00
>> vg_root-lv_var 0.00 0.00   75.45  278.2433.53
>> 136.70
>> 985.73 4.36   33.261.34   41.91   0.59  20.84
>> vg_root-lv_var 0.00 0.00  111.60  181.8049.60
>> 89.34
>> 969.84 2.608.870.81   13.81   0.13   3.90
>> vg_root-lv_var 0.00 0.00   68.40  109.6030.40
>> 53.63
>> 966.87 1.518.460.84   13.22   0.80  14.16
>> ...
>>
>> [CentOS7-ceph-rbd-ssd]
>> $ gunzip large10gFile.gz &
>> $ iostat -x vg_root-lv_data -d 5 -m -N
>> Device: rrqm/s   wrqm/s r/s w/srMB/s
>> wMB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> ...
>> vg_root-lv_data 0.00 0.00   46.40  167.80 0.88
>> 1.46
>>22.36 1.235.662.476.54   4.52  96.82
>> vg_root-lv_data 0.00 0.00   16.60   55.20 0.36
>> 0.14
>>14.44 0.99   13.919.12   15.36  13.71  98.46
>> vg_root-lv_data 0.00 0.00   69.00  173.80 1.34
>> 1.32
>>22.48 1.255.193.775.75   3.94  95.68
>> vg_root-lv_data 0.00 0.00   74.40  293.40 1.37
>> 1.47
>>15.83 1.223.312.063.63   2.54  93.26
>> vg_root-lv_data 0.00 0.00   90.80  359.00 1.96
>> 3.41
>>24.45 1.633.631.944.05   2.10  94.38
>> ...
>>
>> [iostat key]
>> w/s == The number (after merges) of write requests completed
>> per
>> second for the device.
>> wMB/s == The number of sectors (kilobytes, megabytes) written
>> to the
>> device per second.
>> avgrq-sz == The average size (in kilobytes) of the requests
>> that
>> were issued to the device.
>> avgqu-sz == The average queue length of the requests that
>> were
>> issued to the device.
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>>
>> --
>> Christian BalzerNetwork/Systems Engineer
>> ch...@gol.com   Rakuten Communications
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why size=3

2017-10-25 Thread Brian Andrus

Apologies, corrected second link:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-March/016663.html

On Wed, Oct 25, 2017 at 9:44 AM, Brian Andrus <brian.and...@dreamhost.com>
wrote:

> Please see the following mailing list topics that have covered this topic
> in detail:
>
> "2x replication: A BIG warning":
> https://www.spinics.net/lists/ceph-users/msg32915.html
>
> "replica questions":
> https://www.spinics.net/lists/ceph-users/msg32915.html
>
> On Wed, Oct 25, 2017 at 9:39 AM, Ian Bobbitt <ibobb...@globalnoc.iu.edu>
> wrote:
>
>> I'm working on the specifications for a very small Ceph cluster for VM
>> images (OpenStack/QEMU-KVM/etc).
>>
>> I'm being asked to justify sticking with the default redundancy levels
>> (size=3, min_size=2) rather than dropping them to
>> size=2, min_size=1.
>>
>> Can someone help me articulate why we should be keeping 3 copies, beyond
>> "it's the default"?
>>
>> -- Ian
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Brian Andrus | Cloud Systems Engineer | DreamHost
> brian.and...@dreamhost.com | www.dreamhost.com
>



-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why size=3

2017-10-25 Thread Brian Andrus

Please see the following mailing list topics that have covered this topic
in detail:

"2x replication: A BIG warning":
https://www.spinics.net/lists/ceph-users/msg32915.html

"replica questions":
https://www.spinics.net/lists/ceph-users/msg32915.html

On Wed, Oct 25, 2017 at 9:39 AM, Ian Bobbitt <ibobb...@globalnoc.iu.edu>
wrote:

> I'm working on the specifications for a very small Ceph cluster for VM
> images (OpenStack/QEMU-KVM/etc).
>
> I'm being asked to justify sticking with the default redundancy levels
> (size=3, min_size=2) rather than dropping them to
> size=2, min_size=1.
>
> Can someone help me articulate why we should be keeping 3 copies, beyond
> "it's the default"?
>
> -- Ian
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked requests

2017-09-07 Thread Brian Andrus

+degraded+remapped+wait_backfill, acting
> [11,18]
>
> pg 3.709 is active+undersized+degraded+remapped+wait_backfill, acting
> [10,4]
>
> 
>
> pg 3.5d8 is active+undersized+degraded+remapped+wait_backfill, acting
> [2,10]
>
> pg 3.5dc is active+undersized+degraded+remapped+wait_backfill, acting
> [8,19]
>
> pg 3.5f8 is active+undersized+degraded+remapped+wait_backfill, acting
> [2,21]
>
> pg 3.624 is active+undersized+degraded+remapped+wait_backfill, acting
> [12,9]
>
> 2 ops are blocked > 1048.58 sec on osd.9
>
> 3 ops are blocked > 65.536 sec on osd.9
>
> 7 ops are blocked > 1048.58 sec on osd.8
>
> 1 ops are blocked > 524.288 sec on osd.8
>
> 1 ops are blocked > 131.072 sec on osd.8
>
> 
>
> 1 ops are blocked > 524.288 sec on osd.2
>
> 1 ops are blocked > 262.144 sec on osd.2
>
> 2 ops are blocked > 65.536 sec on osd.21
>
> 9 ops are blocked > 1048.58 sec on osd.5
>
> 9 ops are blocked > 524.288 sec on osd.5
>
> 71 ops are blocked > 131.072 sec on osd.5
>
> 19 ops are blocked > 65.536 sec on osd.5
>
> 35 ops are blocked > 32.768 sec on osd.5
>
> 14 osds have slow requests
>
> recovery 4678/1097738 objects degraded (0.426%)
>
> recovery 10364/1097738 objects misplaced (0.944%)
>
>
>
>
>
> *From: *David Turner <drakonst...@gmail.com>
> *Date: *Thursday, September 7, 2017 at 11:33 AM
> *To: *Matthew Stroud <mattstr...@overstock.com>, "
> ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
> *Subject: *Re: [ceph-users] Blocked requests
>
>
>
> To be fair, other times I have to go in and tweak configuration settings
> and timings to resolve chronic blocked requests.
>
>
>
> On Thu, Sep 7, 2017 at 1:32 PM David Turner <drakonst...@gmail.com> wrote:
>
> `ceph health detail` will give a little more information into the blocked
> requests.  Specifically which OSDs are the requests blocked on and how long
> have they actually been blocked (as opposed to '> 32 sec').  I usually find
> a pattern after watching that for a time and narrow things down to an OSD,
> journal, etc.  Some times I just need to restart a specific OSD and all is
> well.
>
>
>
> On Thu, Sep 7, 2017 at 10:33 AM Matthew Stroud <mattstr...@overstock.com>
> wrote:
>
> After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests
> for ‘currently waiting for missing object’. I have tried bouncing the osds
> and rebooting the osd nodes, but that just moves the problems around.
> Previous to this upgrade we had no issues. Any ideas of what to look at?
>
>
>
> Thanks,
>
> Matthew Stroud
>
>
> --
>
>
> CONFIDENTIALITY NOTICE: This message is intended only for the use and
> review of the individual or entity to which it is addressed and may contain
> information that is privileged and confidential. If the reader of this
> message is not the intended recipient, or the employee or agent responsible
> for delivering the message solely to the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited. If you have received this
> communication in error, please notify sender immediately by telephone or
> return email. Thank you.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
>
>
> CONFIDENTIALITY NOTICE: This message is intended only for the use and
> review of the individual or entity to which it is addressed and may contain
> information that is privileged and confidential. If the reader of this
> message is not the intended recipient, or the employee or agent responsible
> for delivering the message solely to the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited. If you have received this
> communication in error, please notify sender immediately by telephone or
> return email. Thank you.
>
>
> --
>
> CONFIDENTIALITY NOTICE: This message is intended only for the use and
> review of the individual or entity to which it is addressed and may contain
> information that is privileged and confidential. If the reader of this
> message is not the intended recipient, or the employee or agent responsible
> for delivering the message solely to the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited. If you have received this
> communication in error, please notify sender immediately by telephone or
> return email. Thank you.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Installing ceph on Centos 7.3

2017-07-18 Thread Brian Wallis

I’m failing to get an install of ceph to work on a new Centos 7.3.1611 server. 
I’m following the instructions at 
http://docs.ceph.com/docs/master/start/quick-ceph-deploy/ to no avail.

First question, is it possible to install ceph on Centos 7.3 or should I choose 
a different version or different linux distribution to use for now?

When I run ceph-deploy on Centos 7.3 I get the following error.

[cephuser@ceph1 my-cluster]$ ceph-deploy install ceph2
[ceph_deploy.conf][DEBUG ] found configuration file at: 
/home/cephuser/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.25): /bin/ceph-deploy install ceph2
[ceph_deploy.install][DEBUG ] Installing stable version hammer on cluster ceph 
hosts ceph2
[ceph_deploy.install][DEBUG ] Detecting platform for host ceph2 ...
[ceph2][DEBUG ] connection detected need for sudo
[ceph2][DEBUG ] connected to host: ceph2
[ceph2][DEBUG ] detect platform information from remote host
[ceph2][DEBUG ] detect machine type
[ceph_deploy.install][INFO  ] Distro info: CentOS Linux 7.3.1611 Core
[ceph2][INFO  ] installing ceph on ceph2
[ceph2][INFO  ] Running command: sudo yum clean all
[ceph2][DEBUG ] Loaded plugins: fastestmirror, priorities
[ceph2][DEBUG ] Cleaning repos: base epel extras grafana influxdb updates
[ceph2][DEBUG ] Cleaning up everything
[ceph2][INFO  ] adding EPEL repository
[ceph2][INFO  ] Running command: sudo yum -y install epel-release
[ceph2][DEBUG ] Loaded plugins: fastestmirror, priorities
[ceph2][DEBUG ] Determining fastest mirrors
[ceph2][DEBUG ]  * base: 
centos.mirror.digitalpacific.com.au
[ceph2][DEBUG ]  * extras: 
centos.mirror.digitalpacific.com.au
[ceph2][DEBUG ]  * updates: 
centos.mirror.digitalpacific.com.au
[ceph2][DEBUG ] Package matching epel-release-7-9.noarch already installed. 
Checking for update.
[ceph2][DEBUG ] Nothing to do
[ceph2][INFO  ] Running command: sudo yum -y install yum-priorities
[ceph2][DEBUG ] Loaded plugins: fastestmirror, priorities
[ceph2][DEBUG ] Loading mirror speeds from cached hostfile
[ceph2][DEBUG ]  * base: 
centos.mirror.digitalpacific.com.au
[ceph2][DEBUG ]  * extras: 
centos.mirror.digitalpacific.com.au
[ceph2][DEBUG ]  * updates: 
centos.mirror.digitalpacific.com.au
[ceph2][DEBUG ] Package yum-plugin-priorities-1.1.31-40.el7.noarch already 
installed and latest version
[ceph2][DEBUG ] Nothing to do
[ceph2][DEBUG ] Configure Yum priorities to include obsoletes
[ceph2][WARNIN] check_obsoletes has been enabled for Yum priorities plugin
[ceph2][INFO  ] Running command: sudo rpm --import 
https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc
[ceph2][INFO  ] Running command: sudo rpm -Uvh --replacepkgs 
http://ceph.com/rpm-hammer/el7/noarch/ceph-release-1-0.el7.noarch.rpm
[ceph2][DEBUG ] Retrieving 
http://ceph.com/rpm-hammer/el7/noarch/ceph-release-1-0.el7.noarch.rpm
[ceph2][WARNIN] error: open of  failed: No such file or directory
[ceph2][WARNIN] error: open of Index failed: No such file or 
directory
[ceph2][WARNIN] error: open of of failed: No such file or directory
[ceph2][WARNIN] error: open of /rpm-hammer/ failed: No such file 
or directory
[ceph2][WARNIN] error: open of  failed: No such file or directory
[ceph2][WARNIN] error: open of Index failed: No such file or directory
[ceph2][WARNIN] error: open of of failed: No such file or directory
[ceph2][WARNIN] error: open of /rpm-hammer/../ failed: No such file or 
directory
[ceph2][WARNIN] error: open of el6/ failed: No such file or 
directory
[ceph2][WARNIN] error: open of 24-Apr-2016 failed: No such file or directory
[ceph2][WARNIN] error: open of 00:05 failed: No such file or directory
[ceph2][WARNIN] error: -: not an rpm package (or package manifest):
[ceph2][WARNIN] error: open of el7/ failed: No such file or 
directory
[ceph2][WARNIN] error: open of 29-Aug-2016 failed: No such file or directory
[ceph2][WARNIN] error: open of 11:53 failed: No such file or directory
[ceph2][WARNIN] error: -: not an rpm package (or package manifest):
[ceph2][WARNIN] error: open of fc20/ failed: No such file or 
directory
[ceph2][WARNIN] error: open of 07-Apr-2015 failed: No such file or directory
[ceph2][WARNIN] error: open of 19:21 failed: No such file or directory
[ceph2][WARNIN] error: -: not an rpm package (or package manifest):
[ceph2][WARNIN] error: open of rhel6/ failed: No such file or 
directory
[ceph2][WARNIN] error: open of 07-Apr-2015 failed: No such file or directory
[ceph2][WARNIN] error: open of 19:22 failed: No such file or directory
[ceph2][WARNIN] error: -: not an rpm package (or package manifest):
[ceph2][WARNIN] error: open of  failed: No such file or 
directory
[ceph2][WARNIN] error: open of  failed: No such file or directory
[ceph2][ERROR ] RuntimeError: command returned

Re: [ceph-users] Adding storage to exiting clusters with minimal impact

2017-07-06 Thread Brian Andrus

On Thu, Jul 6, 2017 at 9:18 AM, Gregory Farnum <gfar...@redhat.com> wrote:

> On Thu, Jul 6, 2017 at 7:04 AM <bruno.cann...@stfc.ac.uk> wrote:
>
>> Hi Ceph Users,
>>
>>
>>
>> We plan to add 20 storage nodes to our existing cluster of 40 nodes, each
>> node has 36 x 5.458 TiB drives. We plan to add the storage such that all
>> new OSDs are prepared, activated and ready to take data but not until we
>> start slowly increasing their weightings. We also expect this not to cause
>> any backfilling before we adjust the weightings.
>>
>>
>>
>> When testing the deployment on our development cluster, adding a new OSD
>> to the host bucket with a crush weight of 5.458 and an OSD reweight of 0
>> (we have set “noin”) causes the acting sets of a few pools to change, thus
>> triggering backfilling. Interestingly, none of the pool backfilling have
>> the new OSD in their acting set.
>>
>>
>>
>> This was not what we expected, so I have to ask, is what we are trying to
>> achieve possible and how best we should go about doing it.
>>
>
> Yeah, there's an understandable but unfortunate bit where when you add a
> new CRUSH device/bucket to a CRUSH bucket (so, a new disk to a host, or a
> new host to a rack) you change the overall weight of that bucket (the host
> or rack). So even though the new OSD might be added with a *reweight* of
> zero, it has a "real" weight of 5.458 and so a little bit more data is
> mapped into the host/rack, even though none gets directed to the new disk
> until you set its reweight value up.
>
> As you note below, if you add the disks with a weight of zero that doesn't
> happen, so you can try doing that and weighting them up gradually.
> -Greg
>

This works well for us - Adding in OSDs with crush weight of 0 (osd crush
initial weight = 0) and slowly crush weighting them in while the reweight
remains at 1. This should also result in less overall data movement if that
is a concern.



>
>>
>> Commands run:
>>
>> ceph osd crush add osd.43 0 host=ceph-sn833 - causes no backfilling
>>
>> ceph osd crush add osd.44 5.458 host=ceph-sn833 - does cause backfilling
>>
>>
>>
>> For multiple hosts and OSDs, we plan to prepare a new crushmap and inject
>> that into the cluster.
>>
>>
>>
>> Best wishes,
>>
>> Bruno
>>
>>
>>
>>
>>
>> Bruno Canning
>>
>> LHC Data Store System Administrator
>>
>> Scientific Computing Department
>>
>> STFC Rutherford Appleton Laboratory
>>
>> Harwell Oxford
>>
>> Didcot
>>
>> OX11 0QX
>>
>> Tel. +44 ((0)1235) 446621
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Brian Andrus

Hi Stefan - we simply disabled exclusive-lock on all older (pre-jewel)
images. We still allow the default jewel featuresets for newly created
images because as you mention - the issue does not seem to affect them.

On Thu, May 4, 2017 at 10:19 AM, Stefan Priebe - Profihost AG <
s.pri...@profihost.ag> wrote:

> Hello Brian,
>
> this really sounds the same. I don't see this on a cluster with only
> images created AFTER jewel. And it seems to start happening after i
> enabled exclusive lock on all images.
>
> Did just use feature disable, exclusive-lock,fast-diff,object-map or did
> you also restart all those vms?
>
> Greets,
> Stefan
>
> Am 04.05.2017 um 19:11 schrieb Brian Andrus:
> > Sounds familiar... and discussed in "disk timeouts in libvirt/qemu
> VMs..."
> >
> > We have not had this issue since reverting exclusive-lock, but it was
> > suggested this was not the issue. So far it's held up for us with not a
> > single corrupt filesystem since then.
> >
> > On some images (ones created post-Jewel upgrade) the feature could not
> > be disabled, but these don't seem to be affected. Of course, we never
> > did pinpoint the cause of timeouts, so it's entirely possible something
> > else was causing it but no other major changes went into effect.
> >
> > One thing to look for that might confirm the same issue are timeouts in
> > the guest VM. Most OS kernel will report a hung task in conjunction with
> > the hang up/lock/corruption. Wondering if you're seeing that too.
> >
> > On Wed, May 3, 2017 at 10:49 PM, Stefan Priebe - Profihost AG
> > <s.pri...@profihost.ag <mailto:s.pri...@profihost.ag>> wrote:
> >
> > Hello,
> >
> > since we've upgraded from hammer to jewel 10.2.7 and enabled
> > exclusive-lock,object-map,fast-diff we've problems with corrupting
> VM
> > filesystems.
> >
> > Sometimes the VMs are just crashing with FS errors and a restart can
> > solve the problem. Sometimes the whole VM is not even bootable and we
> > need to import a backup.
> >
> > All of them have the same problem that you can't revert to an older
> > snapshot. The rbd command just hangs at 99% forever.
> >
> > Is this a known issue - anythink we can check?
> >
> > Greets,
> > Stefan
> > _______
> > ceph-users mailing list
> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >
> >
> >
> >
> > --
> > Brian Andrus | Cloud Systems Engineer | DreamHost
> > brian.and...@dreamhost.com | www.dreamhost.com <http://www.dreamhost.com
> >
>



-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Brian Andrus

Sounds familiar... and discussed in "disk timeouts in libvirt/qemu VMs..."

We have not had this issue since reverting exclusive-lock, but it was
suggested this was not the issue. So far it's held up for us with not a
single corrupt filesystem since then.

On some images (ones created post-Jewel upgrade) the feature could not be
disabled, but these don't seem to be affected. Of course, we never did
pinpoint the cause of timeouts, so it's entirely possible something else
was causing it but no other major changes went into effect.

One thing to look for that might confirm the same issue are timeouts in the
guest VM. Most OS kernel will report a hung task in conjunction with the
hang up/lock/corruption. Wondering if you're seeing that too.

On Wed, May 3, 2017 at 10:49 PM, Stefan Priebe - Profihost AG <
s.pri...@profihost.ag> wrote:

> Hello,
>
> since we've upgraded from hammer to jewel 10.2.7 and enabled
> exclusive-lock,object-map,fast-diff we've problems with corrupting VM
> filesystems.
>
> Sometimes the VMs are just crashing with FS errors and a restart can
> solve the problem. Sometimes the whole VM is not even bootable and we
> need to import a backup.
>
> All of them have the same problem that you can't revert to an older
> snapshot. The rbd command just hangs at 99% forever.
>
> Is this a known issue - anythink we can check?
>
> Greets,
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Client's read affinity

2017-04-04 Thread Brian Andrus

Jason, I haven't heard much about this feature.

Will the localization have effect if the crush location configuration is
set in the [osd] section, or does it need to apply globally for clients as
well?

On Fri, Mar 31, 2017 at 6:38 AM, Jason Dillaman <jdill...@redhat.com> wrote:

> Assuming you are asking about RBD-back VMs, it is not possible to
> localize the all reads to the VM image. You can, however, enable
> localization of the parent image since that is a read-only data set.
> To enable that feature, set "rbd localize parent reads = true" and
> populate the "crush location = host=X rack=Y etc=Z" in your ceph.conf.
>
> On Fri, Mar 31, 2017 at 9:00 AM, Alejandro Comisario
> <alejan...@nubeliu.com> wrote:
> > any experiences ?
> >
> > On Wed, Mar 29, 2017 at 2:02 PM, Alejandro Comisario
> > <alejan...@nubeliu.com> wrote:
> >> Guys hi.
> >> I have a Jewel Cluster divided into two racks which is configured on
> >> the crush map.
> >> I have clients (openstack compute nodes) that are closer from one rack
> >> than to another.
> >>
> >> I would love to (if is possible) to specify in some way the clients to
> >> read first from the nodes on a specific rack then try the other one if
> >> is not possible.
> >>
> >> Is that doable ? can somebody explain me how to do it ?
> >> best.
> >>
> >> --
> >> Alejandrito
> >
> >
> >
> > --
> > Alejandro Comisario
> > CTO | NUBELIU
> > E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> > _
> > www.nubeliu.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Flapping OSDs

2017-04-03 Thread Brian :

Hi Vlad

Is there anything in syslog on any of the hosts when this happens?

Had a similar issue with a single node recently and it was caused by a
firmware issue on a single ssd. That would cause the controller to reset
and osds on that node would flap as a result.

flashed the SSD with new FW and issue hasn't come up since.

Brian


On Mon, Apr 3, 2017 at 8:03 AM, Vlad Blando <vbla...@morphlabs.com> wrote:

> Most of the time random and most of the time 1 at a time, but I also see
> 2-3 that are down at the same time.
>
> The network seems fine, the bond seems fine, I just don't know where to
> look anymore. My other option is to redo the server but that's the last
> resort, as much as possible I don't want to.
>
>
>
> On Mon, Apr 3, 2017 at 2:24 PM, Maxime Guyot <maxime.gu...@elits.com>
> wrote:
>
>> Hi Vlad,
>>
>>
>>
>> I am curious if those OSDs are flapping all at once? If a single host is
>> affected I would consider the network connectivity (bottlenecks and
>> misconfigured bonds can generate strange situations), storage controller
>> and firmware.
>>
>>
>>
>> Cheers,
>>
>> Maxime
>>
>>
>>
>> *From: *ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Vlad
>> Blando <vbla...@morphlabs.com>
>> *Date: *Sunday 2 April 2017 16:28
>> *To: *ceph-users <ceph-us...@ceph.com>
>> *Subject: *[ceph-users] Flapping OSDs
>>
>>
>>
>> Hi,
>>
>>
>>
>> One of my ceph nodes have flapping OSDs, network between nodes are fine,
>> it's on a 10GBase-T network. I don't see anything wrong with the network,
>> but these OSDs are going up/down.
>>
>>
>>
>> [root@avatar0-ceph4 ~]# ceph osd tree
>>
>> # idweight  type name   up/down reweight
>>
>> -1  174.7   root default
>>
>> -2  29.12   host avatar0-ceph2
>>
>> 16  3.64osd.16  up  1
>>
>> 17  3.64osd.17  up  1
>>
>> 18  3.64osd.18  up  1
>>
>> 19  3.64osd.19  up  1
>>
>> 20  3.64osd.20  up  1
>>
>> 21  3.64osd.21  up  1
>>
>> 22  3.64osd.22  up  1
>>
>> 23  3.64osd.23  up  1
>>
>> -3  29.12   host avatar0-ceph0
>>
>> 0   3.64osd.0   up  1
>>
>> 1   3.64osd.1   up  1
>>
>> 2   3.64osd.2   up  1
>>
>> 3   3.64osd.3   up  1
>>
>> 4   3.64osd.4   up  1
>>
>> 5   3.64osd.5   up  1
>>
>> 6   3.64osd.6   up  1
>>
>> 7   3.64osd.7   up  1
>>
>> -4  29.12   host avatar0-ceph1
>>
>> 8   3.64osd.8   up  1
>>
>> 9   3.64osd.9   up  1
>>
>> 10  3.64osd.10  up  1
>>
>> 11  3.64osd.11  up  1
>>
>> 12  3.64osd.12  up  1
>>
>> 13  3.64osd.13  up  1
>>
>> 14  3.64osd.14  up  1
>>
>> 15  3.64osd.15  up  1
>>
>> -5  29.12   host avatar0-ceph3
>>
>> 24  3.64osd.24  up  1
>>
>> 25  3.64osd.25  up  1
>>
>> 26  3.64osd.26  up  1
>>
>> 27  3.64osd.27  up  1
>>
>> 28  3.64osd.28  up  1
>>
>> 29  3.64osd.29  up  1
>>
>> 30  3.64osd.30  up  1
>>
>> 31  3.64osd.31  up  1
>>
>> -6  29.12   host avatar0-ceph4
>>
>> 32  3.64osd.32  up  1
>>
>> 33  3.64osd.33  up  1
>>
>> 34  3.64osd.34  up  1
>>
>> 35  3.64osd.35  up  1
>>
>> 36  3.64osd.36  up  1
>>
>> 37  3.64osd.37  up  1
>>
>> 38  3.64osd.38  up  1
>

Re: [ceph-users] disk timeouts in libvirt/qemu VMs...

2017-03-28 Thread Brian Andrus

Just adding some anecdotal input. It likely won't be ultimately helpful
other than a +1..

Seemingly, we also have the same issue since enabling exclusive-lock on
images. We experienced these messages at a large scale when making a CRUSH
map change a few weeks ago that resulted in many many VMs experiencing the
blocked task kernel messages, requiring reboots.

We've since disabled on all images we can, but there are still jewel-era
instances that cannot have the feature disabled. Since disabling the
feature, I have not observed any cases of blocked tasks, but so far given
the limited timeframe I'd consider that anecdotal.


On Mon, Mar 27, 2017 at 12:31 PM, Hall, Eric <eric.h...@vanderbilt.edu>
wrote:

> In an OpenStack (mitaka) cloud, backed by a ceph cluster (10.2.6 jewel),
> using libvirt/qemu (1.3.1/2.5) hypervisors on Ubuntu 14.04.5 compute and
> ceph hosts, we occasionally see hung processes (usually during boot, but
> otherwise as well), with errors reported in the instance logs as shown
> below.  Configuration is vanilla, based on openstack/ceph docs.
>
> Neither the compute hosts nor the ceph hosts appear to be overloaded in
> terms of memory or network bandwidth, none of the 67 osds are over 80%
> full, nor do any of them appear to be overwhelmed in terms of IO.  Compute
> hosts and ceph cluster are connected via a relatively quiet 1Gb network,
> with an IBoE net between the ceph nodes.  Neither network appears
> overloaded.
>
> I don’t see any related (to my eye) errors in client or server logs, even
> with 20/20 logging from various components (rbd, rados, client,
> objectcacher, etc.)  I’ve increased the qemu file descriptor limit
> (currently 64k... overkill for sure.)
>
> I “feels” like a performance problem, but I can’t find any capacity issues
> or constraining bottlenecks.
>
> Any suggestions or insights into this situation are appreciated.  Thank
> you for your time,
> --
> Eric
>
>
> [Fri Mar 24 20:30:40 2017] INFO: task jbd2/vda1-8:226 blocked for more
> than 120 seconds.
> [Fri Mar 24 20:30:40 2017]   Not tainted 3.13.0-52-generic #85-Ubuntu
> [Fri Mar 24 20:30:40 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [Fri Mar 24 20:30:40 2017] jbd2/vda1-8 D 88043fd13180 0   226
> 2 0x
> [Fri Mar 24 20:30:40 2017]  88003728bbd8 0046
> 88042690 88003728bfd8
> [Fri Mar 24 20:30:40 2017]  00013180 00013180
> 88042690 88043fd13a18
> [Fri Mar 24 20:30:40 2017]  88043ffb9478 0002
> 811ef7c0 88003728bc50
> [Fri Mar 24 20:30:40 2017] Call Trace:
> [Fri Mar 24 20:30:40 2017]  [] ?
> generic_block_bmap+0x50/0x50
> [Fri Mar 24 20:30:40 2017]  [] io_schedule+0x9d/0x140
> [Fri Mar 24 20:30:40 2017]  [] sleep_on_buffer+0xe/0x20
> [Fri Mar 24 20:30:40 2017]  [] __wait_on_bit+0x62/0x90
> [Fri Mar 24 20:30:40 2017]  [] ?
> generic_block_bmap+0x50/0x50
> [Fri Mar 24 20:30:40 2017]  []
> out_of_line_wait_on_bit+0x77/0x90
> [Fri Mar 24 20:30:40 2017]  [] ?
> autoremove_wake_function+0x40/0x40
> [Fri Mar 24 20:30:40 2017]  [] __wait_on_buffer+0x2a/0x30
> [Fri Mar 24 20:30:40 2017]  [] jbd2_journal_commit_
> transaction+0x185d/0x1ab0
> [Fri Mar 24 20:30:40 2017]  [] ?
> try_to_del_timer_sync+0x4f/0x70
> [Fri Mar 24 20:30:40 2017]  [] kjournald2+0xbd/0x250
> [Fri Mar 24 20:30:40 2017]  [] ?
> prepare_to_wait_event+0x100/0x100
> [Fri Mar 24 20:30:40 2017]  [] ? commit_timeout+0x10/0x10
> [Fri Mar 24 20:30:40 2017]  [] kthread+0xd2/0xf0
> [Fri Mar 24 20:30:40 2017]  [] ?
> kthread_create_on_node+0x1c0/0x1c0
> [Fri Mar 24 20:30:40 2017]  [] ret_from_fork+0x7c/0xb0
> [Fri Mar 24 20:30:40 2017]  [] ?
> kthread_create_on_node+0x1c0/0x1c0
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osds down after upgrade hammer to jewel

2017-03-28 Thread Brian Andrus

Well, you said you were running v0.94.9, but are there any OSDs running
pre-v0.94.4 as the error states?

On Tue, Mar 28, 2017 at 6:51 AM, Jaime Ibar <ja...@tchpc.tcd.ie> wrote:

>
>
> On 28/03/17 14:41, Brian Andrus wrote:
>
> What does
> # ceph tell osd.* version
>
> ceph tell osd.21 version
> Error ENXIO: problem getting command descriptions from osd.21
>
>
> reveal? Any pre-v0.94.4 hammer OSDs running as the error states?
>
> Yes, as this is the first one I tried to upgrade.
> The other ones are running hammer
>
> Thanks
>
>
>
> On Tue, Mar 28, 2017 at 1:21 AM, Jaime Ibar <ja...@tchpc.tcd.ie> wrote:
>
>> Hi,
>>
>> I did change the ownership to user ceph. In fact, OSD processes are
>> running
>>
>> ps aux | grep ceph
>> ceph2199  0.0  2.7 1729044 918792 ?  Ssl  Mar27   0:21
>> /usr/bin/ceph-osd --cluster=ceph -i 42 -f --setuser ceph --setgroup ceph
>> ceph2200  0.0  2.7 1721212 911084 ?  Ssl  Mar27   0:20
>> /usr/bin/ceph-osd --cluster=ceph -i 18 -f --setuser ceph --setgroup ceph
>> ceph2212  0.0  2.8 1732532 926580 ?  Ssl  Mar27   0:20
>> /usr/bin/ceph-osd --cluster=ceph -i 3 -f --setuser ceph --setgroup ceph
>> ceph2215  0.0  2.8 1743552 935296 ?  Ssl  Mar27   0:20
>> /usr/bin/ceph-osd --cluster=ceph -i 47 -f --setuser ceph --setgroup ceph
>> ceph2341  0.0  2.7 1715548 908312 ?  Ssl  Mar27   0:20
>> /usr/bin/ceph-osd --cluster=ceph -i 51 -f --setuser ceph --setgroup ceph
>> ceph2383  0.0  2.7 1694944 893768 ?  Ssl  Mar27   0:20
>> /usr/bin/ceph-osd --cluster=ceph -i 56 -f --setuser ceph --setgroup ceph
>> [...]
>>
>> If I run one of the osd increasing debug
>>
>> ceph-osd --debug_osd 5 -i 31
>>
>> this is what I get in logs
>>
>> [...]
>>
>> 0 osd.31 14016 done with init, starting boot process
>> 2017-03-28 09:19:15.280182 7f083df0c800  1 osd.31 14016 We are healthy,
>> booting
>> 2017-03-28 09:19:15.280685 7f081cad3700  1 osd.31 14016 osdmap indicates
>> one or more pre-v0.94.4 hammer OSDs is running
>> [...]
>>
>> It seems the osd is running but ceph is not aware of it
>>
>> Thanks
>> Jaime
>>
>>
>>
>>
>> On 27/03/17 21:56, George Mihaiescu wrote:
>>
>>> Make sure the OSD processes on the Jewel node are running. If you didn't
>>> change the ownership to user ceph, they won't start.
>>>
>>>
>>> On Mar 27, 2017, at 11:53, Jaime Ibar <ja...@tchpc.tcd.ie> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I'm upgrading ceph cluster from Hammer 0.94.9 to jewel 10.2.6.
>>>>
>>>> The ceph cluster has 3 servers (one mon and one mds each) and another 6
>>>> servers with
>>>> 12 osds each.
>>>> The monitoring and mds have been succesfully upgraded to latest jewel
>>>> release, however
>>>> after upgrade the first osd server(12 osds), ceph is not aware of them
>>>> and
>>>> are marked as down
>>>>
>>>> ceph -s
>>>>
>>>> cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45
>>>>  health HEALTH_WARN
>>>> [...]
>>>> 12/72 in osds are down
>>>> noout flag(s) set
>>>>  osdmap e14010: 72 osds: 60 up, 72 in; 14641 remapped pgs
>>>> flags noout
>>>> [...]
>>>>
>>>> ceph osd tree
>>>>
>>>> 3   3.64000 osd.3  down  1.0 1.0
>>>> 8   3.64000 osd.8  down  1.0 1.0
>>>> 14   3.64000 osd.14 down  1.0 1.0
>>>> 18   3.64000 osd.18 down  1.0  1.0
>>>> 21   3.64000 osd.21 down  1.0  1.0
>>>> 28   3.64000 osd.28 down  1.0  1.0
>>>> 31   3.64000 osd.31 down  1.0  1.0
>>>> 37   3.64000 osd.37 down  1.0  1.0
>>>> 42   3.64000 osd.42 down  1.0  1.0
>>>> 47   3.64000 osd.47 down  1.0  1.0
>>>> 51   3.64000 osd.51 down  1.0  1.0
>>>> 56   3.64000 osd.56 down  1.0  1.0
>>>>
>>>> If I run this command with one of the down osd
>>>> ceph osd in 14
>>>> osd.14 is already in.
>>>> howe

Re: [ceph-users] osds down after upgrade hammer to jewel

2017-03-28 Thread Brian Andrus

What does
# ceph tell osd.* version

reveal? Any pre-v0.94.4 hammer OSDs running as the error states?


On Tue, Mar 28, 2017 at 1:21 AM, Jaime Ibar <ja...@tchpc.tcd.ie> wrote:

> Hi,
>
> I did change the ownership to user ceph. In fact, OSD processes are running
>
> ps aux | grep ceph
> ceph2199  0.0  2.7 1729044 918792 ?  Ssl  Mar27   0:21
> /usr/bin/ceph-osd --cluster=ceph -i 42 -f --setuser ceph --setgroup ceph
> ceph2200  0.0  2.7 1721212 911084 ?  Ssl  Mar27   0:20
> /usr/bin/ceph-osd --cluster=ceph -i 18 -f --setuser ceph --setgroup ceph
> ceph2212  0.0  2.8 1732532 926580 ?  Ssl  Mar27   0:20
> /usr/bin/ceph-osd --cluster=ceph -i 3 -f --setuser ceph --setgroup ceph
> ceph2215  0.0  2.8 1743552 935296 ?  Ssl  Mar27   0:20
> /usr/bin/ceph-osd --cluster=ceph -i 47 -f --setuser ceph --setgroup ceph
> ceph2341  0.0  2.7 1715548 908312 ?  Ssl  Mar27   0:20
> /usr/bin/ceph-osd --cluster=ceph -i 51 -f --setuser ceph --setgroup ceph
> ceph2383  0.0  2.7 1694944 893768 ?  Ssl  Mar27   0:20
> /usr/bin/ceph-osd --cluster=ceph -i 56 -f --setuser ceph --setgroup ceph
> [...]
>
> If I run one of the osd increasing debug
>
> ceph-osd --debug_osd 5 -i 31
>
> this is what I get in logs
>
> [...]
>
> 0 osd.31 14016 done with init, starting boot process
> 2017-03-28 09:19:15.280182 7f083df0c800  1 osd.31 14016 We are healthy,
> booting
> 2017-03-28 09:19:15.280685 7f081cad3700  1 osd.31 14016 osdmap indicates
> one or more pre-v0.94.4 hammer OSDs is running
> [...]
>
> It seems the osd is running but ceph is not aware of it
>
> Thanks
> Jaime
>
>
>
>
> On 27/03/17 21:56, George Mihaiescu wrote:
>
>> Make sure the OSD processes on the Jewel node are running. If you didn't
>> change the ownership to user ceph, they won't start.
>>
>>
>> On Mar 27, 2017, at 11:53, Jaime Ibar <ja...@tchpc.tcd.ie> wrote:
>>>
>>> Hi all,
>>>
>>> I'm upgrading ceph cluster from Hammer 0.94.9 to jewel 10.2.6.
>>>
>>> The ceph cluster has 3 servers (one mon and one mds each) and another 6
>>> servers with
>>> 12 osds each.
>>> The monitoring and mds have been succesfully upgraded to latest jewel
>>> release, however
>>> after upgrade the first osd server(12 osds), ceph is not aware of them
>>> and
>>> are marked as down
>>>
>>> ceph -s
>>>
>>> cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45
>>>  health HEALTH_WARN
>>> [...]
>>> 12/72 in osds are down
>>> noout flag(s) set
>>>  osdmap e14010: 72 osds: 60 up, 72 in; 14641 remapped pgs
>>> flags noout
>>> [...]
>>>
>>> ceph osd tree
>>>
>>> 3   3.64000 osd.3  down  1.0 1.0
>>> 8   3.64000 osd.8  down  1.0 1.0
>>> 14   3.64000 osd.14 down  1.0 1.0
>>> 18   3.64000 osd.18 down  1.0  1.0
>>> 21   3.64000 osd.21 down  1.0  1.0
>>> 28   3.64000 osd.28 down  1.0  1.0
>>> 31   3.64000 osd.31 down  1.0  1.0
>>> 37   3.64000 osd.37 down  1.0  1.0
>>> 42   3.64000 osd.42 down  1.0  1.0
>>> 47   3.64000 osd.47 down  1.0  1.0
>>> 51   3.64000 osd.51 down  1.0  1.0
>>> 56   3.64000 osd.56 down  1.0  1.0
>>>
>>> If I run this command with one of the down osd
>>> ceph osd in 14
>>> osd.14 is already in.
>>> however ceph doesn't mark it as up and the cluster health remains
>>> in degraded state.
>>>
>>> Do I have to upgrade all the osds to jewel first?
>>> Any help as I'm running out of ideas?
>>>
>>> Thanks
>>> Jaime
>>>
>>> --
>>>
>>> Jaime Ibar
>>> High Performance & Research Computing, IS Services
>>> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
>>> http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
>>> Tel: +353-1-896-3725
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
> --
>
> Jaime Ibar
> High Performance & Research Computing, IS Services
> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
> Tel: +353-1-896-3725
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] active+clean+inconsistent and pg repair

2017-03-17 Thread Brian Andrus

We went through a period of time where we were experiencing these daily...

cd to the PG directory on each OSD and do a find for "238e1f29.0076024c"
(mentioned in your error message). This will likely return a file that has
a slash in the name, something like rbd\udata.
238e1f29.0076024c_head_blah_1f...

hexdump -C the object (tab completing the name helps) and pipe the output
to a different location. Once you obtain the hexdumps, do a diff or cmp
against them and find which one is not like the others.

If the primary is not the outlier, perform the PG repair without worry. If
the primary is the outlier, you will need to stop the OSD, move the object
out of place, start it back up and then it will be okay to issue a PG
repair.

Other less common inconsistent PGs we see are differing object sizes (easy
to detect with a simple list of file size) and differing attributes ("attr
-l", but the error logs are usually precise in identifying the problematic
PG copy).

On Fri, Mar 17, 2017 at 8:16 AM, Shain Miley <smi...@npr.org> wrote:

> Hello,
>
> Ceph status is showing:
>
> 1 pgs inconsistent
> 1 scrub errors
> 1 active+clean+inconsistent
>
> I located the error messages in the logfile after querying the pg in
> question:
>
> root@hqosd3:/var/log/ceph# zgrep -Hn 'ERR' ceph-osd.32.log.1.gz
>
> ceph-osd.32.log.1.gz:846:2017-03-17 02:25:20.281608 7f7744d7f700 -1
> log_channel(cluster) log [ERR] : 3.2b8 shard 32: soid
> 3/4650a2b8/rb.0.fe307e.238e1f29.0076024c/head candidate had a read
> error, data_digest 0x84c33490 != known data_digest 0x974a24a7 from auth
> shard 62
>
>
> ceph-osd.32.log.1.gz:847:2017-03-17 02:30:40.264219 7f7744d7f700 -1
> log_channel(cluster) log [ERR] : 3.2b8 deep-scrub 0 missing, 1 inconsistent
> objects
>
> ceph-osd.32.log.1.gz:848:2017-03-17 02:30:40.264307 7f7744d7f700 -1
> log_channel(cluster) log [ERR] : 3.2b8 deep-scrub 1 errors
>
> Is this a case where it would be safe to use 'ceph pg repair'?
> The documentation indicates there are times where running this command is
> less safe than others...and I would like to be sure before I do so.
>
> Thanks,
> Shain
>
>
> --
> NPR | Shain Miley | Manager of Infrastructure, Digital Media | smi...@npr.org 
> | 202.513.3649 <(202)%20513-3649>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph on XenServer

2017-02-25 Thread Brian :

Hi Max,

Have you considered Proxmox at all? Nicely integrates with Ceph storage. I
moved from Xenserver longtime ago and have no regrets.

Thanks
Brians

On Sat, Feb 25, 2017 at 12:47 PM, Massimiliano Cuttini 
wrote:

> Hi Iban,
>
> you are running xen (just the software) not xenserver (ad hoc linux
> distribution).
> Xenserver is a linux distribution based on CentOS.
> You cannot recompile the kernel by your-own (... well, you can do, but
> it's not a good idea).
> And you should not install rpm by your-own (... but, i'm gonna do this).
>
> Then I'm stuck with some plugin and just some rpm compatible with kernel
> 3.10.0-514.6.1.el7.x86_64
> However I found a way to install more or less the Ceph Client.
> I have forked the plugin of https://github.com/rposudnevskiy/RBDSR It
> seems the most update.
> Here you can find my findings and updates: https://github.com/phoenixweb/
> RBDSR
> However at the moment I'm stuck with a kernel unsupported feature. :(
> Let's see if I can let this work.
>
> Thanks for your sharings
> Max
>
>
>
>
>
> Il 24/02/2017 18:46, Iban Cabrillo ha scritto:
>
> Hi Massimiliano,
>   We are running CEPH agains our openstack instance running Xen:
>
> ii  xen-hypervisor-4.6-amd64 4.6.0-1ubuntu4.3
>amd64Xen Hypervisor on AMD64
> ii  xen-system-amd64 4.6.0-1ubuntu4.1
>amd64Xen System on AMD64 (meta-package)
> ii  xen-utils-4.64.6.0-1ubuntu4.3
>amd64XEN administrative tools
> ii  xen-utils-common 4.6.0-1ubuntu4.3
>all  Xen administrative tools - common files
> ii  xenstore-utils   4.6.0-1ubuntu4.1
>amd64Xenstore command line utilities for Xen
>
>
> 2017-02-24 15:52 GMT+01:00 Massimiliano Cuttini :
>
>> Dear all,
>>
>> even if Ceph should be officially supported by Xen since 4 years.
>>
>>- http://xenserver.org/blog/entry/tech-preview-of-xenserver-
>>libvirt-ceph.html
>>- https://ceph.com/geen-categorie/xenserver-support-for-rbd/
>>
>> rbd is supported on libvirt 1.3.2, but It has to be recompiled the
>
> ii  libvirt-bin   1.3.2-0~15.10~ amd64programs for
> the libvirt library
> ii  libvirt0:amd64   1.3.2-0~15.10~ amd64library for
> interfacing with different virtualization systems
>
>>
>>
>> Still there is no support yet.
>>
>> At this point there are only some self-made plugin and solution around.
>> Here some:
>>
>>- https://github.com/rposudnevskiy/RBDSR
>>- https://github.com/mstarikov/rbdsr
>>- https://github.com/FlorianHeigl/xen-ceph-rbd
>>
>> Nobody know how much they are compatible or if they are gonna to break
>> Xen the next update.
>>
>> The ugly truth is that XEN is not still able to fully support Ceph and we
>> can only pray that one of the plugin above will not destroy our precious
>> data or VDI.
>>
>> Does anybody had some experience with the plugin above?
>> Which one you'll reccomend?
>> Is there any good installation guide to let this work correctly?
>> Can I just install Ceph with ceph-deploy or do I have to unlock repos on
>> the xenserver and instal it manuall, Thanks for any kind of support.
>>
> Attaching volumes from openstack portal (or using cinder volume manager),
> has been working fine since 6 months ago, but I do not know what will
> happen on next updates
>
>
> Regards,
>> Max
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] help with crush rule

2017-02-21 Thread Brian Andrus

I don't think a CRUSH rule exception is currently possible, but it makes
sense to me for a feature request.

On Sat, Feb 18, 2017 at 6:16 AM, Maged Mokhtar <mmokh...@petasan.org> wrote:

>
> Hi,
>
> I have a need to support a small cluster with 3 hosts and 3 replicas given
> that in normal operation each replica will be placed on a separate host
> but in case one host dies then its replicas could be stored on separate
> osds on the 2 live hosts.
>
> I was hoping to write a rule that in case it could only find 2 replicas on
> separated nodes will emit it and do another select/emit to place the
> reaming replica. Is this possible ? i could not find a way to define an if
> condition or being able to determine the size of the working vector
> actually returned.
>
> Cheers /maged
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Adding multiple osd's to an active cluster

2017-02-17 Thread Brian Andrus

As described recently in several other threads, we like to add OSDs in to
their proper CRUSH location, but with the following parameter set:

  osd crush initial weight = 0

We then bring the OSDs in to the cluster (0 impact in our environment), and
then gradually increase CRUSH weight to bring them to their final desired
value, all at the same time.

The script I use basically checks for all OSDs less than our target weight
in each iteration, and moves it closer to the target weight by a defined
increment and then waiting for HEALTH_OK or other acceptable state.

I would suggest starting with .001 with large groups of OSDs. We can
comfortably bring in 100 OSDs with increments of .004 at a time or so..
Theoretically we could just let them all weight in at once, but this allows
us to find a comfortable rate and pause the process whenever/wherever we
want if it does cause issues.

Hope that helps.

On Fri, Feb 17, 2017 at 1:42 AM, nigel davies <nigdav...@gmail.com> wrote:

> Hay All
>
> How is the best way to added multiple osd's to an active cluster?
>
> As the last time i done this i all most killed the VM's we had running on
> the cluster
>
> Thanks
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] crushtool mappings wrong

2017-02-16 Thread Brian Andrus

v10.2.5 - crushtool working fine to show rack mappings. How are you running
the command? Get some sleep! ha.

crushtool -i /tmp/crush.map --test --ruleset 3 --num-rep 3 --show-mappings

rule byrack {
ruleset 3
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type rack
step emit
}


On Thu, Feb 16, 2017 at 7:10 AM, Blair Bethwaite <blair.bethwa...@gmail.com>
wrote:

> Am I going nuts (it is extremely late/early here), or is crushtool
> totally broken? I'm trying to configure a ruleset that will place
> exactly one replica into three different racks (under each of which
> there are 8-10 hosts). crushtool has given me empty mappings for just
> about every rule I've tried that wasn't just the simplest: chooseleaf
> 0 host. Suspecting something was up with crushtool I have now tried to
> verify correctness on an existing rule and it is including OSDs in the
> result mappings that are not even in this hierarchy...
>
> (this is on a 10.2.2 install)
>
> --
> Cheers,
> ~Blairo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] extending ceph cluster with osds close to near full ratio (85%)

2017-02-14 Thread Brian Andrus

m osd.902 weight 3.640
> item osd.903 weight 3.640
> }
>
>
> As mentioned before we created a temporary bucket called "fresh-install"
> containing the newly installed nodes (i.e.):
>
> root fresh-install {
> id -34  # do not change unnecessarily
> # weight 218.400
> alg straw
> hash 0  # rjenkins1
> item osd-k5-36-fresh weight 72.800
> item osd-k7-41-fresh weight 72.800
> item osd-l4-36-fresh weight 72.800
> }
>
> Then, by steps of 6 OSDs (2 OSDs from each new host), we move OSDs from
> the "fresh-install" to the "sas" bucket.
>
>
I would highly recommend a simple script to weight in gradually as
described above. Much more controllable and you can twiddle the knobs to
your heart's desire.

>
> Thank you in advance for all the suggestions.
>
> Cheers,
> Tyanko
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
Hope that helps.

-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] would people mind a slow osd restart during luminous upgrade?

2017-02-09 Thread Brian Andrus

On Thu, Feb 9, 2017 at 9:12 AM, David Turner <drakonst...@gmail.com> wrote:

> When we upgraded to Jewel 10.2.3 from Hammer 0.94.7 in our QA cluster we
> had issues with client incompatibility.  We first tried upgrading our
> clients before upgrading the cluster.  This broke creating RBDs, cloning
> RBDs, and probably many other things.  We quickly called that test a wash
> and redeployed the cluster back to 0.94.7 and redid the upgrade by
> partially upgrading the cluster, testing, fully upgrading the cluster,
> testing, and finally upgraded the clients to Jewel.  This worked with no
> issues creating RBDs, cloning, snapshots, deleting, etc.
>
> I'm not sure if there was a previous reason that we decided to always
> upgrade the clients first.  It might have had to do with the upgrade from
> Firefly to Hammer.  It's just something we always test now, especially with
> full version upgrades.  That being said, making sure that there is a client
> that was regression tested throughout the cluster upgrade would be great to
> have in the release notes.
>

I agree - it would have been nice to have this in the release notes,
however we only hit it because we're hyperconverged (clients using Jewel
against a Hammer cluster that hasn't yet had daemons restarted). We are
fixing it by setting rbd_default_features = 3 in our upcoming upgrade. We
will then unset it once the cluster is running Jewel.


>
> On Thu, Feb 9, 2017 at 7:29 AM Sage Weil <sw...@redhat.com> wrote:
>
>> On Thu, 9 Feb 2017, David Turner wrote:
>> > The only issue I can think of is if there isn't a version of the clients
>> > fully tested to work with a partially upgraded cluster or a documented
>> > incompatibility requiring downtime. We've had upgrades where we had to
>> > upgrade clients first and others that we had to do the clients last due
>> to
>> > issues with how the clients interacted with an older cluster, partially
>> > upgraded cluster, or newer cluster.
>>
>> We maintain client compatibiltity across *many* releases and several
>> years.  In general this under the control of the administrator via their
>> choice of CRUSH tunables, which effectively let you choose the oldest
>> client you'd like to support.
>>
>> I'm curious which upgrade you had problems with?  Generally speaking the
>> only "client" upgrade ordering issue is with the radosgw clients, which
>> need to be upgraded after the OSDs.
>>
>> > If the FileStore is changing this much, I can imagine a Jewel client
>> having
>> > a hard time locating the objects it needs from a Luminous cluster.
>>
>> In this case the change would be internal to a single OSD and have no
>> effect on the client/osd interaction or placement of objects.
>>
>> sage
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Latency between datacenters

2017-02-08 Thread Brian Andrus

On Wed, Feb 8, 2017 at 5:34 PM, Marcus Furlong <furlo...@gmail.com> wrote:

> On 9 February 2017 at 09:34, Trey Palmer <t...@mailchimp.com> wrote:
> > The multisite configuration available starting in Jewel sound more
> > appropriate for your situation.
> >
> > But then you need two separate clusters, each large enough to contain
> all of
> > your objects.
>
> On that note, is anyone aware of documentation that details the
> differences between federated gateway and multisite? And where each
> would be most appropriate?

They are similar, but mostly reworked and simplified in the case of
multisite. A multisite configuration allows writes to non-master zones.

>
> Regards,
> Marcus.
> --
> Marcus Furlong
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

On Tue, Feb 7, 2017 at 12:17 PM, Daniel Picolli Biazus
<picol...@gmail.com> wrote:
> Hi Guys,
>
> I have been planning to deploy a Ceph Cluster with the following hardware:
>
> OSDs:
>
> 4 Servers Xeon D 1520 / 32 GB RAM / 5 x 6TB SAS 2 (6 OSD daemon per
server)
>
> Monitor/Rados Gateways
>
> 5 Servers Xeon D 1520 32 GB RAM / 2 x 1TB SAS 2 (5 MON daemon/ 4 rados
> daemon)
>
> Usage: Object Storage only
>
> However I need to deploy 2 OSD and 3 MON Servers in Miami datacenter
and
> another 2 OSD and 2 MON Servers in Montreal Datacenter. The latency
between
> these datacenters is 50 milliseconds.
>Considering this scenario, should I use Federated Gateways or should I
> use a single Cluster ?

There's nothing stopping you from separating the datacenters with CRUSH
root definitions to have one zone serving from the pools existing solely in
one Datacenter, and another zone servicing the other. That way the transfer
of data and metadata will occur at the RadosGW level (more latency-tolerant
than rados. this is what it was designed to do), while still being managed
at one point.

It's worth testing both configurations, as well as the effects of latency
on your monitors. In some cases I'd consider trying to source another MON
and running two separate clusters, but simply put, YMMV.

>
> Thanks in advance
> Daniel

-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Running 'ceph health' as non-root user

2017-02-01 Thread Brian ::

This is great - had no idea you could have this level of control with
Ceph authentication.


On Wed, Feb 1, 2017 at 12:29 PM, John Spray  wrote:
> On Wed, Feb 1, 2017 at 8:55 AM, Michael Hartz  
> wrote:
>> I am running ceph as part of a Proxmox Virtualization cluster, which is 
>> doing great.
>>
>> However for monitoring purpose I would like to periodically check with 'ceph 
>> health' as a non-root user.
>> This fails with the following message:
>>> su -c 'ceph health' -s /bin/bash nagios
>> Error initializing cluster client: PermissionDeniedError('error calling 
>> conf_read_file',)
>>
>> Please note: running the command as root user works as intended.
>>
>> Someone else suggested to allow group permissions on the admin keyring, i.e. 
>> chmod 660 /etc/ceph/ceph.client.admin.keyring
>> Link: https://github.com/thelan/ceph-zabbix/issues/12
>> This didn't work.
>
> Nobody should ever need to give their unprivileged users sudo access
> to the ceph CLI or access to the the ceph admin key, just to run the
> status command.
>
> Ceph's own authentication system has fine grained control over
> execution of mon commands.  You can create a special user that can
> only run the status command like this:
> ceph auth get-or-create client.status mon 'allow command "status"' >
> ./status.keyring
>
> ...and then invoke status as that user like this:
> ceph --name client.status --keyring ./status.keyring status
>
> You can then make sure your unprivileged user has read access to
> status.keyring and to ceph.conf (or give it its own copy of
> ceph.conf).
>
> John
>
>
>>
>> Has anyone hints on this?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Running 'ceph health' as non-root user

2017-02-01 Thread Brian ::

And left out - your command line for the ceph checks in nagios should
be prefixed by sudo

'sudo ceph health'

server# su nagios
$ ceph health
Error initializing cluster client: Error('error calling
conf_read_file: errno EACCES',)
$sudo ceph health
HEALTH_OK


On Wed, Feb 1, 2017 at 9:55 AM, Brian :: <b...@iptel.co> wrote:
> Hi Michael,
>
> Install sudo on proxmox server and add an entry for nagios like:
>
> nagios ALL=(ALL) NOPASSWD:/usr/bin/ceph
>
> in a file in /etc/sudoers.d
>
> Brian
>
> On Wed, Feb 1, 2017 at 8:55 AM, Michael Hartz <michael.ha...@yandex.com> 
> wrote:
>> I am running ceph as part of a Proxmox Virtualization cluster, which is 
>> doing great.
>>
>> However for monitoring purpose I would like to periodically check with 'ceph 
>> health' as a non-root user.
>> This fails with the following message:
>>> su -c 'ceph health' -s /bin/bash nagios
>> Error initializing cluster client: PermissionDeniedError('error calling 
>> conf_read_file',)
>>
>> Please note: running the command as root user works as intended.
>>
>> Someone else suggested to allow group permissions on the admin keyring, i.e. 
>> chmod 660 /etc/ceph/ceph.client.admin.keyring
>> Link: https://github.com/thelan/ceph-zabbix/issues/12
>> This didn't work.
>>
>> Has anyone hints on this?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Running 'ceph health' as non-root user

2017-02-01 Thread Brian ::

Hi Michael,

Install sudo on proxmox server and add an entry for nagios like:

nagios ALL=(ALL) NOPASSWD:/usr/bin/ceph

in a file in /etc/sudoers.d

Brian

On Wed, Feb 1, 2017 at 8:55 AM, Michael Hartz <michael.ha...@yandex.com> wrote:
> I am running ceph as part of a Proxmox Virtualization cluster, which is doing 
> great.
>
> However for monitoring purpose I would like to periodically check with 'ceph 
> health' as a non-root user.
> This fails with the following message:
>> su -c 'ceph health' -s /bin/bash nagios
> Error initializing cluster client: PermissionDeniedError('error calling 
> conf_read_file',)
>
> Please note: running the command as root user works as intended.
>
> Someone else suggested to allow group permissions on the admin keyring, i.e. 
> chmod 660 /etc/ceph/ceph.client.admin.keyring
> Link: https://github.com/thelan/ceph-zabbix/issues/12
> This didn't work.
>
> Has anyone hints on this?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unique object IDs and crush on object striping

2017-01-31 Thread Brian Andrus

On Tue, Jan 31, 2017 at 7:42 AM, Ukko <ukkohakkarai...@gmail.com> wrote:

> Hi,
>
> Two quickies:
>
> 1) How does Ceph handle unique object IDs without any
> central information about the object names?
>

That's where CRUSH comes in. It maps an object name to a unique placement
group ID based on the available placement groups. Some of my favorite
explanations of data placement comes from the core Ceph developers. [1] [2]

2) How CRUSH is used in case of splitting an object in
> stripes?
>

The splitting/striping of data actually occurs at a layer above CRUSH. The
clients handle that and calculate object placement with CRUSH based on
unique object names.

> Thanks!
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
[1] https://youtu.be/05spXfLKKVU?t=9m14s
[2] https://youtu.be/lG6eeUNw9iI?t=18m49s

-- 
Brian Andrus
Cloud Systems Engineer
DreamHost, LLC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] is docs.ceph.com down?

2017-01-19 Thread Brian Andrus

On Thu, Jan 19, 2017 at 11:31 AM, Kamble, Nitin A <nitin.kam...@teradata.com
> wrote:

> My browser is getting hang on http://docs.ceph.com <http://ceph.com>
>
>
Check the thread regarding tracker.ceph.com from earlier this morning. Same
issue.


> Thanks,
> Nitin
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
I think your DNS cache may be preventing you from seeing the site at this
point, as it appears the Ceph project guys have got the site back up.

-- 
Brian Andrus
Cloud Systems Engineer
DreamHost, LLC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Problems with http://tracker.ceph.com/?

2017-01-19 Thread Brian Andrus

Much of the Ceph project VMs (including tracker.ceph.com) is currently
hosted on DreamCompute. The migration to our new service/cluster that was
completed on 2017-01-17, the Ceph project was somehow enabled in our new
OpenStack project without enabling a service in our billing system (this
shouldn't be possible).

Since tenant_deletes (started by customers leaving for example) often fail,
we run daily audits that root out accounts without Service Instances in our
billing system, and issue a tenant delete in OpenStack. In hindsight, it
should probably look for accounts that are INACTIVE, and not non-existent.
I have enabled a Service Instance for the DreamCompute service, so it
should NOT happen again. This did happen yesterday as well, but we
incorrectly assessed the situation and thus happened again today.

The good news is the tenant delete failed. The bad news is we're looking
for the tracker volume now, which is no longer present in the Ceph project.

The Ceph project guys are understandably upset, and from the DreamHost
side, we're currently looking to recover the tracker volume.

On Thu, Jan 19, 2017 at 8:51 AM, Sean Redmond <sean.redmo...@gmail.com>
wrote:

> Looks like there maybe an issue with the ceph.com and tracker.ceph.com
> website at the moment
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

-- 
Brian Andrus
Cloud Systems Engineer
DreamHost, LLC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Calamari or Alternative

2017-01-13 Thread Brian Godette

We're using:

https://github.com/rochaporto/collectd-ceph

for time-series, with a slightly modified Grafana dashboard from the one 
referenced.

https://github.com/Crapworks/ceph-dash

for quick health status.

Both took a small bit of modification to make them work with Jewel at the time, 
not sure if that's been taken care of since.

From: ceph-users  on behalf of Marko 
Stojanovic 
Sent: Friday, January 13, 2017 1:30 AM
To: Tu Holmes; John Petrini
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Calamari or Alternative

 There is another nice tool for ceph monitoring:

https://github.com/inkscope/inkscope

Little hard to setup but beside just monitoring you can also manage some items 
using it.

regards
Marko

On 1/13/17 07:30, Tu Holmes wrote:
I'll give ceph-dash a look.

Thanks!
On Thu, Jan 12, 2017 at 9:19 PM John Petrini 
> wrote:
I used Calamari before making the move to Ubuntu 16.04 and upgrading to Jewel. 
At the time I tried to install it on 16.04 but couldn't get it working.

I'm now using ceph-dash along with the 
nagios plugin check_ceph_dash and 
I've found that this gets me everything I need. A nice looking dashboard, 
graphs and alerting on the most important stats.

Another plus is that it's incredibly easy to setup; you can have the dashboard 
up and running in five minutes.

___

The information transmitted is intended only for the person or entity to which 
it is addressed and may contain confidential and/or privileged material. Any 
review, retransmission,  dissemination or other use of, or taking of any action 
in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you received this in error, please contact 
the sender and delete the material from any computer.

On Fri, Jan 13, 2017 at 12:06 AM, Tu Holmes 
> wrote:
Hey Cephers.

Question for you.

Do you guys use Calamari or an alternative?

If so, why has the installation of Calamari not really gotten much better 
recently.

Are you still building the vagrant installers and building packages?

Just wondering what you are all doing.

Thanks.

//Tu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-10 Thread Brian Andrus

On Mon, Jan 9, 2017 at 3:33 PM, Willem Jan Withagen <w...@digiware.nl> wrote:

> On 9-1-2017 23:58, Brian Andrus wrote:
> > Sorry for spam... I meant D_SYNC.
>
> That term does not run any lights in Google...
> So I would expect it has to O_DSYNC.
> (https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-
> test-if-your-ssd-is-suitable-as-a-journal-device/)
>
> Now you tell me there is a SSDs that does take correct action with
> O_SYNC but not with O_DSYNC... That makes no sense to me. It is a
> typical solution in the OS as speed trade-off versus a bit less
> consistent FS.
>
> Either a device actually writes its data persistenly (either in silicon
> cells, or keeps it in RAM with a supercapacitor), or it does not.
> Something else I can not think off. Maybe my EE background is sort of in
> the way here. And I know that is rather hard to write correct SSD
> firmware, I seen lots of firmware upgrades to actually fix serious
> corner cases.
>
> Now the second thing is how hard does a drive lie when being told that
> the request write is synchronised. And Oke is only returned when data is
> in stable storage, and can not be lost.
>
> If there is a possibility that a sync write to a drive is not
> persistent, then that is a serious breach of the sync write contract.
> There will always be situations possible that these drives will lose data.
> And if data is no longer in the journal, because the writing process
> thinks the data is on stable storage it has deleted the data from the
> journal. In this case that data is permanently lost.
>
> Now you have a second chance (even a third) with Ceph, because data is
> stored multiple times. And you can go to another OSD and try to get it
> back.
>
> --WjW
>

I'm not disagreeing per se.


I think the main point I'm trying to address is - as long as the backing
OSD isn't egregiously handling large amounts of writes and it has a good
journal in front of it (that properly handles O_DSYNC [not D_SYNC as
Sebastien's article states]), it is unlikely inconsistencies will occur
upon a crash and subsequent restart.

Therefore - while not ideal to rely on journals to maintain consistency,
that is what they are there for. There is a situation where
"consumer-grade" SSDs could be used as OSDs. While not ideal, it can and
has been done before, and may be preferable to tossing out $500k of SSDs
(Seen it firsthand!)



>
> >
> > On Mon, Jan 9, 2017 at 2:56 PM, Brian Andrus <brian.and...@dreamhost.com
> > <mailto:brian.and...@dreamhost.com>> wrote:
> >
> > Hi Willem, the SSDs are probably fine for backing OSDs, it's the
> > O_DSYNC writes they tend to lie about.
> >
> > They may have a failure rate higher than enterprise-grade SSDs, but
> > are otherwise suitable for use as OSDs if journals are placed
> elsewhere.
> >
> > On Mon, Jan 9, 2017 at 2:39 PM, Willem Jan Withagen <w...@digiware.nl
> > <mailto:w...@digiware.nl>> wrote:
> >
> > On 9-1-2017 18:46, Oliver Humpage wrote:
> > >
> > >> Why would you still be using journals when running fully OSDs
> on
> > >> SSDs?
> > >
> > > In our case, we use cheaper large SSDs for the data (Samsung
> 850 Pro
> > > 2TB), whose performance is excellent in the cluster, but as
> has been
> > > pointed out in this thread can lose data if power is suddenly
> > > removed.
> > >
> > > We therefore put journals onto SM863 SSDs (1 journal SSD per 3
> OSD
> > > SSDs), which are enterprise quality and have power outage
> protection.
> > > This seems to balance speed, capacity, reliability and budget
> fairly
> > > well.
> >
> > This would make me feel very uncomfortable.
> >
> > So you have a reliable journal, so upto there thing do work:
> >   Once in the journal you data is safe.
> >
> > But then you async transfer the data to disk. And that is an SSD
> > that
> > lies to you? It will tell you that the data is written. But if
> > you pull
> > the power, then it turns out that the data is not really stored.
> >
> > And then the only way to get the data consistent again, is to
> > (deep)scrub.
> >
> > Not a very appealing lookout??
> >
> > --WjW
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com <mailto:ceph-

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-09 Thread Brian Andrus

Sorry for spam... I meant D_SYNC.

On Mon, Jan 9, 2017 at 2:56 PM, Brian Andrus <brian.and...@dreamhost.com>
wrote:

> Hi Willem, the SSDs are probably fine for backing OSDs, it's the O_DSYNC
> writes they tend to lie about.
>
> They may have a failure rate higher than enterprise-grade SSDs, but are
> otherwise suitable for use as OSDs if journals are placed elsewhere.
>
> On Mon, Jan 9, 2017 at 2:39 PM, Willem Jan Withagen <w...@digiware.nl>
> wrote:
>
>> On 9-1-2017 18:46, Oliver Humpage wrote:
>> >
>> >> Why would you still be using journals when running fully OSDs on
>> >> SSDs?
>> >
>> > In our case, we use cheaper large SSDs for the data (Samsung 850 Pro
>> > 2TB), whose performance is excellent in the cluster, but as has been
>> > pointed out in this thread can lose data if power is suddenly
>> > removed.
>> >
>> > We therefore put journals onto SM863 SSDs (1 journal SSD per 3 OSD
>> > SSDs), which are enterprise quality and have power outage protection.
>> > This seems to balance speed, capacity, reliability and budget fairly
>> > well.
>>
>> This would make me feel very uncomfortable.
>>
>> So you have a reliable journal, so upto there thing do work:
>>   Once in the journal you data is safe.
>>
>> But then you async transfer the data to disk. And that is an SSD that
>> lies to you? It will tell you that the data is written. But if you pull
>> the power, then it turns out that the data is not really stored.
>>
>> And then the only way to get the data consistent again, is to (deep)scrub.
>>
>> Not a very appealing lookout??
>>
>> --WjW
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Brian Andrus
> Cloud Systems Engineer
> DreamHost, LLC
>



-- 
Brian Andrus
Cloud Systems Engineer
DreamHost, LLC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-09 Thread Brian Andrus

Hi Willem, the SSDs are probably fine for backing OSDs, it's the O_DSYNC
writes they tend to lie about.

They may have a failure rate higher than enterprise-grade SSDs, but are
otherwise suitable for use as OSDs if journals are placed elsewhere.

On Mon, Jan 9, 2017 at 2:39 PM, Willem Jan Withagen <w...@digiware.nl> wrote:

> On 9-1-2017 18:46, Oliver Humpage wrote:
> >
> >> Why would you still be using journals when running fully OSDs on
> >> SSDs?
> >
> > In our case, we use cheaper large SSDs for the data (Samsung 850 Pro
> > 2TB), whose performance is excellent in the cluster, but as has been
> > pointed out in this thread can lose data if power is suddenly
> > removed.
> >
> > We therefore put journals onto SM863 SSDs (1 journal SSD per 3 OSD
> > SSDs), which are enterprise quality and have power outage protection.
> > This seems to balance speed, capacity, reliability and budget fairly
> > well.
>
> This would make me feel very uncomfortable.
>
> So you have a reliable journal, so upto there thing do work:
>   Once in the journal you data is safe.
>
> But then you async transfer the data to disk. And that is an SSD that
> lies to you? It will tell you that the data is written. But if you pull
> the power, then it turns out that the data is not really stored.
>
> And then the only way to get the data consistent again, is to (deep)scrub.
>
> Not a very appealing lookout??
>
> --WjW
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Brian Andrus
Cloud Systems Engineer
DreamHost, LLC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cluster pause - possible consequences

2017-01-04 Thread Brian Andrus

On Mon, Jan 2, 2017 at 6:46 AM, Wido den Hollander <w...@42on.com> wrote:

>
> > Op 2 januari 2017 om 15:43 schreef Matteo Dacrema <mdacr...@enter.eu>:
> >
> >
> > Increasing pg_num will lead to several slow requests and cluster freeze,
> but  due to creating pgs operation , for what I’ve seen until now.
> > During the creation period all the request are frozen , and the creation
> period take a lot of time even for 128 pgs.
> >
> > I’ve observed that during creation period most of the OSD goes at 100%
> of their performance capacity. I think that without operation running in
> the cluster I’ll be able to upgrade pg_num quickly without causing down
> time several times.
> >
>
> First, slowly increase pg_num to the number you want, then increase
> pgp_num in small baby steps as well.
>
> Wido
>

As Wido mentioned, low+slow is the way to go for production environments.
increase in small increments.

pg_num increases should be fairly transparent to client IO, but test first
by increasing your pool in increasing amounts. pgp_num increase will cause
client interruption in a lot of cases, so this is what you'll need to be
wary of.

Here's some select logic from a quick and dirty script I wrote to do the
last PG increase job, maybe it will help in your endeavors:

https://gist.github.com/oddomatik/7cca9b64d7b13d17e800cc35894037ac


>
> > Matteo
> >
> > > Il giorno 02 gen 2017, alle ore 15:02, c...@jack.fr.eu.org ha scritto:
> > >
> > > Well, as the doc said:
> > >> Set or clear the pause flags in the OSD map. If set, no IO requests
> will be sent to any OSD. Clearing the flags via unpause results in
> resending pending requests.
> > > If you do that on a production cluster, that means your cluster will no
> > > longer be in production :)
> > >
> > > Depending on your needs, but ..
> > > Maybe you want do this operation as fast as possible
> > > Or maybe you want to make that operation as transparent as possible,
> > > from a user point of view
> > >
> > > You may have a look at osd_recovery_op_priority &
> > > osd_client_op_priority, they might be interesting for you
> > >
> > > On 02/01/2017 14:37, Matteo Dacrema wrote:
> > >> Hi All,
> > >>
> > >> what happen if I set pause flag on a production cluster?
> > >> I mean, will all the request remain pending/waiting or all the
> volumes attached to the VMs will become read-only?
> > >>
> > >> I need to quickly upgrade placement group number from 3072 to 8192 or
> better to 165336 and I think doing it without client operations will be
> much faster.
> > >>
> > >> Thanks
> > >> Regards
> > >> Matteo
> > >>
> > >>
> > >>
> > >>
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > > --
> > > Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato
> non infetto.
> > > Seguire il link qui sotto per segnalarlo come spam:
> > > http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=9F3C956B85.A333A
> > >
> > >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Brian Andrus
Cloud Systems Engineer
DreamHost, LLC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Pool Sizes

2017-01-04 Thread Brian Andrus

Think "many objects, few pools". The number of pools do not scale well
because of PG limitations. Keep a small number of pools with the proper
number of PGs. See this tool for pool sizing:

https://ceph.com/pgcalc

By default, the developers have chosen a 4MB object size for the built-in
clients. This is a sensible choice and will result in good performance for
most workloads, but can depend on what type of operations you most
frequently perform and how your client interacts with the cluster.

I will let some other folks chime in with firsthand experience, but I have
worked with pools containing billions of objects and observed them
functioning fine. A few issues I can foresee off the top of my head are
potential underlying filesystem limits (should be okay) and keeping cluster
operations to a minimum (resizing/deleting pools).

Since we're talking about scale, CERN's videos are interesting for
examining the current challenges in Ceph at scale. (mostly hardware
observations)

https://youtu.be/A_VojOZjJTY

Yahoo chose a "super-cluster" architecture to work around former
limitations with large clusters, but I do believe many of the findings
CERN/Yahoo have uncovered have been addressed in recent versions of Ceph,
or are being targeted by developers in upcoming versions.

https://yahooeng.tumblr.com/post/116391291701/yahoo-cloud-object-store-object-storage-at

On Sat, Dec 31, 2016 at 3:53 PM, Kent Borg <kentb...@borg.org> wrote:

> More newbie questions about librados...
>
> I am making design decisions now that I want to scale to really big sizes
> in the future, and so need to understand where size limits and performance
> bottlenecks come from. Ceph has a reputation for being able to scale to
> exabytes, but I don't see much on how one should sensibly get to such
> scales. Do I make big objects? Pools with lots of objects in them? Lots of
> pools? A pool that has a thousand objects of a megabyte each vs. a pool
> that has a million objects or a thousand bytes each: why should one take
> one approach and when should one take the other? How big can a pool get? Is
> a billion objects a lot, something that Ceph works to handle, or is it
> something Ceph thinks is no big deal? Is a trillion objects a lot? Is a
> million pools a lot? A billion pools? How many is "lots" for Ceph?
>
> I plan to accumulate data indefinitely, I plan to add cluster capacity on
> a regular schedule, I want performance that doesn't degrade with size.
>
> Where do things break down? What is the wrong way to scale Ceph?
>
> Thanks,
>
> -kb, the Kent who guesses putting all his data in a single xattr or single
> RADOS object would be the wrong way.
>
> P.S. Happy New Year!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Brian Andrus
Cloud Systems Engineer
DreamHost, LLC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw setup issue

2017-01-04 Thread Brian Andrus

-12-22 17:36:46.768578 7f084beeb9c0  1 -- 39.0.16.9:0/1011033520
> --> 39.0.16.7:6789/0 -- mon_get_version(what=osdmap handle=10) v1 -- ?+0
> 0x7f084c8f2aa0 con 0x7f084c8e9480
> >> 2016-12-22 17:36:46.768681 7f082a7fc700  1 -- 39.0.16.9:0/1011033520
> <== mon.0 39.0.16.7:6789/0 16  mon_get_version_reply(handle=10
> version=9506) v2  24+0+0 (2901705056 0 0) 0x7f0814001110 con
> 0x7f084c8e9480
> >> 2016-12-22 17:36:46.768724 7f084beeb9c0  1 -- 39.0.16.9:0/1011033520
> --> 39.0.16.7:6789/0 -- pool_op(create pool 0 auid 0 tid 1 name .rgw.root
> v0) v4 -- ?+0 0x7f084c8f4520 con 0x7f084c8e9480
> >> 2016-12-22 17:36:47.052029 7f082a7fc700  1 -- 39.0.16.9:0/1011033520
> <== mon.0 39.0.16.7:6789/0 17  pool_op_reply(tid 1 (34) Numerical
> result out of range v9507) v1  43+0+0 (366377631 0 0) 0x7f0814001110
> con 0x7f084c8e9480
> >
>
> Thanks for the response Orit.
>
> > what filesystem are you using? we longer support ext4
>
> OSDs are using XFS for Filestore.
>
> > Another option is version mismatch between rgw and the ods.
>
> Exact same version of ceph binaries are installed on OSD, MON and RGW
> nodes.
>
> Is there anything useful in the error messages?
>
> 2016-12-22 17:36:46.768314 7f084beeb9c0 10 failed to list objects
> pool_iterate_begin() returned r=-2
> 2016-12-22 17:36:46.768320 7f084beeb9c0 10 WARNING: store->list_zones()
> returned r=-2
>
> Is this the point where the failure has begun?
>
> As I see, the basic issue is, RGW is not able to create the needed pools
> on demand. I wish there was more detailed output regarding the Numerical
> result out of range issue.
> I am suspecting it may be related to the set of defaults used while
> creating pools automatically. Possibly the default crush rule.
>
> Thanks,
> Nitin
>
> >
> > Orit
> >
> >> 2016-12-22 17:36:47.052067 7f082a7fc700  1 -- 39.0.16.9:0/1011033520
> --> 39.0.16.7:6789/0 -- mon_subscribe({osdmap=9507}) v2 -- ?+0
> 0x7f0818022bb0 con 0x7f084c8e9480
> >> 2016-12-22 17:36:47.055809 7f082a7fc700  1 -- 39.0.16.9:0/1011033520
> <== mon.0 39.0.16.7:6789/0 18  osd_map(9507..9507 src has 8863..9507)
> v3  214+0+0 (1829214220 0 0) 0x7f0814001110 con 0x7f084c8e9480
> >> 2016-12-22 17:36:47.055858 7f084beeb9c0  0 ERROR:  storing info for
> 84f3cdd9-71d9-4d74-a6ba-c0e87d776a2b: (34) Numerical result out of range
> >> 2016-12-22 17:36:47.055869 7f084beeb9c0  0 create_default: error in
> create_default  zone params: (34) Numerical result out of range
> >> 2016-12-22 17:36:47.055876 7f084beeb9c0  0 failure in zonegroup
> create_default: ret -34 (34) Numerical result out of range
> >> 2016-12-22 17:36:47.055970 7f084beeb9c0  1 -- 39.0.16.9:0/1011033520
> mark_down 0x7f084c8e9480 -- 0x7f084c8ec0f0
> >> 2016-12-22 17:36:47.056169 7f084beeb9c0  1 -- 39.0.16.9:0/1011033520
> mark_down_all
> >> 2016-12-22 17:36:47.056263 7f084beeb9c0  1 -- 39.0.16.9:0/1011033520
> shutdown complete.
> >> 2016-12-22 17:36:47.056426 7f084beeb9c0 -1 Couldn't init storage
> provider (RADOS)
> >>
> >>
> >>
> >> I did not create the pools for rgw, as they get created automatically.
> few weeks back, I could setup RGW on jewel successfully. But this time I am
> not able to see any obvious issues which I can fix.
> >>
> >>
> >> [0] http://docs.ceph.com/docs/jewel/radosgw/config/
> >>
> >> Thanks in advance,
> >> Nitin
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Brian Andrus
Cloud Systems Engineer
DreamHost, LLC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unbalanced OSD's

2017-01-03 Thread Brian Andrus

On Mon, Jan 2, 2017 at 4:25 AM, Jens Dueholm Christensen <j...@ramboll.com>
wrote:

> On Friday, December 30, 2016 07:05 PM Brian Andrus wrote:
>
> > We have a set it and forget it cronjob setup once an hour to keep things
> a bit more balanced.
> >
> > 1 * * * * /bin/bash /home/briana/reweight_osd.sh 2>&1 | /usr/bin/logger
> -t ceph_reweight
> >
> > The script checks and makes sure cluster health is OK and no other
> rebalancing is going on. It will
> > also check the reported STDDEV from `ceph osd df` and if outside
> acceptable ranges executes a
> > gentle reweight.
>
> Would you mind sharing that script?


The script attached is a quick and dirty bash script, not perfect, blah
blah blah... consider it still under development but feel free to use it as
a base for your own environment needs.

https://gist.github.com/oddomatik/1e94c67f521ebd16c789e4cbe1d0a5d1

> The three parameters after the reweight-by-utilization are not well
> documented, but they are
> >
> > 103 - Select OSDs that are 3% above the average (default is 120 but we
> want a larger pool of OSDs to
> choose from to get an eventual tighter tolerance)
> > .010 - don't reweight any OSD more than this increment (keeps the impact
> low)
> > 10 - number of OSDs to select (to keep impact manageable)
>
> Ah! Thank you for that pointer.
>
> For the record the same arguments can be used for dry-runs of "ceph osd
> test-reweight-by-utilization ..."
> and correspond to these values in the output from
> test-reweight-by-utilization:
>
>   oload 120
>   max_change 0.05
>   max_change_osds 4
>
> The above values are the current defaults in Hammer (0.94.9), but can
> easily be changed to see the impact
> before running the actual rebalance..
>
> Regards,
> Jens Dueholm Christensen
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Brian Andrus
Cloud Systems Engineer
DreamHost, LLC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unbalanced OSD's

2016-12-30 Thread Brian Andrus

We have a set it and forget it cronjob setup once an hour to keep things a
bit more balanced.

1 * * * * /bin/bash /home/briana/reweight_osd.sh 2>&1 | /usr/bin/logger -t
ceph_reweight

The script checks and makes sure cluster health is OK and no other
rebalancing is going on. It will also check the reported STDDEV from `ceph
osd df` and if outside acceptable ranges executes a gentle reweight.

 ceph osd reweight-by-utilization 103 .015 10

It's definitely an "over time" kind of thing, but after a week we are
already seeing pretty good results. Pending OSD reboots, a few months from
now our cluster should be seeing quite a bit less difference in utilization.

The three parameters after the reweight-by-utilization are not well
documented, but they are

103 - Select OSDs that are 3% above the average (default is 120 but we want
a larger pool of OSDs to choose from to get an eventual tighter tolerance)
.010 - don't reweight any OSD more than this increment (keeps the impact
low)
10 - number of OSDs to select (to keep impact manageable)

Hope that helps.

On Fri, Dec 30, 2016 at 2:27 AM, Kees Meijs <k...@nefos.nl> wrote:

> Thanks, I'll try a manual reweight at first.
>
> Have a happy new year's eve (yes, I know it's a day early)!
>
> Regards,
> Kees
>
> On 30-12-16 11:17, Wido den Hollander wrote:
> > For this reason you can do a OSD reweight by running the 'ceph osd
> reweight-by-utilization' command or do it manually with 'ceph osd reweight
> X 0-1'
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Brian Andrus
Cloud Systems Engineer
DreamHost, LLC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How can I ask to Ceph Cluster to move blocks now when osd is down?

2016-12-27 Thread Brian Andrus

The OSD needs to be marked as "out" before data movement will occur. This
is to give it a chance to come back in.

By default as seen in config_opts.h (linked to in one of your other recent
threads), an OSD should be marked out 300 seconds after going down if it
isn't marked back up before then. You can adjust osd_down_out_interval as
desired. It looks like it's been defaulted to 600s in newer versions.

On Thu, Dec 22, 2016 at 2:54 AM, Stéphane Klein <cont...@stephane-klein.info
> wrote:

> Hi,
>
> How can I ask to Ceph Cluster to move blocks now when osd is down?
>
> Best regards,
> Stéphane
> --
> Stéphane Klein <cont...@stephane-klein.info>
> blog: http://stephane-klein.info
> cv : http://cv.stephane-klein.info
> Twitter: http://twitter.com/klein_stephane
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

-- 
Brian Andrus
Cloud Systems Engineer
DreamHost, LLC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph and rsync

2016-12-16 Thread Brian ::

The fact that you are all SSD I would do exactly what Wido said -
gracefully remove the OSD and gracefully bring up the OSD on the new
SSD.

Let Ceph do what its designed to do. The rsync idea looks great on
paper - not sure what issues you will run into in practise.


On Fri, Dec 16, 2016 at 12:38 PM, Alessandro Brega
 wrote:
> 2016-12-16 10:19 GMT+01:00 Wido den Hollander :
>>
>>
>> > Op 16 december 2016 om 9:49 schreef Alessandro Brega
>> > :
>> >
>> >
>> > 2016-12-16 9:33 GMT+01:00 Wido den Hollander :
>> >
>> > >
>> > > > Op 16 december 2016 om 9:26 schreef Alessandro Brega <
>> > > alessandro.bre...@gmail.com>:
>> > > >
>> > > >
>> > > > Hi guys,
>> > > >
>> > > > I'm running a ceph cluster using 0.94.9-1trusty release on XFS for
>> > > > RBD
>> > > > only. I'd like to replace some SSDs because they are close to their
>> > > > TBW.
>> > > >
>> > > > I know I can simply shutdown the OSD, replace the SSD, restart the
>> > > > OSD
>> > > and
>> > > > ceph will take care of the rest. However I don't want to do it this
>> > > > way,
>> > > > because it leaves my cluster for the time of the rebalance/
>> > > > backfilling
>> > > in
>> > > > a degraded state.
>> > > >
>> > > > I'm thinking about this process:
>> > > > 1. keep old OSD running
>> > > > 2. copy all data from current OSD folder to new OSD folder (using
>> > > > rsync)
>> > > > 3. shutdown old OSD
>> > > > 4. redo step 3 to update to the latest changes
>> > > > 5. restart OSD with new folder
>> > > >
>> > > > Are there any issues with this approach? Do I need any special rsync
>> > > flags
>> > > > (rsync -avPHAX --delete-during)?
>> > > >
>> > >
>> > > Indeed X for transferring xattrs, but also make sure that the
>> > > partitions
>> > > are GPT with the proper GUIDs.
>> > >
>> > > I would never go for this approach in a running setup. Since it's a
>> > > SSD
>> > > cluster I wouldn't worry about the rebalance and just have Ceph do the
>> > > work
>> > > for you.
>> > >
>> > >
>> > Why not - if it's completely safe. It's much faster (local copy),
>> > doesn't
>> > put load on the network (local copy), much safer (2-3 minutes instead of
>> > 1-2 hours degraded time (2TB SSD)), and it's really simple (2 rsync
>> > commands). Thank you.
>> >
>>
>> I wouldn't say it is completely safe, hence my remark. If you copy, indeed
>> make sure you copy all the xattrs, but also make sure the partitions tables
>> match.
>>
>> That way it should work, but it's not a 100% guarantee.
>>
>
> Ok, thanks!  Can a ceph dev confirm? I do not want to loose any data ;)
>
> Alessandro
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CEPH mirror down again

2016-11-25 Thread Andrus, Brian Contractor

Hmm. Apparently download.ceph.com = us-west.ceph.com
And there is no repomd.xml on us-east.ceph.com

This seems to happen a little too often for something that is stable and 
released. Makes it seem like the old BBS days of “I want to play DOOM, so I’m 
shutting the services down”


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238




From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vy 
Nguyen Tan
Sent: Friday, November 25, 2016 7:28 PM
To: Joao Eduardo Luis <j...@suse.de>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] CEPH mirror down again

Hi Matt and Joao,

Thank you for your information. I am installing Ceph with alternative mirror 
(ceph-deploy install --repo-url http://hk.ceph.com/rpm-jewel/el7/ --gpg-url 
http://hk.ceph.com/keys/release.asc {host}) and everything work again.

On Sat, Nov 26, 2016 at 10:12 AM, Joao Eduardo Luis 
<j...@suse.de<mailto:j...@suse.de>> wrote:
On 11/26/2016 03:05 AM, Vy Nguyen Tan wrote:
Hello,

I want to install CEPH on new nodes but I can't reach CEPH repo, It
seems the repo are broken. I am using CentOS 7.2 and ceph-deploy 1.5.36.

Patrick sent an email to the list informing this would happen back on Nov 18th; 
quote:
Due to Dreamhost shutting down the old DreamCompute cluster in their
US-East 1 region, we are in the process of beginning the migration of
Ceph infrastructure.  We will need to move 
download.ceph.com<http://download.ceph.com>,
tracker.ceph.com<http://tracker.ceph.com>, and 
docs.ceph.com<http://docs.ceph.com> to their US-East 2 region.

The current plan is to move the VMs on 25 NOV 2016 throughout the day.
Expect them to be down intermittently.

  -Joao

P.S.: also, it's Ceph; not CEPH.

___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how possible is that ceph cluster crash

2016-11-19 Thread Brian ::

HI Lionel,

Mega Ouch - I've recently seen the act of measuring power consumption
in a data centre (they clamp a probe onto the cable for an AMP reading
seemingly) take out a cabinet which had *redundant* power feeds - so
anything is possible I guess.

Regards
Brian


On Sat, Nov 19, 2016 at 11:20 AM, Lionel Bouton
<lionel-subscript...@bouton.name> wrote:
> Le 19/11/2016 à 00:52, Brian :: a écrit :
>> This is like your mother telling not to cross the road when you were 4
>> years of age but not telling you it was because you could be flattened
>> by a car :)
>>
>> Can you expand on your answer? If you are in a DC with AB power,
>> redundant UPS, dual feed from the electric company, onsite generators,
>> dual PSU servers, is it still a bad idea?
>
> Yes it is.
>
> In such a datacenter where we have a Ceph cluster there was a complete
> shutdown because of a design error : the probes used by the solution
> responsible for starting and stopping the generators were installed
> before the breakers installed on the feeds. After a blackout where
> generators kicked in the breakers opened due to a surge when power was
> restored. The generators were stopped because power was restored, and
> the UPS systems failed 3 minutes later. Closing the breakers couldn't be
> done in time (you don't approach them without being heavily protected,
> putting on the suit to protect you needs more time than simply closing
> the breaker).
>
> There's no such thing as uninterruptible power supply.
>
> Best regards,
>
> Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how possible is that ceph cluster crash

2016-11-18 Thread Brian ::

This is like your mother telling not to cross the road when you were 4
years of age but not telling you it was because you could be flattened
by a car :)

Can you expand on your answer? If you are in a DC with AB power,
redundant UPS, dual feed from the electric company, onsite generators,
dual PSU servers, is it still a bad idea?




On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just  wrote:
> Never *ever* use nobarrier with ceph under *any* circumstances.  I
> cannot stress this enough.
> -Sam
>
> On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi  wrote:
>> Hi Nick and other Cephers,
>>
>> Thanks for your reply.
>>
>>> 2) Config Errors
>>> This can be an easy one to say you are safe from. But I would say most
>>> outages and data loss incidents I have seen on the mailing
>>> lists have been due to poor hardware choice or configuring options such as
>>> size=2, min_size=1 or enabling stuff like nobarriers.
>>
>> I am wondering the pros and cons of the nobarrier option used by Ceph.
>>
>> It is well known that nobarrier is dangerous when power outage happens, but
>> if we already have replicas in different racks or PDUs, will Ceph reduce the
>> risk of data lost with this option?
>>
>> I have seen many performance tuning articles providing nobarrier option in
>> xfs, but there are not many of then mention the trade-off of nobarrier.
>>
>> Is it really unacceptable to use nobarrier in production environment? I will
>> be much grateful if you guys are willing to share any experiences about
>> nobarrier and xfs.
>>
>> Sincerely,
>> Craig Chi (Product Developer)
>> Synology Inc. Taipei, Taiwan. Ext. 361
>>
>> On 2016-11-17 05:04, Nick Fisk  wrote:
>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>> Pedro Benites
>>> Sent: 16 November 2016 17:51
>>> To: ceph-users@lists.ceph.com
>>> Subject: [ceph-users] how possible is that ceph cluster crash
>>>
>>> Hi,
>>>
>>> I have a ceph cluster with 50 TB, with 15 osds, it is working fine for one
>>> year and I would like to grow it and migrate all my old
>> storage,
>>> about 100 TB to ceph, but I have a doubt. How possible is that the cluster
>>> fail and everything went very bad?
>>
>> Everything is possible, I think there are 3 main risks
>>
>> 1) Hardware failure
>> I would say Ceph is probably one of the safest options in regards to
>> hardware failures, certainly if you start using 4TB+ disks.
>>
>> 2) Config Errors
>> This can be an easy one to say you are safe from. But I would say most
>> outages and data loss incidents I have seen on the mailing
>> lists have been due to poor hardware choice or configuring options such as
>> size=2, min_size=1 or enabling stuff like nobarriers.
>>
>> 3) Ceph Bugs
>> Probably the rarest, but potentially the most scary as you have less
>> control. They do happen and it's something to be aware of
>>
>> How reliable is ceph?
>>> What is the risk about lose my data.? is necessary backup my data?
>>
>> Yes, always backup your data, no matter solution you use. Just like RAID !=
>> Backup, neither does ceph.
>>
>>>
>>> Regards.
>>> Pedro.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>> Sent from Synology MailPlus
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rgw print continue and civetweb

2016-11-14 Thread Brian Andrus

Hi William,

"rgw print continue = true" is an apache specific setting, as mentioned
here:

http://docs.ceph.com/docs/master/install/install-ceph-
gateway/#migrating-from-apache-to-civetweb

I do not believe it is needed for civetweb. For documentation, you can see
or change the version branch in the URL to make sure you are consuming the
latest version.


On Sun, Nov 13, 2016 at 8:03 PM, William Josefsson <
william.josef...@gmail.com> wrote:

> Hi list, can anyone please clarify if the default 'rgw print continue
> = true', is supported by civetweb?
>
> I'm using radosgw with civetweb, and this document (may be outdated?)
> mentions to install apache,
> http://docs.ceph.com/docs/hammer/install/install-ceph-gateway/. This
> ticket seems to keep 'print continue' with civetweb enabled.
> http://tracker.ceph.com/issues/12640
>
> Can anyone please clarify whether civetweb support the default
> 100-continue setting? thx will
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Brian Andrus
Cloud Systems Engineer
DreamHost, LLC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Instance filesystem corrupt

2016-10-27 Thread Brian ::

What is the issue exactly?


On Fri, Oct 28, 2016 at 2:47 AM,  wrote:

> I think this issue may not related to your poor hardware.
>
>
>
> Our cluster has 3 Ceph monitor and 4 OSD.
>
>
>
> Each server has
>
> 2 cpu ( Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz ) , 32 GB memory
>
> OSD nodes has 2 SSD for journal disks  and 8 SATA disks ( 6TB / 7200 rpm )
>
> ALL of them were connected to each other by 4 x 10Gbps cable ( 802.3 ad )
>
>
>
> The utilization of our Cpeh is only 13% , most of time the IOPS was kept
> under 1500.
>
>
>
> We still getting this issue…..
>
>
>
>
>
> *From:* Ahmed Mostafa [mailto:ahmedmostafa...@gmail.com]
> *Sent:* Friday, October 28, 2016 6:30 AM
> *To:* dilla...@redhat.com
> *Cc:* Keynes Lee/WHQ/Wistron ;
> ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] Instance filesystem corrupt
>
>
>
> So i couldn't actually wait till the morning
>
>
>
> I sat rbd cache to false and tried to create the same number of instances,
> but the same issue happened again.
>
>
>
> I want to note, that if i rebooted any of the virtual machines that has
> this issue, it works without any problem afterwards.
>
>
>
> Does this mean that over-utilization could be the cause of my problem ?
> The cluster i have have bad hardware and this is the only logical
> explanation i can reach .
>
>
>
> By bad hardware i mean core-i5 processors for instance, i can see the %wa
> reaching 50-60% too.
>
>
>
> Thank you
>
>
>
>
>
> On Thu, Oct 27, 2016 at 4:13 PM, Jason Dillaman 
> wrote:
>
> The only effect I could see out of a highly overloaded system would be
> that the OSDs might appear to become unresponsive to the VMs. Are any of
> you using cache tiering or librbd cache? For the latter, there was one
> issue [1] that can result in read corruption that affects hammer and prior
> releases.
>
>
>
> [1] http://tracker.ceph.com/issues/16002
>
>
>
> On Thu, Oct 27, 2016 at 1:34 AM, Ahmed Mostafa 
> wrote:
>
> This is more or less the same bahaviour i have in ky environment
>
>
>
> By any chance is anyone running their osds and their hypervisors on the
> same machine ?
>
>
>
> And could high workload, like starting 40 - 60 or above virtual machines
> have an effect on this problem ?
>
>
>
>
> On Thursday, 27 October 2016,  wrote:
>
>
>
> Most of filesystem corrupt causes instances crashed, we saw that after a
> shutdown / restart
>
> ( triggered by OpenStack portal  buttons or triggered by OS commands in
> Instances )
>
>
>
> Some are early-detected, we saw filesystem errors in OS logs on instances.
>
> Then we make a filesystem check ( FSCK / chkdsk ) immediately, issue fixed.
>
>
>
> [image: cid:image007.jpg@01D1747D.DB260110]
>
> *Keynes  Lee**李* *俊* *賢*
>
> Direct:
>
> +886-2-6612-1025
>
> Mobile:
>
> +886-9-1882-3787
>
> Fax:
>
> +886-2-6612-1991
>
>
>
> E-Mail:
>
> keynes_...@wistron.com
>
>
>
>
>
> *From:* Jason Dillaman [mailto:jdill...@redhat.com ]
> *Sent:* Wednesday, October 26, 2016 9:38 PM
> *To:* Keynes Lee/WHQ/Wistron 
> *Cc:* will.bo...@target.com; ceph-users 
> *Subject:* Re: [ceph-users] [EXTERNAL] Instance filesystem corrupt
>
>
>
> I am not aware of any similar reports against librbd on Firefly. Do you
> use any configuration overrides? Does the filesystem corruption appears
> while the instances are running or only after a shutdown / restart of the
> instance?
>
>
>
> On Wed, Oct 26, 2016 at 12:46 AM,  wrote:
>
> No , we are using Firefly (0.80.7).
>
> As we are using HPE Helion OpenStack 2.1.5, and what the version is was
> embedded is Firefly.
>
>
>
> An upgrade was planning, but should will not happen  soon.
>
>
>
>
>
>
>
>
>
>
>
> *From:* Will.Boege [mailto:will.bo...@target.com ]
> *Sent:* Wednesday, October 26, 2016 12:03 PM
> *To:* Keynes Lee/WHQ/Wistron ;
> ceph-users@lists.ceph.com
> *Subject:* Re: [EXTERNAL] [ceph-users] Instance filesystem corrupt
>
>
>
> Just out of curiosity, did you recently upgrade to Jewel?
>
>
>
> *From: *ceph-users  on behalf of "
> keynes_...@wistron.com" 
> *Date: *Tuesday, October 25, 2016 at 10:52 PM
> *To: *"ceph-users@lists.ceph.com" 
> *Subject: *[EXTERNAL] [ceph-users] Instance filesystem corrupt
>
>
>
> We are using OpenStack + Ceph.
>
> Recently we found a lot of filesystem corrupt incident on instances.
>
> Some of them are correctable, fixed by fsck, but the others have no luck,
> just corrupt and can never start up again.
>
>
>
> We found this issue on vary operation systems of instances. They are
>
> Redhat4 / CentOS 7 / Windows 2012
>
>
>
> Could someone please advise us some troubleshooting direction ?
>
>
>
>
>
> [image: id:image007.jpg@01D1747D.DB260110]
>
> *Keynes  Lee**李* *俊* *賢*
>
>

Re: [ceph-users] ceph website problems?

2016-10-11 Thread Brian ::

Looks like they are having major challenges getting that ceph cluster
running again.. Still down.

On Tuesday, October 11, 2016, Ken Dreyer  wrote:
> I think this may be related:
>
http://www.dreamhoststatus.com/2016/10/11/dreamcompute-us-east-1-cluster-service-disruption/
>
> On Tue, Oct 11, 2016 at 5:57 AM, Sean Redmond 
wrote:
>> Hi,
>>
>> Looks like the ceph website and related sub domains are giving errors for
>> the last few hours.
>>
>> I noticed the below that I use are in scope.
>>
>> http://ceph.com/
>> http://docs.ceph.com/
>> http://download.ceph.com/
>> http://tracker.ceph.com/
>>
>> Thanks
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] too many PGs per OSD (326 > max 300) warning when ALL PGs are 256

2016-10-10 Thread Andrus, Brian Contractor

David,
Thanks for the info. I am getting an understanding of how this works.
Now I used the ceph-deploy tool to create the rgw pools. It seems then that the 
tool isn’t the best at creating the pools necessary for an rgw gateway as it 
made all of them the default sizes for pg_num/pgp_num
Perhaps, then, it is wiser to have a very low default for those so the 
ceph-deploy tool doesn assign a large value to something that will merely hold 
control or other metadata?


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238



From: David Turner [mailto:david.tur...@storagecraft.com]
Sent: Monday, October 10, 2016 10:33 AM
To: Andrus, Brian Contractor <bdand...@nps.edu>; ceph-users@lists.ceph.com
Subject: RE: [ceph-users] too many PGs per OSD (326 > max 300) warning when ALL 
PGs are 256

You have 11 pools with 256 pgs, 1 pool with 128 and 1 pool with 64... that's 
3,008 pgs in your entire cluster.  Multiply that number by your replica size 
and divide by how many OSDs you have in your cluster and you'll see what your 
average PGs per osd is.  Based on the replica size you shared, that's a total 
number of 6,528 copies of PGs to be divided amongst the OSDS in your cluster.  
Your cluster will be in warning if that number is greater than 300 per OSD, 
like you're seeing.  When designing your cluster and how many pools, pgs, and 
replica size you will be setting, please consult the pgcalc tool found here 
http://ceph.com/pgcalc/.  You cannot reduce the number of PGs in a pool, so the 
easiest way to resolve this issue is mostly likely going to be destroying pools 
and recreating them with the proper number of PGs.

The PG number should be based on what percentage of the data in your cluster 
will be in this pool.  If I'm planning to have about 1024 PGs total in my 
cluster and I give 256 PGs to 4 different pools, then what I'm saying is that 
each of those 4 pools will have the exact same amount of data as each other.  
On the other hand, if I believe that 1 of those pools will have 90% of the data 
and the other 3 pools will have very little data, then I'll probably give the 
larger pool 1024 PGs and the rest of them 64 PGs (or less depending on what I'm 
aiming for).  It is beneficial to keep the pg_num and pgp_num counts to base 2 
numbers.

[cid:image001.jpg@01D222E8.1655BB40]<https://storagecraft.com>

David Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.



From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Andrus, Brian 
Contractor [bdand...@nps.edu]
Sent: Monday, October 10, 2016 11:14 AM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] too many PGs per OSD (326 > max 300) warning when ALL PGs 
are 256
Ok, this is an odd one to me…
I have several pools, ALL of them are set with pg_num and pgp_num = 256. Yet, 
the warning about too many PGs per OSD is showing up.
Here are my pools:

pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 256 pgp_num 256 last_change 134 flags hashpspool stripe_width 0
pool 1 'cephfs_data' replicated size 3 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 256 pgp_num 256 last_change 203 flags hashpspool 
crash_replay_interval 45 stripe_width 0
pool 2 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 0 
object_hash rjenkins pg_num 64 pgp_num 64 last_change 196 flags hashpspool 
stripe_width 0
pool 3 'vmimages' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 213 flags hashpspool stripe_width 0
removed_snaps [1~3]
pool 25 '.rgw.root' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 256 pgp_num 256 last_change 6199 flags hashpspool stripe_width 0
pool 26 'default.rgw.control' replicated size 2 min_size 1 crush_ruleset 0 
object_hash rjenkins pg_num 256 pgp_num 256 last_change 6202 flags hashpspool 
stripe_width 0
pool 27 'default.rgw.data.root' replicated size 2 min_size 1 crush_ruleset 0 
object_hash rjenkins pg_num 256 pgp_num 256 last_change 6204 flags hashpspool 
stripe_width 0
pool 28 'default.rgw.gc' replicated size 2 min_size 1 crush_ruleset 0 
object_hash rjenkins pg_num 256 pgp_num 256 last_change 6205 flags hashpspool 
stripe_width 0
pool 29 'default.rgw.log' replicated size 2 min_size 1 crush_ruleset 0 
object_hash rjenkins pg_num 256 pgp_num 256 last_change 6206 flags hashpspool 
stripe_width 0

[ceph-users] too many PGs per OSD (326 > max 300) warning when ALL PGs are 256

2016-10-10 Thread Andrus, Brian Contractor

Ok, this is an odd one to me...
I have several pools, ALL of them are set with pg_num and pgp_num = 256. Yet, 
the warning about too many PGs per OSD is showing up.
Here are my pools:

pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 256 pgp_num 256 last_change 134 flags hashpspool stripe_width 0
pool 1 'cephfs_data' replicated size 3 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 256 pgp_num 256 last_change 203 flags hashpspool 
crash_replay_interval 45 stripe_width 0
pool 2 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 0 
object_hash rjenkins pg_num 64 pgp_num 64 last_change 196 flags hashpspool 
stripe_width 0
pool 3 'vmimages' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 213 flags hashpspool stripe_width 0
removed_snaps [1~3]
pool 25 '.rgw.root' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 256 pgp_num 256 last_change 6199 flags hashpspool stripe_width 0
pool 26 'default.rgw.control' replicated size 2 min_size 1 crush_ruleset 0 
object_hash rjenkins pg_num 256 pgp_num 256 last_change 6202 flags hashpspool 
stripe_width 0
pool 27 'default.rgw.data.root' replicated size 2 min_size 1 crush_ruleset 0 
object_hash rjenkins pg_num 256 pgp_num 256 last_change 6204 flags hashpspool 
stripe_width 0
pool 28 'default.rgw.gc' replicated size 2 min_size 1 crush_ruleset 0 
object_hash rjenkins pg_num 256 pgp_num 256 last_change 6205 flags hashpspool 
stripe_width 0
pool 29 'default.rgw.log' replicated size 2 min_size 1 crush_ruleset 0 
object_hash rjenkins pg_num 256 pgp_num 256 last_change 6206 flags hashpspool 
stripe_width 0
pool 30 'default.rgw.users.uid' replicated size 2 min_size 1 crush_ruleset 0 
object_hash rjenkins pg_num 256 pgp_num 256 last_change 6211 flags hashpspool 
stripe_width 0
pool 31 'default.rgw.meta' replicated size 2 min_size 1 crush_ruleset 0 
object_hash rjenkins pg_num 256 pgp_num 256 last_change 6214 flags hashpspool 
stripe_width 0
pool 32 'default.rgw.buckets.index' replicated size 2 min_size 1 crush_ruleset 
0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 6216 flags hashpspool 
stripe_width 0
pool 33 'default.rgw.buckets.data' replicated size 2 min_size 1 crush_ruleset 0 
object_hash rjenkins pg_num 256 pgp_num 256 last_change 6218 flags hashpspool 
stripe_width 0


so why would the warning show up, and how do I get it to go away and stay away?


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] The principle of config Federated Gateways

2016-10-05 Thread Brian Chang-Chien

Hi all

I have a question about config federated gateway

Why only sync data and metadata between zones in the same regions

and only sync metadata between zones in the different regions

In different regions, can't sync zone data , can tell me any concern?

Thx
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] too many PGs per OSD when pg_num = 256??

2016-09-22 Thread Andrus, Brian Contractor

Hmm. Something happened then. I only have 20 OSDs. What may cause that?

Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238



From: David Turner [mailto:david.tur...@storagecraft.com]
Sent: Thursday, September 22, 2016 10:04 AM
To: Andrus, Brian Contractor <bdand...@nps.edu>; ceph-users@lists.ceph.com
Subject: RE: too many PGs per OSD when pg_num = 256??

So you have 3,520 pgs.  Assuming all of your pools are using 3 replicas, and 
using the 377 pgs/osd in your health_warn state, that would mean your cluster 
has 28 osds.

When you calculate how many pgs a pool should have, you need to account for how 
many osds you have, how much percentage of data each pool will account for out 
of your entire cluster, and go from there.  The ceph PG Calc tool will be an 
excellent resource to help you figure out how many pgs each pool should have.  
It takes all of those factors into account.  http://ceph.com/pgcalc/

[cid:image001.jpg@01D214B9.781F53F0]<https://storagecraft.com>

David Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


____
From: Andrus, Brian Contractor [bdand...@nps.edu]
Sent: Thursday, September 22, 2016 10:41 AM
To: David Turner; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: too many PGs per OSD when pg_num = 256??
David,
I have 15 pools:
# ceph osd lspools|sed 's/,/\n/g'
0 rbd
1 cephfs_data
2 cephfs_metadata
3 vmimages
14 .rgw.root
15 default.rgw.control
16 default.rgw.data.root
17 default.rgw.gc
18 default.rgw.log
19 default.rgw.users.uid
20 default.rgw.users.keys
21 default.rgw.users.email
22 default.rgw.meta
23 default.rgw.buckets.index
24 default.rgw.buckets.data
# ceph -s | grep -Eo '[0-9]+ pgs'
3520 pgs



Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238



From: David Turner [mailto:david.tur...@storagecraft.com]
Sent: Thursday, September 22, 2016 8:57 AM
To: Andrus, Brian Contractor <bdand...@nps.edu<mailto:bdand...@nps.edu>>; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: too many PGs per OSD when pg_num = 256??

Forgot the + for the regex.

ceph -s | grep -Eo '[0-9]+ pgs'

[cid:image001.jpg@01D214B9.781F53F0]<https://storagecraft.com>

David Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.



From: David Turner
Sent: Thursday, September 22, 2016 9:53 AM
To: Andrus, Brian Contractor; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: too many PGs per OSD when pg_num = 256??
How many pools do you have?  How many pgs does your total cluster have, not 
just your rbd pool?

ceph osd lspools
ceph -s | grep -Eo '[0-9] pgs'

My guess is that you have other pools with pgs and the cumulative total of pgs 
per osd is too many.

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Andrus, Brian 
Contractor [bdand...@nps.edu]
Sent: Thursday, September 22, 2016 9:33 AM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] too many PGs per OSD when pg_num = 256??
All,

I am getting a warning:

 health HEALTH_WARN
too many PGs per OSD (377 > max 300)
pool cephfs_data has many more objects per pg than average (too few 
pgs?)

yet, when I check the settings:
# ceph osd pool get rbd pg_num
pg_num: 256
# ceph osd pool get rbd pgp_num
pgp_num: 256

How does something like this happen?
I did create a radosgw several weeks ago and have put a single file in it for 
testing, but that is it. It only started giving the warning a couple days ago.

Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] too many PGs per OSD when pg_num = 256??

2016-09-22 Thread Andrus, Brian Contractor

David,
I have 15 pools:
# ceph osd lspools|sed 's/,/\n/g'
0 rbd
1 cephfs_data
2 cephfs_metadata
3 vmimages
14 .rgw.root
15 default.rgw.control
16 default.rgw.data.root
17 default.rgw.gc
18 default.rgw.log
19 default.rgw.users.uid
20 default.rgw.users.keys
21 default.rgw.users.email
22 default.rgw.meta
23 default.rgw.buckets.index
24 default.rgw.buckets.data
# ceph -s | grep -Eo '[0-9]+ pgs'
3520 pgs



Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238



From: David Turner [mailto:david.tur...@storagecraft.com]
Sent: Thursday, September 22, 2016 8:57 AM
To: Andrus, Brian Contractor <bdand...@nps.edu>; ceph-users@lists.ceph.com
Subject: RE: too many PGs per OSD when pg_num = 256??

Forgot the + for the regex.

ceph -s | grep -Eo '[0-9]+ pgs'

[cid:image001.jpg@01D214B5.3D5480F0]<https://storagecraft.com>

David Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.



From: David Turner
Sent: Thursday, September 22, 2016 9:53 AM
To: Andrus, Brian Contractor; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: too many PGs per OSD when pg_num = 256??
How many pools do you have?  How many pgs does your total cluster have, not 
just your rbd pool?

ceph osd lspools
ceph -s | grep -Eo '[0-9] pgs'

My guess is that you have other pools with pgs and the cumulative total of pgs 
per osd is too many.

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Andrus, Brian 
Contractor [bdand...@nps.edu]
Sent: Thursday, September 22, 2016 9:33 AM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] too many PGs per OSD when pg_num = 256??
All,

I am getting a warning:

 health HEALTH_WARN
too many PGs per OSD (377 > max 300)
pool cephfs_data has many more objects per pg than average (too few 
pgs?)

yet, when I check the settings:
# ceph osd pool get rbd pg_num
pg_num: 256
# ceph osd pool get rbd pgp_num
pgp_num: 256

How does something like this happen?
I did create a radosgw several weeks ago and have put a single file in it for 
testing, but that is it. It only started giving the warning a couple days ago.

Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] too many PGs per OSD when pg_num = 256??

2016-09-22 Thread Andrus, Brian Contractor

All,

I am getting a warning:

 health HEALTH_WARN
too many PGs per OSD (377 > max 300)
pool cephfs_data has many more objects per pg than average (too few 
pgs?)

yet, when I check the settings:
# ceph osd pool get rbd pg_num
pg_num: 256
# ceph osd pool get rbd pgp_num
pgp_num: 256

How does something like this happen?
I did create a radosgw several weeks ago and have put a single file in it for 
testing, but that is it. It only started giving the warning a couple days ago.

Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] swiftclient call radosgw, it always response 401 Unauthorized

2016-09-22 Thread Brian Chang-Chien

onOne", "publicURL": "http://10.62.13.253:9292;, "id":
"2e5c0bb777184d078b7859fd59b75ed7", "internalURL":
"http://10.62.13.253:9292"}],
"type": "image", "name": "glance"}, {"endpoints_links": [], "endpoints":
[{"adminURL": "http://10.62.13.253:8777;, "region": "RegionOne",
"publicURL": "http://10.62.13.253:8777;, "id":
"d126c0eaa5c0457a801a729f0ad33c7f", "internalURL":
"http://10.62.13.253:8777"}],
"type": "metering", "name": "ceilometer"}, {"endpoints_links": [],
"endpoints": [{"adminURL": "
http://10.62.13.253:8776/v1/d72acd4eb56a4278ace834a391eeb158;, "region":
"RegionOne", "publicURL": "
http://10.62.13.253:8776/v1/d72acd4eb56a4278ace834a391eeb158;, "id":
"6dc2fe16ef3948e4b05ce012a93d947a", "internalURL": "
http://10.62.13.253:8776/v1/d72acd4eb56a4278ace834a391eeb158"}], "type":
"volume", "name": "cinder"}, {"endpoints_links": [], "endpoints":
[{"adminURL": "http://10.62.13.253:8773/services/Admin;, "region":
"RegionOne", "publicURL": "http://10.62.13.253:8773/services/Cloud;, "id":
"1f98895bc24745b3925d2f845180e2d7", "internalURL": "
http://10.62.13.253:8773/services/Cloud"}], "type": "ec2", "name":
"nova_ec2"}, {"endpoints_links": [], "endpoints": [{"adminURL": "
http://10.62.9.140:7480/swift/v1;, "region": "RegionOne", "publicURL": "
http://10.62.9.140:7480/swift/v1;, "id":
"5366d03ba9104b93b45f04f17375ca0c", "internalURL": "
http://10.62.9.140:7480/swift/v1"}], "type": "object-store", "name":
"swift"}, {"endpoints_links": [], "endpoints": [{"adminURL": "
http://10.62.13.253:35357/v2.0;, "region": "RegionOne", "publicURL": "
http://10.62.13.253:5000/v2.0;, "id": "9ade29d900434e558eec20e1f3c1b05d",
"internalURL": "http://10.62.13.253:5000/v2.0"}], "type": "identity",
"name": "keystone"}], "user": {"username": "admin", "roles_links": [],
"id": "f94a2334efd4477abe136d3eccdf036d", "roles": [{"name": "admin"}],
"name": "admin"}, "metadata": {"is_admin": 0, "roles":
["829f2966c3ec4b06aad875525bf0a7d9"]}}}
2016-09-22 17:10:36.117222 7f8104ff9700  0 validated token: admin:admin
expires: 1474538885
2016-09-22 17:10:36.117240 7f8104ff9700 20 updating
user=d72acd4eb56a4278ace834a391eeb158
2016-09-22 17:10:36.117292 7f8104ff9700 20 get_system_obj_state:
rctx=0x7f8104ff1f70
obj=default.rgw.users.uid:d72acd4eb56a4278ace834a391eeb158$d72acd4eb56a4278ace834a391eeb158
state=0x7f8148067918 s->prefetch_data=0
2016-09-22 17:10:36.117303 7f8104ff9700 10 cache get:
name=default.rgw.users.uid+d72acd4eb56a4278ace834a391eeb158$d72acd4eb56a4278ace834a391eeb158
: miss
2016-09-22 17:10:36.118300 7f8104ff9700 10 cache put:
name=default.rgw.users.uid+d72acd4eb56a4278ace834a391eeb158$d72acd4eb56a4278ace834a391eeb158
info.flags=0
2016-09-22 17:10:36.118321 7f8104ff9700 10 adding
default.rgw.users.uid+d72acd4eb56a4278ace834a391eeb158$d72acd4eb56a4278ace834a391eeb158
to cache LRU end
2016-09-22 17:10:36.118346 7f8104ff9700 20 get_system_obj_state:
rctx=0x7f8104ff1f70
obj=default.rgw.users.uid:d72acd4eb56a4278ace834a391eeb158
state=0x7f8148067918 s->prefetch_data=0
2016-09-22 17:10:36.118355 7f8104ff9700 10 cache get:
name=default.rgw.users.uid+d72acd4eb56a4278ace834a391eeb158 : miss
2016-09-22 17:10:36.119091 7f8104ff9700 10 cache put:
name=default.rgw.users.uid+d72acd4eb56a4278ace834a391eeb158 info.flags=0
2016-09-22 17:10:36.119109 7f8104ff9700 10 adding
default.rgw.users.uid+d72acd4eb56a4278ace834a391eeb158 to cache LRU end
2016-09-22 17:10:36.119116 7f8104ff9700  0 NOTICE: couldn't map swift user
d72acd4eb56a4278ace834a391eeb158




and then will stop here

any suggest for this issue?
Thx














2016-09-21 19:09 GMT+08:00 Radoslaw Zarzynski <rzarzyn...@mirantis.com>:

> Hi,
>
> Responded inline.
>
> On Wed, Sep 21, 2016 at 4:54 AM, Brian Chang-Chien
> <brian.changch...@gmail.com> wrote:
> >
> >
> > [global]
> > ...
> > debug rgw = 20
> > [client.radosgw.gateway]
> > host = brianceph
> > rgw keystone url = http://10.62.13.253:35357
> > rgw keystone admin tok

1 2 3 >

1 - 100 of 263 matches

Mail list logo