[ceph-users] Mon crashes virtual void LogMonitor::update_from_paxos(bool*)

2020-01-15 Thread Kevin Hrpcek
ibc_start_main()+0xf5) [0x7f09397c6505] 9: (()+0x24ad40) [0x55e5a02fad40] -261> 2020-01-15 16:36:46.086 7f0946674a00 -1 *** Caught signal (Aborted) ** in thread 7f0946674a00 thread_name:ceph-mon -- Kevin Hrpcek NASA VIIRS Atmosphere SIPS Space Science & E

[ceph-users] January Ceph Science Group Virtual Meeting

2020-01-13 Thread Kevin Hrpcek
2.) Enter Meeting ID: 908675367 3.) Press # Want to test your video connection? https://bluejeans.com/111<https://www.google.com/url?q=https://bluejeans.com/111=D=1579363980705000=AOvVaw3UlW-AxGCX7TXfn8VAGfH4> Kevin -- Kevin Hrpcek NASA VIIRS Atmosphere SIPS Space Science &

[ceph-users] Ceph Science User Group Call October

2019-10-21 Thread Kevin Hrpcek
ttps://www.google.com/url?q=https://bluejeans.com/111=D=1572095869727000=AOvVaw1bRfUtekflHoeS36FKwXw2> -- Kevin Hrpcek NASA VIIRS Atmosphere SIPS Space Science & Engineering Center University of Wisconsin-Madison ___ ceph-users mailing l

Re: [ceph-users] Ceph Scientific Computing User Group

2019-08-27 Thread Kevin Hrpcek
Wednesday of each month. Here's the pad to collect agenda/notes: https://pad.ceph.com/p/Ceph_Science_User_Group_Index -- Mike Perez (thingee) On Tue, Jul 23, 2019 at 10:40 AM Kevin Hrpcek mailto:kevin.hrp...@ssec.wisc.edu>> wrote: Update We're going to hold off until August for this so

Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Kevin Hrpcek
? Or the reweight? (I guess you change the crush weight, I am right?) Thanks! El 24 jul 2019, a les 19:17, Kevin Hrpcek mailto:kevin.hrp...@ssec.wisc.edu>> va escriure: I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what I do, you can obviously change the weight in

Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Kevin Hrpcek
I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what I do, you can obviously change the weight increase steps to what you are comfortable with. This has worked well for me and my workloads. I've sometimes seen peering take longer if I do steps too quickly but I don't run any

Re: [ceph-users] Ceph Scientific Computing User Group

2019-07-23 Thread Kevin Hrpcek
Update We're going to hold off until August for this so we can promote it on the Ceph twitter with more notice. Sorry for the inconvenience if you were planning on the meeting tomorrow. Keep a watch on the list, twitter, or ceph calendar for updates. Kevin On 7/5/19 11:15 PM, Kevin Hrpcek

Re: [ceph-users] Ceph Scientific Computing User Group

2019-07-05 Thread Kevin Hrpcek
a topic for meetings. I will be brainstorming some conversation starters but it would also be interesting to have people give a deep dive into their use of ceph and what they have built around it to support the science being done at their facility. Kevin On 6/17/19 10:43 AM, Kevin Hrpcek

[ceph-users] Ceph Scientific Computing User Group

2019-06-17 Thread Kevin Hrpcek
Hey all, At cephalocon some of us who work in scientific computing got together for a BoF and had a good conversation. There was some interest in finding a way to continue the conversation focused on ceph in scientific computing and htc/hpc environments. We are considering putting together

Re: [ceph-users] Mimic upgrade failure

2018-09-19 Thread Kevin Hrpcek
for an osd that reported a failure and seeing what error code it coming up on the failed ping connection? That might provide a useful hint (e.g., ECONNREFUSED vs EMFILE or something). I'd also confirm that with nodown set the mon quorum stabilizes... sage On Mon, 10 Sep 2018, Kevin Hrpcek

Re: [ceph-users] Mimic upgrade failure

2018-09-12 Thread Kevin Hrpcek
with nodown set the mon quorum stabilizes... sage On Mon, 10 Sep 2018, Kevin Hrpcek wrote: Update for the list archive. I went ahead and finished the mimic upgrade with the osds in a fluctuating state of up and down. The cluster did start to normalize a lot easier after everything was on

Re: [ceph-users] Mimic upgrade failure

2018-09-10 Thread Kevin Hrpcek
the mix of luminous and mimic did not play well together for some reason. Maybe it has to do with the scale of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster has scaled to this size. Kevin On 09/09/2018 12:49 PM, Kevin Hrpcek wrote: Nothing too crazy for non default

Re: [ceph-users] Mimic upgrade failure

2018-09-09 Thread Kevin Hrpcek
things are, setting pause on the cluster to just finish the upgrade faster might not be a bad idea either. This should be a simple question, have you confirmed that there are no networking problems between the MONs while the elections are happening? On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek

Re: [ceph-users] Mimic upgrade failure

2018-09-08 Thread Kevin Hrpcek
018, Kevin Hrpcek wrote: Hello, I've had a Luminous -> Mimic upgrade go very poorly and my cluster is stuck with almost all pgs down. One problem is that the mons have started to re-elect a new quorum leader almost every minute. This is making it difficult to monitor the cluster and even run any co

[ceph-users] Mimic upgrade failure

2018-09-08 Thread Kevin Hrpcek
Hello, I've had a Luminous -> Mimic upgrade go very poorly and my cluster is stuck with almost all pgs down. One problem is that the mons have started to re-elect a new quorum leader almost every minute. This is making it difficult to monitor the cluster and even run any commands on it since

Re: [ceph-users] separate monitoring node

2018-06-19 Thread Kevin Hrpcek
I use icinga2 as well with a check_ceph.py that I wrote a couple years ago. The method I use is that icinga2 runs the check from the icinga2 host itself. ceph-common is installed on the icinga2 host since the check_ceph script is a wrapper and parser for the ceph command output using python's

[ceph-users] Reweighting causes whole cluster to peer/activate

2018-06-14 Thread Kevin Hrpcek
Hello, I'm seeing something that seems to be odd behavior when reweighting OSDs. I've just upgraded to 12.2.5 and am adding in a new osd server to the cluster. I gradually weight the 10TB OSDs into the cluster by doing a +1, letting things backfill for a while, then +1 until I reach my

Re: [ceph-users] librados python pool alignment size write failures

2018-04-03 Thread Kevin Hrpcek
Thanks for the input Greg, we've submitted the patch to the ceph github repo https://github.com/ceph/ceph/pull/21222 Kevin On 04/02/2018 01:10 PM, Gregory Farnum wrote: On Mon, Apr 2, 2018 at 8:21 AM Kevin Hrpcek <kevin.hrp...@ssec.wisc.edu <mailto:kevin.hrp...@ssec.wisc.edu&g

[ceph-users] librados python pool alignment size write failures

2018-04-02 Thread Kevin Hrpcek
to most users... Any insight would be appreciated as we'd prefer to use an official solution rather than our bindings fix for long term use. Tested on Luminous 12.2.2 and 12.2.4. Thanks, Kevin -- Kevin Hrpcek Linux Systems Administrator NASA SNPP Atmospheric SIPS Space Science & Enginee

Re: [ceph-users] Ceph luminous - throughput performance issue

2018-01-31 Thread Kevin Hrpcek
Steven, I've recently done some performance testing on dell hardware. Here are some of my messy results. I was mainly testing the effects of the R0 stripe sizing on the perc card. Each disk has it's own R0 so that write back is enabled. VDs were created like this but with different

Re: [ceph-users] Pool shard/stripe settings for file too large files?

2017-11-09 Thread Kevin Hrpcek
Marc, If you're running luminous you may need to increase osd_max_object_size. This snippet is from the Luminous change log. "The default maximum size for a single RADOS object has been reduced from 100GB to 128MB. The 100GB limit was completely impractical in practice while the 128MB limit

Re: [ceph-users] Cluster Down from reweight-by-utilization

2017-11-06 Thread Kevin Hrpcek
by quickly setting nodown,noout,noup when everything is already down will help as well. Sage, thanks again for your input and advice. Kevin On 11/04/2017 11:54 PM, Sage Weil wrote: On Sat, 4 Nov 2017, Kevin Hrpcek wrote: Hey Sage, Thanks for getting back to me this late on a weekend. Do you

Re: [ceph-users] Cluster Down from reweight-by-utilization

2017-11-04 Thread Kevin Hrpcek
Hey Sage, Thanks for getting back to me this late on a weekend. Do you now why the OSDs were going down? Are there any crash dumps in the osd logs, or is the OOM killer getting them? That's a part I can't nail down yet. OSDs didn't crash, after the reweight-by-utilization OSDs on some of our

[ceph-users] Cluster Down from reweight-by-utilization

2017-11-04 Thread Kevin Hrpcek
g 1 stale+active+clean+scrubbing 1 active+recovering+undersized+degraded 1 stale+active+remapped+backfilling 1 inactive 1 active+clean+scrubbing 1 stale+active+clean+scrubbing+d