Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-05-23 Thread Sean Sullivan
a variable is a different type or a missing delimiter. womp. I am definitely out of my depth but now is a great time to learn! Can anyone shed some more light as to what may be wrong? On Fri, May 4, 2018 at 7:49 PM, Yan, Zheng <uker...@gmail.com> wrote: > On Wed, May 2, 2018 at 7:19 AM, Se

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-05-04 Thread Sean Sullivan
> kh10-8 (200MB) mds log -- https://griffin-objstore.op > ensciencedatacloud.org/logs/ceph-mds.kh10-8.log > kh09-8 (4.1GB) mds log -- https://griffin-objstore.op > ensciencedatacloud.org/logs/ceph-mds.kh09-8.log > > On Tue, May 1, 2018 at 12:09 AM, Patrick Donnelly <pdonn..

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-05-01 Thread Sean Sullivan
lly <pdonn...@redhat.com> wrote: > Hello Sean, > > On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan <lookcr...@gmail.com> > wrote: > > I was creating a new user and mount point. On another hardware node I > > mounted CephFS as admin to mount as root. I created

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-04-30 Thread Sean Sullivan
018 at 7:24 PM, Sean Sullivan <lookcr...@gmail.com> wrote: > So I think I can reliably reproduce this crash from a ceph client. > > ``` > root@kh08-8:~# ceph -s > cluster: > id: 9f58ee5a-7c5d-4d68-81ee-debe16322544 > health: HEALTH_OK > > services: >

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-04-30 Thread Sean Sullivan
can't seem to get them to start again. On Mon, Apr 30, 2018 at 5:06 PM, Sean Sullivan <lookcr...@gmail.com> wrote: > I had 2 MDS servers (one active one standby) and both were down. I took a > dumb chance and marked the active as down (it said it was up but laggy). > Then started th

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-04-30 Thread Sean Sullivan
at 4:32 PM, Sean Sullivan <lookcr...@gmail.com> wrote: > I was creating a new user and mount point. On another hardware node I > mounted CephFS as admin to mount as root. I created /aufstest and then > unmounted. From there it seems that both of my mds nodes crashed for some > reason

[ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-04-30 Thread Sean Sullivan
I was creating a new user and mount point. On another hardware node I mounted CephFS as admin to mount as root. I created /aufstest and then unmounted. From there it seems that both of my mds nodes crashed for some reason and I can't start them any more. https://pastebin.com/1ZgkL9fa -- my mds

[ceph-users] luminous ubuntu 16.04 HWE (4.10 kernel). ceph-disk can't prepare a disk

2017-10-22 Thread Sean Sullivan
On freshly installed ubuntu 16.04 servers with the HWE kernel selected (4.10). I can not use ceph-deploy or ceph-disk to provision osd. whenever I try I get the following:: ceph-disk -v prepare --dmcrypt --dmcrypt-key-dir /etc/ceph/dmcrypt-keys --bluestore --cluster ceph --fs-type xfs --

[ceph-users] zombie partitions, ceph-disk failure.

2017-10-20 Thread Sean Sullivan
I am trying to stand up ceph (luminous) on 3 72 disk supermicro servers running ubuntu 16.04 with HWE enabled (for a 4.10 kernel for cephfs). I am not sure how this is possible but even though I am running the following line to wipe all disks of their partitions, once I run ceph-disk to partition

Re: [ceph-users] Luminous can't seem to provision more than 32 OSDs per server

2017-10-19 Thread Sean Sullivan
I have tried using ceph-disk directly and i'm running into all sorts of trouble but I'm trying my best. Currently I am using the following cobbled script which seems to be working: https://github.com/seapasulli/CephScripts/blob/master/provision_storage.sh I'm at 11 right now. I hope this works.

[ceph-users] Luminous can't seem to provision more than 32 OSDs per server

2017-10-18 Thread Sean Sullivan
I am trying to install Ceph luminous (ceph version 12.2.1) on 4 ubuntu 16.04 servers each with 74 disks, 60 of which are HGST 7200rpm sas drives:: HGST HUS724040AL sdbv sas root@kg15-2:~# lsblk --output MODEL,KNAME,TRAN | grep HGST | wc -l 60 I am trying to deploy them all with :: a line like

[ceph-users] ceph-monstore-tool rebuild assert error

2017-02-07 Thread Sean Sullivan
I have a hammer cluster that died a bit ago (hammer 94.9) consisting of 3 monitors and 630 osds spread across 21 storage hosts. The clusters monitors all died due to leveldb corruption and the cluster was shut down. I was finally given word that I could try to revive the cluster this week!

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2017-02-06 Thread Sean Sullivan
-- I have tried copying my monitor and admin keyring into the admin.keyring used to try to rebuild and it still fails. I am not sure whether this is due to my packages or if something else is wrong. Is there a way to test or see what may be happening? On Sat, Aug 13, 2016 a

[ceph-users] ceph radosgw - 500 errors -- odd

2017-01-13 Thread Sean Sullivan
I am sorry for posting this if this has been addressed already. I am not sure on how to search through old ceph-users mailing list posts. I used to use gmane.org but that seems to be down. My setup:: I have a moderate ceph cluster (ceph hammer 94.9 - fe6d859066244b97b24f09d46552afc2071e6f90 ).

Re: [ceph-users] Filling up ceph past 75%

2016-08-28 Thread Sean Sullivan
give it another shot in a test instance and see how it goes. Thanks for your help as always Mr. Balzer. On Aug 28, 2016 8:59 PM, "Christian Balzer" <ch...@gol.com> wrote: > > Hello, > > On Sun, 28 Aug 2016 14:34:25 -0500 Sean Sullivan wrote: > > > I was curi

[ceph-users] Filling up ceph past 75%

2016-08-28 Thread Sean Sullivan
I was curious if anyone has filled ceph storage beyond 75%. Admitedly we lost a single host due to power failure and are down 1 host until the replacement parts arrive but outside of that I am seeing disparity between the most and least full osd:: ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR

Re: [ceph-users] How can we repair OSD leveldb?

2016-08-18 Thread Sean Sullivan
We have a hammer cluster that experienced a similar power failure and ended up corrupting our monitors leveldb stores. I am still trying to repair ours but I can give you a few tips that seem to help. 1.) I would copy the database off to somewhere safe right away. Just opening it seems to change

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-13 Thread Sean Sullivan
--- end dump of recent events --- Aborted (core dumped) --- --- I feel like I am so close but so far. Can anyone give me a nudge as to what I can do next? it looks like it is bombing out on trying to get an updated paxos. On F

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-12 Thread Sean Sullivan
? All should have the same keys/values although constructed differently right? I can't blindly copy /var/lib/ceph/mon/ceph-$(hostname)/store.db/ from one host to another right? But can I copy the keys/values from one to another? On Fri, Aug 12, 2016 at 12:45 PM, Sean Sullivan <seapasu...@uchicago.

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-12 Thread Sean Sullivan
Op 11 augustus 2016 om 15:17 schreef Sean Sullivan < > seapasu...@uchicago.edu>: > > > > > > Hello Wido, > > > > Thanks for the advice. While the data center has a/b circuits and > > redundant power, etc if a ground fault happens it travels outside and > &

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-11 Thread Sean Sullivan
of the previous osd maps are there. I just don't understand what key/values I need inside. On Aug 11, 2016 1:33 AM, "Wido den Hollander" <w...@42on.com> wrote: > > > Op 11 augustus 2016 om 0:10 schreef Sean Sullivan < > seapasu...@uchicago.edu>: > > > > >

[ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-10 Thread Sean Sullivan
I think it just got worse:: all three monitors on my other cluster say that ceph-mon can't open /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose all 3 monitors? I saw a post by Sage saying that the data can be recovered as all of the data is held on other servers. Is this

[ceph-users] lost power. monitors died. Cephx errors now

2016-08-10 Thread Sean Sullivan
So our datacenter lost power and 2/3 of our monitors died with FS corruption. I tried fixing it but it looks like the store.db didn't make it. I copied the working journal via 1. sudo mv /var/lib/ceph/mon/ceph-$(hostname){,.BAK} 2. sudo ceph-mon -i {mon-id} --mkfs --monmap

[ceph-users] Power Outage! Oh No!

2016-08-10 Thread Sean Sullivan
So we recently had a power outage and I seem to have lost 2 of 3 of my monitors. I have since copied /var/lib/ceph/mon/ceph-$(hostname){,.BAK} and then created a new cephfs and finally generated a new filesystem via ''' sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring

Re: [ceph-users] Radosgw (civetweb) hangs once around 850 established connections

2016-03-20 Thread Sean Sullivan
Hi Ben! I'm using ubuntu 14.04 I have restarted the gateways with the numthreads line you suggested. I hope this helps. I would think I would get some kind of throttle log or something. 500 seems really strange as well. Do you have a thread for this? RGW still has a weird race condition

[ceph-users] Ceph-deploy won't write journal if partition exists and using -- dmcrypt

2015-07-16 Thread Sean Sullivan
Some context. I have a small cluster running ubuntu 14.04 and giant ( now hsmmer). I ran some updates everything was fine. Rebooted a node and a drive must have failed as it no longer shows up. I use --dmcrypt with ceph deploy and 5 osds per ssd journal. To do this I created the ssd

Re: [ceph-users] RGW - Can't download complete object

2015-05-13 Thread Sean Sullivan
Sorry for the delay. It took me a while to figure out how to do a range request and append the data to a single file. The good news is that the end file seems to be 14G in size which matches the files manifest size. The bad news is that the file is completely corrupt and the radosgw log has

Re: [ceph-users] RGW - Can't download complete object

2015-05-13 Thread Sean Sullivan
-Weinraub yeh...@redhat.com To: Sean Sullivan seapasu...@uchicago.edu Cc: ceph-users@lists.ceph.com Sent: Wednesday, May 13, 2015 2:33:07 PM Subject: Re: [ceph-users] RGW - Can't download complete object That's another interesting issue. Note that for part 12_80 the manifest specifies (I assume

Re: [ceph-users] Civet RadosGW S3 not storing complete obects; civetweb logs stop after rotation

2015-04-28 Thread Sean Sullivan
Will do. The reason for the partial request is that the total size of the file is close to 1TB so attempting a download would take quite some time on our 10Gb connection. What is odd is that if I request the last bit received to the end of the file we get a 406 can not be satisfied response

[ceph-users] Can not list objects in large bucket

2015-03-11 Thread Sean Sullivan
I have a single radosgw user with 2 s3 keys and 1 swift key. I have created a few buckets and I can list all of the contents of bucket A and C but not B with either S3 (boto) or python-swiftclient. I am able to list the first 1000 entries using radosgw-admin 'bucket list --bucket=bucketB'

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-23 Thread Sean Sullivan
I am trying to understand these drive throttle markers that were mentioned to get an idea of why these drives are marked as slow.:: here is the iostat of the drive /dev/sdbm http://paste.ubuntu.com/9607168/ an IO wait of .79 doesn't seem bad but a write wait of 21.52 seems really high. Looking

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-22 Thread Sean Sullivan
/q5E6JjkG On 12/19/2014 08:10 PM, Christian Balzer wrote Hello Sean, On Fri, 19 Dec 2014 02:47:41 -0600 Sean Sullivan wrote: Hello Christian, Thanks again for all of your help! I started a bonnie test using the following:: bonnie -d /mnt/rbd/scratch2/ -m $(hostname) -f -b While that gives you

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-22 Thread Sean Sullivan
less hardware. Hopefully it is indeed a bug of somesort and not yet another screw up on my end. Furthermore hopefully I find the bug and fix it for others to find and profit from ^_^. Thanks for all of your help! On 12/22/2014 05:26 PM, Craig Lewis wrote: On Mon, Dec 22, 2014 at 2:57 PM, Sean

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-19 Thread Sean Sullivan
: Hello, On Thu, 18 Dec 2014 23:45:57 -0600 Sean Sullivan wrote: Wow Christian, Sorry I missed these in line replies. Give me a minute to gather some data. Thanks a million for the in depth responses! No worries. I thought about raiding it but I needed the space unfortunately. I had a 3x60 osd

[ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan
Hello Yall! I can't figure out why my gateways are performing so poorly and I am not sure where to start looking. My RBD mounts seem to be performing fine (over 300 MB/s) while uploading a 5G file to Swift/S3 takes 2m32s (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing with

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan
if they're behaving differently on the different requests is one angle of attack. The other is look into is if the RGW daemons are hitting throttler limits or something that the RBD clients aren't. -Greg On Thu, Dec 18, 2014 at 7:35 PM Sean Sullivan seapasu...@uchicago.edu mailto:seapasu

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan
gateway was made last minute to test and rule out the hardware. On December 18, 2014 10:57:41 PM Christian Balzer ch...@gol.com wrote: Hello, Nice cluster, I wouldn't mind getting my hand or her ample nacelles, er, wrong movie. ^o^ On Thu, 18 Dec 2014 21:35:36 -0600 Sean Sullivan wrote: Hello

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan
Wow Christian, Sorry I missed these in line replies. Give me a minute to gather some data. Thanks a million for the in depth responses! I thought about raiding it but I needed the space unfortunately. I had a 3x60 osd node test cluster that we tried before this and it didn't have this

Re: [ceph-users] ceph health related message

2014-09-22 Thread Sean Sullivan
I had this happen to me as well. Turned out to be a connlimit thing for me. I would check dmesg/kernel log and see if you see any conntrack limit reached connection dropped messages then increase connlimit. Odd as I connected over ssh for this but I can't deny syslog.

[ceph-users] Swift can upload, list, and delete, but not download

2014-09-19 Thread Sean Sullivan
So this was working a moment ago and I was running rados bencharks as well as swift benchmarks to try to see how my install was doing. Now when I try to download an object I get this read_length error:: http://pastebin.com/R4CW8Cgj To try to poke at this I wiped all of the .rgw pools, removed

[ceph-users] Ceph can't seem to forget

2014-08-07 Thread Sean Sullivan
I think I have a split issue or I can't seem to get rid of these objects. How can I tell ceph to forget the objects and revert? How this happened is that due to the python 2.7.8/ceph bug ( a whole rack of ceph went town (it had ubuntu 14.10 and that seemed to have 2.7.8 before 14.04). I didn't

[ceph-users] Ceph can't seem to forget.

2014-08-06 Thread Sean Sullivan
I forgot to register before posting so reposting. I think I have a split issue or I can't seem to get rid of these objects. How can I tell ceph to forget the objects and revert? How this happened is that due to the python 2.7.8/ceph bug ( a whole rack of ceph went town (it had ubuntu 14.10 and