Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-05-01 Thread Sean Sullivan
Forgot to reply to all: Sure thing! I couldn't install the ceph-mds-dbg packages without upgrading. I just finished upgrading the cluster to 12.2.5. The issue still persists in 12.2.5 >From here I'm not really sure how to do generate the backtrace so I hope I did it right. For others on Ubuntu t

Re: [ceph-users] v12.2.5 Luminous released

2018-05-01 Thread Sergey Malinin
Useless due to http://tracker.ceph.com/issues/22102 > On 24.04.2018, at 23:29, Abhishek wrote: > > We're glad to announce the fifth bugfix release of Luminous v12.2.x long term > stable ___ ceph-users mailing

[ceph-users] Bluestore on HDD+SSD sync write latency experiences

2018-05-01 Thread Nick Fisk
Hi all, Slowly getting round to migrating clusters to Bluestore but I am interested in how people are handling the potential change in write latency coming from Filestore? Or maybe nobody is really seeing much difference? As we all know, in Bluestore, writes are not double written and in mo

Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-01 Thread Nick Fisk
4.16 required? https://www.phoronix.com/scan.php?page=news_item&px=Skylake-X-P-State-Linux- 4.16 -Original Message- From: ceph-users On Behalf Of Blair Bethwaite Sent: 01 May 2018 16:46 To: Wido den Hollander Cc: ceph-users ; Nick Fisk Subject: Re: [ceph-users] Intel Xeon Scalable and

Re: [ceph-users] CephFS MDS stuck (failed to rdlock when getattr / lookup)

2018-05-01 Thread Daniel Gryniewicz
On 05/01/2018 01:43 PM, Oliver Freyermuth wrote: Hi all, Am 17.04.2018 um 19:38 schrieb Oliver Freyermuth: Am 17.04.2018 um 19:35 schrieb Daniel Gryniewicz: On 04/17/2018 11:40 AM, Oliver Freyermuth wrote: Am 17.04.2018 um 17:34 schrieb Paul Emmerich: [...] We are right now using t

Re: [ceph-users] troubleshooting librados error with concurrent requests

2018-05-01 Thread Sam Whitlock
Thank you! I will try what you suggested. Here is the full backtrace: #0 0x7f816529b360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x7f80c716bb82 in Cond::Wait (this=this@entry=0x7f7f977dce20, mutex=...) at ./common/Cond.h:56 #2 0x7f80c717656f

[ceph-users] Configuration multi region

2018-05-01 Thread Anatoliy Guskov
Hello all, I created S3 RGW cluster scheme like that: — Slave1 (region EU) Master: — Slave2 (region US) Master uses like store for users login and bucket info. What do you think is that good idea or exist better way for storing users data and bucket info? And last questio

Re: [ceph-users] Collecting BlueStore per Object DB overhead

2018-05-01 Thread Gregory Farnum
On Mon, Apr 30, 2018 at 10:57 PM Wido den Hollander wrote: > > > On 04/30/2018 10:25 PM, Gregory Farnum wrote: > > > > > > On Thu, Apr 26, 2018 at 11:36 AM Wido den Hollander > > wrote: > > > > Hi, > > > > I've been investigating the per object overhead for BlueStor

Re: [ceph-users] troubleshooting librados error with concurrent requests

2018-05-01 Thread Gregory Farnum
Can you provide the full backtrace? It kinda looks like you've left something out. In general though, a Wait inside of an operate call just means the thread has submitted its request and is waiting for the answer to come back. It could be blocked locally or remotely. If it's blocked remotely, the

Re: [ceph-users] CephFS MDS stuck (failed to rdlock when getattr / lookup)

2018-05-01 Thread Oliver Freyermuth
Hi all, Am 17.04.2018 um 19:38 schrieb Oliver Freyermuth: > Am 17.04.2018 um 19:35 schrieb Daniel Gryniewicz: >> On 04/17/2018 11:40 AM, Oliver Freyermuth wrote: >>> Am 17.04.2018 um 17:34 schrieb Paul Emmerich: >> >>> [...] We are right now using the packages from https://eu

Re: [ceph-users] radosgw bucket listing (s3 ls s3://$bucketname) slow with ~2 billion objects

2018-05-01 Thread Casey Bodley
The main problem with efficiently listing many-sharded buckets is the requirement to provide entries in sorted order. This means that each http request has to fetch ~1000 entries from every shard, combine them into a sorted order, and throw out the leftovers. The next request to continue the li

Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-01 Thread Blair Bethwaite
Also curious about this over here. We've got a rack's worth of R740XDs with Xeon 4114's running RHEL 7.4 and intel-pstate isn't even active on them, though I don't believe they are any different at the OS level to our Broadwell nodes (where it is loaded). Have you tried poking the kernel's pmqos i

Re: [ceph-users] radosgw bucket listing (s3 ls s3://$bucketname) slow with ~2 billion objects

2018-05-01 Thread Robert Stanford
I second the indexless bucket suggestion. The downside being that you can't use bucket policies like object expiration in that case. On Tue, May 1, 2018 at 10:02 AM, David Turner wrote: > Any time using shared storage like S3 or cephfs/nfs/gluster/etc the > absolute rule that I refuse to break

[ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-01 Thread Wido den Hollander
Hi, I've been trying to get the lowest latency possible out of the new Xeon Scalable CPUs and so far I got down to 1.3ms with the help of Nick. However, I can't seem to pin the CPUs to always run at their maximum frequency. If I disable power saving in the BIOS they stay at 2.1Ghz (Silver 4110),

Re: [ceph-users] radosgw bucket listing (s3 ls s3://$bucketname) slow with ~2 billion objects

2018-05-01 Thread David Turner
Any time using shared storage like S3 or cephfs/nfs/gluster/etc the absolute rule that I refuse to break is to never rely on a directory listing to know where objects/files are. You should be maintaining a database of some sort or a deterministic naming scheme. The only time a full listing of a d

Re: [ceph-users] Collecting BlueStore per Object DB overhead

2018-05-01 Thread David Turner
Primary RGW usage. 270M objects, 857TB data/1195TB raw, EC 8+3 in the RGW data pool, less than 200K objects in all other pools. OSDs 366 and 367 are NVMe OSDs, the rest are 10TB disks for data/DB and 2GB WAL NVMe partition. The only things on the NVMe OSDs are the RGW metadata pools. I only have

Re: [ceph-users] radosgw bucket listing (s3 ls s3://$bucketname) slow with ~2 billion objects

2018-05-01 Thread Robert Stanford
Listing will always take forever when using a high shard number, AFAIK. That's the tradeoff for sharding. Are those 2B objects in one bucket? How's your read and write performance compared to a bucket with a lower number (thousands) of objects, with that shard number? On Tue, May 1, 2018 at 7:59

[ceph-users] radosgw bucket listing (s3 ls s3://$bucketname) slow with ~2 billion objects

2018-05-01 Thread Katie Holly
One of our radosgw buckets has grown a lot in size, `rgw bucket stats --bucket $bucketname` reports a total of 2,110,269,538 objects with the bucket index sharded across 32768 shards, listing the root context of the bucket with `s3 ls s3://$bucketname` takes more than an hour which is the hard l

[ceph-users] troubleshooting librados error with concurrent requests

2018-05-01 Thread Sam Whitlock
I am using librados in application to read and write many small files (<128MB) concurrently, both in the same process and in different processes (across many nodes). The application is built on Tensorflow (the read and write operations are custom kernels I wrote). I'm having an issue with this app

Re: [ceph-users] Please help me get rid of Slow / blocked requests

2018-05-01 Thread John Hearns
>Sounds like one of the following could be happening: > 1) RBD write caching doing the 37K IOPS, which will need to flush at some point which causes the drop. I am not sure this will help Shantur. But you could try running 'watch cat /proc/meminfo' during a benchmark run. You might be able to spo

Re: [ceph-users] Please help me get rid of Slow / blocked requests

2018-05-01 Thread Van Leeuwen, Robert
> On 5/1/18, 12:02 PM, "ceph-users on behalf of Shantur Rathore" > > wrote: >I am not sure if the benchmark is overloading the cluster as 3 out of > 5 runs the benchmark goes around 37K IOPS and suddenly for the >problematic runs it drops to 0 IOPS for a couple of minutes and then >

Re: [ceph-users] Please help me get rid of Slow / blocked requests

2018-05-01 Thread Shantur Rathore
Hi Paul, Thanks for replying to my query. I am not sure if the benchmark is overloading the cluster as 3 out of 5 runs the benchmark goes around 37K IOPS and suddenly for the problematic runs it drops to 0 IOPS for a couple of minutes and then resumes. This is a test cluster so nothing else is run