[ceph-users] adding crush ruleset

2019-04-29 Thread Luis Periquito
Hi,

I need to add a more complex crush ruleset to a cluster and was trying
to script that as I'll need to do it often.

Is there any way to create these other than manually editing the crush map?

This is to create a k=4 + m=2 across 3 rooms, with 2 parts in each room
The ruleset would be something like (haven't tried/tuned the rule):

rule xxx {
   type erasure
   min_size 5
   max_size 6
   step set_chooseleaf_tries 5
   step set_choose_tries 100
   step take default
   step choose indep 3 type room
   step chooseleaf indep 2 type host
}

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool/volume live migration

2019-02-08 Thread Luis Periquito
This is indeed for an OpenStack cloud - it didn't require any level of
performance (so was created on an EC pool) and now it does :(

So the idea would be:
1- create a new pool
2- change cinder to use the new pool

for each volume
  3- stop the usage of the volume (stop the instance?)
  4- "live migrate" the volume to the new pool
  5- start up the instance again


Does that sound right?

thanks,

On Fri, Feb 8, 2019 at 4:25 PM Jason Dillaman  wrote:
>
> Correction: at least for the initial version of live-migration, you
> need to temporarily stop clients that are using the image, execute
> "rbd migration prepare", and then restart the clients against the new
> destination image. The "prepare" step will fail if it detects that the
> source image is in-use.
>
> On Fri, Feb 8, 2019 at 9:00 AM Jason Dillaman  wrote:
> >
> > Indeed, it is forthcoming in the Nautilus release.
> >
> > You would initiate a "rbd migration prepare 
> > " to transparently link the dst-image-spec to the
> > src-image-spec. Any active Nautilus clients against the image will
> > then re-open the dst-image-spec for all IO operations. Read requests
> > that cannot be fulfilled by the new dst-image-spec will be forwarded
> > to the original src-image-spec (similar to how parent/child cloning
> > behaves). Write requests to the dst-image-spec will force a deep-copy
> > of all impacted src-image-spec backing data objects (including
> > snapshot history) to the associated dst-image-spec backing data
> > object.  At any point a storage admin can run "rbd migration execute"
> > to deep-copy all src-image-spec data blocks to the dst-image-spec.
> > Once the migration is complete, you would just run "rbd migration
> > commit" to remove src-image-spec.
> >
> > Note: at some point prior to "rbd migration commit", you will need to
> > take minimal downtime to switch OpenStack volume registration from the
> > old image to the new image if you are changing pools.
> >
> > On Fri, Feb 8, 2019 at 5:33 AM Caspar Smit  wrote:
> > >
> > > Hi Luis,
> > >
> > > According to slide 21 of Sage's presentation at FOSDEM it is coming in 
> > > Nautilus:
> > >
> > > https://fosdem.org/2019/schedule/event/ceph_project_status_update/attachments/slides/3251/export/events/attachments/ceph_project_status_update/slides/3251/ceph_new_in_nautilus.pdf
> > >
> > > Kind regards,
> > > Caspar
> > >
> > > Op vr 8 feb. 2019 om 11:24 schreef Luis Periquito :
> > >>
> > >> Hi,
> > >>
> > >> a recurring topic is live migration and pool type change (moving from
> > >> EC to replicated or vice versa).
> > >>
> > >> When I went to the OpenStack open infrastructure (aka summit) Sage
> > >> mentioned about support of live migration of volumes (and as a result
> > >> of pools) in Nautilus. Is this still the case and is expected to have
> > >> live migration working by then?
> > >>
> > >> thanks,
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > --
> > Jason
>
>
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pool/volume live migration

2019-02-08 Thread Luis Periquito
Hi,

a recurring topic is live migration and pool type change (moving from
EC to replicated or vice versa).

When I went to the OpenStack open infrastructure (aka summit) Sage
mentioned about support of live migration of volumes (and as a result
of pools) in Nautilus. Is this still the case and is expected to have
live migration working by then?

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] safe to remove leftover bucket index objects

2018-10-22 Thread Luis Periquito
It may be related to http://tracker.ceph.com/issues/34307 - I have a
cluster whose OMAP size is larger than the stored data...
On Mon, Oct 22, 2018 at 11:09 AM Wido den Hollander  wrote:
>
>
>
> On 8/31/18 5:31 PM, Dan van der Ster wrote:
> > So it sounds like you tried what I was going to do, and it broke
> > things. Good to know... thanks.
> >
> > In our case, what triggered the extra index objects was a user running
> > PUT /bucketname/ around 20 million times -- this apparently recreates
> > the index objects.
> >
>
> I'm asking the same!
>
> Large omap object found. Object:
> 6:199f36b7:::.dir.ea087a7e-cb26-420f-9717-a98080b0623c.134167.15.1:head
> Key count: 5374754 Size (bytes): 1366279268
>
> In this case I can't find '134167.15.1' in any of the buckets when I do:
>
> for BUCKET in $(radosgw-admin metadata bucket list|jq -r '.[]'); do
> radosgw-admin metadata get bucket:$BUCKET > bucket.$BUCKET
> done
>
> If I grep through all the bucket.* files this object isn't showing up
> anywhere.
>
> Before I remove the object I want to make sure that it's safe to delete it.
>
> A garbage collector for the bucket index pools would be very great to have.
>
> Wido
>
> > -- dan
> >
> > On Thu, Aug 30, 2018 at 7:20 PM David Turner  wrote:
> >>
> >> I'm glad you asked this, because it was on my to-do list. I know that 
> >> based on our not existing in the bucket marker does not mean it's safe to 
> >> delete.  I have an index pool with 22k objects in it. 70 objects match 
> >> existing bucket markers. I was having a problem on the cluster and started 
> >> deleting the objects in the index pool and after going through 200 objects 
> >> I stopped it and tested and list access to 3 pools. Luckily for me they 
> >> were all buckets I've been working on deleting, so no need for recovery.
> >>
> >> I then compared bucket IDs to the objects in that pool, but still only 
> >> found a couple hundred more matching objects. I have no idea what the 
> >> other 22k objects are in the index bucket that don't match bucket markers 
> >> or bucket IDs. I did confirm there was no resharding happening both in the 
> >> research list and all bucket reshard statuses.
> >>
> >> Does anyone know how to parse the names of these objects and how to tell 
> >> what can be deleted?  This is if particular interest as I have another 
> >> costed with 1M injects in the index pool.
> >>
> >> On Thu, Aug 30, 2018, 7:29 AM Dan van der Ster  wrote:
> >>>
> >>> Replying to self...
> >>>
> >>> On Wed, Aug 1, 2018 at 11:56 AM Dan van der Ster  
> >>> wrote:
> 
>  Dear rgw friends,
> 
>  Somehow we have more than 20 million objects in our
>  default.rgw.buckets.index pool.
>  They are probably leftover from this issue we had last year:
>  http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018565.html
>  and we want to clean the leftover / unused index objects
> 
>  To do this, I would rados ls the pool, get a list of all existing
>  buckets and their current marker, then delete any objects with an
>  unused marker.
>  Does that sound correct?
> >>>
> >>> More precisely, for example, there is an object
> >>> .dir.61c59385-085d-4caa-9070-63a3868dccb6.2978181.59.8 in the index
> >>> pool.
> >>> I run `radosgw-admin bucket stats` to get the marker for all current
> >>> existing buckets.
> >>> The marker 61c59385-085d-4caa-9070-63a3868dccb6.2978181.59 is not
> >>> mentioned in the bucket stats output.
> >>> Is it safe to rados rm 
> >>> .dir.61c59385-085d-4caa-9070-63a3868dccb6.2978181.59.8 ??
> >>>
> >>> Thanks in advance!
> >>>
> >>> -- dan
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
>  Can someone suggest a better way?
> 
>  Cheers, Dan
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OMAP size on disk

2018-10-09 Thread Luis Periquito
Hi all,

I have several clusters, all running Luminous (12.2.7) proving S3
interface. All of them have enabled dynamic resharding and is working.

One of the newer clusters is starting to give warnings on the used
space for the OMAP directory. The default.rgw.buckets.index pool is
replicated with 3x copies of the data.

I created a new crush ruleset to only use a few well known SSDs, and
the OMAP directory size changed as expected: if I set the OSD as out
and them tell to compact, the size of the OMAP will shrink. If I set
the OSD as in the OMAP will grow to its previous state. And while the
backfill is going we get loads of key recoveries.

Total physical space for OMAP in the OSDs that have them is ~1TB, so
given a 3x replica ~330G before replication.

The data size for the default.rgw.buckets.data is just under 300G.
There is one bucket who has ~1.7M objects and 22 shards.

After deleting that bucket the size of the database didn't change -
even after running gc process and telling the OSD to compact its
database.

This is not happening in older clusters, i.e created with hammer.
Could this be a bug?

I looked at getting all the OMAP keys and sizes
(https://ceph.com/geen-categorie/get-omap-keyvalue-size/) and they add
up to close the value I expected them to take, looking at the physical
storage.

Any ideas where to look next?

thanks for all the help.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH puzzle: step weighted-take

2018-09-27 Thread Luis Periquito
I think your objective is to move the data without anyone else
noticing. What I usually do is reduce the priority of the recovery
process as much as possible. Do note this will make the recovery take
a looong time, and will also make recovery from failures slow...
ceph tell osd.* injectargs '--osd_recovery_sleep 0.9'
ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph tell osd.* injectargs '--osd_recovery_max_chunk 524288'

I would also assume you have set osd_scrub_during_recovery to false.



On Thu, Sep 27, 2018 at 4:19 PM Dan van der Ster  wrote:
>
> Dear Ceph friends,
>
> I have a CRUSH data migration puzzle and wondered if someone could
> think of a clever solution.
>
> Consider an osd tree like this:
>
>   -2   4428.02979 room 0513-R-0050
>  -72911.81897 rack RA01
>   -4917.27899 rack RA05
>   -6917.25500 rack RA09
>   -9786.23901 rack RA13
>  -14895.43903 rack RA17
>  -65   1161.16003 room 0513-R-0060
>  -71578.76001 ipservice S513-A-IP38
>  -70287.56000 rack BA09
>  -80291.20001 rack BA10
>  -76582.40002 ipservice S513-A-IP63
>  -75291.20001 rack BA11
>  -78291.20001 rack BA12
>
> In the beginning, for reasons that are not important, we created two pools:
>   * poolA chooses room=0513-R-0050 then replicates 3x across the racks.
>   * poolB chooses room=0513-R-0060, replicates 2x across the
> ipservices, then puts a 3rd replica in room 0513-R-0050.
>
> For clarity, here is the crush rule for poolB:
> type replicated
> min_size 1
> max_size 10
> step take 0513-R-0060
> step chooseleaf firstn 2 type ipservice
> step emit
> step take 0513-R-0050
> step chooseleaf firstn -2 type rack
> step emit
>
> Now to the puzzle.
> For reasons that are not important, we now want to change the rule for
> poolB to put all three 3 replicas in room 0513-R-0060.
> And we need to do this in a way which is totally non-disruptive
> (latency-wise) to the users of either pools. (These are both *very*
> active RBD pools).
>
> I see two obvious ways to proceed:
>   (1) simply change the rule for poolB to put a third replica on any
> osd in room 0513-R-0060. I'm afraid though that this would involve way
> too many concurrent backfills, cluster-wide, even with
> osd_max_backfills=1.
>   (2) change poolB size to 2, then change the crush rule to that from
> (1), then reset poolB size to 3. This would risk data availability
> during the time that the pool is size=2, and also risks that every osd
> in room 0513-R-0050 would be too busy deleting for some indeterminate
> time period (10s of minutes, I expect).
>
> So I would probably exclude those two approaches.
>
> Conceptually what I'd like to be able to do is a gradual migration,
> which if I may invent some syntax on the fly...
>
> Instead of
>step take 0513-R-0050
> do
>step weighted-take 99 0513-R-0050 1 0513-R-0060
>
> That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1%
> of the time take room 0513-R-0060.
> With a mechanism like that, we could gradually adjust those "step
> weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060.
>
> I have a feeling that something equivalent to that is already possible
> with weight-sets or some other clever crush trickery.
> Any ideas?
>
> Best Regards,
>
> Dan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw bucket stats vs s3cmd du

2018-09-18 Thread Luis Periquito
Hi all,

I have a couple of very big s3 buckets that store temporary data. We
keep writing to the buckets some files which are then read and
deleted. They serve as a temporary storage.

We're writing (and deleting) circa 1TB of data daily in each of those
buckets, and their size has been mostly stable over time.

The issue has arisen that radosgw-admin bucket stats says one bucket
is 10T and the other is 4T; but s3cmd du (and I did a sync which
agrees) says 3.5T and 2.3T respectively.

The bigger bucket suffered from the orphaned objects bug
(http://tracker.ceph.com/issues/18331). The smaller was created as
10.2.3 so it may also had the suffered from the same bug.

Any ideas what could be at play here? How can we reduce actual usage?

trimming part of the radosgw-admin bucket stats output
"usage": {
"rgw.none": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 18446744073709551572
},
"rgw.main": {
"size": 10870197197183,
"size_actual": 10873866362880,
"size_utilized": 18446743601253967400,
"size_kb": 10615426951,
"size_kb_actual": 10619010120,
"size_kb_utilized": 18014398048099578,
"num_objects": 1702444
},
"rgw.multimeta": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 406462
}
},
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: Planning all flash cluster

2018-06-20 Thread Luis Periquito
adding back in the list :)

-- Forwarded message -
From: Luis Periquito 
Date: Wed, Jun 20, 2018 at 1:54 PM
Subject: Re: [ceph-users] Planning all flash cluster
To: 


On Wed, Jun 20, 2018 at 1:35 PM Nick A  wrote:
>
> Thank you, I was under the impression that 4GB RAM per 1TB was quite 
> generous, or is that not the case with all flash clusters? What's the 
> recommended RAM per OSD currently? Happy to throw more at it for a 
> performance boost. The important thing is that I'd like all nodes to be 
> absolutely identical.
I'm doing 8G per OSD, though I use 1.9T SSDs.

>
> Based on replies so far, it looks like 5 nodes might be a better idea, maybe 
> each with 14 OSD's (960GB SSD's)? Plenty of 16 slot 2U chassis around to make 
> it a no brainer if that's what you'd recommend!
I tend to add more nodes: 1U with 4-8 SSDs per chassis to start with,
and using a single CPU with high frequency. For IOPS/latency cpu
frequency is really important.
I have started a cluster that only has 2 SSDs (which I share with the
OS) for data, but has 8 nodes. Those servers can take up to 10 drives.

I'm using the Fujitsu RX1330, believe Dell would be the R330, with a
Intel E3-1230v6 cpu and 64G of ram, dual 10G and PSAS (passthrough
controller).

>
> The H710 doesn't do JBOD or passthrough, hence looking for an alternative 
> HBA. It would be nice to do the boot drives as hardware RAID 1 though, so a 
> card that can do both at the same time (like the H730 found R630's etc) would 
> be ideal.
>
> Regards,
> Nick
>
> On 20 June 2018 at 13:18, Luis Periquito  wrote:
>>
>> Adding more nodes from the beginning would probably be a good idea.
>>
>> On Wed, Jun 20, 2018 at 12:58 PM Nick A  wrote:
>> >
>> > Hello Everyone,
>> >
>> > We're planning a small cluster on a budget, and I'd like to request any 
>> > feedback or tips.
>> >
>> > 3x Dell R720XD with:
>> > 2x Xeon E5-2680v2 or very similar
>> The CPUs look good and sufficiently fast for IOPS.
>>
>> > 96GB RAM
>> 4GB per OSD looks a bit on the short side. Probably 192G would help.
>>
>> > 2x Samsung SM863 240GB boot/OS drives
>> > 4x Samsung SM863 960GB OSD drives
>> > Dual 40/56Gbit Infiniband using IPoIB.
>> >
>> > 3 replica, MON on OSD nodes, RBD only (no object or CephFS).
>> >
>> > We'll probably add another 2 OSD drives per month per node until full (24 
>> > SSD's per node), at which point, more nodes. We've got a few SM863's in 
>> > production on other system and are seriously impressed with them, so would 
>> > like to use them for Ceph too.
>> >
>> > We're hoping this is going to provide a decent amount of IOPS, 20k would 
>> > be ideal. I'd like to avoid NVMe Journals unless it's going to make a 
>> > truly massive difference. Same with carving up the SSD's, would rather 
>> > not, and just keep it as simple as possible.
>> I agree: those SSDs shouldn't really require a journal device. Not
>> sure about the 20k IOPS specially without any further information.
>> Doing 20k IOPS at 1kB block is totally different at 1MB block...
>> >
>> > Is there anything that obviously stands out as severely unbalanced? The 
>> > R720XD comes with a H710 - instead of putting them in RAID0, I'm thinking 
>> > a different HBA might be a better idea, any recommendations please?
>> Don't know that HBA. Does it support pass through mode or HBA mode?
>> >
>> > Regards,
>> > Nick
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Planning all flash cluster

2018-06-20 Thread Luis Periquito
Adding more nodes from the beginning would probably be a good idea.

On Wed, Jun 20, 2018 at 12:58 PM Nick A  wrote:
>
> Hello Everyone,
>
> We're planning a small cluster on a budget, and I'd like to request any 
> feedback or tips.
>
> 3x Dell R720XD with:
> 2x Xeon E5-2680v2 or very similar
The CPUs look good and sufficiently fast for IOPS.

> 96GB RAM
4GB per OSD looks a bit on the short side. Probably 192G would help.

> 2x Samsung SM863 240GB boot/OS drives
> 4x Samsung SM863 960GB OSD drives
> Dual 40/56Gbit Infiniband using IPoIB.
>
> 3 replica, MON on OSD nodes, RBD only (no object or CephFS).
>
> We'll probably add another 2 OSD drives per month per node until full (24 
> SSD's per node), at which point, more nodes. We've got a few SM863's in 
> production on other system and are seriously impressed with them, so would 
> like to use them for Ceph too.
>
> We're hoping this is going to provide a decent amount of IOPS, 20k would be 
> ideal. I'd like to avoid NVMe Journals unless it's going to make a truly 
> massive difference. Same with carving up the SSD's, would rather not, and 
> just keep it as simple as possible.
I agree: those SSDs shouldn't really require a journal device. Not
sure about the 20k IOPS specially without any further information.
Doing 20k IOPS at 1kB block is totally different at 1MB block...
>
> Is there anything that obviously stands out as severely unbalanced? The 
> R720XD comes with a H710 - instead of putting them in RAID0, I'm thinking a 
> different HBA might be a better idea, any recommendations please?
Don't know that HBA. Does it support pass through mode or HBA mode?
>
> Regards,
> Nick
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] issues on CT + EC pool

2018-05-04 Thread Luis Periquito
Hi,

I have a big-ish cluster that, amongst other things, has a radosgw
configured to have an EC data pool (k=12, m=4). The cluster is
currently running Jewel (10.2.7).

That pool spans 244 HDDs and has 2048 PGs.

from the df detail:
.rgw.buckets.ec 26 -N/A   N/A
   76360G 28.66  185T 97908947 95614k
73271k   185M  101813G
ct-radosgw 37 -N/A   N/A
   4708G 70.69 1952G  5226185  2071k
591M  1518M9416G

The ct-radosgw should be size 3, but currently due to an unrelated
issue (pdu failure) is size 2.

Whenever I flush data from the cache tier to the base tier the OSDs
start updating their local leveldb database, using up 100% IO, until
they: a) are set as down for no answer, and/or b) suicide timeout.

I have other pools targeting those same OSDs but until now nothing has
happened when the IO goes to the other pools.

Any ideas on where to proceed?

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] trimming the MON level db

2018-04-30 Thread Luis Periquito
On Sat, Apr 28, 2018 at 10:24 AM, Wido den Hollander <w...@42on.com> wrote:
>
>
> On 04/27/2018 08:31 PM, David Turner wrote:
>> I'm assuming that the "very bad move" means that you have some PGs not
>> in active+clean.  Any non-active+clean PG will prevent your mons from
>> being able to compact their db store.  This is by design so that if
>> something were to happen where the data on some of the copies of the PG
>> were lost and gone forever the mons could do their best to enable the
>> cluster to reconstruct the PG knowing when OSDs went down/up, when PGs
>> moved to new locations, etc.
>>
>> Thankfully there isn't a way around this.  Something you can do is stop
>> a mon, move the /var/lib/mon/$(hostname -s)/ folder to a new disk with
>> more space, set it to mount in the proper location, and start it back
>> up.  You would want to do this for each mon to give them more room for
>> the mon store to grow.  Make sure to give the mon plenty of time to get
>> back up into the quorum before moving on to the next one.
>>
>
> Indeed. This is a unknown thing with Monitors for a lot of people. I
> always suggest installing a >200GB DC-grade SSD in Monitors to make sure
> you can make large movements without running into trouble with the MONs.
>
> So yes, move this data to a new disk. Without all PGs active+clean you
> can't trim the store.

It's currently on a P3700 400G SSD and almost full (280g). We've been
able to avoid it to stop growing at such a rate as it was before. The
main issue was with an OSD crashing under load making a few others
crash/OoM as well. Setting nodown has been a saver.
>
>> On Wed, Apr 25, 2018 at 10:25 AM Luis Periquito <periqu...@gmail.com
>> <mailto:periqu...@gmail.com>> wrote:
>>
>> Hi all,
>>
>> we have a (really) big cluster that's ongoing a very bad move and the
>> monitor database is growing at an alarming rate.
>>
>> The cluster is running jewel (10.2.7) and is there any way to trim the
>> monitor database before it gets HEALTH_OK?
>>
>> I've searched and so far only found people saying not really, but just
>> wanted a final sanity check...
>>
>> thanks,
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] trimming the MON level db

2018-04-25 Thread Luis Periquito
Hi all,

we have a (really) big cluster that's ongoing a very bad move and the
monitor database is growing at an alarming rate.

The cluster is running jewel (10.2.7) and is there any way to trim the
monitor database before it gets HEALTH_OK?

I've searched and so far only found people saying not really, but just
wanted a final sanity check...

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IO rate-limiting with Ceph RBD (and libvirt)

2018-03-23 Thread Luis Periquito
On Fri, Mar 23, 2018 at 4:05 AM, Anthony D'Atri  wrote:
> FYI: I/O limiting in combination with OpenStack 10/12 + Ceph doesn?t work
> properly. Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1476830
>
>
> That's an OpenStack bug, nothing to do with Ceph.  Nothing stops you from
> using virsh to throttle directly:
>
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-blockio-techniques
>
> https://github.com/cernceph/ceph-scripts/blob/master/tools/virsh-throttle-rbd.py
>

Actually it's not even an OpenStack bug, it's a misunderstanding on
how volume limits work: the flavour will set the limits to the
instances when running from image, but the reporter is running from
volumes, so the limits have to be set with cinder volume types...

>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Newbie question: stretch ceph cluster

2018-02-09 Thread Luis Periquito
On Fri, Feb 9, 2018 at 2:59 PM, Kai Wagner  wrote:
> Hi and welcome,
>
>
> On 09.02.2018 15:46, ST Wong (ITSC) wrote:
>
> Hi, I'm new to CEPH and got a task to setup CEPH with kind of DR feature.
> We've 2 10Gb connected data centers in the same campus.I wonder if it's
> possible to setup a CEPH cluster with following components in each data
> center:
>
>
> 3 x mon + mds + mgr
In this scenario you wouldn't be any better, as loosing a room means
loosing half of your cluster. Can you run the MON somewhere else that
would be able to continue if you loose one of the rooms?

As for MGR and MDS they're (recommended) active/passive; so one per
room would be enough.
>
> 3 x OSD (replicated factor=2, between data center)

replicated with size=2 is a bad idea. You can have size=4 and
min_size=2 and have a crush map with rules something like:

rule crosssite {
id 0
type replicated
min_size 4
max_size 4
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}

this will store 4 copies, 2 in different hosts and 2 different rooms.

>
>
> So that any one of following failure won't affect the cluster's operation
> and data availability:
>
> any one component in either data center
> failure of either one of the data center
>
>
> Is it possible?
>
> In general this is possible, but I would consider that replica=2 is not a
> good idea. In case of a failure scenario or just maintenance and one DC is
> powered off and just one single disk fails on the other DC, this can already
> lead to data loss. My advice here would be, if anyhow possible, please don't
> do replica=2.
>
> In case one data center failure case, seems replication can't occur any
> more.   Any CRUSH rule can achieve this purpose?
>
>
> Sorry for the newbie question.
>
>
> Thanks a lot.
>
> Regards
>
> /st wong
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB
> 21284 (AG Nürnberg)
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High apply latency

2018-01-31 Thread Luis Periquito
on a cursory look of the information it seems the cluster is
overloaded with the requests.

Just a guess, but if you look at IO usage on those spindles they'll be
at or around 100% usage most of the time.

If that is the case then increasing the pg_num and pgp_num won't help,
and short term, will make it worse.

Metadata pools (like default.rgw.buckets.index) really excel in a SSD
pool, even if small. I carved a small OSD in the journal SSDs for
those kinds of workloads.

On Wed, Jan 31, 2018 at 2:26 PM, Jakub Jaszewski
 wrote:
> Is it safe to increase pg_num and pgp_num from 1024 up to 2048 for volumes
> and default.rgw.buckets.data pools?
> How will it impact cluster behavior? I guess cluster rebalancing will occur
> and will take long time considering amount of data we have on it ?
>
> Regards
> Jakub
>
>
>
> On Wed, Jan 31, 2018 at 1:37 PM, Jakub Jaszewski 
> wrote:
>>
>> Hi,
>>
>> I'm wondering why slow requests are being reported mainly when the request
>> has been put into the queue for processing by its PG  (queued_for_pg ,
>> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#debugging-slow-request).
>> Could it be due too low pg_num/pgp_num ?
>>
>> It looks that slow requests are mainly addressed to
>> default.rgw.buckets.data (pool id 20) , volumes (pool id 3) and
>> default.rgw.buckets.index (pool id 14)
>>
>> 2018-01-31 12:06:55.899557 osd.59 osd.59 10.212.32.22:6806/4413 38 :
>> cluster [WRN] slow request 30.125793 seconds old, received at 2018-01-31
>> 12:06:25.773675: osd_op(client.857003.0:126171692 3.a4fec1ad 3.a4fec1ad
>> (undecoded) ack+ondisk+write+known_if_redirected e5722) currently
>> queued_for_pg
>>
>> Btw how can I get more human-friendly client information from log entry
>> like above ?
>>
>> Current pg_num/pgp_num
>>
>> pool 3 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash
>> rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags hashpspool
>> stripe_width 0 application rbd
>> removed_snaps [1~3]
>> pool 14 'default.rgw.buckets.index' replicated size 3 min_size 2
>> crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
>> hashpspool stripe_width 0 application rgw
>> pool 20 'default.rgw.buckets.data' erasure size 9 min_size 6 crush_rule 1
>> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags
>> hashpspool stripe_width 4224 application rgw
>>
>> Usage
>>
>> GLOBAL:
>> SIZE AVAIL RAW USED %RAW USED OBJECTS
>> 385T  144T 241T 62.54  31023k
>> POOLS:
>> NAME ID QUOTA OBJECTS QUOTA BYTES
>> USED   %USED MAX AVAIL OBJECTS  DIRTY  READ   WRITE
>> RAW USED
>> volumes  3  N/A   N/A
>> 40351G 70.9116557G 10352314 10109k  2130M  2520M
>> 118T
>> default.rgw.buckets.index14 N/A   N/A
>> 0 016557G  205205   160M 27945k
>> 0
>> default.rgw.buckets.data 20 N/A   N/A
>> 79190G 70.5133115G 20865953 20376k   122M   113M
>> 116T
>>
>>
>>
>> # ceph osd pool ls detail
>> pool 0 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash
>> rjenkins pg_num 64 pgp_num 64 last_change 4502 flags hashpspool stripe_width
>> 0 application rbd
>> pool 1 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash
>> rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags hashpspool
>> stripe_width 0 application rbd
>> pool 2 'images' replicated size 3 min_size 2 crush_rule 0 object_hash
>> rjenkins pg_num 512 pgp_num 512 last_change 5175 flags hashpspool
>> stripe_width 0 application rbd
>> removed_snaps [1~7,14~2]
>> pool 3 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash
>> rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags hashpspool
>> stripe_width 0 application rbd
>> removed_snaps [1~3]
>> pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash
>> rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool stripe_width 0
>> application rgw
>> pool 5 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
>> stripe_width 0 application rgw
>> pool 6 'default.rgw.data.root' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
>> stripe_width 0 application rgw
>> pool 7 'default.rgw.gc' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
>> stripe_width 0 application rgw
>> pool 8 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
>> stripe_width 0 application rgw
>> pool 9 'default.rgw.users.uid' replicated size 3 min_size 2 

Re: [ceph-users] issue adding OSDs

2018-01-12 Thread Luis Periquito
"ceph versions" returned all daemons as running 12.2.1.

On Fri, Jan 12, 2018 at 8:00 AM, Janne Johansson <icepic...@gmail.com> wrote:
> Running "ceph mon versions" and "ceph osd versions" and so on as you do the
> upgrades would have helped I guess.
>
>
> 2018-01-11 17:28 GMT+01:00 Luis Periquito <periqu...@gmail.com>:
>>
>> this was a bit weird, but is now working... Writing for future
>> reference if someone faces the same issue.
>>
>> this cluster was upgraded from jewel to luminous following the
>> recommended process. When it was finished I just set the require_osd
>> to luminous. However I hadn't restarted the daemons since. So just
>> restarting all the OSDs made the problem go away.
>>
>> How to check if that was the case? The OSDs now have a "class" associated.
>>
>>
>>
>> On Wed, Jan 10, 2018 at 7:16 PM, Luis Periquito <periqu...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I'm running a cluster with 12.2.1 and adding more OSDs to it.
>> > Everything is running version 12.2.1 and require_osd is set to
>> > luminous.
>> >
>> > one of the pools is replicated with size 2 min_size 1, and is
>> > seemingly blocking IO while recovering. I have no slow requests,
>> > looking at the output of "ceph osd perf" it seems brilliant (all
>> > numbers are lower than 10).
>> >
>> > clients are RBD (OpenStack VM in KVM) and using (mostly) 10.2.7. I've
>> > tagged those OSDs as out and the RBD just came back to life. I did
>> > have some objects degraded:
>> >
>> > 2018-01-10 18:23:52.081957 mon.mon0 mon.0 x.x.x.x:6789/0 410414 :
>> > cluster [WRN] Health check update: 9926354/49526500 objects misplaced
>> > (20.043%) (OBJECT_MISPLACED)
>> > 2018-01-10 18:23:52.081969 mon.mon0 mon.0 x.x.x.x:6789/0 410415 :
>> > cluster [WRN] Health check update: Degraded data redundancy:
>> > 5027/49526500 objects degraded (0.010%), 1761 pgs unclean, 27 pgs
>> > degraded (PG_DEGRADED)
>> >
>> > any thoughts as to what might be happening? I've run such operations
>> > many a times...
>> >
>> > thanks for all help, as I'm grasping as to figure out what's
>> > happening...
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] issue adding OSDs

2018-01-11 Thread Luis Periquito
this was a bit weird, but is now working... Writing for future
reference if someone faces the same issue.

this cluster was upgraded from jewel to luminous following the
recommended process. When it was finished I just set the require_osd
to luminous. However I hadn't restarted the daemons since. So just
restarting all the OSDs made the problem go away.

How to check if that was the case? The OSDs now have a "class" associated.



On Wed, Jan 10, 2018 at 7:16 PM, Luis Periquito <periqu...@gmail.com> wrote:
> Hi,
>
> I'm running a cluster with 12.2.1 and adding more OSDs to it.
> Everything is running version 12.2.1 and require_osd is set to
> luminous.
>
> one of the pools is replicated with size 2 min_size 1, and is
> seemingly blocking IO while recovering. I have no slow requests,
> looking at the output of "ceph osd perf" it seems brilliant (all
> numbers are lower than 10).
>
> clients are RBD (OpenStack VM in KVM) and using (mostly) 10.2.7. I've
> tagged those OSDs as out and the RBD just came back to life. I did
> have some objects degraded:
>
> 2018-01-10 18:23:52.081957 mon.mon0 mon.0 x.x.x.x:6789/0 410414 :
> cluster [WRN] Health check update: 9926354/49526500 objects misplaced
> (20.043%) (OBJECT_MISPLACED)
> 2018-01-10 18:23:52.081969 mon.mon0 mon.0 x.x.x.x:6789/0 410415 :
> cluster [WRN] Health check update: Degraded data redundancy:
> 5027/49526500 objects degraded (0.010%), 1761 pgs unclean, 27 pgs
> degraded (PG_DEGRADED)
>
> any thoughts as to what might be happening? I've run such operations
> many a times...
>
> thanks for all help, as I'm grasping as to figure out what's happening...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] issue adding OSDs

2018-01-10 Thread Luis Periquito
Hi,

I'm running a cluster with 12.2.1 and adding more OSDs to it.
Everything is running version 12.2.1 and require_osd is set to
luminous.

one of the pools is replicated with size 2 min_size 1, and is
seemingly blocking IO while recovering. I have no slow requests,
looking at the output of "ceph osd perf" it seems brilliant (all
numbers are lower than 10).

clients are RBD (OpenStack VM in KVM) and using (mostly) 10.2.7. I've
tagged those OSDs as out and the RBD just came back to life. I did
have some objects degraded:

2018-01-10 18:23:52.081957 mon.mon0 mon.0 x.x.x.x:6789/0 410414 :
cluster [WRN] Health check update: 9926354/49526500 objects misplaced
(20.043%) (OBJECT_MISPLACED)
2018-01-10 18:23:52.081969 mon.mon0 mon.0 x.x.x.x:6789/0 410415 :
cluster [WRN] Health check update: Degraded data redundancy:
5027/49526500 objects degraded (0.010%), 1761 pgs unclean, 27 pgs
degraded (PG_DEGRADED)

any thoughts as to what might be happening? I've run such operations
many a times...

thanks for all help, as I'm grasping as to figure out what's happening...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues on Luminous

2018-01-04 Thread Luis Periquito
you never said if it was bluestore or filestore?

Can you look in the server to see which component is being stressed
(network, cpu, disk)? Utilities like atop are very handy for this.

Regarding those specific SSDs they are particularly bad when running
some time without trimming - performance nosedives by at least an
order of magnitude. If you really want to go with that risk look at
least to the PROs. And some workloads will always be slow on them.

You never say what's your target environment: do you value
IOPS/latency? Those CPUs won't be great, and I've read a few things
recommending to avoid NUMA (2 CPUs in there). And (higher) frequency
is more important than # of cores to have a high IOPS cluster.

On Thu, Jan 4, 2018 at 3:56 PM, Rafał Wądołowski
 wrote:
> I have size of 2.
>
> We know about this risk and we accept it, but we still don't know why
> performance so so bad.
>
> Cheers,
>
> Rafał Wądołowski
>
> On 04.01.2018 16:51, c...@elchaka.de wrote:
>
> I assume you have size of 3 then divide your expected 400 with 3 and you are
> not far Away from what you get...
>
> In Addition you should Never use Consumer grade ssds for ceph as they will
> be reach the DWPD very soon...
>
> Am 4. Januar 2018 09:54:55 MEZ schrieb "Rafał Wądołowski"
> :
>>
>> Hi folks,
>>
>> I am currently benchmarking my cluster for an performance issue and I
>> have no idea, what is going on. I am using these devices in qemu.
>>
>> Ceph version 12.2.2
>>
>> Infrastructure:
>>
>> 3 x Ceph-mon
>>
>> 11 x Ceph-osd
>>
>> Ceph-osd has 22x1TB Samsung SSD 850 EVO 1TB
>>
>> 96GB RAM
>>
>> 2x E5-2650 v4
>>
>> 4x10G Network (2 seperate bounds for cluster and public) with MTU 9000
>>
>>
>> I had tested it with rados bench:
>>
>> # rados bench -p rbdbench 30 write -t 1
>>
>> Total time run: 30.055677
>> Total writes made:  1199
>> Write size: 4194304
>> Object size:4194304
>> Bandwidth (MB/sec): 159.571
>> Stddev Bandwidth:   6.83601
>> Max bandwidth (MB/sec): 168
>> Min bandwidth (MB/sec): 140
>> Average IOPS:   39
>> Stddev IOPS:1
>> Max IOPS:   42
>> Min IOPS:   35
>> Average Latency(s): 0.0250656
>> Stddev Latency(s):  0.00321545
>> Max latency(s): 0.0471699
>> Min latency(s): 0.0206325
>>
>> # ceph tell osd.0 bench
>> {
>>  "bytes_written": 1073741824,
>>  "blocksize": 4194304,
>>  "bytes_per_sec": 414199397
>> }
>>
>> Testing osd directly
>>
>> # dd if=/dev/zero of=/dev/sdc bs=4M oflag=direct count=100
>> 100+0 records in
>> 100+0 records out
>> 419430400 bytes (419 MB, 400 MiB) copied, 1.0066 s, 417 MB/s
>>
>> When I do dd inside vm (bs=4M wih direct), I have result like in rados
>> bench.
>>
>> I think that the speed should be arround ~400MB/s.
>>
>> Is there any new parameters for rbd in luminous? Maybe I forgot about
>> some performance tricks? If more information needed feel free to ask.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running Jewel and Luminous mixed for a longer period

2018-01-02 Thread Luis Periquito
On Tue, Dec 5, 2017 at 1:20 PM, Wido den Hollander  wrote:
> Hi,
>
> I haven't tried this before but I expect it to work, but I wanted to check 
> before proceeding.
>
> I have a Ceph cluster which is running with manually formatted FileStore XFS 
> disks, Jewel, sysvinit and Ubuntu 14.04.
>
> I would like to upgrade this system to Luminous, but since I have to 
> re-install all servers and re-format all disks I'd like to move it to 
> BlueStore at the same time.
>
> This system however has 768 3TB disks and has a utilization of about 60%. You 
> can guess, it will take a long time before all the backfills complete.
>
> The idea is to take a machine down, wipe all disks, re-install it with Ubuntu 
> 16.04 and Luminous and re-format the disks with BlueStore.
>
> The OSDs get back, start to backfill and we wait.
Are you OUT'ing the OSDs or removing them altogether (ceph osd crush
remove + ceph osd rm)?

I've noticed that when you remove them completely the data movement is
much bigger.

>
> My estimation is that we can do one machine per day, but we have 48 machines 
> to do. Realistically this will take ~60 days to complete.

That seems a bit optimistic for me. But it depends on how aggressive
you are, and how busy those spindles are.

>
> Afaik running Jewel (10.2.10) mixed with Luminous (12.2.2) should work just 
> fine I wanted to check if there are any caveats I don't know about.
>
> I'll upgrade the MONs to Luminous first before starting to upgrade the OSDs. 
> Between each machine I'll wait for a HEALTH_OK before proceeding allowing the 
> MONs to trim their datastore.

You have to: As far as I've seen after upgrading one of the MONs to
Luminous, the new OSDs running Luminous refuse to start until you have
*ALL* MONs running Luminous.

>
> The question is: Does it hurt to run Jewel and Luminous mixed for ~60 days?
>
> I think it won't, but I wanted to double-check.

I thought the same. I was running 10.2.3 and doing about the same to
upgrade to 10.2.7, so keeping Jewel. The process was pretty much the
same, but had to pause for a month half way through (because of
unrelated issues) and every so often the cluster would just stop. At
least one of the OSDs would stop responding and piling up slow
requests, even though it was idle. It was random OSDs and happened
both on HDD and SSD (this is a cache tiered s3 storage cluster) and
either version. I tried the injectargs but no output - it just printed
and if it was idle. Restart the OSD and it would spring back to
life...

So not sure if you get similar issues, but I'm now avoiding mixed
versions as much as I can.

>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Open Compute (OCP) servers for Ceph

2017-12-22 Thread Luis Periquito
Hi Wido,

what are you trying to optimise? Space? Power? Are you tied to OCP?

I remember Ciara had some interesting designs like this
http://www.ciaratech.com/product.php?id_prod=539=en_cat1=1_cat2=67
though I don't believe they are OCP.

I also had a look and supermicro has a few that may fill your
requirements 
(https://www.supermicro.nl/products/system/1U/6019/SSG-6019P-ACR12L.cfm)



On Fri, Dec 22, 2017 at 1:40 PM, Dan van der Ster  wrote:
> Hi Wido,
>
> We have used a few racks of Wiwynn OCP servers in a Ceph cluster for a
> couple of years.
> The machines are dual Xeon [1] and use some of those 2U 30-disk "Knox"
> enclosures.
>
> Other than that, I have nothing particularly interesting to say about
> these. Our data centre procurement team have also moved on with
> standard racked equipment, so I suppose they also found these
> uninteresting.
>
> Cheers, Dan
>
> [1] http://www.wiwynn.com/english/product/type/details/32?ptype=28
>
>
> On Fri, Dec 22, 2017 at 12:04 PM, Wido den Hollander  wrote:
>> Hi,
>>
>> I'm looking at OCP [0] servers for Ceph and I'm not able to find yet what
>> I'm looking for.
>>
>> First of all, the geek in me loves OCP and the design :-) Now I'm trying to
>> match it with Ceph.
>>
>> Looking at wiwynn [1] they offer a few OCP servers:
>>
>> - 3 nodes in 2U with a single 3.5" disk [2]
>> - 2U node with 30 disks and a Atom C2000 [3]
>> - 2U JDOD with 12G SAS [4]
>>
>> For Ceph I would want:
>>
>> - 1U node / 12x 3.5" / Fast CPU
>> - 1U node / 24x 2.5" / Fast CPU
>>
>> They don't seem to exist yet when looking for OCP server.
>>
>> Although 30 drives is fine, it would become a very large Ceph cluster when
>> building with something like that.
>>
>> Has anybody build Ceph clusters yet using OCP hardaware? If so, which vendor
>> and what are your experiences?
>>
>> Thanks!
>>
>> Wido
>>
>> [0]: http://www.opencompute.org/
>> [1]: http://www.wiwynn.com/
>> [2]: http://www.wiwynn.com/english/product/type/details/65?ptype=28
>> [3]: http://www.wiwynn.com/english/product/type/details/33?ptype=28
>> [4]: http://www.wiwynn.com/english/product/type/details/43?ptype=28
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] determining the source of io in the cluster

2017-12-18 Thread Luis Periquito
As that is a small cluster I hope you still don't have a lot of
instances running...

You can add "admin socket" to the client configuration part and then
read performance information via that. IIRC that prints total bytes
and IOPS, but it should be simple to read/calculate difference. This
will generate one socket per volume mounted (thus the I hope you don't
have many).

On Mon, Dec 18, 2017 at 4:36 PM, Josef Zelenka
 wrote:
> Hi everyone,
>
> we have recently deployed a Luminous(12.2.1) cluster on Ubuntu - three osd
> nodes and three monitors, every osd has 3x 2TB SSD + an NVMe drive for a
> blockdb. We use it as a backend for our Openstack cluster, so we store
> volumes there. IN the last few days, the read op/s rose to around 10k-25k
> constantly(it fluctuates between those two) and it doesn't seem to go down.
> I can see, that the io/read ops come from the pool where we store VM
> volumes, but i can't source this issue to a particular volume. Is that even
> possible? Any experiences with debugging this? Any info or advice is greatly
> appreciated.
>
> Thanks
>
> Josef Zelenka
>
> Cloudevelops
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-28 Thread Luis Periquito
There are a few things I don't like about your machines... If you want
latency/IOPS (as you seemingly do) you really want the highest frequency
CPUs, even over number of cores. These are not too bad, but not great
either.

Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA nodes?
Ideally OSD is pinned to same NUMA node the NVMe device is connected to.
Each NVMe device will be running on PCIe lanes generated by one of the
CPUs...

What versions of TCMalloc (or jemalloc) are you running? Have you tuned
them to have a bigger cache?

These are from what I've learned using filestore - I've yet to run full
tests on bluestore - but they should still apply...

On Mon, Nov 27, 2017 at 5:10 PM, German Anders  wrote:

> Hi Nick,
>
> yeah, we are using the same nvme disk with an additional partition to use
> as journal/wal. We double check the c-state and it was not configure to use
> c1, so we change that on all the osd nodes and mon nodes and we're going to
> make some new tests, and see how it goes. I'll get back as soon as get got
> those tests running.
>
> Thanks a lot,
>
> Best,
>
>
> *German*
>
> 2017-11-27 12:16 GMT-03:00 Nick Fisk :
>
>> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
>> Of *German Anders
>> *Sent:* 27 November 2017 14:44
>> *To:* Maged Mokhtar 
>> *Cc:* ceph-users 
>> *Subject:* Re: [ceph-users] ceph all-nvme mysql performance tuning
>>
>>
>>
>> Hi Maged,
>>
>>
>>
>> Thanks a lot for the response. We try with different number of threads
>> and we're getting almost the same kind of difference between the storage
>> types. Going to try with different rbd stripe size, object size values and
>> see if we get more competitive numbers. Will get back with more tests and
>> param changes to see if we get better :)
>>
>>
>>
>>
>>
>> Just to echo a couple of comments. Ceph will always struggle to match the
>> performance of a traditional array for mainly 2 reasons.
>>
>>
>>
>>1. You are replacing some sort of dual ported SAS or internally RDMA
>>connected device with a network for Ceph replication traffic. This will
>>instantly have a large impact on write latency
>>2. Ceph locks at the PG level and a PG will most likely cover at
>>least one 4MB object, so lots of small accesses to the same blocks (on a
>>block device) will wait on each other and go effectively at a single
>>threaded rate.
>>
>>
>>
>> The best thing you can do to mitigate these, is to run the fastest
>> journal/WAL devices you can, fastest network connections (ie 25Gb/s) and
>> run your CPU’s at max C and P states.
>>
>>
>>
>> You stated that you are running the performance profile on the CPU’s.
>> Could you also just double check that the C-states are being held at C1(e)?
>> There are a few utilities that can show this in realtime.
>>
>>
>>
>> Other than that, although there could be some minor tweaks, you are
>> probably nearing the limit of what you can hope to achieve.
>>
>>
>>
>> Nick
>>
>>
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Best,
>>
>>
>> *German*
>>
>>
>>
>> 2017-11-27 11:36 GMT-03:00 Maged Mokhtar :
>>
>> On 2017-11-27 15:02, German Anders wrote:
>>
>> Hi All,
>>
>>
>>
>> I've a performance question, we recently install a brand new Ceph cluster
>> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
>> The back-end of the cluster is using a bond IPoIB (active/passive) , and
>> for the front-end we are using a bonding config with active/active (20GbE)
>> to communicate with the clients.
>>
>>
>>
>> The cluster configuration is the following:
>>
>>
>>
>> *MON Nodes:*
>>
>> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>>
>> 3x 1U servers:
>>
>>   2x Intel Xeon E5-2630v4 @2.2Ghz
>>
>>   128G RAM
>>
>>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>>
>>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>>
>>
>>
>> *OSD Nodes:*
>>
>> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>>
>> 4x 2U servers:
>>
>>   2x Intel Xeon E5-2640v4 @2.4Ghz
>>
>>   128G RAM
>>
>>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>>
>>   1x Ethernet Controller 10G X550T
>>
>>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>>
>>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
>>
>>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>>
>>
>>
>>
>>
>> Here's the tree:
>>
>>
>>
>> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
>>
>> -7   48.0 root root
>>
>> -5   24.0 rack rack1
>>
>> -1   12.0 node cpn01
>>
>>  0  nvme  1.0 osd.0  up  1.0 1.0
>>
>>  1  nvme  1.0 osd.1  up  1.0 1.0
>>
>>  2  nvme  1.0 osd.2  up  1.0 1.0
>>
>>  3  nvme  1.0 osd.3  up  1.0 1.0
>>
>>  4  nvme  1.0 osd.4  up  1.0 1.0
>>
>>  5  nvme  1.0 osd.5  up  1.0 1.0

Re: [ceph-users] Ceph cache pool full

2017-10-06 Thread Luis Periquito
Not looking at anything else, you didn't set the max_bytes or
max_objects for it to start flushing...

On Fri, Oct 6, 2017 at 4:49 PM, Shawfeng Dong  wrote:
> Dear all,
>
> Thanks a lot for the very insightful comments/suggestions!
>
> There are 3 OSD servers in our pilot Ceph cluster, each with 2x 1TB SSDs
> (boot disks), 12x 8TB SATA HDDs and 2x 1.2TB NVMe SSDs. We use the bluestore
> backend, with the first NVMe as the WAL and DB devices for OSDs on the HDDs.
> And we try to create a cache tier out of the second NVMes.
>
> Here are the outputs of the commands suggested by David:
>
> 1) # ceph df
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 265T  262T2847G  1.05
> POOLS:
> NAMEID USED  %USED  MAX AVAIL OBJECTS
> cephfs_data 1  0  0  248T   0
> cephfs_metadata 2  8515k  0  248T  24
> cephfs_cache3  1381G 100.00 0  355385
>
> 2) # ceph osd df
>  0   hdd 7.27829  1.0 7452G 2076M  7450G  0.03  0.03 174
>  1   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 169
>  2   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 173
>  3   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 159
>  4   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 173
>  5   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 162
>  6   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 149
>  7   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 179
>  8   hdd 7.27829  1.0 7452G 2076M  7450G  0.03  0.03 163
>  9   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 194
> 10   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 185
> 11   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 168
> 36  nvme 1.09149  1.0 1117G  855G   262G 76.53 73.01  79
> 12   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 180
> 13   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 168
> 14   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 178
> 15   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 170
> 16   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 149
> 17   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 203
> 18   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 173
> 19   hdd 7.27829  1.0 7452G 2076M  7450G  0.03  0.03 158
> 20   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 154
> 21   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 160
> 22   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 167
> 23   hdd 7.27829  1.0 7452G 2076M  7450G  0.03  0.03 188
> 37  nvme 1.09149  1.0 1117G 1061G 57214M 95.00 90.63  98
> 24   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 187
> 25   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 200
> 26   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 147
> 27   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 171
> 28   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 162
> 29   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 152
> 30   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 174
> 31   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 176
> 32   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 182
> 33   hdd 7.27829  1.0 7452G 2072M  7450G  0.03  0.03 155
> 34   hdd 7.27829  1.0 7452G 2076M  7450G  0.03  0.03 166
> 35   hdd 7.27829  1.0 7452G 2076M  7450G  0.03  0.03 176
> 38  nvme 1.09149  1.0 1117G  857G   260G 76.71 73.18  79
> TOTAL  265T 2847G   262T  1.05
> MIN/MAX VAR: 0.03/90.63  STDDEV: 22.81
>
> 3) # ceph osd tree
> -1   265.29291 root default
> -388.43097 host pulpo-osd01
>  0   hdd   7.27829 osd.0up  1.0 1.0
>  1   hdd   7.27829 osd.1up  1.0 1.0
>  2   hdd   7.27829 osd.2up  1.0 1.0
>  3   hdd   7.27829 osd.3up  1.0 1.0
>  4   hdd   7.27829 osd.4up  1.0 1.0
>  5   hdd   7.27829 osd.5up  1.0 1.0
>  6   hdd   7.27829 osd.6up  1.0 1.0
>  7   hdd   7.27829 osd.7up  1.0 1.0
>  8   hdd   7.27829 osd.8up  1.0 1.0
>  9   hdd   7.27829 osd.9up  1.0 1.0
> 10   hdd   7.27829 osd.10   up  1.0 1.0
> 11   hdd   7.27829 osd.11   up  1.0 1.0
> 36  nvme   1.09149 osd.36   up  1.0 1.0
> -588.43097 host pulpo-osd02
> 12   hdd   7.27829 osd.12   up  1.0 1.0
> 13   hdd   7.27829 osd.13   up  1.0 1.0
> 14   hdd   7.27829 osd.14   up  1.0 1.0
> 15   hdd   7.27829 osd.15   up  1.0 1.0
> 16   hdd   7.27829 osd.16   up  1.0 1.0
> 17   hdd   7.27829 osd.17   up  1.0 

Re: [ceph-users] osd create returns duplicate ID's

2017-09-29 Thread Luis Periquito
On Fri, Sep 29, 2017 at 9:44 AM, Adrian Saul
 wrote:
>
> Do you mean that after you delete and remove the crush and auth entries for 
> the OSD, when you go to create another OSD later it will re-use the previous 
> OSD ID that you have destroyed in the past?
>

The issue is that has been giving the same ID more than once. As this
is a rebuild it'll be old IDs, but that's not an issue.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd create returns duplicate ID's

2017-09-29 Thread Luis Periquito
Hi all,

I use puppet to deploy and manage my clusters.

Recently, as I have been doing a removal of old hardware and adding of
new I've noticed that sometimes the "ceph osd create" is returning
repeated IDs. Usually it's on the same server, but yesterday I saw it
in different servers.

I was expecting the OSD ID's to be unique, and when they come on the
same server puppet starts spewing errors - which is desirable - but
when it's in different servers it broke those OSDs in Ceph. As they
hadn't backfill any full PGs I just wiped, removed and started anew.

As for the process itself: The OSDs are marked out and removed from
crush, when empty they are auth del and osd rm. After building the
server puppet will osd create, and use the generated ID for crush move
and mkfs.

Unfortunately I haven't been able to reproduce in isolation, and being
a production cluster logging is tuned way down.

This has happened in several different clusters, but they are all
running 10.2.7.

Any ideas?

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] erasure code profile

2017-09-22 Thread Luis Periquito
On Fri, Sep 22, 2017 at 9:49 AM, Dietmar Rieder
 wrote:
> Hmm...
>
> not sure what happens if you loose 2 disks in 2 different rooms, isn't
> there is a risk that you loose  data ?

yes, and that's why I don't really like the profile...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] erasure code profile

2017-09-22 Thread Luis Periquito
Hi all,

I've been trying to think what will be the best erasure code profile,
but I don't really like the one I came up with...

I have 3 rooms that are part of the same cluster, and I need to design
so we can lose any one of the 3.

As this is a backup cluster I was thinking on doing a k=2 m=1 code,
with ruleset-failure-domain=room as the OSD tree is correctly built.

Can anyone think of a better profile?

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] after reboot node appear outside the root root tree

2017-09-13 Thread Luis Periquito
What's your "osd crush update on start" option?

further information can be found
http://docs.ceph.com/docs/master/rados/operations/crush-map/

On Wed, Sep 13, 2017 at 4:38 PM, German Anders  wrote:
> Hi cephers,
>
> I'm having an issue with a newly created cluster 12.2.0
> (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc). Basically when I
> reboot one of the nodes, and when it come back, it come outside of the root
> type on the tree:
>
> root@cpm01:~# ceph osd tree
> ID  CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
> -15   12.0 root default
>  36  nvme  1.0 osd.36 up  1.0 1.0
>  37  nvme  1.0 osd.37 up  1.0 1.0
>  38  nvme  1.0 osd.38 up  1.0 1.0
>  39  nvme  1.0 osd.39 up  1.0 1.0
>  40  nvme  1.0 osd.40 up  1.0 1.0
>  41  nvme  1.0 osd.41 up  1.0 1.0
>  42  nvme  1.0 osd.42 up  1.0 1.0
>  43  nvme  1.0 osd.43 up  1.0 1.0
>  44  nvme  1.0 osd.44 up  1.0 1.0
>  45  nvme  1.0 osd.45 up  1.0 1.0
>  46  nvme  1.0 osd.46 up  1.0 1.0
>  47  nvme  1.0 osd.47 up  1.0 1.0
>  -7   36.0 root root
>  -5   24.0 rack rack1
>  -1   12.0 node cpn01
>   01.0 osd.0  up  1.0 1.0
>   11.0 osd.1  up  1.0 1.0
>   21.0 osd.2  up  1.0 1.0
>   31.0 osd.3  up  1.0 1.0
>   41.0 osd.4  up  1.0 1.0
>   51.0 osd.5  up  1.0 1.0
>   61.0 osd.6  up  1.0 1.0
>   71.0 osd.7  up  1.0 1.0
>   81.0 osd.8  up  1.0 1.0
>   91.0 osd.9  up  1.0 1.0
>  101.0 osd.10 up  1.0 1.0
>  111.0 osd.11 up  1.0 1.0
>  -3   12.0 node cpn03
>  241.0 osd.24 up  1.0 1.0
>  251.0 osd.25 up  1.0 1.0
>  261.0 osd.26 up  1.0 1.0
>  271.0 osd.27 up  1.0 1.0
>  281.0 osd.28 up  1.0 1.0
>  291.0 osd.29 up  1.0 1.0
>  301.0 osd.30 up  1.0 1.0
>  311.0 osd.31 up  1.0 1.0
>  321.0 osd.32 up  1.0 1.0
>  331.0 osd.33 up  1.0 1.0
>  341.0 osd.34 up  1.0 1.0
>  351.0 osd.35 up  1.0 1.0
>  -6   12.0 rack rack2
>  -2   12.0 node cpn02
>  121.0 osd.12 up  1.0 1.0
>  131.0 osd.13 up  1.0 1.0
>  141.0 osd.14 up  1.0 1.0
>  151.0 osd.15 up  1.0 1.0
>  161.0 osd.16 up  1.0 1.0
>  171.0 osd.17 up  1.0 1.0
>  181.0 osd.18 up  1.0 1.0
>  191.0 osd.19 up  1.0 1.0
>  201.0 osd.20 up  1.0 1.0
>  211.0 osd.21 up  1.0 1.0
>  221.0 osd.22 up  1.0 1.0
>  231.0 osd.23 up  1.0 1.0
>  -4  0 node cpn04
>
> Any ideas of why this happen? and how can I fix it? It supposed to be inside
> rack2
>
> Thanks in advance,
>
> Best,
>
> German
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% of total capacity

2017-08-16 Thread Luis Periquito
Not going through the obvious of that crush map is just not looking
correct or even sane... or that the policy itself doesn't sound very
sane - but I'm sure you'll understand the caveats and issues it may
present...

what's most probably happening is that a (or several) pool is using
those same OSDs and the requests to those PGs are also getting blocked
because of the disk full. This turns that some (or all) of the
remaining OSDs are waiting for that one to complete some IO, and
whilst those OSDs have IOs waiting to complete it also stops
responding to the IO that was only local.

Adding more insanity to your architecture what should (the keyword
here is should as I never tested, saw or even thought of such
scenario) work would be OSDs to have local storage and OSDs to have
distributed storage.

As for the architecture itself, and not knowing much of your use-case,
it may make sense to have local storage in something else than Ceph -
you're not using any of the facilities it provides you, and having
some overheads - or using a different strategy for it. IIRC there was
a way to hint data locality to Ceph...


On Wed, Aug 16, 2017 at 8:39 AM, Mandar Naik  wrote:
> Hi,
> I just wanted to give a friendly reminder for this issue. I would appreciate
> if someone
> can help me out here. Also, please do let me know in case some more
> information is
> required here.
>
> On Thu, Aug 10, 2017 at 2:41 PM, Mandar Naik  wrote:
>>
>> Hi Peter,
>> Thanks a lot for the reply. Please find 'ceph osd df' output here -
>>
>> # ceph osd df
>> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
>>  2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
>>  1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
>>  0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
>>  0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
>>  1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
>>  2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
>>   TOTAL   134G 43925M 94244M 31.79
>> MIN/MAX VAR: 0.00/2.99  STDDEV: 44.85
>>
>> I setup this cluster by manipulating CRUSH map using CLI. I had a default
>> root
>> before but it gave me an impression that since every rack is under a
>> single
>> root bucket its marking entire cluster down in case one of the osd is 95%
>> full. So I
>> removed root bucket but that still did not help me. No crush rule is
>> referring
>> to root bucket in the above mentioned case.
>>
>> Yes, I added one osd under two racks by linking host bucket from one rack
>> to another
>> using following command -
>>
>> "osd crush link   [...] :  link existing entry for
>>  under location "
>>
>>
>> On Thu, Aug 10, 2017 at 1:40 PM, Peter Maloney
>>  wrote:
>>>
>>> I think a `ceph osd df` would be useful.
>>>
>>> And how did you set up such a cluster? I don't see a root, and you have
>>> each osd in there more than once...is that even possible?
>>>
>>>
>>>
>>> On 08/10/17 08:46, Mandar Naik wrote:
>>>
>>> Hi,
>>>
>>> I am evaluating ceph cluster for a solution where ceph could be used for
>>> provisioning
>>>
>>> pools which could be either stored local to a node or replicated across a
>>> cluster.  This
>>>
>>> way ceph could be used as single point of solution for writing both local
>>> as well as replicated
>>>
>>> data. Local storage helps avoid possible storage cost that comes with
>>> replication factor of more
>>>
>>> than one and also provide availability as long as the data host is alive.
>>>
>>>
>>> So I tried an experiment with Ceph cluster where there is one crush rule
>>> which replicates data across
>>>
>>> nodes and other one only points to a crush bucket that has local ceph
>>> osd. Cluster configuration
>>>
>>> is pasted below.
>>>
>>>
>>> Here I observed that if one of the disk is full (95%) entire cluster goes
>>> into error state and stops
>>>
>>> accepting new writes from/to other nodes. So ceph cluster became unusable
>>> even though it’s only
>>>
>>> 32% full. The writes are blocked even for pools which are not touching
>>> the full osd.
>>>
>>>
>>> I have tried playing around crush hierarchy but it did not help. So is it
>>> possible to store data in the above
>>>
>>> manner with Ceph ? If yes could we get cluster state in usable state
>>> after one of the node is full ?
>>>
>>>
>>>
>>> # ceph df
>>>
>>>
>>> GLOBAL:
>>>
>>>SIZE AVAIL  RAW USED %RAW USED
>>>
>>>134G 94247M   43922M 31.79
>>>
>>>
>>> # ceph –s
>>>
>>>
>>>cluster ba658a02-757d-4e3c-7fb3-dc4bf944322f
>>>
>>> health HEALTH_ERR
>>>
>>>1 full osd(s)
>>>
>>>full,sortbitwise,require_jewel_osds flag(s) set
>>>
>>> monmap e3: 3 mons at
>>> {ip-10-0-9-122=10.0.9.122:6789/0,ip-10-0-9-146=10.0.9.146:6789/0,ip-10-0-9-210=10.0.9.210:6789/0}
>>>
>>>election epoch 14, quorum 0,1,2
>>> ip-10-0-9-122,ip-10-0-9-146,ip-10-0-9-210
>>>
>>> osdmap e93: 3 osds: 3 up, 3 

Re: [ceph-users] hammer -> jewel 10.2.8 upgrade and setting sortbitwise

2017-07-10 Thread Luis Periquito
Hi Dan,

I've enabled it in a couple of big-ish clusters and had the same
experience - a few seconds disruption caused by a peering process
being triggered, like any other crushmap update does. Can't remember
if it triggered data movement, but I have a feeling it did...



On Mon, Jul 10, 2017 at 3:17 PM, Dan van der Ster  wrote:
> Hi all,
>
> With 10.2.8, ceph will now warn if you didn't yet set sortbitwise.
>
> I just updated a test cluster, saw that warning, then did the necessary
>   ceph osd set sortbitwise
>
> I noticed a short re-peering which took around 10s on this small
> cluster with very little data.
>
> Has anyone done this already on a large cluster with lots of objects?
> It would be nice to hear that it isn't disruptive before running it on
> our big production instances.
>
> Cheers, Dan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing SSD Landscape

2017-06-08 Thread Luis Periquito
Looking at that anandtech comparison it seems the Micron usually is
worse than the P3700.

This week I asked for a few nodes with P3700 400G and got an answer as
they're end of sale, and the supplier wouldn't be able to get it
anywhere in the world. Has anyone got a good replacement for these?

The official replacement is the P4600, but those start at 2T and has
the appropriate price rise (it's slightly cheaper per GB than the
P3700), and it hasn't been officially released yet.

The P4800X (Optane) costs about the same as the P4600 and is small...

Not really sure about the Micron 9100, and couldn't find anything
interesting/comparable in the Samsung range...


On Wed, May 17, 2017 at 5:03 PM, Reed Dier <reed.d...@focusvq.com> wrote:
> Agreed, the issue I have seen is that the P4800X (Optane) is demonstrably
> more expensive than the P3700 for a roughly equivalent amount of storage
> space (400G v 375G).
>
> However, the P4800X is perfectly suited to a Ceph environment, with 30 DWPD,
> or 12.3 PBW. And on top of that, it seems to generally outperform the P3700
> in terms of latency, iops, and raw throughput, especially at greater queue
> depths. The biggest thing I took away was performance consistency.
>
> Anandtech did a good comparison against the P3700 and the Micron 9100 MAX,
> ironically the 9100 MAX has been the model I have been looking at to replace
> P3700’s in future OSD nodes.
>
> http://www.anandtech.com/show/11209/intel-optane-ssd-dc-p4800x-review-a-deep-dive-into-3d-xpoint-enterprise-performance/
>
> There are also the DC P4500 and P4600 models in the pipeline from Intel,
> also utilizing 3D NAND, however I have been told that they will not be
> shipping in volume until mid to late Q3.
> And as was stated earlier, these are all starting at much larger storage
> sizes, 1-4T in size, and with respective endurance ratings of 1.79 PBW and
> 10.49 PBW for endurance on the 2TB versions of each of those. Which should
> equal about .5 and ~3 DWPD for most workloads.
>
> At least the Micron 5100 MAX are finally shipping in volume to offer a
> replacement to Intel S3610, though no good replacement for the S3710 yet
> that I’ve seen on the endurance part.
>
> Reed
>
> On May 17, 2017, at 5:44 AM, Luis Periquito <periqu...@gmail.com> wrote:
>
> Anyway, in a couple months we'll start testing the Optane drives. They
> are small and perhaps ideal journals, or?
>
> The problem with optanes is price: from what I've seen they cost 2x or
> 3x as much as the P3700...
> But at least from what I've read they do look really great...
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing SSD Landscape

2017-05-17 Thread Luis Periquito
>> Anyway, in a couple months we'll start testing the Optane drives. They
>> are small and perhaps ideal journals, or?
>>
The problem with optanes is price: from what I've seen they cost 2x or
3x as much as the P3700...
But at least from what I've read they do look really great...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increase PG or reweight OSDs?

2017-05-03 Thread Luis Periquito
TL;DR: add the OSDs and then split the PGs

They are different commands for different situations...

changing the weight is to have a bigger number of nodes/devices.
Depending on the size of cluster, the size of the devices, how busy it
is and by how much you're growing it will have some different impacts.

Usually people add the devices and slowly increase the OSDs weight to
slowly increase the usage and data on them. There are some way to
improve performance and/or reduce the impact of that operation, like
the number of allowable concurrent backfills and op/backfill priority
settings.

The other one will get *all* of the objects in the existing PGs and
redistribute them in another set of PGs.

The number of PGs doesn't change with the number of OSDs, so the more
OSDs you have to do the splitting the better - because amount of work
is the same, the more workers the least each will have to do.

If impact/IO is important - for example cluster is busy - then you can
additionally set the noscrub/nodeep-scrub flags

On Tue, May 2, 2017 at 7:16 AM, M Ranga Swami Reddy
 wrote:
> Hello,
> I have added 5 new Ceph OSD nodes to my ceph cluster. Here, I wanted
> to increase PG/PGP numbers of pools based new OSDs count. Same time
> need to increase the newly added OSDs weight from 0 -> 1.
>
> My question is:
> Do I need to increase the PG/PGP num increase and then reweight the OSDs?
> Or
> Reweight the OSDs first and then increase the PG/PGP num. of pool(s)?
>
> Both will cause the rebalnce...but wanted to understand, which one
> should be preferable to do on running cluster.
>
> Thanks
> Swami
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel power tuning - 30% throughput performance increase

2017-05-03 Thread Luis Periquito
One of the things I've noticed in the latest (3+ years) batch of CPUs
is that they ignore more the cpu scaler drivers and do what they want.
More than that interfaces like the /proc/cpuinfo are completely
incorrect.

I keep checking the real frequencies using applications like the
"i7z", and it shows per core real frequency.

On the flip side as more of it is directly controlled by the CPUs it
also means it should be safer to run over a longer period of time.

On my testing, it was made with Trusty with default 3.13 kernel,
changing the driver to performance and disabling all powersave options
on BIOS meant circa 50% better latency (for both SSD and HDD cluster),
and with around 10% power usage increase.

On Wed, May 3, 2017 at 8:43 AM, Dan van der Ster  wrote:
> Hi Blair,
>
> We use cpu_dma_latency=1, because it was in the latency-performance profile.
> And indeed by setting cpu_dma_latency=0 on one of our OSD servers,
> powertop now shows the package as 100% in turbo mode.
>
> So I suppose we'll pay for this performance boost in energy.
> But more importantly, can the CPU survive being in turbo 100% of the time?
>
> -- Dan
>
>
>
> On Wed, May 3, 2017 at 9:13 AM, Blair Bethwaite
>  wrote:
>> Hi all,
>>
>> We recently noticed that despite having BIOS power profiles set to
>> performance on our RHEL7 Dell R720 Ceph OSD nodes, that CPU frequencies
>> never seemed to be getting into the top of the range, and in fact spent a
>> lot of time in low C-states despite that BIOS option supposedly disabling
>> C-states.
>>
>> After some investigation this C-state issue seems to be relatively common,
>> apparently the BIOS setting is more of a config option that the OS can
>> choose to ignore. You can check this by examining
>> /sys/module/intel_idle/parameters/max_cstate - if this is >1 and you *think*
>> C-states are disabled then your system is messing with you.
>>
>> Because the contemporary Intel power management driver
>> (https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt) now
>> limits the proliferation of OS level CPU power profiles/governors, the only
>> way to force top frequencies is to either set kernel boot command line
>> options or use the /dev/cpu_dma_latency, aka pmqos, interface.
>>
>> We did the latter using the pmqos_static.py, which was previously part of
>> the RHEL6 tuned latency-performance profile, but seems to have been dropped
>> in RHEL7 (don't yet know why), and in any case the default tuned profile is
>> throughput-performance (which does not change cpu_dma_latency). You can find
>> the pmqos-static.py script here
>> https://github.com/NetSys/NetBricks/blob/master/scripts/tuning/pmqos-static.py.
>>
>> After setting `./pmqos-static.py cpu_dma_latency=0` across our OSD nodes we
>> saw a conservative 30% increase in backfill and recovery throughput - now
>> when our main RBD pool of 900+ OSDs is backfilling we expect to see ~22GB/s,
>> previously that was ~15GB/s.
>>
>> We have just got around to opening a case with Red Hat regarding this as at
>> minimum Ceph should probably be actively using the pmqos interface and tuned
>> should be setting this with recommendations for the latency-performance
>> profile in the RHCS install guide. We have done no characterisation of it on
>> Ubuntu yet, however anecdotally it looks like it has similar issues on the
>> same hardware.
>>
>> Merry xmas.
>>
>> Cheers,
>> Blair
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw leaking objects

2017-04-05 Thread Luis Periquito
To try and make my life easier do you have such a script already done?

Also, has the source of the orphans been found, or will they continue
to happen after the upgrade to the newer version?

thanks,

On Mon, Apr 3, 2017 at 4:59 PM, Yehuda Sadeh-Weinraub <yeh...@redhat.com> wrote:
> On Mon, Apr 3, 2017 at 1:32 AM, Luis Periquito <periqu...@gmail.com> wrote:
>>> Right. The tool isn't removing objects (yet), because we wanted to
>>> have more confidence in the tool before having it automatically
>>> deleting all the found objects. The process currently is to manually
>>> move these objects to a different backup pool (via rados cp, rados
>>> rm), then when you're confident that no needed data was lost in the
>>> process remove the backup pool. In the future we'll automate that.
>>
>> My problem exactly. I don't have enough confidence in myself to just
>> delete a bunch of random objects... Any idea as to when will be
>> available such tool?
>
> Why random? The objects are the ones that the orphan tool pointed at.
> And the idea is to move these objects to a safe place before removal,
> so that even if the wrong objects are removed, they can be recovered.
> There is no current ETA for the tool, but the tool will probably have
> the same two steps as reflected here: 1. backup, 2. remove backup.
>
> Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw leaking objects

2017-04-03 Thread Luis Periquito
> Right. The tool isn't removing objects (yet), because we wanted to
> have more confidence in the tool before having it automatically
> deleting all the found objects. The process currently is to manually
> move these objects to a different backup pool (via rados cp, rados
> rm), then when you're confident that no needed data was lost in the
> process remove the backup pool. In the future we'll automate that.

My problem exactly. I don't have enough confidence in myself to just
delete a bunch of random objects... Any idea as to when will be
available such tool?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw leaking objects

2017-03-30 Thread Luis Periquito
I have a cluster that has been leaking objects in radosgw and I've
upgraded it to 10.2.6.

After that I ran
radosgw-admin orphans find --pool=default.rgw.buckets.data --job-id=orphans

which found a bunch of objects. And ran
radosgw-admin orphans finish --pool=default.rgw.buckets.data --job-id=orphans

which returned quickly. But no real space was recovered. Ran again the
orphans find and it still stated quite a few leaked objects.

Shouldn't this be working, and finding/deleting all the leaked objects?

If it helps this cluster has a cache tiering solution... I've ran a
cache-flush-evict-all on the ct pool, but no changes...

Should I open a BUG with this, as indications on
http://tracker.ceph.com/issues/18331 and
http://tracker.ceph.com/issues/18258 seem to have this fixed in jewel
10.2.6...

thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] failed to encode map with expected crc

2017-03-17 Thread Luis Periquito
Hi All,

I've just ran an upgrade in our test cluster, going from 10.2.3 to
10.2.6 and got the wonderful "failed to encode map with expected crc"
message.

Wasn't this supposed to only happen from pre-jewel to jewel?

should I be looking at something else?

thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] property upgrade Ceph from 10.2.3 to 10.2.5 without downtime

2017-01-20 Thread Luis Periquito
I've run through many upgrades without anyone noticing, including in
very busy openstack environments.

As a rule of thumb you should upgrade MONs, OSDs, MDSs and RadosGWs in
that order, however you should always read the upgrade instructions on
the release notes page
(http://docs.ceph.com/docs/master/release-notes/).

Again as a rule of thumb minor version upgrades have little to no
notices regarding upgrade, usually the issue is regarding major
versions.

On Thu, Jan 19, 2017 at 5:50 PM, Oliver Dzombic  wrote:
> Hi,
>
> i did exactly the same, with
>
> Centos 7, 2x OSD server ~ 30 OSDs all in all, 3x mon server, 3x mds server
>
> Following sequence:
>
> - Update OSD server A & restart ceph.target service
> - Update OSD server B & restart ceph.target service
> - Update inactive mon / mds server C & restart ceph.target service
> - Update inactive mon / mds server B & restart ceph.target service
> - Update active mon / mds server A & restart ceph.target service
>
> Result: No downtime, no issues.
>
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
>
> Am 19.01.2017 um 18:40 schrieb Vy Nguyen Tan:
>> Hello everyone,
>>
>> I am planning for upgrade Ceph cluster from 10.2.3 to 10.2.5. I am
>> wondering can I upgrade Ceph cluster without downtime? And how to
>> upgrade Ceph from 10.2.3 to 10.2.5 without downtime ?
>>
>> Thanks for help.
>>
>> Regards,
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW pool usage is higher that total bucket size

2017-01-05 Thread Luis Periquito
Hi,

I have a cluster with RGW in which one bucket is really big, so every
so often we delete stuff from it.

That bucket is now taking 3.3T after we deleted just over 1T from it.
That was done last week.

The pool (.rgw.buckets) is using 5.1T, and before the deletion was
taking almost 6T.

How can I delete old data from the pool? Currently the pool is using
1.5T more than the sum of the buckets...

radosgw-admin gc list  --include-all returns a few objects that were
recently deleted (whose time stamp are a couple of hours in the
future).

This cluster was originally Hammer, has since been upgraded to
Infernalis and now running Jewel (10.2.3).

Any ideas on recovering the lost space?

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] A VM with 6 volumes - hangs

2016-11-14 Thread Luis Periquito
Without knowing the cluster architecture it's hard to know exactly
what may be happening. And you sent no information on your cluster...

How is the cluster hardware? Where are the journals? How busy are the
disks (% time busy)? What is the pool size? Are these replicated or EC
pools?



On Mon, Nov 14, 2016 at 3:16 PM, German Anders  wrote:
> try to see the specific logs for those particularly osd's, and see if
> something is there, also take a deep close to the pg's that hold those osds
>
> Best,
>
>
> German
>
> 2016-11-14 12:04 GMT-03:00 M Ranga Swami Reddy :
>>
>> When this issue seen, ceph logs shows "slow requests to OSD"
>>
>> But Ceph status is in OK state.
>>
>> Thanks
>> Swami
>>
>> On Mon, Nov 14, 2016 at 8:27 PM, German Anders 
>> wrote:
>>>
>>> Could you share some info about the ceph cluster? logs? did you see
>>> anything different from normal op on the logs?
>>>
>>> Best,
>>>
>>>
>>> German
>>>
>>> 2016-11-14 11:46 GMT-03:00 M Ranga Swami Reddy :

 +ceph-devel

 On Fri, Nov 11, 2016 at 5:09 PM, M Ranga Swami Reddy
  wrote:
>
> Hello,
> I am using the ceph volumes with a VM. Details are below:
>
> VM:
>   OS: Ubuntu 14.0.4
>CPU: 12 Cores
>RAM: 40 GB
>
> Volumes:
>Size: 1 TB
> No:   6 Volumes
>
>
> With above, VM got hung without any read/write operation.
>
> Any suggestions..
>
> Thanks
> Swami



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-14 Thread Luis Periquito
Without knowing the cluster architecture it's hard to know exactly what may
be happening.

How is the cluster hardware? Where are the journals? How busy are the disks
(% time busy)? What is the pool size? Are these replicated or EC pools?

Have you tried tuning the deep-scrub processes? Have you tried stopping
them altogether? Are the journals on SSDs? As a first feeling the cluster
may be hitting it's limits (also you have at least one OSD getting full)...

On Mon, Nov 14, 2016 at 3:16 PM, Thomas Danan 
wrote:

> Hi All,
>
>
>
> We have a cluster in production who is suffering from intermittent blocked
> request (25 requests are blocked > 32 sec). The blocked request occurrences
> are frequent and global to all OSDs.
>
> From the OSD daemon logs, I can see related messages:
>
>
>
> 16-11-11 18:25:29.917518 7fd28b989700 0 log_channel(cluster) log [WRN] :
> slow request 30.429723 seconds old, received at 2016-11-11 18:24:59.487570:
> osd_op(client.2406272.1:336025615 rbd_data.66e952ae8944a.00350167
> [set-alloc-hint object_size 4194304 write_size 4194304,write 0~524288]
> 0.8d3c9da5 snapc 248=[248,216] ondisk+write e201514) currently waiting for
> subops from 210,499,821
>
>
>
> . So I guess the issue is related to replication process when writing new
> data on the cluster. Again it is never the same secondary OSDs that are
> displayed in OSD daemon logs.
>
> As a result we are experiencing very important IO Write latency on ceph
> client side (can be up to 1 hour !!!).
>
> We have checked Network health as well as disk health but we wre not able
> to find any issue.
>
>
>
> Wanted to know if this issue was already observed or if you have ideas to
> investigate / WA the issue.
>
> Many thanks...
>
>
>
> Thomas
>
>
>
> The cluster is composed with 37DN and 851 OSDs and 5 MONs
>
> The Ceph clients are accessing the client with RBD
>
> Cluster is Hammer 0.94.5 version
>
>
>
> cluster 1a26e029-3734-4b0e-b86e-ca2778d0c990
>
> health HEALTH_WARN
>
> 25 requests are blocked > 32 sec
>
> 1 near full osd(s)
>
> noout flag(s) set
>
> monmap e3: 5 mons at {NVMBD1CGK190D00=10.137.81.13:
> 6789/0,nvmbd1cgy050d00=10.137.78.226:6789/0,nvmbd1cgy070d00=
> 10.137.78.232:6789/0,nvmbd1cgy090d00=10.137.78.228:
> 6789/0,nvmbd1cgy130d00=10.137.78.218:6789/0}
>
> election epoch 664, quorum 0,1,2,3,4 nvmbd1cgy130d00,nvmbd1cgy050d00,
> nvmbd1cgy090d00,nvmbd1cgy070d00,NVMBD1CGK190D00
>
> osdmap e205632: 851 osds: 850 up, 850 in
>
> flags noout
>
> pgmap v25919096: 10240 pgs, 1 pools, 197 TB data, 50664 kobjects
>
> 597 TB used, 233 TB / 831 TB avail
>
> 10208 active+clean
>
> 32 active+clean+scrubbing+deep
>
> client io 97822 kB/s rd, 205 MB/s wr, 2402 op/s
>
>
>
>
>
>
>
> *Thank you*
>
> *Thomas Danan*
>
> *Director of Product Development*
>
>
>
> Office+33 1 49 03 77 53
>
> Mobile+33 7 76 35 76 43
>
> Skype thomas.danan
>
>  www.mycom-osi.com
>
>
>
> [image: cid:image001.jpg@01CFFC1F.8FF11180] 
>
> Follow us on Twitter, LinkedIn, YouTube and our Blog
>
> [image: cid:image002.jpg@01CFFD5E.4B6531F0] 
> [image: cid:image003.jpg@01CFFD5E.4B6531F0]
>   [image:
> cid:image004.jpg@01CFFD5E.4B6531F0]
>   [image:
> cid:image005.jpg@01CFFD5E.4B6531F0] 
>
>
>
> --
>
> This electronic message contains information from Mycom which may be
> privileged or confidential. The information is intended to be for the use
> of the individual(s) or entity named above. If you are not the intended
> recipient, be aware that any disclosure, copying, distribution or any other
> use of the contents of this information is prohibited. If you have received
> this electronic message in error, please notify us by post or telephone (to
> the numbers or correspondence address above) or by email (at the email
> address above) immediately.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] crashing mon with crush_ruleset change

2016-11-14 Thread Luis Periquito
I have a pool that I every time I try to change it's crush_ruleset
crashes 2 out of my 3 mons, and it's always the same. I've tried
leaving the first one down and it crashes the second.

It's a replicated pool, and I have other pools that look exactly the same.

I've deep-scrub'ed all the PG's to make sure there was no corruption.

OTOH the issue is with .rgw.meta pool, that from what I've read could
(?) isn't really needed...

thanks,

Some more information:

ceph osd ls detail (on 2 pools)
pool 20 '.log' replicated size 4 min_size 2 crush_ruleset 4
object_hash rjenkins pg_num 16 pgp_num 16 last_change 35288 owner
18446744073709551615 flags hashpspool min_write_recency_for_promote 1
stripe_width 0
pool 26 '.rgw.meta' replicated size 4 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 16 pgp_num 16 last_change 62814 owner
18446744073709551615 flags hashpspool stripe_width 0

On the log I get (I've increase logging)
-8> 2016-11-14 14:28:55.049924 7f3cf3115700 10 --
10.33.40.37:6789/0 >> 10.252.24.104:0/3170739472 pipe(0x7f3d108da800
sd=23 :6789 s=2 pgs=1 cs=1 l=1 c=0x7f3d0e782600).write_ack 7
-7> 2016-11-14 14:28:55.049920 7f3cf561e700  1 --
10.33.40.37:6789/0 <== client.56744214 10.252.24.104:0/3170739472 7
 mon_command({"var": "crush_ruleset", "prefix": "osd pool set",
"pool": ".rgw.meta", "val": "4"} v 0) v1  125+0+0 (3985264784 0 0)
0x7f3d100ca400 con 0x7f3d0e782600
-6> 2016-11-14 14:28:55.049929 7f3cf3115700 10 --
10.33.40.37:6789/0 >> 10.252.24.104:0/3170739472 pipe(0x7f3d108da800
sd=23 :6789 s=2 pgs=1 cs=1 l=1 c=0x7f3d0e782600).writer: state = open
policy.server=1
-5> 2016-11-14 14:28:55.049969 7f3cf561e700  0
mon.ed05sv38@0(leader) e13 handle_command mon_command({"var":
"crush_ruleset", "prefix": "osd pool set", "pool": ".rgw.meta", "val":
"4"} v 0) v1
-4> 2016-11-14 14:28:55.050003 7f3cf561e700  0 log_channel(audit)
log [INF] : from='client.? 10.252.24.104:0/3170739472'
entity='client.admin' cmd=[{"var": "crush_ruleset", "prefix": "osd
pool set", "pool": ".rgw.meta", "val": "4"}]: dispatch
-3> 2016-11-14 14:28:55.050008 7f3cf561e700  1 --
10.33.40.37:6789/0 --> 10.33.40.37:6789/0 -- log(1 entries from seq
105 at 2016-11-14 14:28:55.050005) v1 -- ?+0 0x7f3d0ea29a80 con
0x7f3d0d652280
-2> 2016-11-14 14:28:55.050021 7f3cf561e700 10
mon.ed05sv38@0(leader).paxosservice(osdmap 62242..62823) dispatch
0x7f3d100ca400 mon_command({"var": "crush_ruleset", "prefix": "osd
pool set", "pool": ".rgw.meta", "val": "4"} v 0) v1 from
client.56744214 10.252.24.104:0/3170739472 con 0x7f3d0e782600
-1> 2016-11-14 14:28:55.050026 7f3cf561e700  5
mon.ed05sv38@0(leader).paxos(paxos active c 79137289..79137877)
is_readable = 1 - now=2016-11-14 14:28:55.050027
lease_expire=2016-11-14 14:28:59.753490 has v0 lc 79137877
 0> 2016-11-14 14:28:55.051509 7f3cf561e700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7f3cf561e700 thread_name:ms_dispatch

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x4f3022) [0x7f3d01f73022]
 2: (()+0x10340) [0x7f3d00d76340]
 3: (OSDMonitor::prepare_command_pool_set(std::map,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_>, std::less,
std::allocator,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_> > > >&, std::basic_stringstream&)+0x1228)
[0x7f3d01d8d358]
 4: (OSDMonitor::prepare_command_impl(std::shared_ptr,
std::map,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_>, std::less,
std::allocator,
boost::detail::variant::void_, 

[ceph-users] weird state whilst upgrading to jewel

2016-10-10 Thread Luis Periquito
I was upgrading a really old cluster from Infernalis (9.2.1) to Jewel
(10.2.3) and got some weird, but interesting issues. This cluster
started its life with Bobtail -> Dumpling -> Emperor -> Firefly ->
Giant -> Hammer -> Infernalis and now Jewel.

When I upgraded the first MON (out of 3) everything just worked as it
should. Upgraded the second and the first and second crashed. Reverted
binaries on one of them to Infernalis, deleted the store.db folder in
the other one, started as Jewel (now had 2x Infernalis and 1x Jewel)
and let it sync the store. Upgraded the other nodes and every thing
was fine.

Or so it mostly seems. Other than the usual "failed to encode map xxx
with expected crc".

I had some weird size graphs in calamari, and looking closer (ceph df) I got:
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
 10E  932P   5E 52.46

oooh I got a really big cluster, it's usually a lot smaller (size is 655T).

a snippet cut from "ceph -s"
 health HEALTH_ERR
1 full osd(s)
flags full
  pgmap v77393779: 6384 pgs, 26 pools, 66584 GB data, 52605 kobjects
5502 PB used, 17316 PB / 10488 PB avail

health detail shows: osd.89 is full at 266%
that is one of the OSD's that's being upgraded...

The cluster ends up recovering by its own, and showing the regular
sane values... But this does seem to indicate some sort of underlying
issue

has anyone seen such an issue?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph and SMI-S

2016-08-04 Thread Luis Periquito
Hi all,

I was being asked if CEPH supports the Storage Management Initiative
Specification (SMI-S)? This for the context of monitoring our ceph
clusters/environments.

I've tried looking and find no references to supporting it. But does it?

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Number of PGs: fix from start or change as we grow ?

2016-08-03 Thread Luis Periquito
Changing the number of PGs is one of the most expensive operations you
can run, and should be avoided as much as possible.

Having said that you should try to avoid having way too many PGs with
very few OSDs, but it's certainly preferable to splitting PGs...

On Wed, Aug 3, 2016 at 1:15 PM, Maged Mokhtar
 wrote:
> Hello,
>
> I would like to build a small cluster with 20 disks to start but in the
> future would like to gradually increase it to maybe 200 disks.
> Is it better to fix the number of PGs in the pool from the beginning or is
> it better to start with a small number and then gradually change the number
> of PGs as the system grows ?
> Is the act of changing the number of PGs in a running cluster something that
> can be done regularly ?
>
> Cheers
> /Maged
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests on cluster.

2016-07-14 Thread Luis Periquito
Hi Jaroslaw,

several things are springing up to mind. I'm assuming the cluster is
healthy (other than the slow requests), right?

From the (little) information you send it seems the pools are
replicated with size 3, is that correct?

Are there any long running delete processes? They usually have a
negative impact on performance, specially as they don't really show up
in the IOPS statistics.
I've also something like this happen when there's a slow disk/osd. You
can try to check with "ceph osd perf" and look for higher numbers.
Usually restarting that OSD brings back the cluster to life, if that's
the issue.
If nothing shows, try a "ceph tell osd.* version"; if there's a
misbehaving OSD they usually don't respond to the command (slow or
even timing out).

Also you also don't say how many scrub/deep-scrub processes are
running. If not properly handled they are also a performance killer.

Last, but by far not least, have you ever thought of creating a SSD
pool (even small) and move all pools but .rgw.buckets there? The other
ones are small enough, but enjoy having their own "reserved" osds...



On Thu, Jul 14, 2016 at 1:59 PM, Jaroslaw Owsiewski
 wrote:
> Hi,
>
> we have problem with drastic performance slowing down on a cluster. We used
> radosgw with S3 protocol. Our configuration:
>
> 153 OSD SAS 1.2TB with journal on SSD disks (ratio 4:1)
> - no problems with networking, no hardware issues, etc.
>
> Output from "ceph df":
>
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 166T  129T   38347G 22.44
> POOLS:
> NAME   ID USED   %USED MAX AVAIL
> OBJECTS
> .rgw   9  70330k 039879G
> 393178
> .rgw.root  10848 039879G
> 3
> .rgw.control   11  0 039879G
> 8
> .rgw.gc12  0 039879G
> 32
> .rgw.buckets   13 10007G  5.8639879G
> 331079052
> .rgw.buckets.index 14  0 039879G
> 2994652
> .rgw.buckets.extra 15  0 039879G
> 2
> .log   16   475M 039879G
> 408
> .intent-log17  0 039879G
> 0
> .users 19729 039879G
> 49
> .users.email   20414 039879G
> 26
> .users.swift   21  0 039879G
> 0
> .users.uid 22  17170 039879G
> 89
>
> Problems began on last saturday,
> Troughput was 400k req per hour - mostly PUTs and HEADs ~100kb.
>
> Ceph version is hammer.
>
>
> We have two clusters with similar configuration and both experienced same
> problems at once.
>
> Any hints
>
>
> Latest output from "ceph -w":
>
> 2016-07-14 14:43:16.197131 osd.26 [WRN] 17 slow requests, 16 included below;
> oldest blocked for > 34.766976 secs
> 2016-07-14 14:43:16.197138 osd.26 [WRN] slow request 32.99 seconds old,
> received at 2016-07-14 14:42:43.641440: osd_op(client.75866283.0:20130084
> .dir.default.75866283.65796.3 [delete] 14.122252f4
> ondisk+write+known_if_redirected e18788) currently commit_sent
> 2016-07-14 14:43:16.197145 osd.26 [WRN] slow request 32.536551 seconds old,
> received at 2016-07-14 14:42:43.660487: osd_op(client.75866283.0:20130121
> .dir.default.75866283.65799.6 [delete] 14.d2dc1672
> ondisk+write+known_if_redirected e18788) currently commit_sent
> 2016-07-14 14:43:16.197153 osd.26 [WRN] slow request 30.971549 seconds old,
> received at 2016-07-14 14:42:45.225490: osd_op(client.75866283.0:20132345
> gc.12 [call rgw.gc_set_entry] 12.a45046b8
> ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks
> 2016-07-14 14:43:16.197158 osd.26 [WRN] slow request 30.967568 seconds old,
> received at 2016-07-14 14:42:45.229471: osd_op(client.76495939.0:20147494
> gc.12 [call rgw.gc_set_entry] 12.a45046b8
> ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks
> 2016-07-14 14:43:16.197162 osd.26 [WRN] slow request 32.253169 seconds old,
> received at 2016-07-14 14:42:43.943870: osd_op(client.75866283.0:20130663
> .dir.default.75866283.65805.7 [delete] 14.2b5a1672
> ondisk+write+known_if_redirected e18788) currently commit_sent
> 2016-07-14 14:43:17.197429 osd.26 [WRN] 3 slow requests, 2 included below;
> oldest blocked for > 31.967882 secs
> 2016-07-14 14:43:17.197434 osd.26 [WRN] slow request 31.579897 seconds old,
> received at 2016-07-14 14:42:45.617456: osd_op(client.76495939.0:20147877
> gc.12 [call rgw.gc_set_entry] 12.a45046b8
> ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks
> 2016-07-14 14:43:17.197439 osd.26 [WRN] slow request 30.897873 seconds old,
> received at 2016-07-14 14:42:46.299480: osd_op(client.76495939.0:20148668
> 

Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-13 Thread Luis Periquito
Thanks for sharing Wido.

>From your information you only talk about MON and OSD. What about the
RGW nodes? You stated in the beginning that 99% is rgw...

On Wed, Jul 13, 2016 at 3:56 PM, Wido den Hollander  wrote:
> Hello,
>
> The last 3 days I worked at a customer with a 1800 OSD cluster which had to 
> be upgraded from Hammer 0.94.5 to Jewel 10.2.2
>
> The cluster in this case is 99% RGW, but also some RBD.
>
> I wanted to share some of the things we encountered during this upgrade.
>
> All 180 nodes are running CentOS 7.1 on a IPv6-only network.
>
> ** Hammer Upgrade **
> At first we upgraded from 0.94.5 to 0.94.7, this went well except for the 
> fact that the monitors got spammed with these kind of messages:
>
>   "Failed to encode map eXXX with expected crc"
>
> Some searching on the list brought me to:
>
>   ceph tell osd.* injectargs -- --clog_to_monitors=false
>
>  This reduced the load on the 5 monitors and made recovery succeed smoothly.
>
>  ** Monitors to Jewel **
>  The next step was to upgrade the monitors from Hammer to Jewel.
>
>  Using Salt we upgraded the packages and afterwards it was simple:
>
>killall ceph-mon
>chown -R ceph:ceph /var/lib/ceph
>chown -R ceph:ceph /var/log/ceph
>
> Now, a systemd quirck. 'systemctl start ceph.target' does not work, I had to 
> manually enabled the monitor and start it:
>
>   systemctl enable ceph-mon@srv-zmb04-05.service
>   systemctl start ceph-mon@srv-zmb04-05.service
>
> Afterwards the monitors were running just fine.
>
> ** OSDs to Jewel **
> To upgrade the OSDs to Jewel we initially used Salt to update the packages on 
> all systems to 10.2.2, we then used a Shell script which we ran on one node 
> at a time.
>
> The failure domain here is 'rack', so we executed this in one rack, then the 
> next one, etc, etc.
>
> Script can be found on Github: 
> https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
>
> Be aware that the chown can take a long, long, very long time!
>
> We ran into the issue that some OSDs crashed after start. But after trying 
> again they would start.
>
>   "void FileStore::init_temp_collections()"
>
> I reported this in the tracker as I'm not sure what is happening here: 
> http://tracker.ceph.com/issues/16672
>
> ** New OSDs with Jewel **
> We also had some new nodes which we wanted to add to the Jewel cluster.
>
> Using Salt and ceph-disk we ran into a partprobe issue in combination with 
> ceph-disk. There was already a Pull Request for the fix, but that was not 
> included in Jewel 10.2.2.
>
> We manually applied the PR and it fixed our issues: 
> https://github.com/ceph/ceph/pull/9330
>
> Hope this helps other people with their upgrades to Jewel!
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw live upgrade hammer -> jewel

2016-07-07 Thread Luis Periquito
Hi all,

I have (some) ceph clusters running hammer and they are serving S3 data.
There are a few radosgw serving requests, in a load balanced form
(actually OSPF anycast IPs).

Usually upgrades go smoothly whereby I upgrade a node at a time, and
traffic just gets redirected around the nodes that are running.

>From my tests upgrading to Jewel this is no longer the case, as I have
to stop all the radosgw then run a script to migrate the pools, only
then starting the radosgw processes.

I tested using both Infernalis and Hammer and it seems the behaviour
is the same.

Is there a way to run this upgrade live, without any downtime from the
radosgw service? What would be the best upgrade strategy?

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] layer3 network

2016-07-07 Thread Luis Periquito
If, like me, you have several different networks, or they overlap for
whatever reason, I just have the options:

mon addr = IP:port
osd addr = IP

in the relevant sections. However I use puppet to deploy ceph, and all
files are "manually" created.

So it becomes something like this:

[mon.mon1]
  host = mon1
  mon addr = x.y.z.a:6789
  mon data = /var/lib/ceph/mon/internal-mon1
[osd.0]
  host = dskh1
  osd addr = x.y.z.a
  osd data = /var/lib/ceph/osd/osd-0
  osd journal = /var/lib/ceph/osd/journal/osd-0
  keyring = /var/lib/ceph/osd/osd-0/keyring
  osd max backfills = 1
  osd recovery max active = 1
  osd recovery op priority = 1
  osd client op priority = 63
  osd disk thread ioprio class = idle
  osd disk thread ioprio priority = 7
[osd.1]


necessarily the host and addr parts are correct in our environment.

On 7 July 2016 at 11:36, Nick Fisk <n...@fisk.me.uk> wrote:

> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of Matyas Koszik
> > Sent: 07 July 2016 11:26
> > To: ceph-users@lists.ceph.com
> > Subject: [ceph-users] layer3 network
> >
> >
> >
> > Hi,
> >
> > My setup uses a layer3 network, where each node has two connections
> (/31s), equipped with a loopback address and redundancy is
> > provided via OSPF. In this setup it is important to use the loopback
> address as source for outgoing connections, since the
> interface
> > addresses are not protected from failure, but the loopback address is.
> >
> > So I set the public addr and the cluster addr to the desired ip, but it
> seems that the outgoing connections do not use this as the
> source
> > address.
> > I'm using jewel; is this the expected behavior?
>
> Do your public/cluster networks overlap the physical connection IP's? From
> what I understand Ceph binds to the interface whose IP
> lies within the range specified in the conf file.
>
> So for example if public addr = 192.168.1.0/24
>
> Then your loopback should be in that range, but you must make sure the
> physical nics lie outside this range.
>
> I'm following this with interest as I am about to deploy something very
> similar.
>
> >
> > Matyas
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Luis Periquito

Unix Team Lead

<http://www.ocado.com/>

Head Office, Titan Court, 3 Bishop Square, Hatfield Business Park,
Hatfield, Herts AL10 9NE

-- 


Notice:  This email is confidential and may contain copyright material of 
members of the Ocado Group. Opinions and views expressed in this message 
may not necessarily reflect the opinions and views of the members of the 
Ocado Group. 

 

If you are not the intended recipient, please notify us immediately and 
delete all copies of this message. Please note that it is your 
responsibility to scan this message for viruses. 

 

Fetch and Sizzle are trading names of Speciality Stores Limited, a member 
of the Ocado Group.

 

References to the “Ocado Group” are to Ocado Group plc (registered in 
England and Wales with number 7098618) and its subsidiary undertakings (as 
that expression is defined in the Companies Act 2006) from time to time.  
The registered office of Ocado Group plc is Titan Court, 3 Bishops Square, 
Hatfield Business Park, Hatfield, Herts. AL10 9NE.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] changing k and m in a EC pool

2016-06-30 Thread Luis Periquito
>> I have created an Erasure Coded pool and would like to change the K
>> and M of it. Is there any way to do it without destroying the pool?
>>
> No.
>
> http://docs.ceph.com/docs/master/rados/operations/erasure-code/
>
> "Choosing the right profile is important because it cannot be modified
> after the pool is created: a new pool with a different profile needs to be
> created and all objects from the previous pool moved to the new."
>

Maybe my question should have been: how can I copy those objects from
one pool to another, given the pool is default.rgw.buckets.data. And
can it be done online?

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] changing k and m in a EC pool

2016-06-30 Thread Luis Periquito
Hi all,

I have created an Erasure Coded pool and would like to change the K
and M of it. Is there any way to do it without destroying the pool?

The cluster doesn't have much IO, but the pool (rgw data) has just
over 10T, and I didn't wanted to lose it.


thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSPF to the host

2016-06-08 Thread Luis Periquito
> OTOH, running ceph on dynamically routed networks will put your routing
> daemon (e.g. bird) in a SPOF position...
>
I run a somewhat large estate with either BGP or OSPF attachment, not
only ceph is happy in either of them, as I have never had issues with
the routing daemons (after setting them up properly). However I only
run rj45 copper.

Only had issues because both links, for several unrelated reasons,
became unavailable, and then the host is not contactable.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSPF to the host

2016-06-06 Thread Luis Periquito
Nick,

TL;DR: works brilliantly :)

Where I work we have all of the ceph nodes (and a lot of other stuff) using
OSPF and BGP server attachment. With that we're able to implement solutions
like Anycast addresses, removing the need to add load balancers, for the
radosgw solution.

The biggest issues we've had were around the per-flow vs per-packets
traffic load balancing, but as long as you keep it simple you shouldn't
have any issues.

Currently we have a P2P network between the servers and the ToR switches on
a /31 subnet, and then create a virtual loopback address, which is the
interface we use for all communications. Running tests like iperf we're
able to reach 19Gbps (on a 2x10Gbps network). OTOH we no longer have the
ability to separate traffic between public and osd network, but never
really felt the need for it.

Also spend a bit of time planning how the network will look like and it's
topology. If done properly (think details like route summarization) then
it's really worth the extra effort.



On Mon, Jun 6, 2016 at 11:57 AM, Nick Fisk  wrote:

> Hi All,
>
>
>
> Has anybody had any experience with running the network routed down all
> the way to the host?
>
>
>
> I know the standard way most people configured their OSD nodes is to bond
> the two nics which will then talk via a VRRP gateway and then probably from
> then on the networking is all Layer3. The main disadvantage I see here is
> that you need a beefy inter switch link to cope with the amount of traffic
> flowing between switches to the VRRP address. I’ve been trying to design
> around this by splitting hosts into groups with different VRRP gateways on
> either switch, but this relies on using active/passive bonding on the OSD
> hosts to make sure traffic goes from the correct Nic to the directly
> connected switch.
>
>
>
> What I was thinking, instead of terminating the Layer3 part of the network
> at the access switches, terminate it at the hosts. If each Nic of the OSD
> host had a different subnet and the actual “OSD Server” address bound to a
> loopback adapter, OSPF should advertise this loopback adapter address as
> reachable via the two L3 links on the physically attached Nic’s. This
> should give you a redundant topology which also will respect your
> physically layout and potentially give you higher performance due to ECMP.
>
>
>
> Any thoughts, any pitfalls?
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] civetweb vs Apache for rgw

2016-05-24 Thread Luis Periquito
It may be possible to do it with civetweb, but I use apache because of
HTTPS config.

On Tue, May 24, 2016 at 5:49 AM, fridifree  wrote:
> What apache gives that civetweb not?
> Thank you
>
> On May 23, 2016 11:49 AM, "Anand Bhat"  wrote:
>>
>> For performance, civetweb is better as fastcgi module associated with
>> apache is single threaded. But Apache does have fancy features which
>> civetweb lacks. If you are looking for just the performance, then go for
>> civetweb.
>>
>> Regards,
>> Anand
>>
>> On Mon, May 23, 2016 at 12:43 PM, fridifree  wrote:
>>>
>>> Hi everyone,
>>> What would give the best performance in most of the cases, civetweb or
>>> apache?
>>>
>>> Thank you
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>>
>> --
>>
>> 
>> Never say never.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel ubuntu release is half cooked

2016-05-24 Thread Luis Periquito
Hi,


On Mon, May 23, 2016 at 8:24 PM, Anthony D'Atri  wrote:
>
>
> Re:
>
>> 2. Inefficient chown documentation - The documentation states that one 
>> should "chown -R ceph:ceph /var/lib/ceph" if one is looking to have ceph-osd 
>> ran as user ceph and not as root. Now, this command would run a chown 
>> process one osd at a time. I am considering my cluster to be a fairly small 
>> cluster with just 30 osds between 3 osd servers. It takes about 60 minutes 
>> to run the chown command on each osd (3TB disks with about 60% usage). It 
>> would take about 10 hours to complete this command on each osd server, which 
>> is just mad in my opinion. I can't imagine this working well at all on 
>> servers with 20-30 osds! IMHO the docs should be adjusted to instruct users 
>> to run the chown in _parallel_ on all osds instead of doing it one by one.
>
>
> I suspect the docs are playing it safe there, Ceph runs on servers of widely 
> varying scale, capabilities, and robustness.  Running 30 chown -R processes 
> in parallel could present noticeable impact on a production server.

I did this process in two separate isolated steps:
- first I upgraded ensuring the "setuser match path =
/var/lib/ceph/$type/$cluster-$id" was set. It's in the 9.2.0 and
10.2.0 release notes. This meant that after the upgrade everything was
still running as root as before, and there was no need to change the
permissions.
- then one daemon at a time I chown -R to make it root (it didn't
change anything but it read all the FS information), stopped the
daemon, rerun the chown to ceph (which was now very fast) and started
that daemon. Actual downtime per daemon was under 5 minutes. I did set
the noout flag for the OSDs.

To help with the journal I created a udev rules file and the devices
were already owned by ceph: root was still able to use it.

This process works for both OSD and MON. I have yet to do the MDS, and
my radosgw were already running as ceph user.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] using jemalloc in trusty

2016-05-23 Thread Luis Periquito
Thanks Somnath, I expected that much. But given the hint in the config
files do you know if they are built to use jemalloc? it seems not...

On Mon, May 23, 2016 at 3:34 PM, Somnath Roy <somnath@sandisk.com> wrote:
> You need to build ceph code base to use jemalloc for OSDs..LD_PRELOAD won't 
> work..
>
> Thanks & regards
> Somnath
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Luis 
> Periquito
> Sent: Monday, May 23, 2016 7:30 AM
> To: Ceph Users
> Subject: [ceph-users] using jemalloc in trusty
>
> I've been running some tests with jewel, and wanted to enable jemalloc.
> I noticed that the new jewel release now loads properly /etc/default/ceph and 
> has an option to use jemalloc.
>
> I've installed jemalloc, enabled the LD_PRELOAD option, however doing some 
> tests it seems that it's still using tcmalloc: I still see the 
> "tcmalloc::CentralFreeList::FetchFromSpans()" and it's accompanying lines in 
> perf top.
>
> Also from a lsof I can see the tcmalloc libraries being used, but not the 
> jemalloc ones...
>
> Does anyone know what I'm doing wrong? I'm using the standard binaries from 
> the repo 10.2.1 and ubuntu trusty with kernel 3.13.0-52-generic.
>
> thanks,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] using jemalloc in trusty

2016-05-23 Thread Luis Periquito
I've been running some tests with jewel, and wanted to enable jemalloc.
I noticed that the new jewel release now loads properly
/etc/default/ceph and has an option to use jemalloc.

I've installed jemalloc, enabled the LD_PRELOAD option, however doing
some tests it seems that it's still using tcmalloc: I still see the
"tcmalloc::CentralFreeList::FetchFromSpans()" and it's accompanying
lines in perf top.

Also from a lsof I can see the tcmalloc libraries being used, but not
the jemalloc ones...

Does anyone know what I'm doing wrong? I'm using the standard binaries
from the repo 10.2.1 and ubuntu trusty with kernel 3.13.0-52-generic.

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Thin Provisioning on OpenStack Instances

2016-04-01 Thread Luis Periquito
You want to enable the "show_image_direct_url = True" option.

full configuration information can be found
http://docs.ceph.com/docs/master/rbd/rbd-openstack/

On Thu, Mar 31, 2016 at 10:49 PM, Mario Codeniera
 wrote:
> Hi,
>
> Is there anyone done thin provisioning on OpenStack instances (virtual
> machine)? Based on the current configurations, it works well with my cloud
> using ceph 0.94.5 with SSD journal (from 18mins to around 7 mins for
> creating an 40GB instance and not good SSD iops). But what I wanted is the
> storage space as it copied the whole image from Glance (40GB) to each newly
> created virtual machine, is there any chances that it will copy only the top
> changes? somewhat like a vmware-like snapshot, but still the base image is
> still there.
>
> Current setup:
> xxx --> (uploaded glance image, say Centos 7 with 40GB)
>
> if create an instance,
> xxx + yyy  where yyy is the new changes
> (40GB + MB/GB changes)
>
>
> Plan setup:
> (it will save storage as it will not copy xxx)
> yyy is only stored on the ceph
>
>
> As per testing on current cloud, the OpenStack snapshot still copy the whole
> image + new changes. Correct me if wrong as still using the Kilo release
> (2015.1.1) or maybe it was a misconfiguration? And the more I added the
> users, the more OSDs will be added too.
>
> Any insights are highly appreciated.
>
>
> Thanks,
> Mario
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd timeout

2016-03-09 Thread Luis Periquito
I have a cluster spread across 2 racks, with a crush rule that splits
data across those racks.

To test a failure scenario we powered off one of the racks, and
expected ceph to continuing running. Of the 56 OSDs that were powered
off 52 were quickly set as down in the cluster (it took around 30
seconds) but the remaining 4, all in different hosts, took the 900s
with the "marked down after no pg stats for 900.118248seconds"
message.

now for some questions:
Is it expected some OSDs to not go down as they should? I/O was
happening to the cluster before, during and after this event...
Should we reduce the 900s timeout to a much lower time? How can we
make sure it's not too low beforehand? How low should we go?

After those 18mins elapsed (900 seconds) the cluster resumed all IO
and came back to life as expected.

The big issue is that having these 4 OSDs down caused almost all IO to
stop to this cluster, specially as OSDs get a bigger and bigger queue
of slow requests...

Another issue was during recovery, after we turned back on those
servers, and started ceph on those nodes, the cluster just ground to a
halt for a while. "osd perf" had all the numbers <100, slow requests
were at the same level in all OSDs, nodes didn't have any IOWait, CPUs
were idle, load avg < 0.5, so we couldn't find anything that pointed
to a culprit. However one of the OSDs timed out a "tell osd.* version"
and restarting that OSD made the cluster responsive again. Any idea on
how to detect this type of situations?

This cluster is running hammer (0.94.5) and has 112 OSDs, 56 in each rack.

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] abort slow requests ?

2016-03-04 Thread Luis Periquito
you should really fix the peering objects.

So far what I've seen in ceph is that it prefers data integrity over
availability. So if it thinks that it can't keep all working properly
it tends to stop (i.e. blocked requests), thus I don't believe there's
a way to do this.

On Fri, Mar 4, 2016 at 1:04 AM, Ben Hines  wrote:
> I have a few bad objects in ceph which are 'stuck on peering'.  The clients
> hit them and they build up and eventually stop all traffic to the OSD.   I
> can open up traffic by resetting the OSD (aborting those requests)
> temporarily.
>
> Is there a way to tell ceph to cancel/abort these 'slow requests' once they
> get to certain amount of time? Rather than building up and blocking
> everything..
>
> -Ben
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade from Hammer LTS to Infernalis or wait for Jewel LTS?

2016-03-04 Thread Luis Periquito
On Wed, Mar 2, 2016 at 9:32 AM, Mihai Gheorghe  wrote:
> Hi,
>
> I've got two questions!
>
> First. We are currently running Hammer in production. You are thinking of
> upgrading to Infernalis. Should we upgrade now or wait for the next LTS,
> Jewel? On ceph releases i can see Hammers EOL is estimated in november 2016
> while Infernalis is June 2016.

I don't know where you got this information but it seems wrong. From
previous history the last 2 LTS versions are supported (currently
Firefly and Hammer). That would mean that Hammer should be supported
until the L version is released. Infernalis should be supported until
the release of Jewel.

> If i follow the upgrade procedure there should not be any problems, right?

So far we've upgraded every version without issues. But past performance...

>
> Second. When Jewel LTS will be released, does anybody know if we can upgrade
> straight from Hammer or first we need to upgrade to Infernalis and then
> Jewel. If the latter is the case i see no reason not to upgrade now to
> Infernalis and wait for Jewel release to upgrade again. This way we can take
> advantage of the new features in Infernalis.

Usually you can upgrade LTS -> LTS, so you should be able to go from
Hammer to Jewel. The same should be true to Infernalis. However
minimum versions may apply (like you need at least version 0.94.4 to
upgrade to infernalis).

>
> Also what is the correct order of upgrading? Mons first then OSDs?

Usually mons, then osds and then mds and radosgw. But if there's
something different it'll be published in the release notes.

>
> Any input on the matter would be greatly apreciated.

If it was me, depending on what you value most: if you prefer
stability and a conservative approach I'd install Hammer. If you
prefer features and performance I'd install Infernalis.
As an example all major players (like Redhat, Fujitsu, Suse, etc) use
only the LTS versions for their distros.

>
> Thank you.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] s3 bucket creation time

2016-02-29 Thread Luis Periquito
Hi all,

I have a biggish ceph environment and currently creating a bucket in
radosgw can take as long as 20s.

What affects the time a bucket takes to be created? How can I improve that time?

I've tried to create in several "bucket-location" with different
backing pools (some of them empty) and the time was the same.

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Benchmark individual OSD's

2015-10-29 Thread Luis Periquito
Only way I can think of that is creating a new crush rule that selects
that specific OSD with min_size = max_size = 1, then creating a pool
with size = 1 and using that crush rule.

Then you can use that pool as you'd use any other pool.

I haven't tested however it should work.

On Thu, Oct 29, 2015 at 1:44 AM, Lindsay Mathieson
 wrote:
>
> On 29 October 2015 at 11:39, Lindsay Mathieson 
> wrote:
>>
>> Is there a way to benchmark individual OSD's?
>
>
> nb - Non-destructive :)
>
>
> --
> Lindsay
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph and upgrading OS version

2015-10-22 Thread Luis Periquito
There are several routes you can follow for this work. The best one
will depend on cluster size, current data, pool definition (size),
performance expectations, etc.

They range from doing dist-upgrade a node at a time, to
remove-upgrade-then-add nodes to the cluster. But knowing that ceph is
"self-healing" if you aren't somewhat careful you can do an upgrade
online without much disruption if any (performance will always be
impacted).

On Thu, Oct 22, 2015 at 12:22 PM, Andrei Mikhailovsky  wrote:
>
> Any thoughts anyone?
>
> Is it safe to perform OS version upgrade on the osd and mon servers?
>
> Thanks
>
> Andrei
>
> 
> From: "Andrei Mikhailovsky" 
> To: ceph-us...@ceph.com
> Sent: Tuesday, 20 October, 2015 8:05:19 PM
> Subject: [ceph-users] ceph and upgrading OS version
>
> Hello everyone
>
> I am planning to upgrade my ceph servers from Ubuntu 12.04 to 14.04 and I am
> wondering if you have a recommended process of upgrading the OS version
> without causing any issues to the ceph cluster?
>
> Many thanks
>
> Andrei
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-20 Thread Luis Periquito
On Tue, Oct 20, 2015 at 3:26 AM, Haomai Wang  wrote:
> The fact is that journal could help a lot for rbd use cases,
> especially for small ios. I don' t think it will be bottleneck. If we
> just want to reduce double write, it doesn't solve any performance
> problem.
>

One trick I've been using in my ceph clusters is hiding a slow write
backend behind a fast journal device. The write performance will be of
the fast (and small) journal device. This only helps on write, but it
can make a huge difference.

I've even made some tests showing (within 10%, RBD and S3) that the
backend device doesn't matter and the write performance is exactly the
same as that of the journal device fronting all the writes.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-20 Thread Luis Periquito
>> One trick I've been using in my ceph clusters is hiding a slow write
>> backend behind a fast journal device. The write performance will be of
>> the fast (and small) journal device. This only helps on write, but it
>> can make a huge difference.
>>
>
> Do you mean an external filesystem journal? What filesystem? ext4/xfs?
> I tried that on a physical machine and it worked wonders with both of them, 
> even though data wasn't journaled and hit the platters - I don't yet 
> understand how that was possible but the benchmark just flew.
>

I just have a raw partition in the journal device (SSD) and point "osd
journal" to that block device (something like "osd journal =
/dev/vgsde/journal-8"). So no filesystem in the journal device. Then
the osd data is in a local HDD using normal XFS filesystem.

To help this I do have usually big amounts of RAM (average 6G per
OSD), so the buffered writes to the spindle can take it's time to
flush.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] planet.ceph.com

2015-10-20 Thread Luis Periquito
Hi,

I was looking for some ceph resources and saw a reference to
planet.ceph.com. However when I opened it I was sent to a dental
clinic (?). That doesn't sound right, does it?

I was at this page when I saw the reference...

thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-20 Thread Luis Periquito
>
> On 10/20/2015 08:41 AM, Robert LeBlanc wrote:
>>
>> Given enough load, that fast Jornal will get filled and you will only be
>> as fast as the back disk can flush (and at the same time service reads).
>> That the the situation we are in right now. We are still seeing better
>> performance than a raw spindle, but only 150 IOPs, not 15000 IOPS that
>> the SSD can do. You are still ultimately bound by the back end disk.
>>
>> Robert LeBlanc
>>

This is true. However I've seen this happening in "enterprise-grade"
storage systems, where you have an amount of cache which is very quick
to write. However when that finishes you can go to write through mode
or even worse in to back-to-back mode (think sync write IO).

However given the cluster size, the way ceph works, replication
factors, etc, the volume you need to write at once can be very big,
and easily grown with more OSDs/nodes.

OTOH worst case you are exactly where you started: HDD performance.

Also you can start doing "smart" stuff, like allowing small random IO
into the journal, but coalescing into big writes to the back end
filesystem. If there are any problems then you can just replay through
the journal.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw limiting requests

2015-10-15 Thread Luis Periquito
I've been trying to find a way to limit the number of request an user
can make the radosgw per unit of time - first thing developers done
here is as fast as possible parallel queries to the radosgw, making it
very slow.

I've looked into quotas, but they only refer to space, objects and buckets.

Is it possible to limit the request rate?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] monitor crashing

2015-10-13 Thread Luis Periquito
It seems I've hit this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1231630

is there any way I can recover this cluster? It worked in our test
cluster, but crashed the production one...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] monitor crashing

2015-10-13 Thread Luis Periquito
I'm currently running Hammer (0.94.3), created an invalid LRC profile
(typo in the l=, should have been l=4 but was l=3, and now I don't
have enough different ruleset-locality) and created a pool. Is there
any way to delete this pool? remember I can't start the ceph-mon...

On Tue, Oct 13, 2015 at 11:56 AM, Luis Periquito <periqu...@gmail.com> wrote:
> It seems I've hit this bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=1231630
>
> is there any way I can recover this cluster? It worked in our test
> cluster, but crashed the production one...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW: Can't download a big file

2015-09-29 Thread Luis Periquito
I'm having some issues downloading a big file (60G+).

After some investigation it seems to be very similar to
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001272.html,
however I'm currently running Hammer 0.94.3. However the files were
uploaded when the cluster was running Firefly (IIRC 0.80.7).

>From the radosgw --debug-ms=1 --debug-rgw=20
[...]
2015-09-28 16:08:14.596299 7f6361f93700  1 -- 10.248.33.13:0/1004902 -->
10.249.33.12:6814/15506 -- osd_op(client.129331531.0:817
default.26848827.2__shadow_Exchange
Stores/Directors.edb.WA70vanNXi08cmwrdmURjXFIUCgaIVv.35_2 [read 0~4194304]
9.7a8bb21d ack+read+known_if_redirected e59831) v5 -- ?+0 0x7f63e507aca0
con 0x7f63e4d2ef50
2015-09-28 16:08:14.596361 7f6361f93700 20 rados->aio_operate r=0
bl.length=0
2015-09-28 16:08:14.596371 7f6361f93700 20 rados->get_obj_iterate_cb
oid=default.26848827.2__shadow_Exchange
Stores/Directors.edb.WA70vanNXi08cmwrdmURjXFIUCgaIVv.35_3
obj-ofs=1795162112 read_ofs=0 len=4194304
2015-09-28 16:08:14.596385 7f6361f93700  1 -- 10.248.33.13:0/1004902 -->
10.249.33.14:6825/14173 -- osd_op(client.129331531.0:818
default.26848827.2__shadow_Exchange
Stores/Directors.edb.WA70vanNXi08cmwrdmURjXFIUCgaIVv.35_3 [read 0~4194304]
9.6407682a ack+read+known_if_redirected e59831) v5 -- ?+0 0x7f63e4e77370
con 0x7f63e4d8aa60
2015-09-28 16:08:14.596421 7f6361f93700 20 rados->aio_operate r=0
bl.length=0
2015-09-28 16:08:14.596445 7f6361f93700 20 rados->get_obj_iterate_cb
oid=default.26848827.2__shadow_Exchange
Stores/Directors.edb.WA70vanNXi08cmwrdmURjXFIUCgaIVv.35_4
obj-ofs=1799356416 read_ofs=0 len=4194304
2015-09-28 16:08:14.596474 7f6361f93700  1 -- 10.248.33.13:0/1004902 -->
10.249.33.39:6805/28145 -- osd_op(client.129331531.0:819
default.26848827.2__shadow_Exchange
Stores/Directors.edb.WA70vanNXi08cmwrdmURjXFIUCgaIVv.35_4 [read 0~4194304]
9.ceaa43a4 ack+read+known_if_redirected e59831) v5 -- ?+0 0x7f63e4e77370
con 0x7f63e4f94220
2015-09-28 16:08:14.596502 7f6361f93700 20 rados->aio_operate r=0
bl.length=0
2015-09-28 16:08:14.597073 7f6337d30700  1 -- 10.248.33.13:0/1004902 <==
osd.48 10.248.33.12:6821/27677 9  osd_op_reply(816
default.26848827.2__shadow_Exchange
Stores/Directors.edb.WA70vanNXi08cmwrdmURjXFIUCgaIVv.35_1 [read 0~4194304]
v0'0 uv0 ack = -2 ((2) No such file or directory)) v6  260+0+0
(3636490520 0 0) 0x7f6414011d30 con 0x7f63e4d9dc10
2015-09-28 16:08:14.597152 7f640bfff700 20 get_obj_aio_completion_cb: io
completion ofs=1786773504 len=4194304
2015-09-28 16:08:14.597159 7f640bfff700  0 ERROR: got unexpected error when
trying to read object: -2
2015-09-28 16:08:14.597184 7f6361f93700 20 get_obj_data::cancel_all_io()
2015-09-28 16:08:14.597234 7f6339641700  1 -- 10.248.33.13:0/1004902 <==
osd.42 10.249.33.12:6814/15506 8  osd_op_reply(817
default.26848827.2__shadow_Exchange
Stores/Directors.edb.WA70vanNXi08cmwrdmURjXFIUCgaIVv.35_2 [read 0~4194304]
v0'0 uv0 ack = -2 ((2) No such file or directory)) v6  260+0+0
(4010761036 0 0) 0x7f63fc011820 con 0x7f63e4d2ef50
2015-09-28 16:08:14.597257 7f640bfff700 20 get_obj_aio_completion_cb: io
completion ofs=1790967808 len=4194304
2015-09-28 16:08:14.597263 7f640bfff700  0 ERROR: got unexpected error when
trying to read object: -2
2015-09-28 16:08:14.597298 7f6336320700  1 -- 10.248.33.13:0/1004902 <==
osd.23 10.249.33.14:6825/14173 8  osd_op_reply(818
default.26848827.2__shadow_Exchange
Stores/Directors.edb.WA70vanNXi08cmwrdmURjXFIUCgaIVv.35_3 [read 0~4194304]
v0'0 uv0 ack = -2 ((2) No such file or directory)) v6  260+0+0 (8601650
0 0) 0x7f63ec015050 con 0x7f63e4d8aa60
2015-09-28 16:08:14.597323 7f640bfff700 20 get_obj_aio_completion_cb: io
completion ofs=1795162112 len=4194304
2015-09-28 16:08:14.597326 7f640bfff700  0 ERROR: got unexpected error when
trying to read object: -2
2015-09-28 16:08:14.597177 7f6336d26700  1 -- 10.248.33.13:0/1004902 <==
osd.66 10.249.33.39:6805/28145 9  osd_op_reply(819
default.26848827.2__shadow_Exchange
Stores/Directors.edb.WA70vanNXi08cmwrdmURjXFIUCgaIVv.35_4 [read 0~4194304]
v0'0 uv0 ack = -2 ((2) No such file or directory)) v6  260+0+0
(3449160520 0 0) 0x7f63d80173e0 con 0x7f63e4f94220
2015-09-28 16:08:14.597338 7f640bfff700 20 get_obj_aio_completion_cb: io
completion ofs=1799356416 len=4194304
2015-09-28 16:08:14.597339 7f640bfff700  0 ERROR: got unexpected error when
trying to read object: -2
2015-09-28 16:08:14.597507 7f6361f93700  2 req 1:32.448590:s3:GET
/Exchange%20Stores/Directors.edb:get_obj:http status=404
2015-09-28 16:08:14.597515 7f6361f93700  1 == req done
req=0x7f6420019d70 http_status=404 ==


doing a rados ls -p .rgw.buckets and searching for the ".35_" parts I only
get these...

default.26848827.2__shadow_Exchange
Stores/Directors.edb.WA70vanNXi08cmwrdmURjXFIUCgaIVv.35_11
default.26848827.2__shadow_Exchange
Stores/Directors.edb.WA70vanNXi08cmwrdmURjXFIUCgaIVv.35_10
default.26848827.2__shadow_Exchange
Stores/Directors.edb.WA70vanNXi08cmwrdmURjXFIUCgaIVv.35_12

[ceph-users] radosgw Storage policies

2015-09-28 Thread Luis Periquito
Hi All,

I was hearing the ceph talk about radosgw and Yehuda talks about storage
policies. I started looking for it in the documentation, on how to
implement/use and couldn't much information:
http://docs.ceph.com/docs/master/radosgw/s3/ says it doesn't currently
support it, and http://docs.ceph.com/docs/master/radosgw/swift/ doesn't
mention it.

>From the release notes it seems to be for the swift interface, not S3. Is
this correct? Can we create them for S3 interface, or only Swift?


thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw and keystone version 3 domains

2015-09-25 Thread Luis Periquito
I'm having the exact same issue, and after looking it seems that radosgw is
hardcoded to authenticate using v2 api.

from the config file: rgw keystone url = http://openstackcontrol.lab:35357/

the "/v2.0/" is hardcoded and gets appended to the authentication request.

a snippet taken from radosgw (ran with "-d --debug-ms=1 --debug-rgw=20"
options)

2015-09-25 12:40:00.359333 7ff4bcf61700  1 == starting new request
req=0x7ff57801b810 =
2015-09-25 12:40:00.359355 7ff4bcf61700  2 req 1:0.21::GET
/swift/v1::initializing
2015-09-25 12:40:00.359358 7ff4bcf61700 10 host=s3.lab.tech.lastmile.com
2015-09-25 12:40:00.359363 7ff4bcf61700 20 subdomain= domain=
s3.lab.tech.lastmile.com in_hosted_domain=1
2015-09-25 12:40:00.359400 7ff4bcf61700 10 ver=v1 first= req=
2015-09-25 12:40:00.359410 7ff4bcf61700 10 s->object= s->bucket=
2015-09-25 12:40:00.359419 7ff4bcf61700  2 req 1:0.85:swift:GET
/swift/v1::getting op
2015-09-25 12:40:00.359422 7ff4bcf61700  2 req 1:0.89:swift:GET
/swift/v1:list_buckets:authorizing
2015-09-25 12:40:00.359428 7ff4bcf61700 20
token_id=6b67585266ce4aee9e326e72c81865dd
2015-09-25 12:40:00.359451 7ff4bcf61700 20 sending request to
http://openstackcontrol.lab:35357/v2.0/tokens/6b67585266ce4aee9e326e72c81865dd
2015-09-25 12:40:00.377066 7ff4bcf61700 20 received response: {"error":
{"message": "Non-default domain is not supported (Disable debug mode to
suppress these details.)", "code": 401, "title": "Unauthorized"}}
2015-09-25 12:40:00.377175 7ff4bcf61700  0 user does not hold a matching
role; required roles: admin, Member, _member_
2015-09-25 12:40:00.377179 7ff4bcf61700 10 failed to authorize request
2015-09-25 12:40:00.377216 7ff4bcf61700  2 req 1:0.017883:swift:GET
/swift/v1:list_buckets:http status=401
2015-09-25 12:40:00.377219 7ff4bcf61700  1 == req done
req=0x7ff57801b810 http_status=401 ==


>From this it seems that radosgw doesn't support auth v3! Are there any
plans to add that support?


On Sat, Sep 19, 2015 at 6:56 AM, Shinobu Kinjo  wrote:

> What's error message you saw when you tried?
>
> Shinobu
>
> - Original Message -
> From: "Abhishek L" 
> To: "Robert Duncan" 
> Cc: ceph-us...@ceph.com
> Sent: Friday, September 18, 2015 12:29:20 PM
> Subject: Re: [ceph-users] radosgw and keystone version 3 domains
>
> On Fri, Sep 18, 2015 at 4:38 AM, Robert Duncan 
> wrote:
> >
> > Hi
> >
> >
> >
> > It seems that radosgw cannot find users in Keystone V3 domains, that is,
> >
> > When keystone is configured for domain specific  drivers radossgw cannot
> find the users in the keystone users table (as they are not there)
> >
> > I have a deployment in which ceph providers object block ephemeral and
> user storage, however any user outside of the ‘default’ sql backed domain
> cannot be found by radosgw.
> >
> > Has anyone seen this issue before when using ceph in openstack? Is it
> possible to configure radosgw to use a keystone v3 url?
>
> I'm not sure whether keystone v3 support for radosgw is there yet,
> particularly for the swift api. Currently keystone v2 api is supported,
> and due to the change in format between v2 and v3 tokens, I'm not sure
> whether swift apis will work with v3 yet, though keystone v3 *might*
> just work on the s3 interface due to the different format used.
>
>
> >
> >
> > Thanks,
> >
> > Rob.
> >
> > 
> >
> > The information contained and transmitted in this e-mail is confidential
> information, and is intended only for the named recipient to which it is
> addressed. The content of this e-mail may not have been sent with the
> authority of National College of Ireland. Any views or opinions presented
> are solely those of the author and do not necessarily represent those of
> National College of Ireland. If the reader of this message is not the named
> recipient or a person responsible for delivering it to the named recipient,
> you are notified that the review, dissemination, distribution,
> transmission, printing or copying, forwarding, or any other use of this
> message or any part of it, including any attachments, is strictly
> prohibited. If you have received this communication in error, please delete
> the e-mail and destroy all record of this communication. Thank you for your
> assistance.
> >
> > 
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list

Re: [ceph-users] radosgw and keystone version 3 domains

2015-09-25 Thread Luis Periquito
This was reported in http://tracker.ceph.com/issues/8052 about a year ago.
This ticket hasn't been updated...

On Fri, Sep 25, 2015 at 1:37 PM, Robert Duncan <robert.dun...@ncirl.ie>
wrote:

> I would be interested if anyone even has a work around to this - no matter
> how arcane.
> If anyone gets this to work I would be most obliged
>
> -Original Message-
> From: Shinobu Kinjo [mailto:ski...@redhat.com]
> Sent: 25 September 2015 13:31
> To: Luis Periquito
> Cc: Abhishek L; Robert Duncan; ceph-users
> Subject: Re: [ceph-users] radosgw and keystone version 3 domains
>
> Thanks for the info.
>
> Shinobu
>
> - Original Message -
> From: "Luis Periquito" <periqu...@gmail.com>
> To: "Shinobu Kinjo" <ski...@redhat.com>
> Cc: "Abhishek L" <abhishek.lekshma...@gmail.com>, "Robert Duncan" <
> robert.dun...@ncirl.ie>, "ceph-users" <ceph-us...@ceph.com>
> Sent: Friday, September 25, 2015 8:52:48 PM
> Subject: Re: [ceph-users] radosgw and keystone version 3 domains
>
> I'm having the exact same issue, and after looking it seems that radosgw
> is hardcoded to authenticate using v2 api.
>
> from the config file: rgw keystone url =
> http://openstackcontrol.lab:35357/
>
> the "/v2.0/" is hardcoded and gets appended to the authentication request.
>
> a snippet taken from radosgw (ran with "-d --debug-ms=1 --debug-rgw=20"
> options)
>
> 2015-09-25 12:40:00.359333 7ff4bcf61700  1 == starting new request
> req=0x7ff57801b810 =
> 2015-09-25 12:40:00.359355 7ff4bcf61700  2 req 1:0.21::GET
> /swift/v1::initializing
> 2015-09-25 12:40:00.359358 7ff4bcf61700 10 host=s3.lab.tech.lastmile.com
> 2015-09-25 12:40:00.359363 7ff4bcf61700 20 subdomain= domain=
> s3.lab.tech.lastmile.com in_hosted_domain=1
> 2015-09-25 12:40:00.359400 7ff4bcf61700 10 ver=v1 first= req=
> 2015-09-25 12:40:00.359410 7ff4bcf61700 10 s->object=
> s->bucket=
> 2015-09-25 12:40:00.359419 7ff4bcf61700  2 req 1:0.85:swift:GET
> /swift/v1::getting op
> 2015-09-25 12:40:00.359422 7ff4bcf61700  2 req 1:0.89:swift:GET
> /swift/v1:list_buckets:authorizing
> 2015-09-25 12:40:00.359428 7ff4bcf61700 20
> token_id=6b67585266ce4aee9e326e72c81865dd
> 2015-09-25 12:40:00.359451 7ff4bcf61700 20 sending request to
> http://openstackcontrol.lab:35357/v2.0/tokens/6b67585266ce4aee9e326e72c81865dd
> 2015-09-25 12:40:00.377066 7ff4bcf61700 20 received response: {"error":
> {"message": "Non-default domain is not supported (Disable debug mode to
> suppress these details.)", "code": 401, "title": "Unauthorized"}}
> 2015-09-25 12:40:00.377175 7ff4bcf61700  0 user does not hold a matching
> role; required roles: admin, Member, _member_
> 2015-09-25 12:40:00.377179 7ff4bcf61700 10 failed to authorize request
> 2015-09-25 12:40:00.377216 7ff4bcf61700  2 req 1:0.017883:swift:GET
> /swift/v1:list_buckets:http status=401
> 2015-09-25 12:40:00.377219 7ff4bcf61700  1 == req done
> req=0x7ff57801b810 http_status=401 ==
>
>
> From this it seems that radosgw doesn't support auth v3! Are there any
> plans to add that support?
>
>
> On Sat, Sep 19, 2015 at 6:56 AM, Shinobu Kinjo <ski...@redhat.com> wrote:
>
> > What's error message you saw when you tried?
> >
> > Shinobu
> >
> > - Original Message -
> > From: "Abhishek L" <abhishek.lekshma...@gmail.com>
> > To: "Robert Duncan" <robert.dun...@ncirl.ie>
> > Cc: ceph-us...@ceph.com
> > Sent: Friday, September 18, 2015 12:29:20 PM
> > Subject: Re: [ceph-users] radosgw and keystone version 3 domains
> >
> > On Fri, Sep 18, 2015 at 4:38 AM, Robert Duncan
> > <robert.dun...@ncirl.ie>
> > wrote:
> > >
> > > Hi
> > >
> > >
> > >
> > > It seems that radosgw cannot find users in Keystone V3 domains, that
> > > is,
> > >
> > > When keystone is configured for domain specific  drivers radossgw
> > > cannot
> > find the users in the keystone users table (as they are not there)
> > >
> > > I have a deployment in which ceph providers object block ephemeral
> > > and
> > user storage, however any user outside of the ‘default’ sql backed
> > domain cannot be found by radosgw.
> > >
> > > Has anyone seen this issue before when using ceph in openstack? Is
> > > it
> > possible to configure radosgw to use a keystone v3 url?
> >
> > I'm not sure whether keystone v3 support for radosgw is there yet,
> > particularly for the swift api

[ceph-users] EC pool design

2015-09-09 Thread Luis Periquito
I'm in the process of adding more resources to an existing cluster.

I'll have 38 hosts, with 2 HDD each, for an EC pool. I plan on adding a
cache pool in front of it (is it worth it? S3 data, mostly writes and
objects are usually 200kB upwards to several MB/GB...); all of the hosts
are on the same rack. All the other pools will go into a separate SSD based
pool and would be replicated.

I was reading Somnath's email regarding the performance of different EC
backends, and he compares the jerasure performance with different plugins.

This cluster is currently Hammer, so I was looking to using LRC. Is it
worth using LRC over standard jerasure? What would be a good k and m? I was
thinking k=12, m=4, l=4 as I have more than enough hosts for these values,
but what if I lose more than one host? Will LRC still be able to recover
using the "adjacent" group?

And what about performance? From Somnath's email it seemed the bigger the k
and m the worse it would perform...

What are the usual values you all use?

PS: I still haven't seen Mark Nelson performance presentation...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Defective Gbic brings whole Cluster down

2015-08-28 Thread Luis Periquito
I've seen one misbehaving OSD stopping all the IO in a cluster... I've had
a situation where everything seemed fine with the OSD/node but the cluster
was grinding to a halt. There was no iowait, disk wasn't very busy, wasn't
doing recoveries, was up+in, no scrubs... Restart the OSD and everything
recovers like magic...

On Thu, Aug 27, 2015 at 8:38 PM, Robert LeBlanc rob...@leblancnet.us
wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 +1

 :)

 - 
 Robert LeBlanc
 PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

 On Thu, Aug 27, 2015 at 1:16 PM, Jan Schermer  wrote:
 Well, there's no other way to get reliable performance and SLAs compared to 
 traditional storage when what you work with is commodity hardware in a mesh-y 
 configuration.
 And we do like the idea of killing the traditional storage, right? I think 
 80s called already and wanted their SAN back...

 Jan

  On 27 Aug 2015, at 21:01, Robert LeBlanc  wrote:
 
  -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA256
 
  I know writing to min_size as sync and size-min_size as async has been
  discussed before and would help here. From what I understand required
  a lot of code changes and goes against the strong consistency model of
  Ceph. I'm not sure if it will be implemented although I do love this
  idea to help against tail latency.
  - 
  Robert LeBlanc
  PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
 
 
  On Thu, Aug 27, 2015 at 12:48 PM, Jan Schermer  wrote:
  Don't kick out the node, just deal with it gracefully and without 
  interruption... if the IO reached the quorum number of OSDs then there's 
  no need to block anymore, just queue it. Reads can be mirrored or retried 
  (much quicker, because making writes idempotent, ordered and async is 
  pretty hard and expensive).
  If there's an easy way to detect unreliable OSD that flaps - great, let's 
  have a warning in ceph health.
 
  Jan
 
  On 27 Aug 2015, at 20:43, Robert LeBlanc  wrote:
 
  -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA256
 
  This has been discussed a few times. The consensus seems to be to make
  sure error rates of NICs or other such metrics are included in your
  monitoring solution. It would also be good to preform periodic network
  tests like a full size ping with nofrag set between all nodes and have
  your monitoring solution report that as well.
 
  Although I would like to see such a feature in Ceph, the concern is
  that such a feature can quickly get out of hand and that something
  else that is really designed for it should do it. I can understand
  where they are coming from in that regard, but having Ceph kick out a
  misbehaving node quickly is appealing as well (there would have to be
  a way to specify that only so many nodes could be kicked out).
  - 
  Robert LeBlanc
  PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
 
 
  On Thu, Aug 27, 2015 at 9:37 AM, Christoph Adomeit  wrote:
  Hello Ceph Users,
 
  yesterday I had a defective Gbic in 1 node of my 10 node ceph cluster.
 
  The Gbic was working somehow but had 50% packet-loss. Some packets went 
  through, some did not.
 
  What happend that the whole cluster did not service requests in time, 
  there were lots of timeouts and so on
  until the problem was isolated. Monitors and osds where asked for data 
  but did dot answer or answer late.
 
  I am wondering, here we have a highly redundant network setup and a 
  highly redundant piece of software, but a small
  network fault brings down the whole cluster.
 
  Is there anything that can be configured or changed in ceph so that 
  availability will become better in case of flapping networks ?
 
  I understand, it is not a ceph problem but a network problem but maybe 
  something can be learned from such incidents  ?
 
  Thanks
  Christoph
  --
  Christoph Adomeit
  GATWORKS GmbH
  Reststrauch 191
  41199 Moenchengladbach
  Sitz: Moenchengladbach
  Amtsgericht Moenchengladbach, HRB 6303
  Geschaeftsfuehrer:
  Christoph Adomeit, Hans Wilhelm Terstappen
 
  christoph.adom...@gatworks.de Internetloesungen vom Feinsten
  Fon. +49 2166 9149-32  Fax. +49 2166 9149-10

  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
  -BEGIN PGP SIGNATURE-
  Version: Mailvelope v1.0.2
  Comment: https://www.mailvelope.com
 
  wsFcBAEBCAAQBQJV31pFCRDmVDuy+mK58QAA7qwQAL0EvbHneC00qhCX/jjT
  Xl8whWvQgm/UUDEPAWe2wGkgVZtP3cSAx/p+IkusZuD6NClIiWvazdz5n+vf
  cj4Y+S8Zj4Lw7gypHjy5GSCDSbQnEni32QNKp74GM/EZ1331gXuDvP0bS2Sz
  7g5MXu8Vpf0Kdrj8JrOPnHY1PtljxkQXdrEmijDkmnjruO+XGFQrl8l9GFbN
  enFZI+PpEAoSEJPZosCnX+ZLM3/ZiwAfAPtvcARyDwdmjV7CjyRjVviloR3K
  DV/b+VuWX+NVzTZMKCnILVubt1Khexzk6reU3m7Yjy713dmEehDmKQsESFci
  pMi61iEuxje0O+iqOp+mhhYWtv+Iv7bbpHcGv04vfMsl6+ms6v/EHo/Cccoi
  

Re: [ceph-users] RadosGW - multiple dns names

2015-08-26 Thread Luis Periquito
On Mon, Feb 23, 2015 at 10:18 PM, Yehuda Sadeh-Weinraub yeh...@redhat.com
wrote:



 --

 *From: *Shinji Nakamoto shinji.nakam...@mgo.com
 *To: *ceph-us...@ceph.com
 *Sent: *Friday, February 20, 2015 3:58:39 PM
 *Subject: *[ceph-users] RadosGW - multiple dns names

 We have multiple interfaces on our Rados gateway node, each of which is
 assigned to one of our many VLANs with a unique IP address.

 Is it possible to set multiple DNS names for a single Rados GW, so it can
 handle the request to each of the VLAN specific IP address DNS names?


 Not yet, however, the upcoming hammer release will support that (hostnames
 will be configured as part of the region).


I tested this using Hammer ( 0.94.2) and it doesn't seem to work. I'm just
adding multiple rgw dns name lines to the configuration. Did it make
Hammer, or am I doing it the wrong way? I couldn't find any docs either
way...



 Yehuda


 eg.
 rgw dns name = prd-apiceph001
 rgw dns name = prd-backendceph001
 etc.




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating data into a newer ceph instance

2015-08-26 Thread Luis Periquito
I Would say the easiest way would be to leverage all the self-healing of
ceph: add the new nodes to the old cluster, allow or force all the data to
migrate between nodes, and then remove the old ones out.

Well to be fair you could probably just install radosgw on another node and
use it as your gateway without the need to even create a new OSD node...

Or was there a reason to create a new cluster? I can tell you that one of
the clusters I have has been around since bobtail, and now it's hammer...

On Wed, Aug 26, 2015 at 2:50 PM, Chang, Fangzhe (Fangzhe) 
fangzhe.ch...@alcatel-lucent.com wrote:

 Hi,



 We have been running Ceph/Radosgw version 0.80.7 (Giant) and stored quite
 some amount of data in it. We are only using ceph as an object store via
 radosgw. Last week cheph-radosgw daemon suddenly refused to start (with
 logs only showing “initialization timeout” error on Centos 7).  This
 triggers me to install a newer instance --- Ceph/Radosgw version 0.94.2
 (Hammer). The new instance has a different set of key rings by default. The
 next step is to have all the data migrated. Does anyone know how to get the
 existing data out of the old ceph  cluster (Giant) and into the new
 instance (Hammer)? Please note that in the old three-node cluster ceph osd
 is still running but radosgw is not. Any suggestion will be greatly
 appreciated.

 Thanks.



 Regards,



 Fangzhe Chang







 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating data into a newer ceph instance

2015-08-26 Thread Luis Periquito
I tend to not do too much each time: either upgrade or data migrate. The
actual upgrade process is seamless... So you can just as easily upgrade the
current cluster to hammer, and add/remove nodes on the fly. All of this
is quite seamless and straightforward (other than the data migration
itself).

On Wed, Aug 26, 2015 at 3:17 PM, Chang, Fangzhe (Fangzhe) 
fangzhe.ch...@alcatel-lucent.com wrote:

 Thanks, Luis.



 The motivation for using the newer version is to keep up-to-date with Ceph
 development, since we suspect the old versioned radosgw could not be
 restarted possibly due to library mismatch.

 Do you know whether the self-healing feature of ceph is applicable between
 different versions or not?



 Fangzhe



 *From:* Luis Periquito [mailto:periqu...@gmail.com]
 *Sent:* Wednesday, August 26, 2015 10:11 AM
 *To:* Chang, Fangzhe (Fangzhe)
 *Cc:* ceph-users@lists.ceph.com
 *Subject:* Re: [ceph-users] Migrating data into a newer ceph instance



 I Would say the easiest way would be to leverage all the self-healing of
 ceph: add the new nodes to the old cluster, allow or force all the data to
 migrate between nodes, and then remove the old ones out.



 Well to be fair you could probably just install radosgw on another node
 and use it as your gateway without the need to even create a new OSD node...



 Or was there a reason to create a new cluster? I can tell you that one of
 the clusters I have has been around since bobtail, and now it's hammer...



 On Wed, Aug 26, 2015 at 2:50 PM, Chang, Fangzhe (Fangzhe) 
 fangzhe.ch...@alcatel-lucent.com wrote:

 Hi,



 We have been running Ceph/Radosgw version 0.80.7 (Giant) and stored quite
 some amount of data in it. We are only using ceph as an object store via
 radosgw. Last week cheph-radosgw daemon suddenly refused to start (with
 logs only showing “initialization timeout” error on Centos 7).  This
 triggers me to install a newer instance --- Ceph/Radosgw version 0.94.2
 (Hammer). The new instance has a different set of key rings by default. The
 next step is to have all the data migrated. Does anyone know how to get the
 existing data out of the old ceph  cluster (Giant) and into the new
 instance (Hammer)? Please note that in the old three-node cluster ceph osd
 is still running but radosgw is not. Any suggestion will be greatly
 appreciated.

 Thanks.



 Regards,



 Fangzhe Chang








 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw secret_key

2015-08-24 Thread Luis Periquito
When I create a new user using radosgw-admin most of the time the secret
key gets escaped with a backslash, making it not work. Something like
secret_key: xx\/\/.

Why would the / need to be escaped? Why is it printing the \/ instead
of / that does work?

Usually I just remove the backslash and it works fine. I've seen this on
several different clusters.

Is it just me?

This may require opening a bug in the tracking tool, but just asking here
first.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD GHz vs. Cores Question

2015-08-22 Thread Luis Periquito
I've been meaning to write an email with the experience we had at the
company I work. For the lack of a more complete one I'll just tell some of
the findings. Please note these are my experiences, and are correct for my
environment. The clients are running on openstack, and all servers are
trusty. Tests were made with Hammer (0.94.2).

TLDR: if performance is your objective buy 1S boxes with high frequency,
good journal SSDs, and not many SSDs. Also change the cpu to performance
mode, instead the default ondemand. And don't forget 10Gig is a must.
Replicated pools are also a must for performance.

We wanted to have a small cluster (30TB RAW), performance was important
(IOPS and latency), network was designed to be 10G copper with BGP attached
hosts. There was complete leeway in design and some in budget.

Starting with the network that required us to only create a single network,
but both links are usable - iperf between boxes is usually around
17-19Gbits.

We could choose the nodes, we evaluated dual cpu and single cpu nodes. The
dual cpus would have 24 2.5'' drive bays on a 2U chassis whereas the single
were 8 2.5'' drive bays on a 1U chassis. Long story short we chose the
single cpu (E3 1241 v3). On the CPU all the tests we did with the scaling
governors shown that performance would give us a 30-50% boost in IOPS.
Latency also improved but not by much. The downside was that each system
increased power usage by 5W (!?).

For the difference in price (£80) we bought the boxes with 32G of ram.

As for the disks, as we wanted fast IO we had to go with SSDs. Due to the
budget we had we went with 4x Samsung 850 PRO + 1x Intel S3710 200G. We
also tested the P3600, but one of the critical IO clients had far worse
performance with it. From benchmarking the write performance is that of the
Intel SSD. We made tests with Intel SSD with journal + different Intel SSD
with data and performance was within margin for error the same that Intel
SSD for journal + Samsung SSD for data. Single SSD performance was slightly
lower with either one (around 10%).

From what I've seen: on very big sequential read and write I can get up to
700-800 MBps. On random IO (8k, random writes, reads or mixed workloads) we
still haven't finished all the tests, but so far it indicates the SSDs are
the bottleneck on the writes, and ceph latency on the reads. However we've
been able to extract 400 MBps read IO with 4 clients, each doing 32
threads. I don't have the numbers here but that represents around 50k IOPS
out of a smallish cluster.

Stuff we still have to do revolves around jemalloc vs tcmalloc - trusty has
the bug on the thread cache bytes variable. Also we still have to test
various tunable options, like threads, caches, etc...

Hope this helps.


On Sat, Aug 22, 2015 at 4:45 PM, Nick Fisk n...@fisk.me.uk wrote:

 Another thing that is probably worth considering is the practical side as
 well. A lot of the Xeon E5 boards tend to have more SAS/SATA ports and
 onboard 10GB, this can make quite a difference to the overall cost of the
 solution if you need to buy extra PCI-E cards.

 Unless I've missed one, I've not spotted a Xeon-D board with a large amount
 of onboard sata/sas ports. Please let me know if such a system exists as I
 would be very interested.

 We settled on the Hadoop version of the Supermicro Fat Twin. 12 x 3.5
 disks
 + 2x 2.5 SSD's per U, onboard 10GB-T and the fact they share chassis and
 PSU's keeps the price down. For bulk storage one of these with a single 8
 core low clocked E5 Xeon is ideal in my mind. I did a spreadsheet working
 out U space, power and cost per GB for several different types of server,
 this solution came out ahead in nearly every category.

 If there is a requirement for a high perf SSD tier I would probably look at
 dedicated SSD nodes as I doubt you could cram enough CPU power into a
 single
 server to drive 12xSSD's.

 You mentioned low latency was a key requirement, is this always going to be
 at low queue depths? If you just need very low latency but won't actually
 be
 driving the SSD's very hard you will probably find a very highly clocked E3
 is the best bet with 2-4 SSD's per node. However if you drive the SSD's
 hard, a single one can easily max out several cores.

  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
  Mark Nelson
  Sent: 22 August 2015 00:00
  To: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] OSD GHz vs. Cores Question
 
  FWIW, we recently were looking at a couple of different options for the
  machines in our test lab that run the nightly QA suite jobs via
 teuthology.
 
   From a cost/benefit perspective, I think it really comes down to
 something
  like a XEON E3-12XXv3 or the new XEON D-1540, each of which have
  advantages/disadvantages.
 
  We were very tempted by the Xeon D but it was still just a little too new
 for
  us so we ended up going with servers using more standard E3 processors.
  The Xeon 

Re: [ceph-users] Question

2015-08-17 Thread Luis Periquito
yes. The issue is resource sharing as usual: the MONs will use disk I/O,
memory and CPU. If the cluster is small (test?) then there's no problem in
using the same disks. If the cluster starts to get bigger you may want to
dedicate resources (e.g. the disk for the MONs isn't used by an OSD). If
the cluster is big enough you may want to dedicate a node for being a MON.

On Mon, Aug 17, 2015 at 2:56 PM, Kris Vaes k...@s3s.eu wrote:

 Hi,

 Maybe this seems like a strange question but i could not find this info in
 the docs , i have following question,

 For the ceph cluster you need osd daemons and monitor daemons,

 On a host you can run several osd daemons (best one per drive as read in
 the docs) on one host

 But now my question  can you run on the same host where you run already
 some osd daemons the monitor daemon

 Is this possible and what are the implications of doing this



 Met Vriendelijke Groeten
 Cordialement
 Kind Regards
 Cordialmente
 С приятелски поздрави


 This message (including any attachments) may be privileged or
 confidential. If you have received it by mistake, please notify the sender
 by return e-mail and delete this message from your system. Any unauthorized
 use or dissemination of this message in whole or in part is strictly
 prohibited. S3S rejects any liability for the improper, incomplete or
 delayed transmission of the information contained in this message, as well
 as for damages resulting from this e-mail message. S3S cannot guarantee
 that the message received by you has not been intercepted by third parties
 and/or manipulated by computer programs used to transmit messages and
 viruses.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] НА: tcmalloc use a lot of CPU

2015-08-17 Thread Luis Periquito
How big are those OPS? Are they random? How many nodes? How many SSDs/OSDs?
What are you using to make the tests? Using atop on the OSD nodes where is
your bottleneck?

On Mon, Aug 17, 2015 at 1:05 PM, Межов Игорь Александрович me...@yuterra.ru
 wrote:

 Hi!

 We also observe the same behavior on our test Hammer install, and I wrote
 about it some time ago:

 http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/22609
 http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/22609

 Jan Schremes give us some suggestions in thread, but we still not got any
 positive results - TCMalloc usage is
 high. The usage is lowered to 10%, when disable crc in messages, disable
 debug and disable cephx auth,
 but this is od course not for production use. Also we got a different
 trace, while performin FIO-RBD benchmarks
 on ssd pool:
 ---
   46,07%  [kernel]  [k] _raw_spin_lock
6,51%  [kernel]  [k] mb_cache_entry_alloc
5,74%  libtcmalloc.so.4.2.2  [.]
 tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)
5,50%  libtcmalloc.so.4.2.2  [.] tcmalloc::SLL_Next(void*)
3,86%  libtcmalloc.so.4.2.2  [.] TCMalloc_PageMap335::get(unsigned
 long) const
2,73%  libtcmalloc.so.4.2.2  [.]
 tcmalloc::CentralFreeList::ReleaseToSpans(void*)
0,69%  libtcmalloc.so.4.2.2  [.]
 tcmalloc::CentralFreeList::ReleaseListToSpans(void*)
0,69%  libtcmalloc.so.4.2.2  [.]
 tcmalloc::PageHeap::GetDescriptor(unsigned long) const
0,64%  libtcmalloc.so.4.2.2  [.] tcmalloc::SLL_PopRange(void**, int,
 void**, void**)
 ---

 I dont clearly understand, what's happening in this case: ssd pool is
 connected to the same host,
 but different controller (C60X onboard instead of LSI2208), io scheduler
 set to noop, pool is gathered
 from 4х400Gb Intel DC S3700 and have to perform better, I think - more
 than 30-40 kops.
 But we got the trace above and no more then 12-15 kiops. Where can be a
 problem?






 Megov Igor
 CIO, Yuterra

 --
 *От:* ceph-users ceph-users-boun...@lists.ceph.com от имени YeYin 
 ey...@qq.com
 *Отправлено:* 17 августа 2015 г. 12:58
 *Кому:* ceph-users
 *Тема:* [ceph-users] tcmalloc use a lot of CPU

 Hi, all,
   When I do performance test with rados bench, I found tcmalloc consumed a
 lot of CPU:

 Samples: 265K of event 'cycles', Event count (approx.): 104385445900
 +  27.58%  libtcmalloc.so.4.1.0[.]
 tcmalloc::CentralFreeList::FetchFromSpans()
 +  15.25%  libtcmalloc.so.4.1.0[.]
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long,
 +  12.20%  libtcmalloc.so.4.1.0[.]
 tcmalloc::CentralFreeList::ReleaseToSpans(void*)
 +   1.63%  perf[.] append_chain
 +   1.39%  libtcmalloc.so.4.1.0[.]
 tcmalloc::CentralFreeList::ReleaseListToSpans(void*)
 +   1.02%  libtcmalloc.so.4.1.0[.]
 tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)
 +   0.85%  libtcmalloc.so.4.1.0[.] 0x00017e6f
 +   0.75%  libtcmalloc.so.4.1.0[.]
 tcmalloc::ThreadCache::IncreaseCacheLimitLocked()
 +   0.67%  libc-2.12.so[.] memcpy
 +   0.53%  libtcmalloc.so.4.1.0[.] operator delete(void*)

 Ceph version:
 # ceph --version
 ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e)

 Kernel version:
 3.10.83

 Is this phenomenon normal?Is there any idea about this problem?

 Thanks.
 Ye


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph distributed osd

2015-08-17 Thread Luis Periquito
I don't understand your question? You created a 1G RBD/disk and it's full.
You are able to grow it though - but that's a Linux management issue, not
ceph.

As everything is thin-provisioned you can create a RBD with an arbitrary
size - I've create one with 1PB when the cluster only had 600G/Raw
available.

On Mon, Aug 17, 2015 at 1:18 PM, gjprabu gjpr...@zohocorp.com wrote:

 Hi All,

Anybody can help on this issue.

 Regards
 Prabu

  On Mon, 17 Aug 2015 12:08:28 +0530 *gjprabu gjpr...@zohocorp.com
 gjpr...@zohocorp.com* wrote 

 Hi All,

Also please find osd information.

 ceph osd dump | grep 'replicated size'
 pool 2 'repo' replicated size 2 min_size 2 crush_ruleset 0 object_hash
 rjenkins pg_num 126 pgp_num 126 last_change 21573 flags hashpspool
 stripe_width 0

 Regards
 Prabu




  On Mon, 17 Aug 2015 11:58:55 +0530 *gjprabu gjpr...@zohocorp.com
 gjpr...@zohocorp.com* wrote 



 Hi All,

We need to test three OSD and one image with replica 2(size 1GB). While
 testing data is not writing above 1GB. Is there any option to write on
 third OSD.

 *ceph osd pool get  repo  pg_num*
 *pg_num: 126*

 *# rbd showmapped *
 *id pool image  snap device*
 *0  rbd  integdownloads -/dev/rbd0 *-- *Already one*
 *2  repo integrepotest  -/dev/rbd2  -- newly created*


 [root@hm2 repository]# df -Th
 Filesystem   Type  Size  Used Avail Use% Mounted on
 /dev/sda5ext4  289G   18G  257G   7% /
 devtmpfs devtmpfs  252G 0  252G   0% /dev
 tmpfstmpfs 252G 0  252G   0% /dev/shm
 tmpfstmpfs 252G  538M  252G   1% /run
 tmpfstmpfs 252G 0  252G   0% /sys/fs/cgroup
 /dev/sda2ext4  488M  212M  241M  47% /boot
 /dev/sda4ext4  1.9T   20G  1.8T   2% /var
 /dev/mapper/vg0-zoho ext4  8.6T  1.7T  6.5T  21% /zoho
 /dev/rbd0ocfs2 977G  101G  877G  11% /zoho/build/downloads
 */dev/rbd2ocfs21000M 1000M 0 100%
 /zoho/build/repository*

 @:~$ scp -r sample.txt root@integ-hm2:/zoho/build/repository/
 root@integ-hm2's password:
 sample.txt
 100% 1024MB   4.5MB/s   03:48
 scp: /zoho/build/repository//sample.txt: *No space left on device*

 Regards
 Prabu




  On Thu, 13 Aug 2015 19:42:11 +0530 *gjprabu gjpr...@zohocorp.com
 gjpr...@zohocorp.com* wrote 



 Dear Team,

  We are using two ceph OSD with replica 2 and it is working
 properly. Here my doubt is (Pool A -image size will be 10GB) and its
 replicated with two OSD, what will happen suppose if the size reached the
 limit, Is there any chance to make the data to continue writing in another
 two OSD's.

 Regards
 Prabu







 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon cpu usage

2015-07-25 Thread Luis Periquito
I think I figured out! All 4 of the OSDs on one host (OSD 107-110) were
sending massive amounts of auth requests to the monitors, seeming to
overwhelm them.

Weird bit is that I removed them (osd crush remove, auth del, osd rm), dd
the box and all of the disks, reinstalled and guess what? They are still
doing a lot of requests to the MONs... this will require some further
investigations.

As this is happening during my holidays, I just disabled them, and will
investigate further when I get back.


On Fri, Jul 24, 2015 at 11:11 PM, Kjetil Jørgensen kje...@medallia.com
wrote:

 It sounds slightly similar to what I just experienced.

 I had one monitor out of three, which seemed to essentially run one core
 at full tilt continuously, and had it's virtual address space allocated at
 the point where top started calling it Tb. Requests hitting this monitor
 did not get very timely responses (although; I don't know if this were
 happening consistently or arbitrarily).

 I ended up re-building the monitor from the two healthy ones I had, which
 made the problem go away for me.

 After the fact inspection of the monitor I ripped out, clocked it in at
 1.3Gb compared to the 250Mb of the other two, after rebuild they're all
 comparable in size.

 In my case; this started out for me on firefly, and persisted after
 upgrading to hammer. Which prompted the rebuild, suspecting that in my case
 it were related to something persistent for this monitor.

 I do not have that much more useful to contribute to this discussion,
 since I've more-or-less destroyed any evidence by re-building the monitor.

 Cheers,
 KJ

 On Fri, Jul 24, 2015 at 1:55 PM, Luis Periquito periqu...@gmail.com
 wrote:

 The leveldb is smallish: around 70mb.

 I ran debug mon = 10 for a while,  but couldn't find any interesting
 information. I would run out of space quite quickly though as the log
 partition only has 10g.
 On 24 Jul 2015 21:13, Mark Nelson mnel...@redhat.com wrote:

 On 07/24/2015 02:31 PM, Luis Periquito wrote:

 Now it's official,  I have a weird one!

 Restarted one of the ceph-mons with jemalloc and it didn't make any
 difference. It's still using a lot of cpu and still not freeing up
 memory...

 The issue is that the cluster almost stops responding to requests, and
 if I restart the primary mon (that had almost no memory usage nor cpu)
 the cluster goes back to its merry way responding to requests.

 Does anyone have any idea what may be going on? The worst bit is that I
 have several clusters just like this (well they are smaller), and as we
 do everything with puppet, they should be very similar... and all the
 other clusters are just working fine, without any issues whatsoever...


 We've seen cases where leveldb can't compact fast enough and memory
 balloons, but it's usually associated with extreme CPU usage as well. It
 would be showing up in perf though if that were the case...


 On 24 Jul 2015 10:11, Jan Schermer j...@schermer.cz
 mailto:j...@schermer.cz wrote:

 You don’t (shouldn’t) need to rebuild the binary to use jemalloc. It
 should be possible to do something like

 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 ceph-osd …

 The last time we tried it segfaulted after a few minutes, so YMMV
 and be careful.

 Jan

  On 23 Jul 2015, at 18:18, Luis Periquito periqu...@gmail.com
 mailto:periqu...@gmail.com wrote:

 Hi Greg,

 I've been looking at the tcmalloc issues, but did seem to affect
 osd's, and I do notice it in heavy read workloads (even after the
 patch and
 increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728). This
 is affecting the mon process though.

 looking at perf top I'm getting most of the CPU usage in mutex
 lock/unlock
   5.02% libpthread-2.19.so http://libpthread-2.19.so/[.]
 pthread_mutex_unlock
   3.82%  libsoftokn3.so[.] 0x0001e7cb
   3.46% libpthread-2.19.so http://libpthread-2.19.so/[.]
 pthread_mutex_lock

 I could try to use jemalloc, are you aware of any built binaries?
 Can I mix a cluster with different malloc binaries?


 On Thu, Jul 23, 2015 at 10:50 AM, Gregory Farnum g...@gregs42.com
 mailto:g...@gregs42.com wrote:

 On Thu, Jul 23, 2015 at 8:39 AM, Luis Periquito
 periqu...@gmail.com mailto:periqu...@gmail.com wrote:
  The ceph-mon is already taking a lot of memory, and I ran a
 heap stats
  
  MALLOC:   32391696 (   30.9 MiB) Bytes in use by
 application
  MALLOC: +  27597135872 (26318.7 MiB) Bytes in page heap
 freelist
  MALLOC: + 16598552 (   15.8 MiB) Bytes in central cache
 freelist
  MALLOC: + 14693536 (   14.0 MiB) Bytes in transfer cache
 freelist
  MALLOC: + 17441592 (   16.6 MiB) Bytes in thread cache
 freelists
  MALLOC: +116387992 (  111.0 MiB) Bytes

Re: [ceph-users] ceph-mon cpu usage

2015-07-24 Thread Luis Periquito
The leveldb is smallish: around 70mb.

I ran debug mon = 10 for a while,  but couldn't find any interesting
information. I would run out of space quite quickly though as the log
partition only has 10g.
On 24 Jul 2015 21:13, Mark Nelson mnel...@redhat.com wrote:

 On 07/24/2015 02:31 PM, Luis Periquito wrote:

 Now it's official,  I have a weird one!

 Restarted one of the ceph-mons with jemalloc and it didn't make any
 difference. It's still using a lot of cpu and still not freeing up
 memory...

 The issue is that the cluster almost stops responding to requests, and
 if I restart the primary mon (that had almost no memory usage nor cpu)
 the cluster goes back to its merry way responding to requests.

 Does anyone have any idea what may be going on? The worst bit is that I
 have several clusters just like this (well they are smaller), and as we
 do everything with puppet, they should be very similar... and all the
 other clusters are just working fine, without any issues whatsoever...


 We've seen cases where leveldb can't compact fast enough and memory
 balloons, but it's usually associated with extreme CPU usage as well. It
 would be showing up in perf though if that were the case...


 On 24 Jul 2015 10:11, Jan Schermer j...@schermer.cz
 mailto:j...@schermer.cz wrote:

 You don’t (shouldn’t) need to rebuild the binary to use jemalloc. It
 should be possible to do something like

 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 ceph-osd …

 The last time we tried it segfaulted after a few minutes, so YMMV
 and be careful.

 Jan

  On 23 Jul 2015, at 18:18, Luis Periquito periqu...@gmail.com
 mailto:periqu...@gmail.com wrote:

 Hi Greg,

 I've been looking at the tcmalloc issues, but did seem to affect
 osd's, and I do notice it in heavy read workloads (even after the
 patch and
 increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728). This
 is affecting the mon process though.

 looking at perf top I'm getting most of the CPU usage in mutex
 lock/unlock
   5.02% libpthread-2.19.so http://libpthread-2.19.so/[.]
 pthread_mutex_unlock
   3.82%  libsoftokn3.so[.] 0x0001e7cb
   3.46% libpthread-2.19.so http://libpthread-2.19.so/[.]
 pthread_mutex_lock

 I could try to use jemalloc, are you aware of any built binaries?
 Can I mix a cluster with different malloc binaries?


 On Thu, Jul 23, 2015 at 10:50 AM, Gregory Farnum g...@gregs42.com
 mailto:g...@gregs42.com wrote:

 On Thu, Jul 23, 2015 at 8:39 AM, Luis Periquito
 periqu...@gmail.com mailto:periqu...@gmail.com wrote:
  The ceph-mon is already taking a lot of memory, and I ran a
 heap stats
  
  MALLOC:   32391696 (   30.9 MiB) Bytes in use by
 application
  MALLOC: +  27597135872 (26318.7 MiB) Bytes in page heap
 freelist
  MALLOC: + 16598552 (   15.8 MiB) Bytes in central cache
 freelist
  MALLOC: + 14693536 (   14.0 MiB) Bytes in transfer cache
 freelist
  MALLOC: + 17441592 (   16.6 MiB) Bytes in thread cache
 freelists
  MALLOC: +116387992 (  111.0 MiB) Bytes in malloc metadata
  MALLOC:   
  MALLOC: =  27794649240 (26507.0 MiB) Actual memory used
 (physical + swap)
  MALLOC: + 26116096 (   24.9 MiB) Bytes released to OS
 (aka unmapped)
  MALLOC:   
  MALLOC: =  27820765336 (26531.9 MiB) Virtual address space used
  MALLOC:
  MALLOC:   5683  Spans in use
  MALLOC: 21  Thread heaps in use
  MALLOC:   8192  Tcmalloc page size
  
 
  after that I ran the heap release and it went back to normal.
  
  MALLOC:   22919616 (   21.9 MiB) Bytes in use by
 application
  MALLOC: +  4792320 (4.6 MiB) Bytes in page heap
 freelist
  MALLOC: + 18743448 (   17.9 MiB) Bytes in central cache
 freelist
  MALLOC: + 20645776 (   19.7 MiB) Bytes in transfer cache
 freelist
  MALLOC: + 18456088 (   17.6 MiB) Bytes in thread cache
 freelists
  MALLOC: +116387992 (  111.0 MiB) Bytes in malloc metadata
  MALLOC:   
  MALLOC: =201945240 (  192.6 MiB) Actual memory used
 (physical + swap)
  MALLOC: + 27618820096 tel:%2B%20%2027618820096 (26339.4
 MiB) Bytes released to OS (aka unmapped)
  MALLOC:   
  MALLOC: =  27820765336 (26531.9 MiB) Virtual address space used
  MALLOC:
  MALLOC:   5639  Spans in use
  MALLOC: 29

Re: [ceph-users] ceph-mon cpu usage

2015-07-24 Thread Luis Periquito
Now it's official,  I have a weird one!

Restarted one of the ceph-mons with jemalloc and it didn't make any
difference. It's still using a lot of cpu and still not freeing up memory...

The issue is that the cluster almost stops responding to requests, and if I
restart the primary mon (that had almost no memory usage nor cpu) the
cluster goes back to its merry way responding to requests.

Does anyone have any idea what may be going on? The worst bit is that I
have several clusters just like this (well they are smaller), and as we do
everything with puppet, they should be very similar... and all the other
clusters are just working fine, without any issues whatsoever...
 On 24 Jul 2015 10:11, Jan Schermer j...@schermer.cz wrote:

 You don’t (shouldn’t) need to rebuild the binary to use jemalloc. It
 should be possible to do something like

 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 ceph-osd …

 The last time we tried it segfaulted after a few minutes, so YMMV and be
 careful.

 Jan

 On 23 Jul 2015, at 18:18, Luis Periquito periqu...@gmail.com wrote:

 Hi Greg,

 I've been looking at the tcmalloc issues, but did seem to affect osd's,
 and I do notice it in heavy read workloads (even after the patch and
 increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728). This is
 affecting the mon process though.

 looking at perf top I'm getting most of the CPU usage in mutex lock/unlock
   5.02%  libpthread-2.19.so[.] pthread_mutex_unlock
   3.82%  libsoftokn3.so[.] 0x0001e7cb
   3.46%  libpthread-2.19.so[.] pthread_mutex_lock

 I could try to use jemalloc, are you aware of any built binaries? Can I
 mix a cluster with different malloc binaries?


 On Thu, Jul 23, 2015 at 10:50 AM, Gregory Farnum g...@gregs42.com wrote:

 On Thu, Jul 23, 2015 at 8:39 AM, Luis Periquito periqu...@gmail.com
 wrote:
  The ceph-mon is already taking a lot of memory, and I ran a heap stats
  
  MALLOC:   32391696 (   30.9 MiB) Bytes in use by application
  MALLOC: +  27597135872 (26318.7 MiB) Bytes in page heap freelist
  MALLOC: + 16598552 (   15.8 MiB) Bytes in central cache freelist
  MALLOC: + 14693536 (   14.0 MiB) Bytes in transfer cache freelist
  MALLOC: + 17441592 (   16.6 MiB) Bytes in thread cache freelists
  MALLOC: +116387992 (  111.0 MiB) Bytes in malloc metadata
  MALLOC:   
  MALLOC: =  27794649240 (26507.0 MiB) Actual memory used (physical +
 swap)
  MALLOC: + 26116096 (   24.9 MiB) Bytes released to OS (aka unmapped)
  MALLOC:   
  MALLOC: =  27820765336 (26531.9 MiB) Virtual address space used
  MALLOC:
  MALLOC:   5683  Spans in use
  MALLOC: 21  Thread heaps in use
  MALLOC:   8192  Tcmalloc page size
  
 
  after that I ran the heap release and it went back to normal.
  
  MALLOC:   22919616 (   21.9 MiB) Bytes in use by application
  MALLOC: +  4792320 (4.6 MiB) Bytes in page heap freelist
  MALLOC: + 18743448 (   17.9 MiB) Bytes in central cache freelist
  MALLOC: + 20645776 (   19.7 MiB) Bytes in transfer cache freelist
  MALLOC: + 18456088 (   17.6 MiB) Bytes in thread cache freelists
  MALLOC: +116387992 (  111.0 MiB) Bytes in malloc metadata
  MALLOC:   
  MALLOC: =201945240 (  192.6 MiB) Actual memory used (physical +
 swap)
  MALLOC: + 27618820096 (26339.4 MiB) Bytes released to OS (aka unmapped)
  MALLOC:   
  MALLOC: =  27820765336 (26531.9 MiB) Virtual address space used
  MALLOC:
  MALLOC:   5639  Spans in use
  MALLOC: 29  Thread heaps in use
  MALLOC:   8192  Tcmalloc page size
  
 
  So it just seems the monitor is not returning unused memory into the OS
 or
  reusing already allocated memory it deems as free...

 Yep. This is a bug (best we can tell) in some versions of tcmalloc
 combined with certain distribution stacks, although I don't think
 we've seen it reported on Trusty (nor on a tcmalloc distribution that
 new) before. Alternatively some folks are seeing tcmalloc use up lots
 of CPU in other scenarios involving memory return and it may manifest
 like this, but I'm not sure. You could look through the mailing list
 for information on it.
 -Greg


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon cpu usage

2015-07-23 Thread Luis Periquito
Hi Greg,

I've been looking at the tcmalloc issues, but did seem to affect osd's, and
I do notice it in heavy read workloads (even after the patch and
increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728). This is
affecting the mon process though.

looking at perf top I'm getting most of the CPU usage in mutex lock/unlock
  5.02%  libpthread-2.19.so[.] pthread_mutex_unlock
  3.82%  libsoftokn3.so[.] 0x0001e7cb
  3.46%  libpthread-2.19.so[.] pthread_mutex_lock

I could try to use jemalloc, are you aware of any built binaries? Can I mix
a cluster with different malloc binaries?


On Thu, Jul 23, 2015 at 10:50 AM, Gregory Farnum g...@gregs42.com wrote:

 On Thu, Jul 23, 2015 at 8:39 AM, Luis Periquito periqu...@gmail.com
 wrote:
  The ceph-mon is already taking a lot of memory, and I ran a heap stats
  
  MALLOC:   32391696 (   30.9 MiB) Bytes in use by application
  MALLOC: +  27597135872 (26318.7 MiB) Bytes in page heap freelist
  MALLOC: + 16598552 (   15.8 MiB) Bytes in central cache freelist
  MALLOC: + 14693536 (   14.0 MiB) Bytes in transfer cache freelist
  MALLOC: + 17441592 (   16.6 MiB) Bytes in thread cache freelists
  MALLOC: +116387992 (  111.0 MiB) Bytes in malloc metadata
  MALLOC:   
  MALLOC: =  27794649240 (26507.0 MiB) Actual memory used (physical + swap)
  MALLOC: + 26116096 (   24.9 MiB) Bytes released to OS (aka unmapped)
  MALLOC:   
  MALLOC: =  27820765336 (26531.9 MiB) Virtual address space used
  MALLOC:
  MALLOC:   5683  Spans in use
  MALLOC: 21  Thread heaps in use
  MALLOC:   8192  Tcmalloc page size
  
 
  after that I ran the heap release and it went back to normal.
  
  MALLOC:   22919616 (   21.9 MiB) Bytes in use by application
  MALLOC: +  4792320 (4.6 MiB) Bytes in page heap freelist
  MALLOC: + 18743448 (   17.9 MiB) Bytes in central cache freelist
  MALLOC: + 20645776 (   19.7 MiB) Bytes in transfer cache freelist
  MALLOC: + 18456088 (   17.6 MiB) Bytes in thread cache freelists
  MALLOC: +116387992 (  111.0 MiB) Bytes in malloc metadata
  MALLOC:   
  MALLOC: =201945240 (  192.6 MiB) Actual memory used (physical + swap)
  MALLOC: + 27618820096 (26339.4 MiB) Bytes released to OS (aka unmapped)
  MALLOC:   
  MALLOC: =  27820765336 (26531.9 MiB) Virtual address space used
  MALLOC:
  MALLOC:   5639  Spans in use
  MALLOC: 29  Thread heaps in use
  MALLOC:   8192  Tcmalloc page size
  
 
  So it just seems the monitor is not returning unused memory into the OS
 or
  reusing already allocated memory it deems as free...

 Yep. This is a bug (best we can tell) in some versions of tcmalloc
 combined with certain distribution stacks, although I don't think
 we've seen it reported on Trusty (nor on a tcmalloc distribution that
 new) before. Alternatively some folks are seeing tcmalloc use up lots
 of CPU in other scenarios involving memory return and it may manifest
 like this, but I'm not sure. You could look through the mailing list
 for information on it.
 -Greg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon cpu usage

2015-07-23 Thread Luis Periquito
The ceph-mon is already taking a lot of memory, and I ran a heap stats

MALLOC:   32391696 (   30.9 MiB) Bytes in use by application
MALLOC: +  27597135872 (26318.7 MiB) Bytes in page heap freelist
MALLOC: + 16598552 (   15.8 MiB) Bytes in central cache freelist
MALLOC: + 14693536 (   14.0 MiB) Bytes in transfer cache freelist
MALLOC: + 17441592 (   16.6 MiB) Bytes in thread cache freelists
MALLOC: +116387992 (  111.0 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =  27794649240 (26507.0 MiB) Actual memory used (physical + swap)
MALLOC: + 26116096 (   24.9 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =  27820765336 (26531.9 MiB) Virtual address space used
MALLOC:
MALLOC:   5683  Spans in use
MALLOC: 21  Thread heaps in use
MALLOC:   8192  Tcmalloc page size


after that I ran the heap release and it went back to normal.

MALLOC:   22919616 (   21.9 MiB) Bytes in use by application
MALLOC: +  4792320 (4.6 MiB) Bytes in page heap freelist
MALLOC: + 18743448 (   17.9 MiB) Bytes in central cache freelist
MALLOC: + 20645776 (   19.7 MiB) Bytes in transfer cache freelist
MALLOC: + 18456088 (   17.6 MiB) Bytes in thread cache freelists
MALLOC: +116387992 (  111.0 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =201945240 (  192.6 MiB) Actual memory used (physical + swap)
MALLOC: +  27618820096 (26339.4 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =  27820765336 (26531.9 MiB) Virtual address space used
MALLOC:
MALLOC:   5639  Spans in use
MALLOC: 29  Thread heaps in use
MALLOC:   8192  Tcmalloc page size


So it just seems the monitor is not returning unused memory into the OS or
reusing already allocated memory it deems as free...


On Wed, Jul 22, 2015 at 4:29 PM, Luis Periquito periqu...@gmail.com wrote:

 This cluster is server RBD storage for openstack, and today all the I/O
 was just stopped.
 After looking in the boxes ceph-mon was using 17G ram - and this was on
 *all* the mons. Restarting the main one just made it work again (I
 restarted the other ones because they were using a lot of ram).
 This has happened twice now (first was last Monday).

 As this is considered a prod cluster there is no logging enabled, and I
 can't reproduce it - our test/dev clusters have been working fine, and have
 neither symptoms, but they were upgraded from firefly.
 What can we do to help debug the issue? Any ideas on how to identify the
 underlying issue?

 thanks,

 On Mon, Jul 20, 2015 at 1:59 PM, Luis Periquito periqu...@gmail.com
 wrote:

 Hi all,

 I have a cluster with 28 nodes (all physical, 4Cores, 32GB Ram), each
 node has 4 OSDs for a total of 112 OSDs. Each OSD has 106 PGs (counted
 including replication). There are 3 MONs on this cluster.
 I'm running on Ubuntu trusty with kernel 3.13.0-52-generic, with Hammer
 (0.94.2).

 This cluster was installed with Hammer (0.94.1) and has only been
 upgraded to the latest available version.

 On the three mons one is mostly idle, one is using ~170% CPU, and one is
 using ~270% CPU. They will change as I restart the process (usually the
 idle one is the one with the lowest uptime).

 Running a perf top againt the ceph-mon PID on the non-idle boxes it
 wields something like this:

   4.62%  libpthread-2.19.so[.] pthread_mutex_unlock
   3.95%  libpthread-2.19.so[.] pthread_mutex_lock
   3.91%  libsoftokn3.so[.] 0x0001db26
   2.38%  [kernel]  [k] _raw_spin_lock
   2.09%  libtcmalloc.so.4.1.2  [.] operator new(unsigned long)
   1.79%  ceph-mon  [.] DispatchQueue::enqueue(Message*, int,
 unsigned long)
   1.62%  ceph-mon  [.] RefCountedObject::get()
   1.58%  libpthread-2.19.so[.] pthread_mutex_trylock
   1.32%  libtcmalloc.so.4.1.2  [.] operator delete(void*)
   1.24%  libc-2.19.so  [.] 0x00097fd0
   1.20%  ceph-mon  [.] ceph::buffer::ptr::release()
   1.18%  ceph-mon  [.] RefCountedObject::put()
   1.15%  libfreebl3.so [.] 0x000542a8
   1.05%  [kernel]  [k] update_cfs_shares
   1.00%  [kernel]  [k] tcp_sendmsg

 The cluster is mostly idle, and it's healthy. The store is 69MB big, and
 the MONs are consuming around 700MB of RAM.

 Any ideas on this situation? Is it safe to ignore?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon cpu usage

2015-07-22 Thread Luis Periquito
This cluster is server RBD storage for openstack, and today all the I/O was
just stopped.
After looking in the boxes ceph-mon was using 17G ram - and this was on
*all* the mons. Restarting the main one just made it work again (I
restarted the other ones because they were using a lot of ram).
This has happened twice now (first was last Monday).

As this is considered a prod cluster there is no logging enabled, and I
can't reproduce it - our test/dev clusters have been working fine, and have
neither symptoms, but they were upgraded from firefly.
What can we do to help debug the issue? Any ideas on how to identify the
underlying issue?

thanks,

On Mon, Jul 20, 2015 at 1:59 PM, Luis Periquito periqu...@gmail.com wrote:

 Hi all,

 I have a cluster with 28 nodes (all physical, 4Cores, 32GB Ram), each node
 has 4 OSDs for a total of 112 OSDs. Each OSD has 106 PGs (counted including
 replication). There are 3 MONs on this cluster.
 I'm running on Ubuntu trusty with kernel 3.13.0-52-generic, with Hammer
 (0.94.2).

 This cluster was installed with Hammer (0.94.1) and has only been upgraded
 to the latest available version.

 On the three mons one is mostly idle, one is using ~170% CPU, and one is
 using ~270% CPU. They will change as I restart the process (usually the
 idle one is the one with the lowest uptime).

 Running a perf top againt the ceph-mon PID on the non-idle boxes it wields
 something like this:

   4.62%  libpthread-2.19.so[.] pthread_mutex_unlock
   3.95%  libpthread-2.19.so[.] pthread_mutex_lock
   3.91%  libsoftokn3.so[.] 0x0001db26
   2.38%  [kernel]  [k] _raw_spin_lock
   2.09%  libtcmalloc.so.4.1.2  [.] operator new(unsigned long)
   1.79%  ceph-mon  [.] DispatchQueue::enqueue(Message*, int,
 unsigned long)
   1.62%  ceph-mon  [.] RefCountedObject::get()
   1.58%  libpthread-2.19.so[.] pthread_mutex_trylock
   1.32%  libtcmalloc.so.4.1.2  [.] operator delete(void*)
   1.24%  libc-2.19.so  [.] 0x00097fd0
   1.20%  ceph-mon  [.] ceph::buffer::ptr::release()
   1.18%  ceph-mon  [.] RefCountedObject::put()
   1.15%  libfreebl3.so [.] 0x000542a8
   1.05%  [kernel]  [k] update_cfs_shares
   1.00%  [kernel]  [k] tcp_sendmsg

 The cluster is mostly idle, and it's healthy. The store is 69MB big, and
 the MONs are consuming around 700MB of RAM.

 Any ideas on this situation? Is it safe to ignore?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-mon cpu usage

2015-07-20 Thread Luis Periquito
Hi all,

I have a cluster with 28 nodes (all physical, 4Cores, 32GB Ram), each node
has 4 OSDs for a total of 112 OSDs. Each OSD has 106 PGs (counted including
replication). There are 3 MONs on this cluster.
I'm running on Ubuntu trusty with kernel 3.13.0-52-generic, with Hammer
(0.94.2).

This cluster was installed with Hammer (0.94.1) and has only been upgraded
to the latest available version.

On the three mons one is mostly idle, one is using ~170% CPU, and one is
using ~270% CPU. They will change as I restart the process (usually the
idle one is the one with the lowest uptime).

Running a perf top againt the ceph-mon PID on the non-idle boxes it wields
something like this:

  4.62%  libpthread-2.19.so[.] pthread_mutex_unlock
  3.95%  libpthread-2.19.so[.] pthread_mutex_lock
  3.91%  libsoftokn3.so[.] 0x0001db26
  2.38%  [kernel]  [k] _raw_spin_lock
  2.09%  libtcmalloc.so.4.1.2  [.] operator new(unsigned long)
  1.79%  ceph-mon  [.] DispatchQueue::enqueue(Message*, int,
unsigned long)
  1.62%  ceph-mon  [.] RefCountedObject::get()
  1.58%  libpthread-2.19.so[.] pthread_mutex_trylock
  1.32%  libtcmalloc.so.4.1.2  [.] operator delete(void*)
  1.24%  libc-2.19.so  [.] 0x00097fd0
  1.20%  ceph-mon  [.] ceph::buffer::ptr::release()
  1.18%  ceph-mon  [.] RefCountedObject::put()
  1.15%  libfreebl3.so [.] 0x000542a8
  1.05%  [kernel]  [k] update_cfs_shares
  1.00%  [kernel]  [k] tcp_sendmsg

The cluster is mostly idle, and it's healthy. The store is 69MB big, and
the MONs are consuming around 700MB of RAM.

Any ideas on this situation? Is it safe to ignore?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] monitor election

2015-06-04 Thread Luis Periquito
Hi all,

I've seen several chats on the monitor elections, and how the one with the
lowest IP is always the master.

Is there any way to change or influence this behaviour? Other than changing
the IP of the monitor themselves?

thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] active+clean+scrubbing+deep

2015-06-02 Thread Luis Periquito
that's a normal process running...

for more information
http://ceph.com/docs/master/rados/configuration/osd-config-ref/#scrubbing

On Tue, Jun 2, 2015 at 9:55 AM, Никитенко Виталий v1...@yandex.ru wrote:

 Hi!

 I have ceph version 0.94.1.

 root@ceph-node1:~# ceph -s
 cluster 3e0d58cd-d441-4d44-b49b-6cff08c20abf
  health HEALTH_OK
  monmap e2: 3 mons at {ceph-mon=
 10.10.100.3:6789/0,ceph-node1=10.10.100.1:6789/0,ceph-node2=10.10.100.2:6789/0
 }
 election epoch 428, quorum 0,1,2 ceph-node1,ceph-node2,ceph-mon
  osdmap e978: 16 osds: 16 up, 16 in
   pgmap v6735569: 2012 pgs, 8 pools, 2801 GB data, 703 kobjects
 5617 GB used, 33399 GB / 39016 GB avail
 2011 active+clean
1 active+clean+scrubbing+deep
   client io 174 kB/s rd, 30641 kB/s wr, 80 op/s

 root@ceph-node1:~# ceph pg dump  | grep -i deep | cut -f 1
   dumped all in format plain
   pg_stat
   19.b3

 In log file i see
 2015-05-14 03:23:51.556876 7fc708a37700  0 log_channel(cluster) log [INF]
 : 19.b3 deep-scrub starts
 but no 19.b3 deep-scrub ok

 then i do ceph pg deep-scrub 19.b3, nothing happens and in logs file no
 any records about it.

 What can i do to pg return in active + clean station?
 is there any sense restart OSD or the entirely server where the OSD?

 Thanks.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.87-1

2015-02-26 Thread Luis Periquito
there are release notes online for this version:
http://ceph.com/docs/master/release-notes/#v0-87-1-giant

Seems that someone just forgot to announce in the ML.

On Thu, Feb 26, 2015 at 7:12 AM, Alexandre DERUMIER aderum...@odiso.com
wrote:

 Hi,

 I known that Loic Dachary was currently working on backporting new feature
 on giant,

 I see that 0.87.1 has been tagged in git too:

 here the difference:
 https://github.com/ceph/ceph/compare/v0.87...v0.87.1


 Loic, any annoucement/release note, yet ?


 - Mail original -
 De: Lindsay Mathieson lindsay.mathie...@gmail.com
 À: ceph-users ceph-us...@ceph.com
 Envoyé: Jeudi 26 Février 2015 01:51:51
 Objet: [ceph-users] Ceph 0.87-1

 The Ceph Debian Giant repo ( http://ceph.com/debian-giant ) seems to have
 had an update from 0.87 to 0.87-1 on the 24-Feb.

 Are there release notes anywhere on what changed etc? is there an upgrade
 procedure?

 thanks,

 --
 Lindsay

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] running giant/hammer mds with firefly osds

2015-02-20 Thread Luis Periquito
Hi Dan,

I remember http://tracker.ceph.com/issues/9945 introducing some issues with
running cephfs between different versions of giant/firefly.

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg14257.html

So if you upgrade please be aware that you'll also have to update the
clients.

On Fri, Feb 20, 2015 at 10:33 AM, Dan van der Ster d...@vanderster.com
wrote:

 Hi all,

 Back in the dumpling days, we were able to run the emperor MDS with
 dumpling OSDs -- this was an improvement over the dumpling MDS.

 Now we have stable firefly OSDs, but I was wondering if we can reap
 some of the recent CephFS developments by running a giant or ~hammer
 MDS with our firefly OSDs. Did anyone try that yet?

 Best Regards, Dan
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fixing a crushmap

2015-02-20 Thread Luis Periquito
The process of creating an erasure coded pool and a replicated one is
slightly different. You can use Sebastian's guide to create/manage the osd
tree, but you should follow this guide
http://ceph.com/docs/giant/dev/erasure-coded-pool/ to create the EC pool.

I'm not sure (i.e. I never tried) to create a EC pool the way you did. The
normal replicated ones do work like this.

On Fri, Feb 20, 2015 at 4:49 PM, Kyle Hutson kylehut...@ksu.edu wrote:

 I manually edited my crushmap, basing my changes on
 http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
 I have SSDs and HDDs in the same box and was wanting to separate them by
 ruleset. My current crushmap can be seen at http://pastie.org/9966238

 I had it installed and everything looked gooduntil I created a new
 pool. All of the new pgs are stuck in creating. I first tried creating an
 erasure-coded pool using ruleset 3, then created another pool using ruleset
 0. Same result.

 I'm not opposed to an 'RTFM' answer, so long as you can point me to the
 right one. I've seen very little documentation on crushmap rules, in
 particular.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Total number PGs using multiple pools

2015-01-27 Thread Luis Periquito
Although the documentation is not great, and open to interpretation, there
is a pg calculator here http://ceph.com/pgcalc/.
With it you should be able to simulate your use case, and generate number
based on your scenario.

On Mon, Jan 26, 2015 at 8:00 PM, Italo Santos okd...@gmail.com wrote:

  Thanks for your answer.

 But what I’d like to understand is if this numbers are per pool bases or
 per cluster bases? If this number were per cluster bases I’ll plan on
 cluster deploy how much pools I’d like to have on that cluster and their
 replicas

 Regards.

 *Italo Santos*
 http://italosantos.com.br/

 On Saturday, January 17, 2015 at 07:04, lidc...@redhat.com wrote:

  Here are a few values commonly used:

- Less than 5 OSDs set pg_num to 128
- Between 5 and 10 OSDs set pg_num to 512
- Between 10 and 50 OSDs set pg_num to 4096
- If you have more than 50 OSDs, you need to understand the tradeoffs
and how to calculate the pg_num value by yourself

 But i think 10 OSD is to small for rados cluster.


 *From:* Italo Santos okd...@gmail.com
 *Date:* 2015-01-17 05:00
 *To:* ceph-users ceph-users@lists.ceph.com
 *Subject:* [ceph-users] Total number PGs using multiple pools
 Hello,

 Into placement groups documentation
 http://ceph.com/docs/giant/rados/operations/placement-groups/ we have
 the message bellow:

 “*When using multiple data pools for storing objects, you need to ensure
 that you balance the number of placement groups per pool with the number of
 placement groups per OSD so that you arrive at a reasonable total number of
 placement groups that provides reasonably low variance per OSD without
 taxing system resources or making the peering process too slow.*”

 This means that, if I have a cluster with 10 OSD and 3 pools with size = 3
 each pool can have only ~111 PGs?

 Ex.: (100 * 10 OSDs) / 3 replicas = 333 PGs / 3 pools = 111 PGS per pool

 I don't know if reasoning is right… I’ll glad for any help.

 Regards.

 *Italo Santos*
 http://italosantos.com.br/



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >