Re: failed to open http://apt-mirror.front.sepia.ceph.com

2015-09-23 Thread Sage Weil
On Wed, 23 Sep 2015, Loic Dachary wrote:
> Hi,
> 
> On 23/09/2015 12:29, wangsongbo wrote:
> > 64.90.32.37 apt-mirror.front.sepia.ceph.com
> 
> It works for me. Could you send a traceroute 
> apt-mirror.front.sepia.ceph.com ?

This is a private IP internal to the sepia lab.  Anythign outside the lab 
shouldn't be using it...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD mirroring CLI proposal ...

2015-09-23 Thread Jason Dillaman
> In this case the commands look a little confusing to me, as from their
> names I would rather think they enable/disable mirror for existent
> images too. Also, I don't see a command to check what current
> behaviour is. And, I suppose it would be useful if we could configure
> other default features for a pool (exclusive-lock, object-map, ...)
> Also, I am not sure we should specify  this way, as it is
> not consistent with other rbd commands. By default rbd operates on
> 'rbd' pool, which can be changed by --pool option. So what do you
> think if we have something similar to 'rbd feature' commands?
> 
>   rbd [--pool ] default-feature enable 
>   rbd [--pool ] default-feature disable 
>   rbd [--pool ] default-feature show []
> 
> (If  is not specified in the last command, all features are
> shown).
> 
> Similarly, it might be useful to have 'rbd feature show' command:
> 
>   rbd feature show  []
> 
> BTW, where do you think these default feature flags will be stored?
> Storing in pg_pool_t::flags I suppose is the easiest but it looks like
> a layering violation.
> 

I used 'mirror pool enable/disable' to keep all the related commands together.  
I wasn't attempting to create a mechanism to specify arbitrary default features 
for a given pool, only the ability to enable mirroring (by default) on a given 
pool since that is the use case discussed at CDS. 

Image features can already be seen (along with lots of other image stats) via 
"rbd info ".  Mirror pool settings can be seen via the previously 
proposed "mirror pool info" command.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding Data-At-Rest compression support to Ceph

2015-09-23 Thread Sage Weil
On Wed, 23 Sep 2015, Igor Fedotov wrote:
> Hi Sage,
> thanks a lot for your feedback.
> 
> Regarding issues with offset mapping and stripe size exposure.
> What's about the idea to apply compression in two-tier (cache+backing storage)
> model only ?

I'm not sure we win anything by making it a two-tier only thing... simply 
making it a feature of the EC pool means we can also address EC pool users 
like radosgw.

> I doubt single-tier one is widely used for EC pools since there is no random
> write support in such mode. Thus this might be an acceptable limitation.
> At the same time it seems that appends caused by cached object flush have
> fixed block size (8Mb by default). And object is totally rewritten on the next
> flush if any. This makes offset mapping less tricky.
> Decompression should be applied in any model though as cache tier shutdown and
> subsequent compressed data access is possibly  a valid use case.

Yeah, we need to handle random reads either way, so I think the offset 
mapping is going to be needed anyway.  And I don't think there is any 
real difference from teh EC pool's perspective between a direct user 
like radosgw and the cache tier writing objects--in both cases it's 
doing appends and deletes.

sage


> 
> Thanks,
> Igor
> 
> On 22.09.2015 22:11, Sage Weil wrote:
> > On Tue, 22 Sep 2015, Igor Fedotov wrote:
> > > Hi guys,
> > > 
> > > I can find some talks about adding compression support to Ceph. Let me
> > > share
> > > some thoughts and proposals on that too.
> > > 
> > > First of all I?d like to consider several major implementation options
> > > separately. IMHO this makes sense since they have different applicability,
> > > value and implementation specifics. Besides that less parts are easier for
> > > both understanding and implementation.
> > > 
> > >* Data-At-Rest Compression. This is about compressing basic data volume
> > > kept
> > > by the Ceph backing tier. The main reason for that is data store costs
> > > reduction. One can find similar approach introduced by Erasure Coding Pool
> > > implementation - cluster capacity increases (i.e. storage cost reduces) at
> > > the
> > > expense of additional computations. This is especially effective when
> > > combined
> > > with the high-performance cache tier.
> > >*  Intermediate Data Compression. This case is about applying
> > > compression
> > > for intermediate data like system journals, caches etc. The intention is
> > > to
> > > improve expensive storage resource  utilization (e.g. solid state drives
> > > or
> > > RAM ). At the same time the idea to apply compression ( feature that
> > > undoubtedly introduces additional overhead ) to the crucial heavy-duty
> > > components probably looks contradictory.
> > >*  Exchange Data ?ompression. This one to be applied to messages
> > > transported
> > > between client and storage cluster components as well as internal cluster
> > > traffic. The rationale for that might be the desire to improve cluster
> > > run-time characteristics, e.g. limited data bandwidth caused by the
> > > network or
> > > storage devices throughput. The potential drawback is client overburdening
> > > -
> > > client computation resources might become a bottleneck since they take
> > > most of
> > > compression/decompression tasks.
> > > 
> > > Obviously it would be great to have support for all the above cases, e.g.
> > > object compression takes place at the client and cluster components handle
> > > that naturally during the object life-cycle. Unfortunately significant
> > > complexities arise on this way. Most of them are related to partial object
> > > access, both reading and writing. It looks like huge development (
> > > redesigning, refactoring and new code development ) and testing efforts
> > > are
> > > required on this way. It?s hard to estimate the value of such aggregated
> > > support at the current moment too.
> > > Thus the approach I?m suggesting is to drive the progress eventually and
> > > consider cases separately. At the moment my proposal is to add
> > > Data-At-Rest
> > > compression to Erasure Coded pools as the most definite one from both
> > > implementation and value points of view.
> > > 
> > > How we can do that.
> > > 
> > > Ceph Cluster Architecture suggests two-tier storage model for production
> > > usage. Cache tier built on high-performance expensive storage devices
> > > provides
> > > performance. Storage tier with low-cost less-efficient devices provides
> > > cost-effectiveness and capacity. Cache tier is supposed to use ordinary
> > > data
> > > replication while storage one can use erasure coding (EC) for effective
> > > and
> > > reliable data keeping. EC provides less store costs with the same
> > > reliability
> > > comparing to data replication approach at the expenses of additional
> > > computations. Thus Ceph already has some trade off between capacity and
> > > computation efforts. Actually Data-At-Rest compression is exactly about
> > 

Re: failed to open http://apt-mirror.front.sepia.ceph.com

2015-09-23 Thread Loic Dachary
Hi,

On 23/09/2015 12:29, wangsongbo wrote:
> 64.90.32.37 apt-mirror.front.sepia.ceph.com

It works for me. Could you send a traceroute apt-mirror.front.sepia.ceph.com ?

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: [Ceph-announce] Important security noticed regarding release signing key

2015-09-23 Thread Gaudenz Steinlin
Sage Weil  writes:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Last week, Red Hat investigated an intrusion on the sites of both the Ceph 
> community project (ceph.com) and Inktank (download.inktank.com), which 
> were hosted on a computer system outside of Red Hat infrastructure.
>
> Ceph.com provided Ceph community versions downloads signed with a Ceph 
> signing key (id 7EBFDD5D17ED316D). Download.inktank.comprovided releases 
> of the Red Hat Ceph product for Ubuntu and CentOS operating systems signed 
> with an Inktank signing key (id 5438C7019DCEEEAD). While the investigation 
> into the intrusion is ongoing, our initial focus was on the integrity of 
> the software and distribution channel for both sites.
>
> To date, our investigation has not discovered any compromised code or 
> binaries available for download on these sites. However, we cannot fully 
> rule out the possibility that some compromised code or binaries were 
> available for download at some point in the past. Further, we can no 
> longer trust the integrity of the Ceph signing key, and therefore have 
> created a new signing key (id E84AC2C0460F3994) for verifying downloads. 
> This new key is committed to the ceph.git repository and is 
> also available from
>
>   https://git.ceph.com/release.asc
>
> The new key should look like:
>
> pub   4096R/460F3994 2015-09-15
> uid  Ceph.com (release key) 
>
> All future release git tags will be signed with this new key.
>
> This intrusion did not affect other Ceph sites such as download.ceph.com 
> (which contained some older Ceph downloads) or git.ceph.com (which mirrors 
> various source repositories), and is not known to have affected any other 
> Ceph community infrastructure.  There is no evidence that build system or 
> the Ceph github source repository were compromised.
>
> New hosts for ceph.com and download.ceph.com have been created and the 
> sites have been rebuilt.  All content available on download.ceph.com as 
> been verified, and all ceph.com URLs for package locations now redirect 
> there.  There is still some content missing from download.ceph.com that 
> will appear later today: source tarballs will be regenerated from git, and 
> older release packages are being resigned with the new release key DNS 
> changes are still propogating so you may not see the new versions of the 
> ceph.com and download.ceph.com sites for another hour or so.

It would be nice to have a way to verify the integrity of tarballs
downloaded from http://download.ceph.com/tarballs/. Could you please add
individual signatures or an sha256sum file signed with your release key.
This is important for people building from source tarballs and
distribution packagers baseing their packages from tarballs. Debian and
Ubuntu packages are currently built from them.

Gaudenz


signature.asc
Description: PGP signature


Re: ceph-mon always election when change crushmap in firefly

2015-09-23 Thread Sage Weil
On Wed, 23 Sep 2015, Alexander Yang wrote:
> hello,
> We use Ceph+Openstack in our private cloud. In our cluster, we have
> 5 mons and 800 osds, the Capacity is about 1Pb. And run about 700 vms and
> 1100 volumes,
> recently, we increase our pg_num , now the cluster have about 7
> pgs. In my real intention? I want every osd have 100pgs. but after increase
> pg_num, I find I'm wrong. Because the different crush weight for different
> osd, the osd's pg_num is different, some osd have exceed  500pgs.
> Now, the problem is  appear?cause some reason when i want to change
> some osd  weight, that means change the crushmap.  This change cause about
> 0.03% data to migrate. the mon is always begin to election. It's will hung
> the cluster, and when they end, the  original  leader still is the new
> leader. And during the mon eclection?On the upper layer, vm have too many
> slow request will appear. so now i dare to do any operation about change
> crushmap. But i worry about an important thing, If  when our cluster  down
>  one host even down one rack.   By the time, the cluster curshmap will
> change large, and the migrate data also large. I worry the cluster will
> hung  long time. and result on upper layer, all vm became to  shutdown.
> In my opinion, I guess when I change the crushmap,* the leader mon
> maybe calculate the too many information*, or* too many client want to get
> the new crushmap from leader mon*.  It must be hung the mon thread, so the
> leader mon can't heatbeat to other mons, the other mons think the leader is
> down then begin the new election.  I am sorry if i guess is wrong.
> The crushmap in accessory. So who can give me some advice or guide,
> Thanks very much!

There were huge improvements made in hammer in terms of mon efficiency in 
these cases where it is under load.  I recommend upgrading as that will 
help.

You can also mitigate the problem somewhat by adjusting the mon_lease and 
associated settings up.  Scale all of mon_lease, mon_lease_renew_interval, 
mon_lease_ack_timeout, mon_accept_timeout by 2x or 3x.

It also sounds like you may be using some older tunables/settings 
for your pools or crush rules.  Can you attach the output of 'ceph osd 
dump' and 'ceph osd crush dump | tail -n 20' ?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD mirroring CLI proposal ...

2015-09-23 Thread Jason Dillaman
> > > * rbd mirror pool add 
> > > This will register a remote cluster/pool as a peer to the
> > > current,
> > > local pool.  All existing mirrored images and all future mirrored
> > > images will have this peer registered as a journal client.
> > >
> > > * rbd mirror pool remove 
> > > This will deregister a remote cluster/pool as a peer to the
> > > current,
> > > local pool.  All existing mirrored images will have the remote
> > > deregistered from image journals.
> 
> I am not sure we need 'pool' here. I would prefer:
> 
>   rbd mirror peer add 
>   rbd mirror peer remove 
>   rbd mirror peer show
> 
> Where,  should not necessary contain pool, because
> the default could be used or what is specified via '--pool' option.
> 

I used "pool" to signify that these commands operate on a per-pool basis, which 
is how mirroring will function.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ceph-announce] Important security noticed regarding release signing key

2015-09-23 Thread Sage Weil
On Wed, 23 Sep 2015, Gaudenz Steinlin wrote:
> Sage Weil  writes:
> 
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA1
> >
> > Last week, Red Hat investigated an intrusion on the sites of both the Ceph 
> > community project (ceph.com) and Inktank (download.inktank.com), which 
> > were hosted on a computer system outside of Red Hat infrastructure.
> >
> > Ceph.com provided Ceph community versions downloads signed with a Ceph 
> > signing key (id 7EBFDD5D17ED316D). Download.inktank.comprovided releases 
> > of the Red Hat Ceph product for Ubuntu and CentOS operating systems signed 
> > with an Inktank signing key (id 5438C7019DCEEEAD). While the investigation 
> > into the intrusion is ongoing, our initial focus was on the integrity of 
> > the software and distribution channel for both sites.
> >
> > To date, our investigation has not discovered any compromised code or 
> > binaries available for download on these sites. However, we cannot fully 
> > rule out the possibility that some compromised code or binaries were 
> > available for download at some point in the past. Further, we can no 
> > longer trust the integrity of the Ceph signing key, and therefore have 
> > created a new signing key (id E84AC2C0460F3994) for verifying downloads. 
> > This new key is committed to the ceph.git repository and is 
> > also available from
> >
> > https://git.ceph.com/release.asc
> >
> > The new key should look like:
> >
> > pub   4096R/460F3994 2015-09-15
> > uid  Ceph.com (release key) 
> >
> > All future release git tags will be signed with this new key.
> >
> > This intrusion did not affect other Ceph sites such as download.ceph.com 
> > (which contained some older Ceph downloads) or git.ceph.com (which mirrors 
> > various source repositories), and is not known to have affected any other 
> > Ceph community infrastructure.  There is no evidence that build system or 
> > the Ceph github source repository were compromised.
> >
> > New hosts for ceph.com and download.ceph.com have been created and the 
> > sites have been rebuilt.  All content available on download.ceph.com as 
> > been verified, and all ceph.com URLs for package locations now redirect 
> > there.  There is still some content missing from download.ceph.com that 
> > will appear later today: source tarballs will be regenerated from git, and 
> > older release packages are being resigned with the new release key DNS 
> > changes are still propogating so you may not see the new versions of the 
> > ceph.com and download.ceph.com sites for another hour or so.
> 
> It would be nice to have a way to verify the integrity of tarballs
> downloaded from http://download.ceph.com/tarballs/. Could you please add
> individual signatures or an sha256sum file signed with your release key.
> This is important for people building from source tarballs and
> distribution packagers baseing their packages from tarballs. Debian and
> Ubuntu packages are currently built from them.

Future releases will have tarball signatures.  Alfredo and Andrew are 
working on the new build/release tooling now.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding Data-At-Rest compression support to Ceph

2015-09-23 Thread Igor Fedotov

Hi Sage,
thanks a lot for your feedback.

Regarding issues with offset mapping and stripe size exposure.
What's about the idea to apply compression in two-tier (cache+backing 
storage) model only ?
I doubt single-tier one is widely used for EC pools since there is no 
random write support in such mode. Thus this might be an acceptable 
limitation.
At the same time it seems that appends caused by cached object flush 
have fixed block size (8Mb by default). And object is totally rewritten 
on the next flush if any. This makes offset mapping less tricky.
Decompression should be applied in any model though as cache tier 
shutdown and subsequent compressed data access is possibly  a valid use 
case.


Thanks,
Igor

On 22.09.2015 22:11, Sage Weil wrote:

On Tue, 22 Sep 2015, Igor Fedotov wrote:

Hi guys,

I can find some talks about adding compression support to Ceph. Let me share
some thoughts and proposals on that too.

First of all I?d like to consider several major implementation options
separately. IMHO this makes sense since they have different applicability,
value and implementation specifics. Besides that less parts are easier for
both understanding and implementation.

   * Data-At-Rest Compression. This is about compressing basic data volume kept
by the Ceph backing tier. The main reason for that is data store costs
reduction. One can find similar approach introduced by Erasure Coding Pool
implementation - cluster capacity increases (i.e. storage cost reduces) at the
expense of additional computations. This is especially effective when combined
with the high-performance cache tier.
   *  Intermediate Data Compression. This case is about applying compression
for intermediate data like system journals, caches etc. The intention is to
improve expensive storage resource  utilization (e.g. solid state drives or
RAM ). At the same time the idea to apply compression ( feature that
undoubtedly introduces additional overhead ) to the crucial heavy-duty
components probably looks contradictory.
   *  Exchange Data ?ompression. This one to be applied to messages transported
between client and storage cluster components as well as internal cluster
traffic. The rationale for that might be the desire to improve cluster
run-time characteristics, e.g. limited data bandwidth caused by the network or
storage devices throughput. The potential drawback is client overburdening -
client computation resources might become a bottleneck since they take most of
compression/decompression tasks.

Obviously it would be great to have support for all the above cases, e.g.
object compression takes place at the client and cluster components handle
that naturally during the object life-cycle. Unfortunately significant
complexities arise on this way. Most of them are related to partial object
access, both reading and writing. It looks like huge development (
redesigning, refactoring and new code development ) and testing efforts are
required on this way. It?s hard to estimate the value of such aggregated
support at the current moment too.
Thus the approach I?m suggesting is to drive the progress eventually and
consider cases separately. At the moment my proposal is to add Data-At-Rest
compression to Erasure Coded pools as the most definite one from both
implementation and value points of view.

How we can do that.

Ceph Cluster Architecture suggests two-tier storage model for production
usage. Cache tier built on high-performance expensive storage devices provides
performance. Storage tier with low-cost less-efficient devices provides
cost-effectiveness and capacity. Cache tier is supposed to use ordinary data
replication while storage one can use erasure coding (EC) for effective and
reliable data keeping. EC provides less store costs with the same reliability
comparing to data replication approach at the expenses of additional
computations. Thus Ceph already has some trade off between capacity and
computation efforts. Actually Data-At-Rest compression is exactly about the
same. Moreover one can tie EC and Data-At-Rest compression together to achieve
even better storage effectiveness.
There are two possible ways on adding Data-At-Rest compression:
   *  Use data compression built into a file system beyond the Ceph.
   *  Add compression to Ceph OSD.

At first glance Option 1. looks pretty attractive but there are some drawbacks
for this approach. Here they are:
   *  File System lock-in. BTRFS is the only file system supporting transparent
compression among ones recommended for Ceph usage.  Moreover
AFAIK it?s still not recommended for production usage, see:
http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
*  Limited flexibility - one can use compression methods and policies
supported by FS only.
*  Data compression depends on volume or mount point properties (and is
bound to OSD). Without additional support Ceph lacks the ability to have
different compression policies for different 

Re: RBD mirroring CLI proposal ...

2015-09-23 Thread Jason Dillaman
> > In this case the commands look a little confusing to me, as from their
> > names I would rather think they enable/disable mirror for existent
> > images too. Also, I don't see a command to check what current
> > behaviour is. And, I suppose it would be useful if we could configure
> > other default features for a pool (exclusive-lock, object-map, ...)
> > Also, I am not sure we should specify  this way, as it is
> > not consistent with other rbd commands. By default rbd operates on
> > 'rbd' pool, which can be changed by --pool option. So what do you
> > think if we have something similar to 'rbd feature' commands?
> >
> >   rbd [--pool ] default-feature enable 
> >   rbd [--pool ] default-feature disable 
> >   rbd [--pool ] default-feature show []
> >
> > (If  is not specified in the last command, all features are
> > shown).
> 
> I haven't read the discussion in full, so feel free to ignore, but I'd
> much rather have rbd create create images with the same set of features
> enabled no matter which pool it's pointed to.
> 
> I'm not clear on what exactly a pool policy is.  If it's just a set of
> feature bits to enable at create time, I don't think it's worth
> introducing at all.  If it's a set feature bits + a set of mirroring
> related options, I think it should just be a set of those options.
> Then, rbd create --enable-mirroring could be a way to create an image
> with mirroring enabled.
> 
> My point is a plain old "rbd create foo" shouldn't depend an any new
> pool-level metadata.  It's not that hard to remember which features you
> want for which pools and rbd create shortcuts like --enable-object-map
> and --enable-mirroring would hide feature bit dependencies and save
> typing.  --enable-mirroring would also serve as a ticket to go look at
> new metadata and pull out any mirroring related defaults.
> 

While I agree that I don't necessarily want to go down the road of specifying 
per-pool default features, I think there is a definite use-case need to be able 
to enable mirroring on a per-pool basis by default.  Without such an ability, 
taking OpenStack for example, it would require changes to Glance, Cinder, and 
Nova in order to support DR configurations -- or you could get it automatically 
with a little pre-configuration.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


failed to open http://apt-mirror.front.sepia.ceph.com

2015-09-23 Thread wangsongbo

Hi Loic and other Cephers,

I am running teuthology-suites in our testing, because the connection to 
"apt-mirror.front.sepia.ceph.com" timed out , "ceph-cm-ansible" failed.

And from a web-browser, I got the response like this : "502 Bad Gateway".
"64.90.32.37 apt-mirror.front.sepia.ceph.com" has been added to /etc/hosts.
Did the resources have been removed ?


Thanks and Regards,
WangSongbo
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding Data-At-Rest compression support to Ceph

2015-09-23 Thread Igor Fedotov



On 23.09.2015 17:05, Gregory Farnum wrote:

On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil  wrote:

On Wed, 23 Sep 2015, Igor Fedotov wrote:

Hi Sage,
thanks a lot for your feedback.

Regarding issues with offset mapping and stripe size exposure.
What's about the idea to apply compression in two-tier (cache+backing storage)
model only ?

I'm not sure we win anything by making it a two-tier only thing... simply
making it a feature of the EC pool means we can also address EC pool users
like radosgw.


I doubt single-tier one is widely used for EC pools since there is no random
write support in such mode. Thus this might be an acceptable limitation.
At the same time it seems that appends caused by cached object flush have
fixed block size (8Mb by default). And object is totally rewritten on the next
flush if any. This makes offset mapping less tricky.
Decompression should be applied in any model though as cache tier shutdown and
subsequent compressed data access is possibly  a valid use case.

Yeah, we need to handle random reads either way, so I think the offset
mapping is going to be needed anyway.

The idea of making the primary responsible for object compression
really concerns me. It means for instance that a single random access
will likely require access to multiple objects, and breaks many of the
optimizations we have right now or in the pipeline (for instance:
direct client access).
Could you please elaborate why multiple objects access is required on 
single random access?
In my opinion we need to access absolutely the same object set as 
before: in EC pool each appended block is spitted into multiple shards 
that go to respective OSDs. In general case one has to retrieve a set of 
adjacent shards from several OSDs on single read request. In case of 
compression the only difference is in data range that compressed shard 
set occupy. I.e. we simply need to translate requested data range to the 
actually stored one and retrieve that data from OSDs. What's missed?

And apparently only the EC pool will support
compression, which is frustrating for all the replicated pool users
out there...
In my opinion  replicated pool users should consider EC pool usage first 
if they care about space saving. They automatically gain 50% space 
saving this way. Compression brings even more saving but that's rather 
the second step on this way.

Is there some reason we don't just want to apply encryption across an
OSD store? Perhaps doing it on the filesystem level is the wrong way
(for reasons named above) but there are other mechanisms like inline
block device compression that I think are supposed to work pretty
well.
If I understand the idea of inline block device compression correctly it 
has some of drawbacks similar to FS compression approach. Ones to mention:
* Less flexibility - per device compression only, no way to have 
per-pool compression. No control on the compression process.
* Potentially higher overhead when operating- There is no way to bypass 
non-compressible data processing, e.g. shards with Erasure codes.
* Potentially higher overhead for recovery on OSD death - one needs to 
decompress data at working OSDs and compress it at new OSD. That's not 
necessary if compression takes place prior to EC though.

The only thing that doesn't get us that I can see mentioned here
is the over-the-wire compression — and Haomai already has patches for
that, which should be a lot easier to validate and will work at all
levels of the stack!
-Greg


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding Data-At-Rest compression support to Ceph

2015-09-23 Thread Gregory Farnum
On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil  wrote:
> On Wed, 23 Sep 2015, Igor Fedotov wrote:
>> Hi Sage,
>> thanks a lot for your feedback.
>>
>> Regarding issues with offset mapping and stripe size exposure.
>> What's about the idea to apply compression in two-tier (cache+backing 
>> storage)
>> model only ?
>
> I'm not sure we win anything by making it a two-tier only thing... simply
> making it a feature of the EC pool means we can also address EC pool users
> like radosgw.
>
>> I doubt single-tier one is widely used for EC pools since there is no random
>> write support in such mode. Thus this might be an acceptable limitation.
>> At the same time it seems that appends caused by cached object flush have
>> fixed block size (8Mb by default). And object is totally rewritten on the 
>> next
>> flush if any. This makes offset mapping less tricky.
>> Decompression should be applied in any model though as cache tier shutdown 
>> and
>> subsequent compressed data access is possibly  a valid use case.
>
> Yeah, we need to handle random reads either way, so I think the offset
> mapping is going to be needed anyway.

The idea of making the primary responsible for object compression
really concerns me. It means for instance that a single random access
will likely require access to multiple objects, and breaks many of the
optimizations we have right now or in the pipeline (for instance:
direct client access). And apparently only the EC pool will support
compression, which is frustrating for all the replicated pool users
out there...

Is there some reason we don't just want to apply encryption across an
OSD store? Perhaps doing it on the filesystem level is the wrong way
(for reasons named above) but there are other mechanisms like inline
block device compression that I think are supposed to work pretty
well. The only thing that doesn't get us that I can see mentioned here
is the over-the-wire compression — and Haomai already has patches for
that, which should be a lot easier to validate and will work at all
levels of the stack!
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: failed to open http://apt-mirror.front.sepia.ceph.com

2015-09-23 Thread Loic Dachary


On 23/09/2015 15:11, Sage Weil wrote:
> On Wed, 23 Sep 2015, Loic Dachary wrote:
>> Hi,
>>
>> On 23/09/2015 12:29, wangsongbo wrote:
>>> 64.90.32.37 apt-mirror.front.sepia.ceph.com
>>
>> It works for me. Could you send a traceroute 
>> apt-mirror.front.sepia.ceph.com ?
> 
> This is a private IP internal to the sepia lab.  Anythign outside the lab 
> shouldn't be using it...

This is the public facing IP and is required for teuthology to run outside of 
the lab (http://tracker.ceph.com/issues/12212).

64.90.32.37 apt-mirror.front.sepia.ceph.com

suggests the workaround was used. And a traceroute will confirm if the 
resolution happens as expected (with the public IP) or with a private IP 
(meaning the workaround is not in place where it should).

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


09/23/2015 Weekly Ceph Performance Meeting IS ON!

2015-09-23 Thread Mark Nelson
8AM PST as usual! Discussion topics include an update on transparent 
huge pages testing and I think Ben would like to talk a bit about CBT 
PRs.  Please feel free to add your own!


Here's the links:

Etherpad URL:
http://pad.ceph.com/p/performance_weekly

To join the Meeting:
https://bluejeans.com/268261044

To join via Browser:
https://bluejeans.com/268261044/browser

To join with Lync:
https://bluejeans.com/268261044/lync


To join via Room System:
Video Conferencing System: bjn.vc -or- 199.48.152.152
Meeting ID: 268261044

To join via Phone:
1) Dial:
  +1 408 740 7256
  +1 888 240 2560(US Toll Free)
  +1 408 317 9253(Alternate Number)
  (see all numbers - http://bluejeans.com/numbers)
2) Enter Conference ID: 268261044

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding Data-At-Rest compression support to Ceph

2015-09-23 Thread Igor Fedotov

Sage,

so you are saying that radosgw tend to use EC pools directly without 
caching, right?


I agree that we need offset mapping anyway.

And the difference between cache writes and direct writes is mainly in 
block size granularity: 8 Mb vs. 4 Kb. In the latter case we have higher 
overhead for both offset mapping and compression. But I agree - no real 
difference from implementation point of view.

OK, let's try to handle both use cases.

So what do you think - can proceed with this feature implementation or 
we need more discussion on that?


Thanks,
Igor.

On 23.09.2015 16:15, Sage Weil wrote:

On Wed, 23 Sep 2015, Igor Fedotov wrote:

Hi Sage,
thanks a lot for your feedback.

Regarding issues with offset mapping and stripe size exposure.
What's about the idea to apply compression in two-tier (cache+backing storage)
model only ?

I'm not sure we win anything by making it a two-tier only thing... simply
making it a feature of the EC pool means we can also address EC pool users
like radosgw.


I doubt single-tier one is widely used for EC pools since there is no random
write support in such mode. Thus this might be an acceptable limitation.
At the same time it seems that appends caused by cached object flush have
fixed block size (8Mb by default). And object is totally rewritten on the next
flush if any. This makes offset mapping less tricky.
Decompression should be applied in any model though as cache tier shutdown and
subsequent compressed data access is possibly  a valid use case.

Yeah, we need to handle random reads either way, so I think the offset
mapping is going to be needed anyway.  And I don't think there is any
real difference from teh EC pool's perspective between a direct user
like radosgw and the cache tier writing objects--in both cases it's
doing appends and deletes.

sage



Thanks,
Igor

On 22.09.2015 22:11, Sage Weil wrote:

On Tue, 22 Sep 2015, Igor Fedotov wrote:

Hi guys,

I can find some talks about adding compression support to Ceph. Let me
share
some thoughts and proposals on that too.

First of all I?d like to consider several major implementation options
separately. IMHO this makes sense since they have different applicability,
value and implementation specifics. Besides that less parts are easier for
both understanding and implementation.

* Data-At-Rest Compression. This is about compressing basic data volume
kept
by the Ceph backing tier. The main reason for that is data store costs
reduction. One can find similar approach introduced by Erasure Coding Pool
implementation - cluster capacity increases (i.e. storage cost reduces) at
the
expense of additional computations. This is especially effective when
combined
with the high-performance cache tier.
*  Intermediate Data Compression. This case is about applying
compression
for intermediate data like system journals, caches etc. The intention is
to
improve expensive storage resource  utilization (e.g. solid state drives
or
RAM ). At the same time the idea to apply compression ( feature that
undoubtedly introduces additional overhead ) to the crucial heavy-duty
components probably looks contradictory.
*  Exchange Data ?ompression. This one to be applied to messages
transported
between client and storage cluster components as well as internal cluster
traffic. The rationale for that might be the desire to improve cluster
run-time characteristics, e.g. limited data bandwidth caused by the
network or
storage devices throughput. The potential drawback is client overburdening
-
client computation resources might become a bottleneck since they take
most of
compression/decompression tasks.

Obviously it would be great to have support for all the above cases, e.g.
object compression takes place at the client and cluster components handle
that naturally during the object life-cycle. Unfortunately significant
complexities arise on this way. Most of them are related to partial object
access, both reading and writing. It looks like huge development (
redesigning, refactoring and new code development ) and testing efforts
are
required on this way. It?s hard to estimate the value of such aggregated
support at the current moment too.
Thus the approach I?m suggesting is to drive the progress eventually and
consider cases separately. At the moment my proposal is to add
Data-At-Rest
compression to Erasure Coded pools as the most definite one from both
implementation and value points of view.

How we can do that.

Ceph Cluster Architecture suggests two-tier storage model for production
usage. Cache tier built on high-performance expensive storage devices
provides
performance. Storage tier with low-cost less-efficient devices provides
cost-effectiveness and capacity. Cache tier is supposed to use ordinary
data
replication while storage one can use erasure coding (EC) for effective
and
reliable data keeping. EC provides less store costs with the same
reliability
comparing to data replication approach at the expenses of additional

Re: Adding Data-At-Rest compression support to Ceph

2015-09-23 Thread Sage Weil
On Wed, 23 Sep 2015, Igor Fedotov wrote:
> Sage,
> 
> so you are saying that radosgw tend to use EC pools directly without caching,
> right?
> 
> I agree that we need offset mapping anyway.
> 
> And the difference between cache writes and direct writes is mainly in block
> size granularity: 8 Mb vs. 4 Kb. In the latter case we have higher overhead
> for both offset mapping and compression. But I agree - no real difference from
> implementation point of view.
> OK, let's try to handle both use cases.
> 
> So what do you think - can proceed with this feature implementation or we need
> more discussion on that?

I think we should consider other options before moving forward.

Greg mentions doing this in the fs layer or even devicemapper.  That's 
attractive because it requires no work on our end.

Another option is to do this in the ObjectStore implementation.  It would 
be horribly inefficient to do in all cases, but we could provide a hint 
that all writes to an object will be appends.  This is something that 
NewStore, for example, coule probably do without too much trouble.

sage


> 
> Thanks,
> Igor.
> 
> On 23.09.2015 16:15, Sage Weil wrote:
> > On Wed, 23 Sep 2015, Igor Fedotov wrote:
> > > Hi Sage,
> > > thanks a lot for your feedback.
> > > 
> > > Regarding issues with offset mapping and stripe size exposure.
> > > What's about the idea to apply compression in two-tier (cache+backing
> > > storage)
> > > model only ?
> > I'm not sure we win anything by making it a two-tier only thing... simply
> > making it a feature of the EC pool means we can also address EC pool users
> > like radosgw.
> > 
> > > I doubt single-tier one is widely used for EC pools since there is no
> > > random
> > > write support in such mode. Thus this might be an acceptable limitation.
> > > At the same time it seems that appends caused by cached object flush have
> > > fixed block size (8Mb by default). And object is totally rewritten on the
> > > next
> > > flush if any. This makes offset mapping less tricky.
> > > Decompression should be applied in any model though as cache tier shutdown
> > > and
> > > subsequent compressed data access is possibly  a valid use case.
> > Yeah, we need to handle random reads either way, so I think the offset
> > mapping is going to be needed anyway.  And I don't think there is any
> > real difference from teh EC pool's perspective between a direct user
> > like radosgw and the cache tier writing objects--in both cases it's
> > doing appends and deletes.
> > 
> > sage
> > 
> > 
> > > Thanks,
> > > Igor
> > > 
> > > On 22.09.2015 22:11, Sage Weil wrote:
> > > > On Tue, 22 Sep 2015, Igor Fedotov wrote:
> > > > > Hi guys,
> > > > > 
> > > > > I can find some talks about adding compression support to Ceph. Let me
> > > > > share
> > > > > some thoughts and proposals on that too.
> > > > > 
> > > > > First of all I?d like to consider several major implementation options
> > > > > separately. IMHO this makes sense since they have different
> > > > > applicability,
> > > > > value and implementation specifics. Besides that less parts are easier
> > > > > for
> > > > > both understanding and implementation.
> > > > > 
> > > > > * Data-At-Rest Compression. This is about compressing basic data
> > > > > volume
> > > > > kept
> > > > > by the Ceph backing tier. The main reason for that is data store costs
> > > > > reduction. One can find similar approach introduced by Erasure Coding
> > > > > Pool
> > > > > implementation - cluster capacity increases (i.e. storage cost
> > > > > reduces) at
> > > > > the
> > > > > expense of additional computations. This is especially effective when
> > > > > combined
> > > > > with the high-performance cache tier.
> > > > > *  Intermediate Data Compression. This case is about applying
> > > > > compression
> > > > > for intermediate data like system journals, caches etc. The intention
> > > > > is
> > > > > to
> > > > > improve expensive storage resource  utilization (e.g. solid state
> > > > > drives
> > > > > or
> > > > > RAM ). At the same time the idea to apply compression ( feature that
> > > > > undoubtedly introduces additional overhead ) to the crucial heavy-duty
> > > > > components probably looks contradictory.
> > > > > *  Exchange Data ?ompression. This one to be applied to messages
> > > > > transported
> > > > > between client and storage cluster components as well as internal
> > > > > cluster
> > > > > traffic. The rationale for that might be the desire to improve cluster
> > > > > run-time characteristics, e.g. limited data bandwidth caused by the
> > > > > network or
> > > > > storage devices throughput. The potential drawback is client
> > > > > overburdening
> > > > > -
> > > > > client computation resources might become a bottleneck since they take
> > > > > most of
> > > > > compression/decompression tasks.
> > > > > 
> > > > > Obviously it would be great to have support for all the above cases,
> > > > > e.g.
> > > > > object 

Re: 09/23/2015 Weekly Ceph Performance Meeting IS ON!

2015-09-23 Thread Alexandre DERUMIER
Hi Mark,

can you post the video records of previous meetings ?

Thanks

Alexandre


- Mail original -
De: "Mark Nelson" 
À: "ceph-devel" 
Envoyé: Mercredi 23 Septembre 2015 15:51:21
Objet: 09/23/2015 Weekly Ceph Performance Meeting IS ON!

8AM PST as usual! Discussion topics include an update on transparent 
huge pages testing and I think Ben would like to talk a bit about CBT 
PRs. Please feel free to add your own! 

Here's the links: 

Etherpad URL: 
http://pad.ceph.com/p/performance_weekly 

To join the Meeting: 
https://bluejeans.com/268261044 

To join via Browser: 
https://bluejeans.com/268261044/browser 

To join with Lync: 
https://bluejeans.com/268261044/lync 


To join via Room System: 
Video Conferencing System: bjn.vc -or- 199.48.152.152 
Meeting ID: 268261044 

To join via Phone: 
1) Dial: 
+1 408 740 7256 
+1 888 240 2560(US Toll Free) 
+1 408 317 9253(Alternate Number) 
(see all numbers - http://bluejeans.com/numbers) 
2) Enter Conference ID: 268261044 

Mark 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majord...@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding Data-At-Rest compression support to Ceph

2015-09-23 Thread Samuel Just
I think before moving forward with any sort of implementation, the
design would need to be pretty much completely mapped out --
particularly how the offset mapping will be handled and stored.  The
right thing to do would be to produce a blueprint and submit it to the
list.  I also would vastly prefer to do it on the client side if
possible.  Certainly, radosgw could do the compression just as easily
as the osds (except for the load on the radosgw heads, I suppose).
-Sam

On Wed, Sep 23, 2015 at 8:26 AM, Igor Fedotov  wrote:
>
>
> On 23.09.2015 17:05, Gregory Farnum wrote:
>>
>> On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil  wrote:
>>>
>>> On Wed, 23 Sep 2015, Igor Fedotov wrote:

 Hi Sage,
 thanks a lot for your feedback.

 Regarding issues with offset mapping and stripe size exposure.
 What's about the idea to apply compression in two-tier (cache+backing
 storage)
 model only ?
>>>
>>> I'm not sure we win anything by making it a two-tier only thing... simply
>>> making it a feature of the EC pool means we can also address EC pool
>>> users
>>> like radosgw.
>>>
 I doubt single-tier one is widely used for EC pools since there is no
 random
 write support in such mode. Thus this might be an acceptable limitation.
 At the same time it seems that appends caused by cached object flush
 have
 fixed block size (8Mb by default). And object is totally rewritten on
 the next
 flush if any. This makes offset mapping less tricky.
 Decompression should be applied in any model though as cache tier
 shutdown and
 subsequent compressed data access is possibly  a valid use case.
>>>
>>> Yeah, we need to handle random reads either way, so I think the offset
>>> mapping is going to be needed anyway.
>>
>> The idea of making the primary responsible for object compression
>> really concerns me. It means for instance that a single random access
>> will likely require access to multiple objects, and breaks many of the
>> optimizations we have right now or in the pipeline (for instance:
>> direct client access).
>
> Could you please elaborate why multiple objects access is required on single
> random access?
> In my opinion we need to access absolutely the same object set as before: in
> EC pool each appended block is spitted into multiple shards that go to
> respective OSDs. In general case one has to retrieve a set of adjacent
> shards from several OSDs on single read request. In case of compression the
> only difference is in data range that compressed shard set occupy. I.e. we
> simply need to translate requested data range to the actually stored one and
> retrieve that data from OSDs. What's missed?
>>
>> And apparently only the EC pool will support
>> compression, which is frustrating for all the replicated pool users
>> out there...
>
> In my opinion  replicated pool users should consider EC pool usage first if
> they care about space saving. They automatically gain 50% space saving this
> way. Compression brings even more saving but that's rather the second step
> on this way.
>>
>> Is there some reason we don't just want to apply encryption across an
>> OSD store? Perhaps doing it on the filesystem level is the wrong way
>> (for reasons named above) but there are other mechanisms like inline
>> block device compression that I think are supposed to work pretty
>> well.
>
> If I understand the idea of inline block device compression correctly it has
> some of drawbacks similar to FS compression approach. Ones to mention:
> * Less flexibility - per device compression only, no way to have per-pool
> compression. No control on the compression process.
> * Potentially higher overhead when operating- There is no way to bypass
> non-compressible data processing, e.g. shards with Erasure codes.
> * Potentially higher overhead for recovery on OSD death - one needs to
> decompress data at working OSDs and compress it at new OSD. That's not
> necessary if compression takes place prior to EC though.
>
>> The only thing that doesn't get us that I can see mentioned here
>> is the over-the-wire compression — and Haomai already has patches for
>> that, which should be a lot easier to validate and will work at all
>> levels of the stack!
>> -Greg
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: failed to open http://apt-mirror.front.sepia.ceph.com

2015-09-23 Thread wangsongbo

Sage and Loic,
Thanks for your reply.
I am running teuthology in our testing.I can send a traceroute to 
64.90.32.37.
but when ceph-cm-ansible run the " yum-complete-transaction 
--cleanup-only" command,
it got such a response 
:"http://apt-mirror.front.sepia.ceph.com/misc-rpms/repodata/repomd.xml: 
[Errno 14] PYCURL ERROR 7 - "Failed connect to 
apt-mirror.front.sepia.ceph.com:80; Connection timed out"
And I replace "apt-mirror.front.sepia.ceph.com"  to "64.90.32.37" in 
repo file, then run "yum-complete-transaction --cleanup-only" command,
I got a response like 
this:"http://64.90.32.37/misc-rpms/repodata/repomd.xml: [Errno 14] 
PYCURL ERROR 22 - "The requested URL returned error: 502 Bad Gateway""

I do not know whether it was affected by the last week's attack.

Thanks and Regards,
WangSongbo

On 15/9/23 下午11:22, Loic Dachary wrote:


On 23/09/2015 15:11, Sage Weil wrote:

On Wed, 23 Sep 2015, Loic Dachary wrote:

Hi,

On 23/09/2015 12:29, wangsongbo wrote:

64.90.32.37 apt-mirror.front.sepia.ceph.com

It works for me. Could you send a traceroute
apt-mirror.front.sepia.ceph.com ?

This is a private IP internal to the sepia lab.  Anythign outside the lab
shouldn't be using it...

This is the public facing IP and is required for teuthology to run outside of 
the lab (http://tracker.ceph.com/issues/12212).

64.90.32.37 apt-mirror.front.sepia.ceph.com

suggests the workaround was used. And a traceroute will confirm if the 
resolution happens as expected (with the public IP) or with a private IP 
(meaning the workaround is not in place where it should).

Cheers



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


perf counters from a performance discrepancy

2015-09-23 Thread Deneau, Tom
Hi all --

Looking for guidance with perf counters...
I am trying to see whether the perf counters can tell me anything about the 
following discrepancy

I populate a number of 40k size objects in each of two pools, poolA and poolB.
Both pools cover osds on a single node, 5 osds total.

   * Config 1 (1p): 
  * use single rados bench client with 32 threads to do seq read of 2 
objects from poolA.

   * Config 2 (2p):
  * use two concurrent rados bench clients (running on same client node) 
with 16 threads each,
   one reading 1 objects from poolA,
   one reading 1 objects from poolB,

So in both configs, we have 32 threads total and the number of objects read is 
the same.
Note: in all cases, we drop the caches before doing the seq reads

The combined bandwidth (MB/sec) for the 2 clients in config 2 is about 1/3 of 
the bandwidth for
the single client in config 1.


I gathered perf counters before and after each run and looked at the difference 
of
the before and after counters for both the 1p and 2p cases.  Here are some 
things I noticed
that are different between the two runs.  Can someone take a look and let me 
know
whether any of these differences are significant.  In particular, for the
throttle-msgr_dispatch_throttler ones, since I don't know the detailed 
definitions of these fields.
Note: these are the numbers for one of the 5 osds, the other osds are similar...

* The field osd/loadavg is always about 3 times higher on the 2p c

some latency-related counters
--
osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
osd/op_r_process_latency/sum 1p=3.48506945399276, 2p=42.6278494550061


and some throttle-msgr_dispatch_throttler related counters
--
throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, diff=2
throttle-msgr_dispatch_throttler-client/get_sum 1p=222877, 2p=223088, diff=211
throttle-msgr_dispatch_throttler-client/put 1p=1337, 2p=1339, diff=2
throttle-msgr_dispatch_throttler-client/put_sum 1p=222877, 2p=223088, diff=211
throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_back_server/get_sum 1p=2726, 2p=6298, 
diff=3572
throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_back_server/put_sum 1p=2726, 2p=6298, 
diff=3572
throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_front_server/get_sum 1p=2726, 2p=6298, 
diff=3572
throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_front_server/put_sum 1p=2726, 2p=6298, 
diff=3572
throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252, diff=84
throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896, 2p=11844, diff=3948
throttle-msgr_dispatch_throttler-hbclient/put 1p=168, 2p=252, diff=84
throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896, 2p=11844, diff=3948

-- Tom Deneau, AMD

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: failed to open http://apt-mirror.front.sepia.ceph.com

2015-09-23 Thread Loic Dachary


On 23/09/2015 18:50, wangsongbo wrote:
> Sage and Loic,
> Thanks for your reply.
> I am running teuthology in our testing.I can send a traceroute to 64.90.32.37.
> but when ceph-cm-ansible run the " yum-complete-transaction --cleanup-only" 
> command,
> it got such a response 
> :"http://apt-mirror.front.sepia.ceph.com/misc-rpms/repodata/repomd.xml: 
> [Errno 14] PYCURL ERROR 7 - "Failed connect to 
> apt-mirror.front.sepia.ceph.com:80; Connection timed out"
> And I replace "apt-mirror.front.sepia.ceph.com"  to "64.90.32.37" in repo 
> file, then run "yum-complete-transaction --cleanup-only" command,
> I got a response like this:"http://64.90.32.37/misc-rpms/repodata/repomd.xml: 
> [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 502 Bad 
> Gateway""
> I do not know whether it was affected by the last week's attack.

Querying the IP directly won't get you where the mirror is (it's a vhost). I 
think ansible fails because it queries the DNS and does not use the entry you 
set in the /etc/hosts file. The OpenStack teuthology backend sets a specific 
entry in the DNS to workaround the problem (see 
https://github.com/ceph/teuthology/blob/master/teuthology/openstack/setup-openstack.sh#L318)

Cheers

> 
> Thanks and Regards,
> WangSongbo
> 
> On 15/9/23 下午11:22, Loic Dachary wrote:
>>
>> On 23/09/2015 15:11, Sage Weil wrote:
>>> On Wed, 23 Sep 2015, Loic Dachary wrote:
 Hi,

 On 23/09/2015 12:29, wangsongbo wrote:
> 64.90.32.37 apt-mirror.front.sepia.ceph.com
 It works for me. Could you send a traceroute
 apt-mirror.front.sepia.ceph.com ?
>>> This is a private IP internal to the sepia lab.  Anythign outside the lab
>>> shouldn't be using it...
>> This is the public facing IP and is required for teuthology to run outside 
>> of the lab (http://tracker.ceph.com/issues/12212).
>>
>> 64.90.32.37 apt-mirror.front.sepia.ceph.com
>>
>> suggests the workaround was used. And a traceroute will confirm if the 
>> resolution happens as expected (with the public IP) or with a private IP 
>> (meaning the workaround is not in place where it should).
>>
>> Cheers
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: [ceph-users] Potential OSD deadlock?

2015-09-23 Thread Mark Nelson
FWIW, we've got some 40GbE Intel cards in the community performance 
cluster on a Mellanox 40GbE switch that appear (knock on wood) to be 
running fine with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from 
Intel that older drivers might cause problems though.


Here's ifconfig from one of the nodes:

ens513f1: flags=4163  mtu 1500
inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Mark

On 09/23/2015 01:48 PM, Robert LeBlanc wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

OK, here is the update on the saga...

I traced some more of blocked I/Os and it seems that communication
between two hosts seemed worse than others. I did a two way ping flood
between the two hosts using max packet sizes (1500). After 1.5M
packets, no lost pings. Then then had the ping flood running while I
put Ceph load on the cluster and the dropped pings started increasing
after stopping the Ceph workload the pings stopped dropping.

I then ran iperf between all the nodes with the same results, so that
ruled out Ceph to a large degree. I then booted in the the
3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
need the network enhancements in the 4.x series to work well.

Does this sound familiar to anyone? I'll probably start bisecting the
kernel to see where this issue in introduced. Both of the clusters
with this issue are running 4.x, other than that, they are pretty
differing hardware and network configs.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
/XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
4OEo
=P33I
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc  wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

This is IPoIB and we have the MTU set to 64K. There was some issues
pinging hosts with "No buffer space available" (hosts are currently
configured for 4GB to test SSD caching rather than page cache). I
found that MTU under 32K worked reliable for ping, but still had the
blocked I/O.

I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
the blocked I/O.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:

On Tue, 22 Sep 2015, Samuel Just wrote:

I looked at the logs, it looks like there was a 53 second delay
between when osd.17 started sending the osd_repop message and when
osd.13 started reading it, which is pretty weird.  Sage, didn't we
once see a kernel issue which caused some messages to be mysteriously
delayed for many 10s of seconds?


Every time we have seen this behavior and diagnosed it in the wild it has
been a network misconfiguration.  Usually related to jumbo frames.

sage




What kernel are you running?
-Sam

On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

OK, looping in ceph-devel to see if I can get some more eyes. I've
extracted what I think are important entries from the logs for the
first blocked request. NTP is running all the servers so the logs
should be close in terms of time. Logs for 12:50 to 13:00 are
available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz

2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
2015-09-22 

Re: RBD mirroring CLI proposal ...

2015-09-23 Thread Ilya Dryomov
On Wed, Sep 23, 2015 at 4:08 PM, Jason Dillaman  wrote:
>> > In this case the commands look a little confusing to me, as from their
>> > names I would rather think they enable/disable mirror for existent
>> > images too. Also, I don't see a command to check what current
>> > behaviour is. And, I suppose it would be useful if we could configure
>> > other default features for a pool (exclusive-lock, object-map, ...)
>> > Also, I am not sure we should specify  this way, as it is
>> > not consistent with other rbd commands. By default rbd operates on
>> > 'rbd' pool, which can be changed by --pool option. So what do you
>> > think if we have something similar to 'rbd feature' commands?
>> >
>> >   rbd [--pool ] default-feature enable 
>> >   rbd [--pool ] default-feature disable 
>> >   rbd [--pool ] default-feature show []
>> >
>> > (If  is not specified in the last command, all features are
>> > shown).
>>
>> I haven't read the discussion in full, so feel free to ignore, but I'd
>> much rather have rbd create create images with the same set of features
>> enabled no matter which pool it's pointed to.
>>
>> I'm not clear on what exactly a pool policy is.  If it's just a set of
>> feature bits to enable at create time, I don't think it's worth
>> introducing at all.  If it's a set feature bits + a set of mirroring
>> related options, I think it should just be a set of those options.
>> Then, rbd create --enable-mirroring could be a way to create an image
>> with mirroring enabled.
>>
>> My point is a plain old "rbd create foo" shouldn't depend an any new
>> pool-level metadata.  It's not that hard to remember which features you
>> want for which pools and rbd create shortcuts like --enable-object-map
>> and --enable-mirroring would hide feature bit dependencies and save
>> typing.  --enable-mirroring would also serve as a ticket to go look at
>> new metadata and pull out any mirroring related defaults.
>>
>
> While I agree that I don't necessarily want to go down the road of specifying 
> per-pool default features, I think there is a definite use-case need to be 
> able to enable mirroring on a per-pool basis by default.  Without such an 
> ability, taking OpenStack for example, it would require changes to Glance, 
> Cinder, and Nova in order to support DR configurations -- or you could get it 
> automatically with a little pre-configuration.

So a pool policy is just a set of feature bits?

I think Cinder at least creates images with rbd_default_features from
ceph.conf and adds in layering if it's not set, meaning there is no
interface for passing through feature bits (or anything else really,
things like striping options, etc).  What pool-level default feature
bits infrastructure would do is replace a big (cluster-level) hammer
with a smaller (pool-level) hammer.  You'd have to add librbd APIs for
it and someone eventually will try to follow suit and add defaults for
other settings.  You said you weren't attempting to create a mechanism
to specify arbitrary default features for a given pool, but I think it
will come up in the future if we introduce this - it's only logical.

What we might want to do instead is use this mirroring milestone to add
support for a proper key-value interface for passing in features and
other settings for individual rbd images to OpenStack.  I assume it's
all python dicts with OpenStack, so it shouldn't be hard?  I know that
getting patches into OpenStack can be frustrating at times and I might
be underestimating the importance of the use case you have in mind, but
patching our OpenStack drivers rather than adding what essentially is
a workaround to librbd makes a lot more sense to me.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD mirroring CLI proposal ...

2015-09-23 Thread Jason Dillaman
> So a pool policy is just a set of feature bits?

It would have to store additional details as well.

> I think Cinder at least creates images with rbd_default_features from
> ceph.conf and adds in layering if it's not set, meaning there is no
> interface for passing through feature bits (or anything else really,
> things like striping options, etc).  What pool-level default feature
> bits infrastructure would do is replace a big (cluster-level) hammer
> with a smaller (pool-level) hammer.  You'd have to add librbd APIs for
> it and someone eventually will try to follow suit and add defaults for
> other settings.  You said you weren't attempting to create a mechanism
> to specify arbitrary default features for a given pool, but I think it
> will come up in the future if we introduce this - it's only logical.
> 
> What we might want to do instead is use this mirroring milestone to add
> support for a proper key-value interface for passing in features and
> other settings for individual rbd images to OpenStack.  I assume it's
> all python dicts with OpenStack, so it shouldn't be hard?  I know that
> getting patches into OpenStack can be frustrating at times and I might
> be underestimating the importance of the use case you have in mind, but
> patching our OpenStack drivers rather than adding what essentially is
> a workaround to librbd makes a lot more sense to me.
> 

It would be less work to skip adding the pool-level defaults which is a plus 
given everything else required.  However, putting aside how long it would take 
for the required changes to trickle down from OpenStack, Qemu, etc (since I 
agree that shouldn't drive design), in some ways your proposal could be seen as 
blurring the configuration encapsulation between clients and Ceph. 

Is the goal to configure my storage policies in one place or should I have to 
update all my client configuration settings (not that big of a deal if you are 
using something like Puppet to push down consistent configs across your 
servers)? Trying to think like an end-user, I think I would prefer configuring 
it once within the storage system itself.  I am not familiar with any other 
storage systems that configure mirroring via OpenStack config files, but I 
could be wrong since there are a lot of volume drivers now.

I do like the idea of key/value configuration pairs on image create and I had 
even proposed that a few weeks ago on a separate email thread since we 
shouldn't keep expanding rbd_create/rbd_clone/rbd_copy for every possible 
configuration override.  

--

Jason
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Potential OSD deadlock?

2015-09-23 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We were able to only get ~17Gb out of the XL710 (heavily tweaked)
until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
seems that there were some major reworks in the network handling in
the kernel to efficiently handle that network rate. If I remember
right we also saw a drop in CPU utilization. I'm starting to think
that we did see packet loss while congesting our ISLs in our initial
testing, but we could not tell where the dropping was happening. We
saw some on the switches, but it didn't seem to be bad if we weren't
trying to congest things. We probably already saw this issue, just
didn't know it.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
> FWIW, we've got some 40GbE Intel cards in the community performance cluster
> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
> drivers might cause problems though.
>
> Here's ifconfig from one of the nodes:
>
> ens513f1: flags=4163  mtu 1500
> inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
> inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
> ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
> RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> Mark
>
>
> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> OK, here is the update on the saga...
>>
>> I traced some more of blocked I/Os and it seems that communication
>> between two hosts seemed worse than others. I did a two way ping flood
>> between the two hosts using max packet sizes (1500). After 1.5M
>> packets, no lost pings. Then then had the ping flood running while I
>> put Ceph load on the cluster and the dropped pings started increasing
>> after stopping the Ceph workload the pings stopped dropping.
>>
>> I then ran iperf between all the nodes with the same results, so that
>> ruled out Ceph to a large degree. I then booted in the the
>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>> need the network enhancements in the 4.x series to work well.
>>
>> Does this sound familiar to anyone? I'll probably start bisecting the
>> kernel to see where this issue in introduced. Both of the clusters
>> with this issue are running 4.x, other than that, they are pretty
>> differing hardware and network configs.
>>
>> Thanks,
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.1.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>> 4OEo
>> =P33I
>> -END PGP SIGNATURE-
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>> wrote:
>>>
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>>
>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>> pinging hosts with "No buffer space available" (hosts are currently
>>> configured for 4GB to test SSD caching rather than page cache). I
>>> found that MTU under 32K worked reliable for ping, but still had the
>>> blocked I/O.
>>>
>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>> the blocked I/O.
>>> - 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:

 On Tue, 22 Sep 2015, Samuel Just wrote:
>
> I looked at the logs, it looks like there was a 53 second delay
> between when osd.17 started sending the osd_repop message and when
> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> once see a kernel issue which caused some messages to be mysteriously
> delayed for many 10s of seconds?


 Every 

Re: RBD mirroring CLI proposal ...

2015-09-23 Thread Ilya Dryomov
On Wed, Sep 23, 2015 at 9:28 PM, Jason Dillaman  wrote:
>> So a pool policy is just a set of feature bits?
>
> It would have to store additional details as well.
>
>> I think Cinder at least creates images with rbd_default_features from
>> ceph.conf and adds in layering if it's not set, meaning there is no
>> interface for passing through feature bits (or anything else really,
>> things like striping options, etc).  What pool-level default feature
>> bits infrastructure would do is replace a big (cluster-level) hammer
>> with a smaller (pool-level) hammer.  You'd have to add librbd APIs for
>> it and someone eventually will try to follow suit and add defaults for
>> other settings.  You said you weren't attempting to create a mechanism
>> to specify arbitrary default features for a given pool, but I think it
>> will come up in the future if we introduce this - it's only logical.
>>
>> What we might want to do instead is use this mirroring milestone to add
>> support for a proper key-value interface for passing in features and
>> other settings for individual rbd images to OpenStack.  I assume it's
>> all python dicts with OpenStack, so it shouldn't be hard?  I know that
>> getting patches into OpenStack can be frustrating at times and I might
>> be underestimating the importance of the use case you have in mind, but
>> patching our OpenStack drivers rather than adding what essentially is
>> a workaround to librbd makes a lot more sense to me.
>>
>
> It would be less work to skip adding the pool-level defaults which is a plus 
> given everything else required.  However, putting aside how long it would 
> take for the required changes to trickle down from OpenStack, Qemu, etc 
> (since I agree that shouldn't drive design), in some ways your proposal could 
> be seen as blurring the configuration encapsulation between clients and Ceph.
>
> Is the goal to configure my storage policies in one place or should I have to 
> update all my client configuration settings (not that big of a deal if you 
> are using something like Puppet to push down consistent configs across your 
> servers)? Trying to think like an end-user, I think I would prefer 
> configuring it once within the storage system itself.  I am not familiar with 
> any other storage systems that configure mirroring via OpenStack config 
> files, but I could be wrong since there are a lot of volume drivers now.

I'm not very familiar with OpenStack so I don't know either, I'm just
pointing out that, as far as at least cinder goes, we currently use
a cluster-wide default for something that is inherently a per-image
property, that there is no way to change it, and that there is a way to
configure only a small subset of settings.  I don't see it as blurring
the configuration encapsulation: if a user is creating an image from
OpenStack (or any other client for that matter), they should be able to
specify all the settings they want for a given image and not rely on
cluster-wide or pool-wide defaults.  (Maybe I'm too fixed on this idea
that per-image properties should be per-image and you are trying to
think bigger.  What I'm ranting about here is status quo, mirroring and
the new use cases and configuration challanges it brings along are
somewhat off to the side.)

I'm not against pool-level defaults per se, I just think if we go down
this road it's going to be hard to draw a line in the future, and I want
to make sure we are not adding it just to work around deficiencies in
our OpenStack drivers (and possibly librbd create-like APIs).

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding Data-At-Rest compression support to Ceph

2015-09-23 Thread Gregory Farnum
On Wed, Sep 23, 2015 at 8:26 AM, Igor Fedotov  wrote:
>
>
> On 23.09.2015 17:05, Gregory Farnum wrote:
>>
>> On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil  wrote:
>>>
>>> On Wed, 23 Sep 2015, Igor Fedotov wrote:

 Hi Sage,
 thanks a lot for your feedback.

 Regarding issues with offset mapping and stripe size exposure.
 What's about the idea to apply compression in two-tier (cache+backing
 storage)
 model only ?
>>>
>>> I'm not sure we win anything by making it a two-tier only thing... simply
>>> making it a feature of the EC pool means we can also address EC pool
>>> users
>>> like radosgw.
>>>
 I doubt single-tier one is widely used for EC pools since there is no
 random
 write support in such mode. Thus this might be an acceptable limitation.
 At the same time it seems that appends caused by cached object flush
 have
 fixed block size (8Mb by default). And object is totally rewritten on
 the next
 flush if any. This makes offset mapping less tricky.
 Decompression should be applied in any model though as cache tier
 shutdown and
 subsequent compressed data access is possibly  a valid use case.
>>>
>>> Yeah, we need to handle random reads either way, so I think the offset
>>> mapping is going to be needed anyway.
>>
>> The idea of making the primary responsible for object compression
>> really concerns me. It means for instance that a single random access
>> will likely require access to multiple objects, and breaks many of the
>> optimizations we have right now or in the pipeline (for instance:
>> direct client access).
>
> Could you please elaborate why multiple objects access is required on single
> random access?

It sounds to me like you were planning to take an incoming object
write, compress it, and then chunk it. If you do that, the symbols
("abcdefgh = a", "ijklmnop = b", etc) for the compression are likely
to reside in the first object and need to be fetched for each read in
other objects.

> In my opinion we need to access absolutely the same object set as before: in
> EC pool each appended block is spitted into multiple shards that go to
> respective OSDs. In general case one has to retrieve a set of adjacent
> shards from several OSDs on single read request.

Usually we just need to get the object info from the primary and then
read whichever object has the data for the requested region. If the
region spans a stripe boundary we might need to get two, but often we
don't...

> In case of compression the
> only difference is in data range that compressed shard set occupy. I.e. we
> simply need to translate requested data range to the actually stored one and
> retrieve that data from OSDs. What's missed?
>>
>> And apparently only the EC pool will support
>> compression, which is frustrating for all the replicated pool users
>> out there...
>
> In my opinion  replicated pool users should consider EC pool usage first if
> they care about space saving. They automatically gain 50% space saving this
> way. Compression brings even more saving but that's rather the second step
> on this way.

EC pools have important limitations that replicated pools don't, like
not working for object classes or allowing random overwrites. You can
stick a replicated cache pool in front but that comes with another
whole can of worms. Anybody with a large enough proportion of active
data won't find that solution suitable but might still want to reduce
space required where they can, like with local compression.

>> Is there some reason we don't just want to apply encryption across an
>> OSD store? Perhaps doing it on the filesystem level is the wrong way
>> (for reasons named above) but there are other mechanisms like inline
>> block device compression that I think are supposed to work pretty
>> well.
>
> If I understand the idea of inline block device compression correctly it has
> some of drawbacks similar to FS compression approach. Ones to mention:
> * Less flexibility - per device compression only, no way to have per-pool
> compression. No control on the compression process.

What would the use case be here? I can imagine not wanting to slow
down your cache pools with it or something (although realistically I
don't think that's a concern unless the sheer CPU usage is a problem
with frequent writes), but those would be on separate OSDs/volumes
anyway
Plus block device compression is also able to include all the *other*
stuff that doesn't fit inside the object proper (xattrs and omap).

> * Potentially higher overhead when operating- There is no way to bypass
> non-compressible data processing, e.g. shards with Erasure codes.

My information theory intuition has never been very good, but I don't
think the coded chunks are any less compressible than the data they're
coding for, in general...

> * Potentially higher overhead for recovery on OSD death - one needs to
> decompress data at working OSDs and 

Re: perf counters from a performance discrepancy

2015-09-23 Thread Gregory Farnum
On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil  wrote:
> On Wed, 23 Sep 2015, Deneau, Tom wrote:
>> Hi all --
>>
>> Looking for guidance with perf counters...
>> I am trying to see whether the perf counters can tell me anything about the 
>> following discrepancy
>>
>> I populate a number of 40k size objects in each of two pools, poolA and 
>> poolB.
>> Both pools cover osds on a single node, 5 osds total.
>>
>>* Config 1 (1p):
>>   * use single rados bench client with 32 threads to do seq read of 
>> 2 objects from poolA.
>>
>>* Config 2 (2p):
>>   * use two concurrent rados bench clients (running on same client node) 
>> with 16 threads each,
>>one reading 1 objects from poolA,
>>one reading 1 objects from poolB,
>>
>> So in both configs, we have 32 threads total and the number of objects read 
>> is the same.
>> Note: in all cases, we drop the caches before doing the seq reads
>>
>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is about 1/3 
>> of the bandwidth for
>> the single client in config 1.
>
> How were the object written?  I assume the cluster is backed by spinning
> disks?
>
> I wonder if this is a disk layout issue.  If the 20,000 objects are
> written in order, they willb e roughly sequential on disk, and the 32
> thread case will read them in order.  In the 2x 10,000 case, the two
> clients are reading two sequences of objects written at different
> times, and the disk arms will be swinging around more.
>
> My guess is that if the reads were reading the objects in a random order
> the performance would be the same... I'm not sure that rados bench does
> that though?
>
> sage
>
>>
>>
>> I gathered perf counters before and after each run and looked at the 
>> difference of
>> the before and after counters for both the 1p and 2p cases.  Here are some 
>> things I noticed
>> that are different between the two runs.  Can someone take a look and let me 
>> know
>> whether any of these differences are significant.  In particular, for the
>> throttle-msgr_dispatch_throttler ones, since I don't know the detailed 
>> definitions of these fields.
>> Note: these are the numbers for one of the 5 osds, the other osds are 
>> similar...
>>
>> * The field osd/loadavg is always about 3 times higher on the 2p c
>>
>> some latency-related counters
>> --
>> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
>> osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
>> osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
>> osd/op_r_process_latency/sum 1p=3.48506945399276, 2p=42.6278494550061

So, yep, the individual read ops are taking much longer in the
two-client case. Naively that's pretty odd.

>>
>>
>> and some throttle-msgr_dispatch_throttler related counters
>> --
>> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, diff=2
>> throttle-msgr_dispatch_throttler-client/get_sum 1p=222877, 2p=223088, 
>> diff=211
>> throttle-msgr_dispatch_throttler-client/put 1p=1337, 2p=1339, diff=2
>> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877, 2p=223088, 
>> diff=211
>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134, diff=76
>> throttle-msgr_dispatch_throttler-hb_back_server/get_sum 1p=2726, 2p=6298, 
>> diff=3572
>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134, diff=76
>> throttle-msgr_dispatch_throttler-hb_back_server/put_sum 1p=2726, 2p=6298, 
>> diff=3572
>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, 2p=134, diff=76
>> throttle-msgr_dispatch_throttler-hb_front_server/get_sum 1p=2726, 2p=6298, 
>> diff=3572
>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, 2p=134, diff=76
>> throttle-msgr_dispatch_throttler-hb_front_server/put_sum 1p=2726, 2p=6298, 
>> diff=3572
>> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252, diff=84
>> throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896, 2p=11844, 
>> diff=3948
>> throttle-msgr_dispatch_throttler-hbclient/put 1p=168, 2p=252, diff=84
>> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896, 2p=11844, 
>> diff=3948

IIRC these are just saying how many times the dispatch throttler was
accessed on each messenger — nothing here is surprising, you're doing
basically the same number of messages on the client messengers, and
the heartbeat messengers are passing more because the test takes
longer.

I'd go with Sage's idea for what is actually causing this, or try and
look at how the latency changes over time — if you're going to two
pools instead of one, presumably you're doubling the amount of
metadata that needs to be read into memory during the run? Perhaps
that's just a significant enough effect with your settings that you're
seeing a bunch of extra directory lookups impact your throughput more
than expected... :/
-Greg
--
To unsubscribe from this list: send the line 

Re: perf counters from a performance discrepancy

2015-09-23 Thread Mark Nelson



On 09/23/2015 01:25 PM, Gregory Farnum wrote:

On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil  wrote:

On Wed, 23 Sep 2015, Deneau, Tom wrote:

Hi all --

Looking for guidance with perf counters...
I am trying to see whether the perf counters can tell me anything about the 
following discrepancy

I populate a number of 40k size objects in each of two pools, poolA and poolB.
Both pools cover osds on a single node, 5 osds total.

* Config 1 (1p):
   * use single rados bench client with 32 threads to do seq read of 2 
objects from poolA.

* Config 2 (2p):
   * use two concurrent rados bench clients (running on same client node) 
with 16 threads each,
one reading 1 objects from poolA,
one reading 1 objects from poolB,

So in both configs, we have 32 threads total and the number of objects read is 
the same.
Note: in all cases, we drop the caches before doing the seq reads

The combined bandwidth (MB/sec) for the 2 clients in config 2 is about 1/3 of 
the bandwidth for
the single client in config 1.


How were the object written?  I assume the cluster is backed by spinning
disks?

I wonder if this is a disk layout issue.  If the 20,000 objects are
written in order, they willb e roughly sequential on disk, and the 32
thread case will read them in order.  In the 2x 10,000 case, the two
clients are reading two sequences of objects written at different
times, and the disk arms will be swinging around more.

My guess is that if the reads were reading the objects in a random order
the performance would be the same... I'm not sure that rados bench does
that though?

sage




I gathered perf counters before and after each run and looked at the difference 
of
the before and after counters for both the 1p and 2p cases.  Here are some 
things I noticed
that are different between the two runs.  Can someone take a look and let me 
know
whether any of these differences are significant.  In particular, for the
throttle-msgr_dispatch_throttler ones, since I don't know the detailed 
definitions of these fields.
Note: these are the numbers for one of the 5 osds, the other osds are similar...

* The field osd/loadavg is always about 3 times higher on the 2p c

some latency-related counters
--
osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
osd/op_r_process_latency/sum 1p=3.48506945399276, 2p=42.6278494550061


So, yep, the individual read ops are taking much longer in the
two-client case. Naively that's pretty odd.




and some throttle-msgr_dispatch_throttler related counters
--
throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, diff=2
throttle-msgr_dispatch_throttler-client/get_sum 1p=222877, 2p=223088, diff=211
throttle-msgr_dispatch_throttler-client/put 1p=1337, 2p=1339, diff=2
throttle-msgr_dispatch_throttler-client/put_sum 1p=222877, 2p=223088, diff=211
throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_back_server/get_sum 1p=2726, 2p=6298, 
diff=3572
throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_back_server/put_sum 1p=2726, 2p=6298, 
diff=3572
throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_front_server/get_sum 1p=2726, 2p=6298, 
diff=3572
throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_front_server/put_sum 1p=2726, 2p=6298, 
diff=3572
throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252, diff=84
throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896, 2p=11844, diff=3948
throttle-msgr_dispatch_throttler-hbclient/put 1p=168, 2p=252, diff=84
throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896, 2p=11844, diff=3948


IIRC these are just saying how many times the dispatch throttler was
accessed on each messenger — nothing here is surprising, you're doing
basically the same number of messages on the client messengers, and
the heartbeat messengers are passing more because the test takes
longer.

I'd go with Sage's idea for what is actually causing this, or try and
look at how the latency changes over time — if you're going to two
pools instead of one, presumably you're doubling the amount of
metadata that needs to be read into memory during the run? Perhaps
that's just a significant enough effect with your settings that you're
seeing a bunch of extra directory lookups impact your throughput more
than expected... :/


FWIW, typically if I've seen an effect, it's been the opposite where 
multiple rados bench processes are slightly faster (maybe simply related 
to the client side implementation).  Running collectl or iostat would 
show various interval statistics 

Re: RBD mirroring CLI proposal ...

2015-09-23 Thread Mykola Golub
On Tue, Sep 22, 2015 at 01:32:49PM -0400, Jason Dillaman wrote:

> > > * rbd mirror pool enable 
> > > This will, by default, ensure that all images created in this
> > > pool have exclusive lock, journaling, and mirroring feature bits
> > > enabled.
> > > 
> > > * rbd mirror pool disable 
> > > This will clear the default image features for new images in this
> > > pool.
> > 
> > Will 'rbd mirror pool enable|disable' change behaviour only for newly
> > created images in the pool or will enable|disable mirroring for
> > existent images too?
>
> Since the goal is to set default pool behavior, it would only apply
> to newly created images.  You can enable/disable on specific images
> using the 'rbd mirror image enable/disable' commands.

In this case the commands look a little confusing to me, as from their
names I would rather think they enable/disable mirror for existent
images too. Also, I don't see a command to check what current
behaviour is. And, I suppose it would be useful if we could configure
other default features for a pool (exclusive-lock, object-map, ...)
Also, I am not sure we should specify  this way, as it is
not consistent with other rbd commands. By default rbd operates on
'rbd' pool, which can be changed by --pool option. So what do you
think if we have something similar to 'rbd feature' commands?

  rbd [--pool ] default-feature enable 
  rbd [--pool ] default-feature disable 
  rbd [--pool ] default-feature show []

(If  is not specified in the last command, all features are
shown).

Similarly, it might be useful to have 'rbd feature show' command:

  rbd feature show  []

BTW, where do you think these default feature flags will be stored?
Storing in pg_pool_t::flags I suppose is the easiest but it looks like
a layering violation.

-- 
Mykola Golub
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD mirroring CLI proposal ...

2015-09-23 Thread Mykola Golub
On Wed, Sep 23, 2015 at 09:33:14AM +0300, Mykola Golub wrote:

> Also, I am not sure we should specify  this way, as it is
> not consistent with other rbd commands. By default rbd operates on
> 'rbd' pool, which can be changed by --pool option.

The same reasoning for these commands:

> > * rbd mirror pool add 
> > This will register a remote cluster/pool as a peer to the current,
> > local pool.  All existing mirrored images and all future mirrored
> > images will have this peer registered as a journal client.
> >
> > * rbd mirror pool remove 
> > This will deregister a remote cluster/pool as a peer to the
> > current,
> > local pool.  All existing mirrored images will have the remote
> > deregistered from image journals.

I am not sure we need 'pool' here. I would prefer:

  rbd mirror peer add 
  rbd mirror peer remove 
  rbd mirror peer show

Where,  should not necessary contain pool, because
the default could be used or what is specified via '--pool' option.

-- 
Mykola Golub
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD mirroring CLI proposal ...

2015-09-23 Thread Ilya Dryomov
On Wed, Sep 23, 2015 at 9:33 AM, Mykola Golub  wrote:
> On Tue, Sep 22, 2015 at 01:32:49PM -0400, Jason Dillaman wrote:
>
>> > > * rbd mirror pool enable 
>> > > This will, by default, ensure that all images created in this
>> > > pool have exclusive lock, journaling, and mirroring feature bits
>> > > enabled.
>> > >
>> > > * rbd mirror pool disable 
>> > > This will clear the default image features for new images in this
>> > > pool.
>> >
>> > Will 'rbd mirror pool enable|disable' change behaviour only for newly
>> > created images in the pool or will enable|disable mirroring for
>> > existent images too?
>>
>> Since the goal is to set default pool behavior, it would only apply
>> to newly created images.  You can enable/disable on specific images
>> using the 'rbd mirror image enable/disable' commands.
>
> In this case the commands look a little confusing to me, as from their
> names I would rather think they enable/disable mirror for existent
> images too. Also, I don't see a command to check what current
> behaviour is. And, I suppose it would be useful if we could configure
> other default features for a pool (exclusive-lock, object-map, ...)
> Also, I am not sure we should specify  this way, as it is
> not consistent with other rbd commands. By default rbd operates on
> 'rbd' pool, which can be changed by --pool option. So what do you
> think if we have something similar to 'rbd feature' commands?
>
>   rbd [--pool ] default-feature enable 
>   rbd [--pool ] default-feature disable 
>   rbd [--pool ] default-feature show []
>
> (If  is not specified in the last command, all features are
> shown).

I haven't read the discussion in full, so feel free to ignore, but I'd
much rather have rbd create create images with the same set of features
enabled no matter which pool it's pointed to.

I'm not clear on what exactly a pool policy is.  If it's just a set of
feature bits to enable at create time, I don't think it's worth
introducing at all.  If it's a set feature bits + a set of mirroring
related options, I think it should just be a set of those options.
Then, rbd create --enable-mirroring could be a way to create an image
with mirroring enabled.

My point is a plain old "rbd create foo" shouldn't depend an any new
pool-level metadata.  It's not that hard to remember which features you
want for which pools and rbd create shortcuts like --enable-object-map
and --enable-mirroring would hide feature bit dependencies and save
typing.  --enable-mirroring would also serve as a ticket to go look at
new metadata and pull out any mirroring related defaults.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


答复: Ceph problem

2015-09-23 Thread zhao.ming...@h3c.com

Dear ceph-devel

Cluster environment composed of three hosts,each host runs a monitor 
process and ten OSD processes

If one of the hosts in the cluster is restarted,we run 'rbd create……' 
command will block 120 seconds,but the normal is blocked for 20 seconds.

When a host is down, OSD status remains in the cluster is "UP". 
Tracking function"osdmonitor::check_failure", the variable 
"osd_xinfo_t.laggy_interval" values is verylarge, and far more 
than 20 seconds, but is normally 0. 
So, the variable"osd_xinfo_t.laggy_interval" why does exist?


Looking forward to your reply,Thank you!

-邮件原件-
发件人: huang jun [mailto:hjwsm1...@gmail.com] 
发送时间: 2015年9月23日 10:23
收件人: zhaomingyue 09440 (RD)
抄送: gfar...@redhat.com; ceph-devel@vger.kernel.org
主题: Re: Ceph problem

you can add debug_rados or debug_ms to rbd create command to see what happened 
during 120s

2015-09-23 9:59 GMT+08:00 zhao.ming...@h3c.com :
>
> ---原始邮件---
> 发件人: "陈杰"<276648...@qq.com>
> 发送时间: 2015年09月21日 17:52:19
> 收件人: "ceph-devel@vger.kernel.org";
> 主题: Ceph problem
>
> Dear ceph-devel
>   I'm a software engineer for H3C,We found a ceph problem in the test,the 
> need your help.
>  If one of the hosts in the cluster is restarted,we run 'rbd create……' 
> command will block 120 seconds,but the normal is blocked for 20 seconds.
> Cluster environment composed of three hosts,each host runs a 
> monitor process and ten OSD processes
>
> Looking forward to your reply,Thank you!
> --
> ---
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
> 邮件!
> This e-mail and its attachments contain confidential information from 
> H3C, which is intended only for the person or entity whose address is 
> listed above. Any use of the information contained herein in any way 
> (including, but not limited to, total or partial disclosure, 
> reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, 
> please notify the sender by phone or email immediately and delete it!



--
thanks
huangjun


Re: perf counters from a performance discrepancy

2015-09-23 Thread Sage Weil
On Wed, 23 Sep 2015, Deneau, Tom wrote:
> Hi all --
> 
> Looking for guidance with perf counters...
> I am trying to see whether the perf counters can tell me anything about the 
> following discrepancy
> 
> I populate a number of 40k size objects in each of two pools, poolA and poolB.
> Both pools cover osds on a single node, 5 osds total.
> 
>* Config 1 (1p): 
>   * use single rados bench client with 32 threads to do seq read of 2 
> objects from poolA.
> 
>* Config 2 (2p):
>   * use two concurrent rados bench clients (running on same client node) 
> with 16 threads each,
>one reading 1 objects from poolA,
>one reading 1 objects from poolB,
> 
> So in both configs, we have 32 threads total and the number of objects read 
> is the same.
> Note: in all cases, we drop the caches before doing the seq reads
> 
> The combined bandwidth (MB/sec) for the 2 clients in config 2 is about 1/3 of 
> the bandwidth for
> the single client in config 1.

How were the object written?  I assume the cluster is backed by spinning 
disks?

I wonder if this is a disk layout issue.  If the 20,000 objects are 
written in order, they willb e roughly sequential on disk, and the 32 
thread case will read them in order.  In the 2x 10,000 case, the two 
clients are reading two sequences of objects written at different 
times, and the disk arms will be swinging around more.

My guess is that if the reads were reading the objects in a random order 
the performance would be the same... I'm not sure that rados bench does 
that though?

sage

> 
> 
> I gathered perf counters before and after each run and looked at the 
> difference of
> the before and after counters for both the 1p and 2p cases.  Here are some 
> things I noticed
> that are different between the two runs.  Can someone take a look and let me 
> know
> whether any of these differences are significant.  In particular, for the
> throttle-msgr_dispatch_throttler ones, since I don't know the detailed 
> definitions of these fields.
> Note: these are the numbers for one of the 5 osds, the other osds are 
> similar...
> 
> * The field osd/loadavg is always about 3 times higher on the 2p c
> 
> some latency-related counters
> --
> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
> osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
> osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
> osd/op_r_process_latency/sum 1p=3.48506945399276, 2p=42.6278494550061
> 
> 
> and some throttle-msgr_dispatch_throttler related counters
> --
> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, diff=2
> throttle-msgr_dispatch_throttler-client/get_sum 1p=222877, 2p=223088, diff=211
> throttle-msgr_dispatch_throttler-client/put 1p=1337, 2p=1339, diff=2
> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877, 2p=223088, diff=211
> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134, diff=76
> throttle-msgr_dispatch_throttler-hb_back_server/get_sum 1p=2726, 2p=6298, 
> diff=3572
> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134, diff=76
> throttle-msgr_dispatch_throttler-hb_back_server/put_sum 1p=2726, 2p=6298, 
> diff=3572
> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, 2p=134, diff=76
> throttle-msgr_dispatch_throttler-hb_front_server/get_sum 1p=2726, 2p=6298, 
> diff=3572
> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, 2p=134, diff=76
> throttle-msgr_dispatch_throttler-hb_front_server/put_sum 1p=2726, 2p=6298, 
> diff=3572
> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252, diff=84
> throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896, 2p=11844, diff=3948
> throttle-msgr_dispatch_throttler-hbclient/put 1p=168, 2p=252, diff=84
> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896, 2p=11844, diff=3948
> 
> -- Tom Deneau, AMD
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Potential OSD deadlock?

2015-09-23 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

OK, here is the update on the saga...

I traced some more of blocked I/Os and it seems that communication
between two hosts seemed worse than others. I did a two way ping flood
between the two hosts using max packet sizes (1500). After 1.5M
packets, no lost pings. Then then had the ping flood running while I
put Ceph load on the cluster and the dropped pings started increasing
after stopping the Ceph workload the pings stopped dropping.

I then ran iperf between all the nodes with the same results, so that
ruled out Ceph to a large degree. I then booted in the the
3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
need the network enhancements in the 4.x series to work well.

Does this sound familiar to anyone? I'll probably start bisecting the
kernel to see where this issue in introduced. Both of the clusters
with this issue are running 4.x, other than that, they are pretty
differing hardware and network configs.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
/XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
4OEo
=P33I
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> This is IPoIB and we have the MTU set to 64K. There was some issues
> pinging hosts with "No buffer space available" (hosts are currently
> configured for 4GB to test SSD caching rather than page cache). I
> found that MTU under 32K worked reliable for ping, but still had the
> blocked I/O.
>
> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
> the blocked I/O.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>> I looked at the logs, it looks like there was a 53 second delay
>>> between when osd.17 started sending the osd_repop message and when
>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>> once see a kernel issue which caused some messages to be mysteriously
>>> delayed for many 10s of seconds?
>>
>> Every time we have seen this behavior and diagnosed it in the wild it has
>> been a network misconfiguration.  Usually related to jumbo frames.
>>
>> sage
>>
>>
>>>
>>> What kernel are you running?
>>> -Sam
>>>
>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>> > -BEGIN PGP SIGNED MESSAGE-
>>> > Hash: SHA256
>>> >
>>> > OK, looping in ceph-devel to see if I can get some more eyes. I've
>>> > extracted what I think are important entries from the logs for the
>>> > first blocked request. NTP is running all the servers so the logs
>>> > should be close in terms of time. Logs for 12:50 to 13:00 are
>>> > available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>> >
>>> > 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>> > 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>> > 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>> > 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>> > 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>> > 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>> > 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>> > 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>> > 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>> > 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>> >
>>> > In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>> > osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>> > but for some reason osd.13 doesn't get the message until 53 seconds
>>> > later. osd.17 seems happy to just wait and doesn't resend the data
>>> > (well, I'm not 100% sure how to tell which entries are the actual data
>>> > transfer).
>>> >
>>> > It looks like osd.17 is receiving responses to start the communication
>>> > with osd.13, but the op 

ceph pg query - num_objects_missing_on_primary

2015-09-23 Thread GuangYang
Hello,
While doing a 'ceph pg {id} query', it dumps the info from all peers, however, 
for all peers, it only shows 'num_objects_missing_on_primary', which is the 
same across all peers.

Isn't it better to show the 'num_objecgts_missing' for the peer rather than 
primary?

Thanks,
Guang 
N�Р骒r��yb�X�肚�v�^�)藓{.n�+���z�]z鳐�{ay��,j��f"�h���z��wア�
⒎�j:+v���w�j�m��赙zZ+�茛j"��!�i

RE: perf counters from a performance discrepancy

2015-09-23 Thread Deneau, Tom
I will be out of office for a week but will put this on the list of things to 
try when I get back.

-- Tom

> -Original Message-
> From: Samuel Just [mailto:sj...@redhat.com]
> Sent: Wednesday, September 23, 2015 3:28 PM
> To: Deneau, Tom
> Cc: Mark Nelson; Gregory Farnum; Sage Weil; ceph-devel@vger.kernel.org
> Subject: Re: perf counters from a performance discrepancy
> 
> Just to eliminate a variable, can you reproduce this on master, first with
> the simple messenger, and then with the async messenger? (make sure to
> switch the messengers on all daemons and clients, just put it in the
> [global] section on all configs).
> -Sam
> 
> On Wed, Sep 23, 2015 at 1:05 PM, Deneau, Tom  wrote:
> >
> >
> >> -Original Message-
> >> From: Mark Nelson [mailto:mnel...@redhat.com]
> >> Sent: Wednesday, September 23, 2015 1:43 PM
> >> To: Gregory Farnum; Sage Weil
> >> Cc: Deneau, Tom; ceph-devel@vger.kernel.org
> >> Subject: Re: perf counters from a performance discrepancy
> >>
> >>
> >>
> >> On 09/23/2015 01:25 PM, Gregory Farnum wrote:
> >> > On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil 
> wrote:
> >> >> On Wed, 23 Sep 2015, Deneau, Tom wrote:
> >> >>> Hi all --
> >> >>>
> >> >>> Looking for guidance with perf counters...
> >> >>> I am trying to see whether the perf counters can tell me anything
> >> >>> about the following discrepancy
> >> >>>
> >> >>> I populate a number of 40k size objects in each of two pools,
> >> >>> poolA
> >> and poolB.
> >> >>> Both pools cover osds on a single node, 5 osds total.
> >> >>>
> >> >>> * Config 1 (1p):
> >> >>>* use single rados bench client with 32 threads to do seq
> >> >>> read
> >> of 2 objects from poolA.
> >> >>>
> >> >>> * Config 2 (2p):
> >> >>>* use two concurrent rados bench clients (running on same
> >> client node) with 16 threads each,
> >> >>> one reading 1 objects from poolA,
> >> >>> one reading 1 objects from poolB,
> >> >>>
> >> >>> So in both configs, we have 32 threads total and the number of
> >> >>> objects
> >> read is the same.
> >> >>> Note: in all cases, we drop the caches before doing the seq reads
> >> >>>
> >> >>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is
> >> >>> about 1/3 of the bandwidth for the single client in config 1.
> >> >>
> >> >> How were the object written?  I assume the cluster is backed by
> >> >> spinning disks?
> >> >>
> >> >> I wonder if this is a disk layout issue.  If the 20,000 objects
> >> >> are written in order, they willb e roughly sequential on disk, and
> >> >> the 32 thread case will read them in order.  In the 2x 10,000
> >> >> case, the two clients are reading two sequences of objects written
> >> >> at different times, and the disk arms will be swinging around more.
> >> >>
> >> >> My guess is that if the reads were reading the objects in a random
> >> >> order the performance would be the same... I'm not sure that rados
> >> >> bench does that though?
> >> >>
> >> >> sage
> >> >>
> >> >>>
> >> >>>
> >> >>> I gathered perf counters before and after each run and looked at
> >> >>> the difference of the before and after counters for both the 1p
> >> >>> and 2p cases.  Here are some things I noticed that are different
> >> >>> between the two runs.  Can someone take a look and let me know
> >> >>> whether any of these differences are significant.  In particular,
> >> >>> for the
> >> throttle-msgr_dispatch_throttler ones, since I don't know the
> >> detailed definitions of these fields.
> >> >>> Note: these are the numbers for one of the 5 osds, the other osds
> >> >>> are
> >> similar...
> >> >>>
> >> >>> * The field osd/loadavg is always about 3 times higher on the 2p
> >> >>> c
> >> >>>
> >> >>> some latency-related counters
> >> >>> --
> >> >>> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
> >> >>> osd/op_process_latency/sum 1p=3.48506945394911,
> >> >>> 2p=42.6278494549915 osd/op_r_latency/sum 1p=6.2480111719924,
> >> >>> 2p=579.722513079003 osd/op_r_process_latency/sum
> >> >>> 1p=3.48506945399276,
> >> >>> 2p=42.6278494550061
> >> >
> >> > So, yep, the individual read ops are taking much longer in the
> >> > two-client case. Naively that's pretty odd.
> >> >
> >> >>>
> >> >>>
> >> >>> and some throttle-msgr_dispatch_throttler related counters
> >> >>> --
> >> >>> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339,
> >> >>> diff=2 throttle-msgr_dispatch_throttler-client/get_sum 1p=222877,
> >> >>> 2p=223088, diff=211 throttle-msgr_dispatch_throttler-client/put
> >> >>> 1p=1337, 2p=1339, diff=2
> >> >>> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877,
> >> >>> 2p=223088, diff=211
> >> >>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58,
> >> >>> 2p=134,
> >> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/get_sum
> >> >>> 1p=2726, 2p=6298, 

RE: perf counters from a performance discrepancy

2015-09-23 Thread Deneau, Tom


> -Original Message-
> From: Gregory Farnum [mailto:gfar...@redhat.com]
> Sent: Wednesday, September 23, 2015 3:39 PM
> To: Deneau, Tom
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: perf counters from a performance discrepancy
> 
> On Wed, Sep 23, 2015 at 9:33 AM, Deneau, Tom  wrote:
> > Hi all --
> >
> > Looking for guidance with perf counters...
> > I am trying to see whether the perf counters can tell me anything
> > about the following discrepancy
> >
> > I populate a number of 40k size objects in each of two pools, poolA and
> poolB.
> > Both pools cover osds on a single node, 5 osds total.
> >
> >* Config 1 (1p):
> >   * use single rados bench client with 32 threads to do seq read of
> 2 objects from poolA.
> >
> >* Config 2 (2p):
> >   * use two concurrent rados bench clients (running on same client
> node) with 16 threads each,
> >one reading 1 objects from poolA,
> >one reading 1 objects from poolB,
> >
> > So in both configs, we have 32 threads total and the number of objects
> read is the same.
> > Note: in all cases, we drop the caches before doing the seq reads
> >
> > The combined bandwidth (MB/sec) for the 2 clients in config 2 is about
> > 1/3 of the bandwidth for the single client in config 1.
> >
> >
> > I gathered perf counters before and after each run and looked at the
> > difference of the before and after counters for both the 1p and 2p
> > cases.  Here are some things I noticed that are different between the
> > two runs.  Can someone take a look and let me know whether any of
> > these differences are significant.  In particular, for the throttle-
> msgr_dispatch_throttler ones, since I don't know the detailed definitions
> of these fields.
> > Note: these are the numbers for one of the 5 osds, the other osds are
> similar...
> >
> > * The field osd/loadavg is always about 3 times higher on the 2p c
> >
> > some latency-related counters
> > --
> > osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
> > osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
> > osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
> > osd/op_r_process_latency/sum 1p=3.48506945399276, 2p=42.6278494550061
> 
> So if you've got 20k objects and 5 OSDs then each OSD is getting ~4k reads
> during this test. Which if I'm reading these properly means OSD-side
> latency is something like 1.5 milliseconds for the single client and...144
> milliseconds for the two-client case! You might try dumping some of the
> historic ops out of the admin socket and seeing where the time is getting
> spent (is it all on disk accesses?). And trying to reproduce something
> like this workload on your disks without Ceph involved.
> -Greg

Greg --

Not sure how much it matters but in looking at the pools more closely I
was getting mixed up with an earlier experiment with pools that just used 5 
osds.
The pools for this example actually distributed across 15 osds on 3 nodes.

What is the recommended command for dumping historic ops out of the admin 
socket?

-- Tom






Re: perf counters from a performance discrepancy

2015-09-23 Thread Samuel Just
Just to eliminate a variable, can you reproduce this on master, first
with the simple messenger, and then with the async messenger? (make
sure to switch the messengers on all daemons and clients, just put it
in the [global] section on all configs).
-Sam

On Wed, Sep 23, 2015 at 1:05 PM, Deneau, Tom  wrote:
>
>
>> -Original Message-
>> From: Mark Nelson [mailto:mnel...@redhat.com]
>> Sent: Wednesday, September 23, 2015 1:43 PM
>> To: Gregory Farnum; Sage Weil
>> Cc: Deneau, Tom; ceph-devel@vger.kernel.org
>> Subject: Re: perf counters from a performance discrepancy
>>
>>
>>
>> On 09/23/2015 01:25 PM, Gregory Farnum wrote:
>> > On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil  wrote:
>> >> On Wed, 23 Sep 2015, Deneau, Tom wrote:
>> >>> Hi all --
>> >>>
>> >>> Looking for guidance with perf counters...
>> >>> I am trying to see whether the perf counters can tell me anything
>> >>> about the following discrepancy
>> >>>
>> >>> I populate a number of 40k size objects in each of two pools, poolA
>> and poolB.
>> >>> Both pools cover osds on a single node, 5 osds total.
>> >>>
>> >>> * Config 1 (1p):
>> >>>* use single rados bench client with 32 threads to do seq read
>> of 2 objects from poolA.
>> >>>
>> >>> * Config 2 (2p):
>> >>>* use two concurrent rados bench clients (running on same
>> client node) with 16 threads each,
>> >>> one reading 1 objects from poolA,
>> >>> one reading 1 objects from poolB,
>> >>>
>> >>> So in both configs, we have 32 threads total and the number of objects
>> read is the same.
>> >>> Note: in all cases, we drop the caches before doing the seq reads
>> >>>
>> >>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is
>> >>> about 1/3 of the bandwidth for the single client in config 1.
>> >>
>> >> How were the object written?  I assume the cluster is backed by
>> >> spinning disks?
>> >>
>> >> I wonder if this is a disk layout issue.  If the 20,000 objects are
>> >> written in order, they willb e roughly sequential on disk, and the 32
>> >> thread case will read them in order.  In the 2x 10,000 case, the two
>> >> clients are reading two sequences of objects written at different
>> >> times, and the disk arms will be swinging around more.
>> >>
>> >> My guess is that if the reads were reading the objects in a random
>> >> order the performance would be the same... I'm not sure that rados
>> >> bench does that though?
>> >>
>> >> sage
>> >>
>> >>>
>> >>>
>> >>> I gathered perf counters before and after each run and looked at the
>> >>> difference of the before and after counters for both the 1p and 2p
>> >>> cases.  Here are some things I noticed that are different between
>> >>> the two runs.  Can someone take a look and let me know whether any
>> >>> of these differences are significant.  In particular, for the
>> throttle-msgr_dispatch_throttler ones, since I don't know the detailed
>> definitions of these fields.
>> >>> Note: these are the numbers for one of the 5 osds, the other osds are
>> similar...
>> >>>
>> >>> * The field osd/loadavg is always about 3 times higher on the 2p c
>> >>>
>> >>> some latency-related counters
>> >>> --
>> >>> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
>> >>> osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
>> >>> osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
>> >>> osd/op_r_process_latency/sum 1p=3.48506945399276,
>> >>> 2p=42.6278494550061
>> >
>> > So, yep, the individual read ops are taking much longer in the
>> > two-client case. Naively that's pretty odd.
>> >
>> >>>
>> >>>
>> >>> and some throttle-msgr_dispatch_throttler related counters
>> >>> --
>> >>> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, diff=2
>> >>> throttle-msgr_dispatch_throttler-client/get_sum 1p=222877,
>> >>> 2p=223088, diff=211 throttle-msgr_dispatch_throttler-client/put
>> >>> 1p=1337, 2p=1339, diff=2
>> >>> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877,
>> >>> 2p=223088, diff=211
>> >>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134,
>> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/get_sum
>> >>> 1p=2726, 2p=6298, diff=3572
>> >>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134,
>> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/put_sum
>> >>> 1p=2726, 2p=6298, diff=3572
>> >>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, 2p=134,
>> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/get_sum
>> >>> 1p=2726, 2p=6298, diff=3572
>> >>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, 2p=134,
>> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/put_sum
>> >>> 1p=2726, 2p=6298, diff=3572
>> >>> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252,
>> >>> diff=84 

RE: perf counters from a performance discrepancy

2015-09-23 Thread Deneau, Tom


> -Original Message-
> From: Mark Nelson [mailto:mnel...@redhat.com]
> Sent: Wednesday, September 23, 2015 1:43 PM
> To: Gregory Farnum; Sage Weil
> Cc: Deneau, Tom; ceph-devel@vger.kernel.org
> Subject: Re: perf counters from a performance discrepancy
> 
> 
> 
> On 09/23/2015 01:25 PM, Gregory Farnum wrote:
> > On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil  wrote:
> >> On Wed, 23 Sep 2015, Deneau, Tom wrote:
> >>> Hi all --
> >>>
> >>> Looking for guidance with perf counters...
> >>> I am trying to see whether the perf counters can tell me anything
> >>> about the following discrepancy
> >>>
> >>> I populate a number of 40k size objects in each of two pools, poolA
> and poolB.
> >>> Both pools cover osds on a single node, 5 osds total.
> >>>
> >>> * Config 1 (1p):
> >>>* use single rados bench client with 32 threads to do seq read
> of 2 objects from poolA.
> >>>
> >>> * Config 2 (2p):
> >>>* use two concurrent rados bench clients (running on same
> client node) with 16 threads each,
> >>> one reading 1 objects from poolA,
> >>> one reading 1 objects from poolB,
> >>>
> >>> So in both configs, we have 32 threads total and the number of objects
> read is the same.
> >>> Note: in all cases, we drop the caches before doing the seq reads
> >>>
> >>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is
> >>> about 1/3 of the bandwidth for the single client in config 1.
> >>
> >> How were the object written?  I assume the cluster is backed by
> >> spinning disks?
> >>
> >> I wonder if this is a disk layout issue.  If the 20,000 objects are
> >> written in order, they willb e roughly sequential on disk, and the 32
> >> thread case will read them in order.  In the 2x 10,000 case, the two
> >> clients are reading two sequences of objects written at different
> >> times, and the disk arms will be swinging around more.
> >>
> >> My guess is that if the reads were reading the objects in a random
> >> order the performance would be the same... I'm not sure that rados
> >> bench does that though?
> >>
> >> sage
> >>
> >>>
> >>>
> >>> I gathered perf counters before and after each run and looked at the
> >>> difference of the before and after counters for both the 1p and 2p
> >>> cases.  Here are some things I noticed that are different between
> >>> the two runs.  Can someone take a look and let me know whether any
> >>> of these differences are significant.  In particular, for the
> throttle-msgr_dispatch_throttler ones, since I don't know the detailed
> definitions of these fields.
> >>> Note: these are the numbers for one of the 5 osds, the other osds are
> similar...
> >>>
> >>> * The field osd/loadavg is always about 3 times higher on the 2p c
> >>>
> >>> some latency-related counters
> >>> --
> >>> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
> >>> osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
> >>> osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
> >>> osd/op_r_process_latency/sum 1p=3.48506945399276,
> >>> 2p=42.6278494550061
> >
> > So, yep, the individual read ops are taking much longer in the
> > two-client case. Naively that's pretty odd.
> >
> >>>
> >>>
> >>> and some throttle-msgr_dispatch_throttler related counters
> >>> --
> >>> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, diff=2
> >>> throttle-msgr_dispatch_throttler-client/get_sum 1p=222877,
> >>> 2p=223088, diff=211 throttle-msgr_dispatch_throttler-client/put
> >>> 1p=1337, 2p=1339, diff=2
> >>> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877,
> >>> 2p=223088, diff=211
> >>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134,
> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/get_sum
> >>> 1p=2726, 2p=6298, diff=3572
> >>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134,
> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/put_sum
> >>> 1p=2726, 2p=6298, diff=3572
> >>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, 2p=134,
> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/get_sum
> >>> 1p=2726, 2p=6298, diff=3572
> >>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, 2p=134,
> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/put_sum
> >>> 1p=2726, 2p=6298, diff=3572
> >>> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252,
> >>> diff=84 throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896,
> >>> 2p=11844, diff=3948 throttle-msgr_dispatch_throttler-hbclient/put
> >>> 1p=168, 2p=252, diff=84
> >>> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896, 2p=11844,
> >>> diff=3948
> >
> > IIRC these are just saying how many times the dispatch throttler was
> > accessed on each messenger — nothing here is surprising, you're doing
> > basically the same number of messages on the 

async messenger peering hang

2015-09-23 Thread Samuel Just
I'm seeing some rados runs stuck on peering messages not getting sent
by the async messenger: http://tracker.ceph.com/issues/13213.  Can you
take a look?
-Sam
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: perf counters from a performance discrepancy

2015-09-23 Thread Gregory Farnum
On Wed, Sep 23, 2015 at 9:33 AM, Deneau, Tom  wrote:
> Hi all --
>
> Looking for guidance with perf counters...
> I am trying to see whether the perf counters can tell me anything about the 
> following discrepancy
>
> I populate a number of 40k size objects in each of two pools, poolA and poolB.
> Both pools cover osds on a single node, 5 osds total.
>
>* Config 1 (1p):
>   * use single rados bench client with 32 threads to do seq read of 2 
> objects from poolA.
>
>* Config 2 (2p):
>   * use two concurrent rados bench clients (running on same client node) 
> with 16 threads each,
>one reading 1 objects from poolA,
>one reading 1 objects from poolB,
>
> So in both configs, we have 32 threads total and the number of objects read 
> is the same.
> Note: in all cases, we drop the caches before doing the seq reads
>
> The combined bandwidth (MB/sec) for the 2 clients in config 2 is about 1/3 of 
> the bandwidth for
> the single client in config 1.
>
>
> I gathered perf counters before and after each run and looked at the 
> difference of
> the before and after counters for both the 1p and 2p cases.  Here are some 
> things I noticed
> that are different between the two runs.  Can someone take a look and let me 
> know
> whether any of these differences are significant.  In particular, for the
> throttle-msgr_dispatch_throttler ones, since I don't know the detailed 
> definitions of these fields.
> Note: these are the numbers for one of the 5 osds, the other osds are 
> similar...
>
> * The field osd/loadavg is always about 3 times higher on the 2p c
>
> some latency-related counters
> --
> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
> osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
> osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
> osd/op_r_process_latency/sum 1p=3.48506945399276, 2p=42.6278494550061

So if you've got 20k objects and 5 OSDs then each OSD is getting ~4k
reads during this test. Which if I'm reading these properly means
OSD-side latency is something like 1.5 milliseconds for the single
client and...144 milliseconds for the two-client case! You might try
dumping some of the historic ops out of the admin socket and seeing
where the time is getting spent (is it all on disk accesses?). And
trying to reproduce something like this workload on your disks without
Ceph involved.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: perf counters from a performance discrepancy

2015-09-23 Thread Gregory Farnum
On Wed, Sep 23, 2015 at 1:51 PM, Deneau, Tom  wrote:
>
>
>> -Original Message-
>> From: Gregory Farnum [mailto:gfar...@redhat.com]
>> So if you've got 20k objects and 5 OSDs then each OSD is getting ~4k reads
>> during this test. Which if I'm reading these properly means OSD-side
>> latency is something like 1.5 milliseconds for the single client and...144
>> milliseconds for the two-client case! You might try dumping some of the
>> historic ops out of the admin socket and seeing where the time is getting
>> spent (is it all on disk accesses?). And trying to reproduce something
>> like this workload on your disks without Ceph involved.
>> -Greg
>
> Greg --
>
> Not sure how much it matters but in looking at the pools more closely I
> was getting mixed up with an earlier experiment with pools that just used 5 
> osds.
> The pools for this example actually distributed across 15 osds on 3 nodes.

Okay, so this is running on hard drives, not SSDs. The speed
differential is a lot more plausibly in the drive/filesystem/total
layout in that case, then...

>
> What is the recommended command for dumping historic ops out of the admin 
> socket?

"ceph daemon osd. dump_historic_ops", I think. "ceph daemon osd.
help" will include it in the list, though.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: failed to open http://apt-mirror.front.sepia.ceph.com

2015-09-23 Thread wangsongbo
Loic, It's my fault. The dns server I set is unreachable. when I modify 
that , everything is ok.


Thanks and Regards,
WangSongbo

On 15/9/24 上午1:01, Loic Dachary wrote:


On 23/09/2015 18:50, wangsongbo wrote:

Sage and Loic,
Thanks for your reply.
I am running teuthology in our testing.I can send a traceroute to 64.90.32.37.
but when ceph-cm-ansible run the " yum-complete-transaction --cleanup-only" 
command,
it got such a response 
:"http://apt-mirror.front.sepia.ceph.com/misc-rpms/repodata/repomd.xml: [Errno 14] PYCURL 
ERROR 7 - "Failed connect to apt-mirror.front.sepia.ceph.com:80; Connection timed 
out"
And I replace "apt-mirror.front.sepia.ceph.com"  to "64.90.32.37" in repo file, then run 
"yum-complete-transaction --cleanup-only" command,
I got a response like this:"http://64.90.32.37/misc-rpms/repodata/repomd.xml: [Errno 14] 
PYCURL ERROR 22 - "The requested URL returned error: 502 Bad Gateway""
I do not know whether it was affected by the last week's attack.

Querying the IP directly won't get you where the mirror is (it's a vhost). I 
think ansible fails because it queries the DNS and does not use the entry you 
set in the /etc/hosts file. The OpenStack teuthology backend sets a specific 
entry in the DNS to workaround the problem (see 
https://github.com/ceph/teuthology/blob/master/teuthology/openstack/setup-openstack.sh#L318)

Cheers


Thanks and Regards,
WangSongbo

On 15/9/23 下午11:22, Loic Dachary wrote:

On 23/09/2015 15:11, Sage Weil wrote:

On Wed, 23 Sep 2015, Loic Dachary wrote:

Hi,

On 23/09/2015 12:29, wangsongbo wrote:

64.90.32.37 apt-mirror.front.sepia.ceph.com

It works for me. Could you send a traceroute
apt-mirror.front.sepia.ceph.com ?

This is a private IP internal to the sepia lab.  Anythign outside the lab
shouldn't be using it...

This is the public facing IP and is required for teuthology to run outside of 
the lab (http://tracker.ceph.com/issues/12212).

64.90.32.37 apt-mirror.front.sepia.ceph.com

suggests the workaround was used. And a traceroute will confirm if the 
resolution happens as expected (with the public IP) or with a private IP 
(meaning the workaround is not in place where it should).

Cheers



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Important security noticed regarding release signing key

2015-09-23 Thread wangsongbo

Hi  Ken,
Just now, I run teuthology-suites in our testing, it failed because of lacking 
these packages,
such as qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64, 
qemu-kvm-tools-0.12.1.2-2.415.el6.3ceph etc.
The modify "rm ceph-extras repository config#137" only remove the repository , 
but did not solve the ansible's dependence.
How to solve this dependence ?

Thanks and Regards,
WangSongbo


On 15/9/23 上午10:50, Ken Dreyer wrote:

Hi Songbo, It's been removed from Ansible now:
https://github.com/ceph/ceph-cm-ansible/pull/137

- Ken

On Tue, Sep 22, 2015 at 8:33 PM, wangsongbo  wrote:

Hi Ken,
 Thanks for your reply. But in the ceph-cm-ansible project scheduled by
teuthology, "ceph.com/packages/ceph-extras" is in used now, such as
qemu-kvm-0.12.1.2-2.415.el6.3ceph, qemu-kvm-tools-0.12.1.2-2.415.el6.3ceph
etc.
 Any new releases will be provided ?


On 15/9/22 下午10:24, Ken Dreyer wrote:

On Tue, Sep 22, 2015 at 2:38 AM, Songbo Wang  wrote:

Hi, all,
  Since the last week‘s attack, “ceph.com/packages/ceph-extras”
can be
opened never, but where can I get the releases of ceph-extra now?

Thanks and Regards,
WangSongbo


The packages in "ceph-extras" were old and subject to CVEs (the big
one being VENOM, CVE-2015-3456). So I don't intend to host ceph-extras
in the new location.

- Ken




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow recovery/peering with latest master

2015-09-23 Thread Samuel Just
Wow.  Why would that take so long?  I think you are correct that it's
only used for metadata, we could just add a config value to disable
it.
-Sam

On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy  wrote:
> Sam/Sage,
> I debugged it down and found out that the 
> get_device_by_uuid->blkid_find_dev_with_tag() call within 
> FileStore::collect_metadata() is hanging for ~3 mins before returning a 
> EINVAL. I saw this portion is newly added after hammer.
> Commenting it out resolves the issue. BTW, I saw this value is stored as 
> metadata but not used anywhere , am I missing anything ?
> Here is my Linux details..
>
> root@emsnode5:~/wip-write-path-optimization/src# uname -a
> Linux emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 
> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>
>
> root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 14.04.2 LTS
> Release:14.04
> Codename:   trusty
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: Somnath Roy
> Sent: Wednesday, September 16, 2015 2:20 PM
> To: 'Gregory Farnum'
> Cc: 'ceph-devel'
> Subject: RE: Very slow recovery/peering with latest master
>
>
> Sage/Greg,
>
> Yeah, as we expected, it is not happening probably because of recovery 
> settings. I reverted it back in my ceph.conf , but, still seeing this problem.
>
> Some observation :
> --
>
> 1. First of all, I don't think it is something related to my environment. I 
> recreated the cluster with Hammer and this problem is not there.
>
> 2. I have enabled the messenger/monclient log (Couldn't attach here) in one 
> of the OSDs and found monitor is taking long time to detect the up OSDs. If 
> you see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, 
> there is no communication (only getting KEEP_ALIVE) till 2015-09-16 
> 16:16:07.180482 , so, 3 mins !!
>
> 3. During this period, I saw monclient trying to communicate with monitor but 
> not able to probably. It is sending osd_boot at 2015-09-16 16:16:07.180482 
> only..
>
> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: _send_mon_message to 
> mon.a at 10.60.194.10:6789/0
> 2015-09-16 16:16:07.180482 7f65377fe700  1 -- 10.60.194.10:6820/20102 --> 
> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 72057594037927935 
> v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
> 2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 
> submit_message osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 
> remote, 10.60.194.10:6789/0, have pipe.
>
> 4. BTW, the osd down scenario is detected very quickly (ceph -w output) , 
> problem is during coming up I guess.
>
>
> So, something related to mon communication getting slower ?
> Let me know if more verbose logging is required and how should I share the 
> log..
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: Gregory Farnum [mailto:gfar...@redhat.com]
> Sent: Wednesday, September 16, 2015 11:35 AM
> To: Somnath Roy
> Cc: ceph-devel
> Subject: Re: Very slow recovery/peering with latest master
>
> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy  wrote:
>> Hi,
>> I am seeing very slow recovery when I am adding OSDs with the latest master.
>> Also, If I just restart all the OSDs (no IO is going on in the cluster) , 
>> cluster is taking a significant amount of time to reach in active+clean 
>> state (and even detecting all the up OSDs).
>>
>> I saw the recovery/backfill default parameters are now changed (to lower 
>> value) , this probably explains the recovery scenario , but, will it affect 
>> the peering time during OSD startup as well ?
>
> I don't think these values should impact peering time, but you could 
> configure them back to the old defaults and see if it changes.
> -Greg
>
> 
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Very slow recovery/peering with latest master

2015-09-23 Thread Somnath Roy
I am not sure why it is taking time..I installed latest libblkid as well, but, 
same result. Yeah, config option will be better..I will add that along with my 
write-path pull request.

Thanks & Regards
Somnath

-Original Message-
From: Samuel Just [mailto:sj...@redhat.com] 
Sent: Wednesday, September 23, 2015 4:07 PM
To: Somnath Roy
Cc: Samuel Just (sam.j...@inktank.com); Sage Weil (s...@newdream.net); 
ceph-devel
Subject: Re: Very slow recovery/peering with latest master

Wow.  Why would that take so long?  I think you are correct that it's only used 
for metadata, we could just add a config value to disable it.
-Sam

On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy  wrote:
> Sam/Sage,
> I debugged it down and found out that the 
> get_device_by_uuid->blkid_find_dev_with_tag() call within 
> FileStore::collect_metadata() is hanging for ~3 mins before returning a 
> EINVAL. I saw this portion is newly added after hammer.
> Commenting it out resolves the issue. BTW, I saw this value is stored as 
> metadata but not used anywhere , am I missing anything ?
> Here is my Linux details..
>
> root@emsnode5:~/wip-write-path-optimization/src# uname -a Linux 
> emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 
> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>
>
> root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a No LSB 
> modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 14.04.2 LTS
> Release:14.04
> Codename:   trusty
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: Somnath Roy
> Sent: Wednesday, September 16, 2015 2:20 PM
> To: 'Gregory Farnum'
> Cc: 'ceph-devel'
> Subject: RE: Very slow recovery/peering with latest master
>
>
> Sage/Greg,
>
> Yeah, as we expected, it is not happening probably because of recovery 
> settings. I reverted it back in my ceph.conf , but, still seeing this problem.
>
> Some observation :
> --
>
> 1. First of all, I don't think it is something related to my environment. I 
> recreated the cluster with Hammer and this problem is not there.
>
> 2. I have enabled the messenger/monclient log (Couldn't attach here) in one 
> of the OSDs and found monitor is taking long time to detect the up OSDs. If 
> you see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, 
> there is no communication (only getting KEEP_ALIVE) till 2015-09-16 
> 16:16:07.180482 , so, 3 mins !!
>
> 3. During this period, I saw monclient trying to communicate with monitor but 
> not able to probably. It is sending osd_boot at 2015-09-16 16:16:07.180482 
> only..
>
> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: 
> _send_mon_message to mon.a at 10.60.194.10:6789/0
> 2015-09-16 16:16:07.180482 7f65377fe700  1 -- 10.60.194.10:6820/20102 
> --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 
> 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
> 2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 
> submit_message osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 
> remote, 10.60.194.10:6789/0, have pipe.
>
> 4. BTW, the osd down scenario is detected very quickly (ceph -w output) , 
> problem is during coming up I guess.
>
>
> So, something related to mon communication getting slower ?
> Let me know if more verbose logging is required and how should I share the 
> log..
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: Gregory Farnum [mailto:gfar...@redhat.com]
> Sent: Wednesday, September 16, 2015 11:35 AM
> To: Somnath Roy
> Cc: ceph-devel
> Subject: Re: Very slow recovery/peering with latest master
>
> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy  wrote:
>> Hi,
>> I am seeing very slow recovery when I am adding OSDs with the latest master.
>> Also, If I just restart all the OSDs (no IO is going on in the cluster) , 
>> cluster is taking a significant amount of time to reach in active+clean 
>> state (and even detecting all the up OSDs).
>>
>> I saw the recovery/backfill default parameters are now changed (to lower 
>> value) , this probably explains the recovery scenario , but, will it affect 
>> the peering time during OSD startup as well ?
>
> I don't think these values should impact peering time, but you could 
> configure them back to the old defaults and see if it changes.
> -Greg
>
> 
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of 

Re: Copyright header

2015-09-23 Thread Handzik, Joe
Yes...HP corporate open source contribution standards require me to submit that 
copyright. Such additions exist all over the place in Linux and open stack too.

> On Sep 23, 2015, at 7:30 PM, Somnath Roy  wrote:
> 
> Hi Sage,
> In the latest master, I am seeing a new Copyright header entry for HP in the 
> file Filestore.cc. Is this incidental ?
> 
> * Copyright (c) 2015 Hewlett-Packard Development Company, L.P.
> 
> Thanks & Regards
> Somnath
> 
> 
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Very slow recovery/peering with latest master

2015-09-23 Thread Somnath Roy
Sam/Sage,
I debugged it down and found out that the 
get_device_by_uuid->blkid_find_dev_with_tag() call within 
FileStore::collect_metadata() is hanging for ~3 mins before returning a EINVAL. 
I saw this portion is newly added after hammer.
Commenting it out resolves the issue. BTW, I saw this value is stored as 
metadata but not used anywhere , am I missing anything ?
Here is my Linux details..

root@emsnode5:~/wip-write-path-optimization/src# uname -a
Linux emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 UTC 
2015 x86_64 x86_64 x86_64 GNU/Linux


root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 14.04.2 LTS
Release:14.04
Codename:   trusty

Thanks & Regards
Somnath

-Original Message-
From: Somnath Roy
Sent: Wednesday, September 16, 2015 2:20 PM
To: 'Gregory Farnum'
Cc: 'ceph-devel'
Subject: RE: Very slow recovery/peering with latest master


Sage/Greg,

Yeah, as we expected, it is not happening probably because of recovery 
settings. I reverted it back in my ceph.conf , but, still seeing this problem.

Some observation :
--

1. First of all, I don't think it is something related to my environment. I 
recreated the cluster with Hammer and this problem is not there.

2. I have enabled the messenger/monclient log (Couldn't attach here) in one of 
the OSDs and found monitor is taking long time to detect the up OSDs. If you 
see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, there is 
no communication (only getting KEEP_ALIVE) till 2015-09-16 16:16:07.180482 , 
so, 3 mins !!

3. During this period, I saw monclient trying to communicate with monitor but 
not able to probably. It is sending osd_boot at 2015-09-16 16:16:07.180482 
only..

2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: _send_mon_message to 
mon.a at 10.60.194.10:6789/0
2015-09-16 16:16:07.180482 7f65377fe700  1 -- 10.60.194.10:6820/20102 --> 
10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 72057594037927935 v45) 
v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 
submit_message osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 
remote, 10.60.194.10:6789/0, have pipe.

4. BTW, the osd down scenario is detected very quickly (ceph -w output) , 
problem is during coming up I guess.


So, something related to mon communication getting slower ?
Let me know if more verbose logging is required and how should I share the log..

Thanks & Regards
Somnath

-Original Message-
From: Gregory Farnum [mailto:gfar...@redhat.com]
Sent: Wednesday, September 16, 2015 11:35 AM
To: Somnath Roy
Cc: ceph-devel
Subject: Re: Very slow recovery/peering with latest master

On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy  wrote:
> Hi,
> I am seeing very slow recovery when I am adding OSDs with the latest master.
> Also, If I just restart all the OSDs (no IO is going on in the cluster) , 
> cluster is taking a significant amount of time to reach in active+clean state 
> (and even detecting all the up OSDs).
>
> I saw the recovery/backfill default parameters are now changed (to lower 
> value) , this probably explains the recovery scenario , but, will it affect 
> the peering time during OSD startup as well ?

I don't think these values should impact peering time, but you could configure 
them back to the old defaults and see if it changes.
-Greg



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).



RE: Very slow recovery/peering with latest master

2015-09-23 Thread Somnath Roy
 On Sep 23, 2015, at 6:07 PM, Samuel Just  wrote:
> 
> Wow.  Why would that take so long?  I think you are correct that it's 
> only used for metadata, we could just add a config value to disable 
> it.
> -Sam
> 
>> On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy  wrote:
>> Sam/Sage,
>> I debugged it down and found out that the 
>> get_device_by_uuid->blkid_find_dev_with_tag() call within 
>> FileStore::collect_metadata() is hanging for ~3 mins before returning a 
>> EINVAL. I saw this portion is newly added after hammer.
>> Commenting it out resolves the issue. BTW, I saw this value is stored as 
>> metadata but not used anywhere , am I missing anything ?
>> Here is my Linux details..
>> 
>> root@emsnode5:~/wip-write-path-optimization/src# uname -a Linux 
>> emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 
>> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> 
>> 
>> root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a No 
>> LSB modules are available.
>> Distributor ID: Ubuntu
>> Description:Ubuntu 14.04.2 LTS
>> Release:14.04
>> Codename:   trusty
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -Original Message-
>> From: Somnath Roy
>> Sent: Wednesday, September 16, 2015 2:20 PM
>> To: 'Gregory Farnum'
>> Cc: 'ceph-devel'
>> Subject: RE: Very slow recovery/peering with latest master
>> 
>> 
>> Sage/Greg,
>> 
>> Yeah, as we expected, it is not happening probably because of recovery 
>> settings. I reverted it back in my ceph.conf , but, still seeing this 
>> problem.
>> 
>> Some observation :
>> --
>> 
>> 1. First of all, I don't think it is something related to my environment. I 
>> recreated the cluster with Hammer and this problem is not there.
>> 
>> 2. I have enabled the messenger/monclient log (Couldn't attach here) in one 
>> of the OSDs and found monitor is taking long time to detect the up OSDs. If 
>> you see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, 
>> there is no communication (only getting KEEP_ALIVE) till 2015-09-16 
>> 16:16:07.180482 , so, 3 mins !!
>> 
>> 3. During this period, I saw monclient trying to communicate with monitor 
>> but not able to probably. It is sending osd_boot at 2015-09-16 
>> 16:16:07.180482 only..
>> 
>> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: 
>> _send_mon_message to mon.a at 10.60.194.10:6789/0
>> 2015-09-16 16:16:07.180482 7f65377fe700  1 -- 10.60.194.10:6820/20102 
>> --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 
>> 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
>> 2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 
>> submit_message osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 
>> remote, 10.60.194.10:6789/0, have pipe.
>> 
>> 4. BTW, the osd down scenario is detected very quickly (ceph -w output) , 
>> problem is during coming up I guess.
>> 
>> 
>> So, something related to mon communication getting slower ?
>> Let me know if more verbose logging is required and how should I share the 
>> log..
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -Original Message-
>> From: Gregory Farnum [mailto:gfar...@redhat.com]
>> Sent: Wednesday, September 16, 2015 11:35 AM
>> To: Somnath Roy
>> Cc: ceph-devel
>> Subject: Re: Very slow recovery/peering with latest master
>> 
>>> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy  
>>> wrote:
>>> Hi,
>>> I am seeing very slow recovery when I am adding OSDs with the latest master.
>>> Also, If I just restart all the OSDs (no IO is going on in the cluster) , 
>>> cluster is taking a significant amount of time to reach in active+clean 
>>> state (and even detecting all the up OSDs).
>>> 
>>> I saw the recovery/backfill default parameters are now changed 

Re: Very slow recovery/peering with latest master

2015-09-23 Thread Handzik, Joe
Ok. When configuring with ceph-disk, it does something nifty and actually gives 
the OSD the uuid of the disk's partition as its fsid. I bootstrap off that to 
get an argument to pass into the function you have identified as the 
bottleneck. I ran it by sage and we both realized there would be cases where it 
wouldn't work...I'm sure neither of us realized the failure would take three 
minutes though. 

In the short term, it makes sense to create an option to disable or 
short-circuit the blkid code. I would prefer that the default be left with the 
code enabled, but I'm open to default disabled if others think this will be a 
widespread problem. You could also make sure your OSD fsids are set to match 
your disk partition uuids for now too, if that's a faster workaround for you 
(it'll get rid of the failure).

Joe

> On Sep 23, 2015, at 6:26 PM, Somnath Roy  wrote:
> 
> < 
> -Original Message-
> From: Handzik, Joe [mailto:joseph.t.hand...@hpe.com] 
> Sent: Wednesday, September 23, 2015 4:20 PM
> To: Samuel Just
> Cc: Somnath Roy; Samuel Just (sam.j...@inktank.com); Sage Weil 
> (s...@newdream.net); ceph-devel
> Subject: Re: Very slow recovery/peering with latest master
> 
> I added that, there is code up the stack in calamari that consumes the path 
> provided, which is intended in the future to facilitate disk monitoring and 
> management.
> 
> [Somnath] Ok
> 
> Somnath, what does your disk configuration look like (filesystem, SSD/HDD, 
> anything else you think could be relevant)? Did you configure your disks with 
> ceph-disk, or by hand? I never saw this while testing my code, has anyone 
> else heard of this behavior on master? The code has been in master for 2-3 
> months now I believe.
> [Somnath] All SSD , I use mkcephfs to create cluster , I partitioned the disk 
> with fdisk beforehand. I am using XFS. Are you trying with Ubuntu 3.16.* 
> kernel ? It could be Linux distribution/kernel specific.
> 
> It would be nice to not need to disable this, but if this behavior exists and 
> can't be explained by a misconfiguration or something else I'll need to 
> figure out a different implementation.
> 
> Joe
> 
>> On Sep 23, 2015, at 6:07 PM, Samuel Just  wrote:
>> 
>> Wow.  Why would that take so long?  I think you are correct that it's 
>> only used for metadata, we could just add a config value to disable 
>> it.
>> -Sam
>> 
>>> On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy  
>>> wrote:
>>> Sam/Sage,
>>> I debugged it down and found out that the 
>>> get_device_by_uuid->blkid_find_dev_with_tag() call within 
>>> FileStore::collect_metadata() is hanging for ~3 mins before returning a 
>>> EINVAL. I saw this portion is newly added after hammer.
>>> Commenting it out resolves the issue. BTW, I saw this value is stored as 
>>> metadata but not used anywhere , am I missing anything ?
>>> Here is my Linux details..
>>> 
>>> root@emsnode5:~/wip-write-path-optimization/src# uname -a Linux 
>>> emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 
>>> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> 
>>> 
>>> root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a No 
>>> LSB modules are available.
>>> Distributor ID: Ubuntu
>>> Description:Ubuntu 14.04.2 LTS
>>> Release:14.04
>>> Codename:   trusty
>>> 
>>> Thanks & Regards
>>> Somnath
>>> 
>>> -Original Message-
>>> From: Somnath Roy
>>> Sent: Wednesday, September 16, 2015 2:20 PM
>>> To: 'Gregory Farnum'
>>> Cc: 'ceph-devel'
>>> Subject: RE: Very slow recovery/peering with latest master
>>> 
>>> 
>>> Sage/Greg,
>>> 
>>> Yeah, as we expected, it is not happening probably because of recovery 
>>> settings. I reverted it back in my ceph.conf , but, still seeing this 
>>> problem.
>>> 
>>> Some observation :
>>> --
>>> 
>>> 1. First of all, I don't think it is something related to my environment. I 
>>> recreated the cluster with Hammer and this problem is not there.
>>> 
>>> 2. I have enabled the messenger/monclient log (Couldn't attach here) in one 
>>> of the OSDs and found monitor is taking long time to detect the up OSDs. If 
>>> you see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, 
>>> there is no communication (only getting KEEP_ALIVE) till 2015-09-16 
>>> 16:16:07.180482 , so, 3 mins !!
>>> 
>>> 3. During this period, I saw monclient trying to communicate with monitor 
>>> but not able to probably. It is sending osd_boot at 2015-09-16 
>>> 16:16:07.180482 only..
>>> 
>>> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: 
>>> _send_mon_message to mon.a at 10.60.194.10:6789/0
>>> 2015-09-16 16:16:07.180482 7f65377fe700  1 -- 10.60.194.10:6820/20102 
>>> --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 
>>> 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
>>> 2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 
>>> submit_message osd_boot(osd.10 

Re: Very slow recovery/peering with latest master

2015-09-23 Thread Handzik, Joe
I added that, there is code up the stack in calamari that consumes the path 
provided, which is intended in the future to facilitate disk monitoring and 
management.

Somnath, what does your disk configuration look like (filesystem, SSD/HDD, 
anything else you think could be relevant)? Did you configure your disks with 
ceph-disk, or by hand? I never saw this while testing my code, has anyone else 
heard of this behavior on master? The code has been in master for 2-3 months 
now I believe.

It would be nice to not need to disable this, but if this behavior exists and 
can't be explained by a misconfiguration or something else I'll need to figure 
out a different implementation.

Joe

> On Sep 23, 2015, at 6:07 PM, Samuel Just  wrote:
> 
> Wow.  Why would that take so long?  I think you are correct that it's
> only used for metadata, we could just add a config value to disable
> it.
> -Sam
> 
>> On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy  wrote:
>> Sam/Sage,
>> I debugged it down and found out that the 
>> get_device_by_uuid->blkid_find_dev_with_tag() call within 
>> FileStore::collect_metadata() is hanging for ~3 mins before returning a 
>> EINVAL. I saw this portion is newly added after hammer.
>> Commenting it out resolves the issue. BTW, I saw this value is stored as 
>> metadata but not used anywhere , am I missing anything ?
>> Here is my Linux details..
>> 
>> root@emsnode5:~/wip-write-path-optimization/src# uname -a
>> Linux emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 
>> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> 
>> 
>> root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a
>> No LSB modules are available.
>> Distributor ID: Ubuntu
>> Description:Ubuntu 14.04.2 LTS
>> Release:14.04
>> Codename:   trusty
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -Original Message-
>> From: Somnath Roy
>> Sent: Wednesday, September 16, 2015 2:20 PM
>> To: 'Gregory Farnum'
>> Cc: 'ceph-devel'
>> Subject: RE: Very slow recovery/peering with latest master
>> 
>> 
>> Sage/Greg,
>> 
>> Yeah, as we expected, it is not happening probably because of recovery 
>> settings. I reverted it back in my ceph.conf , but, still seeing this 
>> problem.
>> 
>> Some observation :
>> --
>> 
>> 1. First of all, I don't think it is something related to my environment. I 
>> recreated the cluster with Hammer and this problem is not there.
>> 
>> 2. I have enabled the messenger/monclient log (Couldn't attach here) in one 
>> of the OSDs and found monitor is taking long time to detect the up OSDs. If 
>> you see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, 
>> there is no communication (only getting KEEP_ALIVE) till 2015-09-16 
>> 16:16:07.180482 , so, 3 mins !!
>> 
>> 3. During this period, I saw monclient trying to communicate with monitor 
>> but not able to probably. It is sending osd_boot at 2015-09-16 
>> 16:16:07.180482 only..
>> 
>> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: _send_mon_message to 
>> mon.a at 10.60.194.10:6789/0
>> 2015-09-16 16:16:07.180482 7f65377fe700  1 -- 10.60.194.10:6820/20102 --> 
>> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 72057594037927935 
>> v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
>> 2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 
>> submit_message osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 
>> remote, 10.60.194.10:6789/0, have pipe.
>> 
>> 4. BTW, the osd down scenario is detected very quickly (ceph -w output) , 
>> problem is during coming up I guess.
>> 
>> 
>> So, something related to mon communication getting slower ?
>> Let me know if more verbose logging is required and how should I share the 
>> log..
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -Original Message-
>> From: Gregory Farnum [mailto:gfar...@redhat.com]
>> Sent: Wednesday, September 16, 2015 11:35 AM
>> To: Somnath Roy
>> Cc: ceph-devel
>> Subject: Re: Very slow recovery/peering with latest master
>> 
>>> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy  
>>> wrote:
>>> Hi,
>>> I am seeing very slow recovery when I am adding OSDs with the latest master.
>>> Also, If I just restart all the OSDs (no IO is going on in the cluster) , 
>>> cluster is taking a significant amount of time to reach in active+clean 
>>> state (and even detecting all the up OSDs).
>>> 
>>> I saw the recovery/backfill default parameters are now changed (to lower 
>>> value) , this probably explains the recovery scenario , but, will it affect 
>>> the peering time during OSD startup as well ?
>> 
>> I don't think these values should impact peering time, but you could 
>> configure them back to the old defaults and see if it changes.
>> -Greg
>> 
>> 
>> 
>> PLEASE NOTE: The information contained in this electronic mail message is 
>> intended only for the use of the designated recipient(s) 

Copyright header

2015-09-23 Thread Somnath Roy
Hi Sage,
In the latest master, I am seeing a new Copyright header entry for HP in the 
file Filestore.cc. Is this incidental ?

* Copyright (c) 2015 Hewlett-Packard Development Company, L.P.

Thanks & Regards
Somnath




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow recovery/peering with latest master

2015-09-23 Thread Sage Weil
On Wed, 23 Sep 2015, Handzik, Joe wrote:
> Ok. When configuring with ceph-disk, it does something nifty and 
> actually gives the OSD the uuid of the disk's partition as its fsid. I 
> bootstrap off that to get an argument to pass into the function you have 
> identified as the bottleneck. I ran it by sage and we both realized 
> there would be cases where it wouldn't work...I'm sure neither of us 
> realized the failure would take three minutes though.
> 
> In the short term, it makes sense to create an option to disable or 
> short-circuit the blkid code. I would prefer that the default be left 
> with the code enabled, but I'm open to default disabled if others think 
> this will be a widespread problem. You could also make sure your OSD 
> fsids are set to match your disk partition uuids for now too, if that's 
> a faster workaround for you (it'll get rid of the failure).

I think we should try to figure out where it is hanging.  Can you strace 
the blkid process to see what it is up to?

I opened http://tracker.ceph.com/issues/13219

I think as long as it behaves reliably with ceph-disk OSDs then we can 
have it on by default.

sage


> 
> Joe
> 
> > On Sep 23, 2015, at 6:26 PM, Somnath Roy  wrote:
> > 
> > < > 
> > -Original Message-
> > From: Handzik, Joe [mailto:joseph.t.hand...@hpe.com] 
> > Sent: Wednesday, September 23, 2015 4:20 PM
> > To: Samuel Just
> > Cc: Somnath Roy; Samuel Just (sam.j...@inktank.com); Sage Weil 
> > (s...@newdream.net); ceph-devel
> > Subject: Re: Very slow recovery/peering with latest master
> > 
> > I added that, there is code up the stack in calamari that consumes the path 
> > provided, which is intended in the future to facilitate disk monitoring and 
> > management.
> > 
> > [Somnath] Ok
> > 
> > Somnath, what does your disk configuration look like (filesystem, SSD/HDD, 
> > anything else you think could be relevant)? Did you configure your disks 
> > with ceph-disk, or by hand? I never saw this while testing my code, has 
> > anyone else heard of this behavior on master? The code has been in master 
> > for 2-3 months now I believe.
> > [Somnath] All SSD , I use mkcephfs to create cluster , I partitioned the 
> > disk with fdisk beforehand. I am using XFS. Are you trying with Ubuntu 
> > 3.16.* kernel ? It could be Linux distribution/kernel specific.
> > 
> > It would be nice to not need to disable this, but if this behavior exists 
> > and can't be explained by a misconfiguration or something else I'll need to 
> > figure out a different implementation.
> > 
> > Joe
> > 
> >> On Sep 23, 2015, at 6:07 PM, Samuel Just  wrote:
> >> 
> >> Wow.  Why would that take so long?  I think you are correct that it's 
> >> only used for metadata, we could just add a config value to disable 
> >> it.
> >> -Sam
> >> 
> >>> On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy  
> >>> wrote:
> >>> Sam/Sage,
> >>> I debugged it down and found out that the 
> >>> get_device_by_uuid->blkid_find_dev_with_tag() call within 
> >>> FileStore::collect_metadata() is hanging for ~3 mins before returning a 
> >>> EINVAL. I saw this portion is newly added after hammer.
> >>> Commenting it out resolves the issue. BTW, I saw this value is stored as 
> >>> metadata but not used anywhere , am I missing anything ?
> >>> Here is my Linux details..
> >>> 
> >>> root@emsnode5:~/wip-write-path-optimization/src# uname -a Linux 
> >>> emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 
> >>> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> >>> 
> >>> 
> >>> root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a No 
> >>> LSB modules are available.
> >>> Distributor ID: Ubuntu
> >>> Description:Ubuntu 14.04.2 LTS
> >>> Release:14.04
> >>> Codename:   trusty
> >>> 
> >>> Thanks & Regards
> >>> Somnath
> >>> 
> >>> -Original Message-
> >>> From: Somnath Roy
> >>> Sent: Wednesday, September 16, 2015 2:20 PM
> >>> To: 'Gregory Farnum'
> >>> Cc: 'ceph-devel'
> >>> Subject: RE: Very slow recovery/peering with latest master
> >>> 
> >>> 
> >>> Sage/Greg,
> >>> 
> >>> Yeah, as we expected, it is not happening probably because of recovery 
> >>> settings. I reverted it back in my ceph.conf , but, still seeing this 
> >>> problem.
> >>> 
> >>> Some observation :
> >>> --
> >>> 
> >>> 1. First of all, I don't think it is something related to my environment. 
> >>> I recreated the cluster with Hammer and this problem is not there.
> >>> 
> >>> 2. I have enabled the messenger/monclient log (Couldn't attach here) in 
> >>> one of the OSDs and found monitor is taking long time to detect the up 
> >>> OSDs. If you see the log, I have started OSD at 2015-09-16 
> >>> 16:13:07.042463 , but, there is no communication (only getting 
> >>> KEEP_ALIVE) till 2015-09-16 16:16:07.180482 , so, 3 mins !!
> >>> 
> >>> 3. During this period, I saw monclient trying to communicate with monitor 
> >>> but