Re: use crushtool to simulate pg distribution for a specific pool

2015-09-22 Thread Sage Weil
Hi Zhang,

On Tue, 22 Sep 2015, Z Zhang wrote:
> Hi ceph-devel,
> 
> We enhanced the crushtool to simulate real pg distribution for a 
> specific pool. Recently we had encountered pg uneven issue again after 
> expanding our cluster, although we had re-weighted the cluster to make 
> pg evenly-distribution at the time we built up the cluster. 
> 
> It could be painful to re-weighted osds after expanding cluster because 
> in order to achieve pg evenly-distribution, we may need to re-weight 
> couple of times and each time will trigger data movement.
> 
> Now we could use this crushtool to simulate pg distribution against 
> crush map, re-weight osds and test the crush map again and again until 
> we satisfy with pg distribution. Then we could set back the final crush 
> map and trigger data movement for one time.
> 
> Please check this PR: https://github.com/ceph/ceph/pull/6004 if your 
> guys are interested. 

This looks like it will work, but I'm not sure it's the right place to 
put it.  The osdmaptool has a --test-map-pgs option that maps all PGs and 
gives you the osd distribution and pg min/max per OSD.  The procedure is 
then slightly different:

 ceph osd getmap -o om
 osdmaptool om --export-crush cm
 repeat:
  adjust crush map cm...
  osdmaptool om --import-crush cm --test-map-pgs
 ceph osd setcrushmap -i cm

...but it will keep the details of ceph pools out of crushtool (where they 
probably don't belong).

What do you think?
sage


Re: Adding Data-At-Rest compression support to Ceph

2015-09-22 Thread Sage Weil
On Tue, 22 Sep 2015, Igor Fedotov wrote:
> Hi guys,
> 
> I can find some talks about adding compression support to Ceph. Let me share
> some thoughts and proposals on that too.
> 
> First of all I?d like to consider several major implementation options
> separately. IMHO this makes sense since they have different applicability,
> value and implementation specifics. Besides that less parts are easier for
> both understanding and implementation.
> 
>   * Data-At-Rest Compression. This is about compressing basic data volume kept
> by the Ceph backing tier. The main reason for that is data store costs
> reduction. One can find similar approach introduced by Erasure Coding Pool
> implementation - cluster capacity increases (i.e. storage cost reduces) at the
> expense of additional computations. This is especially effective when combined
> with the high-performance cache tier.
>   *  Intermediate Data Compression. This case is about applying compression
> for intermediate data like system journals, caches etc. The intention is to
> improve expensive storage resource  utilization (e.g. solid state drives or
> RAM ). At the same time the idea to apply compression ( feature that
> undoubtedly introduces additional overhead ) to the crucial heavy-duty
> components probably looks contradictory.
>   *  Exchange Data ?ompression. This one to be applied to messages transported
> between client and storage cluster components as well as internal cluster
> traffic. The rationale for that might be the desire to improve cluster
> run-time characteristics, e.g. limited data bandwidth caused by the network or
> storage devices throughput. The potential drawback is client overburdening -
> client computation resources might become a bottleneck since they take most of
> compression/decompression tasks.
> 
> Obviously it would be great to have support for all the above cases, e.g.
> object compression takes place at the client and cluster components handle
> that naturally during the object life-cycle. Unfortunately significant
> complexities arise on this way. Most of them are related to partial object
> access, both reading and writing. It looks like huge development (
> redesigning, refactoring and new code development ) and testing efforts are
> required on this way. It?s hard to estimate the value of such aggregated
> support at the current moment too.
> Thus the approach I?m suggesting is to drive the progress eventually and
> consider cases separately. At the moment my proposal is to add Data-At-Rest
> compression to Erasure Coded pools as the most definite one from both
> implementation and value points of view.
> 
> How we can do that.
> 
> Ceph Cluster Architecture suggests two-tier storage model for production
> usage. Cache tier built on high-performance expensive storage devices provides
> performance. Storage tier with low-cost less-efficient devices provides
> cost-effectiveness and capacity. Cache tier is supposed to use ordinary data
> replication while storage one can use erasure coding (EC) for effective and
> reliable data keeping. EC provides less store costs with the same reliability
> comparing to data replication approach at the expenses of additional
> computations. Thus Ceph already has some trade off between capacity and
> computation efforts. Actually Data-At-Rest compression is exactly about the
> same. Moreover one can tie EC and Data-At-Rest compression together to achieve
> even better storage effectiveness.
> There are two possible ways on adding Data-At-Rest compression:
>   *  Use data compression built into a file system beyond the Ceph.
>   *  Add compression to Ceph OSD.
> 
> At first glance Option 1. looks pretty attractive but there are some drawbacks
> for this approach. Here they are:
>   *  File System lock-in. BTRFS is the only file system supporting transparent
> compression among ones recommended for Ceph usage.  Moreover
> AFAIK it?s still not recommended for production usage, see:
> http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
>*  Limited flexibility - one can use compression methods and policies
> supported by FS only.
>*  Data compression depends on volume or mount point properties (and is
> bound to OSD). Without additional support Ceph lacks the ability to have
> different compression policies for different pools residing at the same OSD.
>*  File Compression Control isn?t standardized among file systems. If (or
> when) new compression-equipped File System appears Ceph might require
> corresponding changes to handle that properly.
> 
> Having compression at OSD helps to eliminate these drawbacks.
> As mentioned above Data-At-Rest compression purposes are pretty the same as
> for Erasure Coding. It looks quite easy to add compression support to EC
> pools. This way one can have even more storage space for higher CPU load.
> Additional Pros for combining compression and erasure coding are:
>   *  Both EC and compression 

Adding Data-At-Rest compression support to Ceph

2015-09-22 Thread Igor Fedotov

Hi guys,

I can find some talks about adding compression support to Ceph. Let me 
share some thoughts and proposals on that too.


First of all I’d like to consider several major implementation options 
separately. IMHO this makes sense since they have different 
applicability, value and implementation specifics. Besides that less 
parts are easier for both understanding and implementation.


  * Data-At-Rest Compression. This is about compressing basic data 
volume kept by the Ceph backing tier. The main reason for that is data 
store costs reduction. One can find similar approach introduced by 
Erasure Coding Pool implementation - cluster capacity increases (i.e. 
storage cost reduces) at the expense of additional computations. This is 
especially effective when combined with the high-performance cache tier.
  *  Intermediate Data Compression. This case is about applying 
compression for intermediate data like system journals, caches etc. The 
intention is to improve expensive storage resource  utilization (e.g. 
solid state drives or RAM ). At the same time the idea to apply 
compression ( feature that undoubtedly introduces additional overhead ) 
to the crucial heavy-duty components probably looks contradictory.
  *  Exchange Data Сompression. This one to be applied to messages 
transported between client and storage cluster components as well as 
internal cluster traffic. The rationale for that might be the desire to 
improve cluster run-time characteristics, e.g. limited data bandwidth 
caused by the network or storage devices throughput. The potential 
drawback is client overburdening - client computation resources might 
become a bottleneck since they take most of compression/decompression tasks.


Obviously it would be great to have support for all the above cases, 
e.g. object compression takes place at the client and cluster components 
handle that naturally during the object life-cycle. Unfortunately 
significant  complexities arise on this way. Most of them are related to 
partial object access, both reading and writing. It looks like huge 
development ( redesigning, refactoring and new code development ) and 
testing efforts are required on this way. It’s hard to estimate the 
value of such aggregated support at the current moment too.
Thus the approach I’m suggesting is to drive the progress eventually and 
consider cases separately. At the moment my proposal is to add 
Data-At-Rest compression to Erasure Coded pools as the most definite one 
from both implementation and value points of view.


How we can do that.

Ceph Cluster Architecture suggests two-tier storage model for production 
usage. Cache tier built on high-performance expensive storage devices 
provides performance. Storage tier with low-cost less-efficient devices 
provides cost-effectiveness and capacity. Cache tier is supposed to use 
ordinary data replication while storage one can use erasure coding (EC) 
for effective and reliable data keeping. EC provides less store costs 
with the same reliability comparing to data replication approach at the 
expenses of additional computations. Thus Ceph already has some trade 
off between capacity and computation efforts. Actually Data-At-Rest 
compression is exactly about the same. Moreover one can tie EC and 
Data-At-Rest compression together to achieve even better storage 
effectiveness.

There are two possible ways on adding Data-At-Rest compression:
  *  Use data compression built into a file system beyond the Ceph.
  *  Add compression to Ceph OSD.

At first glance Option 1. looks pretty attractive but there are some 
drawbacks for this approach. Here they are:
  *  File System lock-in. BTRFS is the only file system supporting 
transparent compression among ones recommended for Ceph usage. 
 Moreover AFAIK it’s still not recommended for production 
usage, see:

http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
   *  Limited flexibility - one can use compression methods and 
policies supported by FS only.
   *  Data compression depends on volume or mount point properties (and 
is bound to OSD). Without additional support Ceph lacks the ability to 
have different compression policies for different pools residing at the 
same OSD.
   *  File Compression Control isn’t standardized among file systems. 
If (or when) new compression-equipped File System appears Ceph might 
require corresponding changes to handle that properly.


Having compression at OSD helps to eliminate these drawbacks.
As mentioned above Data-At-Rest compression purposes are pretty the same 
as for Erasure Coding. It looks quite easy to add compression support to 
EC pools. This way one can have even more storage space for higher CPU load.

Additional Pros for combining compression and erasure coding are:
  *  Both EC and compression have complexities in partial writing. EC 
pools don’t have partial write support (data append only) and the 
solution for that is cache tier 

Re: RBD mirroring CLI proposal ...

2015-09-22 Thread Jason Dillaman
> > * rbd consistency-group create 
> > [--object-pool ]
> > [--splay-width ]
> > [--object-size ]
> > [--additional-journal-tweakable-settings]
> > This will create an empty journal for use with consistency groups
> > (i.e. attaching
> > multiple RBD images to the same journal to ensure consistent
> > replay).
> 
> For 'rbd feature' commands the option names have "journal" prefix
> (--journal-object-pool), while for 'rbd consistency-group' they
> don't. Is it intentional? I would prefer having "journal" prefix for
> both.

It was intentional with the rationale that the 'journal-' prefix was added to 
the 'rbd feature enable' command because in the future it might support other 
configurable params.  I have no issue keeping them consistent between the two.  
 

> > 
> > * rbd consistency-group rename 
> 
> s/rename/remove/ ?

Yup -- I think I merged the rename and remove commands by accident.  

> > * rbd consistency-group info 
> > This will display information about the specified consistent group
> > 
> > where  is [/]
> 
> Is a consistency-group just a journal (usually used for several images)
> or is it something more? If I enable journaling feature for an image,
> journal is automatically created, is it already a consistency-group?

Correct -- I'm just using the command name to describe the behavior you would 
receive if you attached multiple images to the same journal.  If you enable 
journaling for a given image (or if it is enabled by default due to the pool's 
mirroring policy), then you will have a journal created for you automatically 
that is only associated with the single image.  However, you wouldn't be able 
to attach another image to the same journal nor would you see such a journal 
listed as a named consistency group.  I was originally toying around with 
having all of these "rbd consistency-group" commands being just additional "rbd 
journal" commands and eliminating the semantic difference.

> > * rbd mirror pool enable 
> > This will, by default, ensure that all images created in this
> > pool have exclusive lock, journaling, and mirroring feature bits
> > enabled.
> > 
> > * rbd mirror pool disable 
> > This will clear the default image features for new images in this
> > pool.
> 
> Will 'rbd mirror pool enable|disable' change behaviour only for newly
> created images in the pool or will enable|disable mirroring for
> existent images too?

Since the goal is to set default pool behavior, it would only apply to newly 
created images.  You can enable/disable on specific images using the 'rbd 
mirror image enable/disable' commands.

> > * rbd mirror pool add 
> > This will register a remote cluster/pool as a peer to the current,
> > local pool.  All existing mirrored images and all future mirrored
> > images will have this peer registered as a journal client.
> > 
> > * rbd mirror pool remove 
> > This will deregister a remote cluster/pool as a peer to the
> > current,
> > local pool.  All existing mirrored images will have the remote
> > deregistered from image journals.
> 
> In remote-pool-spec, if it is "cluster/pool", where it is expected to
> find configuration for the cluster "cluster"? In /etc/ceph/cluster.conf?

Ceph conventions state that the configuration file for a given cluster should 
be named ".conf" and placed in the search path (/etc/ceph being 
one such place). 

> Apart from the list of new commands, it would be very helpful to see
> typical use case examples. Are my examples below correct?
> 
> Enable mirroring for pool "volumes", using pool "journals" for journal
> objects:
> 
>   rbd mirror pool enable volumes --journal-object-pool journals

Yup -- I neglected to include journaling defaults in that command, but it would 
be good to include them.

> Start mirroring to the remote cluster ceph-remote:
> 
>   Specify the remote cluster configuration in
>   /etc/ceph/ceph-remote.conf
> 
>   rbd mirror pool add ceph-remote/volumes
> 
>   Start RBD mirroring daemon on the remote cluster. Both clusters
>   should be specified in config or via command line parameters. No
>   need to specify mirroring pools, I suppose?

Haven't thought about this level of detail, but I think you'd be fine only 
specifying the local cluster to the RBD mirroring daemon.  It could then scan 
all the pools, lookup peer cluster names, and connect.  The daemon should use 
the journal's primary / secondary state + epoch to know whether or not to 
replicate the events from the remote.

> To configure a consistency group cgroup1 for images volume1 and
> volume2 in pool volumes:
> 
>   rbd consistency-group create volume/cgroup1 --object-pool journals
>   rbd consistency-group attach volume/volume1 volume/cgroup1
>   rbd consistency-group attach volume/volume2 volume/cgroup1
> 
> I guess before 

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

OK, looping in ceph-devel to see if I can get some more eyes. I've
extracted what I think are important entries from the logs for the
first blocked request. NTP is running all the servers so the logs
should be close in terms of time. Logs for 12:50 to 13:00 are
available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz

2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0

In the logs I can see that osd.17 dispatches the I/O to osd.13 and
osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
but for some reason osd.13 doesn't get the message until 53 seconds
later. osd.17 seems happy to just wait and doesn't resend the data
(well, I'm not 100% sure how to tell which entries are the actual data
transfer).

It looks like osd.17 is receiving responses to start the communication
with osd.13, but the op is not acknowledged until almost a minute
later. To me it seems that the message is getting received but not
passed to another thread right away or something. This test was done
with an idle cluster, a single fio client (rbd engine) with a single
thread.

The OSD servers are almost 100% idle during these blocked I/O
requests. I think I'm at the end of my troubleshooting, so I can use
some help.

Single Test started about
2015-09-22 12:52:36

2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
30.439150 secs
2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
cluster [WRN] slow request 30.439150 seconds old, received at
2015-09-22 12:55:06.487451:
 osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0545
[set-alloc-hint object_size 4194304 write_size 4194304,write
0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
 currently waiting for subops from 13,16
2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
[WRN] 2 slow requests, 2 included below; oldest blocked for >
30.379680 secs
2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
[WRN] slow request 30.291520 seconds old, received at 2015-09-22
12:55:06.406303:
 osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0541
[set-alloc-hint object_size 4194304 write_size 4194304,write
0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
 currently waiting for subops from 13,17
2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
[WRN] slow request 30.379680 seconds old, received at 2015-09-22
12:55:06.318144:
 osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.053f
[set-alloc-hint object_size 4194304 write_size 4194304,write
0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
 currently waiting for subops from 13,14
2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
30.954212 secs
2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
cluster [WRN] slow request 30.954212 seconds old, received at
2015-09-22 12:57:33.044003:
 osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.070d
[set-alloc-hint object_size 4194304 write_size 4194304,write
0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
 currently waiting for subops from 16,17
2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
30.704367 secs
2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
cluster [WRN] slow request 30.704367 seconds old, received at
2015-09-22 12:57:33.055404:
 osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.070e
[set-alloc-hint object_size 4194304 write_size 4194304,write
0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
 currently waiting for subops from 13,17

Server   IP addr  OSD
nodev  - 192.168.55.11 - 12
nodew  - 192.168.55.12 - 13
nodex  - 192.168.55.13 - 16
nodey  - 192.168.55.14 - 17
nodez  - 192.168.55.15 - 14
nodezz - 192.168.55.16 - 15

fio job:
[rbd-test]
readwrite=write
blocksize=4M
#runtime=60
name=rbd-test
#readwrite=randwrite
#bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
#rwmixread=72
#norandommap
#size=1T
#blocksize=4k
ioengine=rbd
rbdname=test2
pool=rbd
clientname=admin
iodepth=8
#numjobs=4
#thread
#group_reporting
#time_based
#direct=1

Re: Ceph problem

2015-09-22 Thread huang jun
you can add debug_rados or debug_ms to rbd create command to see what
happened during 120s

2015-09-23 9:59 GMT+08:00 zhao.ming...@h3c.com :
>
> ---原始邮件---
> 发件人: "陈杰"<276648...@qq.com>
> 发送时间: 2015年09月21日 17:52:19
> 收件人: "ceph-devel@vger.kernel.org";
> 主题: Ceph problem
>
> Dear ceph-devel
>   I'm a software engineer for H3C,We found a ceph problem in the test,the 
> need your help.
>  If one of the hosts in the cluster is restarted,we run 'rbd create……' 
> command will block 120 seconds,but the normal is blocked for 20 seconds.
> Cluster environment composed of three hosts,each host runs a monitor 
> process and ten OSD processes
>
> Looking forward to your reply,Thank you!
> -
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
> 邮件!
> This e-mail and its attachments contain confidential information from H3C, 
> which is
> intended only for the person or entity whose address is listed above. Any use 
> of the
> information contained herein in any way (including, but not limited to, total 
> or partial
> disclosure, reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, please 
> notify the sender
> by phone or email immediately and delete it!



-- 
thanks
huangjun


Ceph problem

2015-09-22 Thread zhao.ming...@h3c.com

---原始邮件---
发件人: "陈杰"<276648...@qq.com>
发送时间: 2015年09月21日 17:52:19
收件人: "ceph-devel@vger.kernel.org";
主题: Ceph problem

Dear ceph-devel
  I'm a software engineer for H3C,We found a ceph problem in the test,the 
need your help.
 If one of the hosts in the cluster is restarted,we run 'rbd create……' 
command will block 120 seconds,but the normal is blocked for 20 seconds.
Cluster environment composed of three hosts,each host runs a monitor 
process and ten OSD processes

Looking forward to your reply,Thank you!
-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!


Where does the data go ??

2015-09-22 Thread Tomy Cheru
Noticed while benchmarking newstore with rocksdb backend that, the data is 
missing in "dev/osd0/fragments"

>64k sized objects produces content in above mentioned dir, however missing 
>with <=64k  sized objects

Corresponding drive/s shows activity, .sst files are in "dev/osd0/db" dir

Is it expected behavior ??

Where does the data go ??

Tomy




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: use crushtool to simulate pg distribution for a specific pool

2015-09-22 Thread Z Zhang
Thanks, Sage. I will take a look at osdmaptool.


Thanks.
Zhi Zhang (David)



> Date: Tue, 22 Sep 2015 11:41:24 -0700
> From: s...@newdream.net
> To: zhangz.da...@outlook.com
> CC: ceph-devel@vger.kernel.org
> Subject: Re: use crushtool to simulate pg distribution for a specific pool
>
> Hi Zhang,
>
> On Tue, 22 Sep 2015, Z Zhang wrote:
>> Hi ceph-devel,
>>
>> We enhanced the crushtool to simulate real pg distribution for a
>> specific pool. Recently we had encountered pg uneven issue again after
>> expanding our cluster, although we had re-weighted the cluster to make
>> pg evenly-distribution at the time we built up the cluster.
>>
>> It could be painful to re-weighted osds after expanding cluster because
>> in order to achieve pg evenly-distribution, we may need to re-weight
>> couple of times and each time will trigger data movement.
>>
>> Now we could use this crushtool to simulate pg distribution against
>> crush map, re-weight osds and test the crush map again and again until
>> we satisfy with pg distribution. Then we could set back the final crush
>> map and trigger data movement for one time.
>>
>> Please check this PR: https://github.com/ceph/ceph/pull/6004 if your
>> guys are interested.
>
> This looks like it will work, but I'm not sure it's the right place to
> put it. The osdmaptool has a --test-map-pgs option that maps all PGs and
> gives you the osd distribution and pg min/max per OSD. The procedure is
> then slightly different:
>
> ceph osd getmap -o om
> osdmaptool om --export-crush cm
> repeat:
> adjust crush map cm...
> osdmaptool om --import-crush cm --test-map-pgs
> ceph osd setcrushmap -i cm
>
> ...but it will keep the details of ceph pools out of crushtool (where they
> probably don't belong).
>
> What do you think?
> sage
  

Seek advice for using Ceph to provice NAS service

2015-09-22 Thread Jevon Qiao

Hi Sage and other Ceph experts,

This is a greeting from Jevon, I'm from China and working in a company 
which are using Ceph as the backend storage. At present, I'm evaluating 
the following two options of using Ceph cluster to provide NAS service 
and I need your advice from the perspective of stability and feasibility.


Option 1: Directly use CephFS
Since Ceph as a unified storage can provide file system storage service 
via cephfs, this looks an ideal solution for my case if CephFS is ready 
to be used in production environment. However, based on the previous 
discussions on CephFS, I see that there are still some issues like not 
ready for supporting multiple metadata servers, lack of a fully 
functioning fsck and so on. Also, I learn that CephFS has been evaluated 
by a large community of users and there are production systems using it 
with a single MDS from the official website of Ceph. So it is difficult 
for me to make the decision on whether I should use it.


Option 2: Ceph rbd + NFS server
This might be a common architecture used in current NAS storage. But the 
problem is how to get rid of the single point failure on NFS server. 
What I have right now is to use Corosync and Pacemaker(the typical HA 
solution in Linux) to form a cluster. It seems that Sebastien Han has 
verified the feasibility.


Your comments/advices would be highly appreciated and I'm looking 
forward to your reply.


Thanks,
Jevon
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Important security noticed regarding release signing key

2015-09-22 Thread wangsongbo

Hi Ken,
Thanks for your reply. But in the ceph-cm-ansible project scheduled 
by teuthology, "ceph.com/packages/ceph-extras" is in used now, such as 
qemu-kvm-0.12.1.2-2.415.el6.3ceph, 
qemu-kvm-tools-0.12.1.2-2.415.el6.3ceph etc.

Any new releases will be provided ?

On 15/9/22 下午10:24, Ken Dreyer wrote:

On Tue, Sep 22, 2015 at 2:38 AM, Songbo Wang  wrote:

Hi, all,
 Since the last week‘s attack, “ceph.com/packages/ceph-extras” can be
opened never, but where can I get the releases of ceph-extra now?

Thanks and Regards,
WangSongbo


The packages in "ceph-extras" were old and subject to CVEs (the big
one being VENOM, CVE-2015-3456). So I don't intend to host ceph-extras
in the new location.

- Ken


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ceph problem

2015-09-22 Thread zhao.ming...@h3c.com
---原始邮件---
发件人: "陈杰"<276648...@qq.com>
发送时间: 2015年09月21日 17:52:19
收件人: "ceph-devel@vger.kernel.org";
主题: Ceph problem

Dear ceph-devel
  I'm a software engineer for H3C,We found a ceph problem in the test,the 
need your help.
 If one of the hosts in the cluster is restarted,we run 'rbd create……' 
command will block 120 seconds,but the normal is blocked for 20 seconds.
Cluster environment composed of three hosts,each host runs a monitor 
process and ten OSD processes

Looking forward to your reply,Thank you!
-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!


Re: [ceph-users] Important security noticed regarding release signing key

2015-09-22 Thread Ken Dreyer
Hi Songbo, It's been removed from Ansible now:
https://github.com/ceph/ceph-cm-ansible/pull/137

- Ken

On Tue, Sep 22, 2015 at 8:33 PM, wangsongbo  wrote:
> Hi Ken,
> Thanks for your reply. But in the ceph-cm-ansible project scheduled by
> teuthology, "ceph.com/packages/ceph-extras" is in used now, such as
> qemu-kvm-0.12.1.2-2.415.el6.3ceph, qemu-kvm-tools-0.12.1.2-2.415.el6.3ceph
> etc.
> Any new releases will be provided ?
>
>
> On 15/9/22 下午10:24, Ken Dreyer wrote:
>>
>> On Tue, Sep 22, 2015 at 2:38 AM, Songbo Wang  wrote:
>>>
>>> Hi, all,
>>>  Since the last week‘s attack, “ceph.com/packages/ceph-extras”
>>> can be
>>> opened never, but where can I get the releases of ceph-extra now?
>>>
>>> Thanks and Regards,
>>> WangSongbo
>>>
>> The packages in "ceph-extras" were old and subject to CVEs (the big
>> one being VENOM, CVE-2015-3456). So I don't intend to host ceph-extras
>> in the new location.
>>
>> - Ken
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

This is IPoIB and we have the MTU set to 64K. There was some issues
pinging hosts with "No buffer space available" (hosts are currently
configured for 4GB to test SSD caching rather than page cache). I
found that MTU under 32K worked reliable for ping, but still had the
blocked I/O.

I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
the blocked I/O.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
> On Tue, 22 Sep 2015, Samuel Just wrote:
>> I looked at the logs, it looks like there was a 53 second delay
>> between when osd.17 started sending the osd_repop message and when
>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>> once see a kernel issue which caused some messages to be mysteriously
>> delayed for many 10s of seconds?
>
> Every time we have seen this behavior and diagnosed it in the wild it has
> been a network misconfiguration.  Usually related to jumbo frames.
>
> sage
>
>
>>
>> What kernel are you running?
>> -Sam
>>
>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>> > -BEGIN PGP SIGNED MESSAGE-
>> > Hash: SHA256
>> >
>> > OK, looping in ceph-devel to see if I can get some more eyes. I've
>> > extracted what I think are important entries from the logs for the
>> > first blocked request. NTP is running all the servers so the logs
>> > should be close in terms of time. Logs for 12:50 to 13:00 are
>> > available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>> >
>> > 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>> > 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>> > 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>> > 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>> > 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>> > 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>> > 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>> > 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>> > 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>> > 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>> >
>> > In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>> > osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>> > but for some reason osd.13 doesn't get the message until 53 seconds
>> > later. osd.17 seems happy to just wait and doesn't resend the data
>> > (well, I'm not 100% sure how to tell which entries are the actual data
>> > transfer).
>> >
>> > It looks like osd.17 is receiving responses to start the communication
>> > with osd.13, but the op is not acknowledged until almost a minute
>> > later. To me it seems that the message is getting received but not
>> > passed to another thread right away or something. This test was done
>> > with an idle cluster, a single fio client (rbd engine) with a single
>> > thread.
>> >
>> > The OSD servers are almost 100% idle during these blocked I/O
>> > requests. I think I'm at the end of my troubleshooting, so I can use
>> > some help.
>> >
>> > Single Test started about
>> > 2015-09-22 12:52:36
>> >
>> > 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> > 30.439150 secs
>> > 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>> > cluster [WRN] slow request 30.439150 seconds old, received at
>> > 2015-09-22 12:55:06.487451:
>> >  osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0545
>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>> > 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>> >  currently waiting for subops from 13,16
>> > 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>> > [WRN] 2 slow requests, 2 included below; oldest blocked for >
>> > 30.379680 secs
>> > 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>> > [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>> > 12:55:06.406303:
>> >  osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0541
>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>> > 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>> >  currently waiting for subops from 13,17
>> > 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>> > [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>> > 12:55:06.318144:
>> >  osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.053f
>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>> > 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>> >  currently waiting for subops from 13,14
>> > 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>> > cluster 

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Samuel Just
I looked at the logs, it looks like there was a 53 second delay
between when osd.17 started sending the osd_repop message and when
osd.13 started reading it, which is pretty weird.  Sage, didn't we
once see a kernel issue which caused some messages to be mysteriously
delayed for many 10s of seconds?

What kernel are you running?
-Sam

On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> OK, looping in ceph-devel to see if I can get some more eyes. I've
> extracted what I think are important entries from the logs for the
> first blocked request. NTP is running all the servers so the logs
> should be close in terms of time. Logs for 12:50 to 13:00 are
> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>
> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>
> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> but for some reason osd.13 doesn't get the message until 53 seconds
> later. osd.17 seems happy to just wait and doesn't resend the data
> (well, I'm not 100% sure how to tell which entries are the actual data
> transfer).
>
> It looks like osd.17 is receiving responses to start the communication
> with osd.13, but the op is not acknowledged until almost a minute
> later. To me it seems that the message is getting received but not
> passed to another thread right away or something. This test was done
> with an idle cluster, a single fio client (rbd engine) with a single
> thread.
>
> The OSD servers are almost 100% idle during these blocked I/O
> requests. I think I'm at the end of my troubleshooting, so I can use
> some help.
>
> Single Test started about
> 2015-09-22 12:52:36
>
> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 30.439150 secs
> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> cluster [WRN] slow request 30.439150 seconds old, received at
> 2015-09-22 12:55:06.487451:
>  osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0545
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>  currently waiting for subops from 13,16
> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
> [WRN] 2 slow requests, 2 included below; oldest blocked for >
> 30.379680 secs
> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> 12:55:06.406303:
>  osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0541
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>  currently waiting for subops from 13,17
> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> 12:55:06.318144:
>  osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.053f
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>  currently waiting for subops from 13,14
> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 30.954212 secs
> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> cluster [WRN] slow request 30.954212 seconds old, received at
> 2015-09-22 12:57:33.044003:
>  osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.070d
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>  currently waiting for subops from 16,17
> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 30.704367 secs
> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
> cluster [WRN] slow request 30.704367 seconds old, received at
> 2015-09-22 12:57:33.055404:
>  osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.070e
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected 

[CEPH-DEVEL] MAX_RBD_IMAGES

2015-09-22 Thread Shinobu Kinjo
Hello,

Does any of you know why *MAX_RBD_IMAGES* was changed from 16 to 128?
I hope that Dan remember -;

http://resources.ustack.com/ceph/ceph/commit/2a6dcabf7f1b7550a0fa4fd223970ffc24ad7870

 - Shinobu
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ceph Days 2015

2015-09-22 Thread Patrick McGarry
Hey cephers,

All of the planned Ceph Days through the end of the year are now live:

http://ceph.com/cephdays/

16 Oct - Shanghai (hosted by Intel)
22 Oct - Tokyo (hosted by Cyber Agent)
05 Nov - Melbourne (hosted by Monash University)

Please sign up as soon as you are able to any of the events you are
interested in attending so we can have an accurate headcount. As
always, if you have questions or concerns please feel free to contact
me. Thanks!


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

4.2.0-1.el7.elrepo.x86_64
- - 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Sep 22, 2015 at 3:41 PM, Samuel Just  wrote:
> I looked at the logs, it looks like there was a 53 second delay
> between when osd.17 started sending the osd_repop message and when
> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> once see a kernel issue which caused some messages to be mysteriously
> delayed for many 10s of seconds?
>
> What kernel are you running?
> -Sam
>
> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>> extracted what I think are important entries from the logs for the
>> first blocked request. NTP is running all the servers so the logs
>> should be close in terms of time. Logs for 12:50 to 13:00 are
>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>
>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>
>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>> but for some reason osd.13 doesn't get the message until 53 seconds
>> later. osd.17 seems happy to just wait and doesn't resend the data
>> (well, I'm not 100% sure how to tell which entries are the actual data
>> transfer).
>>
>> It looks like osd.17 is receiving responses to start the communication
>> with osd.13, but the op is not acknowledged until almost a minute
>> later. To me it seems that the message is getting received but not
>> passed to another thread right away or something. This test was done
>> with an idle cluster, a single fio client (rbd engine) with a single
>> thread.
>>
>> The OSD servers are almost 100% idle during these blocked I/O
>> requests. I think I'm at the end of my troubleshooting, so I can use
>> some help.
>>
>> Single Test started about
>> 2015-09-22 12:52:36
>>
>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> 30.439150 secs
>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>> cluster [WRN] slow request 30.439150 seconds old, received at
>> 2015-09-22 12:55:06.487451:
>>  osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0545
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>  currently waiting for subops from 13,16
>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>> 30.379680 secs
>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>> 12:55:06.406303:
>>  osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0541
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>  currently waiting for subops from 13,17
>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>> 12:55:06.318144:
>>  osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.053f
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>  currently waiting for subops from 13,14
>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> 30.954212 secs
>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>> cluster [WRN] slow request 30.954212 seconds old, received at
>> 2015-09-22 12:57:33.044003:
>>  osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.070d
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>  currently waiting for subops from 16,17
>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> 30.704367 

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Sage Weil
On Tue, 22 Sep 2015, Samuel Just wrote:
> I looked at the logs, it looks like there was a 53 second delay
> between when osd.17 started sending the osd_repop message and when
> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> once see a kernel issue which caused some messages to be mysteriously
> delayed for many 10s of seconds?

Every time we have seen this behavior and diagnosed it in the wild it has 
been a network misconfiguration.  Usually related to jumbo frames.

sage


> 
> What kernel are you running?
> -Sam
> 
> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA256
> >
> > OK, looping in ceph-devel to see if I can get some more eyes. I've
> > extracted what I think are important entries from the logs for the
> > first blocked request. NTP is running all the servers so the logs
> > should be close in terms of time. Logs for 12:50 to 13:00 are
> > available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
> >
> > 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> > 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> > 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> > 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> > 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> > 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> > 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> > 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> > 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> > 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
> >
> > In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> > osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> > but for some reason osd.13 doesn't get the message until 53 seconds
> > later. osd.17 seems happy to just wait and doesn't resend the data
> > (well, I'm not 100% sure how to tell which entries are the actual data
> > transfer).
> >
> > It looks like osd.17 is receiving responses to start the communication
> > with osd.13, but the op is not acknowledged until almost a minute
> > later. To me it seems that the message is getting received but not
> > passed to another thread right away or something. This test was done
> > with an idle cluster, a single fio client (rbd engine) with a single
> > thread.
> >
> > The OSD servers are almost 100% idle during these blocked I/O
> > requests. I think I'm at the end of my troubleshooting, so I can use
> > some help.
> >
> > Single Test started about
> > 2015-09-22 12:52:36
> >
> > 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> > 30.439150 secs
> > 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> > cluster [WRN] slow request 30.439150 seconds old, received at
> > 2015-09-22 12:55:06.487451:
> >  osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0545
> > [set-alloc-hint object_size 4194304 write_size 4194304,write
> > 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
> >  currently waiting for subops from 13,16
> > 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
> > [WRN] 2 slow requests, 2 included below; oldest blocked for >
> > 30.379680 secs
> > 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
> > [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> > 12:55:06.406303:
> >  osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0541
> > [set-alloc-hint object_size 4194304 write_size 4194304,write
> > 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
> >  currently waiting for subops from 13,17
> > 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
> > [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> > 12:55:06.318144:
> >  osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.053f
> > [set-alloc-hint object_size 4194304 write_size 4194304,write
> > 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
> >  currently waiting for subops from 13,14
> > 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> > 30.954212 secs
> > 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> > cluster [WRN] slow request 30.954212 seconds old, received at
> > 2015-09-22 12:57:33.044003:
> >  osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.070d
> > [set-alloc-hint object_size 4194304 write_size 4194304,write
> > 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
> >  currently waiting for subops from 16,17
> > 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> > 

Re: [ceph-users] Important security noticed regarding release signing key

2015-09-22 Thread wangsongbo

Hi Ken,
Thank you, I will update my repo and continue my test.

- Songbo

On 15/9/23 上午10:50, Ken Dreyer wrote:

Hi Songbo, It's been removed from Ansible now:
https://github.com/ceph/ceph-cm-ansible/pull/137

- Ken

On Tue, Sep 22, 2015 at 8:33 PM, wangsongbo  wrote:

Hi Ken,
 Thanks for your reply. But in the ceph-cm-ansible project scheduled by
teuthology, "ceph.com/packages/ceph-extras" is in used now, such as
qemu-kvm-0.12.1.2-2.415.el6.3ceph, qemu-kvm-tools-0.12.1.2-2.415.el6.3ceph
etc.
 Any new releases will be provided ?


On 15/9/22 下午10:24, Ken Dreyer wrote:

On Tue, Sep 22, 2015 at 2:38 AM, Songbo Wang  wrote:

Hi, all,
  Since the last week‘s attack, “ceph.com/packages/ceph-extras”
can be
opened never, but where can I get the releases of ceph-extra now?

Thanks and Regards,
WangSongbo


The packages in "ceph-extras" were old and subject to CVEs (the big
one being VENOM, CVE-2015-3456). So I don't intend to host ceph-extras
in the new location.

- Ken




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [CEPH-DEVEL] MAX_RBD_IMAGES

2015-09-22 Thread Josh Durgin

On 09/22/2015 02:55 PM, Shinobu Kinjo wrote:

Hello,

Does any of you know why *MAX_RBD_IMAGES* was changed from 16 to 128?
I hope that Dan remember -;

http://resources.ustack.com/ceph/ceph/commit/2a6dcabf7f1b7550a0fa4fd223970ffc24ad7870


I don't think there's a reason for the exact limit. rbd-fuse was
created as a prototype, and hasn't had much work on it since. A
hardcoded limit on open images is not necessary, for example. Some
more ideas for improvements if anyone's interested:

http://pad.ceph.com/p/rbd-fuse-2015

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [CEPH-DEVEL] MAX_RBD_IMAGES

2015-09-22 Thread Shinobu Kinjo
> I don't think there's a reason for the exact limit. rbd-fuse was
> created as a prototype, and hasn't had much work on it since.

That's why I've ended up with asking. I've spent 2 days to find out any reason 
-;
The kernel says nothing to me, of course.
Thank you.

 Shinobu

- Original Message -
From: "Josh Durgin" 
To: "Shinobu Kinjo" , "ceph-devel" 

Sent: Wednesday, September 23, 2015 7:42:29 AM
Subject: Re: [CEPH-DEVEL] MAX_RBD_IMAGES

On 09/22/2015 02:55 PM, Shinobu Kinjo wrote:
> Hello,
>
> Does any of you know why *MAX_RBD_IMAGES* was changed from 16 to 128?
> I hope that Dan remember -;
>
> http://resources.ustack.com/ceph/ceph/commit/2a6dcabf7f1b7550a0fa4fd223970ffc24ad7870

I don't think there's a reason for the exact limit. rbd-fuse was
created as a prototype, and hasn't had much work on it since. A
hardcoded limit on open images is not necessary, for example. Some
more ideas for improvements if anyone's interested:

http://pad.ceph.com/p/rbd-fuse-2015

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD mirroring CLI proposal ...

2015-09-22 Thread Mykola Golub
On Mon, Sep 21, 2015 at 10:48:11AM -0400, Jason Dillaman wrote:
> The following describes a set of proposals to support the upcoming RBD 
> mirroring feature [1] via the rbd cli.The RBD mirroring feature will 
> utilize a journal to allow modifications from a primary source to be 
> replicated to one or more backup destinations. 
> 
> To keep configuration simple (no need for synchronizing mirroring metadata), 
> RBD mirroring will be configured on a per-pool basis within a (conceptional) 
> availability zone. For a given image, each zone will have a copy of the image 
> and a journal.  When mirroring is enabled on an image (either at creation due 
> to the default pool settings to explicitly via the rbd CLI), the image's 
> journal will automatically register the known peer clusters.  If a peer is 
> added or rebuilt after mirroring is enabled, the peer will be registered with 
> the image and the RBD mirroring daemon on the remote peer will then be 
> responsible for snapshotting the image, transferring the base image, deleting 
> the snapshot, and starting journal replay.
> 
> A mirrored image can have the following state: primary, secondary 
> (consistent), and secondary (inconsistent).  The images will also track an 
> epoch associated with the primary state for
> detecting when a secondary is now inconsistent due to a failover event. Any 
> modification (IO writes, resize, snap create, etc) to a mirrored image will 
> result in its status being automatically updated to primary.  Inconsistent 
> secondary images will disallow any modifications.
> 
> Proposed CLI Updates
> 
> To configure basic journaling support for an RBD image:
> 
> * rbd feature enable  journaling
> [--journal-object-pool ]
> [--journal-splay-width ]
> [--journal-object-size ]
> [--journal-additional-tweakable-settings]
> This will enable the RBD journaling feature bit.  A new journal will 
> be
> created using default settings if not overridden.  
> 
> * rbd feature disable  journaling
> This will disable the RBD journaling feature bit. If there are 
> associated RBD
> mirroring peers connected to this image's journal, this will fail if 
> the mirror
> doesn't detach within a timeout.  If the image is not attached to a 
> consistency
> group, the journal will be automatically deleted.
> 
> To configure consistency groups:
> 
> * rbd consistency-group create 
> [--object-pool ]
> [--splay-width ]
> [--object-size ]
> [--additional-journal-tweakable-settings]
> This will create an empty journal for use with consistency groups 
> (i.e. attaching
> multiple RBD images to the same journal to ensure consistent replay).

For 'rbd feature' commands the option names have "journal" prefix
(--journal-object-pool), while for 'rbd consistency-group' they
don't. Is it intentional? I would prefer having "journal" prefix for
both.

> 
> * rbd consistency-group rename 

s/rename/remove/ ?

> This will remove the named consistency group journal.  If one or more 
> images
> are attached, this will fail.
> 
> * rbd consistency-group attach  
> This will enable the RBD journaling feature bit and will configure 
> the image to
> record all journal entries to the specified journal.  If journaling 
> is already
> enabled on the image, this will fail.
> 
> * rbd consistency-group detach 
> This will detach the specified image from its journal and disable the 
> RBD
> journaling feature. 
> 
> * rbd consistency-group ls
> This will list all consistency groups within the current pool.
> 
> * rbd consistency-group info 
> This will display information about the specified consistent group
> 
> where  is [/]

Is a consistency-group just a journal (usually used for several images)
or is it something more? If I enable journaling feature for an image,
journal is automatically created, is it already a consistency-group?

> 
> To configure mirroring support for an RBD image:
> 
> * rbd feature enable  mirroring
> This will enable mirroring for an existing image if it wasn't 
> auto-enabled
> by the default pool policy.
> 
> * rbd feature disable  mirroring
> This will disable mirroring for a specific image if enabled manually 
> or
> automatically via the default pool policy.
> 
> * rbd mirror pool enable 
> This will, by default, ensure that all images created in this
> pool have exclusive lock, journaling, and mirroring feature bits
> enabled.
> 
> * rbd mirror pool disable 
> This will clear the default image features for new images in this
> pool.

Will 'rbd mirror pool enable|disable' change behaviour only for newly
created images in the pool or will enable|disable mirroring for
existent images too?

> 
> * rbd mirror 

Re: [ceph-users] radosgw + civetweb latency issue on Hammer

2015-09-22 Thread Giridhar Yasa
I encountered the same issue in my setup (with an AWS SDK client) and
on further investigation found that 'rgw print continue' was set to
false for a civetweb driven rgw.

Updated http://tracker.ceph.com/issues/12640

Giridhar
--
Giridhar Yasa | Flipkart Engineering | http://www.flipkart.com/


On Thu, Aug 6, 2015 at 5:31 AM, Srikanth Madugundi
 wrote:
> Hi,
>
> After upgrading to Hammer and moving from apache to civetweb. We
> started seeing high PUT latency in the order of 2 sec for every PUT
> request. The GET request lo
>
> Attaching the radosgw logs for a single request. The ceph.conf has the
> following configuration for civetweb.
>
> [client.radosgw.gateway]
> rgw frontends = civetweb port=5632
>
>
> Further investion reveled the call to get_data() at
> https://github.com/ceph/ceph/blob/hammer/src/rgw/rgw_op.cc#L1786 is
> taking 2 sec to respond. The cluster is running Hammer 94.2 release
>
> Did any one face this issue before? Is there some configuration I am missing?
>
> Regards
> Srikanth
>
> ___
> ceph-users mailing list
> ceph-us...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


test

2015-09-22 Thread wangsongbo

test
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


the release of ceph-extra

2015-09-22 Thread wangsongbo

Hi, all,
Since the last week‘s attack, “ceph.com/packages/ceph-extras 
” can be opened never, but where 
can I get the releases of ceph-extra now?


Thanks and Regards,
WangSongbo
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD::op_is_discardable condition doubled

2015-09-22 Thread Lakis, Jacek
Hi ceph-devel!
We're checking OSD::op_is_discardable condition two times: in dispatcher 
(handle_op) and in worker threads 
(do_request->can_discard_request->can_discard_op)

  if (!op->get_connection()->is_connected() &&
  op->get_version().version == 0) {
    return true;
  }

Is there any purpose to do this? If not, which of those should be removed in 
your opinion? If condition in handle_op will remain, OP won't pass to the 
worker threads through the prioritized queue. On the other hand, leaving it in 
do_request will parallelize this check (I started parallelizing in PR 5211). 

Regards,
JJ



Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial 
Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | 
Kapital zakladowy 200.000 PLN.

Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i 
moze zawierac informacje poufne. W razie przypadkowego otrzymania tej 
wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; 
jakiekolwiek
przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole 
use of the intended recipient(s). If you are not the intended recipient, please 
contact the sender and delete all copies; any review or distribution by
others is strictly prohibited.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ceph Tech Talk on Thursday

2015-09-22 Thread Patrick McGarry
Hey cephers,

Don't forget that this Thursday is the monthly Ceph Tech Talk
(starting @ 2p EDT instead of 1p).  This month Brent Compton will be
discussing the particulars of building a Ceph reference architecture.
Tune in on Thurs for details.

http://ceph.com/ceph-tech-talks/

If you have questions, concerns, or would like to sign up for a future
Ceph Tech Talk please feel free to contact me! Thanks.


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Important security noticed regarding release signing key

2015-09-22 Thread Ken Dreyer
On Tue, Sep 22, 2015 at 2:38 AM, Songbo Wang  wrote:
> Hi, all,
> Since the last week‘s attack, “ceph.com/packages/ceph-extras” can be
> opened never, but where can I get the releases of ceph-extra now?
>
> Thanks and Regards,
> WangSongbo
>

The packages in "ceph-extras" were old and subject to CVEs (the big
one being VENOM, CVE-2015-3456). So I don't intend to host ceph-extras
in the new location.

- Ken
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2015-09-22 Thread Redynk, Lukasz
subscribe ceph-devel


Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial 
Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | 
Kapital zakladowy 200.000 PLN.

Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i 
moze zawierac informacje poufne. W razie przypadkowego otrzymania tej 
wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; 
jakiekolwiek
przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole 
use of the intended recipient(s). If you are not the intended recipient, please 
contact the sender and delete all copies; any review or distribution by
others is strictly prohibited.


Re: OSD::op_is_discardable condition doubled

2015-09-22 Thread Sage Weil
On Tue, 22 Sep 2015, Lakis, Jacek wrote:
> Hi ceph-devel!
> We're checking OSD::op_is_discardable condition two times: in dispatcher 
> (handle_op) and in worker threads 
> (do_request->can_discard_request->can_discard_op)
> 
>   if (!op->get_connection()->is_connected() &&
>   op->get_version().version == 0) {
>     return true;
>   }
> 
> Is there any purpose to do this? If not, which of those should be 
> removed in your opinion? If condition in handle_op will remain, OP won't 
> pass to the worker threads through the prioritized queue. On the other 
> hand, leaving it in do_request will parallelize this check (I started 
> parallelizing in PR 5211).

Is there a reason not to do it twice?  It looks pretty cheap.

But, if we have to choose, I'd leave it in the worker thread.  It is 
unlikely to trigger on newly dispatched messages (that were *just* taken 
off the wire), and the worker thread check is critical to avoid wasting 
work on requeued/delayed messages that no longer have clients waiting on 
them.

sage

Re: OSD::op_is_discardable condition doubled

2015-09-22 Thread Gregory Farnum
On Tue, Sep 22, 2015 at 5:26 AM, Sage Weil  wrote:
> On Tue, 22 Sep 2015, Lakis, Jacek wrote:
>> Hi ceph-devel!
>> We're checking OSD::op_is_discardable condition two times: in dispatcher 
>> (handle_op) and in worker threads 
>> (do_request->can_discard_request->can_discard_op)
>>
>>   if (!op->get_connection()->is_connected() &&
>>   op->get_version().version == 0) {
>> return true;
>>   }
>>
>> Is there any purpose to do this? If not, which of those should be
>> removed in your opinion? If condition in handle_op will remain, OP won't
>> pass to the worker threads through the prioritized queue. On the other
>> hand, leaving it in do_request will parallelize this check (I started
>> parallelizing in PR 5211).
>
> Is there a reason not to do it twice?  It looks pretty cheap.
>
> But, if we have to choose, I'd leave it in the worker thread.  It is
> unlikely to trigger on newly dispatched messages (that were *just* taken
> off the wire), and the worker thread check is critical to avoid wasting
> work on requeued/delayed messages that no longer have clients waiting on
> them.

While this is definitely the right choice if eliminating one, I think
there was a specific reason to leave it in both. Perhaps just because
it's cheap, but I think maybe the rest of the dispatch work in that
thread relies on it having a connection. (Unless that got changed? We
are a lot more generous about lack of connections than we used to be.)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html