Re: wakeup( ) in async messenger‘ event

2015-08-28 Thread Haomai Wang
On Fri, Aug 28, 2015 at 2:35 PM, Jianhui Yuan zuiwany...@gmail.com wrote:
 Hi Haomai,

 when we use async messenger, the client(as: ceph -s) always stuck in
 WorkerPool::barrier for 30 seconds. It seems the wakeup don't work.

What's the ceph version and os version? It should be a bug we already
fixed before.


 Then, I remove already_wakeup in wakeup. It seems to be working well. So,
 can we just remove already_wakeup. And do read in C_handle_notify until it
 is not data to read.

 Jianhui Yuan



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: format 2TB rbd device is too slow

2015-08-28 Thread Ma, Jianpeng
Hi Ilya, 
   We can change sector size from 512 to 4096. This can reduce the count of 
write.
I did a simple test: for 900G, mkfs.xfs -f
For default: 1m10s
Physical sector size = 4096:  0m10s.

But if change sector size, we need rbd meta record this.

Thanks!
Jianpeng
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of huang jun
 Sent: Thursday, August 27, 2015 8:44 AM
 To: Ilya Dryomov
 Cc: Haomai Wang; ceph-devel
 Subject: Re: format 2TB rbd device is too slow
 
 hi,llya
 
 2015-08-26 23:56 GMT+08:00 Ilya Dryomov idryo...@gmail.com:
  On Wed, Aug 26, 2015 at 6:22 PM, Haomai Wang haomaiw...@gmail.com
 wrote:
  On Wed, Aug 26, 2015 at 11:16 PM, huang jun hjwsm1...@gmail.com
 wrote:
  hi,all
  we create a 2TB rbd image, after map it to local, then we format it
  to xfs with 'mkfs.xfs /dev/rbd0', it spent 318 seconds to finish,
  but  local physical disk with the same size just need 6 seconds.
 
 
  I think librbd have two PR related to this.
 
  After debug, we found there are two steps in rbd module during formating:
  a) send  233093 DELETE requests to osds(number_of_requests = 2TB /
 4MB),
 this step spent almost 92 seconds.
 
  I guess this(https://github.com/ceph/ceph/pull/4221/files) may help
 
  It's submitting deletes for non-existent objects, not zeroing.  The
  only thing that will really help here is the addition of rbd object
  map support to the kernel client.  That could happen in 4.4, but 4.5
  is a safer bet.
 
 
  b) send 4238 messages like this: [set-alloc-hint object_size 4194304
  write_size 4194304,write 0~512] to osds, that spent 227 seconds.
 
  I think kernel rbd also need to use
  https://github.com/ceph/ceph/pull/4983/files
 
  set-alloc-hint may be a problem, but I think a bigger problem is the
  size of the write.  Are all those writes 512 bytes long?
 
 In another test to format 2TB rbd device, there are :
 2 messages,each write 131072 bytes
 4000 messages, each write 262144 bytes
 112 messages, each write 4096 bytes
 194 messages, each write 512 bytes
 
 the xfs info:
 meta-data = /dev/rbd/rbd/test2tisize=256agcount=33,
 agsize=16382976 blks
=sectsz=512
 attr=2,
 projid32bit=1
=crc=0
 data =   bsize=4096
 blocks=524288000, imaxpct=5
=   sunit=1024
 swidth=1024 blks
 naming=version 2   bsize=4096   ascii-ci=0
 log  =internal log bsize=4096   blocks=256000,
 version=2
   = sectsz=512
 sunit=8
 blks, lazy-count=1
 realtime  =none  extsz=4096   blocks=0,
 rtextents=0
 
  Thanks,
 
  Ilya
 
 
 
 --
 thanks
 huangjun
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body
 of a message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html


Re: Proposal: data-at-rest encryption

2015-08-28 Thread Joshua Schmid


On 08/27/2015 03:38 PM, Sage Weil wrote:
 On Thu, 27 Aug 2015, Joshua Schmid wrote:

 On 08/27/2015 02:49 AM, Sage Weil wrote:
 Hi Joshua!


 Hi Sage,


 Overall the ceph-disk changes look pretty good, and it looks like Andrew 
 and David have both reviewed.  My only real concern/request is that the 
 key server be as pluggable as possible.  You're using ftps here, but we'd 
 also like to allow deo[1], or even the mon config-key service.

 Thank for having a look!
 I think this should do:

 https://github.com/jschmid1/ceph/commit/7dd64c70bcb8d986568d6f379a6fbf9a0e40a441

 Service of choice can now be set in the ceph.conf and will be handled
 separately. This is currently only for unlocking/mapping but will be
 extended for locking/new if this solution is acceptable.
 
 Yep!  It'd probably be much cleaner to wrap this up in a class with 
 fetch_key() and create_key(), etc. methods so that there is only one place 
 that has to instantiate the implemetnation based on type.
 

I dont know if i understood you correctly here. But, since ceph-disk is
classless wouldn't it be a bit strange to introduce one?
I now put anything associated with fetching/creating key in a seperate
method retrieve_key() and create_key() which will behave accordingly.



 With the original mon proposal, we also wanted an additional layer of 
 security (beyond simply access to the storage network) by 
 storing some key-fetching-key on the disk.

 Like deo does it? (From the deo README)

 
 Second, we will add a new random key to the pre-existing LUKS encrypted
 disk and then encrypt it using Deo in a known location.
 


   It looks like the ftps
 access is unauthenticated... is that right?  I would assume (I'm not hte 
 expert!) that most key management systems require some credentials to 
 store/fetch keys?

 Its totally unauthenticated, thats right. It'd be possible to require
 USER/PASS for ftp.
 
 Yeah.  If we have a general method to store a key-fetching-key on the disk 
 (in the LUKS table?  I forget if this was practical) on the LUKS disk the 
 might work?  Hopefully such that various backends can all use it (e.g., as 
 the ftps password, or as a mon key)..


I guess putting it in the LuksKeySlot[1] might work. But plain-dmcrypt
has to be handled differently then. For compatibility reasons i suggest
to avoid that. But storing it on the root partition is not a good idea
either. This area definitely needs more heavy thinking..


 
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

-- 
Freundliche Grüße - Kind regards,
Joshua Schmid
Trainee - Storage
SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nürnberg

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard,
Jennifer Guild, Dilip Upmanyu, Graham Norton, HRB 21284 (AG Nürnberg)

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: format 2TB rbd device is too slow

2015-08-28 Thread Ilya Dryomov
On Thu, Aug 27, 2015 at 3:43 AM, huang jun hjwsm1...@gmail.com wrote:
 hi,llya

 2015-08-26 23:56 GMT+08:00 Ilya Dryomov idryo...@gmail.com:
 On Wed, Aug 26, 2015 at 6:22 PM, Haomai Wang haomaiw...@gmail.com wrote:
 On Wed, Aug 26, 2015 at 11:16 PM, huang jun hjwsm1...@gmail.com wrote:
 hi,all
 we create a 2TB rbd image, after map it to local,
 then we format it to xfs with 'mkfs.xfs /dev/rbd0', it spent 318
 seconds to finish, but  local physical disk with the same size just
 need 6 seconds.


 I think librbd have two PR related to this.

 After debug, we found there are two steps in rbd module during formating:
 a) send  233093 DELETE requests to osds(number_of_requests = 2TB / 4MB),
this step spent almost 92 seconds.

 I guess this(https://github.com/ceph/ceph/pull/4221/files) may help

 It's submitting deletes for non-existent objects, not zeroing.  The
 only thing that will really help here is the addition of rbd object map
 support to the kernel client.  That could happen in 4.4, but 4.5 is
 a safer bet.


 b) send 4238 messages like this: [set-alloc-hint object_size 4194304
 write_size 4194304,write 0~512] to osds, that spent 227 seconds.

 I think kernel rbd also need to use
 https://github.com/ceph/ceph/pull/4983/files

 set-alloc-hint may be a problem, but I think a bigger problem is the
 size of the write.  Are all those writes 512 bytes long?

 In another test to format 2TB rbd device,
 there are :
 2 messages,each write 131072 bytes
 4000 messages, each write 262144 bytes
 112 messages, each write 4096 bytes
 194 messages, each write 512 bytes

So the majority of writes is not 512 bytes long.  I don't think
disabling set-alloc-hint (and, as of now at least, you can't disable it
anyway) would drastically change the numbers.  If you are doing mkfs
right after creating and mapping an image for the first time, you can
add -K option to mkfs, which will tell it to not try to discard.  As
for the write phase, I can't suggest anything off hand.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Opensource plugin for pulling out cluster recovery and client IO metric

2015-08-28 Thread Gregory Farnum
On Mon, Aug 24, 2015 at 4:03 PM, Vickey Singh
vickey.singh22...@gmail.com wrote:
 Hello Ceph Geeks

 I am planning to develop a python plugin that pulls out cluster recovery IO
 and client IO operation metrics , that can be further used with collectd.

 For example , i need to take out these values

 recovery io 814 MB/s, 101 objects/s
 client io 85475 kB/s rd, 1430 kB/s wr, 32 op/s


 Could you please help me in understanding how ceph -s  and ceph -w outputs
 prints cluster recovery IO and client IO information.
 Where this information is coming from. Is it coming from perf dump ? If yes
 then which section of perf dump output is should focus on. If not then how
 can i get this values.

 I tried ceph --admin-daemon /var/run/ceph/ceph-osd.48.asok perf dump , but
 it generates hell lot of information and i am confused which section of
 output should i use.

This information is generated only on the monitors based on pg stats
from the OSDs, is slightly laggy, and can be most easily accessed by
calling ceph -s on a regular basis. You can get it with json output
that is easier to parse, and you can optionally set up an API server
for more programmatic access. I'm not sure on the details of doing
that last, though.
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: format 2TB rbd device is too slow

2015-08-28 Thread Ilya Dryomov
On Fri, Aug 28, 2015 at 10:36 AM, Ma, Jianpeng jianpeng...@intel.com wrote:
 Hi Ilya,
We can change sector size from 512 to 4096. This can reduce the count of 
 write.
 I did a simple test: for 900G, mkfs.xfs -f
 For default: 1m10s
 Physical sector size = 4096:  0m10s.

 But if change sector size, we need rbd meta record this.

What exactly do you mean by changing the sector size?  xfs sector size
or physical block size of the rbd device?  If the latter, meaning
mkfs.xfs -s size 4096, I fail to see how it can result in the above
numbers.  If the former, you have to patch the kernel and I fail to see
how it can result in such an improvement too.  I probably didn't have
enough coffee, please elaborate.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal: data-at-rest encryption

2015-08-28 Thread Shinobu Kinjo
Hi,

Thanks for your good comeback!

  to your KSS -;

Shinobu

- Original Message -
From: Joshua Schmid jsch...@suse.de
To: ski...@redhat.com, Sage Weil s...@newdream.net
Cc: Ceph Development ceph-devel@vger.kernel.org
Sent: Friday, August 28, 2015 5:11:34 PM
Subject: Re: Proposal: data-at-rest encryption



On 08/28/2015 01:32 AM, Shinobu wrote:
 Hello,
 
 Just question.
 If the key is broken, how could ceph (maybe named kss standing for key
 store server :0) recover, and where information to restore it could be
 retrieved?
 Any blueprint?

Hi,

i don't know if i got your question right.
Are you asking what ceph will do if the keyserver is down and where to
get the information from on how to restore the OSD?

Well, there will be a timeout if the KSS is unreachable. But in a
productive environment it might not be a bad idea to add HA to your KSS.

 
 Shinobu
 
 On Thu, Aug 27, 2015 at 10:38 PM, Sage Weil s...@newdream.net wrote:
 
 On Thu, 27 Aug 2015, Joshua Schmid wrote:

 On 08/27/2015 02:49 AM, Sage Weil wrote:
 Hi Joshua!


 Hi Sage,


 Overall the ceph-disk changes look pretty good, and it looks like
 Andrew
 and David have both reviewed.  My only real concern/request is that the
 key server be as pluggable as possible.  You're using ftps here, but
 we'd
 also like to allow deo[1], or even the mon config-key service.

 Thank for having a look!
 I think this should do:


 https://github.com/jschmid1/ceph/commit/7dd64c70bcb8d986568d6f379a6fbf9a0e40a441

 Service of choice can now be set in the ceph.conf and will be handled
 separately. This is currently only for unlocking/mapping but will be
 extended for locking/new if this solution is acceptable.

 Yep!  It'd probably be much cleaner to wrap this up in a class with
 fetch_key() and create_key(), etc. methods so that there is only one place
 that has to instantiate the implemetnation based on type.

 With the original mon proposal, we also wanted an additional layer of
 security (beyond simply access to the storage network) by
 storing some key-fetching-key on the disk.

 Like deo does it? (From the deo README)

 
 Second, we will add a new random key to the pre-existing LUKS encrypted
 disk and then encrypt it using Deo in a known location.
 


   It looks like the ftps
 access is unauthenticated... is that right?  I would assume (I'm not
 hte
 expert!) that most key management systems require some credentials to
 store/fetch keys?

 Its totally unauthenticated, thats right. It'd be possible to require
 USER/PASS for ftp.

 Yeah.  If we have a general method to store a key-fetching-key on the disk
 (in the LUKS table?  I forget if this was practical) on the LUKS disk the
 might work?  Hopefully such that various backends can all use it (e.g., as
 the ftps password, or as a mon key)..

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 
 
 

-- 
Freundliche Grüße - Kind regards,
Joshua Schmid
Trainee - Storage
SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nürnberg

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard,
Jennifer Guild, Dilip Upmanyu, Graham Norton, HRB 21284 (AG Nürnberg)

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: S3:Permissions of access-key

2015-08-28 Thread Yehuda Sadeh-Weinraub
On Fri, Aug 28, 2015 at 2:17 AM, Zhengqiankun zheng.qian...@h3c.com wrote:
 hi,Yehuda:

   I have a question and hope that you can help me answer it. Different
 subuser of swift

   can set specific permissions, but why not set specific permission for
 access-key of s3?


Probably because no one ever asked it. It shouldn't be hard to do
this, sounds like an easy starter project if anyone wants to get their
hands dirty in the rgw code. Note that the canonical way to do it in
S3 is through user policies that we don't (yet?) support.

Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Async reads, sync writes, op thread model discussion

2015-08-28 Thread Samuel Just
Oh, yeah, we'll definitely test for correctness for async reads on
filestore, I'm just worried about validating the performance
assumptions.  The 3700s might be just fine for that validation though.
-Sam

On Fri, Aug 28, 2015 at 1:01 PM, Blinick, Stephen L
stephen.l.blin...@intel.com wrote:
 This sounds ok, with the synchronous interface still possible to the 
 ObjectStore based on return code.

 I'd think that the async read interface can be evaluated with any hardware, 
 at least for correctness, by observing the queue depth to the device during a 
 test run.  Also, I think asynchronous reads may benefit various types of NAND 
 SSD's as they do better with more parallelism and I typically see very low 
 queuedepth to them today with Filestore (one of the reasons I think doubling 
 up OSD's on a single flash device helps benchmarks).

 Thanks,

 Stephen



 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
 Sent: Thursday, August 27, 2015 4:22 PM
 To: Milosz Tanski
 Cc: Matt Benjamin; Haomai Wang; Yehuda Sadeh-Weinraub; Sage Weil; ceph-devel
 Subject: Re: Async reads, sync writes, op thread model discussion

 It's been a couple of weeks, so I thought I'd send out a short progress 
 update.  I've started by trying to nail down enough of the threading 
 design/async interface to start refactoring do_op.  For the moment, I've 
 backtracked on the token approach mostly because it seemed more complicated 
 than necessary.  I'm thinking we'll keep a callback like mechanism, but move 
 responsibility for queuing and execution back to the interface user by 
 allowing the user to pass a completion queue and an uninterpreted completion 
 pointer.  These two commits have the gist of the direction I'm going in (the 
 actual code is more a place holder today).  An OSDReactor instance will 
 replace each of the shards in the current sharded work queue.  Any aio 
 initiated by a pg operation from a reactor will pass that reactor's queue, 
 ensuring that the completion winds up back in the same thread.
 Writes would work pretty much the same way, but with two callbacks.

 My plan is to flesh this out to the point where the OSD works again, and then 
 refactor the osd write path to use this mechanism for basic rbd writes.  That 
 should be enough to let us evaluate whether this is a good path forward for 
 async writes.  Async reads may be a bit tricky to evaluate.  It seems like 
 we'd need hardware that needs that kind of queue depth and an objectstore 
 implementation which can exploit it.
 I'll wire up filestore to do async reads optionally for testing purposes, but 
 it's not clear to me that there will be cases where filestore would want to 
 do an async read rather than a sync read.

 https://github.com/athanatos/ceph/commit/642b7190d70a5970534b911f929e6e3885bf99c4
 https://github.com/athanatos/ceph/commit/42bee815081a91abd003bf7170ef1270f23222f6
 -Sam

 On Fri, Aug 14, 2015 at 3:36 PM, Milosz Tanski mil...@adfin.com wrote:
 On Fri, Aug 14, 2015 at 5:19 PM, Matt Benjamin mbenja...@redhat.com wrote:
 Hi,

 I tend to agree with your comments regarding swapcontext/fibers.  I am not 
 much more enamored of jumping to new models (new! frameworks!) as a single 
 jump, either.

 Not suggesting the libraries/frameworks. Just brining up promises as
 an alternative technique to coroutines. Dealing with spaghetti
 evented/callback code gets old after doing it for 10+ years. Then
 throw in blocking IO.

 And FYI, the data flow promises go back in comp sci back to the 80s.

 Cheers,
 - Milosz


 I like the way I interpreted Sam's design to be going, and in particular, 
 that it seems to allow for consistent handling of read, write transactions. 
  I also would like to see how Yehuda's system works before arguing 
 generalities.

 My intuition is, since the goal is more deterministic performance in
 a short horizion, you

 a. need to prioritize transparency over novel abstractions b. need to
 build solid microbenchmarks that encapsulate small, then larger
 pieces of the work pipeline

 My .05.

 Matt

 --
 Matt Benjamin
 Red Hat, Inc.
 315 West Huron Street, Suite 140A
 Ann Arbor, Michigan 48103

 http://www.redhat.com/en/technologies/storage

 tel.  734-761-4689
 fax.  734-769-8938
 cel.  734-216-5309

 - Original Message -
 From: Milosz Tanski mil...@adfin.com
 To: Haomai Wang haomaiw...@gmail.com
 Cc: Yehuda Sadeh-Weinraub ysade...@redhat.com, Samuel Just
 sj...@redhat.com, Sage Weil s...@newdream.net,
 ceph-devel@vger.kernel.org
 Sent: Friday, August 14, 2015 4:56:26 PM
 Subject: Re: Async reads, sync writes, op thread model discussion

 On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang haomaiw...@gmail.com wrote:
  On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub
  ysade...@redhat.com wrote:
  Already mentioned it on irc, adding to ceph-devel for the sake of
  completeness. I did some infrastructure work for rgw and it seems
  

Re: Hammer backport and bypassing procedure

2015-08-28 Thread Josh Durgin

On 08/28/2015 12:16 PM, Loic Dachary wrote:

Hi Abhishek,

We've just had an example of a backport merged into hammer although it did not 
follow the procedure : https://github.com/ceph/ceph/pull/5691

It's a key aspect of backports : we're bound to follow procedure, but 
developers are allowed to bypass it entirely. It may seem like something 
leading to chaos and frustration but it turns out to be exactly the opposite. 
In a nutshell, it would be constant source of frustration for developers to 
learn and obey the rules documented at 
http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO because it would not 
benefit them significantly. It would also be a problem for us, backporters, 
because developers would not be as interested in backporting and our workload 
would significantly increase.

When a developer prepares a backport on his / her own, we update the pull 
request and the issues to obey the procedure so the (s)he does not have to. 
Sure, it's a little tedious but it's a small price to pay for the benefit of 
having a backport being dealt with. That's what I did for 
https://github.com/ceph/ceph/pull/5691 : updaging the corresponding issues, 
adding cross references to the pull request.

Samuel Just felt confident enough about the backport that it did not need a 
rados run to verify it does the right thing. Since it's ultimately Sam's 
responsibility, that's also ok. The only thing we need to keep in mind when 
analyzing the next rados run is that this backport did not pass yet. We don't 
have a way to mark commits that bypassed tests just yet, if you have ideas let 
us know :-)


That was me merging it based on my local testing. I'll keep an eye out
for any fallout in the hammer runs.

Thanks for keeping everything updated Loic!
Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hammer backport and bypassing procedure

2015-08-28 Thread Loic Dachary
Hi Abhishek,

We've just had an example of a backport merged into hammer although it did not 
follow the procedure : https://github.com/ceph/ceph/pull/5691

It's a key aspect of backports : we're bound to follow procedure, but 
developers are allowed to bypass it entirely. It may seem like something 
leading to chaos and frustration but it turns out to be exactly the opposite. 
In a nutshell, it would be constant source of frustration for developers to 
learn and obey the rules documented at 
http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO because it would not 
benefit them significantly. It would also be a problem for us, backporters, 
because developers would not be as interested in backporting and our workload 
would significantly increase.

When a developer prepares a backport on his / her own, we update the pull 
request and the issues to obey the procedure so the (s)he does not have to. 
Sure, it's a little tedious but it's a small price to pay for the benefit of 
having a backport being dealt with. That's what I did for 
https://github.com/ceph/ceph/pull/5691 : updaging the corresponding issues, 
adding cross references to the pull request.

Samuel Just felt confident enough about the backport that it did not need a 
rados run to verify it does the right thing. Since it's ultimately Sam's 
responsibility, that's also ok. The only thing we need to keep in mind when 
analyzing the next rados run is that this backport did not pass yet. We don't 
have a way to mark commits that bypassed tests just yet, if you have ideas let 
us know :-)

Cheers
-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


v0.94.3 is published - toward v0.94.4

2015-08-28 Thread Loic Dachary
Hi Abhishek,

Since v0.94.3 was published after we started work on v0.94.4, part of 
http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_start_working_on_a_new_point_release
 was not done yet. I've updated the HWOTO page to link to the v0.94.4 page:

http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO/diff?utf8=%E2%9C%93version=66version_from=65commit=View+differences

Please let me know if you see anything else I've missed.

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


RE: Async reads, sync writes, op thread model discussion

2015-08-28 Thread Blinick, Stephen L
This sounds ok, with the synchronous interface still possible to the 
ObjectStore based on return code.   

I'd think that the async read interface can be evaluated with any hardware, at 
least for correctness, by observing the queue depth to the device during a test 
run.  Also, I think asynchronous reads may benefit various types of NAND SSD's 
as they do better with more parallelism and I typically see very low queuedepth 
to them today with Filestore (one of the reasons I think doubling up OSD's on a 
single flash device helps benchmarks). 

Thanks,

Stephen



-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
Sent: Thursday, August 27, 2015 4:22 PM
To: Milosz Tanski
Cc: Matt Benjamin; Haomai Wang; Yehuda Sadeh-Weinraub; Sage Weil; ceph-devel
Subject: Re: Async reads, sync writes, op thread model discussion

It's been a couple of weeks, so I thought I'd send out a short progress update. 
 I've started by trying to nail down enough of the threading design/async 
interface to start refactoring do_op.  For the moment, I've backtracked on the 
token approach mostly because it seemed more complicated than necessary.  I'm 
thinking we'll keep a callback like mechanism, but move responsibility for 
queuing and execution back to the interface user by allowing the user to pass a 
completion queue and an uninterpreted completion pointer.  These two commits 
have the gist of the direction I'm going in (the actual code is more a place 
holder today).  An OSDReactor instance will replace each of the shards in the 
current sharded work queue.  Any aio initiated by a pg operation from a reactor 
will pass that reactor's queue, ensuring that the completion winds up back in 
the same thread.
Writes would work pretty much the same way, but with two callbacks.

My plan is to flesh this out to the point where the OSD works again, and then 
refactor the osd write path to use this mechanism for basic rbd writes.  That 
should be enough to let us evaluate whether this is a good path forward for 
async writes.  Async reads may be a bit tricky to evaluate.  It seems like we'd 
need hardware that needs that kind of queue depth and an objectstore 
implementation which can exploit it.
I'll wire up filestore to do async reads optionally for testing purposes, but 
it's not clear to me that there will be cases where filestore would want to do 
an async read rather than a sync read.

https://github.com/athanatos/ceph/commit/642b7190d70a5970534b911f929e6e3885bf99c4
https://github.com/athanatos/ceph/commit/42bee815081a91abd003bf7170ef1270f23222f6
-Sam

On Fri, Aug 14, 2015 at 3:36 PM, Milosz Tanski mil...@adfin.com wrote:
 On Fri, Aug 14, 2015 at 5:19 PM, Matt Benjamin mbenja...@redhat.com wrote:
 Hi,

 I tend to agree with your comments regarding swapcontext/fibers.  I am not 
 much more enamored of jumping to new models (new! frameworks!) as a single 
 jump, either.

 Not suggesting the libraries/frameworks. Just brining up promises as 
 an alternative technique to coroutines. Dealing with spaghetti 
 evented/callback code gets old after doing it for 10+ years. Then 
 throw in blocking IO.

 And FYI, the data flow promises go back in comp sci back to the 80s.

 Cheers,
 - Milosz


 I like the way I interpreted Sam's design to be going, and in particular, 
 that it seems to allow for consistent handling of read, write transactions.  
 I also would like to see how Yehuda's system works before arguing 
 generalities.

 My intuition is, since the goal is more deterministic performance in 
 a short horizion, you

 a. need to prioritize transparency over novel abstractions b. need to 
 build solid microbenchmarks that encapsulate small, then larger 
 pieces of the work pipeline

 My .05.

 Matt

 --
 Matt Benjamin
 Red Hat, Inc.
 315 West Huron Street, Suite 140A
 Ann Arbor, Michigan 48103

 http://www.redhat.com/en/technologies/storage

 tel.  734-761-4689
 fax.  734-769-8938
 cel.  734-216-5309

 - Original Message -
 From: Milosz Tanski mil...@adfin.com
 To: Haomai Wang haomaiw...@gmail.com
 Cc: Yehuda Sadeh-Weinraub ysade...@redhat.com, Samuel Just 
 sj...@redhat.com, Sage Weil s...@newdream.net, 
 ceph-devel@vger.kernel.org
 Sent: Friday, August 14, 2015 4:56:26 PM
 Subject: Re: Async reads, sync writes, op thread model discussion

 On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang haomaiw...@gmail.com wrote:
  On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub 
  ysade...@redhat.com wrote:
  Already mentioned it on irc, adding to ceph-devel for the sake of 
  completeness. I did some infrastructure work for rgw and it seems 
  (at least to me) that it could at least be partially useful here.
  Basically it's an async execution framework that utilizes coroutines.
  It's comprised of aio notification manager that can also be tied 
  into coroutines execution. The coroutines themselves are 
  stackless, they are implemented as state machines,