Re: wakeup( ) in async messenger‘ event
On Fri, Aug 28, 2015 at 2:35 PM, Jianhui Yuan zuiwany...@gmail.com wrote: Hi Haomai, when we use async messenger, the client(as: ceph -s) always stuck in WorkerPool::barrier for 30 seconds. It seems the wakeup don't work. What's the ceph version and os version? It should be a bug we already fixed before. Then, I remove already_wakeup in wakeup. It seems to be working well. So, can we just remove already_wakeup. And do read in C_handle_notify until it is not data to read. Jianhui Yuan -- Best Regards, Wheat -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: format 2TB rbd device is too slow
Hi Ilya, We can change sector size from 512 to 4096. This can reduce the count of write. I did a simple test: for 900G, mkfs.xfs -f For default: 1m10s Physical sector size = 4096: 0m10s. But if change sector size, we need rbd meta record this. Thanks! Jianpeng -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of huang jun Sent: Thursday, August 27, 2015 8:44 AM To: Ilya Dryomov Cc: Haomai Wang; ceph-devel Subject: Re: format 2TB rbd device is too slow hi,llya 2015-08-26 23:56 GMT+08:00 Ilya Dryomov idryo...@gmail.com: On Wed, Aug 26, 2015 at 6:22 PM, Haomai Wang haomaiw...@gmail.com wrote: On Wed, Aug 26, 2015 at 11:16 PM, huang jun hjwsm1...@gmail.com wrote: hi,all we create a 2TB rbd image, after map it to local, then we format it to xfs with 'mkfs.xfs /dev/rbd0', it spent 318 seconds to finish, but local physical disk with the same size just need 6 seconds. I think librbd have two PR related to this. After debug, we found there are two steps in rbd module during formating: a) send 233093 DELETE requests to osds(number_of_requests = 2TB / 4MB), this step spent almost 92 seconds. I guess this(https://github.com/ceph/ceph/pull/4221/files) may help It's submitting deletes for non-existent objects, not zeroing. The only thing that will really help here is the addition of rbd object map support to the kernel client. That could happen in 4.4, but 4.5 is a safer bet. b) send 4238 messages like this: [set-alloc-hint object_size 4194304 write_size 4194304,write 0~512] to osds, that spent 227 seconds. I think kernel rbd also need to use https://github.com/ceph/ceph/pull/4983/files set-alloc-hint may be a problem, but I think a bigger problem is the size of the write. Are all those writes 512 bytes long? In another test to format 2TB rbd device, there are : 2 messages,each write 131072 bytes 4000 messages, each write 262144 bytes 112 messages, each write 4096 bytes 194 messages, each write 512 bytes the xfs info: meta-data = /dev/rbd/rbd/test2tisize=256agcount=33, agsize=16382976 blks =sectsz=512 attr=2, projid32bit=1 =crc=0 data = bsize=4096 blocks=524288000, imaxpct=5 = sunit=1024 swidth=1024 blks naming=version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=256000, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Thanks, Ilya -- thanks huangjun -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal: data-at-rest encryption
On 08/27/2015 03:38 PM, Sage Weil wrote: On Thu, 27 Aug 2015, Joshua Schmid wrote: On 08/27/2015 02:49 AM, Sage Weil wrote: Hi Joshua! Hi Sage, Overall the ceph-disk changes look pretty good, and it looks like Andrew and David have both reviewed. My only real concern/request is that the key server be as pluggable as possible. You're using ftps here, but we'd also like to allow deo[1], or even the mon config-key service. Thank for having a look! I think this should do: https://github.com/jschmid1/ceph/commit/7dd64c70bcb8d986568d6f379a6fbf9a0e40a441 Service of choice can now be set in the ceph.conf and will be handled separately. This is currently only for unlocking/mapping but will be extended for locking/new if this solution is acceptable. Yep! It'd probably be much cleaner to wrap this up in a class with fetch_key() and create_key(), etc. methods so that there is only one place that has to instantiate the implemetnation based on type. I dont know if i understood you correctly here. But, since ceph-disk is classless wouldn't it be a bit strange to introduce one? I now put anything associated with fetching/creating key in a seperate method retrieve_key() and create_key() which will behave accordingly. With the original mon proposal, we also wanted an additional layer of security (beyond simply access to the storage network) by storing some key-fetching-key on the disk. Like deo does it? (From the deo README) Second, we will add a new random key to the pre-existing LUKS encrypted disk and then encrypt it using Deo in a known location. It looks like the ftps access is unauthenticated... is that right? I would assume (I'm not hte expert!) that most key management systems require some credentials to store/fetch keys? Its totally unauthenticated, thats right. It'd be possible to require USER/PASS for ftp. Yeah. If we have a general method to store a key-fetching-key on the disk (in the LUKS table? I forget if this was practical) on the LUKS disk the might work? Hopefully such that various backends can all use it (e.g., as the ftps password, or as a mon key).. I guess putting it in the LuksKeySlot[1] might work. But plain-dmcrypt has to be handled differently then. For compatibility reasons i suggest to avoid that. But storing it on the root partition is not a good idea either. This area definitely needs more heavy thinking.. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Freundliche Grüße - Kind regards, Joshua Schmid Trainee - Storage SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nürnberg SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Jennifer Guild, Dilip Upmanyu, Graham Norton, HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: format 2TB rbd device is too slow
On Thu, Aug 27, 2015 at 3:43 AM, huang jun hjwsm1...@gmail.com wrote: hi,llya 2015-08-26 23:56 GMT+08:00 Ilya Dryomov idryo...@gmail.com: On Wed, Aug 26, 2015 at 6:22 PM, Haomai Wang haomaiw...@gmail.com wrote: On Wed, Aug 26, 2015 at 11:16 PM, huang jun hjwsm1...@gmail.com wrote: hi,all we create a 2TB rbd image, after map it to local, then we format it to xfs with 'mkfs.xfs /dev/rbd0', it spent 318 seconds to finish, but local physical disk with the same size just need 6 seconds. I think librbd have two PR related to this. After debug, we found there are two steps in rbd module during formating: a) send 233093 DELETE requests to osds(number_of_requests = 2TB / 4MB), this step spent almost 92 seconds. I guess this(https://github.com/ceph/ceph/pull/4221/files) may help It's submitting deletes for non-existent objects, not zeroing. The only thing that will really help here is the addition of rbd object map support to the kernel client. That could happen in 4.4, but 4.5 is a safer bet. b) send 4238 messages like this: [set-alloc-hint object_size 4194304 write_size 4194304,write 0~512] to osds, that spent 227 seconds. I think kernel rbd also need to use https://github.com/ceph/ceph/pull/4983/files set-alloc-hint may be a problem, but I think a bigger problem is the size of the write. Are all those writes 512 bytes long? In another test to format 2TB rbd device, there are : 2 messages,each write 131072 bytes 4000 messages, each write 262144 bytes 112 messages, each write 4096 bytes 194 messages, each write 512 bytes So the majority of writes is not 512 bytes long. I don't think disabling set-alloc-hint (and, as of now at least, you can't disable it anyway) would drastically change the numbers. If you are doing mkfs right after creating and mapping an image for the first time, you can add -K option to mkfs, which will tell it to not try to discard. As for the write phase, I can't suggest anything off hand. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Opensource plugin for pulling out cluster recovery and client IO metric
On Mon, Aug 24, 2015 at 4:03 PM, Vickey Singh vickey.singh22...@gmail.com wrote: Hello Ceph Geeks I am planning to develop a python plugin that pulls out cluster recovery IO and client IO operation metrics , that can be further used with collectd. For example , i need to take out these values recovery io 814 MB/s, 101 objects/s client io 85475 kB/s rd, 1430 kB/s wr, 32 op/s Could you please help me in understanding how ceph -s and ceph -w outputs prints cluster recovery IO and client IO information. Where this information is coming from. Is it coming from perf dump ? If yes then which section of perf dump output is should focus on. If not then how can i get this values. I tried ceph --admin-daemon /var/run/ceph/ceph-osd.48.asok perf dump , but it generates hell lot of information and i am confused which section of output should i use. This information is generated only on the monitors based on pg stats from the OSDs, is slightly laggy, and can be most easily accessed by calling ceph -s on a regular basis. You can get it with json output that is easier to parse, and you can optionally set up an API server for more programmatic access. I'm not sure on the details of doing that last, though. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: format 2TB rbd device is too slow
On Fri, Aug 28, 2015 at 10:36 AM, Ma, Jianpeng jianpeng...@intel.com wrote: Hi Ilya, We can change sector size from 512 to 4096. This can reduce the count of write. I did a simple test: for 900G, mkfs.xfs -f For default: 1m10s Physical sector size = 4096: 0m10s. But if change sector size, we need rbd meta record this. What exactly do you mean by changing the sector size? xfs sector size or physical block size of the rbd device? If the latter, meaning mkfs.xfs -s size 4096, I fail to see how it can result in the above numbers. If the former, you have to patch the kernel and I fail to see how it can result in such an improvement too. I probably didn't have enough coffee, please elaborate. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal: data-at-rest encryption
Hi, Thanks for your good comeback! to your KSS -; Shinobu - Original Message - From: Joshua Schmid jsch...@suse.de To: ski...@redhat.com, Sage Weil s...@newdream.net Cc: Ceph Development ceph-devel@vger.kernel.org Sent: Friday, August 28, 2015 5:11:34 PM Subject: Re: Proposal: data-at-rest encryption On 08/28/2015 01:32 AM, Shinobu wrote: Hello, Just question. If the key is broken, how could ceph (maybe named kss standing for key store server :0) recover, and where information to restore it could be retrieved? Any blueprint? Hi, i don't know if i got your question right. Are you asking what ceph will do if the keyserver is down and where to get the information from on how to restore the OSD? Well, there will be a timeout if the KSS is unreachable. But in a productive environment it might not be a bad idea to add HA to your KSS. Shinobu On Thu, Aug 27, 2015 at 10:38 PM, Sage Weil s...@newdream.net wrote: On Thu, 27 Aug 2015, Joshua Schmid wrote: On 08/27/2015 02:49 AM, Sage Weil wrote: Hi Joshua! Hi Sage, Overall the ceph-disk changes look pretty good, and it looks like Andrew and David have both reviewed. My only real concern/request is that the key server be as pluggable as possible. You're using ftps here, but we'd also like to allow deo[1], or even the mon config-key service. Thank for having a look! I think this should do: https://github.com/jschmid1/ceph/commit/7dd64c70bcb8d986568d6f379a6fbf9a0e40a441 Service of choice can now be set in the ceph.conf and will be handled separately. This is currently only for unlocking/mapping but will be extended for locking/new if this solution is acceptable. Yep! It'd probably be much cleaner to wrap this up in a class with fetch_key() and create_key(), etc. methods so that there is only one place that has to instantiate the implemetnation based on type. With the original mon proposal, we also wanted an additional layer of security (beyond simply access to the storage network) by storing some key-fetching-key on the disk. Like deo does it? (From the deo README) Second, we will add a new random key to the pre-existing LUKS encrypted disk and then encrypt it using Deo in a known location. It looks like the ftps access is unauthenticated... is that right? I would assume (I'm not hte expert!) that most key management systems require some credentials to store/fetch keys? Its totally unauthenticated, thats right. It'd be possible to require USER/PASS for ftp. Yeah. If we have a general method to store a key-fetching-key on the disk (in the LUKS table? I forget if this was practical) on the LUKS disk the might work? Hopefully such that various backends can all use it (e.g., as the ftps password, or as a mon key).. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Freundliche Grüße - Kind regards, Joshua Schmid Trainee - Storage SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nürnberg SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Jennifer Guild, Dilip Upmanyu, Graham Norton, HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: S3:Permissions of access-key
On Fri, Aug 28, 2015 at 2:17 AM, Zhengqiankun zheng.qian...@h3c.com wrote: hi,Yehuda: I have a question and hope that you can help me answer it. Different subuser of swift can set specific permissions, but why not set specific permission for access-key of s3? Probably because no one ever asked it. It shouldn't be hard to do this, sounds like an easy starter project if anyone wants to get their hands dirty in the rgw code. Note that the canonical way to do it in S3 is through user policies that we don't (yet?) support. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Async reads, sync writes, op thread model discussion
Oh, yeah, we'll definitely test for correctness for async reads on filestore, I'm just worried about validating the performance assumptions. The 3700s might be just fine for that validation though. -Sam On Fri, Aug 28, 2015 at 1:01 PM, Blinick, Stephen L stephen.l.blin...@intel.com wrote: This sounds ok, with the synchronous interface still possible to the ObjectStore based on return code. I'd think that the async read interface can be evaluated with any hardware, at least for correctness, by observing the queue depth to the device during a test run. Also, I think asynchronous reads may benefit various types of NAND SSD's as they do better with more parallelism and I typically see very low queuedepth to them today with Filestore (one of the reasons I think doubling up OSD's on a single flash device helps benchmarks). Thanks, Stephen -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just Sent: Thursday, August 27, 2015 4:22 PM To: Milosz Tanski Cc: Matt Benjamin; Haomai Wang; Yehuda Sadeh-Weinraub; Sage Weil; ceph-devel Subject: Re: Async reads, sync writes, op thread model discussion It's been a couple of weeks, so I thought I'd send out a short progress update. I've started by trying to nail down enough of the threading design/async interface to start refactoring do_op. For the moment, I've backtracked on the token approach mostly because it seemed more complicated than necessary. I'm thinking we'll keep a callback like mechanism, but move responsibility for queuing and execution back to the interface user by allowing the user to pass a completion queue and an uninterpreted completion pointer. These two commits have the gist of the direction I'm going in (the actual code is more a place holder today). An OSDReactor instance will replace each of the shards in the current sharded work queue. Any aio initiated by a pg operation from a reactor will pass that reactor's queue, ensuring that the completion winds up back in the same thread. Writes would work pretty much the same way, but with two callbacks. My plan is to flesh this out to the point where the OSD works again, and then refactor the osd write path to use this mechanism for basic rbd writes. That should be enough to let us evaluate whether this is a good path forward for async writes. Async reads may be a bit tricky to evaluate. It seems like we'd need hardware that needs that kind of queue depth and an objectstore implementation which can exploit it. I'll wire up filestore to do async reads optionally for testing purposes, but it's not clear to me that there will be cases where filestore would want to do an async read rather than a sync read. https://github.com/athanatos/ceph/commit/642b7190d70a5970534b911f929e6e3885bf99c4 https://github.com/athanatos/ceph/commit/42bee815081a91abd003bf7170ef1270f23222f6 -Sam On Fri, Aug 14, 2015 at 3:36 PM, Milosz Tanski mil...@adfin.com wrote: On Fri, Aug 14, 2015 at 5:19 PM, Matt Benjamin mbenja...@redhat.com wrote: Hi, I tend to agree with your comments regarding swapcontext/fibers. I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either. Not suggesting the libraries/frameworks. Just brining up promises as an alternative technique to coroutines. Dealing with spaghetti evented/callback code gets old after doing it for 10+ years. Then throw in blocking IO. And FYI, the data flow promises go back in comp sci back to the 80s. Cheers, - Milosz I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions. I also would like to see how Yehuda's system works before arguing generalities. My intuition is, since the goal is more deterministic performance in a short horizion, you a. need to prioritize transparency over novel abstractions b. need to build solid microbenchmarks that encapsulate small, then larger pieces of the work pipeline My .05. Matt -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 - Original Message - From: Milosz Tanski mil...@adfin.com To: Haomai Wang haomaiw...@gmail.com Cc: Yehuda Sadeh-Weinraub ysade...@redhat.com, Samuel Just sj...@redhat.com, Sage Weil s...@newdream.net, ceph-devel@vger.kernel.org Sent: Friday, August 14, 2015 4:56:26 PM Subject: Re: Async reads, sync writes, op thread model discussion On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang haomaiw...@gmail.com wrote: On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub ysade...@redhat.com wrote: Already mentioned it on irc, adding to ceph-devel for the sake of completeness. I did some infrastructure work for rgw and it seems
Re: Hammer backport and bypassing procedure
On 08/28/2015 12:16 PM, Loic Dachary wrote: Hi Abhishek, We've just had an example of a backport merged into hammer although it did not follow the procedure : https://github.com/ceph/ceph/pull/5691 It's a key aspect of backports : we're bound to follow procedure, but developers are allowed to bypass it entirely. It may seem like something leading to chaos and frustration but it turns out to be exactly the opposite. In a nutshell, it would be constant source of frustration for developers to learn and obey the rules documented at http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO because it would not benefit them significantly. It would also be a problem for us, backporters, because developers would not be as interested in backporting and our workload would significantly increase. When a developer prepares a backport on his / her own, we update the pull request and the issues to obey the procedure so the (s)he does not have to. Sure, it's a little tedious but it's a small price to pay for the benefit of having a backport being dealt with. That's what I did for https://github.com/ceph/ceph/pull/5691 : updaging the corresponding issues, adding cross references to the pull request. Samuel Just felt confident enough about the backport that it did not need a rados run to verify it does the right thing. Since it's ultimately Sam's responsibility, that's also ok. The only thing we need to keep in mind when analyzing the next rados run is that this backport did not pass yet. We don't have a way to mark commits that bypassed tests just yet, if you have ideas let us know :-) That was me merging it based on my local testing. I'll keep an eye out for any fallout in the hammer runs. Thanks for keeping everything updated Loic! Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hammer backport and bypassing procedure
Hi Abhishek, We've just had an example of a backport merged into hammer although it did not follow the procedure : https://github.com/ceph/ceph/pull/5691 It's a key aspect of backports : we're bound to follow procedure, but developers are allowed to bypass it entirely. It may seem like something leading to chaos and frustration but it turns out to be exactly the opposite. In a nutshell, it would be constant source of frustration for developers to learn and obey the rules documented at http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO because it would not benefit them significantly. It would also be a problem for us, backporters, because developers would not be as interested in backporting and our workload would significantly increase. When a developer prepares a backport on his / her own, we update the pull request and the issues to obey the procedure so the (s)he does not have to. Sure, it's a little tedious but it's a small price to pay for the benefit of having a backport being dealt with. That's what I did for https://github.com/ceph/ceph/pull/5691 : updaging the corresponding issues, adding cross references to the pull request. Samuel Just felt confident enough about the backport that it did not need a rados run to verify it does the right thing. Since it's ultimately Sam's responsibility, that's also ok. The only thing we need to keep in mind when analyzing the next rados run is that this backport did not pass yet. We don't have a way to mark commits that bypassed tests just yet, if you have ideas let us know :-) Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
v0.94.3 is published - toward v0.94.4
Hi Abhishek, Since v0.94.3 was published after we started work on v0.94.4, part of http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_start_working_on_a_new_point_release was not done yet. I've updated the HWOTO page to link to the v0.94.4 page: http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO/diff?utf8=%E2%9C%93version=66version_from=65commit=View+differences Please let me know if you see anything else I've missed. Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
RE: Async reads, sync writes, op thread model discussion
This sounds ok, with the synchronous interface still possible to the ObjectStore based on return code. I'd think that the async read interface can be evaluated with any hardware, at least for correctness, by observing the queue depth to the device during a test run. Also, I think asynchronous reads may benefit various types of NAND SSD's as they do better with more parallelism and I typically see very low queuedepth to them today with Filestore (one of the reasons I think doubling up OSD's on a single flash device helps benchmarks). Thanks, Stephen -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just Sent: Thursday, August 27, 2015 4:22 PM To: Milosz Tanski Cc: Matt Benjamin; Haomai Wang; Yehuda Sadeh-Weinraub; Sage Weil; ceph-devel Subject: Re: Async reads, sync writes, op thread model discussion It's been a couple of weeks, so I thought I'd send out a short progress update. I've started by trying to nail down enough of the threading design/async interface to start refactoring do_op. For the moment, I've backtracked on the token approach mostly because it seemed more complicated than necessary. I'm thinking we'll keep a callback like mechanism, but move responsibility for queuing and execution back to the interface user by allowing the user to pass a completion queue and an uninterpreted completion pointer. These two commits have the gist of the direction I'm going in (the actual code is more a place holder today). An OSDReactor instance will replace each of the shards in the current sharded work queue. Any aio initiated by a pg operation from a reactor will pass that reactor's queue, ensuring that the completion winds up back in the same thread. Writes would work pretty much the same way, but with two callbacks. My plan is to flesh this out to the point where the OSD works again, and then refactor the osd write path to use this mechanism for basic rbd writes. That should be enough to let us evaluate whether this is a good path forward for async writes. Async reads may be a bit tricky to evaluate. It seems like we'd need hardware that needs that kind of queue depth and an objectstore implementation which can exploit it. I'll wire up filestore to do async reads optionally for testing purposes, but it's not clear to me that there will be cases where filestore would want to do an async read rather than a sync read. https://github.com/athanatos/ceph/commit/642b7190d70a5970534b911f929e6e3885bf99c4 https://github.com/athanatos/ceph/commit/42bee815081a91abd003bf7170ef1270f23222f6 -Sam On Fri, Aug 14, 2015 at 3:36 PM, Milosz Tanski mil...@adfin.com wrote: On Fri, Aug 14, 2015 at 5:19 PM, Matt Benjamin mbenja...@redhat.com wrote: Hi, I tend to agree with your comments regarding swapcontext/fibers. I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either. Not suggesting the libraries/frameworks. Just brining up promises as an alternative technique to coroutines. Dealing with spaghetti evented/callback code gets old after doing it for 10+ years. Then throw in blocking IO. And FYI, the data flow promises go back in comp sci back to the 80s. Cheers, - Milosz I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions. I also would like to see how Yehuda's system works before arguing generalities. My intuition is, since the goal is more deterministic performance in a short horizion, you a. need to prioritize transparency over novel abstractions b. need to build solid microbenchmarks that encapsulate small, then larger pieces of the work pipeline My .05. Matt -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 - Original Message - From: Milosz Tanski mil...@adfin.com To: Haomai Wang haomaiw...@gmail.com Cc: Yehuda Sadeh-Weinraub ysade...@redhat.com, Samuel Just sj...@redhat.com, Sage Weil s...@newdream.net, ceph-devel@vger.kernel.org Sent: Friday, August 14, 2015 4:56:26 PM Subject: Re: Async reads, sync writes, op thread model discussion On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang haomaiw...@gmail.com wrote: On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub ysade...@redhat.com wrote: Already mentioned it on irc, adding to ceph-devel for the sake of completeness. I did some infrastructure work for rgw and it seems (at least to me) that it could at least be partially useful here. Basically it's an async execution framework that utilizes coroutines. It's comprised of aio notification manager that can also be tied into coroutines execution. The coroutines themselves are stackless, they are implemented as state machines,