Re: newstore direction
On 10/21/2015 05:06 AM, Allen Samuels wrote: I agree that moving newStore to raw block is going to be a significant development effort. But the current scheme of using a KV store combined with a normal file system is always going to be problematic (FileStore or NewStore). This is caused by the transactional requirements of the ObjectStore interface, essentially you need to make transactionally consistent updates to two indexes, one of which doesn't understand transactions (File Systems) and can never be tightly-connected to the other one. You'll always be able to make this "loosely coupled" approach work, but it will never be optimal. The real question is whether the performance difference of a suboptimal implementation is something that you can live with compared to the longer gestation period of the more optimal implementation. Clearly, Sage believes that the performance difference is significant or he wouldn't have kicked off this discussion in the first place. While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a significant amount of work. I will offer the case that the "loosely couple" scheme may not have as much time-to-market advantage as it appears to have. One example: NewStore performance is limited due to bugs in XFS that won't be fixed in the field for quite some time (it'll take at least a couple of years before a patched version of XFS will be widely deployed at customer environments). Another example: Sage has just had to substantially rework the journaling code of rocksDB. In short, as you can tell, I'm full throated in favor of going down the optimal route. Internally at Sandisk, we have a KV store that is optimized for flash (it's called ZetaScale). We have extended it with a raw block allocator just as Sage is now proposing to do. Our internal performance measurements show a significant advantage over the current NewStore. That performance advantage stems primarily from two things: Has there been any discussion regarding opensourcing zetascale? (1) ZetaScale uses a B+-tree internally rather than an LSM tree (levelDB/RocksDB). LSM trees experience exponential increase in write amplification (cost of an insert) as the amount of data under management increases. B+tree write-amplification is nearly constant independent of the size of data under management. As the KV database gets larger (Since newStore is effectively moving the per-file inode into the kv data base. Don't forget checksums that Sage want's to add :)) this performance delta swamps all others. (2) Having a KV and a file-system causes a double lookup. This costs CPU time and disk accesses to page in data structure indexes, metadata efficiency decreases. You can't avoid (2) as long as you're using a file system. Yes an LSM tree performs better on HDD than does a B-tree, which is a good argument for keeping the KV module pluggable. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler Sent: Tuesday, October 20, 2015 11:32 AM To: Sage Weil; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/19/2015 03:49 PM, Sage Weil wrote: The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are managing metadata, here: the fs managing the file metadata (with its own journal) and the kv backend (with its journal). If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops. - On read we have to open files by name, which means traversing the fs namespace. Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups. We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard... This seems like a a pretty low hurdle to overcome. - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes. (We don't care
rgw with keystone v3
Hi all, I've created a merge request in github (Rgw keystone v3 #6337) to add integration with keystone auth v3 to ceph rgw. We've tested the best we could, and it does seem to work as it should. There are some notes on the merge request. I don't know what's the usual testing process, but we would like, if possible, to have some binaries built to make further tests - to ensure that we're not suffering from some compiling/copying issues. PS: sorry for any mistakes with the process. This is my first time doing such a thing. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On 10/21/2015 09:32 AM, Sage Weil wrote: On Tue, 20 Oct 2015, Ric Wheeler wrote: Now: 1 io to write a new file 1-2 ios to sync the fs journal (commit the inode, alloc change) (I see 2 journal IOs on XFS and only 1 on ext4...) 1 io to commit the rocksdb journal (currently 3, but will drop to 1 with xfs fix and my rocksdb change) I think that might be too pessimistic - the number of discrete IO's sent down to a spinning disk make much less impact on performance than the number of fsync()'s since they IO's all land in the write cache. Some newer spinning drives have a non-volatile write cache, so even an fsync() might not end up doing the expensive data transfer to the platter. True, but in XFS's case at least the file data and journal are not colocated, so its 2 seeks for the new file write+fdatasync and another for the rocksdb journal commit. Of course, with a deep queue, we're doing lots of these so there's be fewer journal commits on both counts, but the lower bound on latency of a single write is still 3 seeks, and that bound is pretty critical when you also have network round trips and replication (worst out of 2) on top. What are the performance goals we are looking for? Small, synchronous writes/second? File creates/second? I suspect that looking at things like seeks/write is probably looking at the wrong level of performance challenges. Again, when you write to a modern drive, you write to its write cache and it decides internally when/how to destage to the platter. If you look at the performance of XFS with streaming workloads, it will tend to max out the bandwidth of the underlaying storage. If we need IOP's/file writes, etc, we should be clear on what we are aiming at. It would be interesting to get the timings on the IO's you see to measure the actual impact. I observed this with the journaling workload for rocksdb, but I assume the journaling behavior is the same regardless of what is being journaled. For a 4KB append to a file + fdatasync, I saw ~30ms latency for XFS, and blktrace showed an IO to the file, and 2 IOs to the journal. I believe the first one is the record for the inode update, and the second is the journal 'commit' record (though I forget how I decided that). My guess is that XFS is being extremely careful about journal integrity here and not writing the commit record until it knows that the preceding records landed on stable storage. For ext4, the latency was about ~20ms, and blktrace showed the IO to the file and then a single journal IO. When I made the rocksdb change to overwrite an existing, prewritten file, the latency dropped to ~10ms on ext4, and blktrace showed a single IO as expected. (XFS still showed the 2 journal commit IOs, but Dave just posted the fix for that on the XFS list today.) Right, if we want to avoid metadata related IO's, we can preallocate a file and use O_DIRECT. Effectively, there should be no updates outside of the data write itself. Also won't be performance optimizations, but we could avoid redoing allocation and defragmentation again. Normally, best practice is to use batching to avoid paying worst case latency when you do a synchronous IO. Write a batch of files or appends without fsync, then go back and fsync and you will pay that latency once (not per file/op). Plumbing for T10 DIF/DIX already exist, what is missing is the normal block device that handles them (not enterprise SAS/disk array class) Yeah... which unfortunately means that unless the cheap drives suddenly start shipping if DIF/DIX support we'll need to do the checksums ourselves. This is probably a good thing anyway as it doesn't constrain our choice of checksum or checksum granularity, and will still work with other storage devices (ssds, nvme, etc.). sage Might be interesting to see if a device mapper target could be written to support DIF/DIX. For what it's worth, XFS developers have talked loosely about looking at data block checksums (could do something like btrfs does, store the checksums in another btree) ric -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] KEYS: Merge the type-specific data with the payload data
Merge the type-specific data with the payload data into one four-word chunk as it seems pointless to keep them separate. Use user_key_payload() for accessing the payloads of overloaded user-defined keys. Signed-off-by: David Howellscc: linux-c...@vger.kernel.org cc: ecryp...@vger.kernel.org cc: linux-e...@vger.kernel.org cc: linux-f2fs-de...@lists.sourceforge.net cc: linux-...@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: linux-ima-de...@lists.sourceforge.net --- Documentation/crypto/asymmetric-keys.txt | 27 +++-- Documentation/security/keys.txt | 41 --- crypto/asymmetric_keys/asymmetric_keys.h |5 -- crypto/asymmetric_keys/asymmetric_type.c | 44 - crypto/asymmetric_keys/public_key.c |4 +- crypto/asymmetric_keys/signature.c |2 - crypto/asymmetric_keys/x509_parser.h |1 crypto/asymmetric_keys/x509_public_key.c |9 ++-- fs/cifs/cifs_spnego.c|6 +-- fs/cifs/cifsacl.c| 25 ++-- fs/cifs/connect.c|9 ++-- fs/cifs/sess.c |2 - fs/cifs/smb2pdu.c|2 - fs/ecryptfs/ecryptfs_kernel.h|5 +- fs/ext4/crypto_key.c |4 +- fs/f2fs/crypto_key.c |4 +- fs/fscache/object-list.c |4 +- fs/nfs/nfs4idmap.c |4 +- include/crypto/public_key.h |1 include/keys/asymmetric-subtype.h|2 - include/keys/asymmetric-type.h | 15 +++ include/keys/user-type.h |8 include/linux/key-type.h |3 - include/linux/key.h | 33 +++ kernel/module_signing.c |1 lib/digsig.c |7 ++- net/ceph/ceph_common.c |2 - net/ceph/crypto.c|6 +-- net/dns_resolver/dns_key.c | 20 + net/dns_resolver/dns_query.c |7 +-- net/dns_resolver/internal.h |8 net/rxrpc/af_rxrpc.c |2 - net/rxrpc/ar-key.c | 32 +++ net/rxrpc/ar-output.c|2 - net/rxrpc/ar-security.c |4 +- net/rxrpc/rxkad.c| 16 --- security/integrity/evm/evm_crypto.c |2 - security/keys/big_key.c | 47 +++--- security/keys/encrypted-keys/encrypted.c | 18 security/keys/encrypted-keys/encrypted.h |4 +- security/keys/encrypted-keys/masterkey_trusted.c |4 +- security/keys/key.c | 18 security/keys/keyctl.c |4 +- security/keys/keyring.c | 12 +++--- security/keys/process_keys.c |4 +- security/keys/request_key.c |4 +- security/keys/request_key_auth.c | 12 +++--- security/keys/trusted.c |6 +-- security/keys/user_defined.c | 14 +++ 49 files changed, 286 insertions(+), 230 deletions(-) diff --git a/Documentation/crypto/asymmetric-keys.txt b/Documentation/crypto/asymmetric-keys.txt index b7675904a747..8c07e0ea6bc0 100644 --- a/Documentation/crypto/asymmetric-keys.txt +++ b/Documentation/crypto/asymmetric-keys.txt @@ -186,7 +186,7 @@ and looks like the following: const struct public_key_signature *sig); }; -Asymmetric keys point to this with their type_data[0] member. +Asymmetric keys point to this with their payload[asym_subtype] member. The owner and name fields should be set to the owning module and the name of the subtype. Currently, the name is only used for print statements. @@ -269,8 +269,7 @@ mandatory: struct key_preparsed_payload { char*description; - void*type_data[2]; - void*payload; + void*payload[4]; const void *data; size_t datalen; size_t quotalen; @@ -283,16 +282,18 @@ mandatory: not theirs. If the parser is happy with the blob, it should propose a description for - the key and attach it to ->description, ->type_data[0] should be set to - point to the subtype to be used, ->payload should be set to point to the - initialised data for that
Re: [PATCH] mark rbd requiring stable pages
On 10/21/2015 03:57 PM, Ilya Dryomov wrote: > On Wed, Oct 21, 2015 at 10:51 PM, Ilya Dryomovwrote: >> On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov wrote: >>> Hmm... On the one hand, yes, we do compute CRCs, but that's optional, >>> so enabling this unconditionally is probably too harsh. OTOH we are >>> talking to the network, which means all sorts of delays, retransmission >>> issues, etc, so I wonder how exactly "unstable" pages behave when, say, >>> added to an skb - you can't write anything to a page until networking >>> is fully done with it and expect it to work. It's particularly >>> alarming that you've seen corruptions. >>> >>> Currently the only users of this flag are block integrity stuff and >>> md-raid5, which makes me wonder what iscsi, nfs and others do in this >>> area. There's an old ticket on this topic somewhere on the tracker, so >>> I'll need to research this. Thanks for bringing this up! >> >> Hi Mike, >> >> I was hoping to grab you for a few minutes, but you weren't there... >> >> I spent a better part of today reading code and mailing lists on this >> topic. It is of course a bug that we use sendpage() which inlines >> pages into an skb and do nothing to keep those pages stable. We have >> csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc >> case is an obvious fix. >> >> I looked at drbd and iscsi and I think iscsi could do the same - ditch >> the fallback to sock_no_sendpage() in the datadgst_en case (and get rid >> of iscsi_sw_tcp_conn::sendpage member while at it). Using stable pages >> rather than having a roll-your-own implementation which doesn't close >> the race but only narrows it sounds like a win, unless copying through >> sendmsg() is for some reason cheaper than stable-waiting? Yeah, that is what I was saying on the call the other day, but the reception was bad. We only have the sendmsg code path when digest are on because that code came before stable pages. When stable pages were created, it was on by default but did not cover all the cases, so we left the code. It then handled most scenarios, but I just never got around to removing old the code. However, it was set to off by default so I left it and made this patch for iscsi to turn on stable pages: [this patch only enabled stable pages when digests/crcs are on and dif not remove the code yet] https://groups.google.com/forum/#!topic/open-iscsi/n4jvWK7BPYM I did not really like the layering so I have not posted it for inclusion. >> >> drbd still needs the non-zero-copy version for its async protocol for >> when they free the pages before the NIC has chance to put them on the >> wire. md-raid5 it turns out has an option to essentially disable most >> of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate >> if that option is enabled. >> >> What I'm worried about is the !crc (!datadgst_en) case. I'm failing to >> convince myself that mucking with sendpage()ed pages while they sit in >> the TCP queue (or anywhere in the networking stack, really), is safe - >> there is nothing to prevent pages from being modified after sendpage() >> returned and Ronny reports data corruptions that pretty much went away >> with BDI_CAP_STABLE_WRITES set. I may be, after prolonged staring at >> this, starting to confuse fs with block, though. How does that work in >> iscsi land? This is what I was trying to ask about in the call the other day. Where is the corruption that Ronny was seeing. Was it checksum mismatches on data being written, or is incorrect meta data being written, etc? If we are just talking about if stable pages are not used, and someone is re-writing data to a page after the page has already been submitted to the block layer (I mean the page is on some bio which is on a request which is on some request_queue scheduler list or basically anywhere in the block layer), then I was saying this can occur with any block driver. There is nothing that is preventing this from happening with a FC driver or nvme or cciss or in dm or whatever. The app/user can rewrite as late as when we are in the make_request_fn/request_fn. I think I am misunderstanding your question because I thought this is expected behavior, and there is nothing drivers can do if the app is not doing a flush/sync between these types of write sequences. >> >> (There was/is also this [1] bug, which is kind of related and probably >> worth looking into at some point later. ceph shouldn't be that easily >> affected - we've got state, but there is a ticket for it.) >> >> [1] http://www.spinics.net/lists/linux-nfs/msg34913.html > > And now with Mike on the CC and a mention that at least one scenario of > [1] got fixed in NFS by a6b31d18b02f ("SUNRPC: Fix a data corruption > issue when retransmitting RPC calls"). > iSCSI handles timeouts/retries and sequence numbers/responses differently so we are not affected. We go through some abort and possibly reconnect process.
RE: newstore direction
We did evaluate whether NVMKV could be implemented by non-fusionIO ssds, i.e re-invent an NVMKV, the final conclusion sounds like it's not hard with persistent memory(which will be available soon). But yeah, NVMKV will not work if no PM is present---persist the hashing table to SSD is not practicable. Range query seems not a very big issue as the random read performance of nowadays SSD is more than enough, I mean, even we break all sequential to random (typically 70-80K IOPS which is ~300MB/s), the performance still good enough. Anyway, I think for the high IOPS case, it's hard for the consumer to play well on SSDs from different vendors, would be better to leave it to SSD vendor, something like Openstack Cinder's structure. a vendor has the responsibility to maintain their drivers to ceph and take care the performance. > -Original Message- > From: Mark Nelson [mailto:mnel...@redhat.com] > Sent: Wednesday, October 21, 2015 9:36 PM > To: Allen Samuels; Sage Weil; Chen, Xiaoxi > Cc: James (Fei) Liu-SSI; Somnath Roy; ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > Thanks Allen! The devil is always in the details. Know of anything else that > looks promising? > > Mark > > On 10/21/2015 05:06 AM, Allen Samuels wrote: > > I doubt that NVMKV will be useful for two reasons: > > > > (1) It relies on the unique sparse-mapping addressing capabilities of > > the FusionIO VSL interface, it won't run on standard SSDs > > (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no > range operations on keys). This is pretty much required for deep scrubbing. > > > > > > Allen Samuels > > Software Architect, Fellow, Systems and Software Solutions > > > > 2880 Junction Avenue, San Jose, CA 95134 > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson > > Sent: Tuesday, October 20, 2015 6:20 AM > > To: Sage Weil; Chen, Xiaoxi > > Cc: James (Fei) Liu-SSI ; Somnath Roy > > ; ceph-devel@vger.kernel.org > > Subject: Re: newstore direction > > > > On 10/20/2015 07:30 AM, Sage Weil wrote: > >> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote: > >>> +1, nowadays K-V DB care more about very small key-value pairs, say > >>> several bytes to a few KB, but in SSD case we only care about 4KB or > >>> 8KB. In this way, NVMKV is a good design and seems some of the SSD > >>> vendor are also trying to build this kind of interface, we had a > >>> NVM-L library but still under development. > >> > >> Do you have an NVMKV link? I see a paper and a stale github repo.. > >> not sure if I'm looking at the right thing. > >> > >> My concern with using a key/value interface for the object data is > >> that you end up with lots of key/value pairs (e.g., $inode_$offset = > >> $4kb_of_data) that is pretty inefficient to store and (depending on > >> the > >> implementation) tends to break alignment. I don't think these > >> interfaces are targetted toward block-sized/aligned payloads. > >> Storing just the metadata (block allocation map) w/ the kv api and > >> storing the data directly on a block/page interface makes more sense to > me. > >> > >> sage > > > > I get the feeling that some of the folks that were involved with nvmkv at > Fusion IO have left. Nisha Talagala is now out at Parallel Systems for > instance. > http://pmem.io might be a better bet, though I haven't looked closely at it. > > > > Mark > > > >> > >> > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI > Sent: Tuesday, October 20, 2015 6:21 AM > To: Sage Weil; Somnath Roy > Cc: ceph-devel@vger.kernel.org > Subject: RE: newstore direction > > Hi Sage and Somnath, > In my humble opinion, There is another more aggressive > solution than raw block device base keyvalue store as backend for > objectstore. The new key value SSD device with transaction support > would be ideal to solve the issues. > First of all, it is raw SSD device. Secondly , It provides key > value interface directly from SSD. Thirdly, it can provide > transaction support, consistency will be guaranteed by hardware > device. It pretty much satisfied all of objectstore needs without > any extra overhead since there is not any extra layer in between device > and objectstore. > Either way, I strongly support to have CEPH own data format > instead of relying on filesystem. > > Regards, > James > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Monday, October 19, 2015 1:55 PM >
Re: newstore direction
On 10/21/2015 10:14 AM, Mark Nelson wrote: On 10/21/2015 06:24 AM, Ric Wheeler wrote: On 10/21/2015 06:06 AM, Allen Samuels wrote: I agree that moving newStore to raw block is going to be a significant development effort. But the current scheme of using a KV store combined with a normal file system is always going to be problematic (FileStore or NewStore). This is caused by the transactional requirements of the ObjectStore interface, essentially you need to make transactionally consistent updates to two indexes, one of which doesn't understand transactions (File Systems) and can never be tightly-connected to the other one. You'll always be able to make this "loosely coupled" approach work, but it will never be optimal. The real question is whether the performance difference of a suboptimal implementation is something that you can live with compared to the longer gestation period of the more optimal implementation. Clearly, Sage believes that the performance difference is significant or he wouldn't have kicked off this discussion in the first place. I think that we need to work with the existing stack - measure and do some collaborative analysis - before we throw out decades of work. Very hard to understand why the local file system is a barrier for performance in this case when it is not an issue in existing enterprise applications. We need some deep analysis with some local file system experts thrown in to validate the concerns. I think Sage has been working pretty closely with the XFS guys to uncover these kinds of issues. I know if I encounter something fairly FS specific I try to drag Eric or Dave in. I think the core of the problem is that we often find ourselves exercising filesystems in pretty unusual ways. While it's probably good that we add this kind of coverage and help work out somewhat esoteric bugs, I think it does make our job of making Ceph perform well harder. One example: I had been telling folks for several years to favor dentry and inode cache due to the way our PG directory splitting works (backed by test results), but then Sage discovered: http://www.spinics.net/lists/ceph-devel/msg25644.html This is just one example of how very nuanced our performance story is. I can keep many users at least semi-engaged when talking about objects being laid out in a nested directory structure, how dentry/inode cache affects that in a general sense, etc. But combine the kind of subtlety in the link above with the vastness of things in the data path that can hurt performance, and people generally just can't wrap their heads around all of it (With the exception of some of the very smart folks on this mailing list!) One of my biggest concerns going forward is reducing the user-facing complexity of our performance story. The question I ask myself is: Does keeping Ceph on a FS help us or hurt us in that regard? The upshot of that is that the kind of micro-optimization is already handled by the file system, so the application job should be easier. Better to fsync() each file from an application that you care about rather than to worry about using more obscure calls. While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a significant amount of work. I will offer the case that the "loosely couple" scheme may not have as much time-to-market advantage as it appears to have. One example: NewStore performance is limited due to bugs in XFS that won't be fixed in the field for quite some time (it'll take at least a couple of years before a patched version of XFS will be widely deployed at customer environments). Not clear what bugs you are thinking of or why you think fixing bugs will take a long time to hit the field in XFS. Red Hat has most of the XFS developers on staff and we actively backport fixes and ship them, other distros do as well. Never seen a "bug" take a couple of years to hit users. Maybe a good way to start out would be to see how quickly we can get the patch dchinner posted here: http://oss.sgi.com/archives/xfs/2015-10/msg00545.html rolled out into RHEL/CentOS/Ubuntu. I have no idea how long these things typically take, but this might be a good test case. How quickly things land in a distro is up to the interested parties making the case for it. Ric Regards, Ric Another example: Sage has just had to substantially rework the journaling code of rocksDB. In short, as you can tell, I'm full throated in favor of going down the optimal route. Internally at Sandisk, we have a KV store that is optimized for flash (it's called ZetaScale). We have extended it with a raw block allocator just as Sage is now proposing to do. Our internal performance measurements show a significant advantage over the current NewStore. That performance advantage stems primarily from two things: (1) ZetaScale uses a B+-tree internally rather than an LSM tree (levelDB/RocksDB). LSM trees experience exponential
Re: newstore direction
Thanks Allen! The devil is always in the details. Know of anything else that looks promising? Mark On 10/21/2015 05:06 AM, Allen Samuels wrote: I doubt that NVMKV will be useful for two reasons: (1) It relies on the unique sparse-mapping addressing capabilities of the FusionIO VSL interface, it won't run on standard SSDs (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no range operations on keys). This is pretty much required for deep scrubbing. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson Sent: Tuesday, October 20, 2015 6:20 AM To: Sage Weil; Chen, Xiaoxi Cc: James (Fei) Liu-SSI ; Somnath Roy ; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/20/2015 07:30 AM, Sage Weil wrote: On Tue, 20 Oct 2015, Chen, Xiaoxi wrote: +1, nowadays K-V DB care more about very small key-value pairs, say several bytes to a few KB, but in SSD case we only care about 4KB or 8KB. In this way, NVMKV is a good design and seems some of the SSD vendor are also trying to build this kind of interface, we had a NVM-L library but still under development. Do you have an NVMKV link? I see a paper and a stale github repo.. not sure if I'm looking at the right thing. My concern with using a key/value interface for the object data is that you end up with lots of key/value pairs (e.g., $inode_$offset = $4kb_of_data) that is pretty inefficient to store and (depending on the implementation) tends to break alignment. I don't think these interfaces are targetted toward block-sized/aligned payloads. Storing just the metadata (block allocation map) w/ the kv api and storing the data directly on a block/page interface makes more sense to me. sage I get the feeling that some of the folks that were involved with nvmkv at Fusion IO have left. Nisha Talagala is now out at Parallel Systems for instance. http://pmem.io might be a better bet, though I haven't looked closely at it. Mark -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI Sent: Tuesday, October 20, 2015 6:21 AM To: Sage Weil; Somnath Roy Cc: ceph-devel@vger.kernel.org Subject: RE: newstore direction Hi Sage and Somnath, In my humble opinion, There is another more aggressive solution than raw block device base keyvalue store as backend for objectstore. The new key value SSD device with transaction support would be ideal to solve the issues. First of all, it is raw SSD device. Secondly , It provides key value interface directly from SSD. Thirdly, it can provide transaction support, consistency will be guaranteed by hardware device. It pretty much satisfied all of objectstore needs without any extra overhead since there is not any extra layer in between device and objectstore. Either way, I strongly support to have CEPH own data format instead of relying on filesystem. Regards, James -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, October 19, 2015 1:55 PM To: Somnath Roy Cc: ceph-devel@vger.kernel.org Subject: RE: newstore direction On Mon, 19 Oct 2015, Somnath Roy wrote: Sage, I fully support that. If we want to saturate SSDs , we need to get rid of this filesystem overhead (which I am in process of measuring). Also, it will be good if we can eliminate the dependency on the k/v dbs (for storing allocators and all). The reason is the unknown write amps they causes. My hope is to keep behing the KeyValueDB interface (and/more change it as appropriate) so that other backends can be easily swapped in (e.g. a btree- based one for high-end flash). sage Thanks & Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, October 19, 2015 12:49 PM To: ceph-devel@vger.kernel.org Subject: newstore direction The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are
Re: newstore direction
On 10/21/2015 06:24 AM, Ric Wheeler wrote: On 10/21/2015 06:06 AM, Allen Samuels wrote: I agree that moving newStore to raw block is going to be a significant development effort. But the current scheme of using a KV store combined with a normal file system is always going to be problematic (FileStore or NewStore). This is caused by the transactional requirements of the ObjectStore interface, essentially you need to make transactionally consistent updates to two indexes, one of which doesn't understand transactions (File Systems) and can never be tightly-connected to the other one. You'll always be able to make this "loosely coupled" approach work, but it will never be optimal. The real question is whether the performance difference of a suboptimal implementation is something that you can live with compared to the longer gestation period of the more optimal implementation. Clearly, Sage believes that the performance difference is significant or he wouldn't have kicked off this discussion in the first place. I think that we need to work with the existing stack - measure and do some collaborative analysis - before we throw out decades of work. Very hard to understand why the local file system is a barrier for performance in this case when it is not an issue in existing enterprise applications. We need some deep analysis with some local file system experts thrown in to validate the concerns. I think Sage has been working pretty closely with the XFS guys to uncover these kinds of issues. I know if I encounter something fairly FS specific I try to drag Eric or Dave in. I think the core of the problem is that we often find ourselves exercising filesystems in pretty unusual ways. While it's probably good that we add this kind of coverage and help work out somewhat esoteric bugs, I think it does make our job of making Ceph perform well harder. One example: I had been telling folks for several years to favor dentry and inode cache due to the way our PG directory splitting works (backed by test results), but then Sage discovered: http://www.spinics.net/lists/ceph-devel/msg25644.html This is just one example of how very nuanced our performance story is. I can keep many users at least semi-engaged when talking about objects being laid out in a nested directory structure, how dentry/inode cache affects that in a general sense, etc. But combine the kind of subtlety in the link above with the vastness of things in the data path that can hurt performance, and people generally just can't wrap their heads around all of it (With the exception of some of the very smart folks on this mailing list!) One of my biggest concerns going forward is reducing the user-facing complexity of our performance story. The question I ask myself is: Does keeping Ceph on a FS help us or hurt us in that regard? While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a significant amount of work. I will offer the case that the "loosely couple" scheme may not have as much time-to-market advantage as it appears to have. One example: NewStore performance is limited due to bugs in XFS that won't be fixed in the field for quite some time (it'll take at least a couple of years before a patched version of XFS will be widely deployed at customer environments). Not clear what bugs you are thinking of or why you think fixing bugs will take a long time to hit the field in XFS. Red Hat has most of the XFS developers on staff and we actively backport fixes and ship them, other distros do as well. Never seen a "bug" take a couple of years to hit users. Maybe a good way to start out would be to see how quickly we can get the patch dchinner posted here: http://oss.sgi.com/archives/xfs/2015-10/msg00545.html rolled out into RHEL/CentOS/Ubuntu. I have no idea how long these things typically take, but this might be a good test case. Regards, Ric Another example: Sage has just had to substantially rework the journaling code of rocksDB. In short, as you can tell, I'm full throated in favor of going down the optimal route. Internally at Sandisk, we have a KV store that is optimized for flash (it's called ZetaScale). We have extended it with a raw block allocator just as Sage is now proposing to do. Our internal performance measurements show a significant advantage over the current NewStore. That performance advantage stems primarily from two things: (1) ZetaScale uses a B+-tree internally rather than an LSM tree (levelDB/RocksDB). LSM trees experience exponential increase in write amplification (cost of an insert) as the amount of data under management increases. B+tree write-amplification is nearly constant independent of the size of data under management. As the KV database gets larger (Since newStore is effectively moving the per-file inode into the kv data base. Don't forget checksums that Sage want's to add :)) this performance delta swamps all others. (2) Having a
Re: newstore direction
On Wed, 21 Oct 2015, Ric Wheeler wrote: > On 10/21/2015 04:22 AM, Orit Wasserman wrote: > > On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote: > > > On 10/19/2015 03:49 PM, Sage Weil wrote: > > > > The current design is based on two simple ideas: > > > > > > > >1) a key/value interface is better way to manage all of our internal > > > > metadata (object metadata, attrs, layout, collection membership, > > > > write-ahead logging, overlay data, etc.) > > > > > > > >2) a file system is well suited for storage object data (as files). > > > > > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. A > > > > few > > > > things: > > > > > > > >- We currently write the data to the file, fsync, then commit the kv > > > > transaction. That's at least 3 IOs: one for the data, one for the fs > > > > journal, one for the kv txn to commit (at least once my rocksdb changes > > > > land... the kv commit is currently 2-3). So two people are managing > > > > metadata, here: the fs managing the file metadata (with its own > > > > journal) and the kv backend (with its journal). > > > If all of the fsync()'s fall into the same backing file system, are you > > > sure > > > that each fsync() takes the same time? Depending on the local FS > > > implementation > > > of course, but the order of issuing those fsync()'s can effectively make > > > some of > > > them no-ops. > > > > > > >- On read we have to open files by name, which means traversing the > > > > fs > > > > namespace. Newstore tries to keep it as flat and simple as possible, > > > > but > > > > at a minimum it is a couple btree lookups. We'd love to use open by > > > > handle (which would reduce this to 1 btree traversal), but running > > > > the daemon as ceph and not root makes that hard... > > > This seems like a a pretty low hurdle to overcome. > > > > > > >- ...and file systems insist on updating mtime on writes, even when > > > > it is > > > > a overwrite with no allocation changes. (We don't care about mtime.) > > > > O_NOCMTIME patches exist but it is hard to get these past the kernel > > > > brainfreeze. > > > Are you using O_DIRECT? Seems like there should be some enterprisey > > > database > > > tricks that we can use here. > > > > > > >- XFS is (probably) never going going to give us data checksums, > > > > which we > > > > want desperately. > > > What is the goal of having the file system do the checksums? How strong do > > > they > > > need to be and what size are the chunks? > > > > > > If you update this on each IO, this will certainly generate more IO (each > > > write > > > will possibly generate at least one other write to update that new > > > checksum). > > > > > > > But what's the alternative? My thought is to just bite the bullet and > > > > consume a raw block device directly. Write an allocator, hopefully keep > > > > it pretty simple, and manage it in kv store along with all of our other > > > > metadata. > > > The big problem with consuming block devices directly is that you > > > ultimately end > > > up recreating most of the features that you had in the file system. Even > > > enterprise databases like Oracle and DB2 have been migrating away from > > > running > > > on raw block devices in favor of file systems over time. In effect, you > > > are > > > looking at making a simple on disk file system which is always easier to > > > start > > > than it is to get back to a stable, production ready state. > > The best performance is still on block device (SAN). > > File system simplify the operation tasks which worth the performance > > penalty for a database. I think in a storage system this is not the > > case. > > In many cases they can use their own file system that is tailored for > > the database. > > You will have to trust me on this as the Red Hat person who spoke to pretty > much all of our key customers about local file systems and storage - customers > all have migrated over to using normal file systems under Oracle/DB2. > Typically, they use XFS or ext4. I don't know of any non-standard file > systems and only have seen one account running on a raw block store in 8 years > :) > > If you have a pre-allocated file and write using O_DIRECT, your IO path is > identical in terms of IO's sent to the device. ...except it's not. Preallocating the file gives you contiguous space, but you still have to mark the extent written (not zero/prealloc). The only way to get an identical IO pattern is to *pre-write* zeros (or whatever) to the file... which is hours on modern HDDs. Ted asked for a way to force prealloc to expose preexisting disk bits a couple hears back at LSF and it was shot down for security reasons (and rightly so, IMO). If you're going down this path, you already have a "file system" in user space sitting on top of the preallocated file, and you could just as easily use the block device directly. If you're not, then you're writing smaller files (e.g.,
Performance meeting next week?
Hi Mark, In light of OpenStack Summit, will we still have a meeting next week? -Paul -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS stuck in a crash loop
On Mon, Oct 19, 2015 at 8:31 AM, Milosz Tanskiwrote: > On Wed, Oct 14, 2015 at 12:46 AM, Gregory Farnum wrote: >> On Sun, Oct 11, 2015 at 7:36 PM, Milosz Tanski wrote: >>> On Sun, Oct 11, 2015 at 6:44 PM, Milosz Tanski wrote: On Sun, Oct 11, 2015 at 6:01 PM, Milosz Tanski wrote: > On Sun, Oct 11, 2015 at 5:33 PM, Milosz Tanski wrote: >> On Sun, Oct 11, 2015 at 5:24 PM, Milosz Tanski wrote: >>> On Sun, Oct 11, 2015 at 1:16 PM, Gregory Farnum >>> wrote: On Sun, Oct 11, 2015 at 10:09 AM, Milosz Tanski wrote: > About an hour ago my MDSs (primary and follower) started ping-pong > crashing with this message. I've spent about 30 minutes looking into > it but nothing yet. > > This is from a 0.94.3 MDS > > 0> 2015-10-11 17:01:23.596008 7fd4f52ad700 -1 mds/SessionMap.cc: > In function 'virtual void C_IO_SM_Save::finish(int)' thread > 7fd4f52ad700 time 2015-10-11 17:01:23.594089 > mds/SessionMap.cc: 120: FAILED assert(r == 0) These "r == 0" asserts pretty much always mean that the MDS did did a read or write to RADOS (the OSDs) and got an error of some kind back. (Or in the case of the OSDs, access to the local filesystem returned an error, etc.) I don't think these writes include any safety checks which would let the MDS break it which means that probably the OSD is actually returning an error — odd, but not impossible. Notice that the assert happened in thread 7fd4f52ad700, and look for the stuff in that thread. You should be able to find an OSD op reply (on the SessionMap object) coming in and reporting an error code. -Greg >>> >>> I only two error ops in that whole MDS session. Neither one happened >>> on the same thread (7f5ab6000700 in this file). But it looks like the >>> only session map is the -90 "Message too long" one. >>> >>> mtanski@tiny:~$ cat single_crash.log | grep 'osd_op_reply' | grep -v >>> 'ondisk = 0' >>> -3946> 2015-10-11 20:51:11.013965 7f5ab20f2700 1 -- >>> 10.0.5.31:6802/27121 <== osd.25 10.0.5.57:6804/32341 6163 >>> osd_op_reply(46349 mds0_sessionmap [writefull 0~95168363] v0'0 uv0 >>> ondisk = -90 ((90) Message too long)) v6 182+0+0 (2955408122 0 0) >>> 0x3a55d340 con 0x3d5a3c0 >>> -705> 2015-10-11 20:51:11.374132 7f5ab22f4700 1 -- >>> 10.0.5.31:6802/27121 <== osd.28 10.0.5.50:6801/1787 5297 >>> osd_op_reply(48004 300.e274 [delete] v0'0 uv1349638 ondisk = -2 >>> ((2) No such file or directory)) v6 179+0+0 (1182549251 0 0) >>> 0x66c5c80 con 0x3d5a7e0 >>> >>> Any idea what this could be Greg? >> >> To follow this up I found this ticket from 9 months ago: >> http://tracker.ceph.com/issues/10449 In there Yan says: >> >> "it's a kernel bug. hang request prevents mds from trimming >> completed_requests in sessionmap. there is nothing to do with mds. >> (maybe we should add some code to MDS to show warning when this bug >> happens)" >> >> When I was debugging this I saw an OSD (not cephfs client) operation >> stuck for a long time along with the MDS error: >> >> HEALTH_WARN 1 requests are blocked > 32 sec; 1 osds have slow >> requests; mds cluster is degraded; mds0: Behind on trimming (709/30) >> 1 ops are blocked > 16777.2 sec >> 1 ops are blocked > 16777.2 sec on osd.28 >> >> I did eventually bounce the OSD in question and it hasn't become stuck >> since, but the MDS is still eating it every time with the "Message too >> long" error on the session map. >> >> I'm not quite sure where to go from here. > > First time I had a chance to use the new recover tools. I was able to > reply the journal, reset it and then reset the sessionmap. MDS > returned back to life and so far everything looks good. Yay. > > Triggering this a bug/issue is a pretty interesting set of steps. Spoke too soon, a missing dir is now causing MDS to restart it self. -6> 2015-10-11 22:40:47.300169 7f580c7b9700 5 -- op tracker -- seq: 4, time: 2015-10-11 22:40:47.300168, event: finishing request, op: client_request(client.3597476:21480382 rmdir #100015e0be2/58 2015-10-11 21:34:49.224905 RETRY=36) -5> 2015-10-11 22:40:47.300208 7f580c7b9700 5 -- op tracker -- seq: 4, time: 2015-10-11 22:40:47.300208, event: cleaned up request, op: client_request(client.3597476:21480382 rmdir #100015e0be2/58 2015-10-11 21:34:49.224905 RETRY=36) -4> 2015-10-11 22:40:47.300231 7f580c7b9700 5 -- op tracker -- seq: 4, time: 2015-10-11
librbd regression with Hammer v0.94.4 -- use caution!
There is a regression in librbd in v0.94.4 that can cause VMs to crash. For now, please refrain from upgrading hypervisor nodes or other librbd users to v0.94.4. http://tracker.ceph.com/issues/13559 The problem does not affect server-side daemons (ceph-mon, ceph-osd, etc.). Jason's identified the bug and has a fix prepared, but it'll probably take a few days before we have v0.94.5 out. https://github.com/ceph/ceph/commit/4692c330bd992a06b97b5b8975ab71952b22477a Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On 10/21/2015 10:51 AM, Ric Wheeler wrote: On 10/21/2015 10:14 AM, Mark Nelson wrote: On 10/21/2015 06:24 AM, Ric Wheeler wrote: On 10/21/2015 06:06 AM, Allen Samuels wrote: I agree that moving newStore to raw block is going to be a significant development effort. But the current scheme of using a KV store combined with a normal file system is always going to be problematic (FileStore or NewStore). This is caused by the transactional requirements of the ObjectStore interface, essentially you need to make transactionally consistent updates to two indexes, one of which doesn't understand transactions (File Systems) and can never be tightly-connected to the other one. You'll always be able to make this "loosely coupled" approach work, but it will never be optimal. The real question is whether the performance difference of a suboptimal implementation is something that you can live with compared to the longer gestation period of the more optimal implementation. Clearly, Sage believes that the performance difference is significant or he wouldn't have kicked off this discussion in the first place. I think that we need to work with the existing stack - measure and do some collaborative analysis - before we throw out decades of work. Very hard to understand why the local file system is a barrier for performance in this case when it is not an issue in existing enterprise applications. We need some deep analysis with some local file system experts thrown in to validate the concerns. I think Sage has been working pretty closely with the XFS guys to uncover these kinds of issues. I know if I encounter something fairly FS specific I try to drag Eric or Dave in. I think the core of the problem is that we often find ourselves exercising filesystems in pretty unusual ways. While it's probably good that we add this kind of coverage and help work out somewhat esoteric bugs, I think it does make our job of making Ceph perform well harder. One example: I had been telling folks for several years to favor dentry and inode cache due to the way our PG directory splitting works (backed by test results), but then Sage discovered: http://www.spinics.net/lists/ceph-devel/msg25644.html This is just one example of how very nuanced our performance story is. I can keep many users at least semi-engaged when talking about objects being laid out in a nested directory structure, how dentry/inode cache affects that in a general sense, etc. But combine the kind of subtlety in the link above with the vastness of things in the data path that can hurt performance, and people generally just can't wrap their heads around all of it (With the exception of some of the very smart folks on this mailing list!) One of my biggest concerns going forward is reducing the user-facing complexity of our performance story. The question I ask myself is: Does keeping Ceph on a FS help us or hurt us in that regard? The upshot of that is that the kind of micro-optimization is already handled by the file system, so the application job should be easier. Better to fsync() each file from an application that you care about rather than to worry about using more obscure calls. I hear you, and I don't want to discount the massive amount of work and experience that has gone into making XFS and the other filesystems as amazing as they are. I think Sage's argument that the fit isn't right has merit though. There's a lot of things that we end up working around. Take last winter when we ended up pushing past the 254byte inline xattr boundary. We absolutely want to keep xattrs inlined so the idea now is we break large ones down into smaller chunks to try to work around the limitation while continuing to employ a 2K inode size. (which from my conversations with Ben sounds like it's a little controversial in it's own right) All of this by itself is fairly inconsequential, but you add enough of this kind of thing up and it's tough not to feel like we're trying to pound a square peg into a round hole. While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a significant amount of work. I will offer the case that the "loosely couple" scheme may not have as much time-to-market advantage as it appears to have. One example: NewStore performance is limited due to bugs in XFS that won't be fixed in the field for quite some time (it'll take at least a couple of years before a patched version of XFS will be widely deployed at customer environments). Not clear what bugs you are thinking of or why you think fixing bugs will take a long time to hit the field in XFS. Red Hat has most of the XFS developers on staff and we actively backport fixes and ship them, other distros do as well. Never seen a "bug" take a couple of years to hit users. Maybe a good way to start out would be to see how quickly we can get the patch dchinner posted here: http://oss.sgi.com/archives/xfs/2015-10/msg00545.html rolled out
Re: [PATCH] mark rbd requiring stable pages
On Wed, Oct 21, 2015 at 10:51 PM, Ilya Dryomovwrote: > On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov wrote: >> Hmm... On the one hand, yes, we do compute CRCs, but that's optional, >> so enabling this unconditionally is probably too harsh. OTOH we are >> talking to the network, which means all sorts of delays, retransmission >> issues, etc, so I wonder how exactly "unstable" pages behave when, say, >> added to an skb - you can't write anything to a page until networking >> is fully done with it and expect it to work. It's particularly >> alarming that you've seen corruptions. >> >> Currently the only users of this flag are block integrity stuff and >> md-raid5, which makes me wonder what iscsi, nfs and others do in this >> area. There's an old ticket on this topic somewhere on the tracker, so >> I'll need to research this. Thanks for bringing this up! > > Hi Mike, > > I was hoping to grab you for a few minutes, but you weren't there... > > I spent a better part of today reading code and mailing lists on this > topic. It is of course a bug that we use sendpage() which inlines > pages into an skb and do nothing to keep those pages stable. We have > csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc > case is an obvious fix. > > I looked at drbd and iscsi and I think iscsi could do the same - ditch > the fallback to sock_no_sendpage() in the datadgst_en case (and get rid > of iscsi_sw_tcp_conn::sendpage member while at it). Using stable pages > rather than having a roll-your-own implementation which doesn't close > the race but only narrows it sounds like a win, unless copying through > sendmsg() is for some reason cheaper than stable-waiting? > > drbd still needs the non-zero-copy version for its async protocol for > when they free the pages before the NIC has chance to put them on the > wire. md-raid5 it turns out has an option to essentially disable most > of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate > if that option is enabled. > > What I'm worried about is the !crc (!datadgst_en) case. I'm failing to > convince myself that mucking with sendpage()ed pages while they sit in > the TCP queue (or anywhere in the networking stack, really), is safe - > there is nothing to prevent pages from being modified after sendpage() > returned and Ronny reports data corruptions that pretty much went away > with BDI_CAP_STABLE_WRITES set. I may be, after prolonged staring at > this, starting to confuse fs with block, though. How does that work in > iscsi land? > > (There was/is also this [1] bug, which is kind of related and probably > worth looking into at some point later. ceph shouldn't be that easily > affected - we've got state, but there is a ticket for it.) > > [1] http://www.spinics.net/lists/linux-nfs/msg34913.html And now with Mike on the CC and a mention that at least one scenario of [1] got fixed in NFS by a6b31d18b02f ("SUNRPC: Fix a data corruption issue when retransmitting RPC calls"). Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
Adding 2c On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote: > My thought is that there is some inflection point where the userland > kvstore/block approach is going to be less work, for everyone I think, > than trying to quickly discover, understand, fix, and push upstream > patches that sometimes only really benefit us. I don't know if we've > truly hit that that point, but it's tough for me to find flaws with > Sage's argument. Regarding the userland / kernel land aspect of the topic, there are further aspects AFAIK not yet addressed in the thread: In the networking world, there's been development on memory mapped (multiple approaches exist) userland networking, which for packet management has the benefit of - for very, very specific applications of networking code - avoiding e.g. per-packet context switches etc, and streamlining processor cache management performance. People have gone as far as removing CPU cores from CPU scheduler to completely dedicate them to the networking task at hand (cache optimizations). There are various latency/throughput (bulking) optimizations applicable, but at the end of the day, it's about keeping the CPU bus busy with "revenue" bus traffic. Granted, storage IO operations may be much heavier in cycle counts for context switches to ever appear as a problem in themselves, certainly for slower SSDs and HDDs. However, when going for truly high performance IO, *every* hurdle in the data path counts toward the total latency. (And really, high performance random IO characteristics approaches the networking, per-packet handling characteristics). Now, I'm not really suggesting memory-mapping a storage device to user space, not at all, but having better control over the data path for a very specific use case, reduces dependency on the code that works as best as possible for the general case, and allows for very purpose-built code, to address a narrow set of requirements. ("Ceph storage cluster backend" isn't a typical FS use case.) It also decouples dependencies on users i.e. waiting for the next distro release before being able to take up the benefits of improvements to the storage code. A random google came up with related data on where "doing something way different" /can/ have significant benefits: http://phunq.net/pipermail/tux3/2015-April/002147.html I (FWIW) certainly agree there is merit to the idea. The scientific approach here could perhaps be to simply enumerate all corner cases of "generic FS" that actually are cause for the experienced issues, and assess probability of them being solved (and if so when). That *could* improve chances of approaching consensus which wouldn't hurt I suppose? BR, Martin -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mark rbd requiring stable pages
On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomovwrote: > Hmm... On the one hand, yes, we do compute CRCs, but that's optional, > so enabling this unconditionally is probably too harsh. OTOH we are > talking to the network, which means all sorts of delays, retransmission > issues, etc, so I wonder how exactly "unstable" pages behave when, say, > added to an skb - you can't write anything to a page until networking > is fully done with it and expect it to work. It's particularly > alarming that you've seen corruptions. > > Currently the only users of this flag are block integrity stuff and > md-raid5, which makes me wonder what iscsi, nfs and others do in this > area. There's an old ticket on this topic somewhere on the tracker, so > I'll need to research this. Thanks for bringing this up! Hi Mike, I was hoping to grab you for a few minutes, but you weren't there... I spent a better part of today reading code and mailing lists on this topic. It is of course a bug that we use sendpage() which inlines pages into an skb and do nothing to keep those pages stable. We have csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc case is an obvious fix. I looked at drbd and iscsi and I think iscsi could do the same - ditch the fallback to sock_no_sendpage() in the datadgst_en case (and get rid of iscsi_sw_tcp_conn::sendpage member while at it). Using stable pages rather than having a roll-your-own implementation which doesn't close the race but only narrows it sounds like a win, unless copying through sendmsg() is for some reason cheaper than stable-waiting? drbd still needs the non-zero-copy version for its async protocol for when they free the pages before the NIC has chance to put them on the wire. md-raid5 it turns out has an option to essentially disable most of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate if that option is enabled. What I'm worried about is the !crc (!datadgst_en) case. I'm failing to convince myself that mucking with sendpage()ed pages while they sit in the TCP queue (or anywhere in the networking stack, really), is safe - there is nothing to prevent pages from being modified after sendpage() returned and Ronny reports data corruptions that pretty much went away with BDI_CAP_STABLE_WRITES set. I may be, after prolonged staring at this, starting to confuse fs with block, though. How does that work in iscsi land? (There was/is also this [1] bug, which is kind of related and probably worth looking into at some point later. ceph shouldn't be that easily affected - we've got state, but there is a ticket for it.) [1] http://www.spinics.net/lists/linux-nfs/msg34913.html Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS stuck in a crash loop
> John, I know you've got > https://github.com/ceph/ceph-qa-suite/pull/647. I think that's > supposed to be for this, but I'm not sure if you spotted any issues > with it or if we need to do some more diagnosing? That test path is just verifying that we do handle dirs without dying in at least one case -- it passes with the existing ceph code, so it's not reproducing this issue. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS stuck in a crash loop
On Wed, Oct 21, 2015 at 2:33 PM, John Spraywrote: > On Wed, Oct 21, 2015 at 10:33 PM, John Spray wrote: >>> John, I know you've got >>> https://github.com/ceph/ceph-qa-suite/pull/647. I think that's >>> supposed to be for this, but I'm not sure if you spotted any issues >>> with it or if we need to do some more diagnosing? >> >> That test path is just verifying that we do handle dirs without dying >> in at least one case -- it passes with the existing ceph code, so it's >> not reproducing this issue. > > Clicked send to soon, I was about to add... > > Milosz mentioned that they don't have the data from the system in the > broken state, so I don't have any bright ideas about learning more > about what went wrong here unfortunately. Yeah, I guess we'll just need to watch out for it in the future. :/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS stuck in a crash loop
On Wed, Oct 21, 2015 at 10:33 PM, John Spraywrote: >> John, I know you've got >> https://github.com/ceph/ceph-qa-suite/pull/647. I think that's >> supposed to be for this, but I'm not sure if you spotted any issues >> with it or if we need to do some more diagnosing? > > That test path is just verifying that we do handle dirs without dying > in at least one case -- it passes with the existing ceph code, so it's > not reproducing this issue. Clicked send to soon, I was about to add... Milosz mentioned that they don't have the data from the system in the broken state, so I don't have any bright ideas about learning more about what went wrong here unfortunately. John -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: newstore direction
I am pushing internally to open-source ZetaScale. Recent events may or may not affect that trajectory -- stay tuned. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: Wednesday, October 21, 2015 10:45 PM To: Allen Samuels; Ric Wheeler ; Sage Weil ; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/21/2015 05:06 AM, Allen Samuels wrote: > I agree that moving newStore to raw block is going to be a significant > development effort. But the current scheme of using a KV store combined with > a normal file system is always going to be problematic (FileStore or > NewStore). This is caused by the transactional requirements of the > ObjectStore interface, essentially you need to make transactionally > consistent updates to two indexes, one of which doesn't understand > transactions (File Systems) and can never be tightly-connected to the other > one. > > You'll always be able to make this "loosely coupled" approach work, but it > will never be optimal. The real question is whether the performance > difference of a suboptimal implementation is something that you can live with > compared to the longer gestation period of the more optimal implementation. > Clearly, Sage believes that the performance difference is significant or he > wouldn't have kicked off this discussion in the first place. > > While I think we can all agree that writing a full-up KV and raw-block > ObjectStore is a significant amount of work. I will offer the case that the > "loosely couple" scheme may not have as much time-to-market advantage as it > appears to have. One example: NewStore performance is limited due to bugs in > XFS that won't be fixed in the field for quite some time (it'll take at least > a couple of years before a patched version of XFS will be widely deployed at > customer environments). > > Another example: Sage has just had to substantially rework the journaling > code of rocksDB. > > In short, as you can tell, I'm full throated in favor of going down the > optimal route. > > Internally at Sandisk, we have a KV store that is optimized for flash (it's > called ZetaScale). We have extended it with a raw block allocator just as > Sage is now proposing to do. Our internal performance measurements show a > significant advantage over the current NewStore. That performance advantage > stems primarily from two things: Has there been any discussion regarding opensourcing zetascale? > > (1) ZetaScale uses a B+-tree internally rather than an LSM tree > (levelDB/RocksDB). LSM trees experience exponential increase in write > amplification (cost of an insert) as the amount of data under management > increases. B+tree write-amplification is nearly constant independent of the > size of data under management. As the KV database gets larger (Since newStore > is effectively moving the per-file inode into the kv data base. Don't forget > checksums that Sage want's to add :)) this performance delta swamps all > others. > (2) Having a KV and a file-system causes a double lookup. This costs CPU time > and disk accesses to page in data structure indexes, metadata efficiency > decreases. > > You can't avoid (2) as long as you're using a file system. > > Yes an LSM tree performs better on HDD than does a B-tree, which is a good > argument for keeping the KV module pluggable. > > > Allen Samuels > Software Architect, Fellow, Systems and Software Solutions > > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler > Sent: Tuesday, October 20, 2015 11:32 AM > To: Sage Weil ; ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > On 10/19/2015 03:49 PM, Sage Weil wrote: >> The current design is based on two simple ideas: >> >>1) a key/value interface is better way to manage all of our >> internal metadata (object metadata, attrs, layout, collection >> membership, write-ahead logging, overlay data, etc.) >> >>2) a file system is well suited for storage object data (as files). >> >> So far 1 is working out well, but I'm questioning the wisdom of #2. >> A few >> things: >> >>- We currently write the data to the file, fsync, then commit the >> kv transaction. That's at least 3 IOs: one for the data, one for the >> fs journal, one for the kv txn to commit (at least once my rocksdb >> changes land... the kv commit is currently 2-3). So two people are >> managing metadata, here: the fs managing the file metadata (with its >> own >> journal) and the kv backend (with its journal). > > If all of
RE: newstore direction
One of the biggest changes that flash is making in the storage world is that the way basic trade-offs in storage management software architecture are being affected. In the HDD world CPU time per IOP was relatively inconsequential, i.e., it had little effect on overall performance which was limited by the physics of the hard drive. Flash is now inverting that situation. When you look at the performance levels being delivered in the latest generation of NVMe SSDs you rapidly see that that storage itself is generally no longer the bottleneck (speaking about BW, not latency of course) but rather it's the system sitting in front of the storage that is the bottleneck. Generally it's the CPU cost of an IOP. When Sandisk first starting working with Ceph (Dumpling) the design of librados and the OSD lead to the situation that the CPU cost of an IOP was dominated by context switches and network socket handling. Over time, much of that has been addressed. The socket handling code has been re-written (more than once!) some of the internal queueing in the OSD (and the associated context switches) have been eliminated. As the CPU costs have dropped, performance on flash has improved accordingly. Because we didn't want to completely re-write the OSD (time-to-market and stability drove that decision), we didn't move it from the current "thread per IOP" model into a truly asynchronous "thread per CPU core" model that essentially eliminates context switches in the IO path. But a fully optimized OSD would go down that path (at least part-way). I believe it's been proposed in the past. Perhaps a hybrid "fast-path" style could get most of the benefits while preserving much of the legacy code. I believe this trend toward thread-per-core software development will also tend to support the "do it in user-space" trend. That's because most of the kernel and file-system interface is architected around the blocking "thread-per-IOP" model and is unlikely to change in the future. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Martin Millnert [mailto:mar...@millnert.se] Sent: Thursday, October 22, 2015 6:20 AM To: Mark NelsonCc: Ric Wheeler ; Allen Samuels ; Sage Weil ; ceph-devel@vger.kernel.org Subject: Re: newstore direction Adding 2c On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote: > My thought is that there is some inflection point where the userland > kvstore/block approach is going to be less work, for everyone I think, > than trying to quickly discover, understand, fix, and push upstream > patches that sometimes only really benefit us. I don't know if we've > truly hit that that point, but it's tough for me to find flaws with > Sage's argument. Regarding the userland / kernel land aspect of the topic, there are further aspects AFAIK not yet addressed in the thread: In the networking world, there's been development on memory mapped (multiple approaches exist) userland networking, which for packet management has the benefit of - for very, very specific applications of networking code - avoiding e.g. per-packet context switches etc, and streamlining processor cache management performance. People have gone as far as removing CPU cores from CPU scheduler to completely dedicate them to the networking task at hand (cache optimizations). There are various latency/throughput (bulking) optimizations applicable, but at the end of the day, it's about keeping the CPU bus busy with "revenue" bus traffic. Granted, storage IO operations may be much heavier in cycle counts for context switches to ever appear as a problem in themselves, certainly for slower SSDs and HDDs. However, when going for truly high performance IO, *every* hurdle in the data path counts toward the total latency. (And really, high performance random IO characteristics approaches the networking, per-packet handling characteristics). Now, I'm not really suggesting memory-mapping a storage device to user space, not at all, but having better control over the data path for a very specific use case, reduces dependency on the code that works as best as possible for the general case, and allows for very purpose-built code, to address a narrow set of requirements. ("Ceph storage cluster backend" isn't a typical FS use case.) It also decouples dependencies on users i.e. waiting for the next distro release before being able to take up the benefits of improvements to the storage code. A random google came up with related data on where "doing something way different" /can/ have significant benefits: http://phunq.net/pipermail/tux3/2015-April/002147.html I (FWIW) certainly agree there is merit to the idea. The scientific approach here could perhaps be to simply enumerate
RE: newstore direction
Fixing the bug doesn't take a long time. Getting it deployed is where the delay is. Many companies standardize on a particular release of a particular distro. Getting them to switch to a new release -- even a "bug fix" point release -- is a major undertaking that often is a complete roadblock. Just my experience. YMMV. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Ric Wheeler [mailto:rwhee...@redhat.com] Sent: Wednesday, October 21, 2015 8:24 PM To: Allen Samuels; Sage Weil ; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/21/2015 06:06 AM, Allen Samuels wrote: > I agree that moving newStore to raw block is going to be a significant > development effort. But the current scheme of using a KV store combined with > a normal file system is always going to be problematic (FileStore or > NewStore). This is caused by the transactional requirements of the > ObjectStore interface, essentially you need to make transactionally > consistent updates to two indexes, one of which doesn't understand > transactions (File Systems) and can never be tightly-connected to the other > one. > > You'll always be able to make this "loosely coupled" approach work, but it > will never be optimal. The real question is whether the performance > difference of a suboptimal implementation is something that you can live with > compared to the longer gestation period of the more optimal implementation. > Clearly, Sage believes that the performance difference is significant or he > wouldn't have kicked off this discussion in the first place. I think that we need to work with the existing stack - measure and do some collaborative analysis - before we throw out decades of work. Very hard to understand why the local file system is a barrier for performance in this case when it is not an issue in existing enterprise applications. We need some deep analysis with some local file system experts thrown in to validate the concerns. > > While I think we can all agree that writing a full-up KV and raw-block > ObjectStore is a significant amount of work. I will offer the case that the > "loosely couple" scheme may not have as much time-to-market advantage as it > appears to have. One example: NewStore performance is limited due to bugs in > XFS that won't be fixed in the field for quite some time (it'll take at least > a couple of years before a patched version of XFS will be widely deployed at > customer environments). Not clear what bugs you are thinking of or why you think fixing bugs will take a long time to hit the field in XFS. Red Hat has most of the XFS developers on staff and we actively backport fixes and ship them, other distros do as well. Never seen a "bug" take a couple of years to hit users. Regards, Ric > > Another example: Sage has just had to substantially rework the journaling > code of rocksDB. > > In short, as you can tell, I'm full throated in favor of going down the > optimal route. > > Internally at Sandisk, we have a KV store that is optimized for flash (it's > called ZetaScale). We have extended it with a raw block allocator just as > Sage is now proposing to do. Our internal performance measurements show a > significant advantage over the current NewStore. That performance advantage > stems primarily from two things: > > (1) ZetaScale uses a B+-tree internally rather than an LSM tree > (levelDB/RocksDB). LSM trees experience exponential increase in write > amplification (cost of an insert) as the amount of data under management > increases. B+tree write-amplification is nearly constant independent of the > size of data under management. As the KV database gets larger (Since newStore > is effectively moving the per-file inode into the kv data base. Don't forget > checksums that Sage want's to add :)) this performance delta swamps all > others. > (2) Having a KV and a file-system causes a double lookup. This costs CPU time > and disk accesses to page in data structure indexes, metadata efficiency > decreases. > > You can't avoid (2) as long as you're using a file system. > > Yes an LSM tree performs better on HDD than does a B-tree, which is a good > argument for keeping the KV module pluggable. > > > Allen Samuels > Software Architect, Fellow, Systems and Software Solutions > > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler > Sent: Tuesday, October 20, 2015 11:32 AM > To: Sage Weil ; ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > On 10/19/2015 03:49 PM, Sage Weil wrote: >> The current design is based on
RE: newstore direction
Actually Range queries are an important part of the performance story and random read speed doesn't really solve the problem. When you're doing a scrub, you need to enumerate the objects in a specific order on multiple nodes -- so that they can compare the contents of their stores in order to determine if data cleaning needs to take place. If you don't have in-order enumeration in your basic data structure (which NVMKV doesn't have) then you're forced to sort the directory before you can respond to an enumeration. That sort will either consume huge amounts of IOPS OR huge amounts of DRAM. Regardless of the choice, you'll see a significant degradation of performance while the scrub is ongoing -- which is one of the biggest problems with clustered systems (expensive and extensive maintenance operations). Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Chen, Xiaoxi [mailto:xiaoxi.c...@intel.com] Sent: Thursday, October 22, 2015 1:10 AM To: Mark Nelson; Allen Samuels ; Sage Weil Cc: James (Fei) Liu-SSI ; Somnath Roy ; ceph-devel@vger.kernel.org Subject: RE: newstore direction We did evaluate whether NVMKV could be implemented by non-fusionIO ssds, i.e re-invent an NVMKV, the final conclusion sounds like it's not hard with persistent memory(which will be available soon). But yeah, NVMKV will not work if no PM is present---persist the hashing table to SSD is not practicable. Range query seems not a very big issue as the random read performance of nowadays SSD is more than enough, I mean, even we break all sequential to random (typically 70-80K IOPS which is ~300MB/s), the performance still good enough. Anyway, I think for the high IOPS case, it's hard for the consumer to play well on SSDs from different vendors, would be better to leave it to SSD vendor, something like Openstack Cinder's structure. a vendor has the responsibility to maintain their drivers to ceph and take care the performance. > -Original Message- > From: Mark Nelson [mailto:mnel...@redhat.com] > Sent: Wednesday, October 21, 2015 9:36 PM > To: Allen Samuels; Sage Weil; Chen, Xiaoxi > Cc: James (Fei) Liu-SSI; Somnath Roy; ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > Thanks Allen! The devil is always in the details. Know of anything > else that looks promising? > > Mark > > On 10/21/2015 05:06 AM, Allen Samuels wrote: > > I doubt that NVMKV will be useful for two reasons: > > > > (1) It relies on the unique sparse-mapping addressing capabilities > > of the FusionIO VSL interface, it won't run on standard SSDs > > (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no > range operations on keys). This is pretty much required for deep scrubbing. > > > > > > Allen Samuels > > Software Architect, Fellow, Systems and Software Solutions > > > > 2880 Junction Avenue, San Jose, CA 95134 > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson > > Sent: Tuesday, October 20, 2015 6:20 AM > > To: Sage Weil ; Chen, Xiaoxi > > > > Cc: James (Fei) Liu-SSI ; Somnath Roy > > ; ceph-devel@vger.kernel.org > > Subject: Re: newstore direction > > > > On 10/20/2015 07:30 AM, Sage Weil wrote: > >> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote: > >>> +1, nowadays K-V DB care more about very small key-value pairs, > >>> +say > >>> several bytes to a few KB, but in SSD case we only care about 4KB > >>> or 8KB. In this way, NVMKV is a good design and seems some of the > >>> SSD vendor are also trying to build this kind of interface, we had > >>> a NVM-L library but still under development. > >> > >> Do you have an NVMKV link? I see a paper and a stale github repo.. > >> not sure if I'm looking at the right thing. > >> > >> My concern with using a key/value interface for the object data is > >> that you end up with lots of key/value pairs (e.g., $inode_$offset > >> = > >> $4kb_of_data) that is pretty inefficient to store and (depending on > >> the > >> implementation) tends to break alignment. I don't think these > >> interfaces are targetted toward block-sized/aligned payloads. > >> Storing just the metadata (block allocation map) w/ the kv api and > >> storing the data directly on a block/page interface makes more > >> sense to > me. > >> > >> sage > > > > I get the feeling that some of the folks that were involved with > > nvmkv at > Fusion IO have left. Nisha Talagala is now out at Parallel Systems for > instance. > http://pmem.io might be a better bet, though I
Re: newstore direction
On 10/21/2015 08:53 PM, Allen Samuels wrote: Fixing the bug doesn't take a long time. Getting it deployed is where the delay is. Many companies standardize on a particular release of a particular distro. Getting them to switch to a new release -- even a "bug fix" point release -- is a major undertaking that often is a complete roadblock. Just my experience. YMMV. Customers do control the pace that they upgrade their machines, but we put out fixes on a very regular pace. A lot of customers will get fixes without having to qualify a full new release (i.e., fixes come out between major and minor releases are easy). If someone is deploying a critical server for storage, then it falls back on the storage software team to help guide them and encourage them to update when needed (and no promises of success, but people move if the win is big. If it is not, they can wait). ric -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: newstore direction
I agree. My only point was that you still have to factor this time into the argument that by continuing to put NewStore on top of a file system you'll get to a stable system much sooner than the longer development path of doing your own raw storage allocator. IMO, once you factor that into the equation the "on top of an FS" path doesn't look like such a clear winner. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Ric Wheeler [mailto:rwhee...@redhat.com] Sent: Thursday, October 22, 2015 10:17 AM To: Allen Samuels; Sage Weil ; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/21/2015 08:53 PM, Allen Samuels wrote: > Fixing the bug doesn't take a long time. Getting it deployed is where the > delay is. Many companies standardize on a particular release of a particular > distro. Getting them to switch to a new release -- even a "bug fix" point > release -- is a major undertaking that often is a complete roadblock. Just my > experience. YMMV. > Customers do control the pace that they upgrade their machines, but we put out fixes on a very regular pace. A lot of customers will get fixes without having to qualify a full new release (i.e., fixes come out between major and minor releases are easy). If someone is deploying a critical server for storage, then it falls back on the storage software team to help guide them and encourage them to update when needed (and no promises of success, but people move if the win is big. If it is not, they can wait). ric PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On Tue, 20 Oct 2015, Ric Wheeler wrote: > > Now: > > 1 io to write a new file > >1-2 ios to sync the fs journal (commit the inode, alloc change) > >(I see 2 journal IOs on XFS and only 1 on ext4...) > > 1 io to commit the rocksdb journal (currently 3, but will drop to > >1 with xfs fix and my rocksdb change) > > I think that might be too pessimistic - the number of discrete IO's sent down > to a spinning disk make much less impact on performance than the number of > fsync()'s since they IO's all land in the write cache. Some newer spinning > drives have a non-volatile write cache, so even an fsync() might not end up > doing the expensive data transfer to the platter. True, but in XFS's case at least the file data and journal are not colocated, so its 2 seeks for the new file write+fdatasync and another for the rocksdb journal commit. Of course, with a deep queue, we're doing lots of these so there's be fewer journal commits on both counts, but the lower bound on latency of a single write is still 3 seeks, and that bound is pretty critical when you also have network round trips and replication (worst out of 2) on top. > It would be interesting to get the timings on the IO's you see to measure the > actual impact. I observed this with the journaling workload for rocksdb, but I assume the journaling behavior is the same regardless of what is being journaled. For a 4KB append to a file + fdatasync, I saw ~30ms latency for XFS, and blktrace showed an IO to the file, and 2 IOs to the journal. I believe the first one is the record for the inode update, and the second is the journal 'commit' record (though I forget how I decided that). My guess is that XFS is being extremely careful about journal integrity here and not writing the commit record until it knows that the preceding records landed on stable storage. For ext4, the latency was about ~20ms, and blktrace showed the IO to the file and then a single journal IO. When I made the rocksdb change to overwrite an existing, prewritten file, the latency dropped to ~10ms on ext4, and blktrace showed a single IO as expected. (XFS still showed the 2 journal commit IOs, but Dave just posted the fix for that on the XFS list today.) > Plumbing for T10 DIF/DIX already exist, what is missing is the normal block > device that handles them (not enterprise SAS/disk array class) Yeah... which unfortunately means that unless the cheap drives suddenly start shipping if DIF/DIX support we'll need to do the checksums ourselves. This is probably a good thing anyway as it doesn't constrain our choice of checksum or checksum granularity, and will still work with other storage devices (ssds, nvme, etc.). sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: newstore direction
I agree that moving newStore to raw block is going to be a significant development effort. But the current scheme of using a KV store combined with a normal file system is always going to be problematic (FileStore or NewStore). This is caused by the transactional requirements of the ObjectStore interface, essentially you need to make transactionally consistent updates to two indexes, one of which doesn't understand transactions (File Systems) and can never be tightly-connected to the other one. You'll always be able to make this "loosely coupled" approach work, but it will never be optimal. The real question is whether the performance difference of a suboptimal implementation is something that you can live with compared to the longer gestation period of the more optimal implementation. Clearly, Sage believes that the performance difference is significant or he wouldn't have kicked off this discussion in the first place. While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a significant amount of work. I will offer the case that the "loosely couple" scheme may not have as much time-to-market advantage as it appears to have. One example: NewStore performance is limited due to bugs in XFS that won't be fixed in the field for quite some time (it'll take at least a couple of years before a patched version of XFS will be widely deployed at customer environments). Another example: Sage has just had to substantially rework the journaling code of rocksDB. In short, as you can tell, I'm full throated in favor of going down the optimal route. Internally at Sandisk, we have a KV store that is optimized for flash (it's called ZetaScale). We have extended it with a raw block allocator just as Sage is now proposing to do. Our internal performance measurements show a significant advantage over the current NewStore. That performance advantage stems primarily from two things: (1) ZetaScale uses a B+-tree internally rather than an LSM tree (levelDB/RocksDB). LSM trees experience exponential increase in write amplification (cost of an insert) as the amount of data under management increases. B+tree write-amplification is nearly constant independent of the size of data under management. As the KV database gets larger (Since newStore is effectively moving the per-file inode into the kv data base. Don't forget checksums that Sage want's to add :)) this performance delta swamps all others. (2) Having a KV and a file-system causes a double lookup. This costs CPU time and disk accesses to page in data structure indexes, metadata efficiency decreases. You can't avoid (2) as long as you're using a file system. Yes an LSM tree performs better on HDD than does a B-tree, which is a good argument for keeping the KV module pluggable. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler Sent: Tuesday, October 20, 2015 11:32 AM To: Sage Weil; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/19/2015 03:49 PM, Sage Weil wrote: > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal > metadata (object metadata, attrs, layout, collection membership, > write-ahead logging, overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A > few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one for the kv txn to commit (at least once my rocksdb > changes land... the kv commit is currently 2-3). So two people are > managing metadata, here: the fs managing the file metadata (with its > own > journal) and the kv backend (with its journal). If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops. > > - On read we have to open files by name, which means traversing the > fs namespace. Newstore tries to keep it as flat and simple as > possible, but at a minimum it is a couple btree lookups. We'd love to > use open by handle (which would reduce this to 1 btree traversal), but > running the daemon as ceph and not root makes that hard... This seems like a a pretty low hurdle to overcome. > > - ...and file systems insist on updating mtime on writes, even when > it is a overwrite with no allocation changes. (We don't care about > mtime.) O_NOCMTIME patches exist but it is hard to get these past the
RE: newstore direction
I doubt that NVMKV will be useful for two reasons: (1) It relies on the unique sparse-mapping addressing capabilities of the FusionIO VSL interface, it won't run on standard SSDs (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no range operations on keys). This is pretty much required for deep scrubbing. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson Sent: Tuesday, October 20, 2015 6:20 AM To: Sage Weil; Chen, Xiaoxi Cc: James (Fei) Liu-SSI ; Somnath Roy ; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/20/2015 07:30 AM, Sage Weil wrote: > On Tue, 20 Oct 2015, Chen, Xiaoxi wrote: >> +1, nowadays K-V DB care more about very small key-value pairs, say >> several bytes to a few KB, but in SSD case we only care about 4KB or >> 8KB. In this way, NVMKV is a good design and seems some of the SSD >> vendor are also trying to build this kind of interface, we had a >> NVM-L library but still under development. > > Do you have an NVMKV link? I see a paper and a stale github repo.. > not sure if I'm looking at the right thing. > > My concern with using a key/value interface for the object data is > that you end up with lots of key/value pairs (e.g., $inode_$offset = > $4kb_of_data) that is pretty inefficient to store and (depending on > the > implementation) tends to break alignment. I don't think these > interfaces are targetted toward block-sized/aligned payloads. Storing > just the metadata (block allocation map) w/ the kv api and storing the > data directly on a block/page interface makes more sense to me. > > sage I get the feeling that some of the folks that were involved with nvmkv at Fusion IO have left. Nisha Talagala is now out at Parallel Systems for instance. http://pmem.io might be a better bet, though I haven't looked closely at it. Mark > > >>> -Original Message- >>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- >>> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI >>> Sent: Tuesday, October 20, 2015 6:21 AM >>> To: Sage Weil; Somnath Roy >>> Cc: ceph-devel@vger.kernel.org >>> Subject: RE: newstore direction >>> >>> Hi Sage and Somnath, >>>In my humble opinion, There is another more aggressive solution >>> than raw block device base keyvalue store as backend for >>> objectstore. The new key value SSD device with transaction support would >>> be ideal to solve the issues. >>> First of all, it is raw SSD device. Secondly , It provides key value >>> interface directly from SSD. Thirdly, it can provide transaction >>> support, consistency will be guaranteed by hardware device. It >>> pretty much satisfied all of objectstore needs without any extra >>> overhead since there is not any extra layer in between device and >>> objectstore. >>> Either way, I strongly support to have CEPH own data format >>> instead of relying on filesystem. >>> >>>Regards, >>>James >>> >>> -Original Message- >>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- >>> ow...@vger.kernel.org] On Behalf Of Sage Weil >>> Sent: Monday, October 19, 2015 1:55 PM >>> To: Somnath Roy >>> Cc: ceph-devel@vger.kernel.org >>> Subject: RE: newstore direction >>> >>> On Mon, 19 Oct 2015, Somnath Roy wrote: Sage, I fully support that. If we want to saturate SSDs , we need to get rid of this filesystem overhead (which I am in process of measuring). Also, it will be good if we can eliminate the dependency on the k/v dbs (for storing allocators and all). The reason is the unknown write amps they causes. >>> >>> My hope is to keep behing the KeyValueDB interface (and/more change >>> it as >>> appropriate) so that other backends can be easily swapped in (e.g. a >>> btree- based one for high-end flash). >>> >>> sage >>> >>> Thanks & Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, October 19, 2015 12:49 PM To: ceph-devel@vger.kernel.org Subject: newstore direction The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv
Re: newstore direction
On 10/21/2015 06:06 AM, Allen Samuels wrote: I agree that moving newStore to raw block is going to be a significant development effort. But the current scheme of using a KV store combined with a normal file system is always going to be problematic (FileStore or NewStore). This is caused by the transactional requirements of the ObjectStore interface, essentially you need to make transactionally consistent updates to two indexes, one of which doesn't understand transactions (File Systems) and can never be tightly-connected to the other one. You'll always be able to make this "loosely coupled" approach work, but it will never be optimal. The real question is whether the performance difference of a suboptimal implementation is something that you can live with compared to the longer gestation period of the more optimal implementation. Clearly, Sage believes that the performance difference is significant or he wouldn't have kicked off this discussion in the first place. I think that we need to work with the existing stack - measure and do some collaborative analysis - before we throw out decades of work. Very hard to understand why the local file system is a barrier for performance in this case when it is not an issue in existing enterprise applications. We need some deep analysis with some local file system experts thrown in to validate the concerns. While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a significant amount of work. I will offer the case that the "loosely couple" scheme may not have as much time-to-market advantage as it appears to have. One example: NewStore performance is limited due to bugs in XFS that won't be fixed in the field for quite some time (it'll take at least a couple of years before a patched version of XFS will be widely deployed at customer environments). Not clear what bugs you are thinking of or why you think fixing bugs will take a long time to hit the field in XFS. Red Hat has most of the XFS developers on staff and we actively backport fixes and ship them, other distros do as well. Never seen a "bug" take a couple of years to hit users. Regards, Ric Another example: Sage has just had to substantially rework the journaling code of rocksDB. In short, as you can tell, I'm full throated in favor of going down the optimal route. Internally at Sandisk, we have a KV store that is optimized for flash (it's called ZetaScale). We have extended it with a raw block allocator just as Sage is now proposing to do. Our internal performance measurements show a significant advantage over the current NewStore. That performance advantage stems primarily from two things: (1) ZetaScale uses a B+-tree internally rather than an LSM tree (levelDB/RocksDB). LSM trees experience exponential increase in write amplification (cost of an insert) as the amount of data under management increases. B+tree write-amplification is nearly constant independent of the size of data under management. As the KV database gets larger (Since newStore is effectively moving the per-file inode into the kv data base. Don't forget checksums that Sage want's to add :)) this performance delta swamps all others. (2) Having a KV and a file-system causes a double lookup. This costs CPU time and disk accesses to page in data structure indexes, metadata efficiency decreases. You can't avoid (2) as long as you're using a file system. Yes an LSM tree performs better on HDD than does a B-tree, which is a good argument for keeping the KV module pluggable. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler Sent: Tuesday, October 20, 2015 11:32 AM To: Sage Weil; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/19/2015 03:49 PM, Sage Weil wrote: The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are managing metadata, here: the fs managing the file metadata (with its own journal) and the kv backend (with its journal). If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the
Re: what does ms_objecter do in OSD ?
On Wed, 21 Oct 2015, Jaze Lee wrote: > Hello, >I find this messenger do not bind to any ip, so i do not know why we do > that. >Does any one know what ms_object can do ? Thanks a lot. It is the librados client that is used by the rados copy-from operation and for cache tiering (to read/write to other OSDs). It doesn't bind to an IP because it is the client side. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On 10/21/2015 04:22 AM, Orit Wasserman wrote: On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote: On 10/19/2015 03:49 PM, Sage Weil wrote: The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are managing metadata, here: the fs managing the file metadata (with its own journal) and the kv backend (with its journal). If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops. - On read we have to open files by name, which means traversing the fs namespace. Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups. We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard... This seems like a a pretty low hurdle to overcome. - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes. (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze. Are you using O_DIRECT? Seems like there should be some enterprisey database tricks that we can use here. - XFS is (probably) never going going to give us data checksums, which we want desperately. What is the goal of having the file system do the checksums? How strong do they need to be and what size are the chunks? If you update this on each IO, this will certainly generate more IO (each write will possibly generate at least one other write to update that new checksum). But what's the alternative? My thought is to just bite the bullet and consume a raw block device directly. Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata. The big problem with consuming block devices directly is that you ultimately end up recreating most of the features that you had in the file system. Even enterprise databases like Oracle and DB2 have been migrating away from running on raw block devices in favor of file systems over time. In effect, you are looking at making a simple on disk file system which is always easier to start than it is to get back to a stable, production ready state. The best performance is still on block device (SAN). File system simplify the operation tasks which worth the performance penalty for a database. I think in a storage system this is not the case. In many cases they can use their own file system that is tailored for the database. You will have to trust me on this as the Red Hat person who spoke to pretty much all of our key customers about local file systems and storage - customers all have migrated over to using normal file systems under Oracle/DB2. Typically, they use XFS or ext4. I don't know of any non-standard file systems and only have seen one account running on a raw block store in 8 years :) If you have a pre-allocated file and write using O_DIRECT, your IO path is identical in terms of IO's sent to the device. If we are causing additional IO's, then we really need to spend some time talking to the local file system gurus about this in detail. I can help with that conversation. I think that it might be quicker and more maintainable to spend some time working with the local file system people (XFS or other) to see if we can jointly address the concerns you have. Wins: - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before). For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before). - No concern about mtime getting in the way - Faster reads (no fs lookup) - Similarly sized metadata for most objects. If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now. Problems: - We have to size the kv backend storage (probably still an XFS partition) vs the block storage. Maybe we do this anyway (put metadata on SSD!) so it won't matter. But what happens when we are storing gobs of rgw index data or cephfs metadata? Suddenly we are pulling storage out of a different pool and those
Re: newstore direction
On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote: > On 10/19/2015 03:49 PM, Sage Weil wrote: > > The current design is based on two simple ideas: > > > > 1) a key/value interface is better way to manage all of our internal > > metadata (object metadata, attrs, layout, collection membership, > > write-ahead logging, overlay data, etc.) > > > > 2) a file system is well suited for storage object data (as files). > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > > things: > > > > - We currently write the data to the file, fsync, then commit the kv > > transaction. That's at least 3 IOs: one for the data, one for the fs > > journal, one for the kv txn to commit (at least once my rocksdb changes > > land... the kv commit is currently 2-3). So two people are managing > > metadata, here: the fs managing the file metadata (with its own > > journal) and the kv backend (with its journal). > > If all of the fsync()'s fall into the same backing file system, are you sure > that each fsync() takes the same time? Depending on the local FS > implementation > of course, but the order of issuing those fsync()'s can effectively make some > of > them no-ops. > > > > > - On read we have to open files by name, which means traversing the fs > > namespace. Newstore tries to keep it as flat and simple as possible, but > > at a minimum it is a couple btree lookups. We'd love to use open by > > handle (which would reduce this to 1 btree traversal), but running > > the daemon as ceph and not root makes that hard... > > This seems like a a pretty low hurdle to overcome. > > > > > - ...and file systems insist on updating mtime on writes, even when it is > > a overwrite with no allocation changes. (We don't care about mtime.) > > O_NOCMTIME patches exist but it is hard to get these past the kernel > > brainfreeze. > > Are you using O_DIRECT? Seems like there should be some enterprisey database > tricks that we can use here. > > > > > - XFS is (probably) never going going to give us data checksums, which we > > want desperately. > > What is the goal of having the file system do the checksums? How strong do > they > need to be and what size are the chunks? > > If you update this on each IO, this will certainly generate more IO (each > write > will possibly generate at least one other write to update that new checksum). > > > > > But what's the alternative? My thought is to just bite the bullet and > > consume a raw block device directly. Write an allocator, hopefully keep > > it pretty simple, and manage it in kv store along with all of our other > > metadata. > > The big problem with consuming block devices directly is that you ultimately > end > up recreating most of the features that you had in the file system. Even > enterprise databases like Oracle and DB2 have been migrating away from > running > on raw block devices in favor of file systems over time. In effect, you are > looking at making a simple on disk file system which is always easier to > start > than it is to get back to a stable, production ready state. The best performance is still on block device (SAN). File system simplify the operation tasks which worth the performance penalty for a database. I think in a storage system this is not the case. In many cases they can use their own file system that is tailored for the database. > I think that it might be quicker and more maintainable to spend some time > working with the local file system people (XFS or other) to see if we can > jointly address the concerns you have. > > > > Wins: > > > > - 2 IOs for most: one to write the data to unused space in the block > > device, one to commit our transaction (vs 4+ before). For overwrites, > > we'd have one io to do our write-ahead log (kv journal), then do > > the overwrite async (vs 4+ before). > > > > - No concern about mtime getting in the way > > > > - Faster reads (no fs lookup) > > > > - Similarly sized metadata for most objects. If we assume most objects > > are not fragmented, then the metadata to store the block offsets is about > > the same size as the metadata to store the filenames we have now. > > > > Problems: > > > > - We have to size the kv backend storage (probably still an XFS > > partition) vs the block storage. Maybe we do this anyway (put metadata on > > SSD!) so it won't matter. But what happens when we are storing gobs of > > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > > a different pool and those aren't currently fungible. > > > > - We have to write and maintain an allocator. I'm still optimistic this > > can be reasonbly simple, especially for the flash case (where > > fragmentation isn't such an issue as long as our blocks are reasonbly > > sized). For disk we may beed to be moderately clever. > > > > - We'll need a fsck to ensure our internal metadata is consistent. The > > good news is it'll just need to validate
what does ms_objecter do in OSD ?
Hello, I find this messenger do not bind to any ip, so i do not know why we do that. Does any one know what ms_object can do ? Thanks a lot. -- 谦谦君子 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html