Re: RFC: mongo _id fields in the multi-environment juju server world
On 4 July 2014 02:01, Tim Penhey tim.pen...@canonical.com wrote: Hi folks, Very shortly we are going to start on the work to be able to store multiple environments within a single mongo database. Most of our current entities are stored in the database with their name or id fields serialized to bson as the _id field. As far as I know (and I may be wrong), if you are adding a document to the mongo collection, and you do not specify an _id field, mongo will create a unique value for you. In our new world, things that used to be unique, like machines, services, units etc, are now only unique when paired with the environment id. It seems we have a number of options here. 1. change the _id field to be a composed field where it is the concatenation of the environment id and the existing id or name field. If we do take this approach, I strongly recommend having the fields that make up the key be available by themselves elsewhere in the document structure. 2. let mongo create the _id field, and we ensure uniqueness over the pair of values with a unique index. One think I am unsure about with this approach is how we currently do our insertion checks, where we do a document does not exist check. We wouldn't be able to do this as a transaction assertion as it can only check for _id values. How fast are the indices updated? Can having a unique index for a document work for us? I'm hoping it can if this is the way to go. 3. use a composite _id field such that the document may start like this: { _id: { env_uuid: blah, name: foo}, ... This gives the benefit of existence checks, and real names for the _id parts. Thoughts? Opinions? Recommendations? There is another possiblity: we could just use a different collection name prefix for each environment. There is no hard limit on the number of collections in mongo (see http://docs.mongodb.org/manual/reference/limits/). That is, instead of using the current hard-coded collection names (machines, relations, etc) we'd prefix them with the environment id; either the UUID or an id stored elsewhere. This would entail very few changes to the existing code. If we think that most operations on an environment will continue to be specific to that environment, I think this has a few advantages. Specifically, it minimises cross-talk between environments - one large environment with heavy traffic will not unduly influence the others. - for a small environment, table indexes remain small and lookups fast even though the total number of entries might be huge. - each environment could have a separate mongo txn log, so one busy environment that's constantly adding transactions will not necessarily slow down all the others. There is, in general, no need for sequential consistency between environments. - database isolation between environments is an advantage when things go wrong - it's easier to fix or delete individual environments if their tables are isolated from one another. The disadvantage is that you can't perform transactions that span multiple environments. I think that's something we probably would not want to do much anyway, but YMMV. I suggest that, at the least, taking this approach would be a quick road to making the state work with multiple environments. It would not preclude a move to changing to use composite keys in the future. cheers, rog. -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: RFC: mongo _id fields in the multi-environment juju server world
My expectation is that: 1) We certainly need the environment UUID as a separate field for the shard key. 2) We *also* need the environment UUID as an _id prefix to keep our watchers sane. 2a) If we had separate collections per environment, we wouldn't; but AIUI, scaling mongo by adding collections tends to end badly (I don't have direct experience here myself; but it does indeed seem that we'd start consuming namespaces at a pretty terrifying rate, and I'm inclined to trust those who have done this and failed.) 2b) I'd ordinarily dislike the duplication across the _id and uuid fields, but there's a clear reason for doing so here, so I'm not going to complain. I *will* continue to complain about documents that duplicate info across fields in order to save a few runtime microseconds here and there ;). If someone with direct experience can chip in reassuringly I *might* be prepared to back off on the N-collections-per-environment thing, but I'm certainly not willing to take it so far as to separate the txn logs and thus discard consistency across environments: I think there will certainly be references between individual hosted environments and the initial environment. So, in short, I think Tim's (1) is the way to go. But *please* don't duplicate data that doesn't have to be -- the UUID is fine, the name is not. If we really end up spending a lot of time extracting names from _id fields we can cache them in the state documents -- but we don't need redundant copies in the DB, and we *really* don't need to make our lives harder by giving our data unnecessary opportunities for inconsistency. Cheers William On Fri, Jul 4, 2014 at 6:42 AM, John Meinel j...@arbash-meinel.com wrote: According to the mongo docs: http://docs.mongodb.org/manual/core/document/#record-documents The field name _id is reserved for use as a primary key; its value must be unique in the collection, is immutable, and may be of any type other than an array. That makes it sound like we *could* use an object for the _id field and do _id = {env_uuid:, name:} Though I thought the purpose of doing something like that is to allow efficient sharding in a multi-environment world. Looking here: http://docs.mongodb.org/manual/core/sharding-shard-key/ The shard key must be indexed (which is just fine for us w/ the primary _id field or with any other field on the documents), and The index on the shard key *cannot* be a *multikey index http://docs.mongodb.org/manual/core/index-multikey/#index-type-multikey.* I don't really know what that means in the case of wanting to shard based on an object instead of a simple string, but it does sound like it might be a problem. Anyway, for purposes of being *unique* we may need to put environ uuid in there, but for the purposes of sharding we could just put it on another field and index that field. John =:- On Fri, Jul 4, 2014 at 5:01 AM, Tim Penhey tim.pen...@canonical.com wrote: Hi folks, Very shortly we are going to start on the work to be able to store multiple environments within a single mongo database. Most of our current entities are stored in the database with their name or id fields serialized to bson as the _id field. As far as I know (and I may be wrong), if you are adding a document to the mongo collection, and you do not specify an _id field, mongo will create a unique value for you. In our new world, things that used to be unique, like machines, services, units etc, are now only unique when paired with the environment id. It seems we have a number of options here. 1. change the _id field to be a composed field where it is the concatenation of the environment id and the existing id or name field. If we do take this approach, I strongly recommend having the fields that make up the key be available by themselves elsewhere in the document structure. 2. let mongo create the _id field, and we ensure uniqueness over the pair of values with a unique index. One think I am unsure about with this approach is how we currently do our insertion checks, where we do a document does not exist check. We wouldn't be able to do this as a transaction assertion as it can only check for _id values. How fast are the indices updated? Can having a unique index for a document work for us? I'm hoping it can if this is the way to go. 3. use a composite _id field such that the document may start like this: { _id: { env_uuid: blah, name: foo}, ... This gives the benefit of existence checks, and real names for the _id parts. Thoughts? Opinions? Recommendations? BTW, I think that if we can make 3 work, then it is the best approach. Tim -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev -- Juju-dev mailing list
Re: RFC: mongo _id fields in the multi-environment juju server world
I would think that if we have to put environ-uuid into the _id field, then we wouldn't need yet-another field to shard on (at least if we put it at the beginning of the field). John =:- On Fri, Jul 4, 2014 at 2:24 PM, William Reade william.re...@canonical.com wrote: My expectation is that: 1) We certainly need the environment UUID as a separate field for the shard key. 2) We *also* need the environment UUID as an _id prefix to keep our watchers sane. 2a) If we had separate collections per environment, we wouldn't; but AIUI, scaling mongo by adding collections tends to end badly (I don't have direct experience here myself; but it does indeed seem that we'd start consuming namespaces at a pretty terrifying rate, and I'm inclined to trust those who have done this and failed.) 2b) I'd ordinarily dislike the duplication across the _id and uuid fields, but there's a clear reason for doing so here, so I'm not going to complain. I *will* continue to complain about documents that duplicate info across fields in order to save a few runtime microseconds here and there ;). If someone with direct experience can chip in reassuringly I *might* be prepared to back off on the N-collections-per-environment thing, but I'm certainly not willing to take it so far as to separate the txn logs and thus discard consistency across environments: I think there will certainly be references between individual hosted environments and the initial environment. So, in short, I think Tim's (1) is the way to go. But *please* don't duplicate data that doesn't have to be -- the UUID is fine, the name is not. If we really end up spending a lot of time extracting names from _id fields we can cache them in the state documents -- but we don't need redundant copies in the DB, and we *really* don't need to make our lives harder by giving our data unnecessary opportunities for inconsistency. Cheers William On Fri, Jul 4, 2014 at 6:42 AM, John Meinel j...@arbash-meinel.com wrote: According to the mongo docs: http://docs.mongodb.org/manual/core/document/#record-documents The field name _id is reserved for use as a primary key; its value must be unique in the collection, is immutable, and may be of any type other than an array. That makes it sound like we *could* use an object for the _id field and do _id = {env_uuid:, name:} Though I thought the purpose of doing something like that is to allow efficient sharding in a multi-environment world. Looking here: http://docs.mongodb.org/manual/core/sharding-shard-key/ The shard key must be indexed (which is just fine for us w/ the primary _id field or with any other field on the documents), and The index on the shard key *cannot* be a *multikey index http://docs.mongodb.org/manual/core/index-multikey/#index-type-multikey.* I don't really know what that means in the case of wanting to shard based on an object instead of a simple string, but it does sound like it might be a problem. Anyway, for purposes of being *unique* we may need to put environ uuid in there, but for the purposes of sharding we could just put it on another field and index that field. John =:- On Fri, Jul 4, 2014 at 5:01 AM, Tim Penhey tim.pen...@canonical.com wrote: Hi folks, Very shortly we are going to start on the work to be able to store multiple environments within a single mongo database. Most of our current entities are stored in the database with their name or id fields serialized to bson as the _id field. As far as I know (and I may be wrong), if you are adding a document to the mongo collection, and you do not specify an _id field, mongo will create a unique value for you. In our new world, things that used to be unique, like machines, services, units etc, are now only unique when paired with the environment id. It seems we have a number of options here. 1. change the _id field to be a composed field where it is the concatenation of the environment id and the existing id or name field. If we do take this approach, I strongly recommend having the fields that make up the key be available by themselves elsewhere in the document structure. 2. let mongo create the _id field, and we ensure uniqueness over the pair of values with a unique index. One think I am unsure about with this approach is how we currently do our insertion checks, where we do a document does not exist check. We wouldn't be able to do this as a transaction assertion as it can only check for _id values. How fast are the indices updated? Can having a unique index for a document work for us? I'm hoping it can if this is the way to go. 3. use a composite _id field such that the document may start like this: { _id: { env_uuid: blah, name: foo}, ... This gives the benefit of existence checks, and real names for the _id parts. Thoughts? Opinions? Recommendations? BTW, I think that if we can make 3 work, then it is the best approach. Tim -- Juju-dev
Re: RFC: mongo _id fields in the multi-environment juju server world
On 4 July 2014 11:24, William Reade william.re...@canonical.com wrote: My expectation is that: 1) We certainly need the environment UUID as a separate field for the shard key. 2) We *also* need the environment UUID as an _id prefix to keep our watchers sane. 2a) If we had separate collections per environment, we wouldn't; but AIUI, scaling mongo by adding collections tends to end badly (I don't have direct experience here myself; but it does indeed seem that we'd start consuming namespaces at a pretty terrifying rate, and I'm inclined to trust those who have done this and failed.) 2b) I'd ordinarily dislike the duplication across the _id and uuid fields, but there's a clear reason for doing so here, so I'm not going to complain. I *will* continue to complain about documents that duplicate info across fields in order to save a few runtime microseconds here and there ;). If someone with direct experience can chip in reassuringly I *might* be prepared to back off on the N-collections-per-environment thing, but I'm certainly not willing to take it so far as to separate the txn logs and thus discard consistency across environments: I think there will certainly be references between individual hosted environments and the initial environment. It can be a great advantage when scaling to be able to partition the transactions across different parts of the database. If we want this to be able to scale, I think we *have* to make it work without requiring transactions across environments. There is no way that we can scale as far as we'd like to by using a single mongo replica set for all environments. This talk is about mysql, not mongo, but I believe some of the lessons are relevant to us. https://www.youtube.com/watch?v=qATTTSg6zXk By my calculations, with a maximum-sized namespace file, a single mongo should be able to support over 9 environments using a separate collection-set for each environment. From my recollection of juju performance, we will be lucky to scale a single mongo up to 1000 environments, let alone 9, so I suspect we'd never get remotely that far. Perhaps there are other disadvantages from having many collections though. It would be nice if we could make this crucial architectural decision in the light of some actual measurements. We may all have some kind of gut feeling for how this might perform, but without actually measuring, we just don't know. As usual, my first reaction is KISS. cheers, rog. -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: RFC: mongo _id fields in the multi-environment juju server world
... It can be a great advantage when scaling to be able to partition the transactions across different parts of the database. If we want this to be able to scale, I think we *have* to make it work without requiring transactions across environments. There is no way that we can scale as far as we'd like to by using a single mongo replica set for all environments. You generally shard across replica sets, and if you shard by environ uuid (say by putting it as a prefix on all the _ids) then each of those is a different write master. It seems conceptually easier than trying to route to a different collection set. Certainly sharding will be easier to rebalance (I think) than moving the collections around. John =:- This talk is about mysql, not mongo, but I believe some of the lessons are relevant to us. https://www.youtube.com/watch?v=qATTTSg6zXk By my calculations, with a maximum-sized namespace file, a single mongo should be able to support over 9 environments using a separate collection-set for each environment. From my recollection of juju performance, we will be lucky to scale a single mongo up to 1000 environments, let alone 9, so I suspect we'd never get remotely that far. Perhaps there are other disadvantages from having many collections though. It would be nice if we could make this crucial architectural decision in the light of some actual measurements. We may all have some kind of gut feeling for how this might perform, but without actually measuring, we just don't know. As usual, my first reaction is KISS. cheers, rog. -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: RFC: mongo _id fields in the multi-environment juju server world
On Fri, Jul 4, 2014 at 6:56 AM, Gustavo Niemeyer gust...@niemeyer.net wrote: 1. change the _id field to be a composed field where it is the concatenation of the environment id and the existing id or name field. If we do take this approach, I strongly recommend having the fields that make up the key be available by themselves elsewhere in the document structure. I'd go with this, including your suggestion of splitting the data apart in proper fields. Sounds straightforward and comfortable to deal with. I'd be interested in trying this approach with Actions. We've gone back and forth between encoding units *only* in the _id or *also* in the document. Both have pro's and con's, but it seems to me that a composite _id would address most of the con's on each approach. I'm also interested in figuring out how the watchers will work in this approach. The Actions watcher is a StringsWatcher, and the .Changes() are []string I'm assuming that will have to become a more specialised watcher where .Changes() returns a list of objects representing the composite key? Also how the watcher detects relevant events might have to be adjusted somewhat. -- John Weldon -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: RFC: mongo _id fields in the multi-environment juju server world
On Fri, Jul 4, 2014 at 6:01 AM, roger peppe roger.pe...@canonical.com wrote: There is another possiblity: we could just use a different collection name prefix for each environment. There is no hard limit on the number of collections in mongo (see http://docs.mongodb.org/manual/reference/limits/). For sharding and for good space management in general it's better to have data in a collection that gets automatically managed by the cluster. It's also much simpler to deal with in general, even if it does require code changes to get started. - for a small environment, table indexes remain small and lookups fast even though the total number of entries might be huge. Same as above: when it gets _huge_ you need sharding either way, and it's easier and more efficient to manage a single collection than 10k. - each environment could have a separate mongo txn log, so one busy environment that's constantly adding transactions will not necessarily slow down all the others. There is, in general, no need for sequential consistency between environments. With txn there's no sequential consistency even within the same environment, if you're touching different documents. - database isolation between environments is an advantage when things go wrong - it's easier to fix or delete individual environments if their tables are isolated from one another. Sure, it prevents bad mistakes caused by not taking the environment id in consideration, but deleting foo:* is just as easy. I suggest that, at the least, taking this approach would be a quick road to making the state work with multiple environments. It would not preclude a move to changing to use composite keys in the future. We already know it's a bad idea today. Let's please not do that mistake. gustavo @ http://niemeyer.net -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: RFC: mongo _id fields in the multi-environment juju server world
On Fri, Jul 4, 2014 at 10:32 AM, roger peppe roger.pe...@canonical.com wrote: It won't be possible to shard the transaction log. Why not? The thing I'm trying to get across is: until we know one way or another, I believe it would be better to choose the (much) simpler option and use the (potential weeks of) dev time for other things. We know it's a bad idea. Besides everything else I mentioned, there are _huge_ MongoDB databases out there being that depend on sharding to scale.. we're talking hundreds of machines. It seems very naive to go with a model that loses the benefits of all the lessons the MongoDB development team learned with those use cases, and the work they have done to support them well. We have been there in Canonical. Ask folks about the CouchDB story. gustavo @ http://niemeyer.net -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev