On Dec 7, 2010, at 1:18 AM, Matt Adams wrote: > Hi folks, > > I am writing with regards to best practices for scaling and the relative > impacts of choosing to use many small databases vs. one (or more) very large > databases. > > Given the scenario with which I am working my original intent was to use many > small databases. In this situation users either need access to an entire > database or not at all so the native CouchDB access permissions and/or a > simple proxy would work quite well to secure data without the need for a more > complicated authentication filter. This also means that replication is an > either/or thing (I would not need to worry about partial replication of > databases). There are other reasons why I lean towards many small databases > but these are probably the primary ones (i.e., many smaller databases are > simpler for me to implement for the purposes of getting CouchDB into play). > > In this scenario most of the databases would be quite small (in the <1GB > range) so we're not dealing with large data sets and the ratio of users to > databases is also fairly low. > > If users were to instead share one very large database (solely for the > purpose of making things easier to cluster) they would usually only be > accessing a very small portion of the database (e.g., a lot of the data would > really belong to many small sets of users and not likely of interest to the > user in question) and I would not want them to have any access to the > remainder. > > Problems arise in my mind when I start thinking about many thousands of these > small databases. What are the clustering implications? Am I going to be > busier dealing with the reality of replicating thousands of smaller databases > for fail-over than simply biting the bullet now and planning for a somewhat > more complex setup? Are things like BigCouch really more suited to > clustering (fewer) very large databases or do they thrive in environments > where there are many small databases? > > > Hopefully this will be enough information for anyone who wishes to chime in > and give me some thoughts or other things to consider. I am not looking for > specific solutions at this point but instead trying to weigh the pros and > cons of moving in a particular direction. > > > Thanks very much, > > Matt
Hi Matt, in general I would recommend the many small databases option for your situation. The only downside I see is that there's a limit to how many databases the server keeps open at any given time (default = 100). After that it begins closing the LRU one. If all 100 open databases are in active use (i.e. in the middle of serving a request) CouchDB will respond with an error. Only an issue if you anticipate massive concurrency. Triggering lots of individual replications can be a chore, but it's not too bad to loop over _all_dbs and fire off a job for each one. BigCouch can deal with lots of small databases. For instance, you might decide that the databases don't need to be sharded at all. In that case you could create them with q=1 and BigCouch will evenly distribute the individual database files across the nodes in the cluster. Of course, you can still have your data stored redundantly regardless of the value of q that you choose. Best, Adam
