As another data point, I recently worked with a modest dataset (about
10 Tib) where the same choice could be made between lots of small ones
and a small number of very large ones; there were two fields that were
natural partitions and they differed in scope by a few orders of
magnitude.

I much preferred the small db variant and I also used q=1 as the small
databases did not need sharding. The advantage here is that databases
that are rarely referenced get pushed entirely out of the disk cache,
leaving more space for hotter data, it also made it easier to deal
with the inherent issues of multi-tenancy. The scope chosen for the
database coincided with the scope of access for the application, so
granting read/write permission on a database level worked out very
well.

Best,
B.

On Tue, Dec 7, 2010 at 3:36 PM, Adam Kocoloski <[email protected]> wrote:
> On Dec 7, 2010, at 1:18 AM, Matt Adams wrote:
>
>> Hi folks,
>>
>> I am writing with regards to best practices for scaling and the relative 
>> impacts of choosing to use many small databases vs. one (or more) very large 
>> databases.
>>
>> Given the scenario with which I am working my original intent was to use 
>> many small databases.  In this situation users either need access to an 
>> entire database or not at all so the native CouchDB access permissions 
>> and/or a simple proxy would work quite well to secure data without the need 
>> for a more complicated authentication filter.  This also means that 
>> replication is an either/or thing (I would not need to worry about partial 
>> replication of databases).  There are other reasons why I lean towards many 
>> small databases but these are probably the primary ones (i.e., many smaller 
>> databases are simpler for me to implement for the purposes of getting 
>> CouchDB into play).
>>
>> In this scenario most of the databases would be quite small (in the <1GB 
>> range) so we're not dealing with large data sets and the ratio of users to 
>> databases is also fairly low.
>>
>> If users were to instead share one very large database (solely for the 
>> purpose of making things easier to cluster) they would usually only be 
>> accessing a very small portion of the database (e.g., a lot of the data 
>> would really belong to many small sets of users and not likely of interest 
>> to the user in question) and I would not want them to have any access to the 
>> remainder.
>>
>> Problems arise in my mind when I start thinking about many thousands of 
>> these small databases.  What are the clustering implications?  Am I going to 
>> be busier dealing with the reality of replicating thousands of smaller 
>> databases for fail-over than simply biting the bullet now and planning for a 
>> somewhat more complex setup?   Are things like BigCouch really more suited 
>> to clustering (fewer) very large databases or do they thrive in environments 
>> where there are many small databases?
>>
>>
>> Hopefully this will be enough information for anyone who wishes to chime in 
>> and give me some thoughts or other things to consider.  I am not looking for 
>> specific solutions at this point but instead trying to weigh the pros and 
>> cons of moving in a particular direction.
>>
>>
>> Thanks very much,
>>
>> Matt
>
> Hi Matt, in general I would recommend the many small databases option for 
> your situation.  The only downside I see is that there's a limit to how many 
> databases the server keeps open at any given time (default = 100).  After 
> that it begins closing the LRU one.  If all 100 open databases are in active 
> use (i.e. in the middle of serving a request) CouchDB will respond with an 
> error.  Only an issue if you anticipate massive concurrency.
>
> Triggering lots of individual replications can be a chore, but it's not too 
> bad to loop over _all_dbs and fire off a job for each one.
>
> BigCouch can deal with lots of small databases.  For instance, you might 
> decide that the databases don't need to be sharded at all.  In that case you 
> could create them with q=1 and BigCouch will evenly distribute the individual 
> database files across the nodes in the cluster.  Of course, you can still 
> have your data stored redundantly regardless of the value of q that you 
> choose.  Best,
>
> Adam
>
>

Reply via email to