Hello, 

I'm working to implement a document warehouse for about 15m documents.  These 
documents range between 10kb - 500kb (legacy archived pdf's).  Currently we do 
this by maintaining a mysql database with the document stored on a variety of 
servers (consisting of about 6TB).  Most of the problems that we encounter are 
a) backups and b) physical access to the documents as they are on a private 
network.  Since not much changes backups aren't really that much of a problem 
(but restoring is very slow).  We are now looking to add documents to this 
regularly (about 10k per week).  So we are looking to implement something new, 
or at least, more useful.

So we thought about using Amazon S3 for storage but these documents fall under 
HIPA constraints so we have decided to do this in house.  

Looking at couchdb, it pretty much does what we are looking to do.  We really 
only want to store a document and maybe some very basic metadata (which we 
currently do by having both a PDF and a metadata file).  Implementation doesnt 
look like a problem with the documented API.

So, the questions.

I would like to break this down into multiple servers and incorporate 
replication at the same time.  The document says that pull is recommended over 
push but doesn't mention why.  Does push replication require the slave (or 
other node) to accept the put/post request as completed?  

If we choose pull replication instead of push, I assume that this is something 
we will need to crontab out to schedule it, or does it have a background 
process that constantly syncs?  API looks like just a single get request.

Either way, here is what we are looking to do at this time.  At two seperate 
locations we will have multiple servers, setup in a master/master 
configuration.  We should not run into any conflicts are updates are not 
allowed.  ID's are unique (MD5 checksum and some other unique information).  

We wanted to use 4 servers at each location, partly because each server has 4TB 
of space (actually 3TB of raid 5).  Each server will hold files based upon the 
first digit of the MD5 checksum (0-3 on server A, 4-7 on server B, 8-A on 
server C, and B-F on server D).  We were thinking of using Apache's URL 
rewriting to proxy the request to the proper server.  This should work for both 
get/put/post.

We will also have the backup server at the second location (which will be the 
active for their location) using the same ideology.

What would be most useful is to be able to ensure that before a commit is 
accepted on a server we could gaurentee that it has been replicated to a second 
box.

Any ideas or suggestions on that?

Reply via email to