Hi good people on this list.
I was recently at ApacheCON Europe, where I followed the spirited and
spiritual Introduction to CouchDB by J. Chris Anderson and Jan Lehnardt.
I also browsed the CouchDB section on the ASF website. I don't know
Erlang, although I followed the brief tutorial linked to from the
website. It looked simple, which makes me suspect I missed quite a lot.
In fact, I have the impression that I missed a whole lot more than
Erlang, so I thank in advance whowever has the patience to read this and
provide some answers to my questions.
I very much like the "Relax" motto.
What I am still trying to figure out mainly, is if CouchDB would be an
appropriate tool for the following.
We basically manage information and documents for other people, as an
ASP service. We provide various easy ways for companies to upload their
electronic documents of all kinds to a dedicated Internet server; we
then process these documents à la Tikka (but not with Tikka)(extract
meta-data and content), automatically index them, and store on the one
side the meta-data and text content in a search engine à la Lucene (but
not Lucene), and on the other side we store the original electronic
document into a special passive file structure that we developed, and
which has proven capable of storing reliably a few million documents so
far. In that file structure, each document is identified by a unique
"logical number", which we store along with the meta-data in the search
engine. (So far in our case, once a document is stored, it never changes).
Then we provide means for the customer to search and find their
documents through a web interface to the search engine, and to retrieve
the corresponding original documents.
It works well and is very reliable, but slowly we are getting into a
management issue due to the volumes of original electronic documents,
which always increases. That is because our customers never throw away
old documents, and they give us ever more varied data to handle.
So we are concerned about increasing volumes to back up, and even more
about volumes to restore in case something would seriously go wrong.
All the above to indicate that when we ourselves talk about "documents",
we talk about on the one hand a searchable index (which works very well,
takes comparatively very little space and which we do not want to change
for now), and on the other hand, stored corresponding electronic
documents (blobs) identified and accessible via one single "key".
I would be interested to understand if CouchDB would provide a reliable
and efficient replacement for our self-developed and self-maintained
storage structure.
The first question is whether the notion of "document" in CouchDB is
compatible with our own notion of document. I mean, could I define in
CouchDB a document as consisting of a single text "key" (a globally
unique document-id), plus a "blob" of undeterminate size (e.g. a MS-Word
document, or a PDF, or an image, or a CAD drawing, or an email or
whatever). And would I then be able to generate for example a search
result webpage, where next to a document summary I can display a PDF
icon, which when clicked retrieves the corresponding electronic document
from CouchDB and sends it to the browser ?
Another aspect that seems particularly interesting - if I got this right
- is the self-replicating nature of CouchDB, which would allow us to
define say 3 "repositories" located in different places, and which would
automatically synchronise themselves. Yes ?
I also seem to have understood that if one of these repositories
suddenly became unavailable because the big one just hit, a document
request would automatically be satisfied by the next available one in
line. Yes ?
Would there be some way in CouchDB to store one such document, in some
logical group containing the original version (say OpenOffice text),
along with its PDF/A version (which we generate when the document is
originally stored) and with an image of the first page (ditto), in such
a way that by using the "main key" plus some additional parameter, I can
retrieve whichever version I need now ?
Would I need to become proficient in Erlang before I can store a new
document or retrieve a stored one, or can this be done using some simple
call from some interface routine in any programming language ?
(For example, a click on a PDF icon generates a call to a mod_perl
add-on Apache module, which then retrieves the document from CouchDB and
returns it to the browser)(perl can "do JSON" or "do XML" e.g.).
To generalise the above question, for what kind of action would I
necessarily need to know Erlang ?
I'll no doubt have more questions if the answers to the above do not
discourage me, but I promise they will be shorter.
Thanks in advance.