Not-even-yet-newbie question

André Warnier Thu, 16 Apr 2009 17:02:30 -0700

Hi good people on this list.

I was recently at ApacheCON Europe, where I followed the spirited andspiritual Introduction to CouchDB by J. Chris Anderson and Jan Lehnardt.I also browsed the CouchDB section on the ASF website. I don't knowErlang, although I followed the brief tutorial linked to from thewebsite. It looked simple, which makes me suspect I missed quite a lot.

In fact, I have the impression that I missed a whole lot more thanErlang, so I thank in advance whowever has the patience to read this andprovide some answers to my questions.


I very much like the "Relax" motto.

What I am still trying to figure out mainly, is if CouchDB would be anappropriate tool for the following.

We basically manage information and documents for other people, as anASP service. We provide various easy ways for companies to upload theirelectronic documents of all kinds to a dedicated Internet server; wethen process these documents à la Tikka (but not with Tikka)(extractmeta-data and content), automatically index them, and store on the oneside the meta-data and text content in a search engine à la Lucene (butnot Lucene), and on the other side we store the original electronicdocument into a special passive file structure that we developed, andwhich has proven capable of storing reliably a few million documents sofar. In that file structure, each document is identified by a unique"logical number", which we store along with the meta-data in the searchengine. (So far in our case, once a document is stored, it never changes).Then we provide means for the customer to search and find theirdocuments through a web interface to the search engine, and to retrievethe corresponding original documents.

It works well and is very reliable, but slowly we are getting into amanagement issue due to the volumes of original electronic documents,which always increases. That is because our customers never throw awayold documents, and they give us ever more varied data to handle.So we are concerned about increasing volumes to back up, and even moreabout volumes to restore in case something would seriously go wrong.

All the above to indicate that when we ourselves talk about "documents",we talk about on the one hand a searchable index (which works very well,takes comparatively very little space and which we do not want to changefor now), and on the other hand, stored corresponding electronicdocuments (blobs) identified and accessible via one single "key".

I would be interested to understand if CouchDB would provide a reliableand efficient replacement for our self-developed and self-maintainedstorage structure.

The first question is whether the notion of "document" in CouchDB iscompatible with our own notion of document. I mean, could I define inCouchDB a document as consisting of a single text "key" (a globallyunique document-id), plus a "blob" of undeterminate size (e.g. a MS-Worddocument, or a PDF, or an image, or a CAD drawing, or an email orwhatever). And would I then be able to generate for example a searchresult webpage, where next to a document summary I can display a PDFicon, which when clicked retrieves the corresponding electronic documentfrom CouchDB and sends it to the browser ?

Another aspect that seems particularly interesting - if I got this right- is the self-replicating nature of CouchDB, which would allow us todefine say 3 "repositories" located in different places, and which wouldautomatically synchronise themselves. Yes ?

I also seem to have understood that if one of these repositoriessuddenly became unavailable because the big one just hit, a documentrequest would automatically be satisfied by the next available one inline. Yes ?

Would there be some way in CouchDB to store one such document, in somelogical group containing the original version (say OpenOffice text),along with its PDF/A version (which we generate when the document isoriginally stored) and with an image of the first page (ditto), in sucha way that by using the "main key" plus some additional parameter, I canretrieve whichever version I need now ?

Would I need to become proficient in Erlang before I can store a newdocument or retrieve a stored one, or can this be done using some simplecall from some interface routine in any programming language ?(For example, a click on a PDF icon generates a call to a mod_perladd-on Apache module, which then retrieves the document from CouchDB andreturns it to the browser)(perl can "do JSON" or "do XML" e.g.).

To generalise the above question, for what kind of action would Inecessarily need to know Erlang ?

I'll no doubt have more questions if the answers to the above do notdiscourage me, but I promise they will be shorter.


Thanks in advance.

Not-even-yet-newbie question

Reply via email to