Re: Using CouchDB to represent the tokenized text of a book

Ian Hobson Wed, 03 Nov 2010 14:34:52 -0700

On 03/11/2010 16:36, Nils Breunese wrote:

Weston Ruter wrote:

Specifically, I'm looking at books that are in a constant flux, i.e. books that 
are being edited. The application here is for Bible translations in particular, 
where each word token needs to be keyed into other metadata, like link to 
source word, insertion datetime, translator, etc. Now that I think of it, in 
order to be referencable, each token would have to exist as a separate document 
anyway since parts of documents aren't indexed by ID, I wouldn't think.

That's right. You'll definitely want to use a document per token here.

I'm not sure this is right. It appears most odd to treat a book that isbeing translated as a sequence of words and symbols. I would expect thetranslator to translate whole sentences, or paragraphs at a time. Forthe Bible, isn't the obvious choice the verse? This would imply twodocument types....

Verses - this contains a list of dictionaries - one for each token.Each dictionary contains the token and the notes about that token. Mightuse an ordered Dictionary and make the token the key. From this, thesource and target texts can be created. Each dictionary can point tolexicon entries and carry translation notes, dates times, translators etc.

Lexicon - each entry is the meaning of a word, in the context in whichit is used. One entry may be referenced in many many places.Translation notes would record data about inferences and implications toensure the correct meaning is chosen.

I rather suspect that notes about the source or target language wordsand how they have been translated, would be almost meaningless ifseparated from the context of the verse.

If verses are given a key computed from Book No, Chapter No, and VerseNo, then a view that presents the verses in the correct order is trivialto construct. If there are situations where verses need to bere-ordered, then you need two views and two Verse Nos (one for eachlanguage) so you can build the correct keys.

As I mentioned above, metadata and related data are both going to be externally 
attached to each token at various sources, so each token needs to  referenced 
by ID. This fact alone invalidates a single-document approach because parts of 
a document can't be linked to, correct?

A list of dictionaries that include the token, and data about the token,will avoid this problem.

You will have the user interface problem of presenting a verse withwords in one order, and receiving it back with new words in a new order.How do you get the program to match up the right notes with the rightwords?


Regards

Ian

Re: Using CouchDB to represent the tokenized text of a book

Reply via email to